THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

The Rambling DBA: Jonathan Kehayias

The random ramblings and rantings of frazzled SQL Server DBA

A tale of CHECKDB failures cause by 3rd party file-system drivers

I’ve had some issues the last few days with being able to focus on anything really productive.  I have a lot of things I need to focus on, and a lot of things that I want/need to be working on, but the last few days have been somewhat bleh for me.  It happens occasionally, but never really lasts long.  Well tonight I got pulled from my slump by my cell phone going crazy just as I was going to be.  What should have been a quick report from my scheduled CHECKDB jobs on all my servers quickly turned into a nightmare as every one of my servers began paging out at 10pm with errors like:

2009-12-10 01:35:44.04 spid75      The operating system returned error 1784(The supplied user buffer is not valid for the requested operation.) to SQL Server during a write at offset 0x000000d5d2e000 in file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 01:35:44.04 spid75      Error: 17053, Severity: 16, State: 1.
2009-12-10 01:35:44.04 spid75      E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16: Operating system error 1784(The supplied user buffer is not valid for the requested operation.) encountered.
2009-12-10 01:35:44.07 spid98      DBCC CHECKDB (DatabaseName) WITH all_errormsgs, no_infomsgs, data_purity executed by Domain\UserName found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 9 seconds.
2009-12-10 01:35:44.45 spid98      Error: 926, Severity: 21, State: 6.
2009-12-10 01:35:44.45 spid98      Database 'DatabaseName' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2009-12-10 02:05:27.17 spid97      A read of the file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16' at offset 0x00000110d7e000 succeeded after failing 1 time(s) with error: 38(Reached the end of the file.). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 02:05:29.82 spid97      Error: 8966, Severity: 16, State: 2.
2009-12-10 02:05:29.82 spid97      Unable to read and latch page (1:5164) with latch type SH. 38(Reached the end of the file.) failed.
2009-12-10 02:05:29.82 spid97      Error: 8966, Severity: 16, State: 2.
2009-12-10 02:05:29.82 spid97      Unable to read and latch page (1:5189) with latch type SH. incorrect pageid (expected 1:5189; actual 0:0) failed.
2009-12-10 02:05:48.41 spid97      Error: 823, Severity: 24, State: 2.
2009-12-10 02:05:48.41 spid97      The operating system returned error 38(Reached the end of the file.) to SQL Server during a read at offset 0x000000276fe000 in file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 02:05:48.42 spid97      Error: 3313, Severity: 21, State: 2.
2009-12-10 02:05:48.42 spid97      During redoing of a logged operation in database 'DatabaseName', an error occurred at log record ID (311168:9684:2). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.

Nothing to bring things back into focus like a mass of potential corruption problems across every production server that I have.  What was most interesting is that ever one of the failures involved a server that uses SAN storage.  Time to wake up my support staff, starting with a member of our server team that is one of the best I have ever worked with.  A couple of phone calls, he was asleep, and he jumped online and started looking at things.   He couldn’t find any issues with the VM’s or the physical server having the issue, so the problem got escalated to waking up our SAN admin.  Meanwhile I through a few tweets out onto Twitter, with a couple directly at Paul Randal (blog/twitter) who is the master of all things CHECKDB.  I also got assistance, voluntarily I might add, from Mr Denny (blog/twitter), Wes Brown (blog/twitter), and Buck Woody(blog/twitter).  Many thanks gents, I really do appreciate it.

What transpired was four hours of what seemed to be completely fruitless troubleshooting involving calls to our SAN vendor and multiple diagnostics from every direction possible, trying to isolate where exactly the problem was originating.  There were no changes to the SQL Servers (at least that we thought of initially, keep reading) and the checks had previously succeeded without a problem.  Were it not for Paul Randal, instead of writing this blog post, I’d still be trying to troubleshoot this problem.  Almost four hours into troubleshooting this problem a simple tweet from Paul jarred my memory of a recent change that turned out to be the actual problem.

At the end of last week, Diskeeper 2009 was upgraded to Diskeeper 2010 on the servers by our server team.  Diskeeper 2009 had been there for at least a year without problems so who would have expected it to be the issue.  Turns out the new version has an option enabled called IntelliWrite that causes this problem to occur.  Disabling this option immediately resolve the problem.  Paul blogged about this problem a long time ago, and recalled reading this when I clicked on it but it was the last thing that I would have thought about.

Search Engine Q&A #14: Beware 3rd party file-system drivers with DBCC CHECKDB

Moral of the story, don’t use file system filtering software on your SQL Servers.  It can and will cause you to lose sleep, wake up your support staff, and go on a wild goose chase for hours before figuring out that there isn’t really a problem.

What made this problem really perplexing is that running CHECKDB in single user mode with TABLOCK succeeded without failure.  This is led us to thinking that the problem was creating snapshots or sparse files in NTFS which was tested by creating a manual database snapshot on a couple of the Enterprise Edition Servers.  This succeeded, which I can only guess lead Paul to asking about the the file-system drivers. 

EDIT:

After searching around some this morning this has also happened to at least one other person so its not specific to my environment:

http://philjax.spaces.live.com/blog/cns!1AFDFE016BEE99E9!13100.entry

Published Thursday, December 10, 2009 3:25 AM by Jonathan Kehayias

Comments

 

Wes Brown said:

That's why I was asking about anything that was doing snap shots in the back ground, should have been more specific maybe could have saved you a couple of hours :(. My side company builds tools that compress the network stream and databases on disks and this was something we hit very early during the development cycle. That's why we test, test, test. The Inteliwrite thing from diskkeeper would make me nervous on a SQL database since it is reordering writes under the covers and breaking the guarantee that data is hardened to disk.

You guys sending this up to the DK team?

Glad everything worked out!

-wes

December 10, 2009 8:23 AM
 

Michael Materie said:

Hi Jonathan,

Our apologies for the headache the software caused you guys. The alerts are certainly alarming (no doubt), and obviously did not come at a good hour, but as you likely noticed there is no actual data corruption. The issue is specifically that IntelliWrite's timing in expanding a file changed what the Database Consistency Check was expecting. Thereby DBCC incorrectly reports consistency errors. Turning off IntelliWrite makes the alerts go away, and then running DBCC confirms that no errors actually occurred.

We've repro'ed the issue and already have a fix. We're starting a field test tomorrow (on Friday, Dec. 10th). The fixed should be broadly available later next week or early the following week. If anyone is interested in testing this, you can contact Diskeeper at fieldtest@diskeeper.com.

Until the fix is available, disabling IntelliWrite on a database volume corrects the issue.

Michael Materie

Diskeeper Corporation

December 10, 2009 1:44 PM
 

The Rambling DBA: Jonathan Kehayias said:

Less than 24 hours ago I blogged about a veritable nightmare I had with CHECKDB and the resulting findings

December 11, 2009 12:49 AM
 

Jeff York said:

Happened to me as well. Thanks for posting this; it was a lifesaver!

December 15, 2009 4:25 PM
 

The Rambling DBA: Jonathan Kehayias said:

A few weeks ago I blogged about A tale of CHECKDB failures caused by 3rd party file-system drivers that

December 29, 2009 1:32 PM
 

Lucm said:

Somehow related problem: A while ago I found a catch-22 with CHECKDB and SCANDISK raising issues with each other on account of ntfs sparse files IIRC. There is a hotfix available from MSFT for this problem.

December 29, 2009 3:28 PM
 

Jane in London said:

Me three!  Thank you SO MUCH!  I was about to succumb to a fit of the screaming habdabs..

I'll blog it tomorrow.

January 4, 2010 12:42 PM
 

Mike M said:

Thanks for posting your findings.  This saved me hours of troubleshooting one of my SQL servers today.  

January 6, 2010 10:16 AM
 

Cynthia Scott in Portland said:

I had the exact same problem.  Thank you so much for posting this.  It hit the nail on the head!

January 8, 2010 2:16 PM
 

Bryan Fulkersin said:

I just happened to be browsing through server logs, and saw numerous inconsistencies amongst 3 HUGE production databases.   Fearing the worst, and searching for a solution, I came across you post.  As soon as Diskeeper 2010 was mentioned, the light bulb came on.

Turned off Intelliwrite, re-ran CHECKDB, and everything is fine.

THANKS AGAIN!

January 20, 2010 12:27 PM
 

Todd Nichols said:

As a .NET guy being forced to also wear a DBA hat, this article was a lifesaver.  THANK YOU!!!

April 13, 2010 12:04 PM
Anonymous comments are disabled

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement