I’ve had some issues the last few days with being able to focus on anything really productive. I have a lot of things I need to focus on, and a lot of things that I want/need to be working on, but the last few days have been somewhat bleh for me. It happens occasionally, but never really lasts long. Well tonight I got pulled from my slump by my cell phone going crazy just as I was going to be. What should have been a quick report from my scheduled CHECKDB jobs on all my servers quickly turned into a nightmare as every one of my servers began paging out at 10pm with errors like:
2009-12-10 01:35:44.04 spid75 The operating system returned error 1784(The supplied user buffer is not valid for the requested operation.) to SQL Server during a write at offset 0x000000d5d2e000 in file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 01:35:44.04 spid75 Error: 17053, Severity: 16, State: 1.
2009-12-10 01:35:44.04 spid75 E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16: Operating system error 1784(The supplied user buffer is not valid for the requested operation.) encountered.
2009-12-10 01:35:44.07 spid98 DBCC CHECKDB (DatabaseName) WITH all_errormsgs, no_infomsgs, data_purity executed by Domain\UserName found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 9 seconds.
2009-12-10 01:35:44.45 spid98 Error: 926, Severity: 21, State: 6.
2009-12-10 01:35:44.45 spid98 Database 'DatabaseName' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
2009-12-10 02:05:27.17 spid97 A read of the file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16' at offset 0x00000110d7e000 succeeded after failing 1 time(s) with error: 38(Reached the end of the file.). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 02:05:29.82 spid97 Error: 8966, Severity: 16, State: 2.
2009-12-10 02:05:29.82 spid97 Unable to read and latch page (1:5164) with latch type SH. 38(Reached the end of the file.) failed.
2009-12-10 02:05:29.82 spid97 Error: 8966, Severity: 16, State: 2.
2009-12-10 02:05:29.82 spid97 Unable to read and latch page (1:5189) with latch type SH. incorrect pageid (expected 1:5189; actual 0:0) failed.
2009-12-10 02:05:48.41 spid97 Error: 823, Severity: 24, State: 2.
2009-12-10 02:05:48.41 spid97 The operating system returned error 38(Reached the end of the file.) to SQL Server during a read at offset 0x000000276fe000 in file 'E:\SQLData\MSSQL.1\DatabaseName.mdf:MSSQL_DBCC16'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2009-12-10 02:05:48.42 spid97 Error: 3313, Severity: 21, State: 2.
2009-12-10 02:05:48.42 spid97 During redoing of a logged operation in database 'DatabaseName', an error occurred at log record ID (311168:9684:2). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Nothing to bring things back into focus like a mass of potential corruption problems across every production server that I have. What was most interesting is that ever one of the failures involved a server that uses SAN storage. Time to wake up my support staff, starting with a member of our server team that is one of the best I have ever worked with. A couple of phone calls, he was asleep, and he jumped online and started looking at things. He couldn’t find any issues with the VM’s or the physical server having the issue, so the problem got escalated to waking up our SAN admin. Meanwhile I through a few tweets out onto Twitter, with a couple directly at Paul Randal (blog/twitter) who is the master of all things CHECKDB. I also got assistance, voluntarily I might add, from Mr Denny (blog/twitter), Wes Brown (blog/twitter), and Buck Woody(blog/twitter). Many thanks gents, I really do appreciate it.
What transpired was four hours of what seemed to be completely fruitless troubleshooting involving calls to our SAN vendor and multiple diagnostics from every direction possible, trying to isolate where exactly the problem was originating. There were no changes to the SQL Servers (at least that we thought of initially, keep reading) and the checks had previously succeeded without a problem. Were it not for Paul Randal, instead of writing this blog post, I’d still be trying to troubleshoot this problem. Almost four hours into troubleshooting this problem a simple tweet from Paul jarred my memory of a recent change that turned out to be the actual problem.
At the end of last week, Diskeeper 2009 was upgraded to Diskeeper 2010 on the servers by our server team. Diskeeper 2009 had been there for at least a year without problems so who would have expected it to be the issue. Turns out the new version has an option enabled called IntelliWrite that causes this problem to occur. Disabling this option immediately resolve the problem. Paul blogged about this problem a long time ago, and recalled reading this when I clicked on it but it was the last thing that I would have thought about.
Search Engine Q&A #14: Beware 3rd party file-system drivers with DBCC CHECKDB
Moral of the story, don’t use file system filtering software on your SQL Servers. It can and will cause you to lose sleep, wake up your support staff, and go on a wild goose chase for hours before figuring out that there isn’t really a problem.
What made this problem really perplexing is that running CHECKDB in single user mode with TABLOCK succeeded without failure. This is led us to thinking that the problem was creating snapshots or sparse files in NTFS which was tested by creating a manual database snapshot on a couple of the Enterprise Edition Servers. This succeeded, which I can only guess lead Paul to asking about the the file-system drivers.
EDIT:
After searching around some this morning this has also happened to at least one other person so its not specific to my environment:
http://philjax.spaces.live.com/blog/cns!1AFDFE016BEE99E9!13100.entry