THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Long I/Os and TCP Timeouts are not a SQL Server Problem, Except When They Are

My coverage of old SQL Server problems seems to be turning into a blog series. In any case, here's at least a reminder, or perhaps just more fodder for the search engines.

Two types of errors pop up on occasion that look very much like network or disk problems, but sometimes the cause is simpler than one would think, and is actually a SQL Server issue:

Case 1: Clients or SQL Agent experience intermittent TCP communication failures like this series from an Agent error log:

[298] SQLServer Error: 64, TCP Provider: The specified network name is no longer available. [SQLSTATE 08S01]

[298] SQLServer Error: 64, Communication link failure [SQLSTATE 08S01]

[298] SQLServer Error: 16389, Communication link failure [SQLSTATE 08S01]

Case 2: The SQL Server error log reports a warning:

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [G:\ <some file> ] in database [myDatabase] (7).  The OS file handle is 0x0000000000000A98.  The offset of the latest long I/O is: 0x0000003b683600

Usually the advice I find on Case 2 is that it's your disk system having a problem; sometimes the advice is basically to ignore the issue unless it's persistent or really terrible.

However, if you've ever had a busy log file that would not allow writes for 30 seconds, it'll get your attention, and make you passionate about finding the solution to this problem :-). Log writes are, of course, sychronous (that's the whole point of the transaction log) so no log writes means that everything stops.

So, the articles that talk about I/O systems being busy or overwhelmed are not wrong, also the articles recommending examination of drivers and firmware, but there is one more place to look. Check these symptoms against your system:

  1. You do have enough I/O performance to meet your workload. Disks are not overwhelmed, perhaps not even busy. It's possible to examine this with perfmon physical disk counters. Still, you see the errors above.
  2. When the SQL Error log reports the long I/Os, the disk affected is less busy, not more busy than normal. Again, it isn't a case of the disk system not keeping up.
  3. When the error logs report the long I/Os or network timeouts, perhaps one CPU core on your server is busy, maybe 90% or 100% utilized, even while the other cores migth be running normally.

This combination can mean the CPU and I/O affinity masks are set incorrectly on the SQL instance, which can cause the normal CPU workload from SQL Server queries and processes to contend with the CPU core(s) that are identified for disk I/O and slow the processing of I/O requests inside SQL Server.

If I understand it correctly, if you explicitly identify any cores for I/O and for CPU affinity, rather than leaving the default Automagic behavior, and you make the error of having those assignments overlap such that some of the same cores have both I/O and CPU affinity set, then the behavior of the scheduler changes and it's possible to have this contention.

Imagine an I/O and CPU intensive process like DBCC CHECKDB. Suppose further that all the disk I/O requests for some disk are set by the scheduler to run on, say, CPU 4. Further suppose that the scheduler also assigned the CHECKDB process itself to CPU 4. Even on an 8 core box with major disk throughput capability, you can wind up with just one very busy core that is a bottleneck not for disk activity per se, but for the I/O requests themselves. Your disk goes idle, because it's not being asked to do work, and the requests for disk-related work are blocked at the CPU.

CPU and IO affinity should either be set to Automatic (probably preferred for any but the biggest systems) or to a mutually exclusive set of CPU cores. See http://msdn.microsoft.com/en-us/library/ms189629.aspx.

Published Tuesday, June 22, 2010 11:18 AM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Ben Thul said:

Luckily, there's an easy way to check if this problem exists on your server(s).  Take the values from sys.configurations for the affinity mask and the affinity i/o mask and bitwise-and them together.  If the value is 0, you're good to go.  If not, fix it.

June 22, 2010 2:52 PM
 

David Warner said:

You mention case 1 and 2 near the start of the blog but after that you only seem to discuss case 2 where I/O is taking a long time. Can you point out the sections that discuss the TCP communication failures please?

January 2, 2014 4:23 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement