THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

IO Queue Depth Strategy

I am finally getting around to setting up a SSD array, starting with a few, then working up to perhaps 20 units over 2 controllers and 4 x4 SAS ports. During initial testing I observed very high disk latency, in the range of 100ms+ for reads and upto 400ms+ for writes in certain operations. This occurs during periods of exceptionally high disk queue depth. So the questions are 1) does high queue-depth and latency improve performance 2) cause problems with other operations, including responsiveness, and 3) can this be avoided without giving up performance?

Lets start with a single disk drive performance principles, then proceeding to disk arrays, caching RAID controllers, and finally to the SAN system, with consideration for SSD characteristics. Certain storage performance advice commonly cited without reference to their origin and more importantly scope can be traced. Then the SQL Server IO queue depth policy is reviewed. With this, we can understand why the simple model used by SQL Server in trying for the best performance can overload the storage system causing periods of poor responsiveness. A more comprehensive model of hard disk, SSD, and storage system performance characteristics, along with input on whether OLTP or DW behavior is desired, can significantly improve SQL Server responsiveness while staying very close to the best achievable throughput performance.

Hard Disk IOPS Theory

The standard theory of disk drive random IO rate is that average access time is the sum of the rotational latency, the average seek time, the transfer time, command overhead, and propagation delays. For small block IO, only the first two items make significant contribution. The average rotational latency for a 15K disk drive is 2ms, and the typical 15K 3.5in cited average seek time is 3.4ms, plus others for the average total access time of around 5.5ms. Then with the two key qualifiers:

  1. 1) for random accesses to data distributed over the entire disk
  2. 2) at queue depth one,one IO issued, the next issued after the completion of the predecessor

A 15K disk should support 180 IOPS. Too many people neglect to mention the two key qualifiers.

Short Stroke Effect

The accesses are to data within a range of disk cylinders, then the average seek time is less, and IOPS should be higher.

Hard Disk Random Read IOPS versus Queue Depth

Now consider the implication of the second qualifier, random read IOPS versus queue depth. At higher queue depth, meaning multiple IO issued to the disk, and even if there are multiple outstanding IO, the controller on the disk drive itself can reorder the IO, with net effect of reducing time between each IO to increase the IOPS at the expense of higher latency for each IO.

There is a small gain at queue depth 2, perhaps to 200 IOPS, and larger gains to queue depth 4 at 240 IOPS, and about 40-50 IOPS with each doubling of queue depth to 32, with small gain to queue depth 64. (the disk drives of 2005 has a 64-deep task queue, it is now 128-deep.) With each doubling of the queue depth, the latency almost doubles.

The figures below show both the short-effect and queue depth impact on IOPS and access latency.

10K IOPS Q
IOPS versus queue depth for various disk space utilizations

10K lat Q
Latency versus queue depth for various disk space utilizations

Individually, both effects improve hard disk performance, but the two combined produce more dramatic benefits. In online transaction processing, response time and hence disk IO latency is as important as throughput performance. Hence the common rule adopted was to keep (time-)averaged queue depth below 2 per disk (disregarding transient spikes) for online transaction processing. (This rule was popular in the days of 5400 and 7200RPM drives).

In batch processing, where there is not a live person waiting for each transaction to complete, the strategy is to drive queue depth higher for the improved throughput performance. In DW/DSS, all such rules go out the window. If there was unused throughput not being utilized, then that was throughput squandered. Note the importance of the qualifers, and yet people feel that it is acceptable to state the rule of queue depth per disk below 2 without qualification.

RAID Controllers and Disk Arrays

In the earlier days, we had just a bunch of disks (JBOD). The major RDBMS handled this situation by supporting multiple files, and file groups as well for each database. And then there was light (sorry, wrong book) RAID and RAID controllers. And people saw that RAID was good, having fewer "disks" to manage at the operating system and database level. Now a disk array appears to the operating system as a single disk and performance counters were usually read from the operating system, not the storage system.

The rule of queue depth 2 per disk (and accompanying qualifiers) does not translate directly to the queue depth determined from the operating system performance counter. So it became popular to cite a latency rule the data should not be higher than 10-20ms roughly, corresponding to the rule of queue depth 2 per disk, having long since forgotten that there were additional qualifiers.

In general, data access latency below 10 ms was usually an indication that transaction response time should be good. Latency in the range 10-20ms corresponds to acceptable response time. Latency over 20ms corresponds to a heavily loaded disk system. More importantly, any temporary surge would push the disk IO into the higher queue depth range with sharp spikes in repsonse times. So even if average transaction response time is deem acceptable, there could be a noticeable distribution tail experiencing very poor response.

RAID Group Random Read IOPS versus Queue Depth

So what should be the IO characteristics of a set of disks in a RAID group. If a disk read IO were issued one at a time, i.e., queue depth 1, then one disk in the group would get the IO. The other disks would be idle, with 180 IOPS from the RAID group. At queue depth 2, it is likely the two IO will go to different disks, so two disks will operate at queue depth 1, with the remaining disks idle. Only when the average queue depth is one per disk in the group will IOPS reach 180 per 15K disk, with some distribution probability adjustments.

For queue depth at some whole integer multiple of one per disk, the expected IOPS per disk is similar to the single disk IOPS versus queue depth. The important point is that at queue depth 1, a given thread will experience single disk IOPS, not disk group IOPS.

Log Write Latency

In the old direct-attach storage days, the only advice was to have a dedicated RAID 1 disk pair for each high transaction volume database log. It was rarely mentioned, but the pure sequential small block log write could achieve latency on the order of 0.3ms, and around 3000-5000 IOPS.

SAN vendors frequently suggest not bothering with dedicated physical disks for each high transaction volume log, every thing will be fine, trust them. Even if do provided dedicated disks, and perhaps even a dedicated service processor, the SAN still could not achieve a very low log write latency. As SAN systems were pervasive, Micrsoft changed SQL Server to allow more log writes to be in flight (32 for 64-bit SQL Server and xxKB?).

RAID Small Block Random Write

RAID level write overhead is a subject covered elsewhere and is not discussed here. People like to cite RAID 5 and 10 rules without qualification, but the rule commonly cited only applies to small block random writes. On a non-caching controller, we would expect write IO to have similar IOPS characteristics as read IO, adjusted for RAID level overhead, both theoretical and controller specific.

Caching RAID Controllers and Read IO

In other discussions, I have explained why read cache is counter-productive. In essence, the database engine itself is a data cache that is much closer and less expensive to access than cache on the storage controller. Next, a properly configured system, the database engine should have much larger buffer cache than on the storage system. It is very unlikely that anything in the storage controller cache will be accessed again. Finally, the overhead of read-caching is significant in a storage configured for high IOPS performance. Read-caching on the storage controllers incurs overhead for blocks that will almost never be accessed.

Do you write weekly TPS reports at work that never get read?
10K IOPS Q
Put this in your report just to see who actually comments on it, with the correct TPS cover sheet of course!

Read caching is usually disabled in TPC benchmark systems for the reasons just cited. One reputable source stated that a small 2MB (not GB!) read cache per LUN to enable read-ahead is the preferred strategy. I recall someone asserting that a specific server system with 48GB memory showed IO performance improvement when SAN cache was increased from 80GB to 120GB. What this fact proves could be argued along more than one angle.

Caching RAID Controllers and Write IO

And now to the random write IO performance characteristics on a caching RAID controller. We have mostly skirted around write IO until now. There is a reason for this. Below is the small block random write IOPS pattern with a caching RAID controller.

10K IOPS Q

When SQL Server or the operating systems sends one or more write IO to the RAID controller, the IO is written to the controller cache, and a completion signal is sent back to the source. The next IO is then sent. There is nearly no variation in IOPS versus queue depth. Latency is very low until the write volume reaches the IOPS limit. Beyond this, the write cache fills up, and latency becomes high, until the source throttle back on write IO.

Large Storage Systems with Large RAID Groups

As systems became more powerful, compute performance growing 40% per year, and hard disk performance averaging less than 10% per year (7.2K to 10K to 15K, then nothing until SSD), it was necessary to build storage systems with very large number of disks. During this period, SAN systems became pervasive, especially for large storage systems.

It was soon noticed that the SAN could not deliver anywhere near the expected IOPS based on the number of disks. One cause was traced to the FC HBA default queue depth setting of 32 (per adapter, now per target?). The reasoning behind the default was the SAN vendor doctrine of shared storage. To prevent one host from generating too much load, the IO was throttled with the HBA queue depth setting so all hosts could get a share of IO volume.

If one were to measure IOPS versus the HBA Queue Depth setting on LUNs comprised of many disks, one should find that IOPS performance increases with the higher Queue Depth all the way up to the maximum. The behavior in fact was already described in the RAID Group IOPS versus queue depth section above.

FC HBA Queue Depth Setting

Note that in the early days, the HBA queue depth setting applied to either the HBA or each HBA FC port. On the more recent Emulex FC HBA, the default is now queue depth 32 per LUN, with the option of per LUN or for the entire target. (QLogic uses the term Execution Throttle?) I suppose that in one of the few TPC-C benchmark reports with SAN storage system, a reference was made to changing the HBA queue depth from 32 to 254 without addition explanation.

The TPC-C systems all have very large disk arrays. Of course it is proper to dial HBA queue depth to maximum. Eventually, this was noted and the recommendation to change HBA queue depth from 32 to 255 made its way into various Microsoft documents. The ones I have seen gave no explanation to the underlying cause and effect, and supporting measurements.

So now what about a SAN with a small disk array? Especially if the queue depth setting is per LUN, and each LUN is comprised of 4 disks? Should the queue depth setting be increased to 254? I suggest following my guidelines for IOPS and latency versus queue depth with adjustments for the number of disks per LUN, all weighed against whether the objective is OLTP responsiveness or batch/DSS pure throughput.

Sequential IO

All of the IOPS versus queue depth discussed so far does not pertain to sequential disk IO. For large block sequential IO, a queue depth of 1 per LUN seems to be sufficient to generate maximum IO bandwidth. I am inclined to think that the theory is the IO size x Queue depth should be larger than the number of disks in the array x the RAID stripe size. The reasoning is that each disk will have IO to process, but I have not verified this hypothesis.

Increasing queue depth beyond the minimum necessary to achieve near maximum bandwidth will only server to increase latency. In a mixed small and large block IO load, perhaps a higher queue depth on the large block might improve the large block portion of throughput, but this has not been studied. In a SAN, there are some suggestions that a higher queue depth may be necessary to reach maximum sequential bandwidth, along with multiple LUNs per RAID group. A satisfactory plausible explanation has not been provided.

SQL Server IO Characteristics

There are several Microsoft documents that describe SQL Server IO in detail. A selection include the CSS SQL Server Engineers blog How It Works: Bob Dorr's SQL Server I/O Presentation,
and Microsoft-SQL-Server-IO-Internals slidedeck,
KB (917047) Microsoft SQL Server I/O subsystem requirements for the tempdb database.
The SQL Server Technical Article by Emily Wilson, Mike Ruthruff, Thomas Kejser Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications.
SQL Server 2008R2 Books Online has the following under Buffer Management, and Craig Freedman's Random Prefetching discussion with asynchronous IO.

In brief, in a table scan operation, SQL Server will issue IO to try to stay 1024 pages ahead of the scan with Enterprise Edition and 128 pages in Standard Edition. In the random-like 8KB accesses for key lookup and loop join inner source rows, SQL Server switches from synchronous to asynchronous IO at estimate 25 rows (a source other than me is needed for this).

High Queue Depth SQL on SSD

In a table scan query without lock hints, a Read Queue depth of over 1300 was observed. The IO size was 8KB, read latency went above 200ms even on SSD storage. With table lock, the IO size was around 500K (probably mostly 512K plus a few small block IO), disk latency was less than 50ms and queue depth was around 40.

For key lookup 8KB IO, queue depth was around 160 with 7ms latency. With HDD storage and 20 or so disk, queue depth 160 works out to 8 per disk, a reasonable number for good IO but not excessive latency.

On SSD, any queue depth more than 1-2 per LUN should achieve maximum IOPS and latency is below 0.1ms. Marc Bevand on the Zorinaq's blog pointed out that IOPS at queue depth 1 is essentially measure of latency. Suppose a SSD is rated at 100us latency, and 30K IOPS for 8KB IO (30K x 8KB = 240MB). Then the queue depth 1 IOPS should be 10K (1,000,000 us/s / 100us). So the theory is that queue depth 3 should reach 30K IOPS. Keeping queue depth at the bare minimum necessary for maximum IOPS does not degrade the query generating the huge IOPS, while providing good responsiveness for other concurrent queries.

High Write Latency in Creating Clustered Indexes

The Create Clustered Index command was observed to generate the highest write latency. Queue depth was 500, latency was 600ms+ and IO size averaging 100KB. Now creating large clustered indexes does not occur during the working day. Still, there is no point issuing so many outstanding IO. With either a caching RAID controller or SSD, write IO bandwidth can be saturated even at low queue depth. Driving IO so high only makes the system highly unresponsive for any function requiring IO to the affected drives.

IO Queue Depth Summary

We have explored in brief the key components that are impacted by IO queue depth. The following are the main points to consider.

1) Random read to hard disks have the most to gain from having sufficiently high queue depth.

2) Sequential IO does (should?) not need high queue depth operation beyond what is necessary to keep all disks busy. Staying 1024 pages ahead of the scan seems reasonable for large block IO,  but I would not flood the queue with 8KB IO. This strategy should be adjusted based on IO size, or perhaps we should why is it necessary it issue 8K IO if the table is not fragmented.

3) Random write to RAID controller with write cache do not need deep queue depth for best performance.

4) SSD do not need deep queue for maximum performance

A) SQL Server appears to follow a set piece strategy on IO queue depth, dependent only on Edition, Standard or Enterprise. The number of disks behind each LUN is not considered, the useage model (OLTP versus DW/DSS) is not considered.

The proposed strategy is as follows. 1) Sequential IO should not try for 1024 pages ahead if IO size is 8KB. 2) Writes IO to controllers with write cache should use lower queue depth. 3) It is important to adjust random read IO queue depth based on the type of storage, HDD or SSD. 4) It is helpful to adjust HDD random read IO by usage model, OLTP or DW/DSS. 4) It is helpful to adjust HDD random read IO based on disks per LUN.

Some of the above could be detected automatically. Others might require a parameter setting, sp_configure seems appropriate. As much as we would like a universal answer (42) independent of user action, having adjustments could greatly improve the usability of SQL Server. Today a number of operations can render the SQL Server system completely unresponsive for the duration due to disk queue flooding, even with SSD storage. Only very large perfectly configured storage systems would have immunity.

PS - The performance data that I cited could be broadened in scope prior to making the change in SQL Server IO queue depth strategy. I am only providing enough to support my arguments, and Microsoft probably has more resources than I to put into this.

More references

Joe Sack's SQL Server Blog SQL Server and HBA Queue Depth Mashup also references Christian Boltons blogcast.

(Update 2010-24) Tempdb
SQL Server is also fairly aggressive on tempdb IO queue depth, probably having been tuned on a high-disk count storage system (as critical database server storage systems should be), and so issues IO at high queue depth. On all but the most poweful storage systems, this results in high disk latencies, and even then still might have somewhat high disk latency.
Many people want to interpret the best practice guideline of less than 10-20ms as a law of xxx engraved in stone, rather than as just one point in the queue depth - latency relation. Hence the (incorrect) conclusion drawn is that the tempdb is overloaded, instead of (the more correct interpretation that) SQL Server just wanting the load up the queue for best throughput.

Synchronous and Asynchronous IO
I mentioned above about the cut over from synchronous IO to asynchronous at 25 rows or there abouts. Consider a transaction processing server. If a transactional query involving 20 row also generates 20 read IO for the key lookup or loop join. The IO are issued serially, the next IO is not issued until the previous completes, 5ms with a properly configured low-queue depth storage system. The complete query will take 100ms,  20 row x 5ms/row (the query does not wait from data page writes). Even if there are many disks able to handle simultaneous IO, only 1 IO is issued by the query at a time.
Now consier a report that generates 100 or 1000's of IO. The IO is issued asynchronously. A batch of 100 IO could complete in 10-20ms, but disk latency is also driven up to 10-20ms. The effect is that reporting queries are prioritized with high queue disk IO, while transaction queries are deprioritized by both serial IO, and the latency for each IO is now longer.
This is another reason why it is important to have fine grain controller over IO queue depth strategy.

See: http://www.qdpma.com/Storage/IoQueueDepthStrategy.html for updates.

Published Monday, October 18, 2010 5:43 PM by jchang
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

this is said:

Amazing work.

Kalen needs to read your blog.

And Paul Randall.

Just to see how large systems work.

October 21, 2010 5:20 PM
 

jchang said:

thanks, I actually had the chance to talk with Mark Souza at SQL Bits earlier this month. He told me that many of the configuration and tuning recommendations are based on investigations conducted on their big test systems, such as HBA Queue Depth 255. I suggested an investigation for the key best practice tuning parameters such as # of data/tempdb files, etc on Small, Medium and Large (and possibly VM) reference configurations (not is so many words).

But first I would like to get access to the SQLCAT labs to do a comprehensive investigation on parallelism for which I posted my initial thoughts:
http://www.qdpma.com/CBO/ParallelismComments.html
http://www.qdpma.com/CBO/CBO09_ParallelismIII.html
http://www.qdpma.com/tpch/TpchStudiesSF10.html
When I have baked the ideas more, I will post here. After that I could do the IO and other BP study. Hopefully Jimmy May and Lindsey Allen (oh yeah, thanks for the nice dinner 2 weeks ago) will help me out in this. I would like to avoid a complete refurbish of my own lab, and this is something Microsoft should fund.

One of the difficulties for consultants (and lectures like Kalen and Paul) would travel alot is they don't have direct access to their labs. I am hoping there will be a notebook with a x4 PCI-E gen 2 ExpressCard port to which I can plug in a SAS controller connected to 8 SSDs in lightweight external enclosure. Then we can do serious IO studies while traveling and do heavy IO demos during sessions too.

One more thing, I have to interpret various counters etc to assess what SQL Server is doing with regard to IO, Paul could have looked at the source code unless he wrote it and does not have to look. But hopefully I have explained why the queue depth strategy should be more complex to handle different storage array sizes and to adjust desired behavior by the usage model.

October 21, 2010 5:47 PM
 

Nasser said:

Hi Joe,

great article as always, I just wanted to ask about the SSD part, did you run your tests on a consumer SSD or the Enterprise SSDs, becuase I wsa checking the specs of a new HP server with Fusion-IO disk (the HP IO Accelerator) and I was shocked when I saw the numbers!!  they claim to achieve more than 100K IOPS by using one disk only?

October 25, 2010 1:11 AM
 

jchang said:

There should be no suprise on the Fusion-IO IOPS number because an SSD is not really a single device that way hard disk is. In fact, the performance of an individual NAND chip is rather mediocre. But the SSD must be comprised of many NAND chips, which can be wired in parallel. Compare this to a hard disk, if we add a platter, the capacity is increased but this does not change either the sequential or randon IOPS.

Read performance of complete SSD devices are usually more or less the same between SLC/MLC, Enterprise or consumer grade. SLC has much higher write cycles than MLC. The main difference between MLC Enterprise and consumer is the uncorrectable read error rate, as the enterprise model will want to keep this as low as possible, while consumers can tolerate infrequent failures. Another difference might be in sustained write performance, the enterpise model will have more allocation for over-provisioning. I suspect the 50GB enterprise model has the same number of chips as a 64GB consumer model, its just that the ent model allocated an extra 14GB for over-provisioning.

Finally, a word of caution. The Fusion-IO spec is for 512byte IOPS. The original SLC ioDrive PCI-E gen1 x4 can support 770MB/s, meaning 8KB IO will be capped at 96K, and probably be less for random 8KB.

Oh yeah, I am using consumer, do you think I am made of money? anyways there will be new SSDs coming out in early 2011, so any SSD I buy now will be obsolete in a few months, another reason to go cheap right now.

October 25, 2010 11:00 AM
 

locca said:

Hi Joe,

  Great article. I just wonder that why does write IO performance level off with thread scaling?

Thanks.

April 26, 2011 2:48 AM
 

jchang said:

i think you meant Queue Depth scaling. Suppose the RAID groups is comprised of 8 disks. On reads, if queue depth is 1, then only 1 of the 8 disks gets activity, so we really need queue depth 8 to get 1 IO per disk. Further increasing queue depth improves performance because of the command queueing effect. Caching has no effect on reads if the data is not in cache. For databases, this is generally true.

No consider writes to the RAID group with caching enabled. The write IO is cached so the controller immediately sends back to the OS that the IO is done. The OS sends the next write IO. So it does not matter what the queue depth is.

April 27, 2011 12:12 PM
 

locca said:

Joe, Thanks. In this kind of batched workload testing, the write cache would be filled up in few seconds and have no effect just as read? from the straight line, Can I say that the write IO should always go to the write cache firstly? If there's no available write cache, the write IO should await?

April 27, 2011 9:41 PM
 

thibaut kanfrey said:

Hi,

I just loved your article!

I am curious what storage array you went for ?

(and how exactly it is configured)

freefrie@yahoo.fr

July 17, 2011 10:21 AM
 

Joe Chang said:

Storage has changed dramatically over the last three years driven by SSD developments. Most of the key

April 2, 2013 6:33 PM
 

jchang said:

The unfortunate thing is that when I get to a customer site, almost always the storage system has already been purchased and configured. I have to work with what is already there. Sometimes I try to get it reconfigured, but that means taking the database offline.

At one site, they had two EMC Clariion with 210 (possibly 240) 15K disks each. What could go wrong? Well only 2 single port FC HBA were on each system. The SAN cost $400K+ each, but I could not get extra FC HBA for 2 years. When they did get the extra dual-port FC HBAs, the SAN admin refused to connect the new FC ports!

Now I do not mind that the SAN vendor sales rep needed to pay for his/her children's private school, or that they needed a new BMW. Go ahead sell your expensive stuff. Just give me great performance.

For my own lab, I am strictly direct attach because its my money, and I do not need HA, even though direct attach could be made pretty reliable. Overly complex SAN on paper appears to be HA, but over complexity introduces problems of its own.

April 18, 2013 4:28 PM
 

Mike W said:

I'm having a strange issue where I was running my SAP Netweaver SQL instance on SQL 2008 with IBM SVC front end, with IBM FlashSystem 820 SSD array housing all the volumes in a dedicated fashion.  Under SQL 2008 I was achieving normal throughput levels with 100 microsecond latency as expected. Once SQL was upgraded to SQL 2012, the latency has gone up to 3-7ms range, apparantly due to the dual 4Gbps HBA's installed on the server being completely saturated by read IO on the SQL data file volumes.  I have done a full stats update against all tables, yet the throughput from SQL continues to flood the HBA's driving latency up dramatically.  Have you seen similiar issues with SQL 2012?  How would you consider addressing it?

April 1, 2014 11:51 AM
 

jchang said:

for old posts like this, please email or I may not find your comment.

I was not aware of significant IO strategy changes from 2008R2 to 2012, but there was a significant somewhere in the 2008 RTM-R2 period. Assuming that there were no hardware changes, FC HBA queue depth setting etc.

If I had to guess, the only that comes to mind that would cause such observed behavior is a change in the degree of parallelism. Someone may have set either CTOP or Max DOP differently on the old system, and on the new system you are using default or not?

In either case, both 2008R2 and 2012 have a peculiar IO queue depth strategy in table scan queries (which if I am interpreting it correctly, may be somewhat idiotic?)

When the DOP is less than the number of files for the FG holding the table, SQL Server issues IO at low QD. When DOP is equal or higher, but the number of files is less than 16, then SQL Server goes with high IO QD.

So if previously you had DOP less than number of files, and now DOP is higher, you could see very high QD IO in table scans now, versus low QD before. Just a guess.

April 1, 2014 3:39 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement