THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Storage Performance for SQL Server

There is a perplexing lack of material on overall storage performance strategy for database servers. Too much of the content is narrowly focused on tuning techniques to improve performance over an unknown base. First, storage system IO performance is well understood from the science and engineering perspective. Second there is sufficient information on SQL Server IO patterns. It follows that the overall storage performance objective is then to achieve the proper balance between capability and cost. Storage performance is achieved by distributing load across several disk controllers, and a (very) large number of disk drives. After which, the storage capacity is likely to be many times larger than the database. Any document that only discusses sizing storage to meet the database space requirement is written by someone who should not be giving advice on database performance. A key element in this storage performance strategy is keeping the amortized cost per disk reasonable. Testing and tuning is the final step to verifying that the storage system performance meets the design specification and requirements.


Storage Performance Requirements

OLTP and data warehouse applications have inherently incompatible objectives. A transactional database server needs sufficient storage performance to support peak volume processing while maintaining good or acceptable responsiveness. In recent years, most database servers now have sufficient memory that this no longer an issue. The remaining major issue is handling transient IO surges that disrupt transaction processing. If it is not practical to achieve complete immunity, then tolerable limits to the magnitude and duration that transaction processing is affected should be established. A data warehouse should be able to power through large queries with as much as bandwidth and IOPS as warranted. Rules on disk performance counter limits can be disregarded. The storage performance was purchased, so one may as well use it.


There are several causes for transient IO surges. Examples include: checkpoints, large queries, and other internal housekeeping activities. A log backup or transaction rollback can also severely degrade transaction processing because this breaks the normal pure sequential IO pattern. One common recommendation is offload large queries to a separate report server. But it is not possible to completely suppress IO surges from other activities. Brute force capability is necessary in the storage system regardless. Once the storage system is configured with brute force capability, it may also be possible to run large queries without degrading transactions. (Large queries on transaction processing servers should be controlled by setting the max degree of parallelism to 1 or perhaps no more than 1 out of 4 processors.)


The key to preventing transaction processing disruptions in the IO subsystem is maintaining low latency reads from data files and very low latency writes to the log file, both of which are necessary to complete transactions. Even if the entire database resides in memory, a large query requiring temporary workspace could cause data to be flushed out of the buffer cache, requiring reads for subsequent queries. Writes to data are not required to complete a query as this is handled by the lazy writer. Reads from log should only occur during transaction log backups and possibly from a very large transaction rollback, which should not occur in properly designed transaction server during busy hours.


Single Hard Disk Drive Performance

A very brief discussion of disk drive performance characteristics is warranted. The difference between random and sequential IO is well known if not fully appreciated. Recent generation 10,000rpm (10K) disk drives can sustain sequential transfer rates of 80MB/sec on the outer tracks and 40MB/sec on the inner tracks. For 15K drives, 125MB/s on the outer and 70MB/s on the inner tracks. Random IO performance is constrained by disk rotational speed and seek-time. The disk media on a 10K drive completes a full rotation in 6ms, for an average rotational latency contribution of 3ms, and 2ms for a 15K drive. Current generation 10K drives have an average seek time of 4.6ms for reads and 5.2ms for writes. A 15K drive might have average seek times of 3.5ms for reads and 4.2ms for writes. For a small 8KB access, the transfer time is approximately 0.1ms. Other contributions are negligible, so the total latency for random reads is about 8ms on 10K and 5.6ms on 15K drives. This leads to the frequently quoted small block random IO performance of 125 IOPS for 10K and 175 IOPS for 15K drives respectively.


Implied in the above calculations are that disk IO is issued serially one at a time and data is distributed across the entire disk. When data is restricted to a narrow section, then the average seek time is lower, the short stroke effect. When multiple disk IO requests are issued simultaneously, the disk re-orders the IO sequence, command queuing, for higher throughput at the expense of longer latency. If data is distributed over the entire disk, high-queue depth operation quickly leads to very high latency, exceeding 20ms at queue depth 4 and over 100ms at queue depth 32 per disk. This is the underlying cause and effect of the maximum queue depth 2 recommendation. A very important matter rarely discussed is the short-stroke effect combined with high queue depth operation. When only a tiny fraction of the disk is used for the active database, preferably 5% or less, latency increases much more slowly with queue depth, less than 20ms at queue depth 8 and less than 40ms at queue depth 16 and random IO performance can exceed 400 IOPS per disk. This is characteristic is the key in maintaining strong transaction processing resiliency. This is also counter the arguments made for SAN in allowing high disk space utilization.


SQL Server IO

The SQL Server engine has internal mechanisms governing disk IO. The details concerning what operations and the set points are not disclosed as the strategy is probably still evolving. In the past, there were differences in IO strategy between Standard and Enterprise editions. The assumption was that Enterprise Edition would be employed on systems with high performance storage systems suitable for aggressive high queue depth IO operation.


One objective of the SQL Server IO strategy is to prevent over flooding the disk queue during checkpoints, effectively shutting down transaction processing. Some builds of SQL Server 2000 were not effective at this, while early builds of SQL Server 2005 were successful. In large queries, queue depth is throttled, but not enough to prevent excessive impact in transaction processing on weak storage systems and perhaps too much for best data warehouse performance on strong storage systems. For execution plans with loop joins or bookmark lookups, SQL Server issues single IO in a serialized manner (one IO is issued, the next IO starts after the first is completed) for estimated row counts 20 and lower. At estimated row counts 30 and higher, SQL Server issues multiple concurrent IO, but does not drive the disks for maximum throughput. In any case, a query requiring disk IO involving 30 plus estimated rows can execute faster than the similar plan involving 20 estimated rows because the disk IO is issued concurrently instead of one at a time serially. In a simple table scan operation, SQL Server issues 256K-1MB IO at low queue depth. Scans followed by a hash operation generate 8K IO, which is not the best choice. See the paper: Microsoft SQL Server I/O Basic Chapter 2 and included references for more details.


Since SQL Server cannot know whether it is being used for transaction processing, or data warehousing, it cannot automatically determine the best disk IO settings. There really should be a setting for the database usage model. In addition, it is helpful to declare the number of physical disk drives in each array. This allows the SQL Server engine to know how many concurrent IO can be issued while maintaining low latency, without relying entirely on the disk access time feedback. The SQL Server engine internal IO governor is beneficial, but this alone cannot mask transient IO surges effectively on a weak disk system. If IO is not throttled enough, disk access latency can become excessive. If IO is throttled, then the full IO performance capability of the disk system is not realized.


System and Controller IO

The major server systems today based on the Intel 5100, 7300 or AMD/nVidia chipsets all have very powerful IO capability with 28 or more PCI-E lanes, allowing for 7 or more x4 PCI-E slots. Intel reported that the 5100 chipset can sustain 3GB/s disk IO. The other two chipsets should be able to equal or exceed this. Many systems configure fewer slots with a mix of x4 and x8 widths. The issue is that the first generation PCI-E RAID controllers could not fully utilize the x8 PCI-E bandwidth (1.5-1.6GB/sec). It is possible the second generation PCI-E controllers can but there is very little information on this. A single PCI-E RAID controller is capable of sustaining 800MB/sec in a x4 PCI-E slot.


To realize the full system IO capability, it is necessary to spread IO across as many controllers or HBAs as allowed by the number of independent PCI-E slots, and across as many disk drives as warranted. Technically, 8 drives could saturate a single PCI-E controller in pure sequential IO. In practice, few SQL operations will generate pure sequential IO. Common external storage enclosures hold up to 14-16 disks. The HP disk enclosures hold 10, 12, and 25 disks. Configuring one or two fully populated enclosures with a total of 12-30 disks for each PCI-E controller is a perfectly sound strategy.


Let us examine the disk configurations employed in the TPC benchmarks. In a recent TPC-C publication, an HP ProLiant ML370G5 with 2 x Xeon X5460 quad core processors is configured with 7 P800 RAID controllers, 600 36GB 15K disks drives for data and 28 72GB 15K drives for log. There are 100 disk drives connected to each of 6 controllers for the data files. The total disk space on the 600 data drives without RAID overhead is 20TB. The size of the data files is 1.8TB. The active portion of disk space used is about 10%. Many TPC-C publications are cost- performance optimized, meaning that it is reasonable to assume each of the 600 data disks are loaded to approximately 200 IOPS, for 120K IOPS total. The TPC-C benchmark generates exceptionally high random IO activity, infrequently seen in actual transaction processing databases. To support this IO rate, a number of special steps are involved. A custom driver to bypass the Windows mini-port architecture is one. Raw partitions are used instead of the NTFS file system. Disk performance counters are disabled in both the operating system and SQL Server. In typical systems running below 10K IOPS, these special precautions have no impact. Raw partitions might be useful for disk activity on the order of 20K IOPS or higher. Disabling performance counters is not advisable in almost all cases.


The TPC-H disk configurations are probably more meaningful to actual heavy duty critical database servers than the TPC-C configurations. The large DSS type queries in the TPC-H benchmark are effectively heavy transient IO surges. There is some variation in TPC-H disk configurations depending on the size of the database and amount of system memory configured. The HP report for at 1TB scale factor (1TB data, and another 300GB indexes) on a ProLiant DL585G2 with 32GB memory is a suitable representation of a heavy duty database server. The disk configuration is 8 RAID controllers (7 PCI-E and 1 PCI-X) connected to a total of 200 disks for data, 25 disks per controller. One controller also connects 2 disks for OS and 4 disks for logs. The TPC-H workload does not generate significant log load. A heavy duty transaction server should have the logs on a dedicated controller.


Storage Performance Sizing

Not all applications require the full performance capability of the above TPC-H storage configuration. A basic level of capability can be achieved with 2 controllers and 2 external disk enclosures with 12-16 disks each. An intermediate capability level is achieved with 4 controllers and 4 disk enclosures with a total of 48-60 disk drives. Really serious capability requirements should employ 6-8 disk controllers and 120-200 disk drives. The cost for one controller and one enclosure populated with 73GB 15K drives (Dell MD1000) is about $7500. The basic capability cost using direct attach storage is then about $15K, $30K for intermediate capability and $60-100K for heavy duty capability. The Dell MD3000 that supports clustering is more expensive. SAN storage systems are much more expensive, with a amortized cost per disk exceeding $2000.


Testing and Verification

For any given storage configuration, based on the number of controllers, number of disks, and RAID level, the theoretical random and sequential IOP performance can be calculated. The actual realized capability may be far below what should have been possible. Always test the configuration. Storage vendors like to offer “advanced” capabilities without fully disclosing the performance overhead. Also be aware of disk partition alignment. It is important to test IO performance over the anticipated size of the database.



Modern server systems have considerable IO performance capability. This requires distributing load over multiple storage controllers and very many disk drives. If the resulting configuration has far more storage capability than needed by the primary database, this is a very important performance bonus, providing the capability of handling heavy IO surges while maintaining reasonably IO latency. The IO governor in the SQL Server engine helps, but cannot hide weak storage performance. Always test to verify the actual storage performance is in line with the expected performance level.



Disk Partition Alignment on RAID

Disk partition alignment is a problem caused by the Windows operating system in RAID stripes. For whatever reason from long ago, the OS allocates 63 sectors (of 512 bytes, or 31.5KB in all) for its own use. The first partition then starts on the 64th sector. A common stripe size on RAID arrays is 64KB. The problem is that in random 8KB disk read or writes, accesses to either the first or last 8KB of a 64KB extents will reside on different stripes of the RAID array, i.e., different physical disks. So in theory, every 8 accesses for 8KB blocks will result in 10 disk accesses, for a degradation of 20% in random 8K IO performance. There are various reports of greater performance impact. In some storage systems, unaligned accesses bypass the cache, so actual performance impact could be greater. Another possibility is that a particular storage system has become seriously fragmented over time, and the simple act reformatting the file system has some benefit. Windows Server 2008 finally corrects this situation. See the presentation by Jimmy May ( on the SQL Server Disk Partition Alignment. There is more information on disk partition alignment targeted for Microsoft Exchange (


SAN Cache

For some inexplicable reason, people seem to want to believe that a large cache on the SAN can compensate for a low disk count. A separate disk cache for a database engine is fundamentally a silly idea. The database engine is a disk cache specially optimized for databases. Suppose the database has an 8GB data buffer. What do you really expect to be in a 4GB SAN cache that is not already in the database engine’s internal cache? If the SAN cache is larger than the database cache, i.e., system memory, then one should probably find another person to do system configuration. Accessing cache in the database engine is much more efficient than going out to the storage controller cache. All this having been said, a large write cache on the storage controller is beneficial in handling checkpoints and other write bursts. RAID controllers should probably be offered with 2-4GB cache, not the 256-512M typically offered.

Published Tuesday, March 4, 2008 1:46 PM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Joe Chang said:

I should have included a pointer to Andys blog, which lists key sources from MS, Jimmy's handle is AspiringGeek

March 5, 2008 9:24 AM

Jeff M said:

Joe, thank you so much for taking the time to write this post!  

March 5, 2008 9:29 AM

Linchi Shea said:

> Disk partition alignment is a problem caused by the Windows operating system in RAID stripes.

Although RAID stripes can add complication to the disk partition misalignment issue, the two are not necessarily related.

> A separate disk cache for a database engine is fundamentally a silly idea.

I'm not sure I agree. If you look at the I/O path hierarchy, there is also a cache hierarchy. What you don't find in the database buffer pool, it still helps to find it in the SAN cache. But the key benefit is really with writes.

March 5, 2008 11:53 AM

jchang said:

yes, partition alignment and RAID are separate topics, but this is one issue,

partitions not aligned to 32K or 64K boundaries on single disks, or in theory, RAID 1, is not an issue, because an IO request for any given block goes only to 1 disk

partition alignment is only an issue in RAID stripes, including RAID 5, 10, even 0, because an IO that should have only gone to 1 disk, ends up going to 2

I had to cut a lot of details and supporting data out of this post to keep it reasonably short, to be more clear, a read cache is ineffective.

First, consider a data buffer in system memory larger than the storage controller cache. In theory, there should be nothing in the storage cache that is not in the database cache, with exceptions for differences in caching algorithms. but mostly, the storage cache duplicates whats already in system memory. this was the double buffering subject from long ago.

Next, if storage cache is larger than database cache, yes, IO does get handled by the cache. But is this a good idea? the cpu cost (on the database server) of going out to storage is much higher than if it were in-process, ie, the sqlserver instance. second, some memory is wasted because of double buffering,

third, caching is not free from the impact sense. that is, if you built a storage system with and without cache, the overhead of caching will degrade the performance of IO that actually go to disk.

Look at the few TPC-C publications that use a SAN for storage. Read the full disclosure carefully. Note that cache is disabled on the data partitions.

All that having been said, I will say an Oracle guy who does alot of EMC work says that a very tiny 2MB read cache per LUN is recommended because this enables read-ahead to work, use the rest for write cache (on transaction apps only)

I think EMC Clarion defaults to read-ahead disabled, because everywhere I have been, sequential performance really stinks, 10-12MB/sec per disk, which is what one would expect if read-ahead is disabled

March 5, 2008 2:36 PM

Linchi Shea said:

I agree that in the simplest case (i.e. with a single disk), read cache doesn't benefit read-ahead. This basically says that read cache doesn't benefit sequential reads. I'll have to respond in more detail with a blog post beacuse I can't post any chart in the comment section.

March 5, 2008 3:46 PM

jchang said:

the Oracle guy actually said read cache does benefit read ahead, just that you only need 2MB per LUN, not 4GB.

I am saying read cache in the storage controller is not effective in reducing disk load, as this should be done in system memory by the database engine

March 5, 2008 4:02 PM

Linchi Shea said:

Regardless of what the Oracle said, I have reproducible data to show that read cache is not good for sequential reads. I'm not claiming this is univerally true, but I have observed it in my own tests.

March 5, 2008 5:56 PM

jchang said:

my fault, I thought you were saying the opposite, I have not tested the tiny 2M read cache myself, but it is very apparent that a cache does add overhead, hence read cache should be kept small or disabled. the main point is that one should not expect storage cache to reduce the spindle count & bandwidth

March 5, 2008 6:15 PM

Paul Nielsen said:

> A separate disk cache for a database engine is fundamentally a silly idea.

I think it depends on the transaction rate and database size. One client of mine just increased their performance by increasing the SAN cache from 80 Gb to 120 Gb. The database server has 48Gb of RAM. So at some scales the SAN cache is less silly.

March 5, 2008 9:58 PM

jchang said:

then about 40GB+ of the SAN cache is wasted, they should have gone to a server that supports 128 or even 256GB memory. It also depends how disks there are, at this level, if they have less than 200-400 disks over 8 FC ports,  they are probably under configured storage wise

March 5, 2008 10:51 PM

Linchi Shea said:

In the next several blog posts, I’ll share with you some empirical results concerning the performance

March 24, 2008 12:20 PM

Glenn Berry said:

Nice post Joe.  Don't you think the new TPC-E benchmark is a little more realistic than TPC-C in terms of disk I/O requirements?

March 25, 2008 2:09 PM

jchang said:

The TPC-C IO load was seriously getting out of hand to the point of distorting the server system architecture required to run the benchmark. Do you remember the original Intel Merced platform that did not see the light of day? It had 24 PCI slots. This is unavoidable in a write (IUD) intensive benchmark, evening by increasing the tpm-C to warehouse ratio. Writes must be written to disk eventually.

TPC-E significantly reduced IO load probably by increasing locality of hot data, rather than data size and by shifting to a much more read intensive query set, which can be cached to eliminate IO.

I am not inclined to think either is realistic, as 200 plus disks on a 2 x Xeon 5460 is bigger than anything I have seen. I guess real apps tend to have highly CPU intensive queries, so may be 100 disks on a 2 socket might be better

March 28, 2008 3:35 PM

Magnus said:

Just wanted to say that this is the best post on the subject that I've read in a long while.

April 13, 2008 12:46 PM

John Couch said:

Ok, I have a question. I see Joe and Linchi going back and forth abotu disk alignment and RAID. Yes they are two different things, and something about not needing to align the disks on RAID 1? Or if you are writing to a single drive? I am lost. The article is great, I just want to understand better. Here is the questions I posed on another forum, and have yet to get an answer..

I recently read some information regarding aligning the Disk for use with SQL Server. Meaning, since SQL Server handles data in chunks of 64KB that the disk drive SQL Server writes to should be configured to work in 64KB chunks. The articles I am referring to are listed below, and I am wondering what other peoples thoughts are. I am also interested in understanding how to determine whether my disks are aligned and/or how to align them. Based on the articles, I see my Offset is 32KB on my drives, but I am also trying to work

with the formula in the Microsoft KB article to determine how/where exactly I become aligned. The formula is:

((Partition offset) * (Disk sector size)) / (Stripe unit size)


1. Does SQL Server read/write data to disk in 64KB (Extent)? Or is it 8KB (Page) Level?

2. Where do I find the values listed in the formula above? I am a little confused how they got the values to determine where the alignment occurred.

3. Do you only do the alignment on drives where SQL Server is writing data/log files?

4. Does setting your offset to 64KB affect the way the OS stores data on the drives? Meaning, a 65KB file will actually use 128KB? And in that case the remaining 63KB is wasted?

The articles I am referencing are listed below.

Based on this article, I have these questions:

1. Does this only matter on RAID volumes? Or, assuming I am only writing to a single disk, should I bother aligning??

2. Can you align the C:\ drive in a similar way?

3. Do all the OS's 2000, 2003 and 2008 allocate the 63 sectors??? for its own use?

Thanks in advance for your help.


September 15, 2008 5:34 PM

jchang said:

Questions 1-4:

1. If the execution plan shows key lookups, or loop joins, which require disk IO, more than likely it will be 8K read IO, except when the cache is cold, in which case it will be 64K. Scans and large index seek ranges will probably generate 64K or larger IO, I have seen 384K. There are situations that generate 8K sequential. Writes are whatever is required, it you modified 8K pieces, then 8K, if you modified entire 64K extents, then 64K.

2. Partition offset, unless you are Windows 2008, if you created the partitions from the GUI, as oppose to diskpart, you are 31.5K offset, which is serious bad news for RAID 0, 5, and 10. Disk sector size in Windows is always 512 bytes for current and previous generations (hard disks can do bigger, hd vendors want MS to go bigger). RAID stripe in in the RAID controller, probably 64K by default.

3. All RAID 0,5, and 10 should get 64K alignment. single disk JBOD and RAID 1 (2 disks mirrored) don't care, as this affects stripes. Also, if the OS boot drive is not aligned, maybe you don't want to reinstall, so keep the big important data and log off it. I think it is mostly data that this matters. Get the big data files partition aligned. Master, model, pubs, adventure works etc should have very little activity, so who cares.

4. no influence of space efficiency, it improves (reduces) IO for disk access. Keep ntfs cluster size at default 4K if you want space efficiency for many small files. If you have big partitions for just a few SQL related files, don't worry. ie file server and sql server should be separate.

article questions:

1. correct

2. you can align C, but save this for the next OS install

3. 2000 and 2003 do the goofy 31.5K (from the GUI), 2008 does 1MB align

Try to find the powerpoint: Partition Alignment for SQL Server.

Look up cscript vbs syntax to make sure the script is correct, particularly with respect to carriage returns.

Save the following text as GetPartitionOffsets.vbs


strComputer = "."

Set objWMIService = GetObject("winmgmts:\\" & strComputer & "\root\CIMV2")

Set colItems = objWMIService.ExecQuery( _

   "SELECT * FROM Win32_DiskPartition",,48)

'Wscript.Echo "-----------------------------------"

Wscript.Echo "Win32_DiskPartition instance"

For Each objItem in colItems

   Wscript.Echo "DiskIndex: " & objItem.DiskIndex & "  -- Name: " & objItem.Name & "  --  StartingOffset: " & objItem.StartingOffset


Execute via command line

C:\>cscript GetPartitionOffsets.vbs

September 16, 2008 11:10 AM

Russel McDonald said:

Hi Joe! How have you been! I miss working with you at Quest, but anyhow I am at GoldenGate now, replicating SQL Server databases. Quite fun!

Anyhow, to get right to the point, if you can shed light on this question of mine I will owe you a favor.

We have been struggling with an issue here for over a week that I finally resolved I think is a dirty cache in a RAID 1 or SAN driver. The symptom is that our logreader reads from the SQL Server active ldf right while SQL server is writing to it, and we end up getting a buffer that has an older part of an older VLF in the first sector(s) in our buffer followed by the new VLF they are writing in the later sectors in our buffer. They do write over the entire area of course, but we seem to miss the first sector(s) due to split IOs likely. So we have early data FIRST followed by new data later in our buffer. It is non-buffered IO, HOWEVER, (1) we call DuplicateHandle on the same handle SS has, and (2) customer has  RAID 1 on a SAN. And here is the real problem...we rebuild the ldf layout (read in each VLF header in the ldf) then go back and re-read the spot that we had the "split IO" on a moment earlier (3 or 4 seconds at least) (and even verify that SS is now writing in yet the NEXT VLF) and STILL get the same problem. So my question is, is it possible that a split IO read was in progress on the SAN, and at the same time before it completed a SQL Server write came in, did not see any kernel buffer to update or make dirty yet, so passed the write through, and then finally the read completed leaving a dirty cache that the driver did not know was dirty? What do you think?

November 25, 2008 11:35 AM

jchang said:

russel, not sure, send me a email, I do not like the idea of constantly reading the ldf, I do recall an EMC fellow telling me of firmware problems of this sort. try making a log of what you are reading, and compare it with the transaction log backup, that should discern any meta data SQL uses

November 28, 2008 6:08 PM

Denis Gobo said:

Wow, it has been already a year since I wrote A year in review, The 21 + 1 best blog posts on SQLBlog

December 31, 2008 10:37 AM

roxxky said:

Thanks for nice info. It’s useful for me. Can you give me some more information with details? I will wait for your next post.

January 1, 2011 12:45 PM

Joe Chang said:

Storage has changed dramatically over the last three years driven by SSD developments. Most of the key

April 2, 2013 6:33 PM

Mathiau said:

But the same still applies for alignment for those not using all SSD.

April 19, 2015 3:47 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement