There is a perplexing lack of material on overall storage performance strategy for database servers. Too much of the content is narrowly focused on tuning techniques to improve performance over an unknown base. First, storage system IO performance is well understood from the science and engineering perspective. Second there is sufficient information on SQL Server IO patterns. It follows that the overall storage performance objective is then to achieve the proper balance between capability and cost. Storage performance is achieved by distributing load across several disk controllers, and a (very) large number of disk drives. After which, the storage capacity is likely to be many times larger than the database. Any document that only discusses sizing storage to meet the database space requirement is written by someone who should not be giving advice on database performance. A key element in this storage performance strategy is keeping the amortized cost per disk reasonable. Testing and tuning is the final step to verifying that the storage system performance meets the design specification and requirements.
Storage Performance Requirements
OLTP and data warehouse applications have inherently incompatible objectives. A transactional database server needs sufficient storage performance to support peak volume processing while maintaining good or acceptable responsiveness. In recent years, most database servers now have sufficient memory that this no longer an issue. The remaining major issue is handling transient IO surges that disrupt transaction processing. If it is not practical to achieve complete immunity, then tolerable limits to the magnitude and duration that transaction processing is affected should be established. A data warehouse should be able to power through large queries with as much as bandwidth and IOPS as warranted. Rules on disk performance counter limits can be disregarded. The storage performance was purchased, so one may as well use it.
There are several causes for transient IO surges. Examples include: checkpoints, large queries, and other internal housekeeping activities. A log backup or transaction rollback can also severely degrade transaction processing because this breaks the normal pure sequential IO pattern. One common recommendation is offload large queries to a separate report server. But it is not possible to completely suppress IO surges from other activities. Brute force capability is necessary in the storage system regardless. Once the storage system is configured with brute force capability, it may also be possible to run large queries without degrading transactions. (Large queries on transaction processing servers should be controlled by setting the max degree of parallelism to 1 or perhaps no more than 1 out of 4 processors.)
The key to preventing transaction processing disruptions in the IO subsystem is maintaining low latency reads from data files and very low latency writes to the log file, both of which are necessary to complete transactions. Even if the entire database resides in memory, a large query requiring temporary workspace could cause data to be flushed out of the buffer cache, requiring reads for subsequent queries. Writes to data are not required to complete a query as this is handled by the lazy writer. Reads from log should only occur during transaction log backups and possibly from a very large transaction rollback, which should not occur in properly designed transaction server during busy hours.
Single Hard Disk Drive Performance
A very brief discussion of disk drive performance characteristics is warranted. The difference between random and sequential IO is well known if not fully appreciated. Recent generation 10,000rpm (10K) disk drives can sustain sequential transfer rates of 80MB/sec on the outer tracks and 40MB/sec on the inner tracks. For 15K drives, 125MB/s on the outer and 70MB/s on the inner tracks. Random IO performance is constrained by disk rotational speed and seek-time. The disk media on a 10K drive completes a full rotation in 6ms, for an average rotational latency contribution of 3ms, and 2ms for a 15K drive. Current generation 10K drives have an average seek time of 4.6ms for reads and 5.2ms for writes. A 15K drive might have average seek times of 3.5ms for reads and 4.2ms for writes. For a small 8KB access, the transfer time is approximately 0.1ms. Other contributions are negligible, so the total latency for random reads is about 8ms on 10K and 5.6ms on 15K drives. This leads to the frequently quoted small block random IO performance of 125 IOPS for 10K and 175 IOPS for 15K drives respectively.
Implied in the above calculations are that disk IO is issued serially one at a time and data is distributed across the entire disk. When data is restricted to a narrow section, then the average seek time is lower, the short stroke effect. When multiple disk IO requests are issued simultaneously, the disk re-orders the IO sequence, command queuing, for higher throughput at the expense of longer latency. If data is distributed over the entire disk, high-queue depth operation quickly leads to very high latency, exceeding 20ms at queue depth 4 and over 100ms at queue depth 32 per disk. This is the underlying cause and effect of the maximum queue depth 2 recommendation. A very important matter rarely discussed is the short-stroke effect combined with high queue depth operation. When only a tiny fraction of the disk is used for the active database, preferably 5% or less, latency increases much more slowly with queue depth, less than 20ms at queue depth 8 and less than 40ms at queue depth 16 and random IO performance can exceed 400 IOPS per disk. This is characteristic is the key in maintaining strong transaction processing resiliency. This is also counter the arguments made for SAN in allowing high disk space utilization.
SQL Server IO
The SQL Server engine has internal mechanisms governing disk IO. The details concerning what operations and the set points are not disclosed as the strategy is probably still evolving. In the past, there were differences in IO strategy between Standard and Enterprise editions. The assumption was that Enterprise Edition would be employed on systems with high performance storage systems suitable for aggressive high queue depth IO operation.
One objective of the SQL Server IO strategy is to prevent over flooding the disk queue during checkpoints, effectively shutting down transaction processing. Some builds of SQL Server 2000 were not effective at this, while early builds of SQL Server 2005 were successful. In large queries, queue depth is throttled, but not enough to prevent excessive impact in transaction processing on weak storage systems and perhaps too much for best data warehouse performance on strong storage systems. For execution plans with loop joins or bookmark lookups, SQL Server issues single IO in a serialized manner (one IO is issued, the next IO starts after the first is completed) for estimated row counts 20 and lower. At estimated row counts 30 and higher, SQL Server issues multiple concurrent IO, but does not drive the disks for maximum throughput. In any case, a query requiring disk IO involving 30 plus estimated rows can execute faster than the similar plan involving 20 estimated rows because the disk IO is issued concurrently instead of one at a time serially. In a simple table scan operation, SQL Server issues 256K-1MB IO at low queue depth. Scans followed by a hash operation generate 8K IO, which is not the best choice. See the paper: Microsoft SQL Server I/O Basic Chapter 2 and included references for more details.
Since SQL Server cannot know whether it is being used for transaction processing, or data warehousing, it cannot automatically determine the best disk IO settings. There really should be a setting for the database usage model. In addition, it is helpful to declare the number of physical disk drives in each array. This allows the SQL Server engine to know how many concurrent IO can be issued while maintaining low latency, without relying entirely on the disk access time feedback. The SQL Server engine internal IO governor is beneficial, but this alone cannot mask transient IO surges effectively on a weak disk system. If IO is not throttled enough, disk access latency can become excessive. If IO is throttled, then the full IO performance capability of the disk system is not realized.
System and Controller IO
The major server systems today based on the Intel 5100, 7300 or AMD/nVidia chipsets all have very powerful IO capability with 28 or more PCI-E lanes, allowing for 7 or more x4 PCI-E slots. Intel reported that the 5100 chipset can sustain 3GB/s disk IO. The other two chipsets should be able to equal or exceed this. Many systems configure fewer slots with a mix of x4 and x8 widths. The issue is that the first generation PCI-E RAID controllers could not fully utilize the x8 PCI-E bandwidth (1.5-1.6GB/sec). It is possible the second generation PCI-E controllers can but there is very little information on this. A single PCI-E RAID controller is capable of sustaining 800MB/sec in a x4 PCI-E slot.
To realize the full system IO capability, it is necessary to spread IO across as many controllers or HBAs as allowed by the number of independent PCI-E slots, and across as many disk drives as warranted. Technically, 8 drives could saturate a single PCI-E controller in pure sequential IO. In practice, few SQL operations will generate pure sequential IO. Common external storage enclosures hold up to 14-16 disks. The HP disk enclosures hold 10, 12, and 25 disks. Configuring one or two fully populated enclosures with a total of 12-30 disks for each PCI-E controller is a perfectly sound strategy.
Let us examine the disk configurations employed in the TPC benchmarks. In a recent TPC-C publication, an HP ProLiant ML370G5 with 2 x Xeon X5460 quad core processors is configured with 7 P800 RAID controllers, 600 36GB 15K disks drives for data and 28 72GB 15K drives for log. There are 100 disk drives connected to each of 6 controllers for the data files. The total disk space on the 600 data drives without RAID overhead is 20TB. The size of the data files is 1.8TB. The active portion of disk space used is about 10%. Many TPC-C publications are cost- performance optimized, meaning that it is reasonable to assume each of the 600 data disks are loaded to approximately 200 IOPS, for 120K IOPS total. The TPC-C benchmark generates exceptionally high random IO activity, infrequently seen in actual transaction processing databases. To support this IO rate, a number of special steps are involved. A custom driver to bypass the Windows mini-port architecture is one. Raw partitions are used instead of the NTFS file system. Disk performance counters are disabled in both the operating system and SQL Server. In typical systems running below 10K IOPS, these special precautions have no impact. Raw partitions might be useful for disk activity on the order of 20K IOPS or higher. Disabling performance counters is not advisable in almost all cases.
The TPC-H disk configurations are probably more meaningful to actual heavy duty critical database servers than the TPC-C configurations. The large DSS type queries in the TPC-H benchmark are effectively heavy transient IO surges. There is some variation in TPC-H disk configurations depending on the size of the database and amount of system memory configured. The HP report for at 1TB scale factor (1TB data, and another 300GB indexes) on a ProLiant DL585G2 with 32GB memory is a suitable representation of a heavy duty database server. The disk configuration is 8 RAID controllers (7 PCI-E and 1 PCI-X) connected to a total of 200 disks for data, 25 disks per controller. One controller also connects 2 disks for OS and 4 disks for logs. The TPC-H workload does not generate significant log load. A heavy duty transaction server should have the logs on a dedicated controller.
Storage Performance Sizing
Not all applications require the full performance capability of the above TPC-H storage configuration. A basic level of capability can be achieved with 2 controllers and 2 external disk enclosures with 12-16 disks each. An intermediate capability level is achieved with 4 controllers and 4 disk enclosures with a total of 48-60 disk drives. Really serious capability requirements should employ 6-8 disk controllers and 120-200 disk drives. The cost for one controller and one enclosure populated with 73GB 15K drives (Dell MD1000) is about $7500. The basic capability cost using direct attach storage is then about $15K, $30K for intermediate capability and $60-100K for heavy duty capability. The Dell MD3000 that supports clustering is more expensive. SAN storage systems are much more expensive, with a amortized cost per disk exceeding $2000.
Testing and Verification
For any given storage configuration, based on the number of controllers, number of disks, and RAID level, the theoretical random and sequential IOP performance can be calculated. The actual realized capability may be far below what should have been possible. Always test the configuration. Storage vendors like to offer “advanced” capabilities without fully disclosing the performance overhead. Also be aware of disk partition alignment. It is important to test IO performance over the anticipated size of the database.
Modern server systems have considerable IO performance capability. This requires distributing load over multiple storage controllers and very many disk drives. If the resulting configuration has far more storage capability than needed by the primary database, this is a very important performance bonus, providing the capability of handling heavy IO surges while maintaining reasonably IO latency. The IO governor in the SQL Server engine helps, but cannot hide weak storage performance. Always test to verify the actual storage performance is in line with the expected performance level.
Disk Partition Alignment on RAID
Disk partition alignment is a problem caused by the Windows operating system in RAID stripes. For whatever reason from long ago, the OS allocates 63 sectors (of 512 bytes, or 31.5KB in all) for its own use. The first partition then starts on the 64th sector. A common stripe size on RAID arrays is 64KB. The problem is that in random 8KB disk read or writes, accesses to either the first or last 8KB of a 64KB extents will reside on different stripes of the RAID array, i.e., different physical disks. So in theory, every 8 accesses for 8KB blocks will result in 10 disk accesses, for a degradation of 20% in random 8K IO performance. There are various reports of greater performance impact. In some storage systems, unaligned accesses bypass the cache, so actual performance impact could be greater. Another possibility is that a particular storage system has become seriously fragmented over time, and the simple act reformatting the file system has some benefit. Windows Server 2008 finally corrects this situation. See the presentation by Jimmy May (firstname.lastname@example.org) on the SQL Server Disk Partition Alignment. There is more information on disk partition alignment targeted for Microsoft Exchange (http://www.msexchange.org/tutorials/Disk-Geometry.html).
For some inexplicable reason, people seem to want to believe that a large cache on the SAN can compensate for a low disk count. A separate disk cache for a database engine is fundamentally a silly idea. The database engine is a disk cache specially optimized for databases. Suppose the database has an 8GB data buffer. What do you really expect to be in a 4GB SAN cache that is not already in the database engine’s internal cache? If the SAN cache is larger than the database cache, i.e., system memory, then one should probably find another person to do system configuration. Accessing cache in the database engine is much more efficient than going out to the storage controller cache. All this having been said, a large write cache on the storage controller is beneficial in handling checkpoints and other write bursts. RAID controllers should probably be offered with 2-4GB cache, not the 256-512M typically offered.