THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

IO Cost Structure – Anticipating SSD arrays

An observant person has probably noticed that SQL queries requiring disk reads not only have longer duration but also higher CPU times. It is not hard then to deduce that disk access (for both HDD and SSD), which involves the OS performing an IO call, the SQL Server process finding a place in the buffer cache for the data pages, and possibly evicting current pages from the buffer cache,  has significant CPU cost. What then is the cost structure of SQL Server disk IO? Exposing this is not entirely simple, as it is necessary to generate specific IO operations in a repeatable manner.


A while back, I ran a series of tests to expose the CPU cost of various SQL operations when all data was resident in memory. At the time, it was not a pressing matter to examine disk IO cost. If disk IO was necessary, the duration due to disk latency was going to be much higher the CPU time portion in most situations, the exception being the few environments with very powerful storage systems, typically with several hundred disk drives (properly configured and verified).


Now solid state disks (SSD) suitable for enterprise (or server storage) systems are on the verge of being practical for broad deployment. It is actually possible for a current generation 4-way server with 20-40 SSD array (any one thinking a single SSD should let others do the thinking) to reach the disk IO performance levels limited by CPU and not access latency. So it is now important to have an understanding disk IO cost structure, along with implications for the cost based optimizer (CBO) concerning IO performance characteristics.



SQL Server provides several methods for collecting information on CPU and duration. This includes 1) Profiler and Trace, the SET STATISTICS TIME option, and for SQL Server 2005 and later 3) the DMV (dm_exec_query_stats). IO at the SQL Server level can be obtained from 1) the DMV dm_io_virtual_file_stats (equivalent to fn_virtualfilestats) and 2) the SET STATISTICS IO option. It is always advisable to compare information from SQL Server with Operating System performance counters as a sanity check. The DMV method was found to be useful for averaging multiple calls, but would not properly report parallel execution plans, so all tests here are for non-parallel plans (MAXDOP 1).

In any case, the reported CPU and duration cost is for the entire query, not a specific portion. So to expose the disk component, it is necessary to compare the in-memory and the not n-memory (disk) results.

The first step is to build a set of test tables and SQL queries, exposing as many of the dependencies as possible, including total row length, data types, variable length columns, lock levels and so on. The next step is to ensure that measurements are repeatable.


Test System and Environment

The test system is a Dell PowerEdge 29000 with 2 x QC E5430 2.66GHz processors, 24GB memory, 1 PCI-E PERC6/I RAID controller and 8 15K SAS disk drives in two RAID 0 arrays. The operating system is Windows Server 2008 64-bit, with SQL Server 2008 64-bit (RTM). The test database tables for table scans are 20GB each. The test table for random IO is 64GB (leaf level) data, plus indexes.


Table Scan

The table scan to a clustered index organized table, with no fragmentation, was observed to generate large block IO, from 192-384KB per read, depending on the lock level and the state of the buffer cache. Surprisingly, at default (Read Committed) with empty buffer cache, the non-parallel table scan generates a rather low 367MB/sec compared with 707MB/sec at nolock. The Read Committed read from disk rate was 680MB/sec with a warm cache.


The table below shows the cost per page for a table scan in CPU-microseconds for various row densities at NOLOCK. The test query for the first two result columns is a needle-in-haystack type, with a SARG that very few rows (10) meet, but with no suitable index, requires a table scan. For the first result column, that entire table was already in-memory. For the second result column, the buffer cache was empty. In both cases, no evictions were necessary. The third column is the in-memory result for a simple count and single column aggregate of all rows in the table. By running this test on tables with a range of row lengths (8000 bytes for 1 row/page, to 25 bytes for 323 rows per page) it is possible to determine the cost associated with a page access and the cost per row.











































From the above table, one can deduce that the cost per page to load data from disk to the buffer cache in large block IO (>192K/Read) is in the range of 4.3-4.6 CPU-microseconds. In another series of tests, SQL Server was restricted to 2GB memory, so in a 20GB table scan, 2GB was already in memory, 18GB must be read from disk, and 18GB must be evicted from the buffer cache. The cost was about 8 CPU-micro-sec per page at low row density. This implies the cost of evicting 1 (clean, not dirty) page is just over 3 CPU-micro-sec per page, based on the assumption of 18GB read and 18GB evicted.


Key Lookup

A nonclustered index seek, followed by a key lookup (bookmark lookup in SQL Server 2000) can generate a small block random IO. However, as we should know, the nonclustered key, on table with a clustered index, really includes the cluster key. To generate essentially random IO, it is necessary to include an extra column between the leading key and the cluster key with an essentially random distribution. One aspect of key lookup generated disk IO is that when the buffer cache is empty, SQL Server will issue a 64K read instead of an 8KB read. When the buffer cache is full (not sure where is the exact transition point occurs) SQL Server issues 8KB IO. Another aspect of small block random IO is the scatter-gather IO. This is a special feature in the Windows file IO API that allows issuing a single call for multiple blocks. To determine random IO cost structure, it necessary to test across a wide range of rows per query, observing the transition point from single block IO to scatter-gather IO. Tests should also start with a warm buffer cache unless the intent is to determine the cost of 64K IO, in which case it is necessary to ensure that for the duration of the test, all IO are 64K. It is also helpful to ensure that no two rows for a given key value reside in the same 8K page or even 64K extent for better consistency between measurements.


The table below shows the CPU-micro-sec per row for queries from 10 rows per Select to 10,000 rows per Select. The first result column is for data in memory. The higher cost per row for a 10 row query should probably be attributed to the fixed costs for issuing a Select query involving an index seek and bookmark lookup. So the actual in-memory cost per row is probably 4 CPU-micro-sec. The second result column is the average CPU cost per when disk access is required (64GB data, plus indexes) and for a warm cache (8K per read). There may be more than one disk IO per row, depending on how muck of the index and the cluster key upper levels are in-memory, so it is necessary to normalize to the actual number of reads. The third result column is the CPU cost per 8KB read. The fourth column is from a test starting with cold cache, for which the average size per read is 64K for the entire test.




Disk (8K)

8K Read

64K Read

























































In some of the above tests, the single IO to scatter-gather IO transition point was at 25 and 26 rows per query. In another set, the transition point was at 24 and 25 rows. The cost per 8KB read is about 26 CPU-micro-sec per 8K page for single block IO and at the onset of scatter-gather IO. However, the average costs drops to 16-17 CPU-micro-sec by 30 rows per Select query and higher. Notice that the cost per 64K read is only 44-50 CPU-micro-sec, while definitely higher than a single 8K read, but much lower than 8 separate 8K reads, even when issued with scatter-gather IO.


The query for 25 key lookups, most of which require (single IO) disk access, had an average duration of 110ms, with average disk read latency of 4.1ms. Note that even though data is distributed over 8 disk drives, if IO is issued one at a time synchronous, the individual query (or sequence of queries) does not benefit from the disk array. The overall system throughput, driven by many concurrent connections, does benefit from the multi-disk array. A query for 26 rows with key lookup had an average duration of 26ms, with disk latency 8.6ms. So clearly, multiple IO’s are issued concurrently, allowing the query to complete much more quickly, even though disk latency is higher. The average CPU was slightly lower at 26 rows, 690 micro sec versus 820 at 25 rows.

This test was done with the explicit SQL query executed from SSMS, so that Profiler captures this as a TSQL event (also called a language event). Another surprise is that the multi-concurrent IO issued did not occur at the 26 or even 10,000 row point from queries executed with sp_executesql. Previously, I had noted the difference in hash operation behavior between TSQL (and stored procedures executed as TSQL) and RPC calls. The hash operation in language event call would not spool to temp even when the intermediate were moderately large. A query plan with the hash operation called by RPC would spool to temp at a much lower point. A hash operation spooling to temp is more expensive than an in-memory hash. The lower set point presumably is helpful in maintaining overall system throughput with many concurrent users. So apparently there are set point differences between queries executed as TSQL or RPC. From the client-side, declare a command object as type Text to execute a stored procedure as TSQL. Appending parameters to SQL will result in the use of sp_executesql. The preferred option for executing large queries is to execute as a language event if it must run fast and is allowed to disrupt transactions. Execute as RPC it is not as time critical, and it is more important to not disrupt transactions.


Below is some additional information demonstrating the single IO to scatter-gather IO transition occurring at either 25-26 rows or 24-25 rows. The first column is the rows per query. The second column is rows per sec when all data and indexes are in memory. The third column is rows per sec with warm cache (generating in 8KB per disk read) and the fourth column is the actual physical disk reads per sec. The fifth column is the rows per sec for cold cache (generating 64K per disk read) and the sixth column is the actual physical disk reads per sec. When the rows per query are below the transition point, it does not matter how many disks or LUNs there are, the disk IO is single and sequential, and the rate is mostly determined by disk latency. Once over the scatter-gather transition point, a single query can benefit from a high count disk array.




disk 8K

8K Rd/s

disk 64K

64K Rd/s




































































Storage Performance Implications

A table scan could potentially generate from 360MB/sec to 1.5GB/sec per processor core (after including the cost to read a page in memory) depending on the number of rows per page. A random load could generate 30-45K IOPS per core. There are two significant adjustments to the raw data. One, in scaling from 1 to 8 cores, there is not going to be an 8X increase in throughput, 6-7X would be very good. Second, the above costs were derived based on special queries meant to expose cost structure. Specifically, the table scan is a need in the hay stack type query, looking a few rows with minimal logic. When a large number of rows are aggregated, the cost is higher. The cost is even higher if the logic is complicated. Consider the case of the TPC-H query 1, which aggregates 7 columns with additional arithmetic calculations and does group by (necessitating a hash aggregate instead of a stream aggregate). This query involves 9-10GB of data, and takes around 50 CPU-sec on a 2.66GHz Intel Core2 processor, for about 200MB/CPU-sec.


This is one of the major contention points on the SQL Server cost model. The purpose of the CBO is to determine the best table access strategy for a given query and data distribution statistics. In non-parallel queries, the exact split between CPU and IO cost does not really matter, so it does not matter that cost model does not attempt to account for the CPU cost of logic operations. For example, SUM(col1) has the same model cost as SUM(col1), SUM(col2), except in sort and hash operations, where the size of the intermediate does impact the cost model. In a parallel plan, CPU costs are reduced depending on the degree of parallelism, as a given amount of processing is divided over multiple processors, so the work is completed more quickly, while IO costs to permanent tables are not reduced. In a query with expensive logical operations, the actual CPU portion is far more than what is represented by the SQL Server cost model. Because this is not accounted for, it is possible that a specific query may not have a parallel execution plan because of the high IO cost portion, when in fact a parallel plan would be very effective.


So with the understanding that there is a wide range of variables, a reasonable target for an 8-core server system might be 2GB/sec sequential and 200K IOPS random. I will probably try for 4GB/sec sequential when I can build a SSD array.

Let me stress again, all of the above numbers are still very preliminary!

Also, the purpose of this for now is not to nail down to the exact cost structure of SQL Server disk IO, but rather to set IO performance specifications for a server configured with an SSD array, or a really massive HDD storage system.


Tuning is still Important!

It will be really great to get SSD, we can already hit very high sequential with HDD, with SSD we can also hit very high random IO. However, the CPU costs discussed above for HDD applies to SSD also, its the same code path. It is also clear the CPU associated with IO is very substantial. A well design database keeping hot code in memory is still very important. Do not let some nitwit tell you its OK to rely on the SSD!


Published Thursday, September 4, 2008 2:23 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Scott R. said:


Thanks for the detailed analysis.  I’m still going through all your information and digesting it.  A lot of good stuff!

Regarding your test system and configuration: I recently ran some disk I/O throughput tests on similar server and disk configurations (Dell 2900, 2 X quad core, PERC-6i RAID controller, 10 SAS disks in 1 / 2 / 2 / 5-disk RAID-0 virtual disks) using IOMeter.  Sequential and random read throughput rates were about as expected, but both sequential and random write throughput rates were significantly lower than expected for 2- and 5-disk RAID-0 virtual disks (yet were as expected for the 1 disk virtual disk).  Write performance for 2- and 5-disk RAID-0 virtual disks were worse than for a 1-disk virtual disk.  Write performance for a 5-disk RAID-0 virtual disk was worse than a 5-disk RAID-5 virtual disk: up to 3 times worse for random writes and up to 11 times worse for sequential writes.  The tests were done to validate previous tests of poor application performance on this test server, with my main suspicions being the disk configuration.  I even repeated the 5-disk RAID-0 tests with an unaligned disk volume and 4 KB NTFS clusters, 1 MB aligned disk volume and 4 KB NTFS clusters, and 1 MB aligned disk volume and 64 KB NTFS clusters, with no significant changes in the write performance.

I plan to follow up with Dell, to see if they have any history with this issue.  My initial suspicion is either a RAID controller firmware issue with RAID-0 virtual disks of 2 or more physical disks, or that I really goofed something up (building the virtual disks, measuring the I/O throughput, etc.).  I have nothing against Dell – I want this to work well, but I can’t explain the resulting performance without their feedback.

If you have experience with this phenomenon or care to test it, I would be interested in hearing your feedback.

These findings just prove to me that a given behavior (i.e., RAID-0 is always the best performing for a given number of disks) should not be assumed without testing to confirm it (or disprove it).

From my reading of your analysis so far, you appear to be testing read performance (mostly or exclusively), so the issue I mention may not be directly impacting your tests.


Scott R.

September 4, 2008 3:14 PM

jchang said:

strange inexplicable behavior is the norm for RAID controllers, post the actual numbers so it is easier for me to digest. check whether the raid controller is set for: stripe size (64-128K is usually good) write cache: write thru or write back, the read/write mix (75% write or more). Finally, run perfmon to watch the disk IO while IOMeter is running. Way back, PERC2 or 3 days, the write cache would periodically shutdown the controller, leading to highly erratic results. I thought my write numbers were ok, but that was a while ago

September 4, 2008 3:42 PM

jchang said:

I can now confirm what you are seeing. A test by backing up a DB on one 4 disk RAID 0 to another 4 disk RAID 0 shows 46MB/sec with horrible write latency. a backup to a single SATA drive runs at 57MB/sec. Turning off write back cache to write thru improves to 65MB/sec. Backup to NUL yields 340MB/sec, which should be just some what higher than the RAID 0 write rate. A search on MegaSAS (PERC6 LSI driver) shows many reports of this problem, mostly by Solaris people (which shows they are reporting this problem, Windows people are slackers!) When I get a change, I will turn on the older box with PERC5 I & E, and test different RAID levels. Ask the Dell rep to forward this matter to LSI, its their driver!

September 4, 2008 6:27 PM

GrumpyoldDBA said:

So how about using SSD for tempdb, I figure propbably more transient data is held in tempdb, the main dbs tending to be in buffer cache hopefully, so I would have thought tempdb has the greatest actual io which would give the greatest benefit overall on a "bang for buck" basis?

September 5, 2008 3:24 AM

jchang said:

Scott: Apparently this is mostly a PERC6 problem. I tested on my old PE2900 with PERC5/I and PERC5/E. All LUNs 5 disk RAID 0. With both LUNs on 5I, data on one LUN, backing up to 2nd LUN, 176MB/sec. Data on PERC5I, backing up to PERC5/E 286MB/sec. The PERC should have been able to handle the simultaneous read write at 400MB/sec each. Its too bad about the PERC6, it was supposed to have pure PCI-E path, not limited to 800MB/sec on the PERC5, either a firmware or a serious foul up in the controller silicon.

September 5, 2008 2:06 PM

Scott R. said:


Thanks for the feedback confirming the RAID-0 issues on the PERC-6i RAID controller, and confirming that the PERC-5 controllers do not appear to have the same issue.  I’m glad it’s not just me or some mistake I made!  I would have been more surprised if this RAID-0 firmware defect crossed over the boundary of different hardware processor chips (PERC-5 is based on the Intel IOP-333 and the PERC-6 is based on the LSI SAS1078).

The server, storage, and IOMeter test parameters I ran included:

-  4 I/O workload test profiles (i.e., test all of the edge of the spectrum workloads):

  *  100% read 100% sequential (RS)

  *  100% write 100% sequential (WS)

  *  100% read 100% random (RR)

  *  100% write 100% random (WR)

-  3 I/O block sizes: 8 KB, 64 KB, and 256 KB

-  8 concurrent I/O levels: 1, 2, 4, 8, 16, 32, 64, 128

-  2 virtual disk write cache policy settings: write-back (default) and write-through

-  1 virtual disk read cache policy setting: no read ahead (default)

-  1 virtual disk stripe element size setting: 64 KB (default)

-  1 NTFS allocation unit size: 4 KB (default)

-  1 disk alignment setting: unaligned – 31.5 KB / 63 sector offset (default)

Some selective tests also added these test variables:

-  2 NTFS allocation unit size: 4 KB (default), 64 KB

-  2 disk alignment settings: unaligned – 31.5 KB / 63 sector offset (default), 1 MB / 1,024 KB / 2,048 sector offset (set with DiskPart)

All of these 192 combinations were run against each disk configuration (1 / 2 / 5-disk RAID-0 and 5-disk RAID-5) on this server.  These same combinations were also run on other test servers of varying RAID-5 and RAID-10 virtual disks (but no other RAID-0 virtual disks on other servers).  The other test servers were older Dell 2900 with PERC-5 RAID controllers.

With this many combinations, I can’t easily show you all the numbers, but I will share some highlights:

Constants for test results below: PERC-6i, 64 KB stripe element, no read ahead RAID cache, 4 KB NTFS allocation unit size – throughput values / ranges are in MBps:

256 KB block, write-back RAID cache:

-  5 disk RAID-0, unaligned volume: RS 453-457, WS 28-31, RR 40-123, WR 15-18

-  5 disk RAID-0, 1 MB aligned volume: RS 474-623, WS 31-33, RR 37-136, WR 15-19

-  5 disk RAID-5, unaligned volume: RS 337-498, WS 317-345, RR 40-135, WR 50-52

-  5 disk RAID-5, 1 MB aligned volume: RS 397-498, WS 322-354, RR 41-134, WR 50-52

(Note the significant jump in RS performance – 36% – for RAID-0 unaligned versus aligned volumes – most other comparisons of aligned versus unaligned are 5-9% different)

256 KB block, write-through RAID cache:

-  5 disk RAID-0, unaligned volume: RS 455-457, WS 53-55, RR 28-123, WR 32-98

-  5 disk RAID-0, 1 MB aligned volume: RS 526-623, WS 56, RR 41-137, WR 31-105

-  5 disk RAID-5, unaligned volume: RS 365-498, WS 22, RR 37-134, WR 16-47

-  5 disk RAID-5, 1 MB aligned volume: RS 438-498, WS 22, RR 41-133, WR 16-47

(Note that RAID-0 write performance improved with write-through cache over write-back cache – 2X for WS and 2-7X for WR, but was still very poor compared to RAID-5 with write-back cache.  RAID-5 write performance degrades significantly with write-through cache, as expected – write-back cache is the required compensating mechanism for the expected write overhead with RAID-5.  Since write cache policy is set at a virtual disk level (and not just at a RAID controller level), virtual disks with differing RAID protection levels can choose to use individual write cache policies if desired.)

I don’t plan to use RAID-0 virtual disk configurations on production servers, but I can see situations where some non-production servers may selectively use RAID-0 virtual disks to contain costs (where warranted by system recovery policies).

I plan to follow up with Dell to hear their take on this issue.

Although I haven’t tested it yet, I look forward to testing the PERC-6 with RAID-0 / 5 / 10 virtual disks using larger stripe element sizes (PERC-5 maximum stripe element size is 128 KB, while PERC-6 can go to 1 MB).  Also worth testing is the RAID-6 protection option (2 parity disks per virtual disk for 2 disk outage survival – versus 1 disk with RAID-5) new to the PERC-6 (versus PERC-5 and earlier).  RAID-6 protection may be worthwhile to mitigate risk in larger virtual disks (10+ physical disks) if the write-back can adequate mask the write workload overhead of RAID-6 (6 I/Os per write request) as it appears to do for RAID-5 (4 I/Os per write request).

Many thanks again for your efforts on this side track.  I apologize if I diverted the main intent of your original post.  Hopefully others may benefit from this dialog.

Scott R.

September 5, 2008 6:13 PM

jchang said:

Grumpy (and Bashful, Doc, Dopey, Happy, Sleepy, Sneezy DBAs too), I am not sure tempdb is the right motivation for SSD. tempdb activity should be sequential, or high queue short-stroke, meaning it should be able to benefit from HDD characteristics. It is random low queue that really needs SSD. Also, I never believed the separate HDDs for data and temp BS, all of that is writen by people w/o serious disk performance credentials.

Scott: Your numbers look good to me, except for the RAID 0 write, which I can duplicate on my PERC6E, while the PERC5e/i is not nearly as bad. So lets bug Dell to bug LSI for a fix. Please report back, if not I will make a special post reporting the PERC6 problem. I would give serious consideration to a RAID stripe size of 128K or 256K, this used to the Compaq Smart Array default, and there was a reason for it, regarding how the Os does things, not sure if its still valid. Anyways, the right RAID config is no read ahead (on the controller), write back, 64-256K stripe size, partition alignment is important (default on W2K8)

September 7, 2008 10:46 PM

AlbertoFerrari said:


Thank to you, I discovereed myself to be among the Windows Slackers. :) I have a PERC/6 controller on a PE2900 server and using SQLIO I checked exactly the same bad performance on a RAID0 virtual disk (it's a development server, so RAID0 is a good candidate to have fast low-cost disk).

I have written to DELL to ask for a fix, up to now no answer.

I think that having a dedicated post for the PERC/6 bug might be very useful, in order to share info about the bug without diverting this very interesting post, which is about SQL performances and great by itself.


September 8, 2008 5:53 PM

jchang said:

The slacker comment was because I did a google search on "megasas performance" and only saw unix/linux descriptions. The best was from the Solaris people, so their team was on the ball. One wonders if Dell has an internal person with deep performance expertise. Dell spends a lot of money on their Power Solutions magazine, but the content is more like advertising, instead of being deep technical. HP does have such expertise, but they are not good about sharing such info publicly in a manner that lets their competitors get a free ride. I suppose only their major customers or partners get access.

I myself did test read and write on the PERC5, but did only read tests on the PERC6. I do want Dell to bug LSI on on this without expressly negative connations yet. Like Scott said, few people use RAID 0 in production, RAID 5 is fine (I have not tested RAID 10). If Dell does not respond, then I will buy an HP P800, and publish the performance comparison.

When I have time, I will move these comments to the storage performance post I did a while ago.

September 8, 2008 6:18 PM

AlbertoFerrari said:

Joe, Scott,

I got an answer today from DELL technical support that sounds pretty good. Here it is:

Dell is aware of some performance concerns with the PERC 6 controller when running RAID 0 and intend on addressing these in the upcoming firmware release scheduled for release at the end of this month (23rd of September onwards).

Please be aware that under most types of IO load on a RAID 0 volume, Dell recommend that the PERC 6 write cache be set to write-through mode, but there are some exceptions to this rule and scenario/real-life based testing is ideal to identify these performance increases.

I will make some tests after having applied the new firmware, more news later.

September 12, 2008 11:09 AM

Scott R. said:


Thanks for the update.  I suspected that Dell may have been aware of this issue, given the severity of the problem and the timeline of the upcoming fix.  Like you, I am hopeful that the upcoming PERC-6 firmware upgrade will resolve the issue, and look forward to testing it.

Dell had posted two fairly recent PERC-6 performance white papers that also use IOMeter for disk performance and throughput testing:

PowerEdge Raid Controller 6 With MD1000 and MD1120 Performance Analysis Report (June 2008)

PowerEdge Expandable RAID Controller 6 Performance Analysis Report (November 2007)

Both white papers have two 100% write scenarios: SQL Server log and web server log.  The throughput results for RAID-0 on these two test scenarios were considerably better than what I got (and I assume what you or Joe got).  What I don’t understand is how Dell got better results than us - unless Dell’s tests were run with different PERC-6 firmware than what we used or some other impacting operational difference.

Per your comment on write-through caching: My tests did include both write-back and write-through caching.  While write-through tests were better than write-back tests in most cases, the result was still poor performance – most important:

-  2-disk and 5-disk RAID-0 tests with write-back and write-through performed worse in write performance than 1-disk (all cases for write sequential and write random, except as noted below)

-  5-disk RAID-0 tests with write-through / write random started out worse than 1-disk, but got better than 1-disk with higher concurrent I/Os – best case was still less than 2x better than 1-disk – well below expectations

-  RAID-0 write performance results were progressively worse with larger I/O block sizes.  Of the three block sizes tested (8 KB, 64 KB, and 256 KB), the poor write performance results were “least poor” with the smaller (8 KB) block size, and got relatively worse with the larger block sizes.  There was nothing close to positive linear scale-up across RAID-0 1 / 2 / 5 disk write tests as there was with the RAID-0 read tests.  In most cases, it was negative (less than same for 1-disk), or well less than 1 for 1 when positive.

Out of curiosity, what version of PERC-6 firmware and drivers were you using for your tests?  My tests were run with firmware version 6.0.2-0002 and driver version (as reported by OMSA).  There is a newer firmware version 6.0.3-0002 dated 06/23/2008 posted on the Dell support site, but the description of issues and fixes does not mention RAID-0 write performance.

Thanks again for your efforts, and your willingness to share test results and feedback from Dell.

Scott R.

September 12, 2008 12:23 PM

AlbertoFerrari said:


I updated to the latest version of both the driver and the firmware, as asked from the technical support to, in order to let them do the escalation of the bug.

I am pretty sure too that they were aware of the bug, it seems not possible that they are writing a new firmware in a couple of weeks. Nevertheless the good news is that a new firmware is coming out.

It is very interesting the I/O block size test, I setted mine to 64K without too many tests (really no one), maybe it will be worth to make some test and reformat the disk after the new driver, to decide the best stripe size.

It might be very interesting for me if you can share your results with me, maybe this blog is not the best place to do it, can you email me to alberto dot ferrari at some info?


September 12, 2008 6:08 PM

mark cleary said:

I think you misunderstand scatter gather IO. The data from a scatter gather IO is always contiguous on disk. The structure used to pass the disk address has room for only one disk address. The in memory buffer can be discontiguous - the buffer parameter is a list of memory address & size pairs.

In SQL terms, a 64 kB read or write of a single extent will need 8 pages in the buffer. Without scatter gather, the pages must be contiguous (in virtual memory). As the buffer gets full it gets hard to find 8 contiguous pages free. When using read ahead with larger read sizes (eg 384 kb), you need 48 pages.

With scatter gather you only need 8 or 48 free pages, not 8 or 48 next to each other in virtual memory.

What you are seeing when the number of I/Os per second jumps is SQL Server starting to issue multiple reads in parallel. Kind of read ahead for random access.

July 8, 2009 6:04 AM

jchang said:

thanks, I looked up the ReadFileScatter API and it is just what Mark said it was.

On further digging, I found Craig Freedman's blog

which says what I was talking about above is asynchronous IO random prefetching (the other kind is sequential read-ahead)

Craigs blog also explained in detail how the execution plan annotates whether prefetching will be used, so I will work this into my dmv query, index and plan tool

July 8, 2009 1:48 PM

Belton said:

So that the product is productive, an individual software developed your day, enabling ongoing company procedures.

data recovery services If you've got the regrettable example of dropping remote an individual Gartner work on hit a brick wall or harmed hard drive. A standard reaction is really as data the from the exact same data no-one the try types, like NTFS and also FAT32.

October 14, 2012 2:34 PM

Joe Chang said:

Storage has changed dramatically over the last three years driven by SSD developments. Most of the key

April 2, 2013 6:33 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement