THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

How did Random I/Os Outperform Sequential I/Os?

Recently, when I was doing some I/O performance tests on an I/O path, I found that 8K random reads (and writes) significantly and consistently outperformed 8K sequential reads (and writes) in terms of I/O throughput (megabytes per second). I was puzzled.

With a traditional hard disk that is made up of a stack of magnetic platters held by a spindle, an electro-mechanic access arm, a printed circuit board, and a hard-case enclosure, the average seek latency is always significantly higher than the average rotational latency (e.g. 9.5ms vs. 4.2ms on a 7200 rpm 500GB disk). And therefore, the random I/O throughput is always expected to be significantly lower than the sequential I/O throughput.

In fact, sequential I/Os have such a huge performance advantage over random I/Os that the computer industry has labored over the past few decades trying very hard to reduce random I/Os and convert them to sequential I/Os with such techniques as caching, transaction logging, sorting, and log-structure file systems.

Granted, the I/O path I was working with was not a traditional hard disk. It was a LUN presented from a SAN with a large amount of cache, and to simplify to some extent, the LUN was a RAID 0 stripe set across 12 virtualized drives with a rather large stripe unit size (960K). But how should I explain why 8K random I/Os could outperform 8K sequential I/Os?

After some discussions with a storage professional, we came up with a theory consisting of the following three key factors:

  • Random I/Os were able to effectively hash I/Os across multiple drives that make up the RAID 0 device.
  • Relatively large RAID 0 stripe unit size of 960K caused 8K sequential I/Os to cluster around the same drives. Note that it would take 120 sequential I/Os to fill a single 960K stripe.
  • A base amount of cache was assigned to each drive in RAID 0. And when random I/Os were hashed across 12 drives, the I/Os benefited from larger amount of cache.

Do I have solid proof that these three factors were the root cause of 8K random I/Os outperforming 8K sequential I/Os? No, I don't. But I do have some circumstantial evidence supporting the theory.

First of all, if the theory is correct, I should see the same behavior with smaller I/Os such as 1K reads and writes. Indeed, 1K random I/Os outperformed 1K sequential I/Os on the same I/O path.

Secondly, if the theory is correct, I should not see the same behavior with larger I/Os, especially I/O block size that is not much smaller than 960K. Indeed, 128K random I/Os did not outperform 128K sequential I/Os on the same I/O path.

Thirdly, if the theory is correct, I should not see the same behavior on an I/O path that has fewer drives. Indeed, on a RAID 0 device with three drives in the same SAN, 8K random I/Os did not outperform 8K sequential I/Os.

Finally, if the theory is correct, I should not see the same behavior on a RAID 0 device with much smaller amount of cache. Indeed, on a directly attached RAID 0 device, 8K random I/Os did not outperform 8K sequential I/Os.

Now as mentioned, I'm not 100% confident about this theory. I can't prove it beyond reasonable doubt. Hopefully, some of you reading this blog post know exactly what caused or could have caused 8K random I/Os to outperform 8K sequential I/Os. And if my explanation doesn't match up with yours, I'd love to hear your comments.

Published Wednesday, April 04, 2007 4:14 PM by Linchi Shea
Filed under: , , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Chris Leonard said:

Linchi,

I think your theory is reasonable, but I have another question:  is that large stripe size causing this effect by helping the random IOs or by hindering the sequential ones?  In other words, does this configuration have an overall positive performance effect on your system?  I believe that if a stripe size is too large, it can hinder the performance of SQL Server.  For example, if SQL's maximum IO request is still 64K (or one extent) having a stripe size of 960K makes the reads relatively sparse from a physical point of view.  I thought that the classic white paper on SQL IO tuning from Compaq a few years back recommended trying to keep the stripe size small because SQL Server never made IO requests larger than 64K.  If this is still true, is it possible that the SAN is simply misconfigured for sequential SQL Server IO?

Cheers,

Chris

April 8, 2007 10:19 AM
 

Bert Corderman said:

Linchi,

I have seen the same performance situation before.  I can provide more insight if you could provide me with

1. More info on San config

a. Vendor EMC, 3Par Hitachi

b. Drive config, you said 12 virtualized drives…but what is the physical drive layout.  Are physical drives shared with other hosts?

c. Connection from host to SAN, how many paths

Also what were you using to test, are you using a tool or are you running a SQL script?

Since you are using a “virtual” drive you are not truly performing sequential IO.  The reason sequential IO is usually much faster is that the largest cost in IO is aligning the disk head to the data, once there streaming the data is fast.  In your virtual case the sequential IO is actually random IO in the physical sense.

As far as why you are getting worse performance on seq vs random I would guess that it has something to do with the threading, but I could add more insight if you answer the above questions.

“Finally, if the theory is correct, I should not see the same behavior on a RAID 0 device with much smaller amount of cache. Indeed, on a directly attached RAID 0 device, 8K random I/Os did not outperform 8K sequential I/Os.”

If your database is very large and your IO is truly random the SAN cache will NOT help with read IO.  (Don’t let your vendors try and tell you other wise ;) )

Bert

April 9, 2007 9:05 AM
 

Linchi Shea said:

Unfortunately, I can only discuss the issue at a rather generic level without involving any vendor, model/make, etc. I know that's not ideal, but there is not much I can do about that. My hope was that at even a rather generic level, useful discussions could be engaged for a better understanding of the behavior.

By 'random/sequential', I was referring to I/Os at the file system level with respect to I/Os to a test file (sized at 20GB). The test file is conceptually divided into a sequence of blocks (in this case 8K blocks) and they are numbered sequentially from beginning to end. So if the I/Os are issued in that sequential order without any gap, the I/Os are conisidered sequential. If a random number is picked to decided which next block to read/write, the I/Os are considered random. The tests were done with sqlio.exe from Microsoft and there was no SQL Server involved. IN other words, the tests were pure I/O tests.

You can email me offline (if you wish).

Linchi

April 9, 2007 10:52 AM
 

bertcord said:

How about your drive configuration?  You mentioned that you are suing 12 virtualized drives.  What are they virtualized onto?  How many physical spindles does each virtual drive use?  Do all 12 virtual drives use the same physical drives?  Also how many fiber channel connections are you using and what are you using for multipathing?

When you say virtual drive I am assuming you are talking about EMC metaluns or are using a non-traditional SAN such as 3Par.

We use SANS from two vendors EMC CX700 and 3ParS400  systems.  On the 3Par when a drive is created it actually creates the LUN in 256MB chunks across all spindles in the system.  This is what it sounds like you are doing.  IS this correct?  If not can you explain what you mean by “virtualized drives”

I understand the concept of sequential to random and your definition in your comment above is correct.  What I was trying to get at is that when you are using a “virtual drive” your IO is no longer sequential at the physical drive level.  It might be for some of the data but not like in traditional DAS storage.  Depending on the SAN it should be able to do a read ahead to cue up the next spindle but it looks like that is not the case.

For example on our 3Par systems the chunklet written to disk is 256MB, IF I create a 1TB LUN these 256MB chunklet will be spread across all drives in the system.  If I am doing a large table scan it might read 256MB sequentially but when it moves to the next chunklet it could be on another physical disk completely.

I stopped using SQLIO as it was limited to I think 4GB size….so I am not the familiar with it anymore.  Do you get to specify how many outstanding threads are running during the test? How many threads were you running in your test?

I had a specific case on the 3Par system where on a 2Node system you would be limited by the maximum thoughput of the single node per thread.  SO in this case the maximum MB thoughput was 700MB.  But doing sequential reads from multiple threads would allow better throughput.

I also don’t think comparing 8KB random read to 8KB sequential is a good test for how SQLServer works.  In my experience I have noticed that SQL server will issues less IO of larger size when data is sequential.  For large tables scans I will consistently see 256KB IO requests.  In other works if SQL knows that it needs eight 8KB pages it will issue one 64KB IO instead of eight 8KB IO

What is your email?  I also just registered on the site so you can get mine from there.

Bert

April 9, 2007 12:03 PM
 

Linchi Shea said:

Bert;

I can't seem to get your email address. But you can reach me at linchi.shea@<Remove-this-before-send>gmail.com

To answer your question, a virtualized drive I was referring to is a 8GB device that is carved out of a 3+1 RAID5 set or a RAID0 mirror set. In the case of RAID0, the 12 virtual drives do not use the same physical spindles, but in the case of RAID5 the 12 virtualized drives do wrap around to use some of the same physical spindles.

The version of SQLIO.exe (v1.5.SG) does not have a size limit of 4GB. I ran different numbers of threads (in this case 2 and 32) with various numbers of outstanding I/Os (1, 4, ..., 128).

I was not comparing 8K random vs. sequential I/Os to predict SQL Server performance. As mentioned, I was intrigued by their apparently 'non-intuitive' difference. It had nothing to do with SQL Server (well the choice of 8K block size was influenced by the SQL Server page size). Instead, it's a pure I/O test to see what the I/O subsystem could do. I understand that SQL Server tends to do large I/Os to better utilize the I/O subsystem. Every decent DBMS does that.

In fact, I have actually tested, among other test configs, placing the tran log file (and separately data file) of a SQL database on two LUNs that had different performance profiles in terms random I/Os, and found practically no performance difference.

But when you combine the results from 1K, 8K, and 128K, you get a pretty good idea what your I/O subsystem can do and what you should expect from your SQL Server (when you understand how it issues various I/Os at different block sizes for different operations).

Linchi

April 9, 2007 2:22 PM
 

Mike Hahn said:

This is an old thread but I thought I'd put in my 2 cents.  I had similar performance where random I/O outperformed sequential.  What resolved it for me was modifying the offset of the partition on the RAID.  There's a Microsoft article that explains how boundary alignment of the stripe unit can be misaligned with the partition starting point.  http://support.microsoft.com/kb/929491/en-us

making sequential reads slower...

For my hardware my vendor recommended offset of 128K.

October 21, 2008 3:53 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement