THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Data, Log and Temp file placement

First, especially for all the people with SAN storage, drive letters are of no consequence. What matters is the actual physical disk layout behind each RAID Group or LUN. Forget capacity, pay attention to the number of spindles supporting each RAID group. While there is nothing wrong with making 2 or more LUNs from the same set of disks, I prefer to make 1 big LUN from the RAID Group, then create multiple partitions from Windows, just to be absolutely clear what storage space shares spindles.

If the RAID group is shared with other applications, make sure that the SLA guarantees read and write latency. One very large company conducted a stress test in the QA environment. The SAN admin carved the LUNs from the same pool of disks as production, but thought he had a really powerful SAN that this was not important. It was and he did not.

The general theory is to aggregate random IO loads into a common pool of disks, and to provide dedicated physical drives for each significant sequential load. For some reason, many people incorrectly interpret this as placing the log files for all databases into a common drive.

Each database log for a well design application, in theory, generates pure sequential IO, and some serious technical heavy weights say that a single pair of physical disks in RAID 1 can handle almost any SQL load (barring very limited exceptions). Well this is not true if you do transaction log backups, have transaction roll-backs, replication and database mirroring. So one might do some additional calculations. If one places more than one log file on a common physical disk set, then even though the IO to an individual file might be sequential, the complete set of IO to the disk is not.

So does each log file get its own set of physical disks? If there are 50 databases in the instance, does this server need 50 pairs of disks in RAID 1 for logs?

OK, let go back to the original theory. A single 15K disk can do approximately 200 small block random IOPS for data spread across the entire disk. It can over 4000-5000 pure sequential IOPS in SQL Server generated log writes. For each pair of physical disks you take away from the common pool, then that is 200 random (write) IOPS that has been lost (assuming RAID 10). If a database does not generate more than 200 log writes/sec and does not require sub-ms log write latency, it is better to leave this log in the common pool.

On the matter of tempdb, I don’t know why people keep suggesting separate disks. Where does the underlying theory come from? Vapor?
Both data and temp should be kind-of random loads. So the common pool is usually better. If the load on data and temp were temporally steady, for example, 10K IOPS to data and 5K to temp, then you could argue to allocate 70 disks for data and 35 for temp.

But this is not how SQL Server behaves. Try running a query that aggregates a very large number of rows, and involves a join. To be specific, the execution plan should show a hash operation, whether for the join or for the aggregate. Check the hash operation cost details. The IO cost component should be non-zero. Otherwise your test query is too small to actually use tempdb (check with the SQL engine team to be sure on this).

Now watch data and temp IO over time. (that is not based on the difference before and after). The disk IO temporal pattern is read from data, write to temp, read from data, write to temp etc, then may at the end, read from temp (but sometime not, think about what this means). So if you had split your disks 50-50 data-temp (excluding OS and log disks), then at any given time, 50% of your disks are busy, the other 50% are idle. Is this the best use of your (gawd awful expensive) resources (if you bought a SAN)?

There are special considerations if certain data files depend on low latency response, and others benefit from high-queue depth operation, but hey, if life were simple, would your company need skilled people?

If the SAN vendor tells you its ok to have one common pool for everything, then that person is a liar or incompetent, and freqeuntly both.

More on storage at: 

Published Tuesday, March 23, 2010 12:56 PM by jchang
Filed under: , , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Adler Catalin said:

Nice post Joe,

I think that most people don't realize that most of the classic sql best-practices consider 1 database / server (physical machine). In reality i think 1 physical machine will host many databases -- complicating matters.

People should remember to analyze, measure, understand resource consumption and take appropriate measures.

Having said that i would like to ask you a question.

How about moving the 'first' tempdb files to a small ramdisk (max 512mb) for disk-constrained servers (think 1-socket, pedestal)? Tempdb would have at least 2 data files: 1 fixed in ramdisk, and 1 expanding on the hdd. This would mostly isolate tempdb ios from everything else, if, of course, there is enough memory left for the server (sql & os).


March 24, 2010 2:03 AM

jchang said:

Concur that in the past, the competent technical analysis was all oriented around 1 heavy database per server. Hence many of the multi-db configuration strategies lacked good substantiating analysis.

In the past, most medium to large businesses were on 4-socket systems with small business on 2. Today with quad or six-core processors, the reality is that a single socket processor could handle the compute requirements for medium and possibly even large business. However, there is not much in the wasy of cost difference between a single socket system and a dual-socket system. A 2-socket system typically has much higher memory capacity when both sockets are populated. Also, the two-socket systems like the HP DL/ML370G6 and Dell T710 have massive internal storage. Have you noticed the per storage bay cost of these are less than for the external storage unit?  So this means the low-end systems should target a 2-way with the cheaper processors, just to get the memory capacity.

I would not blindly recommend a tempdb strategy without first looking at the app. The two-ends of the tempdb spectrum are 1) every gawd dang stored proc seems to create a temp table, making for very high activity to tempdb system table, but there is actually very file io to temp. 2) big queries with hash or sort ops that actually write and read from temp. In between, there might be a situation where there are heavy write temp, but very little read.

In a heavy system, I would recommend 48-96 15K disk drives, which can handle high random IO, sequential IO and temp.

For many active db, instead of 1 pair of disks for the logs of each heavy db, i prefer SSD for the logs.

On lower cost system, my thinking is 64-bit SQL Server + larger memory configuration + high degree of parallelism to increase the set point where intermediate results have to be written to tempdb. On a 32-bit system, this occurs at a relative low point (I worked it out once back in SQL 2000 days). On 64-bit, it is proportional to memory (32-bit is proportional only to non-AWE memory). The memory set point is also one per thread, so high-dgree of parallelism also increases the amount that can be kept.

Failing this, get an inexpensive SSD. I have no recent experience with RAM disks. For $200-400, I am not going to be bothered with issues that could be solved with the cheap SSD.

I am also very leery of the number tempdb files. The original complaint was for case 1) in SQL Server 2000 pre-sp3 days. The issue was supposedly fixed, and yet people keep talking about tempdb files per core. No one seems interested in citing actual performance tests on a specific version. This is generally by people who collect rules without caring why

March 27, 2010 7:14 AM

Alejandro Mihanovich said:

Hi Joe,

Very interesting article. I work at Microsoft. I'm now an MCS guy (used to be a SQL server PFE). I'm writing a design document for SQL 2008 so I need to have good foundation on disk IO concepts. The only concept I'm missing comes from the following paragraph from the SQL predeployment I/O best practices.

"Maintain a balanced storage design. The architecture of specific HBAs (Host Bus Adapters) and their settings influence performance and throughput. But in general, more LUNs result in more independent I/O queues and potentially more concurrent outstanding I/Os. Assuming that the Storage Area Network (SAN) has no bottlenecks, one can expect better scalability with multiple LUNs and multiple paths between server HBAs and the SAN."

I don't see your mentioning of this in your arguments, but it looks that having many different LUNS creates more queues and will allow for more outstanding IOs (and thus more IOPS assuming no SAN bottlenecks). What are your comments on this ?

March 27, 2010 9:27 PM

jchang said:

sorry about the lack of detail, i recall describing more detail in other places. I would suggest caution in the above quote. Balance between the number of IO channels (ie, HBA) and physical disks is essential. Sure a LUN could be construed as a physical disk, but I have seen total idiots think having multiple LUNs (technically and originally, this just meant Logical Unit Number, but in SAN, most use this term to refer to the logical disk created by the storage system and presented to the host OS as a physical disk) is all that is required, not thinking that 10 LUNs created on 10 physical disks is not 10 LUNs on 100 physical disks.

In brief, lets distinguish between two different disk queues. One, the HBA typically defaults to "Per Target" mode, where the default queue depth is created for each target (LUN). I think the other mode is per HBA or something (look in the QLogic manual for this), meaning all LUNs on that HBA channel share one queue.

The other place a disk queue is on the disk itself. I think most SAS/FC allows for 32 outstanding IO requests. That is it can only do elavator seeking for upto 32 IO requests.

Back to the HBA, I believe the default queue depth is 32, which means queue depth 32 per LUN. Some people, including MS people, blindly say increasing queue depth to 128 or 254 improves performance. Well sure if you have a storage system with lots of physical disks. My guess is 8 disks per LUNs or more. If there are only a small number of physical disks per LUN, then whats the point of increasing the queue depth? So I reiterate: it is grossly irresponsible to suggest increasing queue depth on the HBA without understanding the physical disk to LUN mapping. Also, if multiple LUNs are created from the same physical disks, you still end up with mismatched queue depth to physical disks.

So, balance the number of IO channels, physical disks, and set queue depth appropriately, and even better, get a competent professional, not some idiot with a list of rules.

March 28, 2010 11:49 AM

Adler Catalin said:

Thanks for your reply Joe,

2 cheap ssd's (mirror) would consume 2 sata/sas ports.

hp dl380g6 has a max of 16 2.5disks, intel has 12 and dell 710 max of 6 3.5 disks. except for hp, that's not so much.

wouldn't thoose 2 ports be, possibly, used for something else?

i got this ideea tuning our application. we use a lot of small tempdb tables (to transfer multiple values for a parameter to sql -- the solution needs to work on sql 2005 also so we use xml->#table->join) that could benefit from having a isolated io path -- most of the time.

most of our customers use single socket server -- budget oriented. so we got the ideea of using ramdisks for tempdb and use the 2 disks for logs/os/mdf.

ps. what kids a ssd is that cheap and can write fast?

March 29, 2010 8:04 AM

jchang said:

that why you should get the HP DL/ML370G6,

or Dell T710 with 16 2.5in bays, not the 8 3.5in

the SSDs are 2.5in

for an SSD to write fast, you need SLC,

the cheap SSDs are MLC, such is life

this xml to parameter sound like a testing application i advised on a while back, the company heard about SQL 2000 xml capability,

and converted their whole app to pass in an xml, then extract the parameter, that was a real performance killer.

use xml judiciously

April 4, 2010 8:35 AM

Matt said:

Thanks for a great article.  I've been looking around for some help on how to configure my new storage array and haven't found all that much that is both accesible to me (I'm not a DBA!) and seems to not just be regurgitating the same rules of thumb without explanation.

Could I please ask you for a few thoughts on my specific circumstance?  I have a server with 2 146gb 15k disks in a raid 1 where my os resides.  After that I have 12 600gb 15k disks that I'm trying to figure out the best configuration for.  The performance sensitive application is running a databsae from an external vendor that will be roughly 800gb and run stored procedures that use temp tables a lot.  I won't have high requirements for the logs I don't think, I'll just have the logging set to simple.

I'm having trouble imagining the trade off between running the whole thing on raid 5 (at the expense of slow writes to the tempbd), the whole thing on raid 1+0 (no write penalties, but now i only have 3tb of space - though i'm not sure why i have this voice in the back of my head telling me to conserve space since there's no reason we'll need more than 3tb any time soon), or to break this up into a few disks for data, a few for tempdb, and a few for logs, etc... (let alone wondering whether i can save 2 disks that would be in a raid 1 for the logs by just putting them on my OS array).

You can probably tell I'm starting to get paralysis by analysis here... I've read just enough to be able to talk myself into and out of all those options.

Could you please tell me what your general recommendation might be?

October 15, 2010 1:02 AM

jchang said:

On the assumption that this is not a high-volume transaction processing (otherwise 12 disks is inadequate). If you had read my material first, you would have gone with 2 x 24-disk enclosures of 146G 15K drives instead of 12 x 600GB. But here we are, so I suppose you could make 4 3-disk RAID 5 sets, or 6 2-disk RAID 1 sets. Put a data and temp file on each. Log can go on the OS (again, this is not for transactions processing apps). RAID 5 is high overhead only for small block random write IO, not sequential or large block write. So as long as this is not transactions, RAID 5 should be OK.

I have more material on my website, including

October 16, 2010 10:47 AM

Matt said:

Hi Joe,

Thanks so much for getting back to me.  Yes, the hardware arrived before I had seen any of your material.  The specs were actually based on the recommendation of the data vendor - though on asking them any of these questions about the best way to organize the disks, it seems apparent that they cam up with those specs without any real concrete reason for them.

For the 4 raid 5 or the 6 raid 2 arrays, should I then partition the disks like you do at the end of that linked article (one big partition for data, one 10gb for tempdb, and one partition for backup)? I would mention that the data vendor does not give any information about putting different tables in different filegroups, I think it'll just be one filegroup.

It's interesting that yours is the first site I've found that recommends using multiple raid arrays but puts data and tempdb on each instead of separating them.  Just shows how hard it is to find good/consistent advice on the internet!

October 16, 2010 5:57 PM

Matt said:

Also... I've read a few places that it's potentially tempdb that needs the the fastest disk subsystem. I know it's probably contingent on what the workload of the server is like, but should the tempdb (10gb) partition be first to get the outer track on the disk instead of the data partition?

October 16, 2010 11:39 PM

jchang said:

the fact that other people do not talk about multiple raid arrays and data files might be an indication they have not worked with multiple arrays or are not aware of the concept.

Interpreting tempdb IO can be misleading, which is why some people think needs fast disk. Sometimes its because their system has so much memory most data is already in memory, hence only tempdb (and log) actually generates disk IO

The data should always go on the outer tracks.

October 17, 2010 8:18 AM

Matt said:

Thanks so much for the help Joe.  I've gotten to the point where the raid arrays and database file structure is set up as you have suggested and I'm taking a look at perfmon and DMV queries (!45041418ECCAA960!2991.entry) for the kinds of queries we'll typically be running on the machine.  

The performance counters on Temp Tables Creation Rate and Temp Tables For Destruction are awfully high, and the DMV queries are telling me that PAGEIOLATCH_SH may be a problem.

Do you think it's worth it for me to try splitting things up and using 1 or 2 of the Raid1 arrays just for tempdb and the others just for the data files rather than commingling them as it is now with one of each on every Raid array? Or is that entirely off base?

October 19, 2010 9:01 AM

Matt said:

I've also been noticing some behavior that I can't really explain.  4 of my Raid 1 arrays are behaving almost identically (sustaining read queue lengths of around 20 and write queue lengths of less than 1), while the last 2 arrays are completely different, but identical to each other (sustaining read queues of around 150 and write queues of around 10).

Does this make any sense to you?  The arrays are all set up the same as far as I know (same physical disks, same raid, same database files, etc...).

October 19, 2010 9:10 AM

jchang said:

What basis do you think supports separate data and temp?
Suppose data and temp share 8 drives, your data files get low activity, but your temp generate surges. By splitting data and temp to 4 disks each, your temp surges would last twice as long, and the data disks would sit idle.

I would look at sys.dm_io_virtual_file_stats to see if sql server is distributing io evenly to disks, or some other activity is generating IO. Also check the RAID controller settings, is one Direct IO and the other Cached IO? etc
You also need to assess whether this is important enough to warrant a proper investigation, which cannot be done on a blog

October 19, 2010 11:27 AM

Matt said:

Your explanation on data and temp makes perfect sense, thanks for that.

sys.dm_io_virtual_file_stats shows equal distribution to the disks as far as # of reads and writes and # bytes read and written, but the stalls on those 2 disks are 20x the others for read stalls and 100x for write stalls.

I just installed Dell OpenManage Sever Administrator to take a look at the RAID caching.  Each of the virtual disks is set up for:

Read Policy: Adaptive Read Ahead

Write Policy: Write Back

Stripe: 64kb

Disk Cache Policy: Disabled

I can't find any setting / description anywhere that differentiates among the physical disks or RAID arrays.  The only piece of information I can find that seems like it could be possibly related is that my PERC H700 Integrated controller in Dell OMSA: System->Main System Chassis->Slots says the "Data Bus Width" is x4.  Is that why I'm seeing good performance on 4 of the RAID arrays and 2 seem to be lagging?

October 19, 2010 7:59 PM

Matt said:

I just started another set of queries to take a look at the performance counters and now disks 1-5 all have pretty much identical counters and it's just the last disk that is lagging with high queues.  Strange.

Anyway, I don't want to keep sending you question after question as you've been more than helpful.  Thanks again for your expertise.

October 19, 2010 8:49 PM

jchang said:

the x4 PCI-E slot should be fine for 12 disks,

I cannot make a diagnostics with only a slim amount of data

you could run restart SQL Server, backup the database  TO DISK = 'NUL:'

then run my SQL Exec Stats tool

or you could also learn to use SQLIOSIM

October 19, 2010 8:49 PM

Matt said:

Hi Joe,  I had dowaded this earlier when I was looking through, but I hadn't used it yet.  I just ran it - what a wealth of information!  It's going to take me quite some time to work my way through all of it!

October 19, 2010 9:31 PM

Joe Chang said:

Storage has changed dramatically over the last three years driven by SSD developments. Most of the key

April 2, 2013 6:33 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement