THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Fast Track Data Warehouse 3.0 Reference Guide

Microsoft just release Fast Track Data Warehouse 3.0 Reference Guide version. The new changes are increased memory recommendation and the disks per RAID group change from 2-disk RAID 1 to 4-Disk RAID 10.

The earlier FTDW reference architecture cited 4GB memory per core. There was no rational behind this, but it was felt some rule was better than no rule. The new FTDW RG correctly cites the rational that more memory helps keep hash join intermediate results and sort operations in memory.

4-Disk RAID 10
FTDW versions 1 & 2 specified 2-disk RAID groups. The reason was the SAN storage system could not deliver good sequential performance IO with fat RAID groups (comprised of many drives). For some reason, a thin 2-disk RAID group could deliver 100MB/sec per disk. An EMC whitepaper added that a 3-disk RAID 5 group could also deliver good sequential performance. But RAID 5 does not have the necessary performance characteristics to support tempdb IO requirements. FTDW 3.0 now specifies 4-disk RAID 10 groups. The presumption is that 100MB/s sequential IO is still the objective. No explanation was given why a 4-disk RAID group is now specified. It is possible the firmware in SAN systems can now support good sequential performance with larger RAID groups. But an explanation is still desired.
Update 2011-03-16
Apparently the reason for the 4-disk RAID 10 group is to match the PDW configuration. No data was provided as to whether this could still meet the old 100MB/s per disk objective.

The reasons cited for the new memory recommendation in FTDW 3.0 are correct. My suggestion however, is that rather than cite specific memory configurations which just happen to be reasonable today, apply the underlying principal. The cost of server class ECC memory modules follow a consistent pattern. An 8GB DIMM today is $300 and a 16GB is $850 (from Crucial). The 32GB memory modules are just becoming available, and probably very expensive, say for example: $5000. The premium for the 16GB memory module is small relative to 2x8 GB modules.

The low cost of memory relative to the overall environment and the fact that memory has positive benefits, but no specific hard requirements drives the strategy. Fill the available memory sockets with either 8GB or 16GB memory modules (depending on system constraints). When the premium for 16GB modules disappears, then discontinue the 8GB module from consideration. When the premium for 32GB memory modules becomes less, then both 16 & 32GB modules can be considered.

My preference is still to employ direct-attach storage systems with RAID controllers in the server system. The latest PCI-E SAS RAID controllers cannot support 2.8GB/s, just under the net bandwidth of a PCI-E gen 2 x8 slot. Each 6Gbps SAS x4 channel can support 2GB/s. Modern 15K disk drives can deliver 150MB/s on the outer tracks. This infrastructure has an amortized cost structure of less than $700 per disk. The cost structure of low-end SAN systems specified by FTDW is typically 2X higher per disk. I also prefer to employ more 2.5in 15K HDDs over more channels than specified by FTDW over using 300GB+ 3.5in 15K HDDs. 

Index Light Strategy
FTDW endorses the index light strategy. The presumption is that the data warehouse is very large, in the multi-TB range, too large to fit in system memory. The storage strategy employs hard disk drives because an HDD storage system properly configured with IO distributed over multiple channels and many disk drives can deliver high sequential IO bandwidth economically.

The SQL Server query optimizer cost model is built on the model that table scans run at 10.8MB/s (or 1350 pages/sec) and random IO (from loop joins and key lookups) run at 320 IOPS. The absolute value of each in relation to the actual server is not important. The important point is 1350 pages/s sequential to 320 random IOPS ratio. 

The FTDW reference architecture is designed to achieve sequential IO of 100MB/sec per disk, or 12,800 pages/sec. A 3.5in 15K disk can support 175 IOPS (225 for 2.5in 15K) for random IO distributed over the entire drive at queue depth 1. At higher queue depth, 15K disks can support 320 IOPS. So the FTDW reference architecture IO performance characteristic has the ratio 12,800 pages sequential to 320 random IOPS.

    Sequential   Random
SQL Optimizer   1,350   320
FTDW   12,800   320
SSD/Memory   1-2   1

There is no provision to adjust the SQL Server query optimizer sequential-to-random ratio. For this reason, FTDW favors avoiding non-clustered indexes so that key lookups and loop joins are not an execution plan option. A good partitioning scheme to help contain the range of scan operations is an essential element of the index light strategy.

Tempdb on SSD
SQL Server issues IO to tempdb at high queue depth, typically 64K in size, to store the intermediate results of large hash joins and sort operations. The pattern is a write burst eventually followed by reads. A very large disk array can support this, but a small array of SSDs can support extraordinarily high IO rates at lower latency than HDDs.

The next consideration is that if only tempdb is on SSD, there is not a hard requirement to employ enterprise grade SSDs. Depending on the situation, some environment can consider consumer grade SSDs. The difference in cost between consumer and enterprise SSD is more than 10X ($30K per TB for enterprise, less than $3K for consumer, all without RAID). For comparison, an HDD storage designed for performance might cost $10K per TB in RAID 10.
Some HD storage systems can support SSD with SATA/SAS, but not all can support the very high IO rates possible with SSD, so it might be a good idea to employ PCI-E SSDs. This configuration does not support clustering, but clustering is not necessarily a true requirement for data warehouses. Going forward, expect storage systems to finally abandon the under-powerful obsolete processor/controllers, and switch to powerful Intel Xeon 5600 or later processors. Then intermixing SSD and HDD should not be a problem.

Index Heavy Strategy
There appears to be a tendency to treat data warehouses as contest of whose is bigger. The data warehouse exists to serve the organization. With good design, the active portion of a data warehouse may fit in the system memory of modern server systems. Current 2-socket systems support 192-256GB, 4-socket server supports 1TB and 8-socket systems support 2TB memory. The cost of 1TB memory, 64 x 16GB ECC DIMMs should be less than $64K.

Another possible strategy is to employ SSD for the main database itself, or the most critical portions of the data warehouse. In an SSD storage system, there is very little difference between large block “sequential” IO and small (8K) block “random” IO. The difference in throughput between 1 page of a (sequential, large block) table scan and a “random” 1 page loop join/lookup IO might be none or 2:1. It also turns out that for data in memory, the actual CPU cost of a table scan per page is approximately the same as 1 key lookup or loop join.

In this case, the strategy should be to employ (sufficiently) covered indexes as necessary. It may even be advantageous to force index in some cases because the optimizer assumes the cost ratio is 4 pages scan equal 1 random IO. Note that the index heavy strategy will result execution plans with loop joins and key lookups. These operations do not scale to high degrees of parallelism, where as the hash join has excellent scaling at very high degrees of parallelism. (It is suspected the reason is each row latches the root level, resulting in contention at DOP 8 and higher. I do not understand why there cannot be just a single latch for the entire query or even one per thread.)

Published Thursday, February 24, 2011 9:48 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



jimmie johns said:

Interesting Read.

Kejser has the Data and TempDB on the same luns (128) and the Tran logs spread across an 8-disk SSD from his 128-Core talk. He still pushing SANs over DAS...

Dont know why flash SSDs even exist as a product as DDR SSDs with battery give the IOPS...

February 24, 2011 3:17 PM

jchang said:

data and tempdb should share LUNs because the IO pattern is a sequence of: read from data, then write to temp, repeated. So having separate data and tempdb disks means only one set is active at a given time.

I am not sure TK advocates SAN over DAS as opposed to just acknowledging that big customers prefer to buy expensive SAN storage (which is true). Also, many people feel the compulsion to cluster the DW system almost as a reflex reaction.

As to whether the T-log needs to be on SSD, it depends. SQL Server 64-bit allows 32 in-flight log writes. I believe HDD can achieve 0.3ms latency. In theory, this would support 90K T-log write/sec. SSD can support 0.1ms latency. I am inclined to think that the 8-unit stripe is as much for capacity as for performance. Lets say a specific SSD can support 10,000 8KB (non-compressible) writes/sec. Then it would take several to support 80K assuming we could gain performance with striping (my concern is the small size of writes, perhaps a smaller stripe size is necessary?).

Flash SSD can get much better IOPS than HDD at say $30K/TB for enterprise grade, $3K/TB for consumer. DDR SSD might get 10X better latency than Flash, at say $60K+/TB. So its all a matter of price-performance requirements.

FYI, I have been an advocate of rethinking main memory. I would prefer a DRAM mounted directly on top of the CPU with thru-connect (Intel concept) as "memory", supported by a DRAM/flash system as data cache, or fast storage.

Wait a minute, which TK slidedeck? the OLTP one? this blog is about DW which is completely different than OLTP, where clustering (or mirroring) is essential

February 24, 2011 5:35 PM

anonymous said:

Yes, DRAM on chip sound interesting.

The 128Core OLTP TK deck/talk mentions this SAN arrangement (MDB and TEMP on same) with Trans on SSD because SQL cant go faster than the bad

His advice was to just pile the files into the Luns, one per core...

You are talking read-only DW...

February 25, 2011 4:00 PM

anon said:

My understanding was always to put TempDB on FASTEST IO because it supports all the merges, hashes, joins etc.

Since Tranlog is seqential, HDD seems to support "fast enough" for OLTP, guess TK points out what worked better at scale.

I still think DW kills SQL more than OLTP...hence the scalar and stream features present and future.

TKs talk was supporting 10,000 INCOMING trans for a bank.

That means .1ms per...HDDs cant rotate that fast...well maybe cached..

February 25, 2011 4:08 PM

jchang said:

I would rather say that it helps to have tempdb on the storage capable of extremely high volume of 64K read & writes to support the very high IO queue depth generated by SQL Server engine. Tempdb IO does not appear to be pure sequential, hence more disks than normally prescribed by FTDW would help. It just so happens that SSDs seem to fit the bill nicely.

The main storage system could have a mix of HDD and SSD, but I am concerned about the weak processor and limited IO bandwidth on many entry-level SAN storage systems. One exception seems to be the new EMC VNX with Xeon 5600 processors, but EMC has provided very limited architecture details on the VNX so far.  

February 26, 2011 5:23 AM

anon said:

Ill start to look at SAN evals, but I fear non-truthful reviwers out there not as forthcoming as you. Besides, not all reviewer's employers can afford the eval units, and if they are anything like MS, will force you to agree to an EULA before you say "um, your product sucks" publicly. Hence the lack of details such as EMC, DELL etc.

I am thinking these servers are more powerful than the SANs supporting them...

VITA in Virginia comes to mind...(every state service like DMV on one SAN): Hey Northrop Grumman: dumb idea!

February 27, 2011 1:42 PM

jchang said:

I would like to learn more architecture details about the new EMC VNX, which is based on the Intel Xeon 5600 processors, so now EMC finally has a balance of compute, memory and IO bandwidth. But EMC has not yet povided meaningful details.

I would also like to find out what really caused the remote site failure at VITA. My guess is that there were too many systems on the main storage system, and the remote site replication is not synchronized to a consistent state, meaning applications at the remote site could not be started. I really think we have too many powerpoint jockeys (arm chair quarterbacks) who think they have fault tolerance, because they bought it (at extraordinary high price) without realizing FT is really about operational skill.

February 27, 2011 6:37 PM

anon said:

I dont know for sure

but i heard someone say who was a contrctor that an ECC memory strip went south in the SAM, EMC flied guys in overnight to handle...

they have multiple state govt departments on the whole SAN

February 28, 2011 9:27 PM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Privacy Statement