THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Parallel Data Warehouse

The Microsoft Parallel Data Warehouse diagram was somewhat difficult to understand in terms of the functionality of each subsystem in relation to the configuration of its components. So now that HP has provided a detailed list of the PDW components, the diagram below shows the PDW subsystems with component configuration (InfiniBand, FC, and network connections not shown).

SAS

Observe that there are three different ProLiant server models, the DL360 G7, DL370 G6 and the DL380 G7, in five different configurations as suitable for the requirements of each subsystem. There are also up to three different configurations of the P2000 G3 storage system for the Control node cluster, the compute nodes, and the backup node.

Control Node
The Control nodes are ProLiant DL380 G7 servers with 2 Xeon X5680 six-core 3.3GHz 130W processors, 96GB memory, 14 internal 300GB 10K disks, and an external P2000 G3 with 5x450G 15K disks. The Control nodes parse incoming queries to be reissued to the compute nodes, and also reassemble the results from each node to the client as a single set. This would explain the use of powerful processors and a heavy memory configuration.

The purpose of the 14 internal disks is unclear as one might expect that result sorting takes place on the shared storage, unless this is done outside of SQL Server and also outside of the cluster shared resources. Now that I think about it, this is reasonable. On a cluster failover, there is no need to recover the intermediate results of queries in progress, as they will have to be reissued?

The general idea is to distribute as much query processing to the compute nodes as possible. There are situations that require intermediate data to be brought back to the control node for finally processing. Once there are more environments on PDW, there may be an evaluation as to whether a more power control node would be suitable? depending on the specific case.

Management Nodes
The management nodes are ProLiant DL360 G7 servers with a single Xeon E5620 quad-core 2.4GHz 80W processor, 36GB memory and 2 disks. If the management nodes have light compute and memory requirements, I am inclined to think that this functionality could be consolidated with the other subsystems. But this is not necessarily an important point.

Landing Zone
The Landing Zone is a ProLiant DL370 G6 with a single Xeon E5620.  (The HP spec sheet also mentions W5580, which is a quad-core 3.2GHz 130W 45nm Nehalem-EP core.) The memory configuration cited is peculiar, with 6x2GB and 6x4GB DIMMs. A single processor Xeon 5600 system should have a limit of 9 DIMMs, three per memory channel. So why not employ 9x4GB DIMMs?

The DL370 chassis accommodates up to 14 LFF (3.5in) disks, eliminating the need for an external storage unit. The Landing Zone actually employs 10x1TB 7.2K and 2x160GB 7.2K HDDs. It is unclear why the LZ system was not configured with 14 disks. It could have also been configured with all 1TB disks, with the OS using a small slice. There are now 2GB and 3GB disks in the 3.5in 7.2K form factor. Seagate has a 2TB enterprise rated 3.5in 7200RPM drive. But it is unclear when these ultra-large capacity disks will available for server systems or even if there is a need for additional capacity in this function.

Backup Node
The Backup node is a ProLiant DL380G7 with 2 Xeon E5620 processors, 24GB memory. Presumably there are 2 internal disks to boot the OS. There is also an external P2000 G3 storage system with sufficient 3.5in drive capacity to support backups assuming 4:1 compression.

The max operational capacity of the compute nodes is 500TB This would imply that the Backup node could have 125TB net capacity..
The cited max capacity of PDW (with 40 nodes) is 500TB. This is based on the uncompressed data? So the actual net storage is 133TB? A reasonable assumption is that a 133TB database with page compresssion applied might yield another 2X reduction on a backup with compression? 

The maximum configuration for the P2000 G3 is 96 LFF (3.5in) disks, with 7 additional disk enclosures. The P2000 G3 does support the 2TB 7.2K drive, so the P2000 with 5 additional disk enclosures totaling 72x2TB disks would meet the backup capacity requirement.

Compute Node
The Compute nodes are ProLiant DL360 G7 server with 2 Xeon X5670 six-core 2.93GHz 95W processors, 96GB memory, 8 internal 300GB 10K HDD and one external P2000 G3 with the option of 11x300GB 15K, 11x1TB 7.2TB or 24x300GB 10K drives. The external storage unit is for permanent data storage. The internal disks are for tempdb.

There are now 600GB 3.5in 15K and 600GB 2.5in 10K drives, so it is possible that these will replace two of the current options in the near future. A single (42U) rack must support 10 compute nodes (plus 1 spare), 10 storage nodes, and the associated switches (2 InfiniBand, 2 Ethernet, 2 Fiber Channel). This precludes a 2U form factor for the compute node. The 1U DL360 G7 cannot support the 130W thermal envelope for each of 2 Xeon X5680 processors, so the 95W Xeon X5670 processors are employed instead.

If tempdb is actually on the 8 internal drives, then I wonder why the P2000 storage units employs RAID-10? Write to permanent data is expected to be infrequent, negating the need for small block random write IO performance (the most serious liability of RAID-5). Only tempdb activity is expected to generate non-sequential IO.

Comments
Without more than very brief hands on time with PDW, I have two comments at this time. One is that the P2000 G3 supports 1,572MB/s bandwidth. Ideally, for 24 disks per storage unit, we would like to target 2GB/s, possibly somewhat more. Hopefully the next generation HP entry level storage system with employ the Intel C5500 processor (or successor) or some comparable processor with adequate memory and IO bandwidth. I have heard that the desire is also to move the storage interface from FC to Infini-band. 

The second comment is that SSD could be considered for tempdb. The 8 internal 300GB 10K drives might cost $2000 (or whatever OEM volume pricing is). The cost of a PCI-E 1TB consumer grade SSD is approaching $2-3K. An enterprise grade 1TB SSD is higher depending on the vendor.

The maximum PDW configuration is 40 compute nodes. With the 24-disk storage units (300GB decimal, 278GB binary), there are 960 disks, excluding the capacity of disks locally attached on the compute node internal bays. Net RAID-10 capacity is then 133,440GB binary, which could correspond to 500TB uncompressed capacity. The 40 compute node limit may be a connectivity limit. At some point, the 600GB 2.5in 10K drive should become available for the PDW system, doubling capacity. I am also wondering what if an actual multi-unit PDW customer asked MS for additional external storage units to be daisy chained.

New Ones:
on further consideration, if I can mix HDD and SSD on a SAS RAID controller, on the compute node, I would go with 2 HDD for the OS and 4-6 SATA/SAS SSDs, plus 1 PCI-E SSD in the open PCI-E slot.

Anonymous:
Lets figure $10K each for the heavy config servers, $15K for the storage units, and $10K for each of the IB and FC switches, lets figure $500K HW cost for the control rack plus one 10 node compute rack. There are 20 SQL Server processor licenses plus other components, so probably another $500K here. It will probably also involve another $500K in consulting for the deployment, so maybe $2M all told. I am thinking gold electro-plating on the rack might cost $10K.

Compare this with an 8-way ProLiant DL980 environment. Lets just suppose the PDW with 20 Xeon 5600 sockets has 2.5X the performance of 8 Xeon 7500 sockets. The DL980 with memory should cost around $150K, a direct-attach storage (8x24 disks) costs $12.5K each for $100K. Throw in another $50K of hardware to resemble the PDW. SQL EE licenses for 8 sockets is $240K. Suppose the consulting services to deployment on the SQL Server that we are already familar with is $100K. This brings the total to $700K. So $2M for 2.5X of the $700K system seems reasonable, considering the heavily super non-linear cost structure of scaling.

Also, figure an Oracle Database Machine with 8 RAC nodes and 14 Exadata Storage nodes will also cost around $2M (somebody please look this up).

Published Thursday, March 10, 2011 2:58 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Joe said:

I see that the Intel 510 is in-stock at NewEgg, but no word on availability for the OCZ Vertex 3. I am also puzzled on the Intel SSD 510 specifications.
The 250GB model performance specification:
Sequential Reads: 500MB/s, Writes: 315MB/s (6Gbps interface)
Random Reads 20K, Writes 8K IOPS (4KB)
then consider the latency: 65 usec read, 80 usec write.
So random reads at 4K will only generate 80MB/s. Presumably the 510 can sustain nearly 20K IOP read for up to 24K IO? Or is there a difference in the way the 510 handles sequential and random IO?

By contrast, the OCZ Vertex 3 (pre-release) with SandForce 2200 controller specs are 550/500MB/sec sequentail read/write, 60K IOPS random. The reason for the higher write performance is that the SandForce does compression. So for non-compressible write, the performance could be similar. But the random 4K IO generates 240MB/s, only 2X off the large block sequential rate.

Given the 65-usec read latency on the Intel 510, we would expect nearly 15K IOPS at queue depth 1, so higher queue depth should achieve more than 20K?

March 10, 2011 4:29 PM
 

Peter said:

April 22, 2012 7:34 AM
 

jchang said:

Peter, thanks for the Oracle link

May 12, 2012 10:13 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement