THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

HP Oracle 10X Extreme Performance Data Warehouse with Exadata Storage Server

For several months, we have seen ads for the joint HP/Oracle RAC and Exadata storage combination talking about extreme performance (10X faster) for large data warehouses. One thing I like about Oracle is that they have courage to pursue technology with deep hardware design implications, even if it takes several iterations to iron out the major issues. I just got around to looking through the Oracle papers on this. Like OPS/RAC, the Exadata technology has implications for how hardware is built. Hardware vendors can be squeamish on designing silicon for special requirements if there is not a viable installed base. This leads to the chicken and egg, which comes first situation that many other vendors cannot successfully initiate. Oracle dares to do this, and amazingly get customers to shell out big chunks of money on the first iteration. This in turn provides justification for gutless risk averse hardware vendors to do their part.

 

I will start by saying that the SAN systems out there today are designed for transaction processing, not data warehousing. Most SAN systems are designed with a certain number of FC ports, with the intent of supporting 1-4 disk enclosures (typically 15 disk drives each) per FC port. The SAN controller (or service processor) is a significant portion of the overall cost. Configuring one enclosure per FC port leads to a higher amortized cost per disk than having multiple enclosures per FC port, but sequential bandwidth is still limited by the number of FC ports. Depending on the architecture, it is possible to sustain between 330MB/sec (loop architecture) and 390MB/sec (star architecture) per 4Gbit/sec FC port. So it can require 3 FC ports and 45 disks to support 1GB/sec, even though each individual 15K disk drive can sustain 125-160MB/sec. The amortized cost of a SAN system might be $2,000 per disk, so each 1GB/sec through-put costs around $90K.

 

This is why I have advocated direct attach storage for data warehouses, where each 15 disk enclosure can sustain 800-1000MB/sec at an amortized cost of about $500 per disk. But most people do not like inexpensive high-performance storage solutions. And none of the expensive SAN systems provide sufficient bandwidth for really high-end data warehouse systems.

 

In the Exadata system, the interconnect between host and storage is InfiniBand, which signals at 5Gbit/s using a x4 wide connector (like SAS) for a net bandwidth (after 8B/10B encoding) of 16Gbit/s or 2GB/sec. The Exadata Storage Server (or cell) is an HP DL180G5 with 2 Xeon E5430 quad-core 2.66GHz processors, 8GB memory, a P400 RAID controller, 12 450GB 15K SAS or 1TB 7200 RPM SATA disks and dual-port Infiniband HCA. Curiously, of the 5.4TB raw storage with 12 450GB drives, only 1.5TB is available. With RAID 10 overhead, there is 2.7TB. Some space is required for internal use, but 1.2TB seems to be rather large.

 

A complete pre-configured HP/Oracle Database Machine full rack comprises 8 HP DL360 servers with two Xeon E5430 quad-core processors, and 32GB memory for running Oracle RAC, 14 Exadata storage cells, 4 InfiniBand switches (and 1 Gigabit Ethernet switch for auxiliary communications). A half-rack has half of the above components. Each storage cell is listed as supporting 1,000MB/sec with SAS drives and 750MB/sec with SATA drives. The listed bandwidth for the pre-configured full-rack is 14GB/sec. It is stated that Exadata bandwidth scales linearly with the number of racks, but without explicit performance numbers.

 

Compare the Exadata cell with an EMC CLARiiON CX4-960 mid-range SAN. The CX4-960 comprises 2 SPs, each with two quad-core processors, 16GB memory per SP, for which the minimum meaningful configuration is 16 disk enclosures (240 disks) over 16 FC ports. So the resource allocation per SP is 2 quad-core processors, 16GB memory and 120 disk, with probable sequential bandwidth of 3GB/sec (380MB/sec per FC port). The Exadata cell provides approximately the same compute power, 8GB memory, for 12 disks targeting 1GB/sec sequential bandwidth.

 

The purpose of the massive computer power per disk in the Exadata cell relative to a standard SAN is to offload compute functions from the main database engine. Concentrating capability, be it compute power or IO bandwidth, in a single system is always difficult, so distributing work can be useful if it can be done effectively. One candidate is compression. (SQL Server 2008 can store tables and indexes with row or page level compression, which take CPU resources. Oracle probably has comparable capability as well.) [Exadata is for Oracle systems only?] Offloading this to the storage element might be desirable. Finally, the Exadata cell is not just a storage engine, but can also handle database protocols.

 

In addition to a command such as fetch this block, and decompress, the Exadata can also handle SELECT * FROM Table WHERE col = ‘SARG’ (Smart Scan Offload Processing). In a data warehouse, that expectation is that queries are ad-hoc, for which indexes have not been built. So a data warehouse must be able to power through very large table scans. This requires both IO bandwidth and CPU resources, as a database table scan is not a simple IO operation (see my other posts on this matter).

 

The very recent TPC-H report for the HP BladeSystem of June 3, 2009 uses the Oracle Exadata Storage Servers (more on this below) and has price information. The full cost for 6 Exadata storage cells and supporting components is $536,516. (Of this, each Exadata Storage Server is $24,000, and the cost of the Exadata software is $360,000. Interesting, the price of a similarly configure DL180G5 is $14,000.) Three-year support is another $479,846 for a total 3-year cost of approximately $1M.  It is unclear how much of the discount applies to the Exadata versus the Oracle database software. The amortized list price per cell is $166K or 14K per disk. That Oracle sells this clearly sends the message that people do not look at cheap hardware for the data center.

 

Now when HP published a 1000GB TPC-H report for a HP Superdome with 32 Itanium 2 9140 sockets, and 64-cores on April 29, 2009, I was very puzzled. What was the purpose of this publication? HP had already published a 10TB result on a Superdome with 64 sockets (128 cores) of the same Itanium 2 9140 processors back in March 2008 with no real disparity (acknowledging that results at different sizes are not directly comparable). Then about one month later, HP published the 64-node Oracle RAC with Exadata storage result of June 2009 (note that the RAC servers are BL460 with 2 Xeon quad-core 3GHz processors, different from the pre-configured database machine).

 

System

SuperDome

BL460 Cluster

Database

Oracle 11g + Partitioning

Oracle 11gR2, RAC

QphH@1000GB

123,323

1,166,976

TPC-H Power

118,577

782,608

TPC-H Throughput

128,259

1,740,122

Total System Cost

$2,532,527

$6,320,001

Processors

32 Itanium 9140 1.6GHz

128 X5450 3GHz

Cores

64

512

Memory

384GB

2080GB

Disks

768

6 Exadata Cells

HBA

64 Dual Port FC 4Gb/s

64 x 2 Infini-band

 

Was the intent that people would see the Oracle RAC with Exadata storage and draw conclusions based on the approximately 10X difference in performance with another recent 1000GB result?

 

What conclusions can we draw from each of the two results? The first matter to understand is that the TPC-H scale factor 1000GB means the LineItem table, data only, is approximately 1000GB in size. The full database with all tables and indexes is approximately 1700GB. So the entire database fits in the 512-core RAC system with 2080GB and not in the 64 core system with 384GB.

 

Next the 64 core Itanium system has 64 dual-port FC adapters, meaning 128 x 4Gbit/sec ports, which could support 42GB/sec based on 330MB/sec per 4Gbit/s FC port. But it is unlikely that 768 disk drives can sustain this volume (55MB/sec per disk) in a SAN. It is also interesting that system was configured with the EVA 4400 while other HP SuperDome Unix results employ the MSA1000 storage. (It is nice to have 32 EVA SAN systems or 256 MSA 1000s available for performance testing).

 

Note that TPC-H query 1 (a table scan of most of the LineItem table) takes 169.8 seconds on the 64 core Itanium, and 10.3 seconds on the 512 core RAC system. This means that if the data had to be read from disk, then the disk system would have to support 6GB/sec on the 64 core and 97MB/sec on the 512 core system. The 64 core Itanium system definitely has to read data from disk and the configured disk can easily support 6GB/sec (only 8MB/sec per disk) while the 512 core system has 6 Exadata storage systems which can support exactly 6GB/sec per specifications, nowhere near 100MB/sec. But all the data fits in memory so disk reads for data does not occur given the TPC-H sequence where the test run occurs after the database load and index build.

 

The data load time on the 64-core Itanium was 1:07:12 and 2:22:57 on the 512-core RAC, which also may indicate relative storage performance

 

Another point that can be noted is that there is an 8X difference in cores for a 9.5X difference in performance. However, the Core 2 architecture Xeon 3.0GHz cores (45nm) are much more powerful than the Itanium 2 1.6GHz cores (90nm). The gain in Power is 6.6X and 13.6X in through-put. TPC-H scored is based on a geometric mean of 22 queries, some of which are small and other large. The geometric mean has the effect that a 2X speedup in a small query has the small benefit as 2X in a large query. The issue is the getting 10X gain in a small query is very difficult so scaling on Power is attenuated.

 

In summary, the two performance reports are definitely sufficient to assert that Oracle RAC can scale, but having single node and 8 node performance reports for the BL460 would be confirm this. The two TPC-H reports say nothing of Exadata storage system performance, either in the sustainable sequential bandwidth or the value of the Smart Scan Offload processing. If the 64 node, 2080GB memory Exadata storage result had been reported at 10TB, then we might have an idea of its capabilities. Based on the 100GB/sec table scan estimate above, it would require 100 Exadata cells, which might beyond its actual scaling capabilities.

 

The performance data cited in Oracle’s Exadata whitepaper lack details to attribute the source of the performance gain. Given that most SAN systems are horribly configured for Data Warehouse performance, it is quite probable that dropping in the preconfigure full rack Exadata with 14GB/sec sustained table scans can easily generate the quoted numbers.

 

Notes on the Exadata Storage Server

At the time this product came out, the choice of the DL180G5 was reasonable. However, this system, based on the Intel 5100 chipset, has 1 x8 and 2 x4 PCI-E Gen 1 slots.

I am guessing that the Infini-band dual port HCA occupies the PCI-E x8 slot, the P400 RAID controller occupies one of the x4 slots and that the 12 internal disks are connected to one of two x4 SAS ports on the P400.

Technically, the each Infiniband DDR (5Gbit/s) x4 channel is 20Gbit/s, which after 8B/10B encoding is 16Gbit/sec (2GBytes/s) which could fully consume a x8 PCI-E Gen 1 channel. There are 2 IB channels for path redundancy, not bandwidth aggregation, as the x8 PCI-E bandwidth is limited to 2GB/sec.

Since the disk drives are probably on a single x4 SAS port, this bandwidth is limited to 1GB/sec, even though each 15K disk drive can do 160MB/sec (not accounting for RAID 10 implications). So while there is 4GB/s combined bandwidth on the Infiniband links, only 2GB/sec can be sent over the PCI-E port to the IB HCA, and only 1GB/s can be drive from the RAID controller.

 

Now that the DL180G6 is available, with 1 x16 and 2 x8 PCI-E Gen 2 channels, I would retain the IB dual port x4 (if it supports PCI-E gen 2), go to the new P410 RAID Controller, the 25-bay SFF drive bay (assuming the drive bays can be split across two x4 SAS channels), and 24 146GB 15K or 300GB 10K SFF drives (There is no point offering a 250GB SATA option). This unit should support 2GB/sec.

I might even be tempted to bug HP to split the drive bays 4 channels, 6 bays per channel, x4 SAS on each channel. There would be 2 P410 controllers, each driving 2 channels. The 6 disks in each channel might not drive 1GB/s, but 4 channels might support 3GB/s? I priced this around $19K.

To bad we cannot have a generic Infiniband SAN, and skip the Exadata software ($5K licensing per disk?).

Published Tuesday, June 09, 2009 8:26 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

mark cleary said:

Now that the Windows Storage Server is available (via Technet and MSDN for evaluation and development purposes) and an iSCSI target is included, you could build yourself your SAN. You'd have to substitute 10 GB ethernet for the Infiniband. Would be a nice experiment.

July 6, 2009 7:50 AM
 

jchang said:

Its tempting to investigate this,

however I do not think it will lead to more consulting business for me. Customers seem to be content spending lots of money an expensive storage system, like one with 14 DAE of 210 disks. Then I can not get them to cough up another $4K for a pair dual-port FC HBA so there are 4 FC ports per cluster node instead of 2.

July 8, 2009 1:59 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement