Dell recently published a TPC-H report for the PowerEdge T610, 2 x Xeon 5570, with 4 FusionIO 80GB SSD storage devices at 100GB scale factor. So why have we not seen TPC-C or TPC-E OLTP benchmark results published?
Now it is much more feasible to run the TPC-H data warehouse benchmark on SSD because the Scale Factor 100 size is still allowed, for which the Line item table is 100GB for data only, not indexes or other tables. The full SF 100 tpch database is about 170GB for all tables and indexes. Additional space is required for tempdb.
The TPC-C and TPC-E benchmarks require the database size to be scaled with performance target ranges. Consider the Fujitsu TPC-E published result for the Primergy RX300 S5 with 2 Xeon 5570. The dual-socket Xeon 5570 system scored 800 tps-E, for which the required initial database size is about 3TB. The space actually allocated for the data files is approximately 4.5TB, plus another 85GB for log space.
|
System |
Fujitsu Primergy RX300 S5 |
|
Processors |
2 x Intel Xeon X5570 |
|
Memory |
96GB |
|
RAID controllers |
5+1 |
|
Disk enclosures |
30 |
|
HDD |
360 (192 73GB 15K + 168 146GB 15K) |
|
Storage cost |
$148K + $49K for 3 year maintenance |
|
Raw capacity |
35GB |
|
RAID 10 Capacity |
18TB |
|
Estimated IOPS |
360*200 = 72K |
For the 360 15K disk drives, based on 200 IOPS per disk, the small block random IOPS capability of this storage system is 72K, excluding RAID 10 overhead. If the actual load is 10,000 IOPS (at the operating system) with a 50/50 read/write mix, then the raw IOPS to disk is 5K reads and 2x5K writes for a total of 15K IOPS to disk. So a 75K IOPS system can actually handle 50K IOPS at 50/50 read/write mix in RAID 10.
If we consider that the active database resides on only 15% of the disk space (3TB of 18TB after RAID 10 overhead), then there is some benefit from the short-stroke effect. If the average disk queue depth per disk were higher than 1, then command queuing capability would result in even high IOPS per disk. The actual IOPS per disk might be anywhere from 200-300 depending on whether the emphasis was on pure performance or balanced price/performance.
Below is a proposed SSD (+ HDD for archival space) configuration.
|
SSD + HDD configuration |
|
|
SSD Capacity |
4.5TB |
|
60-day space |
13TB |
|
SSD drives |
155 @ 32GB
62 @ 80GB |
|
Cost for Intel SSD
$520 for X25E, 32GB
$340 for X25M, 80GB |
$80K
$21K |
|
HDD drives for 60-day space |
20 x 1TB SATA $3200
or 42 x 450GB SAS |
In additional to the above, we need disk enclosures. Ideally I would like to place no more than 4-5 SSD devices on each x4 SAS port. A x4 3Gbps SAS port can support 1GB/s, but if a HBA/RAID controller with 2 x4 SAS ports is plugged into a x8 PCI-E gen 1 slot, we can only expect 1.6GB/s total (single direction) throughput. The Intel X25 SSDs are rated at 250MB/s sequential read, 170MB/s write for the E, and 70MB/s write for the M (all sequential).
The X25-E random 4K IO characteristics are 35K IOPS read, 3.3K write. In the absence of data, let assume the 8K random read is 15.7K IOPS (probably higher), so under 8K random IO, the bandwidth requirement is only 140MB/s or much less for read/write mixes.
The data sheet also say 8K 2:1 R/W 7K IOPS for the X25-E. No random IO data is listed for the X25-M in the datasheet, so it is not clear the X25-M can meet the TPC-C/E random IO requirements for mixed R/W.
A 1U enclosure with 2 SAS ports (daisy chained enclosures not expected) and 8-10 2.5in bays seem appropriate. The 2U enclosures with 24 bays should have 6 independent SAS ports. The 20 or so 3.5in SATA drives for the 60-day space requirement could be accommodated between the internal bays and 1 or 2 external enclosures. The next generation systems and components should be PCI-E gen 2 (5Gbps per lane) and 6Gbps SAS, but we expect higher SSD bandwidths as well.
So the SSD cost structure does seem to support the TPC-E benchmark. It would probably also support the TPC-C benchmark as well, based on the HP DL370G6 result for a
The main issue above is that I have not included RAID overhead. It is my m opinion that the SSD is not fundamentally a single component device, like a disk drive with a single motor. If the SSD were built with dual controllers, and chip-kill ECC on the NAND, then the SSD would be inherently single component failure tolerant. Of course, this is not the case yet. I am just looking forward to when we can do without RAID in SSD. I am not convinced RAID controllers are going to be able to keep up with an SSD arrays anyways.
Since the IOPS capability of the X25E with SLC NAND (not sure for the X25M with MLC NAND), RAID 5 with the higher small block random write overhead is not an issue. So 190 of 32GB or 77 of the 80GB SSDs in RAID 5 would be required.
I should briefly touch on expected performance benefits of SSD over HDD. The Dell TPC-H result did seem to indicate some benefits from SSD, even though there was not a otherwise similar HDD result to compare with. The TPC-H data warehouse queries may generate many table scans for which HDD is fine, there are still loop joins and key lookups, which generate pseudo-random IO. Several TPC-H queries also dump intermediate results to tempdb.
I am expecting TPC-C and E to show reasonable benefits from SSD over HDD. Consider the main TPC-C new order transaction. A typical TPC-C published result might show an average response time 0.3-0.4sec. This procedures processing an order for upto 15 items (average of 10?) which means one update of the Stock table for each item, one insert in to the Order Line table, and one insert to the New Order table, plus a few a others. Since the TPC-C database is very large, each of the above steps might require a disk IO. On a perfectly configured disk system (for OLTP), the average latency could be as low as 5ms even when the entire system drives 200K IOPS.
Still, if you look at the New Order procedure, it is clear each item must be processed serially. The SQL Server engine might use the Scatter-Gather IO API to consolidate IO calls from multiple concurrent users, but in each step in the new order is issued sequentially, after the previous step completes. Since there are over 20 steps, if each step take 5ms, then we can see why the average duration is well over 100ms.
With SSD, the IO latency should drop to 0.08 milli-sec (80us), meaning 20 steps should in the range of 2ms. Because there are fewer transactions "in-flight" at any given point in time, the expectation is that the SQL Server engine has less to keep track of.
Consider a system supporting 600,000 tpm-C. Thats 10,000 new order transactions per second. If each new order procedure averages 0.3sec, then there are 3,300 new order transactions in-flight at any point in time (plus others).
TPC-C also has performance/size scaling requirements. A 600K transactions per minute result requires approx 50,000 warehouses, each of which requires approx 84MB, for a database size of 4.2TB. The recent 600K tpm-C results required 1000+ disk drives (no RAID requirement) meaning the IOPS load is probably 200-300K, possibly a R/W mix close to 50/50.
Since Wes say the FusionIO 640GB devices are out, lets consider what kind of system would be required. The FusionIO is built with a PCI-E interface, that is, it plugs into the PCI-E slot directly, so it probably comes with its own driver. The second generation FusionIO matches up nicely with either PCI-E gen 1 x8 or PCI-E gen 2 x4 in terms of bandwidth.
For 4TB we need 7-8 of the 640GB drives. So, ideally a system should be configured with 9 PCI-E gen 2 x4 slots, plus with embedded devices (the extra slot or two is for additional network or SATA drives). The new Intel 5500 IOH has 36 PCI-E gen 2 lanes, plus the x4 gen 1 off the ESI. So a single IOH would support 9 x4 slots, plus GE and SAS off the ESI. The HP ML/DL370G6 actually uses 2 IOHs for a mix of x16, x8 and x4 slots.
Per Grumpy below, at this point in time, SSD devices have very different characteristics, particulary with regard to writes. Writes to NAND need to be in large blocks. Depending on how the SSD controller is implemented, expect some issues. So it may not be time yet to deploy transaction processing to SSD. DW might be worth considering. Still, we should see OLTP benchmarks plus accompaning details to better understand SSD characteristics. Where are Bashful, Doc, Dopey, Happy, Sleepy and Sneezy DBAs?