THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

First Nehalem TPC-H

Earlier I talked about the first TPC-C and TPC-E results for 2-way Nehalem, ie, the Intel Xeon 5500 series. The results were spectacular relative to the previous generation Xeon 5400 series, (2.5X gain on the Intel slide deck for database OLTP) and were pretty much hitting the same range as 4-way Xeon 7460.

I pointed out that while these were legitimate results, the TPC-C and TPC-E benchmarks generate high call volume, about 1000 RPC stored procedure calls per second per core. Meaning each call averages around 1 CPU-ms. This type of usage benefits from the Intel Hyper-Threading feature. It was around 10-20% back in the NetBurst days. I am inclined to think it is now much larger with Nehalem, possibly 30-40%. An application like TPC-H would not benefit from HT. Nehalem should still show a moderate performance gain over Core2 on the basis on micro-architecture improvements alone (plus the integrated memory controller).

Well, my thanks to Dell for publishing a TPC-H for Nehalem. Notice to other vendors: get going, slackers! Below is the 2-way Xeon 5500 versus Xeon 5400 or 5300, and 4-way Xeon 7460 or 7350.

System             Configuration                                                    TPC-H@100GB

T610                 2 Xeon 5570 Quad-Core 2.93GHz, 8M L3, 48GB               28,773

ML370G5          2 Xeon 5355 Quad-Core 2.66GHz, 2x4M L2, 64GB           17,687

DL580G5           4 Xeon 7350 Quad-Core 2.93GHz, 2x4M L2, 128GB         34,990

 

System             Configuration                                                      TPC-C

DL370G6           2 Xeon 5570 Quad-core 2.93GHz, 8M L3, 144GB      631,766 (Oracle/Linux)

ML370G5          2 Xeon 5460 Quad-core 3.16GHz, 2x6ML2, 64GB      275,149

DL580G5           4 Xeon 7460 Six-core 2.66GHz 16M L3, 256GB         634,825

 

System             Configuration                                                      TPC-E

Fujitsu RX300    2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB     800.00

TX300 S4          2 Xeon X5460 Quad-core 3.16GHz, 8M L2, 64GB      317.45

Dell R900          4 Dunnington Six-core 2.66GHz, 16M L3, 64GB       671.35

It is unfortunate that the last TPC-H results on Intel were the 65nm Xeon 5300 and 7300 series (except for the Unisys 10TB 16-way). Lets just suppose that a Xeon 5470 3.33GHz would score 20% higher than the Xeon 5355 2.66GHz. The 20% frequency difference might contribute 10%, and the micro-architecture improvements going from the 65nm to 45nm Core2 contribute the rest (The larger cache is not expected to benefit TPC-H high row count queries). This would make the Xeon 5500 series 35% faster on TPC-H than the 5400, which is more than I expected just from the Core 2 to Nehalem architecture improvements.

Of course, this Dell result uses the Fusion-IO SSD drives that plug directly into the PCI-E slots, instead of going through a RAID controller, then the SAS interface. I am looking through the individual TPC-H queries, comparing against both the 2-way 5355 and 4-way 7350 results. I think there is reason to believe that the SSD storage improves performance over a large array of disk drives. A large disk array can deliver sufficient sequential bandwidth, but some SQL operations will generate small block IO, and writes to tempdb should be much faster. The 4-way 16-core Core 2 (7350) has better overall performance than 2-way 8-core Nehalem, but on some individual queries, the Nehalem system scores better.

I am inclined to think the 2-way Opteron six-core (Istanbul) could be close to 2-way Nehalem quad-core on TPC-H, despite the large advantage in TPC-C/E. Intel has a myopic view that the big-dog processor should be reserved for the 4-way+ systems. (It may not make sense to put 8-core Nehalem EX into a 2-way system if the 6-core 32nm Westmere core will be available soon.)

To reiterate, it is very important that all key benchmarks are published so we can get a good idea of what to expect under each circumstance. No one (among reasonable people) expects miracles and magic. It is a complete picture that is important. Knowing that you should expect 20% is better than a misguided belief or hope for 2X.

Some people think I am paranoid and deeply cynical. So I will now resemble this accusation. My thinking is Intel had the full set of benchmark results months ago. It was pointed out that some benchmarks, mostly the ones the benefit from HT, showed huge gains, while others just show good gains. Every organization has worthless marketing types that feel the need to justify their salary. So it was decided to withhold the DW results, just so the worthless crap marketing slides could show the big numbers instead of a complete picture. The complete picture is important, we are happy that Nehalem has arrived. We can work its actual performance characteristics; spectacular gain on some, good gain on others. So stop tinkering with the slide deck!

There is the truth, the whole truth, and nothing but the whole truth. Some people can handle the item 1, but know to stay well clear of 2 and 3,

The use of the FusionIO SSD is interesting. As mentioned above, it interfaces directly to PCI-E. The first generation was PCI-E gen 1 x4, and can do 750MB/s (32K) read, 500MB/s write, 116-119K IOPS (4K), in capacities of 80 and 160GB for SLC, 320GB for MLC. The second generation can do 1.5GB/s read, 1GB/s write, 200K IOPS (4K), in capacities 160/320 SLC, 640GB MLC. The interface is PCI-E x8 gen 1 or x4 Gen 2.

The Intel 5520 chipset has 36 PCI-E gen 2 lanes plus the ESI. A 2-way Nehalem system can be built with 1 or 2 5520 IOHs. The Dell T610 has 1 IOH for 2 x8 and 3x4 slots available (x4 for the internal SAS?). The Dell TPC-H config has 4 Fusion-IO drives, which is fine for this test. An actual production system might want to configure more SSDs. The HP ML370G6 with 2 IOHs has 10 slots (2x16, 2x8, 6x4, one for NICs). The x16 slots are useless for database servers because no network or storage IO adapters can really use x16 bandwidth, and definitely cannot make good use of the unbalanced slots. The x16 slots might be useful for HPC or something. Hopefully Dell or HP will make a system with something like 7 x8 and 4 x4 slots. Now to make maximum use of the (current) Fusion IO SSDs, we would have 18 PCI-E x4 gen 2 slots, but I think Fusion IO could be persuaded to do a double wide SSD instead.

SuperMicro does have a dual 5520 IOH motherboard with 7 x8 PCI-E slots. The onboard SAS occupies a x8, and the dual GbE NIC takes another x4. It looks like one x4 is not wired. For a server, I would have used a x4 for the onboard SAS because that may only connect the boot drives, and I would stick the GbE NIC off the south bridge ICH. The x4 Gen2 should be made available for 10GbE. I used to buy SuperMicro systems because their wide motherboard selection allowed me to get the one with the best IO arrangement for database servers. But when SAS came out, I had a hard time getting the right connectors. I may give them another try if Dell or HP does not do a 7 x8 PCI-E Gen 2 system.

Published Wednesday, June 03, 2009 4:55 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Medina said:

I agree with you that "The use of the FusionIO SSD is interesting." For example, HP just submitted TPC-H results based on the same Fusion-io cards. According to their submission the performance is nearly double that of Dell's recent submission, while also lowering the price to $1.08 per QphH@100GB.

http://www.tpc.org/results/individual_results/HP/HP_DL380G6_100GB_2P_2.93GHzQC_090803_ES.pdf

August 11, 2009 11:59 AM
 

jchang said:

The SF100 TPC-H database with the SQL Server 2008 date type + compression, should just fit in the DL380G6 memory of 144GB,

so the T610 + DL380G6 result is mostly a comparison of SSD versus in-memory. I will blog about this shortly.

Many people are infatuated with magically SSD IOPS numbers, with considering that the IO to load from storage is also much more CPU intensive than a simple fetch from the buffer cache.

August 11, 2009 3:41 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement