THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Early TPC-C and H reports for Shanghai relative to Barcelona and Dunnington

Below are the first performance numbers for the new AMD 45nm Opteron (Shanghai) relative to other recent AMD and Intel Xeon results.

4-way TPC-C

Opteron 8360 Quad-core 2.5GHz (Barcelona) 2M L3, 471,883 (DL585G2)

Opteron 8384 Quad-core 2.7GHz (Shanghai) 6M L3, 579,814 (DL585G2)

Xeon 7350 Quad-core 2.93GHz (Tigerton) 2x4M L2, 407,079 (DL580G5)

Xeon 7460 Six-core 2.66GHz (Dunnington) 16M L3, 634,825 (DL580G5)

Xeon 7460 Six-core 2.66GHz (Dunnington) 16M L3, 684,508 (x3850M2)

(Note: Intel uses inclusive cache, so I just quote the size of the last level. Opteron is exclusive, so technically I should quote L2+L3, the L2 being 4x512K.) 

Both Opteron TPC-C systems have 256GB memory, 700 disk drives, so the Shanghai is running at lower memory per tpm-C and lower disk per tpm-C. Still, the performance gain from Barcelona (65nm) to Shanghai (45nm) is 23% for 8% higher frequency. For reference the 4-way dual-core Opteron result was 262,989 for the 8220 2.8GHz.

 

8-way TPC-H 300GB

Opteron 8360 Quad-core 2.5GHz (Barcelona) 2M L2, 52,860 QphH (DL785, S2K8 w/compression)

Opteron 8374 Quad-core 2.7GHz (Shanghai) 6M L2, 57,865 (DL785, S2K8 w/compression)

Xeon 7350 Quad-core 2.93GHz (Tigerton) 2x4M L2, 46,034 (x3950M2, S2K5)

 

Both Opteron TPC-H systems have 256GB, 206 disks, and 7 disk controllers. The Xeon system is on S2K5, and the Opteron results are S2K8 using the date data type in place of datetime, saving 12 bytes per row on LineItem. Page compression is also enabled, so the entire 300GB Lineitem + indexes and other tables essentially fits in memory, so the Opteron and Xeon results are not properly comparable.

The performance gain from Barcelona to Shanghai is 9% on the combined QphH, 11.7% on power, and 6.6% on throughput.

 

The gains can be divided between increased cache size, frequency and core improvements.

TPC-C is known to benefit from large cache, so the majority of the gain from Barcelona to Shanghai is probably due cache. The Xeon 7460 has the best performance due to the combination of additional cores and the very large L3 cache.

TPC-H is known to be essentially cache size independent. So the fact that TPC-H improves by more than clock frequency says some gain is due to core improvements. For an 8% frequency increase, an expected performance of 5-6% is reasonable. So 3-4% can probably be attributed to core improvements (see below). There is no published TPC-H result for the Intel six-core Dunnington. This is probably an indication that the Intel FSB architecture cannot properly feed 24 cores with a single MCH. The Intel Core 2 significantly out performs Opteron at this single core level (non-parallel large queries), but Opteron catches up at high degrees of parallelism with better memory bandwidth (more memory channels).

 

There is still no SPEC CPU Integer 2006 Base for Shanghai (there is for SPEC CPU rate, but I still like looking at the non-rate). Core i7 results look decent, though not spectacular. A 10% gain from QX9770 3.2GHz to i7 3.2GHz. The goal of each tick and tock is 40%, but this is very difficult to do on SPEC INT. Hopefully we will see Core i7 (45nm) at 3.5-3.8GHz in 2009.

Prior to Shanghai launch, AMD made noise that Shanghai would have 20-30% better performance than Barcelona at the same clock. In a complex processor architecture, it would not be unusual that a minor design mistake, when corrected, yield a significant performance gain on a specific operation, but may yield only a modest gain on a complete suite of operations. AMD was vague as to whether the 20-30% was specific or broad. Anyways, I just talked to the person who did the AMD results. In the past he had made subtle contributions to achieving best in class performance. Between the Barcelona and Shanghai results, there was a minor change that contributed 2%. So this would reduce the contribution from any core improvements in Shanghai.

 

On Scotts comments, there are only TPC-E results for Intel Xeon and Itanium, one major vendor say they have done TPC-E for Opteron but has elected to no publish. Let me just say benchmarking is a viscious world where second place is for losers, even if its by just 1%. Oracle and DB2 refuses to publish. Anyways:

4 x X7460 Six-core (24 cores) 2.66GHz, 128GB  729.65 tpsE (IBM x3850 M2)

12 x X7460 Six-core (64 of 72 cores used) 2.66GHz, 384GB, 1,400 tpsE (NEC 5800)

 

I am glad NEC decided to enter a Xeon for big iron. There is a role for big boxes even though they require special tuning skills. They had an Itanium, but Intel is far behind the ball in getting Itanium architecture and manufacturing process current. There is no way a 90nm processor (Montvale) can compete with 45nm X86/64. Even the upcoming 65nm Tukwila will probably be short, more so once the big Nehalem 8-core server variant arrives. I am here a PASS now, but have not had a chance to bug NEC on details of their 5800 architecture.

 

The SPEC CPU mentioned in the comment below are SPEC int_rate, which is a through-put test. I want the SPEC CPU Integer base (no special complie flags, substituting hand coded assembly for compiler generated code). For reference (I will check if the Xeon 5460 is on a server or workstation platform)

Dell has now published Shanghai SPEC CPU integer results

X5460 3.16GHz 2x6M, 25.3

X7460 2.66GHz, 16M L3, 21.7

Opteron 8360 2.5GHz 14.4

Opteron 8384 2.7GHz 16.9

Core i7 965 3.2GHz 30.2

The reason I say SPEC CPU integer base is important is that is a reasonable predictor of standard non-parallel execution plans for a wide range of SQL operations. Many people have a favorite query they like to run to evaluate new systems. So even though the Opteron can achieve very good results on the 8-way for well tuned parallel plans (and spec int rate), single threaded operations fall far short of the Core 2 based Xeon. So it is important to test single thread, through-put and parallel plan performance separately.

Published Wednesday, November 19, 2008 2:58 PM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Scott R. said:

Joe,

Thanks for your great summary.  Always good info from you.

Did you see the recent (11/6) TPC-E results for a 12-way 6-core server (72 cores total)?  They had to disable 8 cores to get the total number of cores down to 64 - the maximum currently supported by Windows Server.  It's a good thing that Microsoft is looking at increasing the number of cores supported by Windows Server to 256 in the next version (http://sqlblog.com/blogs/denis_gobo/archive/2008/11/06/9881.aspx).

Scott R.

November 19, 2008 3:00 PM
 

Greg Linwood said:

Nice summary - thanks Joe

November 19, 2008 7:49 PM
 

Anand said:

Joe,

Great point about TPC-H improvements on Shanghai based servers.

BTW, SPEC CPU Int results for DL785 are posted at http://h18004.www1.hp.com/products/servers/benchmarks/index.html .

November 21, 2008 4:45 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement