THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Intel Xeon 5600 (Westmere-EP) and AMD Magny-Cours Performance Update

HP has just released TPC-C and TPC-E results for the ProLiant DL380G7 with 2 Xeon 5680 3.33GHz 6-core processor, allowing a direct comparison with their DL385G7 with 2 Opteron 6176 2.3GHz 12-core processors. Last month I complained about the lack of performance results for the Intel Xeon 5600 6-core 32nm processor line for 2-way systems. This might have been deliberate to not complicate the message for the Xeon 7500 8-core 45nm (for 4-way+ systems) launch two weeks later.   http://sqlblog.com/blogs/joe_chang/archive/2010/04/07/intel-xeon-5600-westmere-ep-and-7500-nehalem-ex.aspx

 Below is the updated chart from my previous blog.

Processor
Architecture
Process
TPC 2-way 4-way 8-way 16-way
Core2 65nm
Xeon 5300 QC
7300 QC
TPC-C
TPC-E
TPC-H
251,300
5160 only
17,686@100
407,079
479.51
34,990@100
841,809
804.0
46,034@300
-
1,250.0
-
Barcelona
65nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
471,883
-
-
-
-
52,860@300
-
-
-
Core2 45nm
Xeon 5400 QC
7400 SC
TPC-C
TPC-E
TPC-H
275,149
317.45
-
634,825
729.65
-
Linux DB2
1,165.56
-
-
2,012.8 (R2)
102,778@3T
Shanghai 45nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
579,814
635.4
-
-
-
57,685@300
-
-
-
Istanbul 45nm
6C
TPC-C
TPC-E
TPC-H
-
-
-
-
-
-
-
-
91,558@300*
-
-
-
Nehalem 45nm
Xeon 5500 QC
7500 8C
TPC-C
TPC-E
TPC-H
661,475†
850.0
51,086@100
1,807,347
2,022.64
-
-
3,141.76
162,601@3TB
-
-
-
Westmere 32nm
Xeon 5600 6C
7600 12C
TPC-C
TPC-E
TPC-H
803,068
1,110.1
future
future
future
future
future
future
future
future
future
Magny-Cours
45nm
12C
TPC-C
TPC-E
TPC-H
705,652
887.4
71,438.3@100G
1,193,472
1,464
107,561@300G
future
future
future
future
future
future

* also SF 1TB report
† Xeon W5580 3.2GHz, versus X5570 2.93GHz for other Xeon 5500 results

The Xeon 5680 score 13.8% higher in TPC-C and 25.1% higher in TPC-E. The individual physical core in Westmere is faster than the Opteron core on SPEC CPU 2006 Integer base (adjusted to exclude parallel components). There is no meaning in comparing frequency between completely different processor architectures.

Below are SPEC CPU 2006 Integer base results for the Opteron 8435 2.6GHz and Xeon X5680 3.33GHz

Processor AMD
Opteron
8435
Intel
Xeon
X5680
System HP
DL585
Dell
T710
Freq 2.6GHz 3.33GHz
Cores 4x6 2x6
400 perlb 15.9 27.6
401 bzip2 12.6 20.9
403 gcc 14.0 25.6
429 mcf 17.7 45.4
445 gobmk 14.9 26.1
456 hmmr 17.5 50.9
458 sjeng 14.6 27.6
462 libq 66.9 664
464 h264ref 22.1 39.9
471 omnetpp 13.1 22.5
473 astar 12.6 21.8
483 xalanc 16.3 39.7

base

17.3

39.1
w/o libq 15.36 30.21

The purpose of excluding lib quantum is to compare single core performance. The Intel compiler can parallelize lib quantum, so it is not a single core result(?). I am somewhat inclined to also exclude hmmr because the Intel 11.1 compiler made substantial improvement over their 11.0 compiler. AMD results are on the PGI 8.0 complier, which may not have either optimizations. 

 

Published Wednesday, May 12, 2010 10:29 AM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

sql_noob said:

question about these for OLTP systems. we're looking to buy a pair of HP Proliant DL 380 G6's with 5650's sometime in the next few months. they are going to replace G5's with 5345's. this is going to be for a single instance cluster where most of the workload is small 1 second queries. mostly selects, with some updates and inserts. best guess is 20-30 million queries per day.

currently we're on max degree of parallelism 0. i had the ideal to increase it a little to force each connection/query/thread into it's own core. only problem is that we can't test the workload in QA or any other system.

i tried testing it on a system with a different workload type and didn't like the results.

so my question is, will we see a performance increase or decrease? not looking to see query execution cut from 1 second to half a second but will we be able to process more queries/threads at once if they go through separate cores?

May 12, 2010 10:09 AM
 

Glenn Berry said:

The 32nm Xeon 5650 does about three times better than the old 65nm Xeon 5345, which is based on the 1st Gen Core2 architecture on most of the benchmarks I have seen.

Whether you will see a dramatic overall increase in performance and throughput depends on what other bottlenecks you might have. For instance can your I/O subsystem keep up with the load?

May 12, 2010 10:30 AM
 

jchang said:

Very Important: Get a decent QA system. I would say that the 5345 is a decent processor and not due for replacement. But if you do not currently have a QA system, then by all means get the latest 2-way system, example X5650, so that you do have a QA system.

Next: a 1 second query today is not a small query, its 2-3 Billion CPU-cycles per core. Small is a query that runs in 10milli-sec. It is important to specify both the average CPU and duration (elapsed time) per query. If your average query consumes 1 CPU-sec, then each core can do 86,400 per day, assuming uniform traffic, all at max load. On an 8-core system, thats 691,200 per day, not 20-30mil. So I suppose you meant 1-sec duration and the CPU is less. Presumably more than 30X less to support 20-30M per day.

Anyways, 30M queries per day means 1.25M per hour. Assuming you have peak hours, your load might actually need to hit 3.6M per hour, or 1000 per sec. If you have a single core system, the average query cost would need to be 1 milli-sec, but on a 12-core system, you could afford 12-ms average.

Assuming a range of query costs, a few might benefit from a parallel query plan, possbily maxdop 2 or 4. I would run tests on a QA system, find out which could generate a parallel plan, then explicitly allow these to have parallel plan, leaving everything else on non-parallel.

Forcing each connection to a specific core can help if your average CPU cost is well below 1 milli-sec. Above that I would not bother. Besides, this is a really complicated subject, that is best left to experts.

May 12, 2010 11:09 AM
 

Alen said:

Our QA servers are actually faster than production. we bought them last year. the last of the G5's. I think they are 5450 Xeon's. no problem restoring the databases there, but getting the workload simulated is pretty much impossible. we might end up testing in production.

only reason we're replacing them is we need to upgrade our DR servers and it doesn't make sense to buy new hardware to sit around

May 12, 2010 1:55 PM
 

jchang said:

so what you need to do is figure out the average CPU per SQL call, ie, the top level stored procedure called from the client/app server. the average duration (1 sec) has no meaning, other than its kind of a long time on modern systems.

unless your average cpu per RPC is well under 1 milli-second (my estimate is under 0.3 ms), there will not be much gain in employing thread affinity.

May 12, 2010 3:37 PM
 

Linchi Shea said:

Joe;

After all those assumptions in your first comment above, one has to wonder what really applies? :-) Just kidding. But seriously, theoretical estimates are a rather slippery slope.

May 14, 2010 9:22 AM
 

jchang said:

what makes you think they are assumptions?

You should have read enough of my work by now to know I do waste time with worthless crap.

As far I know, I am the only one to have published quantitative analysis on high-call volume chatty apps, NUMA and affinity tuning. If you think my data above is not correct, test it out for your self.

For reference, the TPC-C and TPC-E benchmark queries average on the order of 0.5 CPU-ms with the Xeon 5500\5600\7500 series (based on physical core CPU-s, not logical). On the 2-way systems, there is some benefit with port-thread affinity alignment. On the bigger systems the benefit becomes more substantial.

In SAP type applications, where the average call is less than 0.1 CPU-ms, port-affinity alignment is absolutely crucial, resulting in negative scaling on hard NUMA systems without this.

A few years ago, you did a performance eval of an 8-way(+?) Intel hard NUMA system, versus 4-way Opteron dual-core (soft-NUMA) system, find almost no scaling beyond 4-way(?). Published benchmarks showed moderate scaling. The benchmarks used port-affinity tuning and you did not. The benchmark test you were using was on the order of 0.5-1.0 CPU-ms per call, which really needed the affinity tuning on NUMA systems. If the average call were 1000 CPU-ms, then it would not have mattered.

May 15, 2010 9:35 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement