THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

New HP ProLiant DL980, 580 and 585 G7 servers

HP has just announced the ProLiant DL580 G7 and DL980 G7 servers based on the Xeon 7500 series processors, and the DL585 G7 4-way server with the 12-core AMD Opteron 6100 series (Magny-Cours).
Apparently the reason for the delay is that the 8-way DL980 G7 employs custom silicon node controllers (XNC), and possibly, so HP could make a splash in announcing all three system at their big annual conference: HP Technology Forum. The DL580 and 585 G7 are available now(?), and the 980 G7 should be available later in Q3.

While the Intel Xeon 7500 processor allows a glue-less 8-way system, HP felt that the design could be improved with node controllers. The node controllers reduce snoop traffic for a majority of memory accesses, and can achieve a 30% reduction in memory latency in some circumstances. It should be considered that HP needed to build custom silicon crossbar (& node controllers) for their SuperDome2 system and the Itanium 9300 series processors, which use the same QuickPath Interconnect (QPI) as the Nehalem processors.

HP may have built a glueless 8-way Xeon 7500 system if they had not already invested the effort to built the XNC for their Itanium systems. This also means that HP should have the components to built a 16-way Xeon 7500 system. If there were demand, such a system could be brought to market? Intel did say that there were 16-way Xeon 7500 system designs, but none have surfaced yet.

Dell has also released a 2-way TPC-E result for the Xeon 5600, and Fujitsu released a 4-way TPC-E result for the Xeon 7500

below is an updated performance table

Processor
Architecture
Process
TPC 2-way 4-way 8-way 16-way
Core2 65nm
Xeon 5300 QC
7300 QC
TPC-C
TPC-E
TPC-H
251,300
5160 only
17,686@100
407,079
479.51
34,990@100
841,809
804.0
46,034@300
-
1,250.0
-
Barcelona
65nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
471,883
-
-
-
-
52,860@300
-
-
-
Core2 45nm
Xeon 5400 QC
7400 6C
TPC-C
TPC-E
TPC-H
275,149
317.45
-
634,825
729.65
-
Linux DB2
1,165.56
-
-
2,012.8 (R2)
102,778@3T
Shanghai 45nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
579,814
635.4
-
-
-
57,685@300G
-
-
-
Istanbul 45nm
6C
TPC-C
TPC-E
TPC-H
-
-
-
-
-
-
-
-
91,558@300G*
-
-
-
Nehalem 45nm
Xeon 5500 QC
7500 8C
TPC-C
TPC-E
TPC-H
661,475†
850.0
51,086@100G
1,807,347
2,022.64
-
-
3,141.76
162,601@3TB
-
-
-
Westmere 32nm
Xeon 5600 6C
7600 ?C
TPC-C
TPC-E
TPC-H
803,068
1,110
73,974@100G
future
future
future
future
future
future
future
future
future
Magny-Cours
45nm
12C
TPC-C
TPC-E
TPC-H
705,652
887.4
-
1,193,472
1,464
107,561@300G
n/a
n/a
n/a
n/a
n/a
n/a

* and SF 1TB report
† Xeon W5580 3.2GHz, versus X5570 2.93GHz
Magny-Cours does not support >4 socket systems
IBM Power 780 with 2 x quad-core POWER7 4.14GHz TPC-C: 1,200,011

TPC-H 300GB: 4-way DL585 G7 vs 8-way ProLiant DL785 G6

A comparison of the TPC-H 300GB results for the 8-way ProLiant DL785 G6 and the 4-way DL585 G7 is interesting.

System TPC-H Power TPC-H Throughput TPC-H Composite QphH
DL785G6 109,067.1 76,860.0 91,558.2
DL585G7 129,198.3 89,547.7 107,561.2

The significant differences between the two systems are below. Both system have the same number of total cores, the 8-way with 6-core processors and the 4-way with 12-core processors. The DL785G6 cores are 2.8GHz versus the DL585G7 at 2.3GHz, about a 20% difference. The DL585G7 has twice the memory, 512GB versus the 256GB. For TPC-H at SF300, and using SQL Server 2008 page compression, 256GB is not quite sufficient to encompass the entire database tables and indexes. With 512GB, there is more than sufficient memory for data, indexes and probably most hash join intermediate results (for minimal tempdb activity)

System DL785G6 DL585G7
Processor Opteron 8439 Opteron 6167
Sockets-Cores 8 x 6 = 48 4 x 12 = 48
Frequency 2.8GHz 2.3GHz
Memory 256GB 512GB
Storage 194 HDD 4 SSD
Windows Server 2008 EE SP1 2008 R2 EE
SQL Server 2008 EE SP1 2008 R2 EE

That the DL585G7 employs SSD storage is not expected to impact performance, and was probably used for lower cost. The 194 15K HDDs and 12 storage enclosures in the DL785 cost $110K, while the 4 320GB Fusio-IO drives in the DL585 cost $55K. If the DL585 had 256 or less memory, then the SSD storage would have moderately better performance than with HDD storage. Another significant difference are the improvements in Windows Server 2008 R2, several of which have major impact scaling to a high number of processor cores.

The chart below shows the TPC-H power query run times for the DL585G7 relative to the DL785G6.

tpch300 DL785 vs DL585
TPC-H Power query run times, DL585G7 relative to DL785G6

Overall, the DL585G7 with 4 Opteron 6167 is about 20% higher than the DL785G6 with 8 Opteron 8439 processors. For the individual queries, several are moderately faster, 3 are much faster, 5 are about the same, and 3 are actually significantly slower. The DL785 has faster processors, which should make all queries run faster. It is difficult to account for differences in the system architecture, as there may be difference in how the individual dies are connected. The greater memory on the DL585 is expected to make certain queries run faster. The scaling improvements in R2 (OS and SQL) might contribute significant gains in some queries, but may also negative effects in others.

It would be very helpful to have access to the actual execution plans, along with execution statistics to determine if the differences can be attributed plans differences or differences in disk IO.

TPC-H 3000GB: 8-way Xeon 7560 vs 16-way Xeon 7460

Below are the TPC-H 3000GB results for the 8-way ProLiant DL980 G7 with the Xeon 7560 processor and the 16-way ES7000 with the Xeon 7460. The 32-way dual-core IBM 5GHz Power6 result is also shown.

System TPC-H Power TPC-H Throughput TPC-H Composite QphH
16 x Xeon 7460 120,254.8 87,841.4 102,778.2
8 x Xeon 7560 185,297.7 142,685.6 162,601.7
32 x Power6 142,790.7 171,607.4 156,537.3

Additional details are below:

System ES7000 DL980G7 Power 595
Processor Xeon 7460 Xeon 7560 Power6
Sockets-Cores 16 x 6 = 96 8 x 8 = 64 32 x 2 = 64
Hyper-Threading no yes 4/core?
Frequency 2.66GHz 2.26GHz 5.0GHz
Memory 1024GB 512GB 512GB
Storage 914 HDD 660 HDD 288 HDD
OS 2008 R2 DC 2008 R2 EE AIX 6.1
Database 2008 R2 DC 2008 R2 EE Sybase 15.1

The Unisys system may have been over-configured in disks and memory. Many of the TPC-H queries involve large table (or range) scans). If the entire entire database cannot be brought into memory, then there may not be much difference in the disk IO generated with either 512G or 1TB memory. More importantly, the Windows operating system and SQL Server versions match, so there is high confidence we are seeing mostly the difference between the two processor (and system) architectures.

The IBM system may appear to be under-configured in terms of the number of disk drives. But it does seem that other database engines are better in switching from pseudo-random to sequential scan IO operations than SQL Server, and can work fine with fewer disks.

While the Xeon 7400 series processor core was top of the line in its time, even the 4-way Xeon 7400 system had limited memory bandwidth (and channels). Scaling beyond 4-way was not a simple matter. Of course, the Xeon 7400 systems were still competitive with systems based on processors with better scalability, but weaker single core performance.

Based on the 16-way Xeon 7460 result, the expectation is that an 8-way Xeon 7460 would be in the range of 75,000, i.e., doubling the number of processors should increase performance by 1.6X. In turn, there is sufficient reason to estimate that the Xeon 7560 is about 2.5X more powerful than the Xeon 7460 for data warehouse usage. This is less than the 2.77X observed in OLTP, which is inline with expectations because OLTP derives substantial benefits from Hyper-Threading (30%?) and data warehousing derives only a modest benefit from HT (10%?).

The chart below shows the TPC-H power query run times for the 8-way Xeon 7560 relative to the 16-way Xeon 7460.

tpch300 DL785 vs DL585
TPC-H Power query run times, 8-way Xeon 7560 relative to 16-way 7460

As with the earlier comparison, there is also wide variation in the individual queries. Many queries are 40% faster, two are about the same, two are actually slower, and one is more than 5X faster.

The chart below shows the TPC-H power query run times for the 32-way IBM p595 with 64-cores relative to the 8-way Xeon 7560 also with 64 cores.

tpch300 DL785 vs DL585
TPC-H Power query run times, 64-core POWER6 relative to 64-core Xeon

The 64 core Xeon 7560 has 30% better TPC-H Power than the 64 core POWER6. The POWER6 in turn has 20% better TPC-H Throughput than the Xeon. Again, there is also wide variation in the individual queries. In query 18 and 19, where the Sybase is faster, the SQL Server execution plan shows key lookups at SF100. It would be helpful if HP could provide execution plans at SF3000. We should not draw too many conclusions when comparing completely different system architectures and completely different database engines. But, I think this is good hint for Microsoft to re-evaluate the execution plan cost formulas.

TPC-H 1000GB: 8-way 6-core Opteron 785G6 vs 16-way quad-core Itanium

Below are the TPC-H 1000GB results for the 8-way ProLiant DL785 G6 with the Opteron 8439 processor and the 16-way Integrity Superdome 2 with the Itanium 9350.

System TPC-H Power TPC-H Throughput TPC-H Composite QphH
8 x Opteron 8439 95,789.1 69,367.6 81,514.8
16 x Itanium 9350 139,181.0 141,188.3 140,181.1

Additional details are below:

System DL785 G6 Superdome 2
Processor Opteron 8439 Itanium 2 9350
Sockets-Cores 8 x 6 = 48 16 x 4 = 64
Hyper-Threading no yes
Frequency 2.8GHz 1.73GHz
Memory 512GB 512GB
Storage 240 HDD 576 HDD
OS 2008 R2 EE HP-UX
Database 2008 EE Oracle 11g R2

The operating system and database engine are both completely different, so caution is warranted in comparing the results. Also very important is that the execution plans could also be very different in certain queries.

As the expectation is that doubling the number of processors should lead to approximately 1.6X performance gain, we can see that six-core Opteron 8439 is the same neigbhorhood as the quad-core Itanium 2 9350. The individual Opteron processor is probably a little better than the Itanium at the socket level in the TPC-H Power test, but the Itanium has the advantage in through-put oriented usage.

The chart below shows the TPC-H power query run times for the 16-way Itanium relative to the 8-way Opteron.

tpch1000 DL785 vs Itanium
TPC-H Power query run times, 16-way quad-core Itanium relative to 8-way 6-core Opteron

System Pricing

Below are the bare system pricing based on HP's published TPC reports in 2010. Base system pricing only includes the system chassis and processors. There might be differences in memory pricing because of differences in memory type.

  Intel AMD
4-way    
System DL580G7 DL585G7
Processor 4xXeon 7560 4x6176
Price $24,597 $11,235
  Intel AMD
2-way    
System DL380G7 DL385G7
Processor 2xXeon 5680 2x6176
Price $6,189 $5,109

I do not consider the TPC overall pricing and pricing-performance metrics to be particularly helpful in comparative assessment. The software licensing may be much higher than the server costs when per-processor licensing apply, but is inconsequential for CAL mode in DW environments. Also, storage can be a very large part of the overall system cost. Some vendors can use SSD or direct-attach for favorable pricing, other vendors employ more expensive SAN storage. In any case, the customer always pick their own storage, so the storage pricing in the TPC reports is of no consequence to the actual customer. That being said, there is no meaningful difference in the TPC-E Price/Performance between the AMD and Intel based platforms at either 2-way or 4-way.

The AMD objective for the near-term (until Bulldozer?) is probably not competing with Intel on a socket basis but overall system price. The price difference between the ML/DL 380 and 385 is nominal at best, and so I would definitely prefer the 380 with the Xeon 5680, because any time I have a big single threaded operation, I want the fastest core, not aggregate core-GHz.

At the 4-way level, between the DL 580 and 585, there is a better story. If we add 256GB memory at $16K (the HP report says $32K for the DL585G7, but I think that is too high), then we are looking at $40.5 vs 27K, which 1.5 to 1. This roughly corresponds to the difference in performance. So it all comes down to the cost of the other components: storage and software licensing.

I am inclined to recommend the 2-way Xeon 5600 series for medium tasks, the 4-way Xeon 7500 series for big tasks, and the 8-way Xeon 7500 ($79K base) for really big tasks. Yes, the 4-way 12-core Opteron does fit in the gap between 2 x 5600 and 4 x 7500. But I am not sure this is a viable gap, as I would recommend the 2 x 5600 when the choice is between a middle and big system, in part because processors today are so powerful and because the Xeon 5600 is the fastest for single-threaded tasks.

ps - High-Availability requirements might steer towards the Xeon 7500 because of the MCA features etc.

pps the HP PREMA whitepaper on their DL980 G7 system architecture say future Xeon systems up to 32-way(?) are possible. I am thinking this should be next year pending go sales of the DL980.

Published Wednesday, June 23, 2010 11:03 AM by jchang
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

zaib said:

well this is totaly byes, comparison between intel 7500 and new AMD 12 core. Ok it does make sense, but why putting IBM power 6? Well. Its been a while since power 7 has been launched, and one more. Why cant we compaire between intel westmere and intanium? I can bet that new intel easly out perform intanium ( which is dying anyway). Do fair comparison. Like intel with Power 7& intanium with Power 7 and intel westmer

July 7, 2010 9:56 AM
 

jchang said:

TPC-H comparisons should only be made at the same scale factor, example 1TB to 1TB, (unless of course its the exact same system).

IBM has not published a TPC-H with POWER7, so I cannot make up numbers from thin air. If IBM would publish a Pwr7 result, I will be happy to put it up.

IBM does have a Power 780 with 2xPOWER7 quad-core 4.14GHz TPC-C result of 1,200,011, with 512GB memory.

This is about 50% higher than the 2 Westmere six-core 3.33GHz result of 803,068 with 192GB memory.

IBM systems since POWER4 have done very well in TPC-C. I suspect a major contributor to the good POWER7 result is 4 logic processors per core, versus 2 for Westmere. (I am hoping Sandy Bridge will also go to 4). The extra memory helps, but not enough to account for such a margin.

I will suggest the HP Westmere result is a little low, compared with their 2 x Nehalem result of 631,766.

Finally, the 2 x POWER7 result is about the same as the 4 x Opteron 6176 12-core.

July 7, 2010 6:46 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement