THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

TPC-H Benchmarks - Westmere-EX versus RISC

There has been relatively litle activity in TPC Benchmarks recently with the exception of the raft of Dell TPC-H results with Exa Solutions. It could be that systems today are so powerful that few people feel the need for benchmarks. IBM published an 8-way Xeon E7 (Westmere-EX) TPC-E result of 4593 in August, slightly higher than the Fujitsu result of 4555, published in May 2011. Both systems have 2TB memory. IBM prices 16GB DIMMs at $899 each, $115K for 2TB or $57.5K per TB. (I think a 16MB DIMM was $600+ back in 1995!) The Fujistu system has 384 SSDs of the 60GB SLC variety, $1014 each, and IBM employed 143 SSDs of the 200GB eMLC variety, $1800 each for 24-28TB raw capacity respectively. Except for unusually write intensive situations, eMLC or even regular MLC is probably good enough for most environments.

HP published a TPC-H 1TB of 219,887.p QphH for their 8-way ProLiant DL980 G7 with the Xeon E7-4870, 26% higher in the overall composite score than the IBM x3580 with the Xeon E7-8870 (essentially the same processor). The HP scores 16% higher in power and 37.7% higher in throughput. Both throughput tests were with 7 streams. The HP system had Hyper-Threading enabled (80 physical cores, 160 logical) while the IBM system did not. Both systems had 2TB memory, more than sufficient to hold the entire database, data and indexes in memory. The IBM system had 7 PCI-E SSDs and the HP system has 416 HDDs over 26 D2700 disk enclosures, 10 LSI SAS RAID controllers, 3 P411 and 1 dual-port 8Gbps FC controller.

Also of interest are TPC-H 1TB reports published for the 16-way SPARC M8000 (June 2011) with SPARC64 VII+ processors and the 4-way SPARC T4-4 (Sep 2011). The table below shows configuration information for recent TPC-H 1000GB results.

TPC-H 1000GBIBM x3850 X5HP ProLiant DL980 G7IBM Power 780SPARC M8000SPARC T4-4
DBMS SQL 2K8R2 EESQL 2K8R2 EESybase IQ ASE 15.2Oracle 11g R2Oracle 11g R2
Processors8 Xeon E78 Xeon E78 POWER716 SPARC64 VII+4 SPARC T4
Cores Threads 80-8080-16032-12864-12832-256
Memory 2048TB2048TB512GB512GB512GB
IO Controllers 713124 Arrays4 Arrays
HDD/SSD7 SSD416 HDD52 SSD4x80 SSD4x80 SSD

The figure below shows TPC-H 1000GB power, throughput and QphH composite scores for 4 x Xeon 7560 (32 cores, 64 threads), two 8 x Xeon E7 (80 cores, 80 and 160 threads) systems, 8 x POWER7 (32 cores, 128 threads) 16 SPARC64 VII+ (64 cores, 128 threads) and the 4 SPARC T4 (32 cores, 256 threads).

tpch100
TPC-H SF 1000 Results

The HP 8-way Xeon and both Oracle/Sun systems, one with 16 sockets and the newest with 4 SPARC T4 processors, are comparable, within 10%.

An important point is that both Oracle/Sun and the IBM Power systems are configured with 512GB memory versus 2TB for the 8-way Xeon E7 systems, which enough to keep all data and indexes in memory. There is still disk IO for the initial data load and tempdb intermediate results. This good indication that Oracle and Sybase have been reasonably optimized on IO, in particular, when to use an index and when not to. I had previously raised the issue that the SQL Server query optimizer should consider the different characteristics of in-memory, DW optimized HDD storage (100MB/s per disk sequential) and SSD.

Sun clearly made tremendous improvements from the SPARC 64 VII+ to the T4, with the 4-way new system essentially matching the previous 16-way. Of course, the Sun had been lagging at the individual processor socket level until now. The most interesting aspect is that the SPARC T4 has 8 threads per core. The expectation is that server applications have a great deal of pointer chasing code, that is: fetch memory which determines next address to fetch with inherently poor locality.

A modern microprocessor with core frequency 3GHz corresponds to a 0.33 nano-second clock cycle. Local node memory access time might be 50ns, or 150 CPU-clocks. Remote node memory acess time might be 100ns for a neighboring node to over 250ns for multi-hop nodes after cache-coherency is taken into account. So depending on how many instructions are required for each non-cached memory access, we can expect each thread or logical core to have many dead cycles, possibly enough to justify 8 threads per core. What is surprising is that Oracle published a TPC-H benchmark with their new T4-4 and not a TPC-C/E which is more likely to emphasize the pointer chasing code than DW.

Below are the 22 individual query times for the above systems in the power test (1 stream).

tpch100
TPC-H SF 1000 Queries 1-22

Below are the 22 individual query power times for just the two 8 Xeon E7 systems. Overall, the HP system (with HT enabled) has 16% TPC-H power score, but the IBM system without HT is faster or comparable in 9 of the 22 queries. Not considering the difference in system architecture, the net might be attributed to HT?

tpch100
TPC-H SF 1000 IBM and HP 8-way Xeon E7

Below are the 22 individual query power times for the HP 8 Xeon E7 and Oracle SPARC T4-4 systems.

tpch100
TPC-H SF 1000 8-way HP Xeon E7 and 4-way SPARC T4

Published Monday, October 10, 2011 5:03 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Eric Strom said:

You mentioned that both Sybase and Oracle had better I/O optimization.  Is this more a function of the OS or the database engine?  That is, should that be more of a job for the OS or the database engine?  I'm not implying bias, but would the tuning have to happen on the OS and DB engine level?

I've been interested in how the Windows OS has been handling significantly different IO characteristics of SSD vs. HDDs and I haven't seen too much info on that as of yet.  Do you have any insight on that, from a DB/HW guy's perspective?

October 17, 2011 11:19 AM
 

jchang said:

The SQL Server versus Sybase/Oracle is all in the DB engine, not the OS. The Windows OS has pretty good plumbing on I/O, with a reasonably rich API set. I do expect continued API improvements now that 1M IOPS in a reasonably priced IO configuration. perhaps in how asynch IO is handled. It would also be nice if Windows would respect the NO_BUFFER (sorry I don't remember the actual IO flag names) flag in network file IO.

Anyways, back to SQL Server. In other blogs, I mention that the query optimizer, ie, execution plan cost has 1350 pages (or 10.8MB/s) of a scan at equivalent cost to 320 key lookups or nested loop join (to inner source). In modern DW designed hard disk configuration, we might target 100MB/s+ per HD, and expect 300-500 pseudo-random 8K IOPS of the pattern from high-row count key lookup or nested loops join (possibly pre-sorted). This would mean 12800 pages of a scan is equivalent to 300-400 lookups.

If the SQL Server query optimizer would allow for the more modern large (one or more 64K) blocks to small (8K) block, then the execution plan would shift sooner from lookup/loop join to scan & hash. So this is why I think Sybase and Oracle can produce an IO intensive (data > memory) TPC-H.

Also, now that we have SSDs, and many databases fit in memory, and SAN systems typically retain the older large/small ratio, the optimizer needs some intelligence in the IO cost model.

October 18, 2011 11:32 PM
 

Eric Strom said:

I follow, so the optimizer is effectively hard coding the cost/benefit ratio of switching from random IO to sequential at ~10MB/S when modern HDDs can get significantly more sequential throughput.

I will have to catch up on some of your older posts, in that regard.  Thanks for entertaining my tangential questions...

October 20, 2011 2:43 PM
 

jchang said:

see the Paul White blog

http://sqlblog.com/blogs/paul_white/archive/2010/09/01/inside-the-optimizer-plan-costing.aspx

which mentions that it is possible to adjust the CPU & IO cost weightings:

DBCC    SETCPUWEIGHT & SETIOWEIGHT

because the CPU and IO contributions from lookup and scan are different, it is possible to adjust the lookup-scan cross-over point even though the IO portion has not changed, I just never got around to trying this to see which way it changes, this technique could also be used to change the onset of parallelism

October 20, 2011 6:36 PM
 

jchang said:

now that i have enough beer to think coherently (or is it the other way around?) increasing the CPU weight should lower the effective page to row cross-over point, decreasing the CPU weight should increase the page to row cross-over to the IO-only 1350:320 ratio. There does not seem to be means of increasing the page/row ratio.

But increasing the CPU weight might make it more likely to get a parallel plan as (table/index) IO costs are not reduced with parallelism. I am just not sure that it would actually produce a good parallel plan.

October 20, 2011 11:35 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement