THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Benchmark Update - Astounding Fujitsu RX900 8-way Xeon 7560 TPC-E Scaling

Fujitsu just published an astounding TPC-E benchmark result of 3,800 tpsE for their 8-way Xeon 7560 system, the Primergy RX900 S1. Fujitsu had previously published a TPC-E result of 2046.96 for their 4-way Xeon 7560 system, the Primergy RX600 S5. The new results shows 85.6% scaling from 4-socket to 8-socket.

Microsoft Windows Server 2008 R2 introduced core OS improvements in that not only increased the number of logical processors supported from 64 to ???, but removed many locks, including the dispatch scheduler lock. This improved high-end scaling (64 to 128 cores?) from 1.5X to 1.7X, based on tests with the HP Superdome and Itanium processors. At the time of this announcement, the Xeon 7500 processor were not yet available.

When the Xeon 7500 did become available in early 2010, the first TPC-E benchmarks were 2,022.64 and 3141.76 tpsE for the 4-way and 8-way Xeon 7560 systems respectively. The scaling from 4S to 8S was 1.55X, well below the expectation of 1.7X set by Microsofts 2008 R2 announcement. This was understandable as the 8-way result was probably rushed to alignment with product launch. Perfect benchmark results are ready on their own schedule, which is not always in time for marketing blitzes. (Of course, considering that the marketing budget may be paying for the benchmarks, it would be advisable to try really really hard to have a good result for product launch.)

There are two apparent differences between the new Fujitsu and original NEC 8-way Xeon 7560 TPC-E reports. One is the Fujitsu uses SSD while the NEC system used HDD storage. The SSD configuration yields much better average response times mostly in the Trade Lookup and Trade Update transactions, with a reductions from 50/56ms to 13/14ms respectively. In the 4-way Xeon 7560 TPC-E reports, the use of SSD over HDD yields 1% improvement

(My mistake, I should compare the Fujitsu RX900 2046.96 tps-E result with the Dell 1933.96 tpsE, both systems at 512GB for 5.8% performance gain attributed to SSD over HDD. The 1% gain was compared to the IBM at 2022.64 tpsE with 1TB memory and HDD storage).

The other difference is that the Fujistu system distributes network traffic over 6 GbE ports compared with 2 for the NEC system. There are 24 or so(?) RPC calls per TPC-E transaction, so the extra network ports might provide another minor improvment.

Nothing apparent can explain the 4S to 8S scaling improvement from 1.55X to 1.85X. This is certainly not impossible, as IBM figured out how to do this and better with their POWER4 line some years ago. At the time, I thought this was mostly the massive inter-processor bandwidth of the POWER4. Now it is more clear that the OS and database engine all contribute to nearly perfect scaling.

My thinking is that some one at Micrsoft has been watching the performance traces and finally figured out the most critical points of contention. (Such persons are always nameless to the outside world, as this would upstage more established egos.) So I believe this is a new build of Windows and SQL Server, but build numbers do seem to be obvious in the TPC reports, even though full disclosure is required. It is never some magic registry entry like Turbo Mode: ON.

ps The more serious impact of SSD may be evident in the Maximum response times, which ran as high as 68.7 seconds for Trade-Result on the NEC system with HDD, and topped out at 7.1 seconds on the Fujitsu system with SSD. I am thinking that having an open transaction for 68 sec can have serious repercussions on an OLTP system. Curious though the 4-way Fujitsu with SSD could not keep max response similarly low (18.53 sec on Trade-Lookup), while the 4-way IBM kept max response to 17 sec with HDD.

Transaction Response Times
The table below shows transaction response times, average and maximum for the 8-way NEC with HHD and the Fujitsu with SSD storage. The SSD storage system has better average response time, with the biggest impact in Trade-Lookup and Trade-Update. The reduction in maximum response is more dramatic in 6 of the 10 transactions.

Transaction Response Times, Average and Maximum

  Avg Response Max Response    
System NEC
NEC Fujitsu weight frames
Storage HDD SSD        
Broker-Volume 0.05 0.06 2.88 6.72 4.9% 1
Customer-Position 0.02 0.05 43.55 3.49 13% 2
Market-Feed 0.03 0.03 48.81 3.48 1% 1
Market-Watch 0.03 0.05 2.77 2.83 18% 1
Security-Detail 0.01 0.02 2.89 3.79 14% 1
Trade-Lookup 0.50 0.13 49.09 3.30 8% 4
Trade-Order 0.07 0.10 45.96 3.74 10.1% 4
Trade-Result 0.07 0.13 68.73 7.10 10% 6
Trade-Status 0.02 0.03 60.23 6.51 19% 1
Trade-Update 0.56 0.14 3.46 3.79 2% 3
Data-Maintenance 0.11 0.07        
Avg Response
0.0812 0.0635        
tx in flight
2526.5 2389.1        

System configuration

System NEC
Processor Xeon 7560 Xeon 7560
Sockets-Cores 8 x 8 = 64 8 x 8 = 64
Hyper-Threading yes yes
Frequency 2.26GHz 2.26GHz
Memory 1024GB 1024GB
IO 7 FC 14 SAS
Storage 1872 HDD 336 SSD
OS 2008 R2 DC 2008 R2 DC
Database 2008 R2 DC 2008 R2 DC
tps-E 3,141.76 3,800.00

Benchmark Summary 2010-09-28

Below is a summary of the best available TPC benchmark results for recent Intel Xeon and AMD Opteron server systems. Note that Westmere-EP 32nm and Nehalem-EX 45nm have been consolidated.

TPC 2-way 4-way 8-way 16-way
Xeon 5600 6C
7500 8C


2010 Sep 19

TPC-H results was finally published for 4-way Xeon 7500 @300GB on 14 Sep. A TPC-C result was also published for the 4-way 7500 on 27 Aug. There will probably not be a TPC-C for the 8-way DL980 as there may be a limitation for SQL Server in the ability to write to a single log file. HP seems to be the only vendor active in TPC-H. This could be because other companies have cut staff. Benchmarking is a specialized skill. It usually takes a dedicated person for each benchmark and environment. It is not the benchmark result that is important. It is the investigation into the root cause of bottlenecks to improve performance in the next iteration that is important. So this means only HP will be making contributions in DW.

Published Tuesday, September 28, 2010 6:13 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Greg Linwood said:

I think you might have missed another small difference between the SSDs used in the Fujitsu & HDDs used in the NEC benchmark..

The Fujitsu SSDs only cost ~$400k where the NEC HDDs cost ~$1.4M

So that 1% increase in performance from SSDs came with a $1M saving :)

Compare any two TPC SSD / HDD benchmarks & you'll see a huge saving from using SSD because you don't need to RAID them on large scale to get perf like you do with HDDs. Note the top TPC-C HDD benchmark (IBM) uses 11,000 HDDs + hundreds of enclosures, racks etc..

September 28, 2010 5:29 PM

jchang said:

I am only concerned with differences that impact performance. Normally I do not care about differences in price performance, as it is usually a silly game. The TPC price-configuration is also an artificiality of the benchmark. By rule, the database size is scaled with performance, and the active data is distributed across the entire database. In real life, a transactional database might be 10+TB, but usually the active data is highly concentrated. With proper cluster key design, it is possible to bring most of the active data into memory, greatly reducing disk IO.

In the few production environments that have active that cannot be concentrated, well then, massive disk IO is the only answer. Also, in a live transaction server that will be driving 80K RPC per sec, the total cost of the environment will be in the $100M-2B range. No one will give a damn about $1M. If the system we to fail, it will be your @$$. So people tend to look for the $10M solution when a $1M solution works. Of course, spending tens of millions on storage did not help the state of virginia in their outage.

TPC-E IO is mostly read, so RAID level has negligible impact on storage cost. For HDDs, it is the number of spindles necessary to support the IO load. For SSD, it is capacity. I think in the live system, it might be SSD for the most active data, HDD for less active and archival data, plus a few backups, plus a restored backup, plus snapshots.

September 28, 2010 6:53 PM

Greg Linwood said:

By far the largest cost component of any large scale server is the storage, so using SSDs on a $100M server (assuming it has HDDs) would obviously save MUCH more than just $1M - it would probably more like $60M & very few organisations wouldn't want to save that sort of $$.

However, TPC-E certainly is highly read oriented. I have run TPC-E in a lab & know for sure that you're right about it being highly read oriented. My NDA precludes me from saying anything specific about my own results but I think it's fair to say that this benchmark is highly focussed on CPU spec, assuming the system is sufficiently cached.

When running the kit, you control the cache / tx rate essentially by controlling the # of customers configuration in the generation program (which in turn controls the size of the DB). All TPC-E benchmarks carefully craft the # of customers to reach a specific DB /cache size, which effectively eliminates a huge proportion of storage I/O. This is one reason why the relevance of TPC-E can be questioned - as it focusses mainly on testing CPU & RAM rather than whole systems. This is still very interesting of course & you do publish some interesting material in this area which I enjoy reading.

However, I do think that evaluating the total system cost vs tx rate is what interests most people & what's really new about the recent benchmarks is the switch to SSDs & the HUGE overall cost savings that come from doing so..

September 28, 2010 7:44 PM

jchang said:

sure, but TPC benchmark system configuration cost and real world system costs have no relation. Almost all TPC configurations are direct-attach storage. Very very few use SAN. Almost all large production OLTP systems use SAN storage. Direct-attach storage with 146G 15K HD might be $600 per disk on the low-end, $750 per disk on the NEC. A SAN with the 15K 146G drive might run $2000-4000 amortized per disk. The SSDs in the Fujitsu are $1100 each, plus $200 per disk for the enclosure. It still bothers me that the Crucial 64GB SSD is only $150. SSDs for a SAN might be hugely more. Yet people have no problem paying the 4-6X premium for SAN storage.

This is why I say TPC system costs are irrelevent. It may seem obvious to you that people would want to save money. The reality is people buy the really really expensive storage system (on a per disk basis). Then when they need to save money, they sacrifice performance by cutting the number of disks, then by going to even fewer big capacity disks to reduce the enclosure cost, to guarantee crappy performance. Then people interested in saving money go with RAID 5 and sharing the storage system, as if performance was not already bad enough,

Anyways, the TPC configuration is really just what is needed for the benchmark specification. I would configure a hybrid SSD/HDD. The price difference between the Fujitsu and NEC is not hugely relevent, a real system is going to be much more expensive. Right now, no direct-attach storage system supports clustering, even though SAS should be able to do this. Since we have to use SAN for clustering, yet no commercial SAN system is properly architected for SSD. All the main SAN systems use 3.5in disks with too few bays per enclosure, and daisy chain too many enclosures sharing the same IO channel.

The right way to do a storage system is 8-12 bays per IO channel, 8 for SSD, and 4 for HDD. And we may as well use 2.5in disks, not 3.5.

I do not see a technical problem with the way TPC-E stresses CPU-memory because they configured the storage system correctly. The problem with real world is people who don't know what they are doing configure storage incorrectly, and end up IO crippled. Sure we could build a benchmark that is required to be IO crippled, but all this would prove is we can be just as stupid as real world IT pretenders/dilettantes. They are why 30% of major IT projects are outright failures and another 30% seriously under achieve key objectives, These people do not belong in this profession, giving the rest of us a bad name.

September 28, 2010 10:04 PM

Greg Linwood said:

>TPC benchmark system configuration cost and real world system costs have no relation<

How do you draw this conclusion? TPC benchmarks are performed on real world hardware, with audited pricing. Most our customers run SQL Server on hardware that has been used in TPC benchmarks..

If you believe that the hardware TPC benchmarks are run on have no relation to real world systems, why discuss these benchmarks at all??

I also disagree that "Almost all large production OLTP systems use SAN storage". Some are, others aren't, simple as that. I'd put it to you that most truely large scale systems aren't run on SANs (not shared ones anyhow) as the risks of shared infrastructure are huge to such systems & the larger the system, the more likely the customer will invest in dedicated infrastructure..

Also, whilst there might be a very small number of businesses / govt who truely aren't budget conscious, this is such an extremely small group they would hardly be worth using as the target audience for a public blog. I can't imagine more than a tiny % of the readership of this blog having truely unlimited budget!!

September 29, 2010 5:26 AM

jchang said:

Ok, lets try this yet one more time.

The benchmark performance characteristics have meaning, especially the scaling from 2->4->8P, the difference between Xeon and Opteron and average response time between low queue HDD, high queue HDD and SSD, while the system cost does not. That why I focus on the parts that have meaning, and ignore the parts that do not.

All large (4-way+) and almost all medium (2-way) OLTP system I have seen are on SAN storage, and all are clustered too (a significant reason for SAN). The critical ones have dedicated SAN, usually this is because the DB team doesn't trust the SAN team. Potentially an OLTP system could use mirroring, but the ones I have seen cluster locally (with SAN), mirror to remote site. If your customers with large OLTP system are not on SAN, then we have different customers.

I didn't say people have unlimited budget. I said people put large OLTP systems on SAN storage, which has a totally different cost structure between HDD and SSD than the direct-attach storage in TPC benchmark reports. So the cost savings in going from 1872x147G 15K HDD to 336x64G SSD in a direct attach storage system has no relation to the cost structure of a similar switch in a SAN system. Hence, I don't pay attention to this.

If people ran TPC-E benchmarks with the typical TPC-E configuration, I would pay attention to its cost structure. But people run their own OLTP applications on SAN system, so I pay some attention to SAN cost structure.

ps. I strongly recommend direct-attach storage for DW because should have no firm clustering requirement, hence no requirement for SAN.

pps, oh yeah, one more thing. The financial decision point between HDD and SSD is really one of capacity versus IOPS. On HDD, capacity is effectively free (direct-attach $500 per 146GB, no quite on SAN ~$2K per 146G) so the cost is in the IOPS - 200 at low latency. For SAN, the IOPS are nearly free at typically 25K IOPS per SATA/SAS device (Fusion-IO might be 1M, but it is equivalent to an arrays of SSDs) but capacity is "expensive". So in most situations, we can just compare the HDD IOPS cost to the SSD GB cost with the type of storage we are interested in, direct-attach or SAN.

For each customer application, there will be a specific IOPS/GB ratio, to be equated to the storage cost structure. The TPC-E and C have specific IOPS/GB ratio per benchmark rule. Again, the TPC-E IOPS/GB ratio is driven the benchmark rule, has absolutely no direct mapping to every different production environment's IOPS/GB ratio. Hence the cross-over point in TPC-E and a specific production environment is different, even if they use direct-attach storage. So why you waste time looking at TPC-E total system price. (the component prices is helpful because you can use it to calculate the cross-over point for your environment)

ppps. It is not necessary for a benchmark to be exactly like your application to be useful. In fact it is impossible for there to be benchmarks like every application. This is why it is important to break down a result in to components that can be mapped to your application. Too many people get lost in the superficial metrics specific to the benchmark, but with no meaning in their application.

September 29, 2010 9:58 AM

Rob Volk said:

> It is never some magic registry entry like Turbo Mode: ON.

Well, there goes Intel's latest marketing ploy:

September 29, 2010 10:01 AM

jchang said:

Great spot Rob, actually this is valid, alteast for many well- designed server apps, HT does help, by 30% or more, more cache helps alittle too. I wonder how many people are running server apps on their Gateway SX2841-09e. Of course, uou are already paying the modest premium for HT and 24M LLC on the Xeon 7560. Still, I don't think this is a good idea. Its really better to have two SKUs. Asking customers to do something never works. Or it should be mandatory that this upgrade be done at time of purchase by the Geek Squad techs co-located with Best Buy.

Also, it will be xmas soon, we could ask Santa for the magic button to make our servers run faster.

September 29, 2010 10:21 AM

Glenn Berry said:

SQL Server 2008 R2 on top of Windows Server 2008 R2 allows you to go up to 256 cores. I am not sure whether Windows Server 2008 R2 by itself has a higher limit than 256. I wonder if it is a SQL Server limit or a Windows limit?

The Westmere-EX will have 10 cores, plus hyper-threading, for 20 logical cores per socket. A 16 socket Westmere-EX would then have 320 logical cores, which would exceed the 256 core limit.

October 4, 2010 10:16 AM

jchang said:

My understanding is the R2 limit is actually 1024 logical processors, but there may be a split between Enterprise Edition, for which CAL is allowed, and Datacenter Edition, for which only per processor licensing is available. MS policy is not announce support for more than they have tested. So the enumeration scheme may work for more. 1024 lp could be 512 cores with HT, which would be more than a Superdome2 with 64 sockets, quad-core = 256 cores. I suppose HP has a 16 or 32-way Xeon 7x00 in development. I also assume Westmere-EX has taped out + first silicon. So 32 x 10 x 2 = 640?

My advice to Intel is to increase threads per core to 4, meaning 1280, because proper coding and continued work should make better use of HT. Of course, I do not know if the Intel cpu people read my blog and followed my advice verbatim.

October 4, 2010 11:15 AM

Nasser said:

Hi Joe,

I was looking at the TPC-C results, I was surprised to see that HP has published a new TPC-C entry on SQL Server 2005 (Windows 2008 R2)!!! I thought Microsoft stopped supporting TPC-C and only focusing on TPC-E and TPC-H? how could HP get away with that, plus why did HP use SQL 2005 not 2008 R2?

Finally, I wish Microsoft can come with some decent TPC-C (with the help of HP, NEC, DELL or UNISYS), I am arguing with Oracle Fanboys all the time, and all I hear from them is how MS can't keep up in the TPC-C and switched to TPC-E?


October 9, 2010 9:35 AM

jchang said:

TPC-E was put codeveloped by all the major software and hardware vendors, Microsoft, Oracle, IBM, Sybase etc. Everyone recognized that TPC-C specifications was distorting system configuration as time progressed due in part to the different rate of performance growth in the components of the complete server system. The other factor was that as processors becames immensely powerful, people threw in lots of read queries, while TPC-C remain pure transaction processing. This generates an even mix of read/write (a read forces the eviction of a dirty page).

The TPC-C disk IO pattern has become so difficult that neither the Oracle/Sun 7.6M and IBM 10M tpm-C result could use HDD, both used SSD storage, which is also very challenging for such a large database.

I think MS want the switch to TPC-E with SQL Server 2008. But HP still wanted to put out TPC-C for thier own reasons. So MS allowed this for SQL Server 2005 (which will mean that no > 64 logical procs). Another reason their will not be higher SQL Server TPC-C on a single database is that TPC-C generates intensive log writes, that is pushing the limit of what can be written in serialized fashion given current write latencies and the SQL Server log write in-flight limits. I do not think MS will introduce multiple concurrent log file system.

I do think it is possible to distribute the TPC-C load across multiple databases on a single server and instance to get around the log write issue. Perhaps we can ask HP if they would try this on their DL980G7.

Anyways, Oracle was a codeveloper of TPC-E. They have no complaint on the technical aspects of TPC-E. As I said probably in the Big Iron III blog, Oracle in fact runs TPC-E just fine, and can probably put out a very impressive big system scale-up result, possibly better than SQL Server. The reason as I understand it that Oracle will not publish a TPC-E is that then they would be obligated to put out a Oracle RAC TPC-E, which is the problem. TPC-C naturally partitions into localizes activity, with only a small fraction of cross district  transactions, ie, almost everything can stay on a RAC node. This is not the case with TPC-E. Oracle has made it company policy to push scale-out via RAC, not the agnostic scale up or out. But Oracle just changed the ODM to include 8-way, so they may put out a 8-way TPC-E.

Have you ever heard the saying: don't argue with the village idiot? people may not be able to tell which one is the idiot. I suggest not arguing with Oracle fanboys. Oracle is indeed a very sophisticated product (that used to require a great deal of knowledge to implement). Many simpletons think if they use a sophisticated product, then that makes them sophisticated too. The idiot can leave the village, but somethings won't change.

October 9, 2010 11:59 AM

Grumpy Old DBA said:

October 13, 2010 6:16 AM

Leave a Comment


About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement