Several TPC benchmark results for SQL Server published recently (and some not so recently) are of interest in the high-end performance sphere. Unisys published new TPC-E and TPC-H results for their 16-way ES7000 with the Intel Xeon X7460 six-core processor (96 cores total). HP published TPC-H results for their 8-way ProLiant DL785G6 with the AMD Opteron six-core (Istanbul) processor at scale factor 300GB back in July, and now at 1TB. (There are still no TPC-E results for the Opteron Istanbul processors).
The most recent Unisys benchmark results employ Windows Server 2008 R2 together with SQL Server 2008 R2, which supports greater than 64 processors, as the 16-socket Unisys with the six-core Intel Xeon has 96 cores total. Windows Server 2008 R2 also features significant core level changes, most prominently the removal of the dispatch scheduler lock, which was described as enabling much better scaling at the higher number of processor cores.
The main feature in the latest Unisys TPC-E benchmark performance result is Windows Server 2008 R2, which now allows the operating system to utilze all 96 processor cores in the fully populated ES7000. Compare the range of TPC-E performance results for the Intel Xeon X7460 processor from 4 to 16 sockets. The IBM x3850M2 4-way Xeon 7460 with 24 cores and 128GB memory achieved 729.65 tpsE. The 8-way Unisys ES7000, Xeon 7460 with 48 cores and 256GB memory achieved 1165.65.
Last year, Unisys published a result for their ES7000 7600R with 16 Xeon 7460 2.66GHz six-core processors on Windows Server 2008 Datacenter Edition and SQL Server 2008 EE, both 64-bit, which was limited to using 64 of the 96 processor cores in the 16-way ES7000. This provided only a 28% performance increase over the ES7000 with 8 processors and 48 cores. While the performance gain per core is in line with expectations, the performance relative to the 8-socket 48-core is less than desired.
The new Unisys ES7000 result of Nov 2009 is on Windows Server 2008 R2 and SQL Server 2008 R2, both 64-bit Datacenter Edition, which now allows full use of all 96 cores. There is a 34.8% increase in the TPC-E score from 64 to 96 cores. If this were the scaling from increasing the processor cores only, then it would be excellent. However some non-neglibile gain is expected from the operating system enhancements only, considering the high number of cores.
The scaling for each doubling of the number of processor sockets and cores from 4-sockets/24-cores to 8-sockets/48-cores is 1.60. The scaling from 8/48 to 16/96 is 1.73, including the software gains from Windows Server 2008 RTM to R2. The average scaling for each doubling from 4 to 16 sockets (48 to 96 cores) is 1.66. This is approximately in line with the Microsoft slidedecks citing a 1.7X gain for each doubling of processor cores.
Unisys TPC-H Results for the ES7000 16-way
Unisys recently published a TPC-H scale factor 3TB result for their 16-way ES7000 with Windows Server and SQL Server 2008 R2, enabling the use of all 96 cores. Unisys had previously published a SF 10TB result for 16-way on Windows Server 2008 RTM, restricted to using 64 of the 96 cores. A TPC-H scale factor 1TB means the Lineitem table data without compression is approximately 1TB (or was 1TB with the 8 byte date time data type). The full size of the SF 1TB database uncompressed can be from 1.4 to 1.7TB. Technically, we are not supposed to compare TPC-H results at different scale factors. But there are very few published results, so sometimes this cannot be avoided in comparative analysis. Also for comparison, there is a SF 3TB result for a 32-socket 64-core IBM system with the POWER6 processor, the AIX operating system and Sybase database engine.
The chart below shows for the two Unisys results, the ratio of individual query runtime for the 10TB to 3TB on the power runtime (stream 0). One reason we should be careful in comparing results from different scale factors is that the execution plan can be different, so there is no guarantee that the query cost is directly proportional to database size. For the moment, setting aside the possibility of fundamentally different execution plans. The 10TB is 3.3 times larger than 3TB. The 10TB result is on only 64 cores, while the 3TB is on 96 cores.
Without the difference in cores, the 10TB query times should be about 3.3X longer than the 3TB baseline. There should be an additional 20-30% for the difference in number of cores. The expectation then is that queries on the 10TB with 64 cores should take approximately 4 times longer than on the 3TB with 96 cores, not accounting for differences between RTM and R2, and differences in the execution plans. Query 9 runs slower than expected on the 10TB 64 core system. Queries 12 and 22 run slower on the 3TB 96 core system. As a matter of fact, the query times between SF 3TB and 10TB are curiously close to each other. The execution plan at 10GB for query 12 shows a key lookup, and for query 22 a nested loops join. If this occurs in the execution plan at 3 or 10TB, then the query time might be constrained by the disk random IO performance. Both Unisys 3TB and 10TB results are on systems with about 900 disks.
The chart below compares the Unisys with 96 cores (Core 2 architecture) running SQL Server relative to the IBM with 64 cores (POWER6 architecture) running Sybase. The IBM POWER6 system has 50% better overall TPC-H score in QphH than the Unisys with the Xeon 7460 processor. The advantage is less than 20% on the TPC-H power component, which is a single stream testing the ability to use all availabe compute resources for each query.
The IBM result is on a completely difference processor architecture as well as a completely different database engine, so large query-to-query differences are not unexpected, as evident from the chart. The Unisys system with Xeon processors and SQL Server is faster some queries. The IBM system with POWER6 processors and Sybase is faster on others. The IBM system has only 282 disks compared with over 900 disks on Unisys. Both systems should have sufficient sequential disk IO performance, but the Unisys will have much better random IO capability. The IBM system is almost twice as fast on the TPC-H throughput metric, which is not suprising because the POWER6 processor is designed for massive bandwidth, while the Intel processors upto the Core 2 architecture are designed for much more price constrained systems. (Even though the Xeon 7400 series are targeted at expandable systems, it has the same bus architecture as the Core 2 processors for desktop and mobile systems.)
Also important, these SF 3TB results with 1TB system memory should not be compared with the TPC-H SF100-300 results where the data essentially resides in memory versus, without consideration for that disk IO involves significant CPU consumption. The in-memory results have zero random IO and even the SSD results also have nearly no limits on random IO. I am still inclined to speculate that Nehalem EP/EX has much better TPC-H than Dunnington per socket, possibly because of much better memory bandwidth.
HP ProLiant DL785 Opteron TPC-H Results
The results below are on the 8-way HP ProLiant DL785. The first result is on the quad-core Opteron 8384 is on Windows Server 2008 RTM, SQL Server 2008 RTM. The second result with the six-core Opteron 8439 is on Windows Server 2008 SP1 and SQL Server 2008 SP1.
The third result with the six-core Opteron 8439 is on Windows Server 2008 R2 and SQL Server 2008 SP1. It is possible that the (locking) performance improvements in Windows Server 2008 R2 was desired, but because the system has less than 64 cores, SQL Server 2008 R2 was not required. This allows using CAL instead of per processor licensing.
Between the first two systems, the number of cores is increased by 50%, from 4 per socket to 6 per socket. The core frequency increased from 2.7GHz to 2.8GHz. The AMD Opteron Istanbul six-core also introduced HT-assist, which help maintain cache-coherency. HT-Assist is supposed to help transaction processing application more than DW, but no TPC-C or TPC-E benchmark has been published for Istanbul as of 17 November 2009. It is possible HT-Assist is beneficial, but not all the bugs have been worked yet. The difference between SQL Server 2008 RTM and SP1 is unknown.
|8 Opteron 8384 quad-core
|8 Opteron 8439 six-core
|8 Opteron 8439 six-core
The chart below compares the query run time for the SF 300GB for the quad-core relative to the size core. The expectation is that the six-core should be 20-30% faster than the quad-core, as scaling the performance of a single query with a parallel execution plan from DOP 32 to 48 is not a simple matter. Most queries appear to be 40% faster, so this might be attributed to the HT-Assist.
Queries 2, 6 and 17 show very little gain from 32 to 48 cores. This might be because these are relatively small queries which are difficult to scale by increasing the degree of parallelism.
The chart below shows the individual query times for the DL785G6 8-way Opteron six core scale factor 1TB relative to 300GB. The difference in the size of the database is 3.3X. There should be some improvement for Windows Server 2008 R2 over RTM.
However, the SF 300GB with compression fits mostly in memory, and the 1TB does not. So any query plans that generate random IO might be disproportionately longer than the difference in size. Query plan that require a table scan to disk may encounter only minor penalty as the storage system has sufficient sequential IO bandwidth.