|
|
|
|
-
Several TPC benchmark results for SQL Server published recently (and some not so recently) are of interest in the high-end performance sphere. Unisys published new TPC-E and TPC-H results for their 16-way ES7000 with the Intel Xeon X7460 six-core processor (96 cores total). HP published TPC-H results for their 8-way ProLiant DL785G6 with the AMD Opteron six-core (Istanbul) processor at scale factor 300GB back in July, and now at 1TB. (There are still no TPC-E results for the Opteron Istanbul processors).
The most recent Unisys benchmark results employ Windows Server 2008 R2 together with SQL Server 2008 R2, which supports greater than 64 processors, as the 16-socket Unisys with the six-core Intel Xeon has 96 cores total. Windows Server 2008 R2 also features significant core level changes, most prominently the removal of the dispatch scheduler lock, which was described as enabling much better scaling at the higher number of processor cores.
TPC-E Results
The main feature in the latest Unisys TPC-E benchmark performance result is Windows Server 2008 R2, which now allows the operating system to utilze all 96 processor cores in the fully populated ES7000. Compare the range of TPC-E performance results for the Intel Xeon X7460 processor from 4 to 16 sockets. The IBM x3850M2 4-way Xeon 7460 with 24 cores and 128GB memory achieved 729.65 tpsE. The 8-way Unisys ES7000, Xeon 7460 with 48 cores and 256GB memory achieved 1165.65.
Last year, Unisys published a result for their ES7000 7600R with 16 Xeon 7460 2.66GHz six-core processors on Windows Server 2008 Datacenter Edition and SQL Server 2008 EE, both 64-bit, which was limited to using 64 of the 96 processor cores in the 16-way ES7000. This provided only a 28% performance increase over the ES7000 with 8 processors and 48 cores. While the performance gain per core is in line with expectations, the performance relative to the 8-socket 48-core is less than desired.
| Processor |
GHz |
Memory |
Disks |
OS |
SQL Server |
tpsE |
Pub Date |
| 4S/24C/24T |
2.66 |
128GB |
400 |
W2K8 |
2008 RTM |
729.65 |
9-2008 |
| 8S/48C/48T |
2.66 |
256GB |
675 |
W2K8 |
2008 RTM |
1,165.56 |
4-2009 |
| 16S/64C/64T |
2.66 |
512GB |
775 |
W2K8 |
2008 RTM |
1,493.42 |
11-2008 |
| 16S/96C/96T |
2.66 |
1024GB |
882 |
W2K8 R2 |
2008 R2 |
2,012.77 |
11-2009 |
The new Unisys ES7000 result of Nov 2009 is on Windows Server 2008 R2 and SQL Server 2008 R2, both 64-bit Datacenter Edition, which now allows full use of all 96 cores. There is a 34.8% increase in the TPC-E score from 64 to 96 cores. If this were the scaling from increasing the processor cores only, then it would be excellent. However some non-neglibile gain is expected from the operating system enhancements only, considering the high number of cores.
The scaling for each doubling of the number of processor sockets and cores from 4-sockets/24-cores to 8-sockets/48-cores is 1.60. The scaling from 8/48 to 16/96 is 1.73, including the software gains from Windows Server 2008 RTM to R2. The average scaling for each doubling from 4 to 16 sockets (48 to 96 cores) is 1.66. This is approximately in line with the Microsoft slidedecks citing a 1.7X gain for each doubling of processor cores.
Unisys TPC-H Results for the ES7000 16-way
Unisys recently published a TPC-H scale factor 3TB result for their 16-way ES7000 with Windows Server and SQL Server 2008 R2, enabling the use of all 96 cores. Unisys had previously published a SF 10TB result for 16-way on Windows Server 2008 RTM, restricted to using 64 of the 96 cores. A TPC-H scale factor 1TB means the Lineitem table data without compression is approximately 1TB (or was 1TB with the 8 byte date time data type). The full size of the SF 1TB database uncompressed can be from 1.4 to 1.7TB. Technically, we are not supposed to compare TPC-H results at different scale factors. But there are very few published results, so sometimes this cannot be avoided in comparative analysis. Also for comparison, there is a SF 3TB result for a 32-socket 64-core IBM system with the POWER6 processor, the AIX operating system and Sybase database engine.
| System |
Processors |
GHz |
Memory |
Disks |
SF |
QphH |
Power |
Throughput |
| Unisys |
16S/64C/64T |
2.66 |
512GB |
928 |
10TB |
80,172.7 |
103,956.1 |
61,830.5 |
| Unisys |
16S/96C/96T |
2.66 |
1024GB |
914 |
3TB |
102,778.2 |
120,254.8 |
87,841.4 |
| IBM p595 |
32S/64C/?T |
5.0 |
1024GB |
288 |
3TB |
154,115.8 |
141,695.9 |
167,625.6 |
The chart below shows for the two Unisys results, the ratio of individual query runtime for the 10TB to 3TB on the power runtime (stream 0). One reason we should be careful in comparing results from different scale factors is that the execution plan can be different, so there is no guarantee that the query cost is directly proportional to database size. For the moment, setting aside the possibility of fundamentally different execution plans. The 10TB is 3.3 times larger than 3TB. The 10TB result is on only 64 cores, while the 3TB is on 96 cores.

Without the difference in cores, the 10TB query times should be about 3.3X longer than the 3TB baseline. There should be an additional 20-30% for the difference in number of cores. The expectation then is that queries on the 10TB with 64 cores should take approximately 4 times longer than on the 3TB with 96 cores, not accounting for differences between RTM and R2, and differences in the execution plans. Query 9 runs slower than expected on the 10TB 64 core system. Queries 12 and 22 run slower on the 3TB 96 core system. As a matter of fact, the query times between SF 3TB and 10TB are curiously close to each other. The execution plan at 10GB for query 12 shows a key lookup, and for query 22 a nested loops join. If this occurs in the execution plan at 3 or 10TB, then the query time might be constrained by the disk random IO performance. Both Unisys 3TB and 10TB results are on systems with about 900 disks.
The chart below compares the Unisys with 96 cores (Core 2 architecture) running SQL Server relative to the IBM with 64 cores (POWER6 architecture) running Sybase. The IBM POWER6 system has 50% better overall TPC-H score in QphH than the Unisys with the Xeon 7460 processor. The advantage is less than 20% on the TPC-H power component, which is a single stream testing the ability to use all availabe compute resources for each query.

The IBM result is on a completely difference processor architecture as well as a completely different database engine, so large query-to-query differences are not unexpected, as evident from the chart. The Unisys system with Xeon processors and SQL Server is faster some queries. The IBM system with POWER6 processors and Sybase is faster on others. The IBM system has only 282 disks compared with over 900 disks on Unisys. Both systems should have sufficient sequential disk IO performance, but the Unisys will have much better random IO capability. The IBM system is almost twice as fast on the TPC-H throughput metric, which is not suprising because the POWER6 processor is designed for massive bandwidth, while the Intel processors upto the Core 2 architecture are designed for much more price constrained systems. (Even though the Xeon 7400 series are targeted at expandable systems, it has the same bus architecture as the Core 2 processors for desktop and mobile systems.)
Also important, these SF 3TB results with 1TB system memory should not be compared with the TPC-H SF100-300 results where the data essentially resides in memory versus, without consideration for that disk IO involves significant CPU consumption. The in-memory results have zero random IO and even the SSD results also have nearly no limits on random IO. I am still inclined to speculate that Nehalem EP/EX has much better TPC-H than Dunnington per socket, possibly because of much better memory bandwidth.
HP ProLiant DL785 Opteron TPC-H Results
The results below are on the 8-way HP ProLiant DL785. The first result is on the quad-core Opteron 8384 is on Windows Server 2008 RTM, SQL Server 2008 RTM. The second result with the six-core Opteron 8439 is on Windows Server 2008 SP1 and SQL Server 2008 SP1.
The third result with the six-core Opteron 8439 is on Windows Server 2008 R2 and SQL Server 2008 SP1. It is possible that the (locking) performance improvements in Windows Server 2008 R2 was desired, but because the system has less than 64 cores, SQL Server 2008 R2 was not required. This allows using CAL instead of per processor licensing.
Between the first two systems, the number of cores is increased by 50%, from 4 per socket to 6 per socket. The core frequency increased from 2.7GHz to 2.8GHz. The AMD Opteron Istanbul six-core also introduced HT-assist, which help maintain cache-coherency. HT-Assist is supposed to help transaction processing application more than DW, but no TPC-C or TPC-E benchmark has been published for Istanbul as of 17 November 2009. It is possible HT-Assist is beneficial, but not all the bugs have been worked yet. The difference between SQL Server 2008 RTM and SP1 is unknown.
| Processor |
cores |
GHz |
Memory |
SF |
QphH |
Power |
Throughput |
| 8 Opteron 8384 quad-core |
32 |
2.7 |
256GB |
300 |
57,684.7 |
75,161.2 |
44,271.9 |
| 8 Opteron 8439 six-core |
48 |
2.8 |
256GB |
300 |
91,558.2 |
109,067.1 |
76,860.0 |
| 8 Opteron 8439 six-core |
48 |
2.8 |
512GB |
1000 |
81,514.8 |
95,789.1 |
69,367.6 |
The chart below compares the query run time for the SF 300GB for the quad-core relative to the size core. The expectation is that the six-core should be 20-30% faster than the quad-core, as scaling the performance of a single query with a parallel execution plan from DOP 32 to 48 is not a simple matter. Most queries appear to be 40% faster, so this might be attributed to the HT-Assist.

Queries 2, 6 and 17 show very little gain from 32 to 48 cores. This might be because these are relatively small queries which are difficult to scale by increasing the degree of parallelism.
The chart below shows the individual query times for the DL785G6 8-way Opteron six core scale factor 1TB relative to 300GB. The difference in the size of the database is 3.3X. There should be some improvement for Windows Server 2008 R2 over RTM.

However, the SF 300GB with compression fits mostly in memory, and the 1TB does not. So any query plans that generate random IO might be disproportionately longer than the difference in size. Query plan that require a table scan to disk may encounter only minor penalty as the storage system has sufficient sequential IO bandwidth.
|
-
-
At Intel Developer Forum 2009 last week, Microsoft disclosed significant advances in Windows Server 2008 R2 with the elimination of many locks, most prominently the Dispatch Scheduler lock, that impact the ability scale performance up to and beyond 64 cores. Look for the presentation: Microsoft & Intel Innovations in Hardware and Software to Deliver New Technology Experiences by Mark Russinovich, and Shiv Kaushik, Intel Developer Forum session SPCS003 (http://www.intel.com/idf/training-sessions/ or https://intel.wingateweb.com/us09/scheduler/catalog/catalog.jsp), also available as a webcast (http://www.intel.com/idf/pressroom/video.htm)
Earlier I talked about Big-Iron revival with the upcoming Intel Nehalem-EX eight-core processor. Intel will finally have a Xeon processor with high-end scaling potential. The Intel Server Update slide deck by Boyd Davis state that 8 OEMs have 15 system designs for 8-way and larger in the works. Several system vendors, including IBM, NEC and Unisys, have had 8-way to 32-way systems for several Xeon processor generations, but it was always apparent that performance scaling trailed off after 4 sockets. See the IDF session SPCS002 Technology Insight: Intelligent and Expandable High-End Intel Server Platform, Codenamed Nehalem-EX by Stephen Pawlowski for details.
Scaling performance on big iron NUMA systems involves a complex combination of matching the database architecture to the capabilities of the underlying SQL Server engine, the operating system and hardware architecture. For major improvements in two pillars of the foundation to arrive together is bound to generate excitement and anticipation.
SQL Server 2000 and later
In the old days, it was possible but very difficult to scale SQL Server 2000 on the contemporary hardware and operating system of that time. Many of the difficulties were in the SQL Server engine, but there were limitations with the Windows operating system and hardware as well. Certain operations in SQL Server 2000 could scale well beyond 4 cores or over multiple NUMA nodes. Other operations did not, and some operations even had severe negative scaling beyond 4 cores.
In order to scale on NUMA systems, it was necessary to design the database cluster keys, non-clustered indexes and write SQL queries so that the execution plan avoided problematic operations. Almost none of the details for this have ever been published (mine were originally published else where, but I will try to collect them on my website www.qdpma.com). Apparently vendors do not like to tell customers they should completely re-architect. As a database architect consultant, I do not see why this subject is so taboo. - Its just a simple matter of hiring a really expensive database architect consulant for 8-10 weeks right? People should do this more often, and this is my completely unbiased opinion.
Many developers and even data architects do not have adequate skills for generating efficient execution plans on ordinary SMP systems, let alone manipulating the execution plan to match a specific NUMA system architecture.
People have just tried to put an existing application on a big-iron system without any consideration for redesigning database architecture, or even using execution plan hints. Usually this leads to an uneven outcome, and frequently severe problems specific to NUMA systems. I believe enough people encountered such problems that a general awareness developed to avoid big-iron systems. In the last few years, I have encountered few people contemplating the purchase of the big-iron system, and these were usually the ProLiant DL785.
With SQL Server 2005, many scaling problems in the database engine were resolved. Service pack 2 provided another round of significant fixes. SQL Server 2008 introduced data warehouse performance enhancement, but I found these to be problematic and would sometimes rewrite a query to force the 2005 execution plans.
Windows Server Operating System
Microsoft makes ongoing enhancements to the core server operating system to improve performance scaling over many processors. Improvements were made from Windows 2000 to 2003, but a change in the handling of interrupts caused performance issues in large systems. This was resolve in a post SP1 hot-fix. At Microsoft WinHEC 2007, disk I/O handling in NUMA systems was discussed, but is was unclear whether this was a Windows Server 2008 RTM or R2 feature.
The Windows Server 2008 R2 extension to handle more than 64 processors will get most of the press, but the things that get less publicity, ie that are not understood by press people, are equally important.
See the follow WinHEC presentations for more details:
Win HEC 2008
ENT-T554 Windows Support For Greater Than 64 Logical Processors by Arie van der Hoeven
ENT-T555 Scaling More Than 64 Logical Processors: A SQL Perspective by Alex Verbitski and Pravin Mittal
WinHEC 2007
SVR-T332 NUMA I/O Optimizations by Bruce Worthington (linked provided by Konstantin Korobkov below, thanks)
The WinHEC ENT-T555 presentation mentions OLTP scaling of 1.7X from 64 to 128 Logical Processors. The IDF presentations states 1.7X scaling from 128 to 256 LP, but it is very possible these two presentations do not reference the same baseline.
Even though the core elements will soon be in place to enable broadly scalable performance on big-iron systems, the expectation is that it will take time for the SQL Server engine team, and the Windows operating system team to build enough experience on the Nehalem-EX NUMA platforms to make all of this work together out of box. In the meantime, there are a handful of consultants with deep NUMA performance tuning experience that can make this happen as is (not to be construed as a solicitation for services).
AMD Istanbul
AMD crowed loudly that Opteron with its integrated memory controller and Hyper-Transport scaled memory and inter-processor bandwidth with the number of processors, while the Intel systems up to the Xeon 7400 series were constrained on a shared processor front-side bus and the fixed memory bandwidth of discrete memory controller. See for example excerpts from: SQL Server 2005 and AMD64 –a winning team.

With the announcement of the six-core Opteron, codename Istanbul, we find out that AMD previously did not have mechanism for maintaining cache-coherency comparable to the Snoop Filter in the Intel chipsets. Without this, much of the available bandwidth is consumed by cache coherency traffic, limiting scaling in systems with 4 or more processor sockets. In Istanbul, the HT Assist, or Probe Filter feature uses 1M of the 6M L3 as a directory cache to track cache lines. AMD measured 42GB/s memory bandwidth with HT Assist versus 25.5GB/s without HT Assist.

So there is now an expectation that Opteron systems should have improved scaling in 4-way and larger systems. While HP and Sun have 8-way Opteron systems since the quad-core Barcelona, only TPC-H data warehouse benchmarks have been published. To date no TPC-C or TPC-E OLTP benchmarks have been published for 8-way Opteron systems, even the six-core Istanbul with HT-Assist. For that matter, no TPC-C or TPC-E benchmarks have been published for 4-way systems with the six-core Istanbul, even though 4-way quad-core Opteron systems have posted respectable results on both TPC-C and TPC-E. It is possible that this is not a simple feature to implement and the first attempt has issues. Hopefully a fixed version will available before too long and we can see OLTP benchmark results for subsequent generation Opteron systems.
Itanium
The Itanium processor and system architecture was designed for big system scaling, but the processor has languished at the 90nm dual-core Montvale. The 65nm quad-core Tukwila has encountered multiple delays to 2010? Itanium is now mostly positioned as having extensible reliability and availability features (Machine Check Architecture).
Per Linchi, see Benchmark Omissions for the Six-Core Intel Xeon AMD Opteron Processors
|
-
While reading through the documentation for the HP Storage Works 2000 MSA, I found the following performance numbers cited in the quickspecs for both the FC and SAS versions. The 2000fc version cited performance numbers for the fc (Fiber Channel), sa (SAS) and i (iSCSI) models. The 2000sa version only cites SAS results. The MSA 2000 Technical Cookbook also cites performance results.
The results cited in the FC model documentation for the SAS and iSCSI models cannot be correct, unless the SAS and iSCSI models were completely botched, and as far as I am aware of, it has been a really long time since HP/Compaq last botched a design this badly.
I am inclined to think that the results cited in the FC section are incorrect, and may inadvertently be used by HP sales rep to recommend the FC version. The configuration in the performance report below is somewhat ok for OLTP, 4 x 4Gbs FC ports or equivalent. Of course if I had 4 SAS RAID controllers, each connected to 2 MSA-60 with 12 15K HDDs, I could get 4-6GB/s sequential. For a clustered transaction processing system, you pretty much need the SAN to support shared disks. But I am inclined to ask people to consider database mirroring, each node employing direct attach storage.
For Data Warehouse applications, I would outright recommend the direct-attach MSA-60 and skip the SAN.
From HP StorageWorks 2000fc G2 Modular Smart Array on 2009 Sep 14
|
Performance numbers are a guideline as established by tests using RAW I/O in an Operating System Agnostic test lab environment. A total of 96 72GB 15K SAS drives were used in a dual controller configuration of eight 12 Disk vdisks, eight 806 GB volumes, and two volumes per host. Four hosts directly attached to the MSA2000fc G2 were used in this test configuration (results cannot be expected with a single host). Results were achieved in Sequential Writes with 256K blocks; all random tests were based on 8K block sizes. Number and type of applications, drive type, operating system used, and the number of hosts will affect overall performance.
|
| Workload |
MSA2300fc |
MSA2000sa |
MSA2000i |
| Host Connect |
4 Gb Fibre Channel |
3 Gb SAS |
1GbE Ethernet |
| |
|
|
|
| MSA2000 RAID 10 Performance Results (rounded) |
| Random Reads IOPs |
21,400 |
10,600 |
8,200 |
| Random Writes IOPs |
8,800 |
4,900 |
4,500 |
| Random Mix IOPs 60/40 read/write |
13,500 |
6,800 |
6,100 |
| Sequential Reads MB |
1,300 |
700 |
300 |
| Sequential Writes MBs |
560 |
350 |
260 |
| MSA2000 RAID 5 Performance Results (rounded) |
| Random Reads IOPs |
20,500 |
10,200 |
7,800 |
| Random Writes IOPs |
2,400 |
2,000 |
1,600 |
| Sequential Reads MBs |
1,300 |
700 |
300 |
| Sequential Writes MBs |
780 |
380 |
270 |
Random Mix IOPs 60/40 read/write |
5,900 |
3,300 |
3,200 |
| MSA2000 RAID 6 Performance Results (rounded) |
| Random Reads IOPs |
20,500 |
10,100 |
7,800 |
| Random Writes IOPs |
1,600 |
1,400 |
1,200 |
| Sequential Reads MBs |
1,300 |
700 |
300 |
| Sequential Writes MBs |
820 |
380 |
270 |
| Random Mix IOPs 60/40 read/write |
4,500 |
2,750 |
2,560 | |
From: HP StorageWorks 2000sa G2 Modular Smart Array
|
Performance numbers are a guideline as established by tests using RAW I/O in an Operating System Agnostic test lab environment. A total of 96 72GB 15K SAS drives were used in a dual controller configuration of eight 12 Disk vdisks, eight 806 GB volumes, and two volumes per host. Four hosts directly attached to the MSA2000sa G2 were used in this test configuration (results cannot be expected with a single host). Results were achieved in Sequential Writes with 256K blocks; all random test were based on 8K block sizes. Number and type of applications, drive type, operating system used, and the number of hosts will affect overall performance.
|
| Workload |
MSA2300sa G2 |
| Host Connect |
3Gb SAS |
| MSA2000 RAID 10 Performance Results (rounded) |
| Random Reads IOPs |
21861 |
| Random Writes IOPs |
14491 |
| Random Mix IOPs 60/40 read/write |
17700 |
| Sequential Reads MB |
1049 |
| Sequential Writes MBs |
531 |
| MSA2000 RAID 5 Performance Results (rounded) |
| Random Reads IOPs |
21019 |
| Random Writes IOPs |
2705 |
| Sequential Reads MBs |
1051 |
| Sequential Writes MBs |
644 |
Random Mix IOPs 60/40 read/write |
5897 |
| MSA2000 RAID 6 Performance Results (rounded) |
| Random Reads IOPs |
21029 |
| Random Writes IOPs |
1918 |
| Sequential Reads MBs |
1054 |
| Sequential Writes MBs |
736 |
| Random Mix IOPs 60/40 read/write |
4872 | |
The following slide is from the HP MSA2000 Technical Cook Book.

Update 2009 Oct 30
Apparently HP has updated the HP StorageWorks 2000fc G2 Modular Smart Array spec sheet
Performance numbers are a guideline as established by tests using RAW I/O in an Operating System Agnostic test lab environment. A total of 96 72GB 15K SAS drives were used in a dual controller configuration of eight 12 Disk vdisks, eight 806 GB volumes, and two volumes per host. Four hosts directly attached to the MSA2000fc G2 were used in this test configuration (results cannot be expected with a single host). Results were achieved in Sequential Writes with 256K blocks; all random tests were based on 8K block sizes. NOTE: Number and type of applications, drive type and number of drives, operating system used, and the number of hosts will affect overall performance. |
| MSA2 G2 Performance |
MSA2300fc G2 |
MSA2000sa G2 |
MSA2000i G2 |
| Protocol (host connect) |
4 Gb Fibre Channel |
3 Gb SAS |
1GbE Ethernet |
| MSA2000 RAID 10 Performance Results |
| Random Reads IOPs |
22,874 |
21,861 |
13,658 |
| Random Writes IOPs |
15,008 |
14,491 |
14,399 |
| Random Mix IOPs 60/40 read/write |
17,878 |
17,700 |
13,493 |
| Sequential Reads MB |
1,238 |
1,050 |
274 |
| Sequential Writes MBs |
532 |
532 |
266 |
| MSA2000 RAID 5 Performance Results |
| Random Reads IOPs |
22,044 |
21,000 |
12,335 |
| Random Writes IOPs |
2,714 |
2,714 |
2,714 |
| Random Mix IOPs 60/40 read/write |
5,325 |
4,926 |
5,325 |
| Sequential Reads MBs |
1,238 |
1,050 |
274 |
| Sequential Writes MBs |
725 |
617 |
266 |
| MSA2000 RAID 6 Performance Results |
| Random Reads IOPs |
21,975 |
21,000 |
12,292 |
| Random Writes IOPs |
1,876 |
1,876 |
1,876 |
| Random Mix IOPs 60/40 read/write |
4,293 |
3,865 |
3,851 |
| Sequential Reads MBs |
1,238 |
1,050 |
274 |
| Sequential Writes MBs |
772 |
729 |
266 |
|
-
To date, no 4-way or 8-way TPC-H data warehouse benchmark result has been published for the six-core Xeon X7460 and no TPC-C or TPC-E OLTP benchmark result has been published for six-core Opteron. Usually, the absence of published results means the results are not competitive, in one manner or another.
TPC-C, E and H results were published for the previous generation quad-core Intel Xeon X7350. TPC-C and TPC-E results were published for the follow-on six-core Xeon X7460, but there are no 4-way or 8-way TPC-H results. Unisys did publish a 10TB TPC-H result their 16-way ES7000 with the Xeon 7460, but there is no simple way to compare this with 4-way or 8-way results at 100 or 300GB scale factors.
There are very impressive 4-way quad-core Opteron 8384 2.7GHz TPC-C and TPC-E results of 579,814 tpm-C and 635.43 tpsE respectively, but not for the six-core Opteron. For TPC-H, there is a series of 8-way results at scale factor 300GB for the quad-core and six-core Opteron processors, though curiously no 4-way results for the Opteron after dual-core 8220.
One suspected reason for the lack of 4-way or 8-way TPC-H results on the Intel Xeon X7460 is that it cannot achieve meaningful performance gains over the quad-core Xeon X7350. The large 16M L3 cache on the X7460 helps on high-call volume (>10,000 RPC/sec) benchmarks like TPC-C and TPC-E, but not in the high-row count TPC-H queries with parallel execution plans, where the lower 2.66GHz frequency is also a liability.
In the dual-core era, 4-way Intel Xeon (with the Pentium 4 based NetBurst core) and AMD Opteron systems were very close on TPC-H. It might be that the 4-way quad-core Opteron processors were not competitive with the Xeon 7300 series (Core2 architecture cores) so no TPC-H results were published. The quad-core Opteron was very competitive, significantly better even, than the Xeon 7350 without a large shared cache, in TPC-C and TPC-E. The Opteron architecture does not need as large a cache as the Xeon, but does benefit from the large 6M L3 cache in Shanghai compared with the 2M L3 cache in Barcelona.
TPC-H results were published for quad-core and six-core Opteron in the HP ProLiant DL785 8-way systems at scale factor 300GB. Previously, IBM had published an 8-way SF 300 TPC-H result for the Xeon X7350. The first 8-way Opteron quad-core had a better result, so it is possible that the Xeon bus architecture could not scale to 8-way for DW type workloads.
Significantly, the 8-way six-core Opteron 8389 2.8GHz shows a very significant TPC-H performance gain over the quad-core Opteron 8384 2.7GHz (91,558.2 QphH@300GB versus 57,684.7), more than would be suggested by the 50% increase in the number of cores and nominal frequency increase, as scaling is less than linear. Presumably, this should be attributed to micro-architecture improvements between Shanghai and Istanbul.
The most significant improvement cited by AMD is HT-assist, which is essentially a snoop filter for maintaining cache coherency in the Hyper-Transport architecture. Now ever since the Opteron with integrated memory controller and HT introduction, AMD crowed about how memory and inter-processor bandwidth scaled with the number of processors, unlike the Intel architecture, where memory and inter-processor bandwidth was bottlenecked by the front-side bus.
Well AMD neglected to mention that their scalable bandwidth was also offset by increased inter-processor communication to maintain cache-coherency (see the article by Johan de Dela on Anandtech http://it.anandtech.com/IT/showdoc.aspx?i=3571). So now that AMD has the snoop filter capability and a very good TPC-H result for the Opteron with HT-assist, why is there not any published TPC-C or TPC-E OLTP benchmark results?
Note that Intel had difficultly with Snoop Filter in the 5000P/X chipset. The snoop filter improved some benchmarks, and cause degradation in others. So it would be no surprise if it takes or two generations to work out the issues. The expectation is that AMD will need to work out these issues if Magny-Cours is expected to compete with Nehalem-EX systems.
For Intel, the lack of competitive Xeon 7400 series DW benchmark results will be a moot point once the next-generation Nehalem-EX systems becomes available.
Anyways, these are my suspicions. System vendors are welcome to refute any of my opinions by publishing results. I was supriseb by the 2-way Xeon 5500 Nehalem TPC-H results.
|
-
A recently published TPC-H benchmark result, along with previously published results provide some insight into to the relative impact of SSD to disk drives, and SSD versus system memory. The TPC-H configurations are shown below. All results are at SF 100. The first two are on Windows Server 2003 sp1, and SQL Server 2005 sp1 and sp2 respectively use processors based on the Intel Core 2 architecture and HDD storage. The two recent results are both on Windows Server 2008 and SQL Server 2008 sp1 using the Intel Xeon 5570 processors (Nehalem architecture) and Fusion-IO SSD storage.
| System |
OS |
SQL |
Processors |
Memory |
Storage |
| Dell PE2900 |
W2K3 sp1 |
2005 sp1 |
2 Xeon 5355 |
48GB |
HDD |
| HP DL580G5 |
W2K3 sp1 |
2005 sp2 |
4 Xeon 7350 |
128GB |
HDD |
| Dell T610 |
W2K8 |
2008 sp1 |
2 Xeon 5570 |
48GB |
SSD |
| HP DL380G6 |
W2K8 |
2008 sp1 |
2 Xeon 5570 |
144GB |
SSD |
The system configurations above are not the best to make this comparison. Ideally we would like to be able see the TPCH SF 100 result for Dell T610 with HDD storage. This would provide two systems identical except for the storage, and two similar systems that differ in system memory with the large memory configuration sufficient to place the entire database in the buffer cache. But we can make estimates for the expected differences between the two processor architectures based on available information.
First a quick word about TPC-H. Scale Factor 100 means the LINEITEM table data (leaf level) should be 100GB, or did when the table definition employed the 8 byte DATETIME data type. With the 4 byte DATE type employed in SQL Server 2008, the SF 100 LINEITEM table drops to 90GB. The full TPC-H SF100 database with all tables and indexes was approximately 170GB in SQL Server 2005, decreasing to 150GB with the DATE type. With compression, the database size probably fits in the buffer cache with SQL Server maximum memory set to 135GB out of the system memory of 144GB.
TPC-H is comprised of 22 scored queries, and has two test sequences. The Power sequence runs the 22 queries sequentially using a single stream, which is essentially a test of the database engine parallel execution plan capability. The Throughput sequences run multiple concurrent streams. All systems in this group use 5 streams in the throughput test at this SF. It would be helpful to see the Power sequence query time at degree of parallelism 1, 2, 4, 8 and so on to the number of physical and logical cores, but this is not part of the TPC-H. The top level results are shown below.
| Processor |
Memory |
QphH |
Power |
Throughput |
| 2 Xeon 5355 |
48GB |
15,723.9 |
20,587.9 |
12,009.1 |
| 4 Xeon 7350 |
128GB |
34,989.9 |
46,708.6 |
26,211.3 |
| 2 Xeon 5570 |
48GB |
28,772.9 |
38,550.7 |
21,475,1 |
| 2 Xeon 5570 |
144GB |
51,110.6 |
72,134.0 |
36,313.6 |
| transition |
Memory |
QphH |
Power |
Throughput |
| 2x 5355 to 2x 5570 |
48GB |
83% |
87% |
79% |
| 2x 5570 to 4x 7350 |
48-128 |
21.6% |
21% |
22% |
| 2x 5570 |
48-144 |
78% |
87% |
69% |
The T610 with 2 Xeon 5570 result is very impressive, being 83% higher than the 2-way Xeon 5355 quad-core 2.66GHz Core 2 architecture (8 physical cores on both systems) with other difference being the SSD storage. The T610 is just 21% lower than the 4-way Xeon 7350 system with 16 Core 2 architecture cores and 128GB memory. The typical TPC-H storage configuration with hard drives can deliver substantial IO bandwidth, with 4-8 PCI-E SAS controllers and 80-200 disk drives. The SSD configuration, however, can also deliver spectacularly more pseudo-random small block IOPS than 200 disk drives and at much lower latency.
The DL380G6 with 144GB memory is even more impressive scoring 78% higher in the composite QphH score, 87% in Power and 69 % higher in overall than the T610 with 48GB.
The chart below the run time for each of the 22 TPC-H queries in the Power sequence (stream 0) on the Dell T610 with Xeon 5570 processors and SSD storage relative to the Dell PowerEdge 2900 with Xeon 5355 processors and HDD storage.

We can use the SPEC CPU 2005 integer benchmark to make a rough estimate for the difference in the processor performance between the Core 2 and Nehalem architectures. The SPEC CPU 2005 integer for the Xeon 5355 (2.6GHz) is 19.4 and for the Xeon 5570 31.5, with a very large gain in 2 components, which means some of the gain should be de-rated due to the nature of SPEC CPU where the very latest compiler improvements are incorporated. The 60% difference in performance implies that the execution time of the faster processor should be approximate 0.625 times that of the baseline processor. Since the reduction in query time is visibly better than 0.625 for many queries, we can attribute this to the SSD.
The chart below shows the TPC-H Power sequence query times for the DL380G6 with 144GB memory relative to the T610 with 48GB. Five queries, 1, 9, 13, 15 and 16, show only minor differences with query 13 actually being 15% faster with less memory, and another two, 3, 14, show 20% gain for the extra memory. Nine queries show gains of 2.5X or more and another shows 2X gain. So overall, there is a significant benefit to having data in system memory, i.e., buffer cache, even with the extraordinarily high IOPS and low latency of SSD storage.

The reason for this is that the CPU cost for the operating system to perform an IO operation, regardless of whether this is to disk or SSD storage devices, and then to evict an existing page in the buffer cache is high relative to the cost of many common SQL operations. See IO Cost Structure (or my blog on SQLblog) for more on this. The duration or elapsed time on disks drives is much higher than SSD, but the CPU cost cannot be evaded.
While the published performance data is not ideal, needing a result for a 2-way Xeon 5570 with 48GB memory and HDD storage for SQL Server 2008, the available results are sufficient to indicate that SSD can significantly improve performance relative to HDD even in data warehouse type applications, and the system memory is still best for database performance.
The chart below shows the stream 0 (Power) runtime for the 16-core (4xXeon 7350 Core 2) 128GB and hard drives, the 8-core (Xeon 5570 Nehalem) with 48G and SSD, and the 8-core (X5570) with 144GB and SSD relative to the 8-core (Xeon 5355 Core 2) 48GB memory with hard drives. In query 3, the 16-core with 128G is actually 20% slower than the 8-core with 48GB. This goes to show parallel execution plans can be very tricky. Without thorough investigation of how each query behaves at degree of parallelism 1, 2, 4, 8, 16 and so on, nasty surprises can and do occur. Big-iron system can be very useful, but a simple migration from a 4-way system can result in minimal or negative performance change. With expert investigation and troubleshooting, very respectable performance gains can usually be achieved.

For the majority of the queries, the 16-core 128G memory system has double or more improvement relative to 8-cores 48GB memory. Doubling the number cores alone should yield 30-40 reduction in query time as the gain is entirely due to the increase degree of parallelism from 8 to 16, which is a difficult aspect to scale. So it is apparent the additional memory probably contributes to performance significantly.
Update 2009-10-11
The original HP 2xX5570 publication has been withdrawn. Three results were later published, all with 2 x Xeon 5570 processors, 144GB memory. One is with the 4 Fusion IO drives, another uses 12 SATA SSD drives spread across 3 SAS controllers, and the third with 100 SAS hard disk drives.
| Processor |
Memory |
QphH |
Power |
Throughput |
| 100 HDD |
144GB |
50,738.4 |
67,712.9 |
38,019.1 |
| 12 SATA SSD |
144GB |
51,422.4 |
70,048.5 |
37,749.1 |
| 4 FusionIO SSD |
144GB |
51,085.6 |
72,110.5 |
36,190.8 |
The chart below shows the TPC-H power results on the 22 individual queries for the SATA SSD and FusionIO SSD relative to the SAS HDD. Most queries are either indentical or very close. The SSD is much faster for certain queries, but slower in others.

The two queries where SSD is much faster are 2 and 12. The execution plan for these at SF10 show nested loops joins to the medium tables ORDERS and PARTSUPP (I do not have the execution plan for SF 100 handy). HDD is faster on Q9, which shows scans and hash match joins between several tables.
|
-
Earlier, JK blogged and ranted about buying bigger hardware.
Let me assure people that talent for highly efficient and scalable (over many processor cores) was drained long ago, and not just from the recent emergence of multi-core processors. For that matter, I am not sure it was ever prevalent because developers built code on small single socket (and single core in the old days) boxes without consideration for the issues that only occur in SMP or NUMA systems. Of course, it is necessary a bad thing that a person of modest skills can build an application, so long as it is rebuilt before moving to the big time.
But we should look at the flip side, from the point of view of both the software vendor (who pays for developers to write code) and the CIO (of the company that pays the sys admin and DBA to maintain the production servers). The high level people responsible for running a business or department are almost never (exception for Bill G and a few others) proficient coders. To rise in business or administration, over the course of time, they learn to assess value on different metrics. The simplest metric is money, and related, head count.
Let’s start with the ISV that sells software that has high (perceived) value. Suppose the probable full deployment cost might be on the order of $10M (£, € or ¥). The software vendor might want to target a price of $1M for the software license, plus 10% per year for support. Now if this software were super efficient it might run on a 2 socket database server costing $10K each, (two servers in cluster + 2 for the DR cluster totaling $40K) plus storage, and a $4K web server (again, 2 for redundancy, plus 2 more for DR).
The CIO will then ask: why am I paying $1M for software that runs on $10K database server and a $4K web server. Now suppose if this software were grossly inefficient that it would require an 8-way database server costing $200K each (a total of 4 for a 2-node cluster + another pair for DR) and 10 web servers (plus another 10 for DR).
Well then, now it seems perfectly reasonable that the software price is $1M. If the application still did not run well, then it certainly justifies having lots of consultants on permanent assignment to make sure it works.
Now let’s look at this from the CIO point of view. Suppose your company has two or more enterprise wide database applications, each administered by a separate DBA. You being a super proficient SQL Server performance expert, enable your database to run on the 2-way quad-core system. The other guy, has no such skills, proposed and deployed an Oracle RAC solution spanning eight 4-socket quad-core servers (The intent here is not to pick on Oracle, Larry is exceptional at making money and has a really nice yacht to boot). Who will know that with proper tuning, it would have also run on a single 2-socket system (this could still be a 2 node RAC for redundancy).
When it comes time for the annual review, which DBA will rate higher, the one that maintains a simple $200K (hardware + database licensing) environment or the one that handles the complex $2M environment? The CIO and HR know that the other DBA has a complex environment because of how long it took to get set up and all the very expensive consultants that had to be hired to do it. You did yours without expensive outside help.
From the CIO and HR point of view, what is an appropriate pay scale for each DBA? Do either of the CIO or HR people have the technical knowledge to put a value on what you did to save the company money? Or are they using other metrics?
If you think all this is highly irrational, take a long look at how things work in your organization and comment.
Anyways, below is my latest hardware list for consideration and comparison.
2-socket systems
|
Vendor |
Dell |
Dell |
Dell |
HP |
HP |
|
Model |
PowerEdge 2900 |
PowerEdge T610 |
PowerEdge T710 |
ProLiant DL370G6 |
ProLiant DL385G6 |
|
CPU Series |
Xeon 5400 |
Xeon 5500 |
Xeon 5500 |
Xeon 5500 |
Opteron 2400 |
|
Architecture |
Core 2 |
Nehalem |
Nehalem |
Nehalem |
Istanbul |
|
Cores/Socket |
4 |
4 |
4 |
4 |
6 |
|
Hyper-Thread |
No |
Yes |
Yes |
Yes |
No |
|
DIMM sockets |
12 |
12 |
18 |
18 |
16 |
|
IOH |
5000P? |
5520 |
2 x 5520? |
2x5520 |
|
|
PCI-E |
Gen 1 |
Gen 2 |
Gen 2 |
Gen 2 |
? |
|
x16 |
|
|
1 |
2 |
|
|
x8 |
1 |
2 |
4 |
1+1 (NIC) |
2 |
|
x4 |
3 |
3+1 |
1+1 |
6 |
4 |
|
PCI-X |
2 |
|
|
|
|
|
Int. HDD |
8+2 |
8 |
8/16 |
6+6+2 |
6/16 |
|
|
|
|
|
|
|
|
Configuration |
|
|
|
|
|
|
Price |
$4,537 |
$5,546 |
$5,417 |
$8,809 |
$6,858 |
|
CPU |
2xE5440 |
2xX5550 |
2xX5550 |
2 X5550 |
2x2435 |
|
Memory |
12x4 GB |
12x4 GB |
12x4 GB |
12x4GB |
12x4GB |
4-socket systems (and one 8)
|
Vendor |
Dell |
Dell |
HP |
HP |
HP |
|
Model |
PowerEdge R900 |
PowerEdge R905 |
ProLiant DL580G5 |
ProLiant DL585G6 |
ProLiant DL785G6 |
|
CPU Series |
Xeon 7400 |
Opteron 8300 |
Xeon 7400 |
Opteron 8400 |
Opteron 8400 |
|
Architecture |
Dunnington |
Istanbul |
Dunnington |
Istanbul |
Istanbul |
|
Cores/Socket |
6 |
6 |
6 |
4 |
6 |
|
Hyper-Thread |
No |
No |
No |
No |
No |
|
DIMM sockets |
32 |
32 |
32 |
32 |
64 |
|
IOH |
7300 |
|
|
|
|
|
PCI-E |
Gen 1 |
|
|
|
|
|
x16 |
|
|
|
|
3 |
|
x8 |
4 (2x2) |
2 |
|
3 |
3 |
|
x4 |
3 |
5 |
|
4 |
5 |
|
PCI-X |
2 |
|
|
2 |
|
|
|
|
|
|
|
|
|
Configuration |
|
|
|
|
|
|
Price |
$20,236 |
$16,437 |
$26,268 |
$23,570 |
$57,285 |
|
CPU |
4xX7460 |
4x8435 2.6GHz |
4xX7460 |
4x8439 |
8x8439 2.8GHz |
|
Memory |
32x4 GB |
32x4GB |
32x4GB |
32x4GB |
64x4GB |
big iron
|
Vendor |
HP |
Unisys |
Unisys |
NEC |
|
|
Model |
ProLiant DL785G6 |
ES7000 7600R |
ES7000 7600R |
Express5800
A1160 |
|
|
System Price |
$48,997 |
$66,729 |
$135,003 |
$145,596 |
|
|
Memory Price |
+8,288 |
+19,136 |
$46,376 |
$50,344 |
|
|
CPU |
8x8439 2.8GHz |
8xX7460 |
16 x X7460 |
16 x X7460 |
|
|
Memory |
64x4 (256GB) |
256GB |
512GB |
512GB |
|
2-socket systems
The Dell PowerEdge 2900 is for comparison with the previous generation. Dell came out first in tower chassis with the T610, which did not seem to be the replacement for the 2900, especially considering that there was a R710 in 2U. Just recently, Dell released the T410 and T710 filling out the Xeon 5500 series 2-socket tower chassis lineup. There is only a very small price difference between the T610 and T710 at the base model. When configured with dual power supplies, then T710 is actually slightly less expensive than the T610.
Earlier, I said I liked the ProLiant ML/DL370G6 because it implemented two 5520 IOH devices for 72 PCI-E gen 2 lanes. But I did not like the 2 x16 slots because database servers cannot really use these extra-wide slots. A combination like 7 x8 + 4 x4 slots would have been better. Also, using one of the x8 slots for the 4 included GbE ports is a waste. This should have occupied a x4 slot, or better yet, just implement the pair of GbE ports on the ICH, which attaches off the ESI port instead squandering a valuable PCI-E slot. We can then use the PCI-E slots for our choice of IO.
The T710 appears to have 56 PCI-E lanes configured (1x16+4x8+2x4) but there is no documentation that actually says the T710 implements two of the 5520 IOH devices. Still, 5 wide slots (1x16 + 4x8) are better than 2 on the T610, but 6 x8 would have been better. Let the workstation people have the x16 slots. A pair of available x4 slots would be nice too (one is used by the internal storage controller).
Finally there is the ProLiant DL385G6, which supports the new six–core Istanbul. This makes the DL385G6 a really powerful web server, but I would prefer more IO bandwidth for databases. Also, I do not know if the PCI-E slots are Gen1 or Gen2.
The Dell R805 now also supports the six-core Opteron 2400 series. Price with 2 x 2435 (2.6GHz) and 32GB memory is $3734 (probably another $460 to bring it to the 48GB reference used above, because the 2900 has 12 DIMM slots).
4-socket systems
The 4-way landscape will change when the Nehalem-EX (Intel Xeon 7500?) systems come out. For now, we have Intel Xeon 7400 series and AMD Opteron 8400, both at 6 cores. Previously, I criticized Intel for being myopic in obsessive focus on the 4-way platform, i.e., not making the 6-core Dunnington available in 2-way systems. In earlier generations, the 4-way processor had nothing special over the 2-way, and was usually 1 year behind. Anyways, AMD is first to reach 6-core in 2-way. Hopefully, with the powerful 8-core Nehalem-EX, there will be 2-socket systems. For a long time, Microsoft has recommended 4-way as the default choice for database servers. I think this should now be a 2-way, once we get to 6-core or more.
This system should handle most loads. And even if it turns out that a larger system is needed, the 2-way didn’t cost much and can always be used for other purposes. When consolidating small databases, a few 2-way systems is more flexible than one big system.
Big Iron
The big iron systems are shown in comparison the 8-way ProLiant 785. Notice that there is only a slight price premium going from two 4-way Opteron systems to one 8-way system ($47K and $57K respectively). It used to the premium was much larger. The price on Unisys and NEC systems appear to be about $33-37K for each 4 socket node. It used to be that a 4-socket node in the big-iron systems was around $80-90K. Right now we cannot really use the full power of the 16-socket system with the six-core Xeon 7400, because Windows Server 2008 can only support 64 cores. Soon R2 will out, and we can see how SQL Server scales.
However, I really think we need the better QPI interconnect technology of the Nehalem-EX to properly benefit from 128+ cores.
For a long time AMD crowed about how the Opteron HT interconnect and memory bandwidth scaled with the number of sockets, while Intel was constrained by the FSB. Yet the largest Opteron system was the 8-socket from HP and Sun. Even though the ProLiant 785 was very impressive on TPC-H, yet strangely there have been no published TPC-C or TPC-E results. Now with Istanbul, AMD mentions they have added HT-assist, similar in function to the snoop filter on Intel chipsets. Without this, there is excessive traffic to maintain cache coherency. This is not a simple matter and I expect it will take AMD an iteration or two to work out the bugs. Intel had similar difficulties with the snoop filter in their chipsets.
|
-
I have updated this at the URL below,
later I will update this blog as well
|
-
I have made updates to my tool for automating SQL Server query exec stats, index usage, execution plan analysis. So I would like people to give it a try and send feedback. If there is an error, please report the file and line number (the top level is sufficient).
download at
http://www.qdpma.com/tools/ExecStats.zip (check http://www.qdpma.com/ for updates)
documentations is now at
http://www.qdpma.com/SQLExecStats.html
As before, the objectives are to:
1. Simplfy the process of collecting key query tuning information, which also means one can keep a running record of query stats and execution plans over time (daily or weekly).
2. Automatically examine the XML execution plans for information that to be displayed with the query exec stats, and build a cross-reference of between indexes and execution plan references.
3. Store the information in a compact self contained file, ie, not a massive performance data warehouse. Hopefully the DBA makes an effort to learn how to do basic query tuning. Still a reasonably complete set of information (query stats and plans tbut not database data) can be sent for someone else to examine, including an (horribly) expensive consultant who may or may not have any meaningful expertise. (Not to be construed as a solicitation for services)
I will update the documentation as soon as I can. Below is the main screen
Click the button to the right of the databases box to bring up all databases.
This will collect table and index info on multiple databases. The SQLExecStats programs also builds a cross-reference of index usage by execution plan. If the instance has many databases with a large number of table, and indexes, I suggest deselecting the inactive databases. Also, if in multi-db mode, consider deselecting the Stats Hdr check box
One of things I want is to see space information for each database. This view show how much file space is allocated for data and logs separately, and how much is use, and not used in each. Note sp_spaceused shows space for data and logs combined
Below is the virtual file IO. I have also calculated the Average bytes per rd and write, and the average ms/Rd and Write. Of course, this is a point in time calculation, which would include database backups. Later I will do differential file IO
I will add more later
|
-
Dell recently published a TPC-H report for the PowerEdge T610, 2 x Xeon 5570, with 4 FusionIO 80GB SSD storage devices at 100GB scale factor. So why have we not seen TPC-C or TPC-E OLTP benchmark results published?
Now it is much more feasible to run the TPC-H data warehouse benchmark on SSD because the Scale Factor 100 size is still allowed, for which the Line item table is 100GB for data only, not indexes or other tables. The full SF 100 tpch database is about 170GB for all tables and indexes. Additional space is required for tempdb.
The TPC-C and TPC-E benchmarks require the database size to be scaled with performance target ranges. Consider the Fujitsu TPC-E published result for the Primergy RX300 S5 with 2 Xeon 5570. The dual-socket Xeon 5570 system scored 800 tps-E, for which the required initial database size is about 3TB. The space actually allocated for the data files is approximately 4.5TB, plus another 85GB for log space.
|
System |
Fujitsu Primergy RX300 S5 |
|
Processors |
2 x Intel Xeon X5570 |
|
Memory |
96GB |
|
RAID controllers |
5+1 |
|
Disk enclosures |
30 |
|
HDD |
360 (192 73GB 15K + 168 146GB 15K) |
|
Storage cost |
$148K + $49K for 3 year maintenance |
|
Raw capacity |
35GB |
|
RAID 10 Capacity |
18TB |
|
Estimated IOPS |
360*200 = 72K |
For the 360 15K disk drives, based on 200 IOPS per disk, the small block random IOPS capability of this storage system is 72K, excluding RAID 10 overhead. If the actual load is 10,000 IOPS (at the operating system) with a 50/50 read/write mix, then the raw IOPS to disk is 5K reads and 2x5K writes for a total of 15K IOPS to disk. So a 75K IOPS system can actually handle 50K IOPS at 50/50 read/write mix in RAID 10.
If we consider that the active database resides on only 15% of the disk space (3TB of 18TB after RAID 10 overhead), then there is some benefit from the short-stroke effect. If the average disk queue depth per disk were higher than 1, then command queuing capability would result in even high IOPS per disk. The actual IOPS per disk might be anywhere from 200-300 depending on whether the emphasis was on pure performance or balanced price/performance.
Below is a proposed SSD (+ HDD for archival space) configuration.
|
SSD + HDD configuration |
|
|
SSD Capacity |
4.5TB |
|
60-day space |
13TB |
|
SSD drives |
155 @ 32GB
62 @ 80GB |
|
Cost for Intel SSD
$520 for X25E, 32GB
$340 for X25M, 80GB |
$80K
$21K |
|
HDD drives for 60-day space |
20 x 1TB SATA $3200
or 42 x 450GB SAS |
In additional to the above, we need disk enclosures. Ideally I would like to place no more than 4-5 SSD devices on each x4 SAS port. A x4 3Gbps SAS port can support 1GB/s, but if a HBA/RAID controller with 2 x4 SAS ports is plugged into a x8 PCI-E gen 1 slot, we can only expect 1.6GB/s total (single direction) throughput. The Intel X25 SSDs are rated at 250MB/s sequential read, 170MB/s write for the E, and 70MB/s write for the M (all sequential).
The X25-E random 4K IO characteristics are 35K IOPS read, 3.3K write. In the absence of data, let assume the 8K random read is 15.7K IOPS (probably higher), so under 8K random IO, the bandwidth requirement is only 140MB/s or much less for read/write mixes.
The data sheet also say 8K 2:1 R/W 7K IOPS for the X25-E. No random IO data is listed for the X25-M in the datasheet, so it is not clear the X25-M can meet the TPC-C/E random IO requirements for mixed R/W.
A 1U enclosure with 2 SAS ports (daisy chained enclosures not expected) and 8-10 2.5in bays seem appropriate. The 2U enclosures with 24 bays should have 6 independent SAS ports. The 20 or so 3.5in SATA drives for the 60-day space requirement could be accommodated between the internal bays and 1 or 2 external enclosures. The next generation systems and components should be PCI-E gen 2 (5Gbps per lane) and 6Gbps SAS, but we expect higher SSD bandwidths as well.
So the SSD cost structure does seem to support the TPC-E benchmark. It would probably also support the TPC-C benchmark as well, based on the HP DL370G6 result for a
The main issue above is that I have not included RAID overhead. It is my m opinion that the SSD is not fundamentally a single component device, like a disk drive with a single motor. If the SSD were built with dual controllers, and chip-kill ECC on the NAND, then the SSD would be inherently single component failure tolerant. Of course, this is not the case yet. I am just looking forward to when we can do without RAID in SSD. I am not convinced RAID controllers are going to be able to keep up with an SSD arrays anyways.
Since the IOPS capability of the X25E with SLC NAND (not sure for the X25M with MLC NAND), RAID 5 with the higher small block random write overhead is not an issue. So 190 of 32GB or 77 of the 80GB SSDs in RAID 5 would be required.
I should briefly touch on expected performance benefits of SSD over HDD. The Dell TPC-H result did seem to indicate some benefits from SSD, even though there was not a otherwise similar HDD result to compare with. The TPC-H data warehouse queries may generate many table scans for which HDD is fine, there are still loop joins and key lookups, which generate pseudo-random IO. Several TPC-H queries also dump intermediate results to tempdb.
I am expecting TPC-C and E to show reasonable benefits from SSD over HDD. Consider the main TPC-C new order transaction. A typical TPC-C published result might show an average response time 0.3-0.4sec. This procedures processing an order for upto 15 items (average of 10?) which means one update of the Stock table for each item, one insert in to the Order Line table, and one insert to the New Order table, plus a few a others. Since the TPC-C database is very large, each of the above steps might require a disk IO. On a perfectly configured disk system (for OLTP), the average latency could be as low as 5ms even when the entire system drives 200K IOPS.
Still, if you look at the New Order procedure, it is clear each item must be processed serially. The SQL Server engine might use the Scatter-Gather IO API to consolidate IO calls from multiple concurrent users, but in each step in the new order is issued sequentially, after the previous step completes. Since there are over 20 steps, if each step take 5ms, then we can see why the average duration is well over 100ms.
With SSD, the IO latency should drop to 0.08 milli-sec (80us), meaning 20 steps should in the range of 2ms. Because there are fewer transactions "in-flight" at any given point in time, the expectation is that the SQL Server engine has less to keep track of.
Consider a system supporting 600,000 tpm-C. Thats 10,000 new order transactions per second. If each new order procedure averages 0.3sec, then there are 3,300 new order transactions in-flight at any point in time (plus others).
TPC-C also has performance/size scaling requirements. A 600K transactions per minute result requires approx 50,000 warehouses, each of which requires approx 84MB, for a database size of 4.2TB. The recent 600K tpm-C results required 1000+ disk drives (no RAID requirement) meaning the IOPS load is probably 200-300K, possibly a R/W mix close to 50/50.
Since Wes say the FusionIO 640GB devices are out, lets consider what kind of system would be required. The FusionIO is built with a PCI-E interface, that is, it plugs into the PCI-E slot directly, so it probably comes with its own driver. The second generation FusionIO matches up nicely with either PCI-E gen 1 x8 or PCI-E gen 2 x4 in terms of bandwidth.
For 4TB we need 7-8 of the 640GB drives. So, ideally a system should be configured with 9 PCI-E gen 2 x4 slots, plus with embedded devices (the extra slot or two is for additional network or SATA drives). The new Intel 5500 IOH has 36 PCI-E gen 2 lanes, plus the x4 gen 1 off the ESI. So a single IOH would support 9 x4 slots, plus GE and SAS off the ESI. The HP ML/DL370G6 actually uses 2 IOHs for a mix of x16, x8 and x4 slots.
Per Grumpy below, at this point in time, SSD devices have very different characteristics, particulary with regard to writes. Writes to NAND need to be in large blocks. Depending on how the SSD controller is implemented, expect some issues. So it may not be time yet to deploy transaction processing to SSD. DW might be worth considering. Still, we should see OLTP benchmarks plus accompaning details to better understand SSD characteristics. Where are Bashful, Doc, Dopey, Happy, Sleepy and Sneezy DBAs?
|
-
Everyone should know by now how really useful the DMVs dm_exec_query_stats, and dm_db_index_usage_stats and the associated DMFs for obtaining the SQL and XML plan: dm_exec_sql_text, dm_exec_query_plan, and dm_exec_text_query_plan. Of course it has been explained that dm_exec_query_stats is not a replacement for SQL Server Profiler and SQL Trace. The DMV is only reliable if execution plans are not frequently evicted from the procedure cache and if SQL is not frequently recompiled.
Still the DMV dm_exec_query_stats is popular because the results are available with a simple query. The complexity of setting up a trace, making sure it is not adversely impacting server performance, making sure space is available, and most of all, the effort of parsing the trace are all avoided.
So what do we do next? From the top query statistics we can start looking at the SQL and the execution plans. One minor annoyance is that if we use CROSS APPLY to dm_exec_query_plan, each XML plan has to be saved individually. This can be annoying if one has to remote into to a PC in the server room and this connection is slow. There is much information in the XML plan that would be helpful if we could see it in a grid along with the execution statistics.
From the index usage statistics, we can eliminate unused indexes. There might also be indexes that are infrequently used. Sometimes it is obvious that certain indexes can be consolidated. Other times, it is necessary to examine the execution plan to determine if another index is sufficient. The XML plan analysis cross-references in which plans each index is used. All of this is tedious work which can be automated. So this is the purpose of the SQLExecStats tool.
A few people have tried it on production systems. There are still bugs, and feature requests, but hopefully I can turn this into something useful over time.
lets use this url for the latest build rather than changing the url for each build
http://www.qdpma.com/tools/ExecStats.zip
The above build is now 20090628. The (poor excuse for) documention is also online http://www.qdpma.com/SQLExecStats.html
Feedback is welcome, in particular, send me the output files so I can try to figure out if this is collecting the right information. Support will be based on availability.
Note: SQL Server 2008 sp1 fix now correctly reports parallel execution plan total_worker_time (CPU). It is not fixed on SQL Server 2005 sp3, latest cu?
Limitations:
European Localization The XML plan is stored as a string, with numbers in US-en format. My program extracts plan cost and estimated row numbers from the XML plan. The previous version did not specify any localization, but I think the default is the host environment. Hence the code double.Format(value) is interpreted as whatever the local environment is instead of US-en. The builds since 20090623 specifies the US-en format in most places.
If some one with European number formats could run this program and send me the output, I would appreciate it.
Bug fixes, features etc
2009-06-22: The password now should display *, I don't do security work, and it shows. Some formatting was changed to right align, comma thousands separator, Very large numbers in Excel cannot be changed, 4 bytes integers are represented as numbers, larger numbers are represented as text. I will do more formatting changes later
2009-06-28
The error described by Zen occurs with case sensitive collations. Apparently one SQL query I issued from the executable was not a case correct with the columns I declared on the receiving side, and this is sufficient to cause the data table load to fail.
The current build is really an intermediate build. I am adding the ability to collect table and index information for multiple databases and not everything is consistent yet.
Be careful on trying to collect for too many databases, and too many execution plans, there are checks for over size yet.
2009-07-09
I am still in the middle of changing the tool to have the option of collecting table and index info from all databases. So some of the UI is not entirely consistent. However, all bug fixes are in the new version.
Previously there were SQL queries that were not case correct with client-side table definitions, and when the default collation is case sensitive, an error occurred. This should be corrected.
Also, previously I issued DBCC SHOW_STATISTICS([schema.table],[stat])
which can generate an error if there is a hyphen '-' in the table or schema. I changed this to
('schema.table',[stat]) I am not sure if this is the most safe format, should it be: ('schema.table','stat')? anyways, go to the main url, http://www.qdpma.com/
|
-
For several months, we have seen ads for the joint HP/Oracle RAC and Exadata storage combination talking about extreme performance (10X faster) for large data warehouses. One thing I like about Oracle is that they have courage to pursue technology with deep hardware design implications, even if it takes several iterations to iron out the major issues. I just got around to looking through the Oracle papers on this. Like OPS/RAC, the Exadata technology has implications for how hardware is built. Hardware vendors can be squeamish on designing silicon for special requirements if there is not a viable installed base. This leads to the chicken and egg, which comes first situation that many other vendors cannot successfully initiate. Oracle dares to do this, and amazingly get customers to shell out big chunks of money on the first iteration. This in turn provides justification for gutless risk averse hardware vendors to do their part.
I will start by saying that the SAN systems out there today are designed for transaction processing, not data warehousing. Most SAN systems are designed with a certain number of FC ports, with the intent of supporting 1-4 disk enclosures (typically 15 disk drives each) per FC port. The SAN controller (or service processor) is a significant portion of the overall cost. Configuring one enclosure per FC port leads to a higher amortized cost per disk than having multiple enclosures per FC port, but sequential bandwidth is still limited by the number of FC ports. Depending on the architecture, it is possible to sustain between 330MB/sec (loop architecture) and 390MB/sec (star architecture) per 4Gbit/sec FC port. So it can require 3 FC ports and 45 disks to support 1GB/sec, even though each individual 15K disk drive can sustain 125-160MB/sec. The amortized cost of a SAN system might be $2,000 per disk, so each 1GB/sec through-put costs around $90K.
This is why I have advocated direct attach storage for data warehouses, where each 15 disk enclosure can sustain 800-1000MB/sec at an amortized cost of about $500 per disk. But most people do not like inexpensive high-performance storage solutions. And none of the expensive SAN systems provide sufficient bandwidth for really high-end data warehouse systems.
In the Exadata system, the interconnect between host and storage is InfiniBand, which signals at 5Gbit/s using a x4 wide connector (like SAS) for a net bandwidth (after 8B/10B encoding) of 16Gbit/s or 2GB/sec. The Exadata Storage Server (or cell) is an HP DL180G5 with 2 Xeon E5430 quad-core 2.66GHz processors, 8GB memory, a P400 RAID controller, 12 450GB 15K SAS or 1TB 7200 RPM SATA disks and dual-port Infiniband HCA. Curiously, of the 5.4TB raw storage with 12 450GB drives, only 1.5TB is available. With RAID 10 overhead, there is 2.7TB. Some space is required for internal use, but 1.2TB seems to be rather large.
A complete pre-configured HP/Oracle Database Machine full rack comprises 8 HP DL360 servers with two Xeon E5430 quad-core processors, and 32GB memory for running Oracle RAC, 14 Exadata storage cells, 4 InfiniBand switches (and 1 Gigabit Ethernet switch for auxiliary communications). A half-rack has half of the above components. Each storage cell is listed as supporting 1,000MB/sec with SAS drives and 750MB/sec with SATA drives. The listed bandwidth for the pre-configured full-rack is 14GB/sec. It is stated that Exadata bandwidth scales linearly with the number of racks, but without explicit performance numbers.
Compare the Exadata cell with an EMC CLARiiON CX4-960 mid-range SAN. The CX4-960 comprises 2 SPs, each with two quad-core processors, 16GB memory per SP, for which the minimum meaningful configuration is 16 disk enclosures (240 disks) over 16 FC ports. So the resource allocation per SP is 2 quad-core processors, 16GB memory and 120 disk, with probable sequential bandwidth of 3GB/sec (380MB/sec per FC port). The Exadata cell provides approximately the same compute power, 8GB memory, for 12 disks targeting 1GB/sec sequential bandwidth.
The purpose of the massive computer power per disk in the Exadata cell relative to a standard SAN is to offload compute functions from the main database engine. Concentrating capability, be it compute power or IO bandwidth, in a single system is always difficult, so distributing work can be useful if it can be done effectively. One candidate is compression. (SQL Server 2008 can store tables and indexes with row or page level compression, which take CPU resources. Oracle probably has comparable capability as well.) [Exadata is for Oracle systems only?] Offloading this to the storage element might be desirable. Finally, the Exadata cell is not just a storage engine, but can also handle database protocols.
In addition to a command such as fetch this block, and decompress, the Exadata can also handle SELECT * FROM Table WHERE col = ‘SARG’ (Smart Scan Offload Processing). In a data warehouse, that expectation is that queries are ad-hoc, for which indexes have not been built. So a data warehouse must be able to power through very large table scans. This requires both IO bandwidth and CPU resources, as a database table scan is not a simple IO operation (see my other posts on this matter).
The very recent TPC-H report for the HP BladeSystem of June 3, 2009 uses the Oracle Exadata Storage Servers (more on this below) and has price information. The full cost for 6 Exadata storage cells and supporting components is $536,516. (Of this, each Exadata Storage Server is $24,000, and the cost of the Exadata software is $360,000. Interesting, the price of a similarly configure DL180G5 is $14,000.) Three-year support is another $479,846 for a total 3-year cost of approximately $1M. It is unclear how much of the discount applies to the Exadata versus the Oracle database software. The amortized list price per cell is $166K or 14K per disk. That Oracle sells this clearly sends the message that people do not look at cheap hardware for the data center.
Now when HP published a 1000GB TPC-H report for a HP Superdome with 32 Itanium 2 9140 sockets, and 64-cores on April 29, 2009, I was very puzzled. What was the purpose of this publication? HP had already published a 10TB result on a Superdome with 64 sockets (128 cores) of the same Itanium 2 9140 processors back in March 2008 with no real disparity (acknowledging that results at different sizes are not directly comparable). Then about one month later, HP published the 64-node Oracle RAC with Exadata storage result of June 2009 (note that the RAC servers are BL460 with 2 Xeon quad-core 3GHz processors, different from the pre-configured database machine).
|
System |
SuperDome |
BL460 Cluster |
|
Database |
Oracle 11g + Partitioning |
Oracle 11gR2, RAC |
|
QphH@1000GB |
123,323 |
1,166,976 |
|
TPC-H Power |
118,577 |
782,608 |
|
TPC-H Throughput |
128,259 |
1,740,122 |
|
Total System Cost |
$2,532,527 |
$6,320,001 |
|
Processors |
32 Itanium 9140 1.6GHz |
128 X5450 3GHz |
|
Cores |
64 |
512 |
|
Memory |
384GB |
2080GB |
|
Disks |
768 |
6 Exadata Cells |
|
HBA |
64 Dual Port FC 4Gb/s |
64 x 2 Infini-band |
Was the intent that people would see the Oracle RAC with Exadata storage and draw conclusions based on the approximately 10X difference in performance with another recent 1000GB result?
What conclusions can we draw from each of the two results? The first matter to understand is that the TPC-H scale factor 1000GB means the LineItem table, data only, is approximately 1000GB in size. The full database with all tables and indexes is approximately 1700GB. So the entire database fits in the 512-core RAC system with 2080GB and not in the 64 core system with 384GB.
Next the 64 core Itanium system has 64 dual-port FC adapters, meaning 128 x 4Gbit/sec ports, which could support 42GB/sec based on 330MB/sec per 4Gbit/s FC port. But it is unlikely that 768 disk drives can sustain this volume (55MB/sec per disk) in a SAN. It is also interesting that system was configured with the EVA 4400 while other HP SuperDome Unix results employ the MSA1000 storage. (It is nice to have 32 EVA SAN systems or 256 MSA 1000s available for performance testing).
Note that TPC-H query 1 (a table scan of most of the LineItem table) takes 169.8 seconds on the 64 core Itanium, and 10.3 seconds on the 512 core RAC system. This means that if the data had to be read from disk, then the disk system would have to support 6GB/sec on the 64 core and 97MB/sec on the 512 core system. The 64 core Itanium system definitely has to read data from disk and the configured disk can easily support 6GB/sec (only 8MB/sec per disk) while the 512 core system has 6 Exadata storage systems which can support exactly 6GB/sec per specifications, nowhere near 100MB/sec. But all the data fits in memory so disk reads for data does not occur given the TPC-H sequence where the test run occurs after the database load and index build.
The data load time on the 64-core Itanium was 1:07:12 and 2:22:57 on the 512-core RAC, which also may indicate relative storage performance
Another point that can be noted is that there is an 8X difference in cores for a 9.5X difference in performance. However, the Core 2 architecture Xeon 3.0GHz cores (45nm) are much more powerful than the Itanium 2 1.6GHz cores (90nm). The gain in Power is 6.6X and 13.6X in through-put. TPC-H scored is based on a geometric mean of 22 queries, some of which are small and other large. The geometric mean has the effect that a 2X speedup in a small query has the small benefit as 2X in a large query. The issue is the getting 10X gain in a small query is very difficult so scaling on Power is attenuated.
In summary, the two performance reports are definitely sufficient to assert that Oracle RAC can scale, but having single node and 8 node performance reports for the BL460 would be confirm this. The two TPC-H reports say nothing of Exadata storage system performance, either in the sustainable sequential bandwidth or the value of the Smart Scan Offload processing. If the 64 node, 2080GB memory Exadata storage result had been reported at 10TB, then we might have an idea of its capabilities. Based on the 100GB/sec table scan estimate above, it would require 100 Exadata cells, which might beyond its actual scaling capabilities.
The performance data cited in Oracle’s Exadata whitepaper lack details to attribute the source of the performance gain. Given that most SAN systems are horribly configured for Data Warehouse performance, it is quite probable that dropping in the preconfigure full rack Exadata with 14GB/sec sustained table scans can easily generate the quoted numbers.
Notes on the Exadata Storage Server
At the time this product came out, the choice of the DL180G5 was reasonable. However, this system, based on the Intel 5100 chipset, has 1 x8 and 2 x4 PCI-E Gen 1 slots.
I am guessing that the Infini-band dual port HCA occupies the PCI-E x8 slot, the P400 RAID controller occupies one of the x4 slots and that the 12 internal disks are connected to one of two x4 SAS ports on the P400.
Technically, the each Infiniband DDR (5Gbit/s) x4 channel is 20Gbit/s, which after 8B/10B encoding is 16Gbit/sec (2GBytes/s) which could fully consume a x8 PCI-E Gen 1 channel. There are 2 IB channels for path redundancy, not bandwidth aggregation, as the x8 PCI-E bandwidth is limited to 2GB/sec.
Since the disk drives are probably on a single x4 SAS port, this bandwidth is limited to 1GB/sec, even though each 15K disk drive can do 160MB/sec (not accounting for RAID 10 implications). So while there is 4GB/s combined bandwidth on the Infiniband links, only 2GB/sec can be sent over the PCI-E port to the IB HCA, and only 1GB/s can be drive from the RAID controller.
Now that the DL180G6 is available, with 1 x16 and 2 x8 PCI-E Gen 2 channels, I would retain the IB dual port x4 (if it supports PCI-E gen 2), go to the new P410 RAID Controller, the 25-bay SFF drive bay (assuming the drive bays can be split across two x4 SAS channels), and 24 146GB 15K or 300GB 10K SFF drives (There is no point offering a 250GB SATA option). This unit should support 2GB/sec.
I might even be tempted to bug HP to split the drive bays 4 channels, 6 bays per channel, x4 SAS on each channel. There would be 2 P410 controllers, each driving 2 channels. The 6 disks in each channel might not drive 1GB/s, but 4 channels might support 3GB/s? I priced this around $19K.
To bad we cannot have a generic Infiniband SAN, and skip the Exadata software ($5K licensing per disk?).
|
-
Now that I have actually kind of read this paper, I will comment on it. I say kind of because I cannot focus on non-technical matters, the same way I get really frustrated trying to explain something to a person that just cannot understand the difference between an argument substantiated by hard analysis and an argument that seems logical but is not the predominant underlying effect.
First, Microsoft competes with Oracle, and MS gets hit with this all the time so of course they need to collect in one place their best arguments. On a side note, many Microsoft whitepapers list the author(s). When an artist paints a masterpiece, he/she will put a signature on it because he takes pride in his work. Not all talented artists can paint what they want and must accept commercial work (having food to eat is not overrated?). In such cases, he does not want his peers to know that he had to stoop to do such work and prefers it be anonymous. There is no listed author for this paper.
The first argument in the paper is valid. Only a small percentage of customers have actually deployed RAC. Back in the OPS days, one prominent expert said that he had never seen for himself an OPS deployment that actually achieved positive scaling, but he did hear from someone he considered competent on this matter that one customer did. RAC is much better than OPS (to get something right, you do have to screw it up once or twice or trice). The Oracle people I talk (more hurling insults/ridicule back and forth than talking actually) with do say that the technical skills to deploy RAC are not common among Oracle DBAs.
But the fact that Oracle RAC is not really required by most people does not stop sales and marketing from making a big deal out of it. Funny how CIOs are influenced by the scaling argument (men are most susceptible the size thing) even though it is of no consequence to their specific environment.
Of course RAC is expensive. The alternative is buying big iron which is also expensive. If a solution is not really painfully expensive, is it any good? If one project manager turns what should have been a $1M project into a $10M project and another manager does his project for $1M, who will know it should have been a $1M project? Who will get the higher job ranking come evaluation time? That is: the next promotion and big raise?
When a project is so hideously expensive, the CIO must go to the CEO, who in turn may need to go the board for approval. Well if this project becomes really messed up, the big bosses are not inclined to declare it a failure because they will look bad as well, that is, they had endorsed the project. When a small project runs into difficulty, no matter who is at fault, it is not hard for the CEO/CIO to pin the blame on a lowly project manager, i.e., fire him.
If what you need is SQL Server cluster style fail-over redundancy, then the expense of RAC licensing does not make sense. So RAC only makes sense if scaling performance is needed. The paper talks about scale-out OLTP [by which I mean with distributed partitioned views]. Scale out on any DBMS is not simple and not just for the reasons described. If you understand in detail how the [SQL Server] cost base optimizer (CBO) works with respect to local and remote data, you will understand the horrible implications. [In the execution plan with remote data, the row count estimates are mostly 1,000 or 10,000 rows. When two sources of 1,000 rows each are joined, the output is 1M rows. When this is joined to a large local table, the plan is frequently a table scan.][Technically, RAC scales out on hardware, but does not have a DPV database design.]
If one wanted to point out an issue with Oracle RAC for OLTP, the most apparent is the near complete absence of published benchmark results. There is one RAC TPC-C (there are no Oracle TPC-E period) and it was done long ago (12/2003). The TPC-C RAC pub was 16 HP Integrity rx5670 each with 4 Itanium 2 1.5GHz for a total of 64 cores, a score of 1,184,893tpm-C at total cost $6.5M, versus the contemporary result for a HP Superdome, same 64 Itanium 2 processors, scored 1,008,144 tpm-C at $8.4M cost. The RAC had 17% better performance at 22% lower cost. Looking closely at the price detail, the cost of memory for the rx5670 was $1.4M versus $5M on the Superdome. It is a little difficult to compare pricing because the Superdome discount was about 45% versus 25% for the clustered rx5670. The major software licensing difference was $640K for RAC and $320K for partitioning. So a big chunk of the price advantage is because of a memory pricing anomaly.
|
Nodes |
1 |
16 |
|
System |
Superdome |
rx5670 |
|
Database |
Oracle 10g |
Oracle 10g+RAC+Part |
|
Report Date |
11/04/2003 |
12/08/2003 |
|
Tpm-C |
1,008,144 |
1,184,893 |
|
Total System Cost |
$8,397,262 |
$6,541,770 |
|
Price/Performance |
$8.33 per tpm-C |
$5.52/tpm-C |
|
Processors |
64 Itanium 2 1.5GHz 6M |
16x4 Itanium 2 1.5GHz 6M |
|
Memory |
1024GB |
768GB (16x48) |
|
Disks |
2100+120 |
672+1344+224 |
|
HBA |
28 FC 2Gb/s |
64 x FC |
|
Costs |
|
|
|
Processors |
$1,280,000
($40K ea) |
$528,000
($8.25K ea) |
|
Memory |
$4,998,400
$39K for 8GB |
$1,440,000
$7500 for 4GB |
|
Server Subtotal |
$7,085,433 |
|
|
Storage |
$5,032,188 |
|
|
Server+Storage |
|
$4,694,618 |
|
Oracle 10g |
$1,280,000 |
$1,280,000 |
|
RAC+Partitioning |
|
$960,000 |
|
Discounts |
$7,000,000 |
$1,900,000 |
Since then, there have been no Oracle RAC TPC-C publications. This usually means there is no good news. If MS wants to criticize RAC for OLTP, I am ok with it. In other posts, I argued that going forward, the new Intel QPI and existing AMD HT interconnects should allow building big-iron (scale-up) systems with better scaling than RAC, on account of the higher bandwidth and lower latency that can be achieved versus RAC going over Infiniband. This is a theoretical argument that needs actual measurements to assess its validity.
On the Data Warehouse side, there are many Oracle RAC TPC-H publications. From the results, I think RAC has decent scalability, and I am really happy that I can bring a better balance of processor power, memory and storage than I can with big iron, or rather, a max'ed out server system (i.e., expensive big capacity DIMMs, and high priced storage). So apparently my arguments above on interconnect bandwidth and latency are not as important on DW. MS does mention they will soon have their own MPP solution, so that good, because I am too old to learn Oracle, and I am relatively happy (meaning I am bitching a lot) doing big SQL Server projects.
I will expand on what I mean by balance. Lets compare 2 recent TPC-H 1000GB results. A 64-node, BL460c with 2 QC Xeon 5450 3.0GHz, 32GB memory per node, (128 sockets, 512 cores, 2TB memory) scored 1,166,976 QphH at total cost $6.3M compared to a 32-socket, 64 core Superdome, 384GB memory score of 123,323 QphH at cost $2.5M. Put aside for this discussion the fact that Itanium is still dual core on a 90nm process while Xeon is quad core on 45nm. For 8X the number of cores, the performance gain is 9.5X. Keeping on mind that the Xeon 5450 core is about 50% faster than the Itanium 1.6GHz based on SPEC CPU int 2006, this is about right. Also the RAC system used the Exadata storage which offloads some processing, but probably not too much.
The cost breakdown for the Itanium is $736K for processors, $800K for memory, $1.3M for storage (EVA SAN), $470K for Oracle (+54% discount on all). In the RAC-Xeon system, it is $160K for processors, $85K for memory, $200K for Infiniband, $540K for storage, $6M for Oracle (30% discount on Oracle only). OK, this is not what I was really getting at. If you max out a system, it means buying the 8GB DIMMs which cost 4X more than the 4GB DIMMs but does not contribute proportionately higher performance. The big gain in the RAC system is that its possible to configure enough memory to fit the database and make do with a less powerful storage system. This argument goes out the window if the RAC memory cannot encompass the entire DB plus working space. Still, it is good to have this avenue.
|
System |
Superdome |
BL460 Cluster |
|
Database |
Oracle 11g + Partitioning |
Oracle 11gR2, RAC |
|
QphH@1000GB |
123,323 |
1,166,976 |
|
TPC-H Power |
118,577 |
782,608 |
|
TPC-H Throughput |
128,259 |
1,740,122 |
|
Total System Cost |
$2,532,527 |
$6,320,001 |
|
Processors |
32 Itanium 9140 1.6GHz |
128 X5450 3GHz |
|
Cores |
64 |
512 |
|
Memory |
384GB |
2080GB |
|
Disks |
768 |
128+6x12 |
|
HBA |
64 Dual Port FC 4Gb/s |
64 x 2 Infiniband |
|
Costs |
|
|
|
Processors |
$736,000
($23K ea) |
$159,872
($1.2K ea) |
|
Memory |
$850,000
$35,611 per 16GB |
$92,000
$339 per 8GB |
|
HBA |
$233,680 ($3.5K ea) |
$199,000 (HCA+switch) |
|
Server Subtotal |
$2,397,614 |
$696,419 |
|
Storage |
$1,390,861 |
$536,516 |
|
Software |
$624,245 |
$6,425,600 |
|
List Price (ex maint.) |
$4,412,720 |
$ |
|
Discounts |
$2,210,000 |
$1,965,000 |
|
Maintenance (net) |
$330,067 |
$626,054 |
|
-
Earlier I talked about the first TPC-C and TPC-E results for 2-way Nehalem, ie, the Intel Xeon 5500 series. The results were spectacular relative to the previous generation Xeon 5400 series, (2.5X gain on the Intel slide deck for database OLTP) and were pretty much hitting the same range as 4-way Xeon 7460.
I pointed out that while these were legitimate results, the TPC-C and TPC-E benchmarks generate high call volume, about 1000 RPC stored procedure calls per second per core. Meaning each call averages around 1 CPU-ms. This type of usage benefits from the Intel Hyper-Threading feature. It was around 10-20% back in the NetBurst days. I am inclined to think it is now much larger with Nehalem, possibly 30-40%. An application like TPC-H would not benefit from HT. Nehalem should still show a moderate performance gain over Core2 on the basis on micro-architecture improvements alone (plus the integrated memory controller).
Well, my thanks to Dell for publishing a TPC-H for Nehalem. Notice to other vendors: get going, slackers! Below is the 2-way Xeon 5500 versus Xeon 5400 or 5300, and 4-way Xeon 7460 or 7350.
System Configuration TPC-H@100GB
T610 2 Xeon 5570 Quad-Core 2.93GHz, 8M L3, 48GB 28,773
ML370G5 2 Xeon 5355 Quad-Core 2.66GHz, 2x4M L2, 64GB 17,687
DL580G5 4 Xeon 7350 Quad-Core 2.93GHz, 2x4M L2, 128GB 34,990
System Configuration TPC-C
DL370G6 2 Xeon 5570 Quad-core 2.93GHz, 8M L3, 144GB 631,766 (Oracle/Linux)
ML370G5 2 Xeon 5460 Quad-core 3.16GHz, 2x6ML2, 64GB 275,149
DL580G5 4 Xeon 7460 Six-core 2.66GHz 16M L3, 256GB 634,825
System Configuration TPC-E
Fujitsu RX300 2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB 800.00
TX300 S4 2 Xeon X5460 Quad-core 3.16GHz, 8M L2, 64GB 317.45
Dell R900 4 Dunnington Six-core 2.66GHz, 16M L3, 64GB 671.35
It is unfortunate that the last TPC-H results on Intel were the 65nm Xeon 5300 and 7300 series (except for the Unisys 10TB 16-way). Lets just suppose that a Xeon 5470 3.33GHz would score 20% higher than the Xeon 5355 2.66GHz. The 20% frequency difference might contribute 10%, and the micro-architecture improvements going from the 65nm to 45nm Core2 contribute the rest (The larger cache is not expected to benefit TPC-H high row count queries). This would make the Xeon 5500 series 35% faster on TPC-H than the 5400, which is more than I expected just from the Core 2 to Nehalem architecture improvements.
Of course, this Dell result uses the Fusion-IO SSD drives that plug directly into the PCI-E slots, instead of going through a RAID controller, then the SAS interface. I am looking through the individual TPC-H queries, comparing against both the 2-way 5355 and 4-way 7350 results. I think there is reason to believe that the SSD storage improves performance over a large array of disk drives. A large disk array can deliver sufficient sequential bandwidth, but some SQL operations will generate small block IO, and writes to tempdb should be much faster. The 4-way 16-core Core 2 (7350) has better overall performance than 2-way 8-core Nehalem, but on some individual queries, the Nehalem system scores better.
I am inclined to think the 2-way Opteron six-core (Istanbul) could be close to 2-way Nehalem quad-core on TPC-H, despite the large advantage in TPC-C/E. Intel has a myopic view that the big-dog processor should be reserved for the 4-way+ systems. (It may not make sense to put 8-core Nehalem EX into a 2-way system if the 6-core 32nm Westmere core will be available soon.)
To reiterate, it is very important that all key benchmarks are published so we can get a good idea of what to expect under each circumstance. No one (among reasonable people) expects miracles and magic. It is a complete picture that is important. Knowing that you should expect 20% is better than a misguided belief or hope for 2X.
Some people think I am paranoid and deeply cynical. So I will now resemble this accusation. My thinking is Intel had the full set of benchmark results months ago. It was pointed out that some benchmarks, mostly the ones the benefit from HT, showed huge gains, while others just show good gains. Every organization has worthless marketing types that feel the need to justify their salary. So it was decided to withhold the DW results, just so the worthless crap marketing slides could show the big numbers instead of a complete picture. The complete picture is important, we are happy that Nehalem has arrived. We can work its actual performance characteristics; spectacular gain on some, good gain on others. So stop tinkering with the slide deck!
There is the truth, the whole truth, and nothing but the whole truth. Some people can handle the item 1, but know to stay well clear of 2 and 3,
The use of the FusionIO SSD is interesting. As mentioned above, it interfaces directly to PCI-E. The first generation was PCI-E gen 1 x4, and can do 750MB/s (32K) read, 500MB/s write, 116-119K IOPS (4K), in capacities of 80 and 160GB for SLC, 320GB for MLC. The second generation can do 1.5GB/s read, 1GB/s write, 200K IOPS (4K), in capacities 160/320 SLC, 640GB MLC. The interface is PCI-E x8 gen 1 or x4 Gen 2.
The Intel 5520 chipset has 36 PCI-E gen 2 lanes plus the ESI. A 2-way Nehalem system can be built with 1 or 2 5520 IOHs. The Dell T610 has 1 IOH for 2 x8 and 3x4 slots available (x4 for the internal SAS?). The Dell TPC-H config has 4 Fusion-IO drives, which is fine for this test. An actual production system might want to configure more SSDs. The HP ML370G6 with 2 IOHs has 10 slots (2x16, 2x8, 6x4, one for NICs). The x16 slots are useless for database servers because no network or storage IO adapters can really use x16 bandwidth, and definitely cannot make good use of the unbalanced slots. The x16 slots might be useful for HPC or something. Hopefully Dell or HP will make a system with something like 7 x8 and 4 x4 slots. Now to make maximum use of the (current) Fusion IO SSDs, we would have 18 PCI-E x4 gen 2 slots, but I think Fusion IO could be persuaded to do a double wide SSD instead.
SuperMicro does have a dual 5520 IOH motherboard with 7 x8 PCI-E slots. The onboard SAS occupies a x8, and the dual GbE NIC takes another x4. It looks like one x4 is not wired. For a server, I would have used a x4 for the onboard SAS because that may only connect the boot drives, and I would stick the GbE NIC off the south bridge ICH. The x4 Gen2 should be made available for 10GbE. I used to buy SuperMicro systems because their wide motherboard selection allowed me to get the one with the best IO arrangement for database servers. But when SAS came out, I had a hard time getting the right connectors. I may give them another try if Dell or HP does not do a 7 x8 PCI-E Gen 2 system.
|
-
Yesterday Intel held a product announcement press event for the upcoming Nehalem EX, which will succeed the current Xeon 7400 series based on the Core 2 micro-architecture for "expandable system", i.e., 4-way and higher, in late 2009 or early 2010. The current Xeon 5500 series (also Nehalem architecture) has 4 cores, 8M shared L3, 2 QPI links, and 3 DDR3 memory channels. Nehalem EX has 8 cores, 24M shared L3 cache, 4 QPI links and 4 FBD memory channels (there is now a Scalable Memory Buffer between the memory interface and memory, did Intel just move the AMB from the DIMM to the motherboard?).
AMD has also recently discussed their plans. The current quad-core Shanghai gets a frequency bump from 2.7GHz to 3.1GHz, and a six core Istanbul should be released very soon (June, announced at 2.6GHz). See the Johan de Gelas Anandtech article on Istanbul. It describes HT assist, (essentially a snoop filter for HT) as using 1M of the L3 cache. The HP ProLiant DL585G6 for Istanbul also appears to be HT version 3.0 or HT3, upping the HT transfer rate from 2GT/s to 4.4GT/s.
Later on, there will be Magny-Cours, which would be 2 Istanbul die in one package. Istanbul has six cores, 3 Hyper Transport links and 2 memory channels. In Magny-Cours, the two six core chips are linked by one HT link, so the external package will have 12 cores, 4 HT links and 4 memory channels. After this, a new improved micro-architecture would arrive?
Now there have been big iron Windows systems for many years. The HP Superdome supports up to 64 Itanium 2 sockets. The problem has been that Intel has not kept pace with Itanium. The current Itanium 9100 series, Montvale, is a 90nm dual core, while the Xeon line is at 45nm and six+ cores. Tukwila, the 65nm quad-core Itanium that should have been launched in 2008, was recently delayed until 2010. Supposedly Itanium should finally be caught up on process technology in 2011 with the 32nm Poulson. Unisys (ES7000 7600R), NEC (Express5800/A1160) and IBM (x3950M2) all have had 16-socket capable Xeon systems for a while. HP has the 8-way ProLiant DL785G5 for Opteron processors (I really would like to get the architectural diagram for how HP connects the 8 sockets). I have not followed Sun since I focus on Windows/SQL Server. (Sun has the 8-way x4600 for Opteron. see http://www.sun.com/servers/x64/x4600/arch-wp.pdf for an architectural diagram on how 8 Opterons are connected in a twisted ladder)
Still, I consider this to be a revival or perhaps true arrival of big iron because of the issues in the past on scaling beyond 4-sockets, both in terms of performance and price-performance.
Previously, there were technical challenges in scaling the Intel Xeon beyond 4 sockets, both for the system vendors in designing such a system, and the DBA/developer in getting their application to scale beyond 4-sockets. For an OEM to build an 8-way+ system, it required the effort to built custom chips, the market volume was low, and Intel kept changing the FSB. All of this meant there was a big step up in price per socket going from a 4-socket system to 8, 16 or 32.
This was the rational for Oracle RAC. Instead of buying really expensive big-iron hardware, one can buy lower cost high volume hardware and really expensive software licenses. Think about it. Scaling up on big iron or a RAC-type technology depends on interconnect bandwidth and latency. For either the Intel QPI or AMD HT, it should be possible to achieve far better bandwidth and latency in big-iron than a RAC-type solution. The best Infini-band can do now in a x4 link is 40Gbit/s (5GB/s) at approx 1us latency.
Now that there is prospect of stability in the Intel processor interconnect, my expectation is that we should now see 8-way+ systems at a less severe price premium over 4-way systems. (there will always be a premium because validating and supporting big systems requires deeper technical skills). On AMD Opteron, having the 4 HT ports from one package enables 8-way glue-less systems (with fewer hops) and helps in building 8-way+ (with glue?).
In the Intel announcement was that 8 OEMS have 15 or so 8-way+ (including 16 and 32-way) Nehalem EX systems in the works. IBM, NEC and Unisys are obviously 3 of the OEMs, given their recent commitment to big-iron Xeon. Fujitsu and Hitachi might be another 2, as the Japanese players love big-iron. Sun should be one for 6 of the 8 OEMs. I am guessing this means HP and Dell are the two remaining OEMs. HP is no surprise. They already have the 8-way Opteron. Their commitment to Itanium means that HP would have built a chipset around QPI for the next generation, which is the same processor interconnect on Nehalem.
Dell is the question. Their attitude might be that they do not expect to sell many big-iron systems, considering the technical difficulties they had in the past on this. To sell big iron, it is absolutely necessary to have top technical expertise to go into customer shops to find out if it is the right solution and what changes need to be made to deploy successfully. (OEM reps are invited to drop hints, even if its still a company secret, we will keep it just between us)
[OK, I forgot about SGI, they have big iron Itanium, which means if they do a chipset for the next gen Itanium with QPI, they can do a Nehalem-EX too. plus they just blogged this http://ceoblog.sgi.com/]
Up to Windows Server 2008 RTM, the OS does not support more than 64 cores, physical or logical. This limit will be lifted with Windows Server 2008 R2, accompanied by SQL Server 2008 R2(?). Both the Unisys 7600R and NEC A1160 posted TPC-E benchmark results for 16-sockets, but only 4 of the 6 cores in the Intel X7460 processor enabled, to stay under the current 64-core limit. Scaling was decent, but not spectacular, going from 721tps-E@4-sockets/24 cores, to 1156 tps-E@8S/48c, to 1400tps-E@12S/64c and 1568tps-E@16S/64c.
Note that scaling large/(hard) NUMA systems require proper use of port affinity settings, and how interrupts are handled. Windows 2008 R2 supposedly has a much improved disk I/O handling on NUMA systems.
The Intel announcement mentioned that 4-way Nehalem EX will have 2.5X+ performance over 4-way Xeon 7460, based on a very recent internal measurement using OLTP workload, i.e., TPC-C or TPC-E. This is also inline with the huge TPC-C & E gains posted by 2-way Xeon 5500 over Xeon 5400. Previously I discussed this matter. Each Nehalem core should have moderately better performance than a Core 2 micro-architecture core. Nehalem systems have more memory channels to better support multi-core scaling. The Nehalem EX 4-way system has 16 memory channels supporting 32 cores, versus the Xeon 7400 (7300 MCH) 4 memory channels supporting 24 cores. Nehalem EX will have 8 physical cores compared with 6 on Xeon 7460. Finally, both TPC-C and TPC-E benefit from Hyper-Threading, a feature from the Pentium 4 (NetBurst) micro-architecture (designed in Oregon), but not implemented by Core 2 (designed in Israel). Anyways, 2.5X over X7460 means 1.6M tpm-C or 1700 tps-E.
Now both TPC-C and TPC-E are OLTP benchmarks (workloads). The interpretation should not be that HT (and large cache) benefit OLTP workloads as in any one else's OLTP workload. Each TPC-C transaction involves on average 2.25 or so RPC calls (network roundtrip) and each TPC-E transaction involves approximately 22.3 RPCs. By looking at the recent results on Xeon 7460 or Opteron Quad-core, one can figure out that the average cost per RPC in both TPC-C and E is on the order of 1 CPU-millisecond (the duration of the complete RPC might be longer, say 80-400ms)
The correct interpretation should be that HT and large cache benefits high call volume applications, transaction processing or not. HT benefits mostly in the network round-trip. This was based on tests done on the previous version of HT, i.e., Pentium 4 architecture. I did not find one SQL operation that benefited from HT except in handling just the RPC overhead. The Quest LiteSpeed compression engine did show huge gains with HT, 40%. This indicates the theory behind HT is valid. One just needs to figure what in the SQL Server engine does not like HT. It is possible that the HT in Nehalem now works better with SQL Server.
The large cache reduces the (fixed) startup cost of an SQL operation, but not the incremental cost per additional rows. So if someone else's OLTP application average 10 CPU-ms per call, then it might not show as much gain going from Core 2 to Nehalem.
I suspect this is the reason Intel has not posted any TPC-H benchmark results. It should show some gain over Core 2, just not the spectacular gains in C & E. I am inclined to think that the 4-way Xeon 7460 is memory bandwidth constraint in TPC-H, and that is alleviated in Nehalem, but there are no published TPC-H results to substantiate this matter.
Dunnington and Nehalem EX are both 45nm. Dunnington has 1.9 billion transistors, 6 cores, there is a 3M L2 cache shared by each pair of cores, and a 16M L3 cache shared by all cores for a L2+L3 total of 25M. Nehalem EX has 2.3B transistors, 256K L2 cache dedicated for each core and 24M L3 cache for 26M L2+L3 cache. Granted there is a big increase in latency from L2 to L3. I would interest to see the supporting data (estimates made before the design work) for the big L2 caches in Dunnington.
Even with all of the improvements over time, on the hardware with Nehalem, integrated memory controllers, QPI, on the software stack, w2k8r2 and s2k8r8, scaling on NUMA systems is not trivial. What SQL execution plan operations scale?, what does not?, what might have negative scaling? etc, what problems can be fixed with code changes etc. All of this should be done with proper expertise. (Not to be construed as an advertisement or solicitation for services, this will not be cheap either)
PS -
I am neither advocating nor criticizing big-iron systems. The important point is that new systems coming every year are approximately 40% more powerful at comparable price ranges. That means the value of compute power depreciates at 30% per year (1/1.4 = 0.71). So it does not make sense to buy now for what you do not expect to need for 2+ years. Buy what need for the next year, and buy a new system after that, rotating the existing system to a less important task. Of course, if you work for an inflexible government agency that mandates replacement at 5 year intervals, or if buying the $1M system makes you more important than the other group that runs on a $30K system, well then go for it! On the flip side, one should not argue for the minimum system that meets requirements, but rather think about how massive compute power can be used to generate value.
I used have many complaints about Intel, particularly on the chipsets. Most have been addressed. The remaining complaint is that Intel has a twisted view that 4-way systems are special, ie, compared to 2-way systems. This is why the 6-core Dunnington is only used in the Xeon 7400 series and not the 5400 series, even though there is no reason it cannot be used in the 5400. The same applies to the upcoming 8-core Nehalem EX being only positioned in the 7000 line and not the 5000 line. AMD has no issues offering 6-core Istanbul in a 2-way. Hopefully, hardware vendors will have a better picture of customer interests, and offer a 2-way for Nehalem EX. Sure I know it is not cheap, this is why the different between men a boys is the size and price of their toys.
HP/Oracle just published a RAC TPC-H result with 64 BL640c blade servers at1000GB. This system comprised 128 quad core Xeon 5450 processors (512 cores), 32GB memory per node (64GB on one node). The total memory was 2080GB. The full database size should be around 1700GB. The 1000GB description is for just the LineItem table, not including the two non-clustered indexes and the other tables.
CPU memory Power Throughput QphH
32 Itanium2 DC 256GB 90,909 53,899 69,999
128 Core2 QC 2080GB 782,609 1,740,122 1,166,977
Based on the published Oracle RAC results, I should point out that RAC scaling on TPC-H does look good. The almost total lack of TPC-C (1 published?) may indicate an issue in scaling high-call volume applications. In the above mentioned Oracle RAC result, the blade server hardware costs were about $700K, $500K for storage, $3M for Oracle, $1.5M for RAC, $700K for partitioning, $700K for compression, $400K for support ($100K for unbreakable Linux support, if its unbreakable, why the support?) for about $6M in software, minus $1.8M in Oracle discounts. If I could charge that much, I would get myself a 400ft yacht. Never mind, Larry already did.
|
|
|
|
|
|