Intel Dunnington
Yet another update with the publication TPC results for the Intel X7460 six core (Dunnington)
X7460 2.67GHz 3x3M L2, 16M L3
Dell, HP lists availability Sep 15, 2008, IBM lists availability on the x3950M2 as Dec 10,
Dell R900 with 4 x X7460, 2.67GHz, 6 core, 16M L3, $17,195
HP DL580G5 with 4 x X7460, 2.67GHz 6 core, 16M L3 $19,151
I think the IBM x3950M2 with 4 x X7460 is $41K (understanding this system can be expanded to 16 sockets, and hence has higher cost structure)
Tukwila Quad Core Itanium due in Q1 2009?
I am looking over the Intel IDF slides on Tukwila. It is a quad-core Itanium, with probably just minor improvements in the core (specifically mentioned are HT) but with integrated memory controller and the QuickPath Interconnect (QPI) replacing the FSB. Tukwila will be 65nm when the first 45nm procs are 1 year old, meaning it really 2 years late (What I mean by this is a 65-nm quad core Itanium could have been built in late 2006/early 2007, if the prep work started early with clear objectives, and its performanc would have been earth shattering relative to x86/64). Frequency improvements are mentioned over the 90nm Dual-core, which is running about the same frequency as the 130nm single core Madison. Still, Tukwila has a large cache, 6M L3 per core, massive bandwith via QPI for good scaling characteristics (4 full width + 2 half width, allowing glueless 8-way), 4 DDR memory channels per socket, an HT, which is good for high call volume database apps. Intel mentioned about 2X performance, so they are probably targeting 740K tpm-C.
This means we will have the choice of 1) the six-core Dunnington, with the most powerful CPU core on earth (prior to Nehalem) on a weak chipset with 4 memory channels support 4 sockets of 24 cores, 2) the new quad-core Itanium with outstanding scaling characteristics plus HT for added throughput, but a weak core, 3) AMD Barcelona, also with very good scaling (8-way glueless), no HT, and slightly better than weak core, 4) Nehalem, with what will be the most powerful core, the new QPI for good scaling, 3 memory channels per socket, HT, but for the first year, only 2-sockets. Decisions, decisions.
Nehalem/Beckton
Due out in Q4 2008, the initial Nehalem core will support 2-way, 3 memory channels, 2 QPI. About a year later, Beckton, the MP server (4 sockets and up) version come out, 4 mem channels/socket, 4QPI.
Performance
TPC-C (Windows Server 2003, SQL Server 2005SP2)
4 x Intel X7460 six core 2.67GHz, 634,825 tpm-C
4 x AMD 8360 quad-core 2.5GHz 471,883 tpm-C
4 x Intel X7350 quad-core 2.93GHz 407,079 tpm-C
TPC-E (W2K8, S2K8, Dell PowerEdge R900 results)
4 x Intel X7460 six core 2.67GHz, 671.35 tps-E
4 x Intel X7350 quad core 2.93GHz, 451.29 tps-E
4 x AMD 8360 quad core ??
TPC-H (W2K3, S2K5, SF 100)
4 x Intel X7350 quad core 2.93GHz, 46,034QphH
4 x Intel X7460 six core ??
4 x AMD 8360 quad core ??
TPC-H 300GB
8 x Intel X7350 QC 46,034 QqpH (IBM x3950M2, W2K3, S2K5 sp2)
8 x AMD 8360 QC 52,860 QqpH (HP DL785, W2K8, S2K8)
On TPC-C, the 7460 six core generated a 34% edge over the quad core AMD and 56% advantage over the quad core X3750. Even with the large cache, this is higher than expected. At the time, I suspected HP did not pursue optimization with the 407K result.
On TPC-E, the six core showed a 49% edge over the older quad core.
This could indicate the 7300 chipset with 4 memory channels cannot properly scale the 4 QC 2x4M L2 processors, but can scale the new six core 16M L3 procs.
What's missing are comparable TPC-H numbers, especially at 100GB. The big cache on X7460 helps high call volume apps like TPC-C and E, but not TPC-H. Can the 7300 chipset drive the extra 8 cores (in the X7460 over 16 core in the X7350) in DW queries?
There is an 8xOpteron QC and 8xX7350 at 300GB, but the Opteron is on S2K8 while the X7350 is on S2K5, which has different characteristics.
The X7460 (Dunnington) is a clear winner at 4-way for the high-call volume apps. There are not sufficient results in DW to make a call. AMD does have a small openning in the fairly low-priced 8-way (compared with hard-NUMA systems).
As much as I would like to buy one of these for my own use in researching SQL Server performance characteristics, I am holding my 2008 budget for a Nehalem system as soon as it comes out, and a SSD array. As soon as I can confirm an SSD can do 10K IOPS on random 8K reads (I see the IDF announcements that the new Intel SSD due early 2009 will do 30K IOPS at 4K), I will get a dozen to see what is involved in reaching 100K IOPS from SQL queries. A few years ago, a quick test on a TMS SSD SAN showed 45K, limited by the SQL Server side CPU. On Nehalem, the big question is whether the Hyper-Threading issues of NetBurst has been fixed.
______________________________________________________________________
This is an update to the original post on Server Sizing for SQL Server to reflect the new quad-core Opteron systems.
The recommended server systems, as of Q3 2008, for line-of-business database applications are:
2-way Intel Xeon: HP ProLiant ML370G5 and Dell PowerEdge 2900 III
4-way Intel: Dell PowerEdge R900 and HP ProLiant DL580G5
2-way AMD Opteron: Dell PowerEdge R805
4-way AMD Opteron: Dell PowerEdge R905 and HP ProLiant DL585G5
8-way AMD Opteron: HP ProLiant DL 785G5
Processors
For 2-socket Xeon, the top processors include the X5460 3.16GHz, and the E5440 2.83GHz, both 2x6M cache.
For 4-socket Xeon, the X7350 2.93GHz 2x4M.
For 4 & 8 socket Opteron, the top processor is the8360SE at 2.5GHz.
Processor Notes
I really do not want to get heavily into Xeon versus Opteron. It is too emotional a subject for many people and too infested with FUD driven by marketing people. This frequently involves valid technical points taken out of context. What it comes down to is the Core 2 architecture has by far the highest SPEC CPU integer (not rate) scores, and will generate the best results in certain categories of performance tests. This is most evident in single large query tests.
At the 4 & 8-socket level, AMD Opteron has the better memory architecture, with 2 DDR2 memory channels per socket, 8 total in a 4-socket system and 16 memory channels in an 8-socket system, compared with 4 in the 4-socket Xeon with the 7300 chipset. This may yield an advantage in full saturation tests, which are more difficult to run. So at the 4-socket level, the difference is Xeon has the more compute power in the processor cores, while Opteron can turn memory faster. What is better: a 400HP engine with a transmission with 70% efficiency or a 310 HP engine with a 90% transmission efficiency?
At the 8-socket level, Opteron is the best choice for most situations. The 8-socket Opteron (Barcelona) system has what is considered to be soft a NUMA architecture, meaning that memory latency difference between local and remote nodes is low or inconsequential (i.e. do not set the NUMA flag). The IBM and Unisys big iron systems are considered hard NUMA, meaning that memory latency between local and remote nodes is high. Hard NUMA systems can scale, but would most likely required specialized performance analysis skills which are not easily found.
Additional comments
I rate the HP ProLiant ML370G5 over the PowerEdge 2900 III on technical grounds: more memory sockets, and more PCI-E sockets. On the same grounds, I rate the Dell PowerEdge R805 over the ProLiant DL385G5 because the R805 has 16 DIMM sockets over 8 for the DL385G5.
At the 4-socket level, for Intel platforms, the Dell and HP systems are sufficiently comparable.
Note that Dual-Core Opteron processors are an option in the 2 and 4-socket systems, but not the 8 socket DL785. The original and dual core Opteron processors have up to 3 full (16-bit) HT links, of which 2 connect to other processors, and 1 connects to an IO hub. In a 4-socket system, the processors are at the corners of a square, with each processor connected to processors on the two adjacent corners. Hence there is a far processor two hops away.
The Barcelona quad-core has up to 4 full width HT links, each of which can be split as two half-width (8-bit) HT links. In a 4-socket system, each processor can connect directly to all of the other three sockets with a full HT link, leaving one for IO. In an 8-socket system, each processor can connect directly to all seven other sockets with one half-wide HT link, leaving one half-wide link for IO. The HP 4-socket DL585G5 and 8-socket DL785 only support quad-core Opteron, not dual core, which may indicate the use of three full HT links to processors. The Dell R905 supports both dual and quad-core Opteron, which may indicate an older 2-hop to the far processor.
Finally, until quad-core on a current generation manufacturing process is available, Itanium has the very high memory capacity (>256-512GB) and IO bandwidth (>10GB/sec) niche. It could be pointed out that whatever the criticism of Itanium be since its launch, at the time it was conceived in the 1990s, it was a forgone conclusion that RISC would overwhelm x86, which would be not able to benefit from advanced design concepts. Intel was not content to do a johnny come lately to the RISC party, and with HP, came up with a better idea than RISC. And yet, what processor today has the best SPEC CPU int.
IBM Xeon Systems
I have said before that I do not have recent experience with IBM systems. Just from looking at the IBM redbook ob the x3850 M2, it looks very impressive. For this and the x3950 M2, IBM does their own chipset, which supports a NUMA architecture to 4 nodes of 4 sockets. I do not know if 8 nodes are still supported. What I like about the x3850 M2 memory controller is the 8 DDR2 memory channels. I really think the Intel 7300 with 4 memory channels is too weak to support 4 quad cores, and now 4 six core procs. Intel always was afraid of the high-end chipset, obsessively looking at the entry point price, which drags down the high-end configuration. The IBM x3850 M2 did post a TPC-E of 479.51 tps-E over Dell's 451.29. The IBM system has 128GB memory compared with 64GB for Dell, so it is not clear if the 8 memory channels contributed.