AMD released information on their upcoming Bulldozer architecture in conjunction with Hotchips 22, of which a less technical version is on the AMD website Bulldozer press release. The slidedeck is at slideshare. The principle new feature is the AMD solution for addressing multi-threaded performance. AMD seems determined not implement the same multi-threading solution adopted by Intel (Hyper-Threading), IBM POWER, and Sun. The new AMD Bulldozer architecture, shown in the diagrams below, has a complete core that appears as two cores to the operating system. Each sub-core has dedicated Integer scheduler and execution units, and L1 cache. The Fetch/Decode, floating-point and L2 cache are shared by the complete core.
Additional Bulldozer architecture diagrams:
More detailed Bulldozer core architecture diagrams:
The Bulldozer chip-level view shown below has 4 complete double integer cores, with L3 cache and memory controllers shared by all cores.
The AMD press release mentions that the processor socket would have 16 core visibly to the operating system, so the socket level would consist of two chips. Bulldozer then has 33% more cores than the current 12-core Opteron 6100 series (Magny-Cours). AMD stated that 50% better performance was expected, corresponding to 13% better performance per core. This could indicated that the Bulldozer core has micro-architectural improvements, or it could be that AMD is able get 15-20% better frequency on the 32nm process than Magny-Cours can on the 45nm process. Note that the single die six-core Istanbul could reach 2.8GHz on the 45nm process.
AMD claims the additional Integer unit requires 12% extra die space, for a potential 70%(?) performance gain on some applications. A 70% throughput gain for 12% die size is a spectacular achievement. However, there is a curious matter of Bulldozer die size. The quad-core Shanghai on 45nm die size is 258mm2 with 6M L3 cache. The six-core Istanbul also on 45nm is 346mm2 with 6M L3. A shrink/compaction of quad-core Shanghai to 32nm should have die size in the range of 130mm2. The 12% for the double integer execution would only increase die size to 145mm2, which is rather small for the high-end server processor, even if the processor socket will have 2 die.
Intel Hyper-Threading Recap: Simultaneous and Temporal
Several Intel processor architecture have Hyper-Threading, including the Pentium 4 (NetBurst), the more recent Itanium (Monteceito forward) and even Atom. The original Pentium 4 implementation was simultaneous multi-threading. The Itanium and Nehalem (Atom?) implementions are time-slice or temporal multi-threading. (Wikipedia say Nehalem is SMT, not temporal?)
The general idea in microprocessor architecture has always been to make best use of the available transistors. Long ago, the objective was single threaded performance at the processor socket level (and multi-threaded performance at system level). For the last several years, the objective shifted to multi-threaded performance within a reasonable power envelop, with consideration for the fact that single-threaded performance is still important.
The ideal microprocessor design has all units uniformly running at maximum load continuously. Of course, the ideal cannot be achieved across a broad range of applications except perhaps brief intervals.
In a single-threaded architecture, the pattern of Moore's law has been each doubling of the transistor budget could increase performance by about 40%. In a multi-core design, the aggregate throughput could be linear with the number of cores, or transistor budget. One criteria power budget of the combined cores. In many multi-core designs, the clock frequency is restricted to below the top frequency capability of the single core. So multi-core processors can achieve scaling than Moore's Law if an application is effectively multi-threaded.
Early in the original Pentium 4 (Willamette) design phase, it was thought that simultaneous multi-threading (SMT) as it was called before Hyper-Threading, could contribute 30% through-put performance gain in multi-threaded server applications at a cost of approximately 10% in transistor budget. As it turned out, unanticipated complications resulted in less gain in the key TPC-C benchmark, probably on the order of 10%.
It is possible or suspected that all of the Pentium 4 HT performance gain can be attributed to the network round-trip portion, and no gain in the core SQL Server engine. In addition, there was erratic behavior in other applications probably due to neither the operating system nor the application being properly designed for a HT processor architecture. This is especially evident in parallel execution plans. The Prescott based Pentium 4 processor did show 40-50% performance gain on Quest LightSpeed database backup compression, probably the highest reported HT performance gain. Hyper-Threading does have potential, but there are issues that need to be worked out.
For the Nehalem generation, an informal statement puts the HT performance gain in the range of 30% for high call volume workloads, and in the range of 10% for DW, but additional details are necessary before making conclusive statements. If the original Willamette die size impact were still the case, then the 30% gain in transaction for 10% die size is a good trade. However the 10% gain in DW is only marginal, about the same as having more non-HT cores.
In a test with custom non-transactional (no locking code) index search engine, an astounding nearly 100% performance gain was observed with HT in a 2-way quad-core Nehalem system. The code was almost entirely a repetitive sequence of string comparison followed memory fetch. If the comparison operation is shorter than a local memory fetch (50-60ns, or 150-180 CPU-cycles), then 100% gain would seem to be possible. Perhaps there are still unresolved HT contention points in the SQL Server engine.
With the Bulldozer architecture, AMD is asserting that the Fetch/Decode and FP units are under-utilized in previous Opteron architectures. Of course, Bulldozer is targeted at server applications, which tend to emphasize integer performance. Hence the Bulldozer design has two integer cores sharing the under-utilized units.
Preliminary Assessment: AMD Bulldozer verus Intel Westmere-EX and Sandy Bridge
Any assesment at this point is speculative without actual measurement details. The various press releases mention that all three expected 2011 processors (AMD Bulldozer, Intel Westmere-EX on the high-end and Intel Sandy-Bridge up to the mid-point) have reached first silicon. It is unknown is meaningful performance numbers have been generated with the pre-production samples, as some references are to estimated performance and some are simulations. However, based on the pre-production statements that have been made, which have been reasonably reliable in the past, we can draw some expectations.
At this point in mid-2010, the three main processors are:
- Intel Xeon 5600 (Westmere), six-core, 32nm
- Intel Xeon 7500 (Nehalem-EX), 8-core, 45nm
- AMD Opteron 6100 (Magny-Cours), 2-chip x six-core, 45nm
In 2-way (socket) systems, Xeon 5680 3.33GHz has 25% better TPC-E (OLTP) performance as Opteron 6176 2.3GHz and the two are roughly comparable on TPC-H (DW). On 4-way systems, the Xeon 7560 2.26GHz has 38% better TPC-E than Opteron 6176. However, the 4-way Opteron has a much lower system cost than the Xeon 7500, for roughly comparable price performance. 4-way Opteron price-point advantage would be negated by the SQL Server Enterprise Edition per-processor license cost, but is a better fit for Standard Edition and CAL licensing. The other factor to consider is that Xeon 5680 has nearly 2X the single core performance as Opteron and the Xeon 7560 is 35% better.
In the mid-2011 time frame, the expected processors would be:
- Intel Sandy-bride, ?-cores, 32nm
- Intel Westmere-EX 10-core, 32nm
- AMD Bulldozer, 2-chip x 4 double cores, 32nm
The expectation is that Bulldozer with 50% throughput performance gain over Magny-Cours will be better than the current six-core Westmere, but could be comparable if Sandy-Bridge has 8-cores of the same single-core performance as each Westmere core. Given that both Westmere and Sandy Bridge are 32nm, an increase in the number of cores at the same power would require either lower frequency or better power efficiency. Westmere and Sandy Bridge would contine to have nearly 2X the single core performance of Bulldozer.
In 4-way systems, Westmere-EX will have 25% more cores, and should be able to improve over 2.26GHz frequency of Nehalem-EX. (There was a 50% increase in the number of cores from Westmere to Nehalem both at 3.33GHz) This should allow Westmere-EX to maintain an advantage over Bulldozer. AMD will maintain the midway price point between the 2-way and 4-way Intel systems.
I am sure there will be heated (vehement?) arguments over my expectations. But please, I would like to hear quantitative analysis rather than emotional outbursts.
Additional Comments - 2010-09-02
Consider some of the implications of Moore's Law. Suppose we start with a baseline processor with core size 50mm2. If the goal is a high-end workstation product, we might target a total silicon budget of 400mm2, in which case we could have 8 cores total. The aggregate throughput performance of the processor socket for a trivially multi-threaded application, that is one in which there is zero overhead for coordinating threads and no resource contention, is simply the number of cores times the baseline single core performance, 8 in this case.
||# of cores
Now suppose the we desired greater single core performance. The pattern established by Moore's law at given manufacturing process is that doubling the transistor budget (die area to be more precise, as logic and cache memory transistor have very different densities) should yield a 40% performance gain for a single thread design. The second processor, at 100mm2 should have 40% better performance relative to the baseline. However, we can only fit 4 cores on a 400mm2, for an aggregate thoughput performance of 5.6. The third example at 200mm2 should have twice the baseline single performance, but only 2 will fit on the die, for an aggregate throughput performance of 4.
It is evident that throughput oriented applications with very little multi-threading overhead/contention favors many basic cores, while applications that are not easily multi-threadable favor more powerful cores at the expense of throughput.
Now Lets see how Opteron and Nehalem match up with theory. Using the 45nm Opteron core as the reference, the 45nm Nehalem core has nearly twice the integer performance (floating point ratio?). The six-core Istanbul die size is 346mm2, for 58mm2 per core including 1.5M cache (512K L2 and 1M L3). The 45nm quad-core Nehalem is 268mm2, or 67mm2 per core including 256K L2 and 2M L3. This is a serious problem for the Opteron core, with only a slighly smaller die size per core. The single core (integer) performance is too far below Nehalem (Core 2 as well) and the die size is too large to compete on aggregate throughput performance.
per discussion below on core size, chip-architect has a diagram on 32nm core size w & w/o L2, I will look for one on 45nm cores