THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

AMD Bulldozer Comments

AMD released information on their upcoming Bulldozer architecture in conjunction with Hotchips 22, of which a less technical version is on the AMD website Bulldozer press release. The slidedeck is at slideshare. The principle new feature is the AMD solution for addressing multi-threaded performance. AMD seems determined not implement the same multi-threading solution adopted by Intel (Hyper-Threading), IBM POWER, and Sun. The new AMD Bulldozer architecture, shown in the diagrams below, has a complete core that appears as two cores to the operating system. Each sub-core has dedicated Integer scheduler and execution units, and L1 cache. The Fetch/Decode, floating-point and L2 cache are shared by the complete core.

Bulldozer

Additional Bulldozer architecture diagrams:

Bulldozer

More detailed Bulldozer core architecture diagrams:

Bulldozer

Bulldozer

The Bulldozer chip-level view shown below has 4 complete double integer cores, with L3 cache and memory controllers shared by all cores.

Bulldozer

The AMD press release mentions that the processor socket would have 16 core visibly to the operating system, so the socket level would consist of two chips. Bulldozer then has 33% more cores than the current 12-core Opteron 6100 series (Magny-Cours). AMD stated that 50% better performance was expected, corresponding to 13% better performance per core. This could indicated that the Bulldozer core has micro-architectural improvements, or it could be that AMD is able get 15-20% better frequency on the 32nm process than Magny-Cours can on the 45nm process. Note that the single die six-core Istanbul could reach 2.8GHz on the 45nm process.

AMD claims the additional Integer unit requires 12% extra die space, for a potential 70%(?) performance gain on some applications. A 70% throughput gain for 12% die size is a spectacular achievement. However, there is a curious matter of Bulldozer die size. The quad-core Shanghai on 45nm die size is 258mm2 with 6M L3 cache. The six-core Istanbul also on 45nm is 346mm2 with 6M L3. A shrink/compaction of quad-core Shanghai to 32nm should have die size in the range of 130mm2. The 12% for the double integer execution would only increase die size to 145mm2, which is rather small for the high-end server processor, even if the processor socket will have 2 die.

Intel Hyper-Threading Recap: Simultaneous and Temporal
Several Intel processor architecture have Hyper-Threading, including the Pentium 4 (NetBurst), the more recent Itanium (Monteceito forward) and even Atom. The original Pentium 4 implementation was simultaneous multi-threading. The Itanium and Nehalem (Atom?) implementions are time-slice or temporal multi-threading. (Wikipedia say Nehalem is SMT, not temporal?)

The general idea in microprocessor architecture has always been to make best use of the available transistors. Long ago, the objective was single threaded performance at the processor socket level (and multi-threaded performance at system level). For the last several years, the objective shifted to multi-threaded performance within a reasonable power envelop, with consideration for the fact that single-threaded performance is still important.

The ideal microprocessor design has all units uniformly running at maximum load continuously. Of course, the ideal cannot be achieved across a broad range of applications except perhaps brief intervals.

In a single-threaded architecture, the pattern of Moore's law has been each doubling of the transistor budget could increase performance by about 40%. In a multi-core design, the aggregate throughput could be linear with the number of cores, or transistor budget. One criteria power budget of the combined cores. In many multi-core designs, the clock frequency is restricted to below the top frequency capability of the single core. So multi-core processors can achieve scaling than Moore's Law if an application is effectively multi-threaded.

Early in the original Pentium 4 (Willamette) design phase, it was thought that simultaneous multi-threading (SMT) as it was called before Hyper-Threading, could contribute 30% through-put performance gain in multi-threaded server applications at a cost of approximately 10% in transistor budget. As it turned out, unanticipated complications resulted in less gain in the key TPC-C benchmark, probably on the order of 10%.

It is possible or suspected that all of the Pentium 4 HT performance gain can be attributed to the network round-trip portion, and no gain in the core SQL Server engine. In addition, there was erratic behavior in other applications probably due to neither the operating system nor the application being properly designed for a HT processor architecture. This is especially evident in parallel execution plans. The Prescott based Pentium 4 processor did show 40-50% performance gain on Quest LightSpeed database backup compression, probably the highest reported HT performance gain. Hyper-Threading does have potential, but there are issues that need to be worked out.

For the Nehalem generation, an informal statement puts the HT performance gain in the range of 30% for high call volume workloads, and in the range of 10% for DW, but additional details are necessary before making conclusive statements. If the original Willamette die size impact were still the case, then the 30% gain in transaction for 10% die size is a good trade. However the 10% gain in DW is only marginal, about the same as having more non-HT cores.

In a test with custom non-transactional (no locking code) index search engine, an astounding nearly 100% performance gain was observed with HT in a 2-way quad-core Nehalem system. The code was almost entirely a repetitive sequence of string comparison followed memory fetch. If the comparison operation is shorter than a local memory fetch (50-60ns, or 150-180 CPU-cycles), then 100% gain would seem to be possible. Perhaps there are still unresolved HT contention points in the SQL Server engine.

With the Bulldozer architecture, AMD is asserting that the Fetch/Decode and FP units are under-utilized in previous Opteron architectures. Of course, Bulldozer is targeted at server applications, which tend to emphasize integer performance. Hence the Bulldozer design has two integer cores sharing the under-utilized units.

Preliminary Assessment: AMD Bulldozer verus Intel Westmere-EX and Sandy Bridge

Any assesment at this point is speculative without actual measurement details. The various press releases mention that all three expected 2011 processors (AMD Bulldozer, Intel Westmere-EX on the high-end and Intel Sandy-Bridge up to the mid-point) have reached first silicon. It is unknown is meaningful performance numbers have been generated with the pre-production samples, as some references are to estimated performance and some are simulations. However, based on the pre-production statements that have been made, which have been reasonably reliable in the past, we can draw some expectations.

At this point in mid-2010, the three main processors are:

  1. Intel Xeon 5600 (Westmere), six-core, 32nm
  2. Intel Xeon 7500 (Nehalem-EX), 8-core, 45nm
  3. AMD Opteron 6100 (Magny-Cours), 2-chip x six-core, 45nm

In 2-way (socket) systems, Xeon 5680 3.33GHz has 25% better TPC-E (OLTP) performance as Opteron 6176 2.3GHz and the two are roughly comparable on TPC-H (DW). On 4-way systems, the Xeon 7560 2.26GHz has 38% better TPC-E than Opteron 6176. However, the 4-way Opteron has a much lower system cost than the Xeon 7500, for roughly comparable price performance. 4-way Opteron price-point advantage would be negated by the SQL Server Enterprise Edition per-processor license cost, but is a better fit for Standard Edition and CAL licensing. The other factor to consider is that Xeon 5680 has nearly 2X the single core performance as Opteron and the Xeon 7560 is 35% better.

In the mid-2011 time frame, the expected processors would be:

  1. Intel Sandy-bride, ?-cores, 32nm
  2. Intel Westmere-EX 10-core, 32nm
  3. AMD Bulldozer, 2-chip x 4 double cores, 32nm

The expectation is that Bulldozer with 50% throughput performance gain over Magny-Cours will be better than the current six-core Westmere, but could be comparable if Sandy-Bridge has 8-cores of the same single-core performance as each Westmere core. Given that both Westmere and Sandy Bridge are 32nm, an increase in the number of cores at the same power would require either lower frequency or better power efficiency.  Westmere and Sandy Bridge would contine to have nearly 2X the single core performance of Bulldozer.

In 4-way systems, Westmere-EX will have 25% more cores, and should be able to improve over 2.26GHz frequency of Nehalem-EX. (There was a 50% increase in the number of cores from Westmere to Nehalem both at 3.33GHz) This should allow Westmere-EX to maintain an advantage over Bulldozer. AMD will maintain the midway price point between the 2-way and 4-way Intel systems.

I am sure there will be heated (vehement?) arguments over my expectations. But please, I would like to hear quantitative analysis rather than emotional outbursts.

Additional Comments - 2010-09-02

Consider some of the implications of Moore's Law. Suppose we start with a baseline processor with core size 50mm2. If the goal is a high-end workstation product, we might target a total silicon budget of 400mm2, in which case we could have 8 cores total. The aggregate throughput performance of the processor socket for a trivially multi-threaded application, that is one in which there is zero overhead for coordinating threads and no resource contention, is simply the number of cores times the baseline single core performance, 8 in this case.

Core Size # of cores Core Perfomance Aggregate Perf
50mm2 8 1.0 8.0
100mm2 4 1.4 5.6
200mm2 2 2.0 4.0

Now suppose the we desired greater single core performance. The pattern established by Moore's law at given manufacturing process is that doubling the transistor budget (die area to be more precise, as logic and cache memory transistor have very different densities) should yield a 40% performance gain for a single thread design. The second processor, at 100mm2 should have 40% better performance relative to the baseline. However, we can only fit 4 cores on a 400mm2, for an aggregate thoughput performance of 5.6. The third example at 200mm2 should have twice the baseline single performance, but only 2 will fit on the die, for an aggregate throughput performance of 4.

It is evident that throughput oriented applications with very little multi-threading overhead/contention favors many basic cores, while applications that are not easily multi-threadable favor more powerful cores at the expense of throughput.

Now Lets see how Opteron and Nehalem match up with theory. Using the 45nm Opteron core as the reference, the 45nm Nehalem core has nearly twice the integer performance (floating point ratio?). The six-core Istanbul die size is 346mm2, for 58mm2 per core including 1.5M cache (512K L2 and 1M L3). The 45nm quad-core Nehalem is 268mm2, or 67mm2 per core including 256K L2 and 2M L3. This is a serious problem for the Opteron core, with only a slighly smaller die size per core. The single core (integer) performance is too far below Nehalem (Core 2 as well) and the die size is too large to compete on aggregate throughput performance.

Update 2010-09-17

per discussion below on core size, chip-architect has a diagram on 32nm core size w & w/o L2, I will look for one on 45nm cores

http://www.chip-architect.com/news/Llano_vs_SandyBridge_vs_Westmere.jpg

Published Sunday, August 29, 2010 11:46 PM by jchang
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Glenn Berry said:

Yet another good analysis, Joe.  I am sure you have seen AnandTech's Sandy Bridge preview.

http://www.anandtech.com/show/3871/the-sandy-bridge-preview-three-wins-in-a-row/1

Even though it focuses on mobile and desktop versions, it is an interesting read. I really hope that Bulldozer is released on time and that it lives up to expectations. If AMD is not a viable competitor to Intel, Intel slows down their release cycles, which hurts all of us.

August 31, 2010 7:58 AM
 

jchang said:

AnandTech and others are saying SandyBridge should provide 10% performance gain at the single-core level. The main feature is high performance graphics integrated. The quad/dual-core desktop and mobile versions are supposed come out first.

If you recall, Intel decided to skip a desktop quad-core Westmere, offering on the dual-core die for mobile, and a six-core die for server/workstation + extreme (sometimes with 2 cores disabled).

So I am thinking Intel may elect to skip a 32nm 6/8-core Sandy Bridge-EP for servers, as we are not interested in integrated graphics, and just go straight to a 22nm 8-core Ivy Bridge-EP. This is not their official plan, but when the reality of schedule and manpower come home, practical decisions have to be made.

August 31, 2010 2:48 PM
 

inf64 said:

Some correction on the core sizes used for Shanghai and Nehalem. Shanghai single core size without the L2 is 15.6mm^2,while Nehalem single core size without L2 is 24.4mm^2.Intel design has much better L3 cache density(sram cell size) and therefore intel quad core Nehalem  chip has similar size as quad core Shanghai.But looking at the core logic size Nehalem has much greater single core size(~50%) and AMD needs exactly 50% more cores(Istanbul) to match that level of performance on average.

September 9, 2010 6:24 AM
 

jchang said:

thanks, this is great info on the bare core sizes, but it should take 2X the core complexity (size) to get 40% performance, so 1.5 core complexity should translate by Moore's law to 22% performance gain. And yet Nehalem (45nm) 2.93GHz SPEC int CPU base is 31.5 (excluding libquantum 25.6) versus Opteron 8439 2.8GHz is 18.3 (excl libq 16.3). Granted Intel has a top compiler team, but this is also seem in SQL operations. The net is AMD really needs to improve core performaance for the die size / transistor count.

September 9, 2010 8:33 AM
 

inf64 said:

The 2.93Ghz nehalem (5570) has a 3.33Ghz turbo mode functionality for single thread workloads.That's roughly 19% over 2.8Ghz Opteron you used in your comparison.The difference is around 31% with Turbo taken into account(without the inflating libquantum subtest). With Bulldozer,not only the IPC per core will go up,but there will be equally aggressive (of not better) Turbo mode for single and poorly threaded workloads.So my guess is that Bulldozer ,with its complete ISA support and new and improved design,will be on similar/equal footing with Sandy Bridge in SPEC.

September 9, 2010 8:53 AM
 

jchang said:

I picked the 2.93GHz because it was the top freq at Nehalem launch, not because its close to the Opteron 2.8GHz frequency. I am sure all Xeon and Opteron processors are thermally constrained and cores are running well below the electrical design point. I must have missed the turbo mode, that would be important. Given that Opteron reached 3GHz at 90nm, even at a conservative estimate of 20% per process step, a 32nm could be at 5GHz, but perhaps 4GHz in turbo might be possible.

Still I was thinking the difference in performance-die size was because AMD has a synthesized core, while Intel does the hand layout.

September 10, 2010 10:58 AM
 

inf64 said:

IMO 4Ghz on 32nm with Bulldozer is the maximum clock for desktop and workstation parts(8 cores/4modules max).For server parts(16 cores/8 modules) it should be ~2.5-3Ghz max  and with Turbo mode on those same parts ~3.4Ghz(<=half cores inactive for TDP headroom).

As for the synthesized core comment,AFAIK the only design AMD used with this technique is new Bobcat core.All previous designs(K7/K8/10h) are hand layout(except for L2/L3 cache design and density,where intel has a big upper hand until now).

September 11, 2010 7:33 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement