At Intel Developer Forum 2009 last week, Microsoft disclosed significant advances in Windows Server 2008 R2 with the elimination of many locks, most prominently the Dispatch Scheduler lock, that impact the ability scale performance up to and beyond 64 cores. Look for the presentation: Microsoft & Intel Innovations in Hardware and Software to Deliver New Technology Experiences by Mark Russinovich, and Shiv Kaushik, Intel Developer Forum session SPCS003 (http://www.intel.com/idf/training-sessions/ or https://intel.wingateweb.com/us09/scheduler/catalog/catalog.jsp), also available as a webcast (http://www.intel.com/idf/pressroom/video.htm)
Earlier I talked about Big-Iron revival with the upcoming Intel Nehalem-EX eight-core processor. Intel will finally have a Xeon processor with high-end scaling potential. The Intel Server Update slide deck by Boyd Davis state that 8 OEMs have 15 system designs for 8-way and larger in the works. Several system vendors, including IBM, NEC and Unisys, have had 8-way to 32-way systems for several Xeon processor generations, but it was always apparent that performance scaling trailed off after 4 sockets. See the IDF session SPCS002 Technology Insight: Intelligent and Expandable High-End Intel Server Platform, Codenamed Nehalem-EX by Stephen Pawlowski for details.
Scaling performance on big iron NUMA systems involves a complex combination of matching the database architecture to the capabilities of the underlying SQL Server engine, the operating system and hardware architecture. For major improvements in two pillars of the foundation to arrive together is bound to generate excitement and anticipation.
SQL Server 2000 and later
In the old days, it was possible but very difficult to scale SQL Server 2000 on the contemporary hardware and operating system of that time. Many of the difficulties were in the SQL Server engine, but there were limitations with the Windows operating system and hardware as well. Certain operations in SQL Server 2000 could scale well beyond 4 cores or over multiple NUMA nodes. Other operations did not, and some operations even had severe negative scaling beyond 4 cores.
In order to scale on NUMA systems, it was necessary to design the database cluster keys, non-clustered indexes and write SQL queries so that the execution plan avoided problematic operations. Almost none of the details for this have ever been published (mine were originally published else where, but I will try to collect them on my website www.qdpma.com). Apparently vendors do not like to tell customers they should completely re-architect. As a database architect consultant, I do not see why this subject is so taboo. - Its just a simple matter of hiring a really expensive database architect consulant for 8-10 weeks right? People should do this more often, and this is my completely unbiased opinion.
Many developers and even data architects do not have adequate skills for generating efficient execution plans on ordinary SMP systems, let alone manipulating the execution plan to match a specific NUMA system architecture.
People have just tried to put an existing application on a big-iron system without any consideration for redesigning database architecture, or even using execution plan hints. Usually this leads to an uneven outcome, and frequently severe problems specific to NUMA systems. I believe enough people encountered such problems that a general awareness developed to avoid big-iron systems. In the last few years, I have encountered few people contemplating the purchase of the big-iron system, and these were usually the ProLiant DL785.
With SQL Server 2005, many scaling problems in the database engine were resolved. Service pack 2 provided another round of significant fixes. SQL Server 2008 introduced data warehouse performance enhancement, but I found these to be problematic and would sometimes rewrite a query to force the 2005 execution plans.
Windows Server Operating System
Microsoft makes ongoing enhancements to the core server operating system to improve performance scaling over many processors. Improvements were made from Windows 2000 to 2003, but a change in the handling of interrupts caused performance issues in large systems. This was resolve in a post SP1 hot-fix. At Microsoft WinHEC 2007, disk I/O handling in NUMA systems was discussed, but is was unclear whether this was a Windows Server 2008 RTM or R2 feature.
The Windows Server 2008 R2 extension to handle more than 64 processors will get most of the press, but the things that get less publicity, ie that are not understood by press people, are equally important.
See the follow WinHEC presentations for more details:
Win HEC 2008
ENT-T554 Windows Support For Greater Than 64 Logical Processors by Arie van der Hoeven
ENT-T555 Scaling More Than 64 Logical Processors: A SQL Perspective by Alex Verbitski and Pravin Mittal
SVR-T332 NUMA I/O Optimizations by Bruce Worthington (linked provided by Konstantin Korobkov below, thanks)
The WinHEC ENT-T555 presentation mentions OLTP scaling of 1.7X from 64 to 128 Logical Processors. The IDF presentations states 1.7X scaling from 128 to 256 LP, but it is very possible these two presentations do not reference the same baseline.
Even though the core elements will soon be in place to enable broadly scalable performance on big-iron systems, the expectation is that it will take time for the SQL Server engine team, and the Windows operating system team to build enough experience on the Nehalem-EX NUMA platforms to make all of this work together out of box. In the meantime, there are a handful of consultants with deep NUMA performance tuning experience that can make this happen as is (not to be construed as a solicitation for services).
AMD crowed loudly that Opteron with its integrated memory controller and Hyper-Transport scaled memory and inter-processor bandwidth with the number of processors, while the Intel systems up to the Xeon 7400 series were constrained on a shared processor front-side bus and the fixed memory bandwidth of discrete memory controller. See for example excerpts from: SQL Server 2005 and AMD64 –a winning team.
With the announcement of the six-core Opteron, codename Istanbul, we find out that AMD previously did not have mechanism for maintaining cache-coherency comparable to the Snoop Filter in the Intel chipsets. Without this, much of the available bandwidth is consumed by cache coherency traffic, limiting scaling in systems with 4 or more processor sockets. In Istanbul, the HT Assist, or Probe Filter feature uses 1M of the 6M L3 as a directory cache to track cache lines. AMD measured 42GB/s memory bandwidth with HT Assist versus 25.5GB/s without HT Assist.
So there is now an expectation that Opteron systems should have improved scaling in 4-way and larger systems. While HP and Sun have 8-way Opteron systems since the quad-core Barcelona, only TPC-H data warehouse benchmarks have been published. To date no TPC-C or TPC-E OLTP benchmarks have been published for 8-way Opteron systems, even the six-core Istanbul with HT-Assist. For that matter, no TPC-C or TPC-E benchmarks have been published for 4-way systems with the six-core Istanbul, even though 4-way quad-core Opteron systems have posted respectable results on both TPC-C and TPC-E. It is possible that this is not a simple feature to implement and the first attempt has issues. Hopefully a fixed version will available before too long and we can see OLTP benchmark results for subsequent generation Opteron systems.
The Itanium processor and system architecture was designed for big system scaling, but the processor has languished at the 90nm dual-core Montvale. The 65nm quad-core Tukwila has encountered multiple delays to 2010? Itanium is now mostly positioned as having extensible reliability and availability features (Machine Check Architecture).
Per Linchi, see Benchmark Omissions for the Six-Core Intel Xeon AMD Opteron Processors