Back in the 1997 time frame, I gave a presentation projecting out when 64-bit operating systems and applications should become pervasive in the high volume platforms. My best estimate was this would be some time in 2003 or 2004. This meant that hardware platforms should be ready in 2001 or 2002 to allow for reasonable software availability, beginning with the operating system and a few key applications. And yet here we are in 2008 with most people are still running predominantly 32-bit environments. There is even reluctance to run 32-bit applications on 64-bit operating systems to facilitate the transition to full 64-bit.
First, what do we mean by 64-bit? It used mean any of: the size of internal registers, or external data or address busses. Recently, this term is used mostly for the virtual address space (VAS, sometimes called linear address), which requires registers of the same size (or larger) for an efficient implementation. The data bus width long ago went to 64-bit or 128-bit in the case of Itanium. Going forward, the bus width is no longer relevant because off-chip communications are transitioning to point-to-point links whose width is not related to the internal architecture.
Just like in the time of the 16 to 32-bit transition, people ask if or even presume that 64-bit is faster than 32-bit. Technically, this is not true or false by itself. It is somewhat complicated as most things in life. When applications out grow the current VAS architecture, a series of convoluted procedures are employed to get around limited the VAS size. In the next larger VAS, these elaborate mechanisms are no longer necessary, which allows simpler operation and possibly better performance. That is until the next transition is required.
If the entire program resides in L1 cache and if there are no differences in the instruction set architecture (ISA), there should be no difference between a 32-bit and 64-bit program. Now a 64-bit program has 64-bit pointers, just as 32-bit programs have 32-bit pointers. The size of the pointer is usually tied to the size of the VAS. For the AMD64 (Intel uses the term Intel 64, and IA-64 for Itanium) ISA, instructions can be 32 or 64-bit. The Itanium ISA packs three instructions in a 128-bit word. In any case, even if most instructions are 32-bit, the 64-bit program memory working set is larger because of the 64-bit size of pointers. This means that the 64-bit program can have more L1 and L2 cache misses. Depending on the latency differences between L1, L2 (L3 in some cases) and any off-die cache or main memory, the will be a performance penalty accounting for delays accessing the next instruction. The penalty should be smaller for L1 misses caught by L2 for processors that have very low latency L2, but become more significant for on-die L3 if applicable and much more for off-die cache or memory accesses.
Now, the AMD64 ISA is not a simple extension of the 32-bit X86 (Intel IA-32) ISA to handle 64-bit linear addressing. The 32-bit X86 ISA has 8 (somewhat) general purpose registers, a liability inherited from the old 8086 16-bit X86 ISA. The 8086 was not meant to be a high performance microprocessor, having 29K transistors on a 3 micron process. The slightly later Motorola 68000 with 70K transistors on a comparable 3.5 micron process had 16 (8 data, 8 address) registers. The later Intel 80386 had 276K transistors on 1.5 micron, so 130K transistors might have been possible on mature 3 micron process. RISC microprocessor in the early 1980’s went with 32 general purpose registers. Itanium, intended for 1998-99 introduction, implemented 128 general purpose registers.
The AMD64 ISA features 16 general purpose registers in 64-bit mode, which is a significant innovation for an extension of the X64 ISA. This has important performance implications. The 8 register architecture of X86 resulted in binary code with a sizeable portion of code consisting of instructions that copy the contents between register and memory and back and forth to free up registers for the immediate tasks. In the 1990’s, it was thought that the X86 ISA incurred a 15-20% performance penalty relative to an otherwise comparable RISC microprocessor with 32 registers, just on the reduction of superfluous instruction execution with the higher number of registers.
An examination of an open source data compression algorithm with a short code sequence showed about 10% of the instruction were register to memory copies of a temporary nature. The same code compile for AMD64 showed no register to memory copies for temporary storage. When compiled for Itanium with optimization, entire loops were unrolled entirely in the available registers. The general idea is that more registers allow for complex code with relatively few avoidable register to memory temporary copies, with the balance criteria of not slowing register access with rarely used registers.
The actual performance characteristics of the compression code on 32-bit and 64-bit showed the 32-bit version to be about 1-2% faster. There was no difference in performance between the 32-bit program on a 32-bit or 64-bit OS. It is possible that the temporary memory copies were handled by L1 cache. Another explanation might have to do with Intel Pentium 4 NetBurst architecture. The NetBurst has 96(?) actual registers even only 8 are visible to the ISA. A register renaming scheme is implemented along with out of order execution. So it is possible the temporary register memory copies never occurs or otherwise causes no penalties.
All of the above discussion supports the premise that 64-bit should not have significant performance difference better or worse unless it avoids complication contortions that a 32-bit code would do to stay within the 32-bit VAS.
The Windows operating system requires address space for certain memory structures like the non-paged and the paged pool, and System PTEs. This is discussed in detail in various Microsoft documents. A number of environments have shown that the OS can be constrained due the 32-bit size limits.
Now SQL Server 32-bit can use more than 4GB of physical memory for data buffers. This involves PAE on the 32-bit OS and AWE feature. The AWE memory cannot be used for the stored procedure cache or other functions. Personally, I have never thought that procedure caches should be allowed to become very large. It is usually an indication something has gone horribly wrong with the design of the application (not the SQL Server engine). This can happen when the application does not use stored procedures, or uses dynamic SQL within the stored procedure defeating the purpose of the stored procedure, or someone decided it would be a good idea to have hundreds or thousands of databases. In any case, the procedure cache is almost entirely plans that will not be used again and plans that do need to be reused may get evicted.
In a well designed transaction processing application, the vast majority of physical memory is used for data buffers. The penalty for using AWE memory instead memory within the VAS appears to be small, probably between 5-10%. Still, a particular application that uses address space outside of the 8KB pages managed by the SQL engine may eventually cause severe VAS fragmentation after extensive uptime (or sooner) such large contiguous spaces are not available. This can cause severe performance degradation or it could cause the internal garbage collector to active, which appears to be a complete system lockup. These events are not easily revealed in a simple performance test.
It is in large query performance that benefits of a full 64-bit SQL Server and Windows operating system are more directly observed. A large will need address space (and memory) to handle the intermediate results, particularly hash and sort operations. This needs to be directly addressable, and not in the AWE region. If the intermediate results are too large, it is spooled out to the temp table. In the 32-bit version, the size of address space that can be allocated for this purpose is limited. In the 64-bit version, far more can be kept in memory before the temp database is required.
A quick performance test can be done using the TPC-H data generator and the tables, indexes and queries from recent published reports. The scale factor 1 database is used as a control. This database has a Line Item table with just less than 1GB data and total data plus indexes of 1.7GB, which fits within the 3GB 32-bit address space (or nearly 4GB for 32-bit SQL Server on Windows 64-bit). AWE memory is not enabled. Performance counters indicate no disk activity, and nearly no activity to temp. The second test is the SF 10 database with total size of test of 17GB.
Surprisingly, SQL Server 2005 64-bit showed about a 10% performance improvement in both CPU usage and runtime duration (based on the sum of the 22 queries, not the geometric mean for TPC-H reporting requirements) over SQL Server 2005 32-bit. At SF 10, the improvement was about 20% in both CPU and duration. There was very little disk activity to data in both 32 and 64-bit test runs. There was a moderate level of temp database activity in the 32-bit test and almost none in the 64-bit test.
I have just started SQL Server 2008 performance testing. When I have properly vetted the results, I will discuss in the usual light detail.