THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Memory Latency and NUMA

It should be intuitively obvious that round-trip memory access latency is one of the most important factors in modern server system architecture for transaction processing databases. Yet this is a topic that no one talks about. Vendors do not want to discuss this because no near or long-term actions are planned. Outsiders cannot write a meaningful article because too much important information is missing. In some respects, there is nothing practical we can do about latency in terms of memory components that we elect to employ. However, we can influence latency because Intel processors since Nehalem in 2009/10 have integrated memory controllers. Hence all multi-socket systems since then have non-uniform memory access (NUMA) and this is one mechanism that determines latency.

We will start by looking at modern Intel processor architecture and the time scales involved. Then continue by examining system architecture with respect to memory latency. From there, we can do basic calculations on how we expect memory latency to impact transaction processing performance. This basic model still needs to be backed up by real experimental measurements, but there is enough to provide the basis for further investigations. Pending such, one of the important conclusions is that it is time to re-examine fundamental assumptions on server sizing.

Broadwell, Xeon E3 v4

The diagram below is a representation of a modern Intel mainline (desktop and mobile) processor. The proportions are from Broadwell, the 5th generation Intel Core i3, 5 and 7 and Xeon E3 v4, as Sky Lake has a different layout arrangement. The four cores, system agent, and graphics communicate over the interconnect ring. The memory controller and IO are in the system agent, but the signals that go off-chip are in the interface shown as a separate unit.

 

 

Never mind the above, I don't think thats how Broadwell 4c is laid out. See below left. Sky Lake is below right.

 
 
 

Intel and system vendors like to talk about specifications such as processor frequency and memory transfer rate. Recent generation Intel processor cores are capable of operating in the high 3 to low 4GHz range. It is quite possible that they could run even higher, but are power constrained.

DDR4 Interlude

The memory interface is DDR4 at 1866 and 2133MT/s. The term used is mega-transfers per second instead of MHz. This is because the clock is cited in MHz, and the data transfers rate is eight times the clock rate. Address and writes should be one-half the data rate. Using MT/s for data transfer rate is so much more clear.
(edit 2016-12-26)
In DDR4, there is a memory clock, 266.67MHz for example, an I/O bus clock that is 4 times higher, 1066MHz, and the data transfer rate at double the I/O clock, 2133MT/s.
DDR memory timings are specified at the memory interface and consist of 4 values: CL, RCD, RP and RAS, the last of which is frequently not cited, and sometimes only the CAS Latency is cited. The value is in terms of the I/O clock cycles.

In database transaction processing, the memory access pattern is largely unpredictable, amounting to a random memory row access, so the latency is RP + RCD + CL, for Row Pre-charge, Row to Column Delay, and CAS Latency. For Registered DDR4 2166, all three values are 15? (does this includes the extra cycle for registered memory?). The 2166 MT/s corresponds to 1066MHz, so each cycle is 0.938ns, and 15 cycles is 14 ns. The total memory latency at the memory interface is then 42ns?

Cache and Memory Latency

There are applications in which processor frequency and memory bandwidth matter an great deal. But neither are particularly important for transaction processing databases. The diagram below calls out some details between the processor core, various levels of cache, and memory that are more relevant to databases.

 

 

At 3GHz, the processor core cycle time is 0.33ns, the inverse of the frequency. The L1 cache is cited as having 4 cycle latency. The L1 is part of the processor core execution pipeline, so to some degree, L1 latency is hidden. L2 cache latency is cited as 12 cycles. It is not certain as to whether this is fixed in cycles, or is actually something like 4ns.

L3 cache probably depends on the number of cores and other circumstances. It is shown here as 40+ cycles. If L3 latency is actually time based, 15ns for example, then number of cycles would depend on the core clock rate. I am not sure if the L3 latency incorporates L2 latency. Memory latency is probably somewhat over 50ns, plus the L3 latency. How much of this is at the DRAM chip versus the transmission delay between the processor to DRAM and back?

Igor Pavlov provides both the 7z benchmark and results for many processors at 7-cpu, source code included. 7-cpu lists Haswell L3 at 36 cycles, and memory at L3 + 57ns. Sky Lake L3 is 42 cycles and memory at L3 + 51ns.
(edit 2016-12-26)
(This seems to imply that the transmission delay from processor to memory and back is about 10ns?)
Intel has their own Memory Latency Checker utility. One of the Intel pages shows an example with local node memory latency as 67ns, and remote node at 125ns, which will be used in the examples below.

Broadwell EP and EX

The above diagrams were for the desktop and Xeon E3 processors. We are really more interested in the Intel EP/EX processors used in the Xeon E5 and E7 product lines. The latest Xeon E5 and E7 processors are v4, based on the Broadwell core. There are 3 layouts, HCC, MCC and LCC with 24, 15 and 10 cores respectively. Intel provides functional layouts for all three, but the actual die layout is provided only for the LCC model.

The Broadwell HCC representation is shown below. There are 6 rows and 4 columns of cores, 24 cores total. The 24-core model is only available in the E7. The top E5 has only 22 cores enabled. Two columns of cores communicate over the interconnect dual ring (counter-rotating). The two double rings are connected by a switch. The Intel functional diagram actually shows both QPI and PCI-E on the left side.

 

 

Below are my representations of the MCC model on the left with 3 columns of 5 rows for 15 cores and the LCC model on the right with 2 columns of 5 rows for 10 cores. In the LCC model, there is no ring switch. The PCI-E and memory actually do overhang the core, meaning that the space to the right of the cores is blank? As even the LCC is a high margin product, an exact and efficient fit is not necessary?

 
 

The MCC arrangement of both QPI and PCI-E on the left side of the switch connected to the left ring and a memory controller on each side of the ring switches matches the Intel functional layout, but I do not know if there is overhang. Regardless of the actual placement of the controllers, the interface signals for QPI and PCI-E probably does run along the length of the upper edge, and the interface for the memory signals probably runs along most of the lower edge.

I am inclined to believe that L3 latency is higher in the E5/E7 processors as the path is longer and more complicated. On the LCC die, there are 10 cores, the memory, QPI and PCI-E controllers on one ring. However, if the desktop and E3 processors have only one ring (single direction) then it is possible that the bidirectional ring in the E5/E7 processors help keep L3 latency low? Presumably latencies on the MCC and HCC die are longer than on the LCC because both rings must be checked?

Edit 2017-01-4
Search: Haswell Cluster on Die (COD) Mode, but filter our call of duty. An Intel slidedeck on this suggests that memory latency is higher when crossing the coherency bus switches.

Xeon E5 v4

Below is a representation of a 2-socket Xeon E5 system with the HCC die. Two of the 24 cores are marked out, as the E5 has a maximum of 22 cores. The E5 has 2 full QPI links, and both are used to connect to the other proccessor. For a core in the left socket, the memory attached to that socket is local memory, and memory attached to the other socket is remote memory.

 

 

It should not be difficult to appreciate that there is a large difference in memory access time between local and remote memory nodes. The Intel Memory Latency Checker example has 67ns for local and 125ns for remote memory, which will be the values we use in a following example. I am not certain if these values are for unbuffered or registered memory. Unbuffered memory should have lower latency, but registered memory is available in larger capacities, 64GB versus 16GB.

Xeon E7 v4

Below is a representation of the 4-socket Xeon E7 system. The E7 has 3 full QPI links, one connecting directly to each of the three other processors. All remote processors are then said to be 1-hop away. The difference of significance here between the E5 and E7 systems is that the E5 memory channel connects directly to memory. The E7 connects to a scalable memory buffer (SMB, other names have been used too), that splits into two memory channels on the downstream side. Because there are so few acronyms in effect, the interface from processor to SMB is SMI. The SMB doubles the number of DIMM sites, and in effect, doubles the memory capacity per socket.

 

 

This difference in the memory arrangement between processors designed for 2 and 4-socket systems has been a recurring pattern in Intel system architecture for a long time, though it was not present in all systems, and there were variations. In the current generation, there is a four-socket version of the E5, which does not have the SMB in the path to memory, but each processor only has two QPI links, so one of the remote sockets is two-hops away.

Long ago, maximum memory capacity was very valuable for database servers in reducing IO to less than impossible levels. The extra latency incurred from the SMB chip was worth the price. Since then, memory configuration has increased to stupendous levels, and data read IO has been reduced to negligible. NAND flash has also become very economical, allowing storage now to be capable of formerly impossibly high IOPS. Of course, this occurred after it was no longer absolutely essential.

In more recent years, with maximum memory configuration sometimes more than double what is already extravagant, we might not want to incur the extra latency of the SMB?

Highly Simplified Example

Before proceeding, I will take the opportunity to say that the modern microprocessor is an enormously complex entity in so many ways that it defies characterization in anything less than an exceedingly complicated model. The processor has multiple cores. Each core has pipelined superscalar and out-of-order execution. Each core has dedicated L1 I+D cache and dedicated L2 unified cache. And there is decoded micro-op cache as well. The processor has L3 cache shared among the cores.

That said, I will now provide an example based on an extremely simplified model of the core-processor memory complex.

Suppose we have a fictitious 3GHz processor, cycle time 0.33ns, having a very fast cache, and executes one instruction per cycle when all memory is in cache. Suppose further that memory latency is 67ns, and the long latency L3 cache effects are not considered.

Suppose we have a similarly fictitious transaction of 10M instructions. If everything is cache, the transaction completes in 3.33 million nano-seconds, or 3.33 milli-sec, and our performance is 300 transactions per second per core.

Now suppose that 5% (1/20) of instructions require a round-trip memory access before proceeding to the next step. The 0.95 fraction of instructions that have a cache hit consume 9.5M cycles. The 0.05 fraction of 10M is 0.5M instructions that miss cache. Each of these require a round-trip memory access of 67ns or 201 cycles for 100.5M cycles. The total time to complete the transaction is now 110M cycles. Performance is now 27.27 transactions per sec per core instead of 300 tps.

GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
3.067nsn/a167ns2010.05110.0M27.27-

Now suppose that we have a 2-socket system, meaning two memory nodes, and that we have not specially architected our database in a manner to achieve higher memory locality than expected from random access patterns. Any memory access is equally likely to be in either node. The local node memory continues to be 67ns and remote node memory access is 125ns. Average memory access is now (67+125)/2 = 96ns, or 288 cycles

GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
3.067ns125ns296ns2880.05153.5M19.5439.09

Without a database architected to achieve memory locality, we have lost 28% performance per core (1 - 19.54/27.27). Of course, we did double the number of cores, so throughput has increased by 43% (2*19.54/27.27). Alternatively, the performance per core in the single socket system is 39.5% better than in the 2-socket system (27.27/19.54). This magnitude is important. Did your vendors and infrastructure experts forget to mention this?

Now suppose we are on a 4-socket Xeon E7-type system with the same database, so memory access is equally probable to any of the four cores. Local memory access is 25%, and 75% is remote to one of the three other sockets. All sockets are directly connected, so all remote nodes are one-hop away. Now, recall that Xeon E7 has a memory buffer in the path between the processor (memory controller) and memory.

GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
3.082ns140ns4125.5ns3770.05197.75M15.1760.68

Let's suppose that the SMB adds 15ns additional latency. (I do not know what the number really is. It is not free. The magic of doubling memory capacity comes at a price.) Local node memory access is now 82ns and remote node is 140ns. Average memory access is (82+3*140)/4 = 125.5ns, or 377 cycles.

We have now just lost another 22% performance per core going from a 2-socket E5 to the 4-socket E7 type systems (1 - 15.17/19.54). Total throughput is 55% better than the two-socket (2*15.17/19.54). Performance per core is 28.8% better on the 2-socket than on the 4-socket.

The performance per core between the 1-socket is 80% better than on 4-socket (27.27/15.17). The 4-socket has 2.22X better throughput than the 1-socket (4*15.17/27.27).

Scaling - NUMA

Below is all three of the above cases in a single table.

GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
3.067nsn/a167ns2010.05110.0M27.27-
3.067ns125ns296ns2880.05153.5M19.5439.09
3.082ns140ns4125.5ns3770.05197.75M15.1760.68
Scaling - Frequency

We can also do the same calculations based on a similarly fictitious 2GHz processor.

GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
2.067nsn/a167ns1340.0576.5M26.14-
2.067ns125ns296ns1920.05105.5M18.9637.91
2.082ns140ns4125.5ns2510.05135.0M14.8159.26

Notice that we did not lose much performance in stepping down from 3 to 2GHz. We could even further step down to 1GHz and still be at 23.26, 34.78 and 55.36 tot tps for 1, 2, and 4 sockets respectively. It is important to stress that this is based on the assumption of a transaction processing workload having the characteristic of serialized memory accesses.

Disclaimer

All of the above is based on a highly simplified model. Real and rigorous testing should be done before drawing final conclusions. Regardless, there is no way anyone can claim that the difference is between local and remote node memory access latency is not important, unless the database has been architected to achieve a high degree of memory locality.

Front-Side Bus Systems

In the days before Intel integrated the memory controller, four processors connected to a memory controller in a system with uniform memory access. The Pentium Pro arrangement of 4P on one bus is represented on the left, although the diagram is actually closer to the 450NX or later.

 
   
 

The system on the right represents the 4-way Xeon 7300, each quad-core processor on its own bus connected to the 7300 MCH. Intel had already committed to the Pentium 4 quad-pumped bus in 2000/01. Vendors were expecting a long stable infrastructure, so Intel delayed the switch-over from bus to point-to-point until 2009.

Pre-Nehalem NUMA Systems

A system with 16 or more processors could be built with a custom memory + node controller connecting four processors, memory and also a crossbar. The crossbar in turn connects multiple node controllers to form a system having non-uniform memory access (NUMA).

 

 

In the old NUMA systems, some SQL Server operations ran fine, and other operations had horrible characteristics, far worse than would be suggested by the remote to local node memory latency ratio. So, it is the possible that there are other NUMA affects with far greater negative impact. Some may have since been resolved, while others may still be present but not as pronounced in modern NUMA systems.

Hyper-Threading (HT)

An important fact to notice is that a very high fraction of CPU cycles are no-ops, where the processor core does nothing while waiting for a round-trip memory access. This is why Hyper-Threading is highly effective. While one logical processor is waiting for a memory access to complete, the other thread can run. Scaling on the logical processors can be nearly linear for a transaction processing workload.

Note, the first-generation of Intel Hyper-Threading was on the Intel Pentium 4 (Net Burst) processors. In that generation, the implementation was two threads running simultaneously, as in each clock cycle, trying to fill the super-scalar execution units. The first generation of HT was problematic. It could have been because it was too aggressive to try to execute two threads simultaneously, or it could have been simply that the Windows operating system and SQL Server engine at the time did not know how to properly use HT. The next generation of Intel processor architecture, Core 2, did not have HT.

Then in Nehalem, HT returned, except that this time, it was a time slice implementation. Only one thread executes on any given cycle. When the executing thread encounters a memory or other long latency operation, the processor core switches to the other thread. If anyone has doubts on HT based on experience or hearsay from the Pentium 4 generation, forget it. The Nehalem and later HT is highly effective for transaction processing workloads. There used to be several SKUs with HT disabled in the Xeon E5/7 v1-3 generations. Pay close attention and pass on the no-HT SKUs.

The question to ask is why Intel does not increase the degree of HT? The generic term is Simultaneous Multi-Threading (SMT). Both IBM POWER and Oracle SPARC processors are or have been at 8-way SMT. Granted, one of the two mentioned that scaling to 8-way SMT was tricky. It is high time for Intel to increase HT to 4-way.

Database Architecture

In the above examples, the simple model suggests that scaling to multiple sockets is poor on the assumption of a transaction processing database without a means of achieving memory locality. (There is supposed to be an HPE whitepaper demonstrating the importance of the SQL NUMA tuning techniques in a properly designed database.) Just what does a database architected for NUMA mean? Naturally, this will have to be expounded in a separate article.

But for now, take a close look at both the TPC-C and TPC-E databases. The TPC-C database has all table leading with a common key, Warehouse Id, that provides a natural organizational structure. The TPC-E database has 5 transaction tables with a common key value but does not use the identity property. Instead it uses a function that must read, then update a table to determine the next value.

The Case for Single Socket

Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4's.

But I am really talking about a single-socket Xeon E5. In the v4 generation, we could have up to 22 cores, though we should start by looking at the 10-core E5-2630 v4 at $667, stepping up to the 16-core 2683 v4 at $1846 before going to the 20-22 core models at $3226 and $4938.

 

 

The Xeon E5 has 40 PCI-E gen 3 lanes. It might be convenient if there were a motherboard with 1 PCI-E x8 and 8 x4, because NVMe PCI-E SSDs are more common and economical with the x4 interface. Supermicro does have a UP Xeon E5 motherboard (X10SRL-F) with 4 x8, 2x4 gen3 plus 1 x4 gen2. It only has 8 DIMM sites out of 12 possible with the E5, but that is probably good enough.

Summary

A strong explanation was provided showing why round-trip memory latency is very important in transaction processing. One implication of this is that scaling to multiple sockets is poor due to the NUMA effect. A remedy is to architect the database and application together in working with the SQL Server NUMA tuning options to achieve locality. Alternatively, give serious consideration to a single-socket, yet still very powerful system. A second implication is that processor frequency is less important for transaction processing, though it might be important for other aspects. The memory latency affect also supports the argument that Hyper-Threading is highly effective and Intel really needs to increase the degree of HT.

Addendum

OK, I didn't show why database transaction processing incurs the round-trip memory latency. It has to do with the b-tree index, in which we read through a page to find the right pointer to the next level. We access the memory for that pointer, then read through to find the next pointer. I will try to do a diagram of this later. But if someone can dig through an open source database engine, please send a code example.

Several of the images were updated.

It would help if Intel would be so helpful as to provide L3 latencies for Xeon E5 v4 LCC, MCC and HCC models. What memory latencies for local and remote node in E5 v4? How much latency does the SMB in the Xeon E7 add?

Note that Xeon E3 and client side processors use unbuffered memory. While Xeon E5 can used unbuffered memory, these are currently limited to 16GB DIMMs while registered memory is available in capacities to 64GB.

The Xeon D is for specialized embedded applications and not suited for the single-socket database server. It has only 4 DIMM sites?

Supermicro has an Ultra product line targeting specialized applications. One of the features they call Hyper-Speed. They claim that with very high quality design and components, it is possible to reduce (memory?) latency via low jitter. I would like to know more about this. But the only option is Xeon E5 dual-socket, and I am more interested in single-socket. The emphasis seems to be on RHEL, and high frequency trading? There are examples for determining which processor socket the NIC is attached to, and whether a threading is running on a core in that socket. These hardware organization detection tools really should be incorporated into Windows as well. I have tried to use the WMI API from C#, but some things require coding in C or possibly assembly?

It was stressed that round-trip memory latency impacts transaction processing databases. Column-store DW avoids this problem by emphasis on marching through memory sequentially. The main intent of Hekaton memory-optimized tables was to eliminate the need for locks. But the other part was the use of a hash index, which happens to reduce the number of memory round-trip operations.

Additional references
Inte Core i7 Xeon 5500 series
Core i7 Xeon 5500 Series
Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
Intel D. Levinthal paper

freebsd Ulrich Drepper paper

Published Sunday, December 18, 2016 12:30 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

RichB said:

As always, thanks :)

December 18, 2016 8:38 PM
 

adeeb said:

nice work

March 6, 2017 8:05 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Privacy Statement