THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: the cost of NUMA remote memory access

These days if you get a new server-class machine to run SQL Server, you can almost be 100% sure that it’ll be running on NUMA hardware. The recent AMD Opteron and Intel Nehalem-based processors are all built on some form of NUMA architecture.

The current consensus is that as the number of processors grows, their shared memory bus can easily get congested and becomes a major impediment to scalability. NUMA hardware solves this scalability challenge by dividing the processors into groups, with each group having a private memory bus (or memory controller). Such a group of processors is often built into a single physical package, and known as a socket. By limiting the number of processors inside a group (or a socket), the congestion on the private memory bus can therefore be controlled. A NUMA hardware-based server scales by increasing the number of such processor groups and connect them via specially-designed high-speed interconnect. In AMD Opteron, the interconnect among the NUMA nodes is called HyperTransport, and in Intel Nehalem it’s called QuickPath interconnect.

Of course, there is a cost of connecting these sockets whether it’s with HyperTransport or with QuickPath. And that is, accessing memory on another NUMA node via the interconnect is slower than accessing memory local to the same NUMA node without going over an interconnect. Accessing memory on another NUMA node is called remote memory access, whereas accessing memory on the same NUMA node is called local memory access.

The exact cost differential between remote memory access and local memory access varies across processors and processor clock frequencies. For instance, this Dell whitepaper has some test results on the Xeon 5500 processors, showing that local memory access can have 40% higher bandwidth than remote memory access, and the latency of local memory access is around 70 nanoseconds whereas remote memory access has a latency of about 100 nanoseconds.

On your system, you may not be able to run a memory benchmark to measure the local to remote memory access differential. But you can get a rough idea using a free Sysinternals tool called CoreInfo, which you can download from Microsoft. In particular, the CoreInfo output has a section on approximate cross-NUMA node access cost. On my test server, it looks like the following:

Calculating Cross-NUMA Node Access Cost...
Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01
00: 1.0 1.2
01: 1.3 1.0

You may get slightly different results across multiple runs of CoreInfo even on the same system. But the above result shows that, for instance, accessing memory node 00 from node 01 is about 30% more expensive than accessing memory local to node 00.


The question that a SQL Server professional may ask is: how should I expect this local-to-remote memory access cost differential to show up in a SQL Server workload?

The exact impact, of course, depends on many factors. Nonetheless, it would be nice to get a feel for the worst case scenario.

A worst case scenario is when SQL Server has to do 100% remote memory access in order to process a query. That is, the workers that process the queries from a workload are all running on one NUMA node, while the pages for the database are all cached in memory local to another NUMA node.

I did some search to see what might have been said on this subject, but could not find any published empirical data points that show the performance impact of remote memory access on a SQL Server workload. So I decided to construct a test and find it out myself. This post reports the findings from the test.

Test Environment

A brief summary of the test environment is in order. The test server is an HP ProLiant 360 G7 with two Intel Xeon X5690 (Westmere-EP) processors at 3.47GHz. Since hyperthreading is enabled, each socket has 12 logical processors. The total physical memory is 64GB in eight 8GB DDR3 modules, four modules (or 32GB in total) per socket. Each X5690 processor (or socket) has six cores ad 12MB of L3 cache. The OS is Windows Server 2008 R2 Enterprise X64 Edition with Service Pack 1. The DBMS is Microsoft SQL Server 2008 R2 Enterprise X64 Edition (10.50.1600). Max server memory is set to 50,000MB for the SQL Server instance.

Test Setup

To prepare for the test, the steps outlined below are followed:

  1. Follow the BOL article,  How to: Configure the Database Engine to Listen on Multiple TCP Ports, to configure the SQL Server instance to listen on two additional ports: 50001 and 50002.
  2. Follow the BOL article, How to: Map TCP/IP Ports to NUMA Nodes, to configure the SQL Server instance to affinitize connections to port 50001 to node 1, and connections to port 50002 to node 2.
  3. Restart the SQL Server instance, and verify that it’s listening on ports 50001 and 50002.
  4. Connect to the SQL Server instance on port 50001 by specifying NYCSQL01,50001 (assuming that the SQL Server instance name is NYCSQL01), and execute the attached SQL script to create the test tables and populate them with the test data. the main test table tran_stats is populated with 50,000,000 rows, and is about 9GB in size. Note that each NUMA node has 32GB of physical memory, so there is more than enough room to cache the entire test table on node 1.
  5. Use the perfmon counters under SQLServer:Buffer Node (primarily Database Pages) to verify that the test table is entirely loaded into memory local to node 1

Two Test Scenarios

With all the test data cached on node 1, we run two test scenarios:

  • 100% local memory access. Limit query processing to the cores on node 1 only. We accomplish this by connecting the workload driver to port 50001 only.
  • 100% remote memory access. Limit query processing to the cores on node 2 only. We accomplish this by connecting the workload driver to port 50002 only.

Now, from my previous post and from Bob Dorr’s post, we know that we cannot have query parallelism if we want to have the control outlined above. So for this test, maxdop is set to 1 throughout. And while we are running the test, we watch the perfmon counters under SQLServer:Buffer Node to verify that indeed we are getting only local memory access or only remote memory access, depending on which scenario is being tested.

The two key counters to watch are:

  • SQLServer:Buffer Node\Local node page lookups/sec, and
  • SQLServer:Buffer Node\Remote node page lookups/sec

In the case of 100% local memory access, perfmon counter SQLServer:Buffer Node\Remote node page lookups/sec should have a zero value. And it should be zero for all the nodes because we don’t have any other activity on the SQL Server instance. In the case of 100% remote memory access, perfmon counter SQLServer:Buffer Node\Local node page lookups/sec is expected to have a zero value for all the nodes.

Test Queries

We use the same reporting query as we used in some of the previous posts (such as this one on hyperthreading). The query is reproduced as follows:

select COUNT(*) from ( 
   select id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus, 
             COUNT(*) as cnt 
  from tran_stats t1 join dimension t2 
      on t1.server_config = t2.server_config and 
           t1.thread_name = t2.thread_name and 
           t1.tran_type   = t2.tran_type and  
           t1.tran_status = t2.tran_status and 
           t1.users       = t2.users and 
           t1.cpus        = t2.cpus 
group by id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus 
) as v
option (maxdop 1)

To ensure that the processors are sufficiently loaded, this query is run in streams over multiple concurrent connections to port 50001 for the 100% local memory access scenario, and to port 50002  for the 100% remote memory access scenario. Since we are interested in the worst case scenario, we need to test various numbers of concurrent streams and settle on one that shows the largest local-to-remote memory access differential. In our case, that number is 32, i.e. with 32 concurrent query streams, we see the largest performance difference between 100% local memory access and 100% remote memory access.

Test Results

The following chart shows the largest throughput difference in terms of queries per hour between 100% local memory access and 100% remote memory access:


As I mentioned in my previous post on SQL Server NUMA behavior, I ran into two surprises when playing with NUMA affinity. One is that NUMA affinity doesn’t work with query parallelism. The other surprise, to a less extent, is that the cost differential between local memory access and remote memory access is not as large as I expected from reading all the published materials on the Nehalem-based NUMA architecture. In fact, no matter where you read it, you’d always be warned of the cost of remote memory access, and that often would be the very first thing you’d be told.

So it’s no surprise that I’d be surprised by the test result presented in the above chart. More specifically, the largest difference I could find between 100% local memory access and 100% remote memory access is about 5%, and I was definitely expecting more than that for this particular 100% cached workload.

This in a way is good news because that means we probably shouldn’t worry too much about data load not being balanced across the NUMA nodes, especially when we have some other performance issues to contend with.


Appendix. Code for loading the test data

drop table tran_stats
create table dbo.tran_stats (
    server_config     varchar(100),
    users             int,
    cpus              smallint,
    thread_name       varchar(20),
    tran_count        int,
    tran_type         varchar(20),
    tran_status       varchar(20),
    tran_begin        datetime,
    tran_end          datetime,
    resultset_size    smallint,
    filler            char(100))
set nocount on
truncate table tran_stats
declare @i int = 1
declare @batch_size int = 100000
begin tran
while @i <= 50000000
       insert into tran_stats with (tablock) (server_config,users,cpus,thread_name,tran_count,tran_type,tran_status,tran_begin,tran_end,resultset_size,filler)
       select 'X580 G' + right('0' + cast((@i % 40) as varchar(5)), 2),
               ((@i % 5) + 1) * 10,
               ((@i % 8) + 1) * 10,
               'VC' + cast(@i % 100 as varchar(15)),
               case when @i % 3 = 0 then 'READ-ONLY'
                    when @i % 3 = 1 then 'WRITE-ONLY'
                    when @i % 3 = 2 then 'READ-WRITE'
               case when @i % 3 = 0 then 'SUCCESS'
                    when @i % 3 = 1 then 'FAILURE'
                    when @i % 3 = 2 then 'ROLLBACK'
                 DATEADD(second, @i % 50, getdate()),
                 @i % 100,
              if @i % @batch_size = 0
                  commit tran
                  begin tran
              select @i = @i + 1
if @@TRANCOUNT > 0
commit tran
create clustered index cix_tran_stats on tran_stats(tran_begin)
drop table Dim_server_config
create table Dim_server_config (
   id_server_config int identity,
   server_config  varchar(100)
drop table Dim_thread_name
create table Dim_thread_name (
   id_thread_name int identity,
   thread_name  varchar(20)
drop table Dim_tran_type
create table Dim_tran_type (
   id_tran_type int identity,
   tran_type  varchar(20)
drop table Dim_tran_status
create table Dim_tran_status (
   id_tran_status int identity,
   tran_status  varchar(20)
drop table Dim_users
create table Dim_users (
   id_users int identity,
   users int
drop table Dim_cpus
create table Dim_cpus (
   id_cpus int identity,
   cpus int
insert Dim_server_config(server_config)
select distinct server_config 
  from (select top 100000 server_config from tran_stats with (nolock)) as v
insert Dim_thread_name(thread_name)
select distinct thread_name 
  from (select top 100000 thread_name from tran_stats with (nolock)) as v
insert Dim_tran_type(tran_type)
select distinct tran_type 
  from (select top 100000 tran_type from tran_stats with (nolock)) as v
insert Dim_tran_status(tran_status)
select distinct tran_status 
  from (select top 100000 tran_status from tran_stats with (nolock)) as v
insert Dim_users(users)
select distinct users 
  from (select top 100000 users from tran_stats with (nolock)) as v
insert Dim_cpus(cpus)
select distinct cpus 
  from (select top 100000 cpus from tran_stats with (nolock)) as v
if OBJECT_ID('dbo.dimensions') is not NULL
   drop table dbo.dimensions
select distinct * into dbo.dimensions
  from Dim_server_config, Dim_thread_name, Dim_tran_type, Dim_tran_status, Dim_users, Dim_cpus
Published Monday, January 30, 2012 7:17 PM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



GrumpyOldDBA said:

Is it possible to disable NUMA or is it built into the Hardware? The reason I ask is that a couple of years back I wanted servers with lots of memory but not lots of cpus, e.g. 1TB ram with 2 or 4 sockets, I was told that due to the architecture I could only use 1TB of ram with 8 sockets. I was never able to get a dcefinitive answer on this other than some discussions on an Oracle forum where there was some discussion of NUMA degrading performance and disabling it thus making memory flat across all sockets.  I could never find anyone to ask about this interesting subject - which I'm glad you're blogging. My interest point is how performance may be impacted when databases are say > 10TB and the server has 1TB of ram. ( assume my database would be running with parallelism )

I sort of have an eye on this as at some point we'll be wanting to upgrade our 16 core prod servers and I'm unsure of which way to go, so you whole series of posts have been extremely useful, thank you

January 31, 2012 3:32 AM

Linchi Shea said:

You can't disable NUMA at the hardware level. However, many machines have a BIOS setting that allows node interleaving, which effectively presents memory as if it were on a SMP machine. By default, this setting is disabled, and I don't know any real-world scenario where disabling NUMA at BIOS provides a performance advantage. But I have not played with this, and don't have any first-hand experience. Nor am I motivated to spend any time on it because I really don't see any real chance that I'd be disabling NUMA this way.

It makes sense that you have to go to a certain number of processors to get a certan amount of memory because there is a limited number of DIMM count per cartridge per processor socket, and there is a maximum size per DIMM. In particular, if you go back a few years when there were no 16GB/32GB DIMMs, you would have to have more sockets to support more DIMMs, therefore more total RAM.

January 31, 2012 12:31 PM

Greg Linwood said:

An excellent article Linchi, I think the best I've ever read from your blog. It's not only interesting content but very well presented covering background info, describes the test well & a good analysis of the results. The use of a commonly used server also puts it in nice perspective for most readers.

I'm not surprised by the results as we've never detected any significance from NUMA in SQL performance work our team does, but we do sometimes get asked by clients who are suspicious as to whether NUMA is a factor. I'll point them to this article for further reading, thx..

January 31, 2012 8:47 PM

Linchi Shea said:

Thanks Greg! I actually also tried OLTP workloads (e.g. TPC-C) in the same test setup, and didn't see any difference at all. It's more intutive with OLTP than with reporting in that the time it takes to get a page from another node accounts for an extremely small fraction of the total response time in an OLTP query. But for a reporting workload that doesn't return much data to the client, it's all about page lookups on the server 9actually massive number of page lookups in the range of 170,00~200,000 page lookups/sec), and we know that a remote lookup is more than 5% more expensive than a local page lookup in such a machine, thus the surprise.

February 1, 2012 8:14 AM

Greg Linwood said:

200k page lookups/sec is fairly high for OLTP but I think the DL360 G7 could probably stretch further than this, though I'm in Nepal this week & don't have easy access to our perf repository to verify.

How large is the DB after the setup / load script is run? I'm wondering what cache ratio you're getting with 64GB RAM & 50GB Max Server Memory & whether this is potentially a factor, or whether any other bottlenecks might be obfuscating?

February 1, 2012 12:09 PM

Linchi Shea said:

As mentioned, I was interested in the worst case scenario. For that, I didn't want any other resource bottleneck that may obscure the impact of remote memory access. For instance, if SQL Server has to wait for a page to be retrieved from disk, that would be an eternity compared to retrieving a page from a remote node, and in such as scenario you would most definitely not see any impact from remote memory access in the final performane measure (whatever tha may be). To answer your question, I made sure that the entire test table was cached on a single node. Each NUMA node had 32GB of RAM, and the table itself was about 9GB.

Yes, a DL360 G7 can handle more than 200,000 remote page lookups in terms of raw throughput because 200,000 pages/sec translates into about 1.5GB/sec, far below the theoretical throughput of QuickPath interconnect. But in this case, SQL Server was doing more than just looking up pages.

February 1, 2012 1:42 PM

Bill said:

Excellent post. One other area around NUMA and performance that has come up recently is the use of trace flag 834 (Large Page Allocations). On large memory machines with NUMA that have the trace flag enabled and the max_server_memory is considerabley less than the physical memory of the box, we have seen an uneven distribution of memory across numa nodes. You can review you SQLServer\BufferNode counters to see that some nodes have 0 target pages.

The resolution is to dedicate the box to SQL Server (these type of boxes propably are ready are), increase the max_server_memory or remove trace flag 834. Then you will see the memory evenly distributed across NUMA nodes.

Bill -- Microsoft CSS

February 4, 2012 6:08 PM

Ewald Cress said:

Thanks for this fantastic post, Linchi. I've been puzzling over this issue for quite a while, because while it's easy to understand the theoretical cost of remote access, there are very few explanations of how this actually affects SQL Server as a system. Seems more and more like it's a level of abstraction that we can just ignore in practical terms.

February 6, 2012 3:33 AM

Linchi Shea said:

In my exchange with Greg Linwood above, I mentioned not seeing any difference with an OLTP workload. That's not entirely accurate. I just checked my notes and found there was some small difference, similar to the case of the reporting workload discussed in the main body of this post. In addition, the OLTP workload can actually push up page lookups much higher than 200,000/sec seen in the reporting workload. I'll repeat the OLTP test and post the results later.

February 6, 2012 9:40 AM

Lonny Niederstadt said:

Thanks! Excellent framework for some future tests I'll have to run.  In the context of intensive batch ETL and batch report workloads, I think that disabling NUMA with trace flag 8015 could be quite valuable. Even though the differences between local and remote memory access latency and bandwidth seem intimidating, we'll have to see what the difference leads to in query execution time.  If its small, the benefits of disabling NUMA may still greatly outweigh the memory access benefits for our workload.  

The main reason I'm interested in trace flag 8015 is avoiding the possible "away list" query memory inflation that Bob Dorr describes in the link below.  In extreme cases this might cause pending query memory grants.  At any rate, it could lower the page life expectancy for database blocks already under eviction pressure on the NUMA node due to activity on the node.

January 25, 2013 4:55 PM

That Guy said:

I was under the impression this would have little impact if the memory controller or bus was not stressed?

I expect the advantages of local memory access come about because of increase in available memory channels and therefore bandwidth rather than need to pass through two memory controllers.

So what would happen if we had queries using local memory and queries using remote memory going through the same memory controller. If we did this across all NUMA nodes we would be back in the day of the memory through the northbridge approach and effectively throwing a whole lot of bus bandwidth out.

Would this make any difference to SQL Server. Well - perhaps not. SQL should make heavy use of of the system bus and memory bandwidth because it is heavily concerned with moving data around rather than doing to much work with it. However, in reality do not really know how much is actually moving things around in memory vs CPU time.

Perhaps if we were to look at this topic in more detail we would start seeing where the apparent value in NUMA comes from.

I do note that a Google study on the subject did find NUMA had a "significant" impact of 10-20%. They are dealing with the issue with a fixed real world workload type on servers with 2 and 4 NUMA nodes. Well worth a read - see

August 16, 2013 8:40 PM

Leave a Comment


About Linchi Shea

Checking out SQL Server via empirical data points

This Blog


Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement