These days if you get a new server-class machine to run SQL Server, you can almost be 100% sure that it’ll be running on NUMA hardware. The recent AMD Opteron and Intel Nehalem-based processors are all built on some form of NUMA architecture.
The current consensus is that as the number of processors grows, their shared memory bus can easily get congested and becomes a major impediment to scalability. NUMA hardware solves this scalability challenge by dividing the processors into groups, with each group having a private memory bus (or memory controller). Such a group of processors is often built into a single physical package, and known as a socket. By limiting the number of processors inside a group (or a socket), the congestion on the private memory bus can therefore be controlled. A NUMA hardware-based server scales by increasing the number of such processor groups and connect them via specially-designed high-speed interconnect. In AMD Opteron, the interconnect among the NUMA nodes is called HyperTransport, and in Intel Nehalem it’s called QuickPath interconnect.
Of course, there is a cost of connecting these sockets whether it’s with HyperTransport or with QuickPath. And that is, accessing memory on another NUMA node via the interconnect is slower than accessing memory local to the same NUMA node without going over an interconnect. Accessing memory on another NUMA node is called remote memory access, whereas accessing memory on the same NUMA node is called local memory access.
The exact cost differential between remote memory access and local memory access varies across processors and processor clock frequencies. For instance, this Dell whitepaper has some test results on the Xeon 5500 processors, showing that local memory access can have 40% higher bandwidth than remote memory access, and the latency of local memory access is around 70 nanoseconds whereas remote memory access has a latency of about 100 nanoseconds.
On your system, you may not be able to run a memory benchmark to measure the local to remote memory access differential. But you can get a rough idea using a free Sysinternals tool called CoreInfo, which you can download from Microsoft. In particular, the CoreInfo output has a section on approximate cross-NUMA node access cost. On my test server, it looks like the following:
Calculating Cross-NUMA Node Access Cost...
Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.0 1.2
01: 1.3 1.0
You may get slightly different results across multiple runs of CoreInfo even on the same system. But the above result shows that, for instance, accessing memory node 00 from node 01 is about 30% more expensive than accessing memory local to node 00.
Objective
The question that a SQL Server professional may ask is: how should I expect this local-to-remote memory access cost differential to show up in a SQL Server workload?
The exact impact, of course, depends on many factors. Nonetheless, it would be nice to get a feel for the worst case scenario.
A worst case scenario is when SQL Server has to do 100% remote memory access in order to process a query. That is, the workers that process the queries from a workload are all running on one NUMA node, while the pages for the database are all cached in memory local to another NUMA node.
I did some search to see what might have been said on this subject, but could not find any published empirical data points that show the performance impact of remote memory access on a SQL Server workload. So I decided to construct a test and find it out myself. This post reports the findings from the test.
Test Environment
A brief summary of the test environment is in order. The test server is an HP ProLiant 360 G7 with two Intel Xeon X5690 (Westmere-EP) processors at 3.47GHz. Since hyperthreading is enabled, each socket has 12 logical processors. The total physical memory is 64GB in eight 8GB DDR3 modules, four modules (or 32GB in total) per socket. Each X5690 processor (or socket) has six cores ad 12MB of L3 cache. The OS is Windows Server 2008 R2 Enterprise X64 Edition with Service Pack 1. The DBMS is Microsoft SQL Server 2008 R2 Enterprise X64 Edition (10.50.1600). Max server memory is set to 50,000MB for the SQL Server instance.
Test Setup
To prepare for the test, the steps outlined below are followed:
- Follow the BOL article, How to: Configure the Database Engine to Listen on Multiple TCP Ports, to configure the SQL Server instance to listen on two additional ports: 50001 and 50002.
- Follow the BOL article, How to: Map TCP/IP Ports to NUMA Nodes, to configure the SQL Server instance to affinitize connections to port 50001 to node 1, and connections to port 50002 to node 2.
- Restart the SQL Server instance, and verify that it’s listening on ports 50001 and 50002.
- Connect to the SQL Server instance on port 50001 by specifying NYCSQL01,50001 (assuming that the SQL Server instance name is NYCSQL01), and execute the attached SQL script to create the test tables and populate them with the test data. the main test table tran_stats is populated with 50,000,000 rows, and is about 9GB in size. Note that each NUMA node has 32GB of physical memory, so there is more than enough room to cache the entire test table on node 1.
- Use the perfmon counters under SQLServer:Buffer Node (primarily Database Pages) to verify that the test table is entirely loaded into memory local to node 1
Two Test Scenarios
With all the test data cached on node 1, we run two test scenarios:
- 100% local memory access. Limit query processing to the cores on node 1 only. We accomplish this by connecting the workload driver to port 50001 only.
- 100% remote memory access. Limit query processing to the cores on node 2 only. We accomplish this by connecting the workload driver to port 50002 only.
Now, from my previous post and from Bob Dorr’s post, we know that we cannot have query parallelism if we want to have the control outlined above. So for this test, maxdop is set to 1 throughout. And while we are running the test, we watch the perfmon counters under SQLServer:Buffer Node to verify that indeed we are getting only local memory access or only remote memory access, depending on which scenario is being tested.
The two key counters to watch are:
- SQLServer:Buffer Node\Local node page lookups/sec, and
- SQLServer:Buffer Node\Remote node page lookups/sec
In the case of 100% local memory access, perfmon counter SQLServer:Buffer Node\Remote node page lookups/sec should have a zero value. And it should be zero for all the nodes because we don’t have any other activity on the SQL Server instance. In the case of 100% remote memory access, perfmon counter SQLServer:Buffer Node\Local node page lookups/sec is expected to have a zero value for all the nodes.
Test Queries
We use the same reporting query as we used in some of the previous posts (such as this one on hyperthreading). The query is reproduced as follows:
select COUNT(*) from (
select id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus,
COUNT(*) as cnt
from tran_stats t1 join dimension t2
on t1.server_config = t2.server_config and
t1.thread_name = t2.thread_name and
t1.tran_type = t2.tran_type and
t1.tran_status = t2.tran_status and
t1.users = t2.users and
t1.cpus = t2.cpus
group by id_server_config,id_thread_name,id_tran_type,id_tran_status,id_users,id_cpus
) as v
option (maxdop 1)
To ensure that the processors are sufficiently loaded, this query is run in streams over multiple concurrent connections to port 50001 for the 100% local memory access scenario, and to port 50002 for the 100% remote memory access scenario. Since we are interested in the worst case scenario, we need to test various numbers of concurrent streams and settle on one that shows the largest local-to-remote memory access differential. In our case, that number is 32, i.e. with 32 concurrent query streams, we see the largest performance difference between 100% local memory access and 100% remote memory access.
Test Results
The following chart shows the largest throughput difference in terms of queries per hour between 100% local memory access and 100% remote memory access:
As I mentioned in my previous post on SQL Server NUMA behavior, I ran into two surprises when playing with NUMA affinity. One is that NUMA affinity doesn’t work with query parallelism. The other surprise, to a less extent, is that the cost differential between local memory access and remote memory access is not as large as I expected from reading all the published materials on the Nehalem-based NUMA architecture. In fact, no matter where you read it, you’d always be warned of the cost of remote memory access, and that often would be the very first thing you’d be told.
So it’s no surprise that I’d be surprised by the test result presented in the above chart. More specifically, the largest difference I could find between 100% local memory access and 100% remote memory access is about 5%, and I was definitely expecting more than that for this particular 100% cached workload.
This in a way is good news because that means we probably shouldn’t worry too much about data load not being balanced across the NUMA nodes, especially when we have some other performance issues to contend with.
Appendix. Code for loading the test data
drop table tran_stats
go
create table dbo.tran_stats (
server_config varchar(100),
users int,
cpus smallint,
thread_name varchar(20),
tran_count int,
tran_type varchar(20),
tran_status varchar(20),
tran_begin datetime,
tran_end datetime,
resultset_size smallint,
filler char(100))
go
set nocount on
go
truncate table tran_stats
go
declare @i int = 1
declare @batch_size int = 100000
begin tran
while @i <= 50000000
begin
insert into tran_stats with (tablock) (server_config,users,cpus,thread_name,tran_count,tran_type,tran_status,tran_begin,tran_end,resultset_size,filler)
select 'X580 G' + right('0' + cast((@i % 40) as varchar(5)), 2),
((@i % 5) + 1) * 10,
((@i % 8) + 1) * 10,
'VC' + cast(@i % 100 as varchar(15)),
@i,
case when @i % 3 = 0 then 'READ-ONLY'
when @i % 3 = 1 then 'WRITE-ONLY'
when @i % 3 = 2 then 'READ-WRITE'
END,
case when @i % 3 = 0 then 'SUCCESS'
when @i % 3 = 1 then 'FAILURE'
when @i % 3 = 2 then 'ROLLBACK'
END,
GETDATE(),
DATEADD(second, @i % 50, getdate()),
@i % 100,
'abc'
if @i % @batch_size = 0
begin
commit tran
begin tran
end
select @i = @i + 1
end
if @@TRANCOUNT > 0
commit tran
go
create clustered index cix_tran_stats on tran_stats(tran_begin)
go
drop table Dim_server_config
go
create table Dim_server_config (
id_server_config int identity,
server_config varchar(100)
)
go
drop table Dim_thread_name
go
create table Dim_thread_name (
id_thread_name int identity,
thread_name varchar(20)
)
go
drop table Dim_tran_type
go
create table Dim_tran_type (
id_tran_type int identity,
tran_type varchar(20)
)
go
drop table Dim_tran_status
go
create table Dim_tran_status (
id_tran_status int identity,
tran_status varchar(20)
)
go
drop table Dim_users
go
create table Dim_users (
id_users int identity,
users int
)
go
drop table Dim_cpus
go
create table Dim_cpus (
id_cpus int identity,
cpus int
)
go
insert Dim_server_config(server_config)
select distinct server_config
from (select top 100000 server_config from tran_stats with (nolock)) as v
insert Dim_thread_name(thread_name)
select distinct thread_name
from (select top 100000 thread_name from tran_stats with (nolock)) as v
insert Dim_tran_type(tran_type)
select distinct tran_type
from (select top 100000 tran_type from tran_stats with (nolock)) as v
insert Dim_tran_status(tran_status)
select distinct tran_status
from (select top 100000 tran_status from tran_stats with (nolock)) as v
insert Dim_users(users)
select distinct users
from (select top 100000 users from tran_stats with (nolock)) as v
insert Dim_cpus(cpus)
select distinct cpus
from (select top 100000 cpus from tran_stats with (nolock)) as v
go
if OBJECT_ID('dbo.dimensions') is not NULL
drop table dbo.dimensions
go
select distinct * into dbo.dimensions
from Dim_server_config, Dim_thread_name, Dim_tran_type, Dim_tran_status, Dim_users, Dim_cpus
go