THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

PDW Concept on NUMA

This is purely a question and not a proposal, as I know how Microsoft is so very thrilled every time I have a suggestion. Could the Parallel Data Warehouse concept work on a single NUMA system?

Technically all multi-socket systems today are NUMA because the processors now have integrated memory controllers. The path to memory is different between local access and memory attached to a differentanother processor socket. But I am more interested in 8-socket, and potential 16+ socket systems. The single system definition here is that of a single operating system image. Some large systems allow partitioning to support multiple OS images to be running, but that configuration is not discussed here.

Both Windows Server and SQL Server are NUMA aware in that there are distinct memory nodes (that are closer to the local processor cores than remote processor cores). But there is no workable mechanism to fully benefit from knowledge of the underlying hardware organization. Say for example we have a large table resident in the buffer cache evenly distributed across the nodes. We would like for a connection to SQL Server issue a query to this table to be farmed out to one thread on each node with that thread accessing only rows stored locally.

But this is not what happens. More likely, starting from an empty buffer cache, a query accesses the table. The table is initially loaded into memory on the same node as the connection, then switching to other nodes? I have no idea what actually happens so it would be nice if someone would like to investigate this. Now another connection that happens to be assigned to a thread on the different node would have to cross nodes to remote memory for this same query? So the expectation is that there would be apparent inconsistent variations in query time depending on both the node distribution of the participating buffer cache pages and the node affinity of the execution thread.

In the PDW environment, there are multiple compute nodes, each a 2-socket system running its own Windows operating system, and (modified?) SQL Server instance\process with fact tables partitioned by node and dimension tables replicated to all nodes. A front-end sends a query to each node which processes data on its node. The purpose of this architecture is to allow scaling the computational resources (CPU, memory and IO) beyond the limits of a single system. But it also happens to achieve a degree of locality optimization.

The limitation of the PDW concept is that any single query is a more involved process. Furthermore, the query execution plan elements cannot cross nodes without a remote server mechanism. Even with the InfiniBand protocol, communications between nodes are expensive and long latency relative to the processor cycle time scale. This is why PDW is Parallel Data Warehouse and not Parallel Database.

So now what? My question is can we apply the PDW concept to a NUMA system, one big 8-socket NUMA system running a single Windows operating system. Instead of PDW with a SQL Server instance on each system (PDW compute node), there is now a SQL Server instance localized to each node of the NUMA system. Communications between processes on a system is lower latency than across systems over InfiniBand.

Can we go further than this? How about bringing the Parallel concept inside to a single process? Now each “node” is a collection of threads affinitized to a node on the NUMA system. Communication between threads is even faster than between processes. And now, potentially execution plan elements can cross “nodes”?

We might ask why bother. The nature of server system is that the 8-socket system is more than twice as expensive as four 2-socket system. In addition, the 8-socket system may have processors one generation behind the 2-socket system. The fact is that by the time we assemble a collection of systems or one big system, the full costs of hardware are irrelevant to a really important mission. The real question is which meets the objectives better.

 

Below is a completely separate discussion but is on the topic of NUMA systems.

A Better NUMA System Memory Strategy?

As far as I know, Windows or SQL OS creates 2 memory nodes from processor node 0, then 1 additional memory node on each of the other nodes. One of the memory nodes on node 0 is for generally allocations. All other memory nodes, including the second one on node 0, are for NUMA aware allocations? The immediate problem is that the memory nodes are asymmetric in size.

There is one possibility to consider for system not at maximum memory. This is to put maximum memory in node 0 (good luck figuring out which one this is) and the normal memory in the other nodes. But of course, the nature of component pricing today is that is makes more sense to fill the DIMM sockets with the second largest capacity memory module, today 16GB. Today, Crucial shows 16GB ECC DIMMs at $200 each and 32GB DIMMs at $800 each. By sometime in the next year or so, expect the 32GB DIMM to trends towards $400 and a new 64GB DIMM to introduce at several thousand dollars initially.

HP-UX on the other hand has a very clever memory strategy, which is to configure both interleaved and node memory, or LORA in their terminology. Say for example that there are 8 nodes and 1TB memory total, 128GB per node. Interleaved memory totaling 128GB is allocated from all nodes at 16GB per node. The remaining 114GB on each node are formed into node local memory. The addressing scheme might be something like this. For interleaved portion in the first 128GB, the memory is interleaved in 256 bytes stripes across the nodes. The first 256 bytes is from node 0, the next 256 bytes is from node 1, and so on before wrapping back to node 0. Above this are separate 114GB contiguous chunks on each node.

This arrangement recognizes the fact that not all consumers want local node memory. Now the NUMA local memory nodes are uniform in size. I happen to think this is an excellent idea. I do not know if there is an institutional bias at MS against ideas not invented in happy valley.

 

Edit based on Lonny Niederstadt's comment

on further thought, I think it might be better if the OS did not interleave a single memory page (4KB on X86) across NUMA nodes, perhaps even for the 2MB large page mode, but not sure on the 1GB page.

The HP-UX scheme allows the sys admin to adjust the interleaved / LORA ratio, with 1/8:7/8 being recommended. This would allow NUMA to be turned off.

The SQL OS could also override the Windows scheme, simply by choosing how it allocates pages.

For NUMA controls, the default should be off, so people do not encounter anomalies unless they make the decision to go for it.

I am thinking on a transaction server, the scheme should be to prefer allocating buffer cache pages from the local node memory (of the thread making the call). The app server would connect over different port values. SQL Server would have affinity set by port. So each group would be likely to have their data in the local node.

For DW, the strategy should be to round-robin extents (or large pages?) by NUMA node. Even better would be if the clustered index could be on a hash key. Then two tables on the hash key, ORDERS and LINEITEM for example, would have the same node as preferred buffer location.

Taking a step further, if storage were also distribute by node, and the FG has on or more file on each node, then the preferred file location matches. And the index rebuild would reset to match this scheme.

There should be some parallelism options beyond MAXDOP. One would be Local node. So parallel execution should use local node cores up to the socket limit. Another would be Global, so allocate 1 thread per node, then round robin. An independent option would be Hyper-Thread or HT, in which both or all logical processors of 1 core would be allocated before moving to the next core. Default of course would be off, but this could be a powerful tool for investigating HT issues.

Perhaps to local node parallelism option should be default for the OLTP scheme, and the Global be default for the DW scheme

I am thinking something needs to be done about processor affinity bit mask. There should be 3 values, which could be done with 2 sets of bitmasks. One is preferred processor affinity. The other is secondary affinity, and finally excluded.

This might help manage multi-instance, each instance could get separate preferred, but the secondary would allow activity on other nodes as well.

Since you mention CPU per node, I will plug my ExecStats tool,
http://www.qdpma.com/ExecStatsZip.html

which is PerfMon mode display individual CPU, annotating the node boundaries. Note the 3rd party tools vendors - learn what a real server is, not the developer's desktop!

The high CPU on node zero might be related to this
http://support.microsoft.com/kb/2813214
FIX: CPU spike when there is no load on a server after you install SQL Server 2012 on the server, sp1 cu3.

Now having thought about it some, I think this should be a request, we could call it SQL Server Data Center Edition

Published Friday, September 20, 2013 3:55 PM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Lonny Niederstadt said:

For some workloads, people should consider disabling NUMA support in SQL Server in the first place. :)  Current task distribution ends up stacking many tasks within a NUMA node in lots of cases.  Since that is done to provide affinity between tasks and memory, that means that the buffer pool portion for that NUMA node can churn as well.  So in many cases, if you watch per CPU metrics and per-buffer node metrics in SQL Server, you can catch long periods of time where CPU in one node will be stressed and that bpool churns, while the CPUs in other nodes and their bpool loafs.

Its not for nuthin' that Oracle disables their database level NUMA support by default in both Linux and AIX packages.

For computation heavy OLTP databases, maximixing memory affinity makes sense.  For data and memory hungry stuff that spans NUMA nodes like most data warehouses, IBMs finding generally holds that trying to prioritize memory affinity can lead to performance degradation, either due to expense in memory allocation or waits for affinitized processor task dispatch.

And, in Windows, unless there's been a change, there is still the hurdle that SQL Server's memory allocations can't specify which node is to satisfy them - rather the allocation is satisfied and then the node it came from can be learned.  

September 20, 2013 4:04 PM
 

jchang said:

Contents moved into main post based on Lonny Niederstadt's comment

September 20, 2013 6:02 PM
 

Lonny Niederstadt said:

Ahhh - HP-UX.  I used to know that, way back when :)  IBM Power systems running AIX have all kinds of memory and task affinity knobs to turn also, and that's what I've been watching the last few years.  So many details its forced just about everything I knew about HP-UX and Solaris right out of my head :)

But for DW/DSS loads heavy enough to cross the socket/MCM/CEC boundary, its often best to disable the affinity on AIX, too.  For one thing, paging to disk becomes less predictable.  The other item is one I haven't proven out yet, but with enough data coming into the shared memory segment from disk IO, it seems like disabling many of the AIX affinity features simplifies the instruction set enough to make a noticeable performance difference.  

October 27, 2013 1:06 AM
 

Thomas Kejser said:

Joe, what you describe is something I have also been lobbying for, especially for DW workloads.

However, I question the value of partitioning the buffer pool in OLTP. In transaction heavy workloads, the L2 cache dirt (specifically on the tlog, locking structures and B-tree root pages is) the largest problem. SQL server already does pretty well here (smarter than Oracle) by partitioning the lock manager, pinning the log writer (typically to node 1) and super latching. There are a few more thing you can do to get the final benefits for OLTP:

1) partitioning the network interfaces/port so each app server always talks to the same CPU/NUMA. If users are affinitized to app servers (as they should be) - you get affinity "all the way through"

2) remove SQL server from the CPU core that runs the log manager (even if this means sacrificing a few cores). This keeps L2 for the tlog clean and ready for the expensive flushes. (One could consider dedicated hardware for this)

Both of above can be achieved today, though 2 is a pain to get right.

November 29, 2013 2:37 AM
 

jchang said:

hey Thomas, how do you determine the affinity of the log manager? is it different for each db in an instance? Also, I expect to get another bottle of Sazerac soon.

December 2, 2013 2:41 PM
 

Thomas Kejser said:

The log thread can be seen in dm_exec_session. You can join to the schedulers to find the core it runs on

December 8, 2013 7:36 AM
 

Thomas Kejser said:

Jo, based on your knowledge of big boxes - what would you say the lower limit for latency on a crossbar is once you go above 16 sockets? I would assert that it can be proven that a pre-partitioned system is the only feasible scale strategy.

February 15, 2014 3:43 AM
 

jchang said:

There is very little information on this topic. My understanding is the connection-transmission delay is on the order of 25ns, and this might be for a chip on the PCB, with a socketed device having higher capacitive (inductive?) delays.

My understanding is that locally attached memory has a latency of 50-60ns, and memory is socketed + there is a latency on DRAM itself, so perhaps 25ns per hop is old (1997?).

It is also my understanding that in a 2-socket+ system, where the memory access is processor-to-processor-memory and back is on the order of 100ns.

This is about the same as old processor-memory-controller-memory access time (100ns was best, some systems were 140ns?).

The HP Proliant DL980 remote node access is processor-node controller-node controller-processor and back. I do not know if the node controllers are socketed to the PCB or not, but the processor unit has a pin-connector to the node controller, and the same with each node controller? this is likely to introduce a more substantial delay?

Remote memory access might be on the order of 280ns?

Also cache coherency related delays, there was an explanation of this on the DL980, perhaps it is in their WP.

Finally, on the Intel Xeon processors, the QPI sub-unit is on the far side of the processor from the memory controller, so traffic must cross the ring.

February 15, 2014 11:30 AM
 

jchang said:

On the matter of scaling, if the PDW concept does scale, then QPI connections are lower latency than Infini-band. So the difference would be between 1) single OS image, one or more SQL instances, with single process or LPC communication between nodes versus 2) multiple OS images.

I was presuming that each "PDW-NUMA" node restrict itself to local resources as much as possible. So perhaps the scaling limitation is that it is not possible to "turn-off" unnecessary cache coherency protocols?

If the system was intended strictly for PDW-style operation, using QPI only for inter-node communication?  

February 15, 2014 11:39 AM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement