THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: diminishing marginal return on the degree of parallelism

In commenting on my previous post, Greg Linwood and GrumpyOldDBA raised questions about various implications of parallelism. In this post, I’ll look at the impact of different degrees of parallelism on the query performance. I’ll limit my examination on the same query that uses a Cartesian product temp table, as discussed in the previous two posts ([1] and [2]). The details of the test setup can be found in [1]. The SQL Server instance is SQL Server 2008 R2 (10.50.1600) Enterprise X64 Edition. I’ll look at the performance impact of concurrent executions in a future post.

The law of diminishing marginal return dictates that as you increase the number of processors in a parallel plan, sooner or later you’ll reach a point after which you’ll see a decrease in the performance gain from adding another processor to the parallel plan. Note that it is the decrease in the performance gain, not the decrease in the performance itself.

The question is, for a specific query, does SQL Server know where that inflection point is and therefore make the right decision in picking the optimal degree of parallelism? SQL Server Books Online thinks so. It says:

“When SQL Server runs on a computer with more than one microprocessor or CPU, it detects the best degree of parallelism, that is, the number of processors employed to run a single statement, for each parallel plan execution.”

But how does that work out in practice? This is a valid question to ask because determining the optimal degree of parallelism is an incredibly difficult problem to solve. We obviously can’t answer this question in generality. Nor should we even attempt.

So let’s take a look at the following query (reproduced from the previous post, and see that post for more details):

select id_server_config,
          COUNT(*) as cnt
  from tran_stats t1 join #tmp t2 
       on t1.server_config = t2.server_config and
          t1.thread_name = t2.thread_name and
          t1.tran_type = t2.tran_type and
          t1.tran_status = t2.tran_status and
          t1.users = t2.users and
          t1.cpus = t2.cpus 
group by id_server_config, 

Let’s run this query on a server with 80 processors with the max degree of parallelism (maxdop) set to different values, and see it performs. (Note that the tran_stats table has 200,000,000 rows, and the query runs with all the data already cached in memory). The following chart summarizes the result:


A couple of things stand out on this chart.

First of all, the diminishing marginal return on parallelism is no mistake. After maxdop = 8, the additional performance gain by increasing degrees of parallelism is rather small. And after maxdop = 24, the gain is virtually non-existent. Arguably, the optimal degree of parallelism for this query on this server is somewhere between 8 and 24.

The second salient point is that when maxdop is set to 0, SQL Server sets the actual degree of parallelism to 64 when running the query. As per Books Online, 64 is the maximum value SQL Server would set maxdop to.

Why is setting maxdop to 64 not optimal for this query? Let’s look at the following chart that shows the % total CPU usage on the server (with 80 processors) at various maxdop settings:


For this query, as we increase the maxdop setting, the total CPU usage steadily goes up. Even though after maxdop = 24, we practically gains no performance, the total CPU usage still goes up. We just end up wasting a tremendous amount of CPU cycles for nothing.

Obviously we need to gain more experience to be sure, but it doesn’t look like a good idea to leave maxdop at 0 on a box with a lot of processors.

Published Friday, December 16, 2011 1:24 PM by Linchi Shea




Chris Wood said:


I find the Total Cpu Usage % vs Maxdop graph a little disturbing. My biggest servers have 16 CPU's so I would interested to see this graph on servers with say 16 to 64 servers and see if this is true in those situations too. If it was it kinda backs up the maxdop = 8 suggested by Microsoft for OLTP.



December 18, 2011 10:10 AM

Adam Machanic said:

Chris, where did you see a MAXDOP = 8 suggestion for OLTP? That sounds to me like a suggestion for OLAP, not OLTP.


December 18, 2011 1:56 PM

Adam Machanic said:

Linchi, how did you measure the performance of the queries? Did you render the results in SSMS? Insert into a temp table? Something else entirely?

Did you look at the actual query plans to make sure that work was properly balanced across all of the threads? And did you take a look at wait stats?

December 18, 2011 1:59 PM

Linchi Shea said:


Obviously, we don't want to extrapolate too much. Not all workloads will just end up wasting a huge amount of CPU cycles as the degree of parallelism goes up. I have a follow-up post that will explore this a bit more and should be relevant to your question.


Since the row count of the resultset is only about 1800, whether or not to send the results to SSMS does not change the story. In my tests, I actually wrap another SELCET COUNT(*) outside to minimize the server to client traffic. No change in the the actual plan in terms of what we are interested in.

The actual execution plan as captured with SET STATISTICS XML ON suggests that it's more or less balanced across the threads. For example, with dop=48, each thread is expected to handle about 420,000 rows, and that was more or less the caee. A small number of threads handled as few as 250,000 rows, and a small number of threads handled 500,000 rows. But most fell around 420,000. I think the problem here is just a result of Amdahl's law. In other words,  after a certain point, the serial aspect may start to dominate and further dividing the rows from the large table among more threads doesn't help the overall speedup. Also in this example, even with the table having 200,000,000 rows, when all rows are cached, a parallel plan with 24 or more threads finished in less than 20 seconds, and my guess is that SQL Server may need much of that time just to set things up for parallel executions.

December 18, 2011 5:15 PM

Adam Machanic said:

Hi Linchi,

It's interesting to bring up Amdahl's Law here, because you'd need a significantly large serial zone for it to apply. I don't see the parallel plan in your prior post on the topic, but based on the query I would expect virtually everything to be processed in a parallel zone: parallel scans followed by parallel hash match followed by aggregation done in parallel. Is that not the case? Can you post the query plan?

One thing that's interesting to note is that you have a somewhat small number of partitions (1800) for the number of rows you're processing. (Or put another way, a fairly large number of rows per partition.) I suspect that your results would be quite different if things were sliced up a bit differently.

December 18, 2011 10:23 PM

Linchi Shea said:


No doubt with a different example you may see radically different CPU usage behavior. The point here is to show a spcific example of the dinimishing marginal return with the degree of parallelism. A different example may see the inflection point at a different place. But as servers having 80 or more processors become common, it'll be easy to see this inflection point, and SQL Server picking a dop somewhere not near the inflection point.

December 19, 2011 12:35 AM

Adam Machanic said:

"as servers having 80 or more processors become common, it'll be easy to see this inflection point, and SQL Server picking a dop somewhere not near the inflection point."

This is no doubt correct. But a large number of queries ARE able to be tuned to use more CPUs -- including, I bet, yours. I just want to make sure that readers understand that generalization is quite difficult in this area.

BTW, I'll take a look at the plan; thanks for posting it. Maybe a file attachment next time? :-)

December 19, 2011 9:37 AM

Linchi Shea said:

"But a large number of queries ARE able to be tuned to use more CPUs -- including, I bet, yours." Oh yeah, one way is to use maxdop hint :-)

December 19, 2011 9:51 AM

Chris Wood said:


The original tip came at PASS 2008 or 2009 by Isaac Asimov. Probably not many servers with 64 cores then. His reasoning was that the gathering of the information from more than 8 threads took longer than the gains of splitting the workload up.


December 19, 2011 10:52 AM

Adam Machanic said:

Wow, I thought Isaac Asimov died in 1992 :-)

... but you probably meant Isaac Kunen. Anyway I don't know the context of this recommendation, but it sounds like it matters a lot -- I wonder if he was speaking about a VERY specific kind of query, e.g. spatial (since that's what he works on).

December 19, 2011 11:11 AM

Roger Wolter said:

I haven't seen this mentioned in the discussion yet but the rapid drop-off in performance gain with a MAXDOP of more than 8 is probably because 8 is the size of the NUMA node in this architecture.  All the 64 core systems I have worked on have a NUMA node size of 8 (although I assume this isn't universally true).  Once you go past the size of a NUMA node, distributing and gathering parallel steams invlves cross-node memory access which can rapidly decrease the performance benefit of parallelism.  My guess is that if you used a non-NUMA machine the decrease in performance benefit would be more linear.

December 19, 2011 12:19 PM

Chris Wood said:

Bob Ward had also mentioned this to me at this years PASS and reading Roger's comment I can see why now.


December 19, 2011 2:45 PM

Linchi Shea said:

Okay, posting the excution plan as a comment is not good idea. I removed it and attached it as a file.

December 19, 2011 2:50 PM

Linchi Shea said:

"Once you go past the size of a NUMA node, distributing and gathering parallel steams invlves cross-node memory access which can rapidly decrease the performance benefit of parallelism. "


If that's the case, shouldn't the optimizer take that into consideration and optimize accordingly?

December 19, 2011 2:51 PM

Adam Machanic said:

I do not agree that cross-node access is necessarily going to have much of an impact; it depends on how many times gathering or repartitioning is taking place. In Linchi's plan we can see four repartitions and a gather, so perhaps that is a major contributor. But how can we measure it?

December 19, 2011 3:57 PM

Steve Jones said:

So, set MAXDOP = NUMA node size as a gross generalization? Override MAXDOP for queries that might benefit from more/less?

December 19, 2011 4:23 PM

Greg Linwood said:

Setting MAXDOP relative to NUMA is what you might do if you only ever plan to run a single query on the system.

If you're planning to run a normal OLTP workload, you need to consider the impact of concurrency which, far more often than not, means you should set MAXDOP to 1 at the system level & then only configure it on specific queries where you know MAXDOP is necessary.

December 19, 2011 4:59 PM

Roger Wolter said:

Should the optimizer limit parallelism to NUMA node size?  I'm guessing this is one of those times that the optimizer is going to assume you know what you are doing and go with the MAXDOP you give it.  Your number show that performance improves with MAXDOP larger than the NUMA node size - just not as much improvement per node.  I'm guessing you could find queries where a MAXDOP of 64 is the right value - maybe a scan on a clustered index with a couple hundred partitions.  If the optimizer limited that to 8 the performance could be much worse than it could be.

I usually set MAXDOP to the NUMA node size for NUMA machines and also bump up the cost threshold for parallelism to something like 15 or 20.  This is for a system with a mixed OLTP and reporting workload.  If you have a huge datawarehouse that routinely scans billion row tables, you probably would benefit from a much higher MAXDOP.

December 19, 2011 6:12 PM

Linchi Shea said:

Hi Roger;

Just to clarify, I'm not referring to cases where maxdop is explicitly set, but rather to the case when maxdop is set to 0 on a 64 or more processor machine. Supposedly, when maxdop is set to 0, we are allowing SQL Server to make the right decision in picking a degree of parallelism. In that case, if it is true that in many cases, going over the NUMA node size won't get much of a marginal return, the optimizer ought to take that into consideration. True, there may be cases where 64 degrees of parallelism may be desirable. But since the optimizer cannot get it always right, and it seems that we'll have to play a probability game here. So the NUMA node size maybe should have some weight in that probability calculation.

December 19, 2011 6:43 PM

Paul White said:

Hi Linchi,

This post seems to be mostly about how to interpret the word 'best' in the BOL phrase "it detects the best degree of parallelism" (

The word 'best' can be used in the sense 'greatest possible'.  This refers to the idea that SQL Server may reduce DOP at execution time if resources are short (for example if sufficient worker threads cannot be reserved).

There do appear to have been minor changes to the infrastructure and model over the years (tweaking costs, better semi-join reduction, making attempts to choose an 'ideal' NUMA node and so on), but by and large, SQL Server's approach to parallelism is pretty basic.

There are certainly many things SQL Server might try to take into consideration with parallel query plans.  Perhaps future versions will, judging from some of the new hardware-specific information in query plans in SQL Server 2012.  How far it is wise to take that in a general-purpose product like SQL Server is perhaps less clear.


December 19, 2011 10:23 PM

Jeremy Lowell said:

It's been my experience that with larger servers which have 8 numa nodes that setting maxdop between 2 and 4 for oltp and never more than 8 for an olap type of workload has provided the best performance.

Spanning numa nodes with a singular query (i.e. maxdop = 24 on an 8 core box) can create a significant amount of remote page lookups and increased c3 contention.  There are exceptions to every rule of course but from what I've seen the best "average" performance is between 2 and 4 for OLTP and 8 for OLAP (with servers that have 8 numa nodes).

This is a very, very interesting topic and one that needs to be further understood by many of us as these servers will soon become common place and data volumes continue to grow.

How many of you have used soft numa and routed it via ip ranges to better "balance" the cpu usage?  This is something that I'm considering but the thought of the testing cycles required to find the right balance has kept me wary of pursuing this as an option.

December 20, 2011 6:38 PM

Linchi Shea said:


This post is not about how to interpret what BOL may or may not say, but rather to try to contribute some data points to the community to help understand how the darn thing actually works in practices on various machines with various workloads. As Adama mentioned, it's dangerous to generalize int his area. But I want to report what I have seen, hopefully, with full disclosure (including the all the scripts and data generation), and others can report their findings and experience so that interpreting what BOL says becomes less an issue.

December 20, 2011 7:43 PM

Greg Linwood said:

This discussion is generally only relevant for systems that only run individual queries, one at a time, which might benefit from parallelism.

In the huge majority of real world systems, gains from parallelism on individual queries are generally wiped out by the increased context switching accross a concurrent workload.

I really hope you do get to post something on the trade off between query parallelism & concurrent workload Linchi, as this discussion borders on being misleading for the lay reader without inclusion of this crucial dimension in the analysis.

December 21, 2011 5:30 AM

Adam Machanic said:


Sounds like the huge majority of systems you work on have either MAXDOP set too high and/or underpowered CPU subsystems. The databases I work on tend to have lots of queries that benefit from some level of parallelism. (I should qualify this by mentioning that these days I only work on data warehouses.) Sure, we need to properly configure things to ensure a good balance when concurrency becomes an issue, but in normal situations even without configuration I wouldn't ever say that the benefit is totally "wiped out."

If it were truly that easy to eliminate any benefit at all of parallelism, SQL Server would be unable to handle DW workloads -- unless of course you happened to be working with very, very patient end-users. Take Linchi's query as an example: Users can wait 5 minutes for a serial query to return the data, or 15 seconds. The goal is, obviously, to get every query as close to that 15 second mark as possible -- maybe not quite there, but in the ballpark. That's what DOP configuration tradeoffs help us to ensure.


December 21, 2011 10:14 AM

Greg Linwood said:

I specifically mentioned OLTP workloads in my first post, under which there is nearly always a high degree of concurrent multi-user activity (the nature of OLTP)

You've brought up Data Warehousing which is the workload domain which generally benefits from DOP (as there are generally fewer queries running concurrently on DW systems & therefore less context switching)

The major fault of this discussion up to this point is that it was being conducted in a clinical manner, without reference to the impact of concurrent workload - a crucial consideration when considering the impact of DOP.

You referred to Linchi's example of users waiting for 5 minutes for a serial query vs 15 seconds, but without reference about the degree of concurrent workload & whether the system was OLTP / DW, these numbers are probably misleading - it's very likely to be the other way around if there is a high degree of concurrent workload due to the inherent overhead associated with context switching.

Discussing performance improvements from individual queries with DOP without reference to whether the workload is DW or OLTP & also without reference to whether the workload is serial or concurrent is potentially worse than not discussing it at all as less experienced DBAs will draw generalised conclusions which are generally wrong.

OLTPs generally benefit most from DOP=1 & only configuring specific queries to use DOP where needed (as OLTPs generally have multi CPUs to support multiple users, not parallelism within queries). DW systems usually benefit from DOP but even this isn't universal & Data Warehouse systems are also the vast minority of installed SQL Server systems.

December 21, 2011 8:28 PM
New Comments to this post are disabled

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog


Privacy Statement