THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Page Free Space

See also my SQL Server technical articles on

Forcing a Parallel Query Execution Plan

This post is for SQL Server developers who have experienced the special kind of frustration, which only comes from spending hours trying to convince the query optimizer to generate a parallel execution plan.  This situation often occurs when making an apparently innocuous change to the text of a moderately complex query; a change which somehow manages to turn a parallel plan that executes in ten seconds, into a five-minute serially-executing monster.

SQL Server provides a number of query and table hints that allow the experienced practitioner to take greater control over the final form of a query plan.  These hints are usually seen as a tool of last resort, because they can make code harder to maintain, introduce extra dependencies, and may prevent the optimizer reacting to future changes in indexing or data distribution.  One such query hint is the (over) popular OPTION (MAXDOP 1), which prevents the optimizer from considering plans that use parallelism.  Sadly, there is currently no corresponding hint to force the optimizer to choose a parallel plan.

The result of all this is a great deal of wasted time trying increasingly obscure query syntax, until eventually the desired parallel plan is obtained, or the developer gives up in despair.  Even where success is achieved, the price is often fragile code that risks reverting to serial execution any time indexing or statistics change.  In any case, the resulting SQL is usually hard to read, and scary to maintain.

Why Expensive Queries Produce Serial Plans

Whenever the query optimizer produces a serial plan instead of the ‘obviously better’ parallel plan, there is always a reason.  Leaving aside the more obvious causes, such as the configuration setting max degree of parallelism being set to one, running under a Resource Governor workload group with MAX_DOP = 1, or having only one logical processor available to SQL Server, the usual causes of a serial plan are parallelism-inhibiting operations, cardinality estimation errors, costing model limitations, and code path issues.

Parallelism-Inhibiting Components

There are many things that prevent parallelism, either because they make no sense in a parallel plan, or because the product just does not support them yet.  Some of these force the whole plan to run serially, others require a ‘serial zone’ – a part of the plan that runs serially, even though other parts may use multiple threads concurrently.

That list changes from version to version, but for example these things make the whole plan serial on SQL Server 2008 R2 SP1:

  • Modifying the contents of a table variable (reading is fine)
  • Any T-SQL scalar function (which are evil anyway)
  • CLR scalar functions marked as performing data access (normal ones are fine)
  • Random intrinsic functions including OBJECT_NAME, ENCYPTBYCERT, and IDENT_CURRENT
  • System table access (e.g. sys.tables)

Inconveniently, the list of intrinsic functions is quite long and does not seem to follow a pattern.  ERROR_NUMBER and @@TRANCOUNT also force a serial plan, @@ERROR and @@NESTLEVEL do not.  The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table (or view) with a computed column that uses a T-SQL scalar function will result in a serial plan, even if the problematic column is not referenced in the query.

These query features are examples that require a serial zone in the plan:

  • TOP
  • Sequence project (e.g. ROW_NUMBER, RANK)
  • Multi-statement T-SQL table-valued functions
  • Backward range scans (forward is fine)
  • Global scalar aggregates
  • Common sub-expression spools
  • Recursive CTEs

The information presented above is based on the original list published by Craig Freedman, and updated for 2008 R2.

One way to check that a query does not have any parallelism-inhibiting components is to test the query using a CPU cost multiplier.  This should only be done on a private test system where you are able to flush the whole plan cache after testing.  The idea is to use an undocumented and unsupported DBCC command to temporarily increase the CPU cost of the query plan operators.  It is not a fool-proof test (some rare parallelizable queries will not generate a parallel plan with this technique) but it is quite reliable:

-- Query to test
FROM Production.Product AS p 
LEFT JOIN Production.TransactionHistory AS th ON 
    p.ProductID = th.ProductID

The final commands to reset the CPU weighting factor and flush the plan cache are very important.

If you get an parallel estimated plan for a particular test query, it shows that a parallel plan is at least possible.  Varying the value passed to the DBCC command adjusts the multiplier applied to normal CPU costs, so you will likely see different plans for different values.  The illustrated factor of a thousand is often enough to produce a parallel estimated plan, but you may need to experiment with higher values.  It is not recommended to use estimated plans obtained using this technique directly in USE PLAN hints or plan guides because these are not plans the optimizer would ever produce naturally.  To be clear, direct use of the plans would likely render a production system unsupported and the person responsible might be fired, shot, or possibly both.

Cardinality Estimation Errors

If there is nothing that absolutely prevents parallelism in the target query, the optimizer may still choose a serial alternative if it has a lower estimated cost.  For that reason, there are a couple of things we can do to promote the parallel option here, all based on the very sound notion of giving the optimizer accurate information to base its estimates on.  The considerations here go well beyond just ensuring statistics are up-to-date, or building them with the FULLSCAN option.  For example, depending on the nature of the query, you may need to provide all or some of the following:

  • Multi-column statistics (for correlations)
  • Filtered statistics or indexes (for more histogram steps)
  • Computed columns on filtering expressions in the query (avoid cardinality guesses)
  • Good constraint information (foreign keys and check constraints)
  • Materialize parts of the query in temporary tables (more accurate statistics in deep plans)
  • Regular hints such as OPTIMIZE FOR

In general, anything you can do to ensure that estimated row counts are close to the runtime values will help the optimizer cost the serial and parallel alternatives more accurately.  Many failures to choose a parallel plan are caused by inaccurate row counts.

Model Limitations

SQL Server uses a model to estimate the runtime cost of each operator in a query plan.  The exact calculations vary between operators, but most are based on a minimum cost, with an additional per-row component.  None of these estimates of expected CPU and I/O cost take into account the specific hardware SQL Server finds itself running on.  The advantage of this is that plans from one machine can be readily reproduced and compared on another machine running the same version of the software, without having to worry about hardware differences.

Not all operators can be costed reasonably, and things like functions are particularly problematic because the optimizer has no clue how many rows might be produced or what the distribution of values might look like.  Even very normal-looking operators can pose problems.  Consider the task of estimating the number of rows and distribution of values resulting from a join or a complex GROUP BY clause.  Even where reasonable estimates can be made, the derived statistics that propagate up the query tree (from the persistent statistics at the leaves) tend to become quickly less reliable.  The optimizer includes many heuristics that aim to prevent these inaccuracies getting out of control, so it might resort to complete guesses after only a few operators as the compounding effect of deriving new statistics takes hold.

There are many other assumptions and limitations of the model that will not fit into a blog post, the interested reader can find more detailed information in Chapter 8 of the indispensable SQL Server 2008 Internals book.

Costing Limitations

When SQL Server costs a parallel plan, it generally reduces the CPU cost for a parallel iterator by the a factor equal to the expected runtime DOP.  For example the previous query can produce the following serial and parallel plans:



Taking the Merge Join operator as an example, the parallel version has its CPU cost reduced by a factor of 4 when the expected runtime DOP is four (serial plan on the left, parallel on the right):


On the other hand, the Index Scans show no reduction in I/O cost, though the CPU cost is again reduced by a factor of four:


As mentioned earlier, different operators cost themselves differently (for example a many-to-many merge join also has an I/O cost component that also happens to be reduced by a factor of four).  These details also vary somewhat between releases, so the presentation here is to give you an appreciation of the general approach rather than to dwell too much on the specifics.

Looking again at the serial and parallel plans, it is clear that which of the two plans costs cheaper depends on whether the parallel plan saves enough by reducing CPU and I/O costs in the various operators, to pay for the extra operators in the plan.  In this case, the extra operators are three exchange (parallelism) operators – two Repartition Streams to redistribute rows for correct results when joining, and one Gather Streams to merge the threads back to a single final result.

The way the numbers work means that it is often a tight race between the best parallel and serial plan alternatives.  In many real-world cases, the difference between the two can be extremely small – making it even more frustrating when the serial version turns out to take fifty times as long as the parallel version to execute.  One other point worth mentioning again here is that the DOP estimate is limited to the number of logical processors that SQL Server sees, divided by two.  My test machine has eight cores, all available to SQL Server, but the DOP estimate used for costing calculations is limited to four.  This has obvious consequences for costing, where CPU and I/O costs are typically divided by the estimated DOP of four, rather than eight.

A Note about Parallel Nested Loops

Plans with Nested Loops joins can be a particular problem, because the inner side almost always runs multiple threads serially.  The parallelism icons are still present, but they indicate that there are DOP independent serial threads.  The distinction is perhaps a subtle one, but it (a) explains why operators that normally force a serial zone can run ‘in parallel’ on the inner side of a loops join; and (b) the optimizer does not reduce the CPU costs on the inner side by the estimated runtime DOP.  This puts nested loops at an unfair disadvantage when it comes to parallelism costing, compared with Hash and Merge Joins.

Code Path Issues

This last category concerns the fact that the optimizer may not get as far as evaluating a parallel plan at all.  One way this can occur is if a final plan is found during the Trivial Plan stage.  If a Trivial Plan is possible, and the resulting cost is less than the configured cost threshold for parallelism, the full optimization stages are skipped and a serial plan is returned immediately.

Trivial Plan

The following query has an estimated serial plan cost of around 85 units, but with the parallelism threshold set to 100 a Trivial Plan is produced (as shown in the plan property ‘Optimization Level’ or by checking the changes in sys.dm_exec_query_optimizer_info as shown below:

FROM sys.dm_exec_query_optimizer_info AS deqoi 
    [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')
FROM dbo.bigTransactionHistory AS bth
FROM sys.dm_exec_query_optimizer_info AS deqoi 
    [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')


When the cost threshold is reduced to 84 we get a parallel plan…


A deeper analysis shows that the query still qualified for Trivial Plan (and the stage was run) but the final cost exceeded the parallelism threshold so optimization continued.  This query does not qualify for ‘search 0’ (TP or Transaction Processing) because a minimum of three tables are required there.

So, optimization moves on to ‘search 1’ (Quick Plan) which runs twice.  It runs once considering only serial plans, and comes out with a best cost of 84.6181.  Since this exceeds the threshold of 84, Quick Plan is re-run with the parallel plan option enabled.  The result is a parallel plan at cost 44.7854.  The plan does not meet the entry conditions for ‘search 2’ (Full Optimization) so the finished plan is copied out.

Good Enough Plan & Time Out

Returning to code path reasons that prevent a parallel plan, the last category covers queries that enter the Quick Plan stage, but that stage terminates early, either with a Good Enough Plan Found message, or a Time Out.  Both of these are heuristics to prevent the optimizer spending more time optimizing than it stands to gain by reducing estimated execution time (cost).  Good Enough Plan results when the current lowest cost plan is so cheap that further optimization effort is no longer justified.

Time Out is a related phenomenon: at the start of a stage, the optimizer sets itself a ‘budget’ of a number of rule applications it estimates it can perform in the time justified by the initial cost of the plan.  This means that query trees that start with a higher cost get a correspondingly larger allowance of rule applications (roughly comparable to the number of moves a chess program thinks ahead).  If the optimizer explores the allowed number of rules before the natural end of the optimization stage, it returns the best complete plan at that point with a Time Out message.  This may well occur during the first run of ‘search 1’, preventing us reaching the second run that adds parallelism.

One interesting consequence of the rule concerning Trivial Plan and the cost threshold for parallelism is that a system configured with a threshold of zero can never produce a Trivial Plan.  Bearing this in mind, we can generate a surprising Time Out with this query:

SELECT * FROM Production.Product AS p

As you would expect, this query is normally optimized using Trivial Plan (there are no cost-based plan choices here):


…but when the cost threshold is set to zero, we get Full Optimization with a Time Out…the optimizer timed out working out how to do SELECT * from a single table!


In this particular case, the optimizer ‘timed out’ after 15 tasks (it normally runs through many thousands).  A Time Out result can sometimes also be an indicator that the input query is over-complex, but the interpretation is not at all that straightforward.

The Solution

We need a robust query plan hint, analogous to MAXDOP, that we can specify as a last resort when all other techniques still result in a serial plan, and where the parallel alternative is much to be preferred of course.  I really want to emphasise that very many cases of unwanted serial plans are due to designers and developers not giving the optimizer good quality information.  I see very few systems with things like proper multi-column statistics, filtered indexes/statistics, and adequate constraints.  Even less frequently, do I see (perhaps non-persisted) computed columns created on query filter expressions to aid cardinality estimation.  On the other hand, non-relational database designs with poor indexing, and decidedly non-relational queries are extremely common.  (As are database developers complaining about the poor decisions the optimizer makes sometimes!)

There’s always a Trace Flag

In the meantime, there is a workaround.  It’s not perfect (and most certainly a choice of very last resort) but there is an undocumented (and unsupported) trace flag that effectively lowers the cost threshold to zero for a particular query.  It actually goes a little further than that; for example, the following query will not generate a parallel plan even with a zero cost threshold:

FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID


This is a completely trivial query plan of course – the first row from the scan is joined to a single row from the seek.  The total estimated cost of the serial plan is 0.0065893.  Returning the cost threshold for parallelism to the default of 5 just for completeness, we can obtain a parallel plan (purely for demonstration purposes) using the trace flag:

FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID


The parallel alternative is returned, despite the fact it costs much higher at 0.0349929 (5.3 times the cost of the serial plan).  In my testing, this trace flag has proved invaluable in certain particularly tricky cases where a parallel plan is essential, but there is no reasonable way to get it from the standard optimizer.


Even experts with decades of SQL Server experience and detailed internal knowledge will want to be careful with this trace flag.  I cannot recommend you use it directly in production unless advised by Microsoft, but you might like to use it on a test system as an extreme last resort, perhaps to generate a plan guide or USE PLAN hint for use in production (after careful review).

This is an arguably lower risk strategy, but bear in mind that the parallel plans produced under this trace flag are not guaranteed to be ones the optimizer would normally consider.  If you can improve the quality of information provided to the optimizer instead to get a parallel plan, go that way :)

If you would prefer to see a fully supported T-SQL OPTION (MINDOP) or OPTION (PARALLEL_PLAN) hint, please vote here:

Update: For SQL Server 2016 SP1 CU2 and later, there is a new undocumented hint that does not require administrator privileges:


See this post by Erik Darling and this one by Dmitry Pilugin.

© 2011 Paul White
Twitter: @SQL_Kiwi

Published Friday, December 23, 2011 10:09 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



TheSQLGuru said:

Great information as usual Paul!  One nugget I wasn't aware of was this "The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table with a computed column that uses such a function will result in a serial plan, even if the problematic column is not referenced in the query.".  Adds to my "Death by UDF" beliefs! :-)  

I also like your possible use of this undocumented/unsupported TF in a non-production environment as a way to get a parallel plan on which you can base a plan guide!

December 22, 2011 4:49 PM

Alejandro Mesa said:


Great post, as always.

The timing is also perfect, and I included it in my letter to Santa (I already voted, one up). I hope we can get this gift, but as you clearly stated, we developers have to be good with the optimizer.



December 22, 2011 5:41 PM

Paul White said:

Hi Kevin,

Thanks!  And you're welcome for the nugget :)


December 22, 2011 5:59 PM

Paul White said:

Hey Alejandro,

I just *knew* Santa was real all along :)


December 22, 2011 6:16 PM

Greg Linwood said:

Paul, this is another well presented article and I agree there should be better hints for forcing parallelism in the relatively few cases when it's needed.

However, your post will read to many inexperienced readers as if parallelism is always good, propagating the SQL myth that it should always be left switched on (as, unfortunately, it is by default).

Many SQL OLTP systems have suffered severe performance degradation from uncontrolled query parallelism & it is unfortunate that Microsoft leaves this feature switched on be default. Note that Oracle does the opposite - you have to configure Oracle specifically to parallelise queries, which is generally done for Data Warehouse systems, which are the vast minority of installed systems both on Oracle & SQL. Oracle has this the right way around imo as more systems benefit from the default configuration & those that genuinely need parallelism can switch it on as needed.

Your post is insightful but many will read into it that parallelism is universally good & should be used widely when the opposite is in fact that case. Most SQL systems are OLTP, not Data Warehouse & OLTP systems are generally configured with multi CPU cores to support concurrent queries, not parallelism within queries.

Your post also leaves out the crucial context of impact from concurrent workload. As with other recent posts on query parallelism, your performance measurements are taken in a clinical single query scenario. If you ran multiple instances of these queries & compared the serial vs parallel performance your results would highly likely be very different & reflect better what SQL Server generally does in real world workloads - processing many queries concurrently. Even Data Warehouses often need to support concurrent query processing so hard-wiring parallelism into query hints is questionable when the developer doesn't know for sure up front whether the query will always have all CPUs available to it..

December 22, 2011 6:47 PM

Paul White said:

Hi Greg,

Thanks for taking the time to post such a detailed commentary, and I appreciate the kind words regarding presentation and insight!

This is a very tightly focussed post aimed at any advanced developer that has ever struggled to convince the query optimizer that a parallel plan really is the best way to go.  Implicit in that is the idea that parallelism, in the specific case considered, is not only desirable but pretty much essential.

OLTP-type workloads are not the target of this post, as these are already well-served with configuration options (workload groups, server MAXDOP) as well as the query-level MAXDOP(1) hint as noted at the top of this entry.

Certainly, true high-concurrency OLTP-type workloads do not often require parallelism to keep schedulers busy doing useful work.  For those systems, very selectively overriding server MAXDOP using a hint (or using a different workload group for Resource Governor) can be the way to go, in case there are off-peak queries that would benefit and where sufficient processor resource is likely to be available.

Nevertheless, regardless of the system, there will always be times where the occasional query fails to achieve desirable parallelism due to the causes noted in this entry and in my Connect item.  In those limited cases, I would hope the details presented here still add value.

My posts are somewhat aimed at advanced practitioners, so I do rather assume the reader knows not to over-drive parallelism, as well as being aware of concurrency implications.

Thanks again for the comment.



December 22, 2011 7:14 PM

Greg Linwood said:

I follow your intent & agree that it might be good to have more options to force parallelism in specific Data Warehousing scenarios but I can also see how this would very likely be a mis-used feature in OLTP workloads

My concern is that parallelism is already a widespread problem with OLTP workloads & articles such as this are easily mis-interpreted to mean that it should be used even more widely. You might be targetting an expert audience but a wide variety of readers will interpret this information as meaning it applies to their systems & given that most SQL users are OLTP users, it's important to point out the difference between applicability to OLTP vs DW systems.

December 22, 2011 7:47 PM

GrumpyOldDBA said:

Very interesting post and helps explain why I often get that "why isn't it using a parallel plan" moment.

However as one who has been pushing parallelism for many years I still have to take up this OLTP vs DSS "thing" I generally support operational systems, they are not data warehouses so I guess they could be OLTP but I've never believed in this black and white view of database applications. I've struggled over the years with decisions to disable parallelism on systems which have clearly needed parallelism to scale and in fact with falling clock speeds putting your app on 80 cores with parallelism turned off is never going to achieve very much ( in my view ). However in my "own world" I've been tuning queries by often having to force parallelism for many years - in fact I think I've commented on this subject on your blog before.

As ever I'm grateful for your work which is awesome and enhances my knowledge of SQL Server constantly - Have a good christmas.

December 23, 2011 8:18 AM

Paul White said:

Hi Colin,

Very high quality comment as always.  Have a great Christmas too!


December 23, 2011 8:33 AM

Greg Linwood said:

Hi Grumpy, if query parallelism is good in operational / OLTP DBs why does Microsoft switch it off during officially audited TPC benchmarks?

TO check this yourself:

Pull down the Supporting files, extract files, open ./Clause2/Utility/Database_Options_2.sql

You'll see they leave MAXDOP set to 0 during the data load phase prior to the benchmark setup but then explicity set it to 1 during runtime for the benchmark test.

They have been doing this for years with all OLTP Benchmark Kits & they don't certify benchmarks if the hardware vendor who runs the test alters the configuration.

Leaving query parallelism in the default configuration of zero is simply wrong for OLTP workloads. It only makes sense even vaguely in Data Warehouse workloads.

December 23, 2011 4:33 PM

a.m. said:


Is it possible that Colin's mixed-workload OLTP systems don't conform to your idea of OLTP? Seems quite likely; based on your comments, it's become clear your idea of a data warehouse workload doesn't conform with mine.

Apparently -- surprise, surprise -- there are a huge number of different types of workloads out there, and it's tough to make generic recommendations that work for all of them. Hopefully our readers are smart enough to figure that out on their own, so I think your concerns are rather unfounded.


December 23, 2011 10:07 PM

Greg Linwood said:

I'm referring to the Transaction Processing Council's definition of OLTP, which includes Microsoft's agreement with Oracle & IBM on how OLTP should be defined for performance measurement purposes:

How could it be any clearer that my definition agrees with Microsoft's?

December 23, 2011 10:44 PM

Paul White said:

I tend to agree with the views of Colin and Adam, though there may well be workloads out there that are pure OLTP (whatever that means these days), where parallelism is rarely required.  I say rarely, because in my personal experience most mixed-workload systems have queries that absolutely do benefit from parallelism.

To me, 'pure OLTP' implies a system with a very high volume of concurrent navigational queries that affect very few rows, performing many joins and typically using nested loops to do so.  These are exactly the sorts of queries that SQL Server will generally not invoke parallelism for anyway: this is exactly what the trivial plan and transaction processing stages of optimization are for.

As discussed in the main text, neither of these optimizer stages are capable of producing a parallel plan, so any queries that get as far as the second run of Quick Plan optimization must at least appear to the optimizer as benefiting from parallel execution.  In that case, something must be wrong with the query, database design, or indexing - things that are probably not best 'solved' simply by enforcing MAXDOP 1 globally.

It is also perfectly possible for a parallel query to use less CPU while executing faster than the equivalent serial query.  This is a point I explored in a previous post:

I don't personally have much of a problem with the default configuration being MAXDOP 0 with a cost threshold of 5, but it seems to me that any case for change would be better sited on Connect than in the comments section of a blog post like this one.


December 24, 2011 12:39 AM

John Alan said:

Paul thanks for another excellent insight into the QO.

These flags are certainly not to be used as startup options ;-) but would help troubleshoot certain 'heavy' queries that insist on running serially despite the lack of parallelism-inhibiting components.

This computed column/scalar function combo is something worth a look. I've got one on a critical table where parallelism is a distant dream.

Good work!

December 25, 2011 3:56 AM

Alexander Kuznetsov said:

Merry Christmas, Paul!

As a geek, I did enjoy reading this, and thank you for writing this up.

However, putting on my business hat, I have very mixed feelings.

First, when we are developing solutions, the following business requirements usually influence the choices we make when we implement them:

1. Is this a one time task, or is this going to run all the time from multiple connections, so we need the best, the fastest possible algorithm.

2. What performance is good enough, like "it should return under 30 seconds, and use up as little resources as possible", or "executive priority, return ASAP, let everyone else get out of its way and wait".

3. Which resources (memory, CPU) should be used in moderation.

If we had a clear and direct way to communicate these requirements to the optimizer, SQL Server would be a more useful tool. Maybe I am missing something, but good developers just do not implement solutions until they get these requirements straight, until they know what the customers want. Maybe, just maybe, the optimizer should let its customers communicate what they want more directly. Please correct me if I am wrong.

Second, when query tuning get this complicated, other alternatives get more profitable. It may be cheaper to add hardware, and/or develop more in other languages, and let SQL Server only do simpler tasks.

For example, we recently had a problem when the optimizer was not giving up a parallel plan. After verifying that all needed statistics were present and up to date, we just removed several conditions from WHERE clause, moving some filtering to the client (about 5% rows were filtered on the client). As a result, we ended up with a simpler query that was getting a parallel plan for all combinations of parameters, delivering good and predictable performance.

Moving some filtering to the client was a very simple solution, it did not need a highly skilled expert and/or time-consuming research. Also I think it resulted in more predictable performance.

Because SQL Server optimizer is a closed source thing, we cannot just debug through the process of choosing the plan and see what is happening. As a result, in some cases we are getting the feeling that optimizing a complex query cannot be done done - we are likely to get back to it later on, losing time and money every time we have to re-optimize, and eroding customers' confidence.

From my perspective, this seems to be yet another situation when complexity sucks, and a simpler optimizer might be a better fit for my customers' requirements.

What do you think?

December 26, 2011 9:57 PM

Paul White said:

Hi Alex,

I agree with your views to a very great extent.  Complexity does suck, by and large, and there are many variables to be juggled in the real world.

On the other hand, a good part of your argument seems to go in favour of extra hints like OPTION (PARALLEL_PLAN): this gives us a direct way to communicate what we want to the optimizer, and is likely to result in predictable performance without e.g. moving filtering to the client.

I would very much like to be able to construct query plans directly, much in the way SSIS allows us to build data flows, but I doubt that level of control will ever be afforded to us.

It's a very interesting area overall.  Merry Christmas to you also, and I'm glad your inner geek enjoyed this post :)


December 26, 2011 11:21 PM

GrumpyOldDBA said:

As I generally work in an isolated environment or if you like I don't work with anyone with an in depth sql knowledge so blogs are an important source for me, Paul and Linchi especially put out stuff which is of great interest to me ( I also read many others ). I'd probably say that in general I support systems which have challenges with performance, over the years these have been in banking, pharmaceuticals, travel and "other" financial systems - many have been international. Performance Tuning is vital and I don't really know what you'd define these databases as - they're not a Datawarehouse ( I have been there too and used lots of parallel stuff there ) Sadly I admit to reading the full disclosure of many benchamrking tests but do I think these have much relevance to what In do? Probably not! Do I find the many blogs relevent ? Absolutely and every so often there's something which shines a light on something I'm working with. Sitting outside the OLTP vs DSS thing it just makes sense to realise that parallel processing must be the way to go. I've enjoyed the discussion on this one Paul, Alex, Greg, Adam.

December 28, 2011 7:41 AM

Alexander Kuznetsov said:


I definitely do want to be control, as this seems to be more profitable - we are losing time and money when we spend too much time trying to figure out how a closed source black box optimizer works.

I am not sure "more hints" is the right word, because most hints may be silently ignored. A stronger word is needed, so that when I execute a query and require, not hint, "and BTW your budget is 1 CPU and 4GB", then the query either uses no more than 4 Gb, or throws an exception, just like when I misspell a column name. Does it make sense?

I frequently write my own execution plans in C#, it is one of my best/easiest options whenever the optimizer is struggling. Because much of my data is temporal, RDBMS is not a perfect tool for it anyway, so I would not blame the optimizer for struggling with time series and such. Anyway, in the best case a self join developed in C# outperformed an "already optimized" set based query by twenty thousand times, and with little effort on both SQL and C# sides. Because T-SQL is quite verbose and C# may be very succinct, the C# code was shorter and more more readable than the T-SQL one.

In addition, it is so very much easier to unit test C# code than to unit test T-SQL, and NUnit tests using mocks run much faster.

So, I am writing less and less complex T-SQL. This is especially true since I saw the pricing on Enterprise Edition. I am currently running 2008 R2 Enterprise Edition, but I will definitely not upgrade to 2012 Enterprise Edition. This is yet another incentive to move complex processing off SQL Server,especially for frequently running queries. What do you think?

December 28, 2011 10:03 PM

Paul White said:


Most hints (e.g. INDEX, LOOP JOIN) are already directives, though some (e.g. ROWLOCK) are not, that is true.  I was thinking more of directives, to be clear about it.  As far as CPU and memory budgets are concerned, isn't that what Resource Governor gives you?  Admittedly, this does require Enterprise Edition.

I agree that SQL Server is in desperate need of native temporal support, though CLR functions and aggregates are a good workaround in many cases while we are waiting for that to arrive.  In general, yes, I do agree that simple SQL is best - for all sorts of reasons, not just optimizer-related issues.


January 2, 2012 10:02 PM

Jungsun Kim said:

Thank you for the good article, Paul.


January 3, 2012 10:04 AM

Nick Beagley said:

Great article, thanks

I was under the impression that when you limit MAXDOP through Resource Governor that it doesn't act like the instance wide maxdop setting, or even maxdop option within a query, but rather effects the number of cores the query will be run on rather than the number of cores for which the plan will be generated.

For example, if you limit maxdop to 1 with Resource Governor on an instance that has been configured for a higher maxdop, it will generate a parallel plan but this parallel plan will be run serially, providing different behaviour to that which occurs when the instance has been set with global maxdop 1.

January 19, 2012 11:25 PM

Paul White said:

Hi Nick,

There are a number of differences between RG-enforced DOP and the max degree of parallelism option, for example the latter can be overridden per query with an OPTION (MAXDOP n) hint, but the RG limit is absolute.

It may be possible to generate and cache a parallel plan that is then restricted to running serially via RG, but this isn't a scenario I have ever tested for myself.  I did write before about how SQL Server can execute a cached parallel plan serially by removing exchanges at run time:

Thanks for the question!


January 22, 2012 8:33 AM

SomewhereSomehow said:

Thanks for the article.

By the way, talking about not using Trivial Plan. Maybe it would be interesting to you that there is a special trace flag that allows you to skip trivial plan phase - it's 8757, you may try it.

Also I've noticed that you are strongly interesed in optimizer, so it may interest you. I have discovered a set of undocumented or ever not mentioned anywhere flags that allows to:

April 25, 2012 12:46 PM

Paul White said:


Thanks but I 'discovered' those a long time ago :)

I've been presenting on this topic at user groups and SQL Saturdays over the last month or so.  You can download the slides and demos at e.g.


April 25, 2012 2:08 PM

SomewhereSomehow said:


Well, ok then, thanks!

April 25, 2012 2:47 PM

SomeName said:

Great article!

I've published intresting test's result to

May be you will find it intresting too and explain to the community what happens?

May 22, 2012 10:54 AM

Paul White said:

Hi SomeName (aka M2M telematics),

Thank you.  The tipping point for parallelism for me on that query (also on 2012) is 2257.  The estimated cost for the parallel 2257 plan is 11.2158; for the serial 2258 plan it is 11.2162.  It is often possible to find the 'edge' where the costing component produces very close costs for serial and parallel options.

In this case, the cost savings from performing a parallel scan are offset by the increased estimated cost of the conversion in the Compute Scalar.  The cost of that conversion is related to the size of the data type, so it makes sense that varying that parameter affects the balance between serial and parallel options.


May 22, 2012 11:21 AM

SomeName (aka M2M telematics) said:

Hello again )

It's good and reasonable explanation, but it concerns only 2) and 3).

Key point of this test is query 1) vs 2) and 3).

Query 1) doesn't contain conversion at all and field Text contains much less characters than 695 at any row but plan doesn't contain parallelism.

Have I missed something?

What do you think?

May 23, 2012 2:47 AM

Paul White said:

For query 1, the serial plan estimated cost is 10.6047; getting a parallel plan with TF 8649, the estimated cost is 12.0579. So, the optimizer chooses the serial plan alternative - the lower cost of the parallel scan is not enough to pay for the Gather Streams operator. If your cost threshold for parallelism is set to 11 or higher, the plan will be a TRIVIAL one, and parallelism will not even be considered.


May 23, 2012 3:07 AM

Alex Bransky said:

This post helped me a lot, except for one thing:  You have to be a sysadmin to use QUERYTRACEON.  I have reports in SSRS called by a website using a SQL user for authentication and I'd like that user to be able to force parallelism since the SQL engine doesn't want to (which is baffling).  But it seems like a big security risk to make that user a sysadmin.

December 5, 2012 7:02 PM

Paul White said:

Hi Alex,

Yes, that is a big restriction, and another) good reason not to use this trace flag in production.  We really need a query hint - so I hope you voted and commented on the Connect item :)


December 5, 2012 11:38 PM

Alex Bransky said:

Okay now I'm really confused.  I have a login without sysadmin rights that is able to execute a stored proc that contains QUERYTRACEON.  When that login executes DBCC TRACEON it gets the message below.

Msg 2571, Level 14, State 3, Line 1

User 'Website' does not have permission to run DBCC TRACEON.

How does this make any sense?

December 10, 2012 2:26 PM

Paul White said:


Ownership chaining allows OPTION (QUERYTRACEON 8649) in a procedure where separate DBCC TRACEON (8649) and DBCC TRACEOFF (8649 ) statements would fail permission checks.  This may or may not be fully by design behaviour, but it is the way it works.

See the links below for information on ownership chaining (if necessary):


December 10, 2012 5:23 PM

Alex Bransky said:

Thanks!  I figured it had to be something like that but I didn't realize ownership chaining went so far as to effectively allow sysadmin rights (based on the contents on the sproc).  I will see The Hobbit this weekend in your honor.

December 10, 2012 6:18 PM

TheSQLGuru said:

This trace flag just saved my bacon Paul!!  Had a very complex/ugly nested loop, outer apply, spatial mess of a query that just wouldn't parallelize despite 4-figure query cost.  I thought the trace flag for SQL 2012 TOP-slow-query would do it for me (there was a top in the query) but it didn't.  8649 got me a parallel plan and cut run-times by an order of magnitude+.  And for something I had to iterate almost 400 times (and schedule for once a day going forward) that REALLY made me happy!! :-)

February 8, 2013 3:31 PM

crokusek said:

Going a step would be nice to able to direct a parallel plan for a specific operation say a union or join.  Same with mindop/maxdop.

I have a case where each side of a "union all" takes .5 and .3 and when run with maxdop(1) it shows an expected .8.  When I encourage a parallel plan using the hint you suggest the parallelization occurs not at the union but rather at multiple levels.  When using maxdop(2) the optimizer did not place the parallel operation at the union point which for my case at least would like result in about a 40% improvement (assuming each side of the union does not interact which is the case).  

Thanks for the article!

March 25, 2013 12:37 AM

Paul White said:

Hi crokusek,

Not sure I follow all that completely (it's hard to describe plans in words sometimes!) but I do agree it would be nicer to have finer control over parallelism sometimes. Having a documented way to generate a good parallel plan would be a good start...

April 17, 2013 7:34 PM

Paul White: Page Free Space said:

The query optimizer does not always choose an optimal strategy when joining partitioned tables. This

June 16, 2013 11:22 AM

Adam Machanic said:

" Even experts with decades of SQL Server experience and detailed internal knowledge will want to be

July 11, 2013 10:33 AM

Eugene Karpovich said:

Hi Paul, thanks again for this article.

Actually, it is not the first time I read it, but this time I just wanted to share with you one of my observations that helped me force (or I should probably say just "get") a parallel plan in one of my particular scenarios.

The main thing is this - it seems, SQL Optimizer makes correct estimation of the number of rows in a table variable in case this table variable 1) is queried inside dynamic SQL, and 2) is being passed there as parameter (which essentially means it is of a user-defined type).

It is probably a well-known fact, but at least I was not aware of it before.

The root cause for being stuck with a serial plan in my particular case (or, at least, what seemed to be a root cause) was incorrect cardinality estimation. Which, in its turn, was caused by the fact that I was using table variables inside one of my functions.

For myself, I found the following explanation to this fact: since Optimizer under "normal" conditions estimates the output of a table variable as one single row, it greatly under-estimated efforts needed to execute my query, which led to a compile-time decision that a serial plan would be good enough.

When I wrapped a call to my function in a dymanic sql statement (I did it for considerations which are irrelevant to my main point here), I suddenly got a perfect parallel execution plan.

It is when I started to dig further into this that I discovered the main thing that I am sharing here.


P.S. I have a script that proves my point in case you are interested - not sure how to attach it here though..

P.P.S. there is a small chance you may actually remember me - I am the guy how approached you in the intermission of your first session at PASS 2012; my question was related to the main subject here - forcing parallel execution plan using dynamic SQL..

July 14, 2013 8:13 PM

Paul White said:

Hi Eugene,

Yes I do remember your question at the Summit - it's good to hear the details. It is true that the cardinality of a table variable is most often estimated at a single row, but the optimizer can see the true cardinality if a statement-level recompile occurs. The easiest way to demonstrate that is to add OPTION (RECOMPILE) to the statement (though there are other causes of statement-level recompilation of course).


INSERT @T VALUES (1), (2), (3);



The post-execution (actual) execution plan shows an estimate of 1 row for the first SELECT, and 3 rows for the second SELECT. Table variables don't support statistics though (not even those normally associated with indexes) so the optimizer still doesn't have all the information it could have, but this technique does increase your chances of a parallel plan as you have seen.

I realize you weren't using OPTION (RECOMPILE) in your example, but what you saw was the result of a statement-level recompilation.


July 14, 2013 9:46 PM

Eugene Karpovich said:

Thank you for the information, Paul

July 16, 2013 11:08 AM

maruf said:

very nice article ...

November 5, 2013 9:34 AM

Alan Burstein said:

I just ran across this -- amazing work as usual Paul! I always learn so much from your articles.

I have a question and a comment:

Can you point me to any articles that explain why mTVFs never produce a parallel plan? I have seen this discussed but have never found it documented.

Turning mTVFs into iTVFs has become one of my favorite low-hanging-fruit performance optimizations" I have been using lately with great success is to whenever possible. My experience has been that many developers create mTVFs because they don't know any better. I see lots of this:


RETURNS t @TABLE (c1 int, c2 int)


INSERT INTO @x SELECT blah FROM blahblah

and have had huge success changing them to



SELECT blah FROM blahblah

December 11, 2013 2:10 PM

Paul White said:

Hi Alan,

Multi-statement table-valued functions force a serial branch in the calling plan, but parallelism can exist in other branches of the same plan.

It is also possible to see parallel execution within the TVF, though naturally not on the statement that modifies the table variable.

Replacing TVFs with in-line functions ("parameterized views") is very often a very good idea, I agree.

I'm afraid I don't know why the serial-branch restriction on TVFs exists. It may be because the function might contain its own parallelism, and it would be difficult to ensure the DOP concurrent thread limit is respected in case a parallel plan calls a parallel TVF, but that's just a guess.

Thanks for the kind words about the blog, I appreciate it :)


December 12, 2013 6:00 AM

Paul Vaughan said:

I found this post to be well written and informative. I would note however that as with most hints, a hint to force parallelism might be rather brittle. I would recommend instead a Cardinality hint, such as the one implemented in Oracle. Instead of directing the planner about what to do, it allows the developer to give it a better clue about the cardinality of intermediate result sets. In my personal battles with failing to parallelize, the problem has always stemmed from mis-estimating the cardinality involved.

May 27, 2014 3:01 PM

Mandeep said:

Hi Paul,

Thanks for this wonderful article, it helped me  with my current issue.


September 4, 2014 1:05 AM

Mark said:


Interesting tidbit on the pseudo-parallelism on the inner side of a nested loops join. But what is the point of having multiple threads running serially compared to having a dedicated worker processing the inner side? Wouldn't the performance be about the same, but also fewer context switches in the latter case?

November 1, 2014 5:30 PM

Esmael said:

Thank you!

November 4, 2014 9:28 AM

Paul White said:

Mark, there are indeed DOP dedicated threads on the inner side that pick up the next correlation value on each rebind. The effect is to have DOP threads running serially, so I don't think there's a difference. Please feel free to restate the point if I misunderstood you.

November 6, 2014 7:51 PM

Mark, said:


Sorry, what I meant to ask was, what advantage does a parallelized 'loops join' offer over a non-parallelized one in terms of speed, given that the inner side runs serially in both cases?

November 8, 2014 4:13 PM

Paul White said:

A parallel nested loops join at DOP 4 has 4 serial threads running the inner side independently (with different correlated parameters) *simultaneously*.

There is nothing complicated about the inner side, there are no extra threads or exchanges (in general). At e.g. DOP 4, the four threads running the inner side simultaneously are the same four threads running the outer side.

See also my article on (though it isn't specifically about nested loops, the concepts of threads and branches are very relevant):

November 10, 2014 1:45 AM

Mark said:

Got it. Thanks.

November 10, 2014 8:00 AM

Blaž Dakskobler said:

Hi Paul,

Great article!

I was wondering, is the list of "Parallelism-Inhibiting Components" valid for SQL Server 2012 and 2014 aswell?

November 24, 2014 8:24 AM

Paul White said:

Hi Blaž,

Off the top of my head, the only one that might be out of date is the common spool one, I'll check it out and update the post for 2012 and 2014 when I get time.


November 24, 2014 8:33 PM

Lubor said:

There were trace flags enforcing parallel plans even on single CPU boxes. We needed it when shipping the first release of SQL Server with parallel plans (7.0) since there were only a few machines with multi-CPUs in our labs (no multi-cores then) and that was the only way for testers to see the parallel plans on thier single CPU boxes. I'm not sure if the trace flags are still working; they were never made public so they don't need to be supported. Note: I'm not working in Microsoft any more so I don't have access to the code; I was the PM for QO and QP from 7.0 till 2005 shipped.

December 15, 2014 11:23 PM

Paul White said:

Thanks for the insight, Lubor! Very interesting!

December 16, 2014 12:52 AM

Kendra Little said:

Hi Paul,

I'm wondering if accessing an in-memory (aka memory-optimized, aka Hekaton) table should go in one of the lists of causes of serial plans.

In the Books Online page about memory-optimized table variables, it contains one statement,

"Like memory-optimized tables, memory-optimized table variables,

Do not support parallel plans."

I haven't done much testing around this yet, though, and was wondering if you happen to know more.



January 7, 2015 5:58 PM

Paul White said:

Hi Kendra,

The list was never meant to be exhaustive, but perhaps it would be useful to make it so. I'll give it some thought, and possibly update the post again to include all the scenarios I can think of.

January 13, 2015 8:26 AM

Mike West said:

"I say rarely, because in my personal experience most mixed-workload systems have queries that absolutely do benefit from parallelism."

This is so important for most real world applications. I'm still looking for a server that's not a sharepoint box that would operate more efficiently at 1 then at 8.

Now, to be brutally honest I can't tune to the degree Paul can but I honestly haven't seen an oltp box that would benefit from setting it to 1.

I think the shared tenant model exacerbates this even further. How many real world boxes have one really well performing database on it?

One last thought, I'd  like to thank Paul and Adam for corresponding with me on this topic. I used to think I knew what parallelism was but didn't and both Paul and Adam helped me out via articles and through personal communication.

April 28, 2015 8:10 AM

Tim Cartwright said:

Paul, above you stated:

"Inconveniently, the list of intrinsic functions is quite long and does not seem to follow a pattern.  ERROR_NUMBER and @@TRANCOUNT also force a serial plan, @@ERROR and @@NESTLEVEL do not."

Does that mean if we use ERROR_NUMBER() OR @@TRANCOUNT in a catch that it will cause our plan to become serial? That is perturbing as my standard error handler for 2008 and prior is below. I could remove the @@trancount as it technically should be there to rollback, but removing the call to error_number would break the logic of my catch. With 2012 and higher I just call throw;. Can you confirm if my pattern is invalid or do I need to change it?







   /* server 2008 rethrow :( */


  @ErrorMessage nvarchar(4000) = ERROR_MESSAGE(),

  @ErrorNumber int = ERROR_NUMBER(),

  @ErrorSeverity int = ERROR_SEVERITY(),

  @ErrorState int = ERROR_STATE(),

  @ErrorLine int = ERROR_LINE(),

  @ErrorProcedure nvarchar(200) = ISNULL(ERROR_PROCEDURE(), '-');

   SELECT @ErrorMessage = N'Error %d, Level %d, State %d, Procedure %s, Line %d, Message: ' + @ErrorMessage;

   RAISERROR (@ErrorMessage, @ErrorSeverity, 1, @ErrorNumber, @ErrorSeverity, @ErrorState, @ErrorProcedure, @ErrorLine)


May 20, 2015 9:17 AM

Paul White said:

Hi Tim, no those intrinsics only affect the *statement* in which they appear. Other statements in the batch may use parallelism.

May 20, 2015 9:48 AM

Balaji Ram said:

HI Paul,

Thanks for the great post. I work in a DW application in Finance domain. We have massive workload batches and achieving desirable performance is always challenge.

The queries involve joins of sometimes 10s of tables. Our data model is such that almost all of the tables are huge 10+ Million rows and there is no clear small dimension tables that I can use as "Driver" table in a parallel apply pattern, etc.

Hence almost all of the joins are with multiple similar sized huge tables (10+ Mill rows each). Also, almost all the tables have Primary Key on something called a ApplicationRecordId. This is system generated & ever increasing key generated seperately and is used across all tables in star schema. all of our star schema tables are joined by this common ApplicationRecordId field.

some of the complicated queries takes for ever to run. Checking dm_os_waiting_tasks.resource_description revealed the wait bottlenecs are a handful of nodes and all of them are Repartition streams operator.

Checking the query plan revealed:

1. No matter parallel Hash / Merge join is used

2. No matter all the tables are Primary Keyed and joined on the common ApplicationRecordId column,

Each table is acted by RePartitionStreams operator before join(Hash or Merge Join) which massively slows down the query. Infact serial version of this query is running faster.

Can you please help me understand:

1. In a parallel operation, if RepartitionStreams operator is necessary to happen before a join between two equally sized BIG tables (10+ mill rows)

2. Any tips on how to attack & resolve this performance issue.

May 21, 2015 2:27 PM

Grant Fritchey said:

Hey Paul,

As always, I bow, scrape, etc. This is really useful.

What's not clear to Microsoft, I think, is just how useful this is in the case of Columnstore indexes. You only get Batch Mode processing when the plan goes parallel. BUT, tiny little plans with estimates that you wouldn't think would benefit in any way from parallel execution still have huge benefits when a Columnstore index goes to batch mode. I suspect when word gets out on this, there's going to be a ton of use of this thing.

May 26, 2015 12:37 PM

Paul White said:

Hey Grant, sorry I've only just noticed you commented. My apologies. Yes, I agree with your concerns, even given the 2016 announcements about serial batch mode and so on. We still need a way to ensure we get parallelism when we know, more likely than not, that good performance will require it.

August 22, 2015 8:20 AM

David Rimshnick said:

Thanks, great article!

Do CAST(), ABS(), or CHECKSUM() inhibit parallelism?

December 30, 2016 1:04 PM

Steve said:

Hi Paul,

I do wonder, if using persisted computed columns will help to assuage the performance impact caused by a computed column, even if it still forces serialization. Any thoughts?



January 27, 2017 10:55 AM

Erik Darling said:

At long last, I have finally become a footnote in one of your posts.

Thanks Paul!

May 14, 2017 11:28 AM

Leave a Comment

Privacy Statement