THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Forcing a Parallel Query Execution Plan

This post is for SQL Server developers who have experienced the special kind of frustration, which only comes from spending hours trying to convince the query optimizer to generate a parallel execution plan.  This situation often occurs when making an apparently innocuous change to the text of a moderately complex query; a change which somehow manages to turn a parallel plan that executes in ten seconds, into a five-minute serially-executing monster.

SQL Server provides a number of query and table hints that allow the experienced practitioner to take greater control over the final form of a query plan.  These hints are usually seen as a tool of last resort, because they can make code harder to maintain, introduce extra dependencies, and may prevent the optimizer reacting to future changes in indexing or data distribution.  One such query hint is the (over) popular OPTION (MAXDOP 1), which prevents the optimizer from considering plans that use parallelism.  Sadly, there is currently no corresponding hint to force the optimizer to choose a parallel plan.

The result of all this is a great deal of wasted time trying increasingly obscure query syntax, until eventually the desired parallel plan is obtained, or the developer gives up in despair.  Even where success is achieved, the price is often fragile code that risks reverting to serial execution any time indexing or statistics change.  In any case, the resulting SQL is usually hard to read, and scary to maintain.

Why Expensive Queries Produce Serial Plans

Whenever the query optimizer produces a serial plan instead of the ‘obviously better’ parallel plan, there is always a reason.  Leaving aside the more obvious causes, such as the configuration setting max degree of parallelism being set to one, running under a Resource Governor workload group with MAX_DOP = 1, or having only one logical processor available to SQL Server, the usual causes of a serial plan are parallelism-inhibiting operations, cardinality estimation errors, costing model limitations, and code path issues.

Parallelism-Inhibiting Components

There are many things that prevent parallelism, either because they make no sense in a parallel plan, or because the product just does not support them yet.  Some of these force the whole plan to run serially, others require a ‘serial zone’ – a part of the plan that runs serially, even though other parts may use multiple threads concurrently.

That list changes from version to version, but for example these things make the whole plan serial on SQL Server 2008 R2 SP1:

  • Modifying the contents of a table variable (reading is fine)
  • Any T-SQL scalar function (which are evil anyway)
  • CLR scalar functions marked as performing data access (normal ones are fine)
  • Random intrinsic functions including OBJECT_NAME, ENCYPTBYCERT, and IDENT_CURRENT
  • System table access (e.g. sys.tables)

Inconveniently, the list of intrinsic functions is quite long and does not seem to follow a pattern.  ERROR_NUMBER and @@TRANCOUNT also force a serial plan, @@ERROR and @@NESTLEVEL do not.  The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table (or view) with a computed column that uses a T-SQL scalar function will result in a serial plan, even if the problematic column is not referenced in the query.

These query features are examples that require a serial zone in the plan:

  • TOP
  • Sequence project (e.g. ROW_NUMBER, RANK)
  • Multi-statement T-SQL table-valued functions
  • Backward range scans (forward is fine)
  • Global scalar aggregates
  • Common sub-expression spools
  • Recursive CTEs

The information presented above is based on the original list published by Craig Freedman, and updated for 2008 R2.

One way to check that a query does not have any parallelism-inhibiting components is to test the query using a CPU cost multiplier.  This should only be done on a private test system where you are able to flush the whole plan cache after testing.  The idea is to use an undocumented and unsupported DBCC command to temporarily increase the CPU cost of the query plan operators.  It is not a fool-proof test (some rare parallelizable queries will not generate a parallel plan with this technique) but it is quite reliable:

DBCC FREEPROCCACHE
DBCC SETCPUWEIGHT(1000)
GO
-- Query to test
SELECT
    COUNT_BIG(*) 
FROM Production.Product AS p 
LEFT JOIN Production.TransactionHistory AS th ON 
    p.ProductID = th.ProductID
GO
DBCC SETCPUWEIGHT(1)
DBCC FREEPROCCACHE

The final commands to reset the CPU weighting factor and flush the plan cache are very important.

If you get an parallel estimated plan for a particular test query, it shows that a parallel plan is at least possible.  Varying the value passed to the DBCC command adjusts the multiplier applied to normal CPU costs, so you will likely see different plans for different values.  The illustrated factor of a thousand is often enough to produce a parallel estimated plan, but you may need to experiment with higher values.  It is not recommended to use estimated plans obtained using this technique directly in USE PLAN hints or plan guides because these are not plans the optimizer would ever produce naturally.  To be clear, direct use of the plans would likely render a production system unsupported and the person responsible might be fired, shot, or possibly both.

Cardinality Estimation Errors

If there is nothing that absolutely prevents parallelism in the target query, the optimizer may still choose a serial alternative if it has a lower estimated cost.  For that reason, there are a couple of things we can do to promote the parallel option here, all based on the very sound notion of giving the optimizer accurate information to base its estimates on.  The considerations here go well beyond just ensuring statistics are up-to-date, or building them with the FULLSCAN option.  For example, depending on the nature of the query, you may need to provide all or some of the following:

  • Multi-column statistics (for correlations)
  • Filtered statistics or indexes (for more histogram steps)
  • Computed columns on filtering expressions in the query (avoid cardinality guesses)
  • Good constraint information (foreign keys and check constraints)
  • Materialize parts of the query in temporary tables (more accurate statistics in deep plans)
  • Regular hints such as OPTIMIZE FOR

In general, anything you can do to ensure that estimated row counts are close to the runtime values will help the optimizer cost the serial and parallel alternatives more accurately.  Many failures to choose a parallel plan are caused by inaccurate row counts.

Model Limitations

SQL Server uses a model to estimate the runtime cost of each operator in a query plan.  The exact calculations vary between operators, but most are based on a minimum cost, with an additional per-row component.  None of these estimates of expected CPU and I/O cost take into account the specific hardware SQL Server finds itself running on.  The advantage of this is that plans from one machine can be readily reproduced and compared on another machine running the same version of the software, without having to worry about hardware differences.

Not all operators can be costed reasonably, and things like functions are particularly problematic because the optimizer has no clue how many rows might be produced or what the distribution of values might look like.  Even very normal-looking operators can pose problems.  Consider the task of estimating the number of rows and distribution of values resulting from a join or a complex GROUP BY clause.  Even where reasonable estimates can be made, the derived statistics that propagate up the query tree (from the persistent statistics at the leaves) tend to become quickly less reliable.  The optimizer includes many heuristics that aim to prevent these inaccuracies getting out of control, so it might resort to complete guesses after only a few operators as the compounding effect of deriving new statistics takes hold.

There are many other assumptions and limitations of the model that will not fit into a blog post, the interested reader can find more detailed information in Chapter 8 of the indispensable SQL Server 2008 Internals book.

Costing Limitations

When SQL Server costs a parallel plan, it generally reduces the CPU cost for a parallel iterator by the a factor equal to the expected runtime DOP.  For example the previous query can produce the following serial and parallel plans:

image

image

Taking the Merge Join operator as an example, the parallel version has its CPU cost reduced by a factor of 4 when the expected runtime DOP is four (serial plan on the left, parallel on the right):

image

On the other hand, the Index Scans show no reduction in I/O cost, though the CPU cost is again reduced by a factor of four:

image

As mentioned earlier, different operators cost themselves differently (for example a many-to-many merge join also has an I/O cost component that also happens to be reduced by a factor of four).  These details also vary somewhat between releases, so the presentation here is to give you an appreciation of the general approach rather than to dwell too much on the specifics.

Looking again at the serial and parallel plans, it is clear that which of the two plans costs cheaper depends on whether the parallel plan saves enough by reducing CPU and I/O costs in the various operators, to pay for the extra operators in the plan.  In this case, the extra operators are three exchange (parallelism) operators – two Repartition Streams to redistribute rows for correct results when joining, and one Gather Streams to merge the threads back to a single final result.

The way the numbers work means that it is often a tight race between the best parallel and serial plan alternatives.  In many real-world cases, the difference between the two can be extremely small – making it even more frustrating when the serial version turns out to take fifty times as long as the parallel version to execute.  One other point worth mentioning again here is that the DOP estimate is limited to the number of logical processors that SQL Server sees, divided by two.  My test machine has eight cores, all available to SQL Server, but the DOP estimate used for costing calculations is limited to four.  This has obvious consequences for costing, where CPU and I/O costs are typically divided by the estimated DOP of four, rather than eight.

A Note about Parallel Nested Loops

Plans with Nested Loops joins can be a particular problem, because the inner side almost always runs multiple threads serially.  The parallelism icons are still present, but they indicate that there are DOP independent serial threads.  The distinction is perhaps a subtle one, but it (a) explains why operators that normally force a serial zone can run ‘in parallel’ on the inner side of a loops join; and (b) the optimizer does not reduce the CPU costs on the inner side by the estimated runtime DOP.  This puts nested loops at an unfair disadvantage when it comes to parallelism costing, compared with Hash and Merge Joins.

Code Path Issues

This last category concerns the fact that the optimizer may not get as far as evaluating a parallel plan at all.  One way this can occur is if a final plan is found during the Trivial Plan stage.  If a Trivial Plan is possible, and the resulting cost is less than the configured cost threshold for parallelism, the full optimization stages are skipped and a serial plan is returned immediately.

Trivial Plan

The following query has an estimated serial plan cost of around 85 units, but with the parallelism threshold set to 100 a Trivial Plan is produced (as shown in the plan property ‘Optimization Level’ or by checking the changes in sys.dm_exec_query_optimizer_info as shown below:

SELECT
    deqoi.[counter],
    deqoi.occurrence
FROM sys.dm_exec_query_optimizer_info AS deqoi 
WHERE
    [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')
GO
SET SHOWPLAN_XML ON
GO
SELECT
    COUNT_BIG(*) 
FROM dbo.bigTransactionHistory AS bth
OPTION (RECOMPILE)
GO
SET SHOWPLAN_XML OFF
GO
SELECT
    deqoi.[counter],
    deqoi.occurrence
FROM sys.dm_exec_query_optimizer_info AS deqoi 
WHERE
    [counter] IN ('trivial plan', 'search 0', 'search 1', 'search 2')

image

When the cost threshold is reduced to 84 we get a parallel plan…

image

A deeper analysis shows that the query still qualified for Trivial Plan (and the stage was run) but the final cost exceeded the parallelism threshold so optimization continued.  This query does not qualify for ‘search 0’ (TP or Transaction Processing) because a minimum of three tables are required there.

So, optimization moves on to ‘search 1’ (Quick Plan) which runs twice.  It runs once considering only serial plans, and comes out with a best cost of 84.6181.  Since this exceeds the threshold of 84, Quick Plan is re-run with the parallel plan option enabled.  The result is a parallel plan at cost 44.7854.  The plan does not meet the entry conditions for ‘search 2’ (Full Optimization) so the finished plan is copied out.

Good Enough Plan & Time Out

Returning to code path reasons that prevent a parallel plan, the last category covers queries that enter the Quick Plan stage, but that stage terminates early, either with a Good Enough Plan Found message, or a Time Out.  Both of these are heuristics to prevent the optimizer spending more time optimizing than it stands to gain by reducing estimated execution time (cost).  Good Enough Plan results when the current lowest cost plan is so cheap that further optimization effort is no longer justified.

Time Out is a related phenomenon: at the start of a stage, the optimizer sets itself a ‘budget’ of a number of rule applications it estimates it can perform in the time justified by the initial cost of the plan.  This means that query trees that start with a higher cost get a correspondingly larger allowance of rule applications (roughly comparable to the number of moves a chess program thinks ahead).  If the optimizer explores the allowed number of rules before the natural end of the optimization stage, it returns the best complete plan at that point with a Time Out message.  This may well occur during the first run of ‘search 1’, preventing us reaching the second run that adds parallelism.

One interesting consequence of the rule concerning Trivial Plan and the cost threshold for parallelism is that a system configured with a threshold of zero can never produce a Trivial Plan.  Bearing this in mind, we can generate a surprising Time Out with this query:

SELECT * FROM Production.Product AS p

As you would expect, this query is normally optimized using Trivial Plan (there are no cost-based plan choices here):

image

…but when the cost threshold is set to zero, we get Full Optimization with a Time Out…the optimizer timed out working out how to do SELECT * from a single table!

image

In this particular case, the optimizer ‘timed out’ after 15 tasks (it normally runs through many thousands).  A Time Out result can sometimes also be an indicator that the input query is over-complex, but the interpretation is not at all that straightforward.

The Solution

We need a robust query plan hint, analogous to MAXDOP, that we can specify as a last resort when all other techniques still result in a serial plan, and where the parallel alternative is much to be preferred of course.  I really want to emphasise that very many cases of unwanted serial plans are due to designers and developers not giving the optimizer good quality information.  I see very few systems with things like proper multi-column statistics, filtered indexes/statistics, and adequate constraints.  Even less frequently, do I see (perhaps non-persisted) computed columns created on query filter expressions to aid cardinality estimation.  On the other hand, non-relational database designs with poor indexing, and decidedly non-relational queries are extremely common.  (As are database developers complaining about the poor decisions the optimizer makes sometimes!)

There’s always a Trace Flag

In the meantime, there is a workaround.  It’s not perfect (and most certainly a choice of very last resort) but there is an undocumented (and unsupported) trace flag that effectively lowers the cost threshold to zero for a particular query.  It actually goes a little further than that; for example, the following query will not generate a parallel plan even with a zero cost threshold:

SELECT TOP (1)
    p.Name
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
ORDER BY
    p.Name

image

This is a completely trivial query plan of course – the first row from the scan is joined to a single row from the seek.  The total estimated cost of the serial plan is 0.0065893.  Returning the cost threshold for parallelism to the default of 5 just for completeness, we can obtain a parallel plan (purely for demonstration purposes) using the trace flag:

SELECT TOP (1)
    p.Name
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
ORDER BY
    p.Name
OPTION (RECOMPILE, QUERYTRACEON 8649)

image

The parallel alternative is returned, despite the fact it costs much higher at 0.0349929 (5.3 times the cost of the serial plan).  In my testing, this trace flag has proved invaluable in certain particularly tricky cases where a parallel plan is essential, but there is no reasonable way to get it from the standard optimizer.

Conclusion

Even experts with decades of SQL Server experience and detailed internal knowledge will want to be careful with this trace flag.  I cannot recommend you use it directly in production unless advised by Microsoft, but you might like to use it on a test system as an extreme last resort, perhaps to generate a plan guide or USE PLAN hint for use in production (after careful review).

This is an arguably lower risk strategy, but bear in mind that the parallel plans produced under this trace flag are not guaranteed to be ones the optimizer would normally consider.  If you can improve the quality of information provided to the optimizer instead to get a parallel plan, go that way :)

If you would prefer to see a fully supported T-SQL OPTION (MINDOP) or OPTION (PARALLEL_PLAN) hint, please vote here:

https://connect.microsoft.com/SQLServer/feedback/details/714968/provide-a-hint-to-force-generation-of-a-parallel-plan

© 2011 Paul White
Twitter: @SQL_Kiwi
Email: SQLkiwi@gmail.com

Published Friday, December 23, 2011 10:09 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

TheSQLGuru said:

Great information as usual Paul!  One nugget I wasn't aware of was this "The T-SQL scalar function restriction is also a bit sneaky.  Any reference to a table with a computed column that uses such a function will result in a serial plan, even if the problematic column is not referenced in the query.".  Adds to my "Death by UDF" beliefs! :-)  

I also like your possible use of this undocumented/unsupported TF in a non-production environment as a way to get a parallel plan on which you can base a plan guide!

December 22, 2011 4:49 PM
 

Alejandro Mesa said:

Paul,

Great post, as always.

The timing is also perfect, and I included it in my letter to Santa (I already voted, one up). I hope we can get this gift, but as you clearly stated, we developers have to be good with the optimizer.

Cheers,

AMB

December 22, 2011 5:41 PM
 

Paul White said:

Hi Kevin,

Thanks!  And you're welcome for the nugget :)

Paul

December 22, 2011 5:59 PM
 

Paul White said:

Hey Alejandro,

I just *knew* Santa was real all along :)

Paul

December 22, 2011 6:16 PM
 

Greg Linwood said:

Paul, this is another well presented article and I agree there should be better hints for forcing parallelism in the relatively few cases when it's needed.

However, your post will read to many inexperienced readers as if parallelism is always good, propagating the SQL myth that it should always be left switched on (as, unfortunately, it is by default).

Many SQL OLTP systems have suffered severe performance degradation from uncontrolled query parallelism & it is unfortunate that Microsoft leaves this feature switched on be default. Note that Oracle does the opposite - you have to configure Oracle specifically to parallelise queries, which is generally done for Data Warehouse systems, which are the vast minority of installed systems both on Oracle & SQL. Oracle has this the right way around imo as more systems benefit from the default configuration & those that genuinely need parallelism can switch it on as needed.

Your post is insightful but many will read into it that parallelism is universally good & should be used widely when the opposite is in fact that case. Most SQL systems are OLTP, not Data Warehouse & OLTP systems are generally configured with multi CPU cores to support concurrent queries, not parallelism within queries.

Your post also leaves out the crucial context of impact from concurrent workload. As with other recent posts on query parallelism, your performance measurements are taken in a clinical single query scenario. If you ran multiple instances of these queries & compared the serial vs parallel performance your results would highly likely be very different & reflect better what SQL Server generally does in real world workloads - processing many queries concurrently. Even Data Warehouses often need to support concurrent query processing so hard-wiring parallelism into query hints is questionable when the developer doesn't know for sure up front whether the query will always have all CPUs available to it..

December 22, 2011 6:47 PM
 

Paul White said:

Hi Greg,

Thanks for taking the time to post such a detailed commentary, and I appreciate the kind words regarding presentation and insight!

This is a very tightly focussed post aimed at any advanced developer that has ever struggled to convince the query optimizer that a parallel plan really is the best way to go.  Implicit in that is the idea that parallelism, in the specific case considered, is not only desirable but pretty much essential.

OLTP-type workloads are not the target of this post, as these are already well-served with configuration options (workload groups, server MAXDOP) as well as the query-level MAXDOP(1) hint as noted at the top of this entry.

Certainly, true high-concurrency OLTP-type workloads do not often require parallelism to keep schedulers busy doing useful work.  For those systems, very selectively overriding server MAXDOP using a hint (or using a different workload group for Resource Governor) can be the way to go, in case there are off-peak queries that would benefit and where sufficient processor resource is likely to be available.

Nevertheless, regardless of the system, there will always be times where the occasional query fails to achieve desirable parallelism due to the causes noted in this entry and in my Connect item.  In those limited cases, I would hope the details presented here still add value.

My posts are somewhat aimed at advanced practitioners, so I do rather assume the reader knows not to over-drive parallelism, as well as being aware of concurrency implications.

Thanks again for the comment.

Regards,

Paul

December 22, 2011 7:14 PM
 

Greg Linwood said:

I follow your intent & agree that it might be good to have more options to force parallelism in specific Data Warehousing scenarios but I can also see how this would very likely be a mis-used feature in OLTP workloads

My concern is that parallelism is already a widespread problem with OLTP workloads & articles such as this are easily mis-interpreted to mean that it should be used even more widely. You might be targetting an expert audience but a wide variety of readers will interpret this information as meaning it applies to their systems & given that most SQL users are OLTP users, it's important to point out the difference between applicability to OLTP vs DW systems.

December 22, 2011 7:47 PM
 

GrumpyOldDBA said:

Very interesting post and helps explain why I often get that "why isn't it using a parallel plan" moment.

However as one who has been pushing parallelism for many years I still have to take up this OLTP vs DSS "thing" I generally support operational systems, they are not data warehouses so I guess they could be OLTP but I've never believed in this black and white view of database applications. I've struggled over the years with decisions to disable parallelism on systems which have clearly needed parallelism to scale and in fact with falling clock speeds putting your app on 80 cores with parallelism turned off is never going to achieve very much ( in my view ). However in my "own world" I've been tuning queries by often having to force parallelism for many years - in fact I think I've commented on this subject on your blog before.

As ever I'm grateful for your work which is awesome and enhances my knowledge of SQL Server constantly - Have a good christmas.

December 23, 2011 8:18 AM
 

Paul White said:

Hi Colin,

Very high quality comment as always.  Have a great Christmas too!

Paul

December 23, 2011 8:33 AM
 

Greg Linwood said:

Hi Grumpy, if query parallelism is good in operational / OLTP DBs why does Microsoft switch it off during officially audited TPC benchmarks?

TO check this yourself:

http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=111082601

Pull down the Supporting files, extract files, open ./Clause2/Utility/Database_Options_2.sql

You'll see they leave MAXDOP set to 0 during the data load phase prior to the benchmark setup but then explicity set it to 1 during runtime for the benchmark test.

They have been doing this for years with all OLTP Benchmark Kits & they don't certify benchmarks if the hardware vendor who runs the test alters the configuration.

Leaving query parallelism in the default configuration of zero is simply wrong for OLTP workloads. It only makes sense even vaguely in Data Warehouse workloads.

December 23, 2011 4:33 PM
 

Adam Machanic said:

Greg,

Is it possible that Colin's mixed-workload OLTP systems don't conform to your idea of OLTP? Seems quite likely; based on your comments, it's become clear your idea of a data warehouse workload doesn't conform with mine.

Apparently -- surprise, surprise -- there are a huge number of different types of workloads out there, and it's tough to make generic recommendations that work for all of them. Hopefully our readers are smart enough to figure that out on their own, so I think your concerns are rather unfounded.

--Adam

December 23, 2011 10:07 PM
 

Greg Linwood said:

I'm referring to the Transaction Processing Council's definition of OLTP, which includes Microsoft's agreement with Oracle & IBM on how OLTP should be defined for performance measurement purposes: www.TPC.org.

How could it be any clearer that my definition agrees with Microsoft's?

December 23, 2011 10:44 PM
 

Paul White said:

I tend to agree with the views of Colin and Adam, though there may well be workloads out there that are pure OLTP (whatever that means these days), where parallelism is rarely required.  I say rarely, because in my personal experience most mixed-workload systems have queries that absolutely do benefit from parallelism.

To me, 'pure OLTP' implies a system with a very high volume of concurrent navigational queries that affect very few rows, performing many joins and typically using nested loops to do so.  These are exactly the sorts of queries that SQL Server will generally not invoke parallelism for anyway: this is exactly what the trivial plan and transaction processing stages of optimization are for.

As discussed in the main text, neither of these optimizer stages are capable of producing a parallel plan, so any queries that get as far as the second run of Quick Plan optimization must at least appear to the optimizer as benefiting from parallel execution.  In that case, something must be wrong with the query, database design, or indexing - things that are probably not best 'solved' simply by enforcing MAXDOP 1 globally.

It is also perfectly possible for a parallel query to use less CPU while executing faster than the equivalent serial query.  This is a point I explored in a previous post:

http://sqlblog.com/blogs/paul_white/archive/2011/07/07/bitmap-magic.aspx

I don't personally have much of a problem with the default configuration being MAXDOP 0 with a cost threshold of 5, but it seems to me that any case for change would be better sited on Connect than in the comments section of a blog post like this one.

Paul

December 24, 2011 12:39 AM
 

John Alan said:

Paul thanks for another excellent insight into the QO.

These flags are certainly not to be used as startup options ;-) but would help troubleshoot certain 'heavy' queries that insist on running serially despite the lack of parallelism-inhibiting components.

This computed column/scalar function combo is something worth a look. I've got one on a critical table where parallelism is a distant dream.

Good work!

December 25, 2011 3:56 AM
 

Alexander Kuznetsov said:

Merry Christmas, Paul!

As a geek, I did enjoy reading this, and thank you for writing this up.

However, putting on my business hat, I have very mixed feelings.

First, when we are developing solutions, the following business requirements usually influence the choices we make when we implement them:

1. Is this a one time task, or is this going to run all the time from multiple connections, so we need the best, the fastest possible algorithm.

2. What performance is good enough, like "it should return under 30 seconds, and use up as little resources as possible", or "executive priority, return ASAP, let everyone else get out of its way and wait".

3. Which resources (memory, CPU) should be used in moderation.

If we had a clear and direct way to communicate these requirements to the optimizer, SQL Server would be a more useful tool. Maybe I am missing something, but good developers just do not implement solutions until they get these requirements straight, until they know what the customers want. Maybe, just maybe, the optimizer should let its customers communicate what they want more directly. Please correct me if I am wrong.

Second, when query tuning get this complicated, other alternatives get more profitable. It may be cheaper to add hardware, and/or develop more in other languages, and let SQL Server only do simpler tasks.

For example, we recently had a problem when the optimizer was not giving up a parallel plan. After verifying that all needed statistics were present and up to date, we just removed several conditions from WHERE clause, moving some filtering to the client (about 5% rows were filtered on the client). As a result, we ended up with a simpler query that was getting a parallel plan for all combinations of parameters, delivering good and predictable performance.

Moving some filtering to the client was a very simple solution, it did not need a highly skilled expert and/or time-consuming research. Also I think it resulted in more predictable performance.

Because SQL Server optimizer is a closed source thing, we cannot just debug through the process of choosing the plan and see what is happening. As a result, in some cases we are getting the feeling that optimizing a complex query cannot be done done - we are likely to get back to it later on, losing time and money every time we have to re-optimize, and eroding customers' confidence.

From my perspective, this seems to be yet another situation when complexity sucks, and a simpler optimizer might be a better fit for my customers' requirements.

What do you think?

December 26, 2011 9:57 PM
 

Paul White said:

Hi Alex,

I agree with your views to a very great extent.  Complexity does suck, by and large, and there are many variables to be juggled in the real world.

On the other hand, a good part of your argument seems to go in favour of extra hints like OPTION (PARALLEL_PLAN): this gives us a direct way to communicate what we want to the optimizer, and is likely to result in predictable performance without e.g. moving filtering to the client.

I would very much like to be able to construct query plans directly, much in the way SSIS allows us to build data flows, but I doubt that level of control will ever be afforded to us.

It's a very interesting area overall.  Merry Christmas to you also, and I'm glad your inner geek enjoyed this post :)

Paul

December 26, 2011 11:21 PM
 

GrumpyOldDBA said:

As I generally work in an isolated environment or if you like I don't work with anyone with an in depth sql knowledge so blogs are an important source for me, Paul and Linchi especially put out stuff which is of great interest to me ( I also read many others ). I'd probably say that in general I support systems which have challenges with performance, over the years these have been in banking, pharmaceuticals, travel and "other" financial systems - many have been international. Performance Tuning is vital and I don't really know what you'd define these databases as - they're not a Datawarehouse ( I have been there too and used lots of parallel stuff there ) Sadly I admit to reading the full disclosure of many benchamrking tests but do I think these have much relevance to what In do? Probably not! Do I find the many blogs relevent ? Absolutely and every so often there's something which shines a light on something I'm working with. Sitting outside the OLTP vs DSS thing it just makes sense to realise that parallel processing must be the way to go. I've enjoyed the discussion on this one Paul, Alex, Greg, Adam.

December 28, 2011 7:41 AM
 

Alexander Kuznetsov said:

Paul,

I definitely do want to be control, as this seems to be more profitable - we are losing time and money when we spend too much time trying to figure out how a closed source black box optimizer works.

I am not sure "more hints" is the right word, because most hints may be silently ignored. A stronger word is needed, so that when I execute a query and require, not hint, "and BTW your budget is 1 CPU and 4GB", then the query either uses no more than 4 Gb, or throws an exception, just like when I misspell a column name. Does it make sense?

I frequently write my own execution plans in C#, it is one of my best/easiest options whenever the optimizer is struggling. Because much of my data is temporal, RDBMS is not a perfect tool for it anyway, so I would not blame the optimizer for struggling with time series and such. Anyway, in the best case a self join developed in C# outperformed an "already optimized" set based query by twenty thousand times, and with little effort on both SQL and C# sides. Because T-SQL is quite verbose and C# may be very succinct, the C# code was shorter and more more readable than the T-SQL one.

In addition, it is so very much easier to unit test C# code than to unit test T-SQL, and NUnit tests using mocks run much faster.

So, I am writing less and less complex T-SQL. This is especially true since I saw the pricing on Enterprise Edition. I am currently running 2008 R2 Enterprise Edition, but I will definitely not upgrade to 2012 Enterprise Edition. This is yet another incentive to move complex processing off SQL Server,especially for frequently running queries. What do you think?

December 28, 2011 10:03 PM
 

Paul White said:

Alex,

Most hints (e.g. INDEX, LOOP JOIN) are already directives, though some (e.g. ROWLOCK) are not, that is true.  I was thinking more of directives, to be clear about it.  As far as CPU and memory budgets are concerned, isn't that what Resource Governor gives you?  Admittedly, this does require Enterprise Edition.

I agree that SQL Server is in desperate need of native temporal support, though CLR functions and aggregates are a good workaround in many cases while we are waiting for that to arrive.  In general, yes, I do agree that simple SQL is best - for all sorts of reasons, not just optimizer-related issues.

Paul

January 2, 2012 10:02 PM
 

Jungsun Kim said:

Thank you for the good article, Paul.

Jungsun

January 3, 2012 10:04 AM
 

Nick Beagley said:

Great article, thanks

I was under the impression that when you limit MAXDOP through Resource Governor that it doesn't act like the instance wide maxdop setting, or even maxdop option within a query, but rather effects the number of cores the query will be run on rather than the number of cores for which the plan will be generated.

For example, if you limit maxdop to 1 with Resource Governor on an instance that has been configured for a higher maxdop, it will generate a parallel plan but this parallel plan will be run serially, providing different behaviour to that which occurs when the instance has been set with global maxdop 1.

January 19, 2012 11:25 PM
 

Paul White said:

Hi Nick,

There are a number of differences between RG-enforced DOP and the max degree of parallelism option, for example the latter can be overridden per query with an OPTION (MAXDOP n) hint, but the RG limit is absolute.

It may be possible to generate and cache a parallel plan that is then restricted to running serially via RG, but this isn't a scenario I have ever tested for myself.  I did write before about how SQL Server can execute a cached parallel plan serially by removing exchanges at run time: http://sqlblog.com/blogs/paul_white/archive/2010/11/04/myth-sql-server-caches-a-serial-plan-with-every-parallel-plan.aspx

Thanks for the question!

Paul

January 22, 2012 8:33 AM
 

SomewhereSomehow said:

Thanks for the article.

By the way, talking about not using Trivial Plan. Maybe it would be interesting to you that there is a special trace flag that allows you to skip trivial plan phase - it's 8757, you may try it.

Also I've noticed that you are strongly interesed in optimizer, so it may interest you. I have discovered a set of undocumented or ever not mentioned anywhere flags that allows to:

April 25, 2012 12:46 PM
 

Paul White said:

@SomewhereSomehow,

Thanks but I 'discovered' those a long time ago :)

I've been presenting on this topic at user groups and SQL Saturdays over the last month or so.  You can download the slides and demos at e.g. http://www.sqlsaturday.com/viewsession.aspx?sat=139&sessionid=8412

Paul

April 25, 2012 2:08 PM
 

SomewhereSomehow said:

=))

Well, ok then, thanks!

April 25, 2012 2:47 PM
 

SomeName said:

Great article!

I've published intresting test's result to connect.microsoft.com.

May be you will find it intresting too and explain to the community what happens?

https://connect.microsoft.com/SQLServer/feedback/details/743136/obscure-optimizers-behaviour

May 22, 2012 10:54 AM
 

Paul White said:

Hi SomeName (aka M2M telematics),

Thank you.  The tipping point for parallelism for me on that query (also on 2012) is 2257.  The estimated cost for the parallel 2257 plan is 11.2158; for the serial 2258 plan it is 11.2162.  It is often possible to find the 'edge' where the costing component produces very close costs for serial and parallel options.

In this case, the cost savings from performing a parallel scan are offset by the increased estimated cost of the conversion in the Compute Scalar.  The cost of that conversion is related to the size of the data type, so it makes sense that varying that parameter affects the balance between serial and parallel options.

Paul

May 22, 2012 11:21 AM
 

SomeName (aka M2M telematics) said:

Hello again )

It's good and reasonable explanation, but it concerns only 2) and 3).

Key point of this test is query 1) vs 2) and 3).

Query 1) doesn't contain conversion at all and field Text contains much less characters than 695 at any row but plan doesn't contain parallelism.

Have I missed something?

What do you think?

May 23, 2012 2:47 AM
 

Paul White said:

For query 1, the serial plan estimated cost is 10.6047; getting a parallel plan with TF 8649, the estimated cost is 12.0579. So, the optimizer chooses the serial plan alternative - the lower cost of the parallel scan is not enough to pay for the Gather Streams operator. If your cost threshold for parallelism is set to 11 or higher, the plan will be a TRIVIAL one, and parallelism will not even be considered.

Paul 

May 23, 2012 3:07 AM
 

Alex Bransky said:

This post helped me a lot, except for one thing:  You have to be a sysadmin to use QUERYTRACEON.  I have reports in SSRS called by a website using a SQL user for authentication and I'd like that user to be able to force parallelism since the SQL engine doesn't want to (which is baffling).  But it seems like a big security risk to make that user a sysadmin.

December 5, 2012 7:02 PM
 

Paul White said:

Hi Alex,

Yes, that is a big restriction, and another) good reason not to use this trace flag in production.  We really need a query hint - so I hope you voted and commented on the Connect item :)

Paul

December 5, 2012 11:38 PM
 

Alex Bransky said:

Okay now I'm really confused.  I have a login without sysadmin rights that is able to execute a stored proc that contains QUERYTRACEON.  When that login executes DBCC TRACEON it gets the message below.

Msg 2571, Level 14, State 3, Line 1

User 'Website' does not have permission to run DBCC TRACEON.

How does this make any sense?

December 10, 2012 2:26 PM
 

Paul White said:

Alex,

Ownership chaining allows OPTION (QUERYTRACEON 8649) in a procedure where separate DBCC TRACEON (8649) and DBCC TRACEOFF (8649 ) statements would fail permission checks.  This may or may not be fully by design behaviour, but it is the way it works.

See the links below for information on ownership chaining (if necessary):

http://msdn.microsoft.com/en-us/library/ms188676(v=sql.105).aspx

http://weblogs.sqlteam.com/dang/archive/2008/02/09/Security-with-Ownership-Chains.aspx

Paul

December 10, 2012 5:23 PM
 

Alex Bransky said:

Thanks!  I figured it had to be something like that but I didn't realize ownership chaining went so far as to effectively allow sysadmin rights (based on the contents on the sproc).  I will see The Hobbit this weekend in your honor.

December 10, 2012 6:18 PM
 

TheSQLGuru said:

This trace flag just saved my bacon Paul!!  Had a very complex/ugly nested loop, outer apply, spatial mess of a query that just wouldn't parallelize despite 4-figure query cost.  I thought the trace flag for SQL 2012 TOP-slow-query would do it for me (there was a top in the query) but it didn't.  8649 got me a parallel plan and cut run-times by an order of magnitude+.  And for something I had to iterate almost 400 times (and schedule for once a day going forward) that REALLY made me happy!! :-)

February 8, 2013 3:31 PM
 

crokusek said:

Going a step further...it would be nice to able to direct a parallel plan for a specific operation say a union or join.  Same with mindop/maxdop.

I have a case where each side of a "union all" takes .5 and .3 and when run with maxdop(1) it shows an expected .8.  When I encourage a parallel plan using the hint you suggest the parallelization occurs not at the union but rather at multiple levels.  When using maxdop(2) the optimizer did not place the parallel operation at the union point which for my case at least would like result in about a 40% improvement (assuming each side of the union does not interact which is the case).  

Thanks for the article!

March 25, 2013 12:37 AM
 

Paul White said:

Hi crokusek,

Not sure I follow all that completely (it's hard to describe plans in words sometimes!) but I do agree it would be nicer to have finer control over parallelism sometimes. Having a documented way to generate a good parallel plan would be a good start...

April 17, 2013 7:34 PM
 

Paul White: Page Free Space said:

The query optimizer does not always choose an optimal strategy when joining partitioned tables. This

June 16, 2013 11:22 AM
 

Adam Machanic said:

" Even experts with decades of SQL Server experience and detailed internal knowledge will want to be

July 11, 2013 10:33 AM
 

Eugene Karpovich said:

Hi Paul, thanks again for this article.

Actually, it is not the first time I read it, but this time I just wanted to share with you one of my observations that helped me force (or I should probably say just "get") a parallel plan in one of my particular scenarios.

The main thing is this - it seems, SQL Optimizer makes correct estimation of the number of rows in a table variable in case this table variable 1) is queried inside dynamic SQL, and 2) is being passed there as parameter (which essentially means it is of a user-defined type).

It is probably a well-known fact, but at least I was not aware of it before.

The root cause for being stuck with a serial plan in my particular case (or, at least, what seemed to be a root cause) was incorrect cardinality estimation. Which, in its turn, was caused by the fact that I was using table variables inside one of my functions.

For myself, I found the following explanation to this fact: since Optimizer under "normal" conditions estimates the output of a table variable as one single row, it greatly under-estimated efforts needed to execute my query, which led to a compile-time decision that a serial plan would be good enough.

When I wrapped a call to my function in a dymanic sql statement (I did it for considerations which are irrelevant to my main point here), I suddenly got a perfect parallel execution plan.

It is when I started to dig further into this that I discovered the main thing that I am sharing here.

Regards

P.S. I have a script that proves my point in case you are interested - not sure how to attach it here though..

P.P.S. there is a small chance you may actually remember me - I am the guy how approached you in the intermission of your first session at PASS 2012; my question was related to the main subject here - forcing parallel execution plan using dynamic SQL..

July 14, 2013 8:13 PM
 

Paul White said:

Hi Eugene,

Yes I do remember your question at the Summit - it's good to hear the details. It is true that the cardinality of a table variable is most often estimated at a single row, but the optimizer can see the true cardinality if a statement-level recompile occurs. The easiest way to demonstrate that is to add OPTION (RECOMPILE) to the statement (though there are other causes of statement-level recompilation of course).

DECLARE @T AS TABLE (col1 int NULL);

INSERT @T VALUES (1), (2), (3);

SELECT * FROM @T AS T;

SELECT * FROM @T AS T OPTION (RECOMPILE);

The post-execution (actual) execution plan shows an estimate of 1 row for the first SELECT, and 3 rows for the second SELECT. Table variables don't support statistics though (not even those normally associated with indexes) so the optimizer still doesn't have all the information it could have, but this technique does increase your chances of a parallel plan as you have seen.

I realize you weren't using OPTION (RECOMPILE) in your example, but what you saw was the result of a statement-level recompilation.

Paul

July 14, 2013 9:46 PM
 

Eugene Karpovich said:

Thank you for the information, Paul

July 16, 2013 11:08 AM
 

maruf said:

very nice article ...

November 5, 2013 9:34 AM
 

Alan Burstein said:

I just ran across this -- amazing work as usual Paul! I always learn so much from your articles.

I have a question and a comment:

Can you point me to any articles that explain why mTVFs never produce a parallel plan? I have seen this discussed but have never found it documented.

Turning mTVFs into iTVFs has become one of my favorite low-hanging-fruit performance optimizations" I have been using lately with great success is to whenever possible. My experience has been that many developers create mTVFs because they don't know any better. I see lots of this:

CREATE FUNCTION f(p)

RETURNS t @TABLE (c1 int, c2 int)

AS

INSERT INTO @x SELECT blah FROM blahblah

and have had huge success changing them to

CREATE FUNCTION f(p)

RETURNS

SELECT blah FROM blahblah

December 11, 2013 2:10 PM
 

Paul White said:

Hi Alan,

Multi-statement table-valued functions force a serial branch in the calling plan, but parallelism can exist in other branches of the same plan.

It is also possible to see parallel execution within the TVF, though naturally not on the statement that modifies the table variable.

Replacing TVFs with in-line functions ("parameterized views") is very often a very good idea, I agree.

I'm afraid I don't know why the serial-branch restriction on TVFs exists. It may be because the function might contain its own parallelism, and it would be difficult to ensure the DOP concurrent thread limit is respected in case a parallel plan calls a parallel TVF, but that's just a guess.

Thanks for the kind words about the blog, I appreciate it :)

Paul

December 12, 2013 6:00 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement