THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Page Free Space

See also my SQL Server technical articles on

Improving Partitioned Table Join Performance

Published Monday, June 17, 2013 5:22 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS



Jason Horner said:

Great post Paul. Wondering if we used a local partition view instead of the built-in table partitioning if we would still be susceptible to the extra parallel scans, and eliminate the need for dynamic sql.

June 16, 2013 12:09 PM

Ash said:

Awesome!!!Always tons of good stuff in your posts that will have me bookmarking it for re-reads....

June 16, 2013 12:47 PM

Paul White said:

Hi Jason,

Yes it is possible to do something similar with local partitioned views, though care is needed to get the best from it.


June 17, 2013 6:39 AM

cdp said:

Hi Paul,

I was following your example on my own but the data/content for your numbers table seems to be missing. Can you let me know table design and content?



June 18, 2013 11:27 AM

Paul White said:


It's a fairly standard numbers table, I'll update the post to include the code for it.


June 18, 2013 12:18 PM

Chris Adkin said:

Hi Paul,

Again another excellent post, I would really like to see you deliver some training / a pre-con in Europe. Going off at a slight tangent, and I appreciate you data does not fit this scenario, but is it possible to force a parallel few outer rows join ?.


June 20, 2013 8:14 AM

Paul White said:

Hi Chris,

If the opportunity presents itself one day, I would certainly consider it - thanks for saying that.

As far as forcing the round-robin repartition for the 'few outer rows optimization' is concerned, I'm not aware of anything (like a specific trace flag) to force it, so we're left with tricks like TOP (@n) with OPTIMIZE FOR to make it look as if the number of rows is in the range where this optimization is applied. It's a hack of course, and vulnerable to future changes in optimizer behaviour, but it's the best we have today, as far as I know. Other T-SQL tricks that don't change the semantics but result in an in-range cardinality estimate are possible too, naturally.


June 20, 2013 9:05 AM

Anonymous said:


Any clues on why the QO would push the partition elimination predicate into a filter AFTER the scan, instead of using it as a seek predicate on the scan? And how to avoid such issues? (Preferably without any query hints -- I'm trying to pack this into a view.)

The result of the QO's decision in this case means that instead of getting great performance I'm getting one scan of the entire table, per partition. Pretty abysmal :-(



July 3, 2013 10:02 AM

Paul White said:

Hi Adam,

I have seen the optimizer choose a separate filter, but not in exactly the circumstances you seem to be describing. Aside from a FORCESEEK hint, it's hard to know what to suggest without seeing the execution plan. Perhaps you are driving the collocated join using partition values instead of ids? Anyway, if you can send me the plan, I'll be happy to take a look.



July 3, 2013 7:30 PM

Tom said:

Hi Paul,

I'm working through your example. I noticed that when I run the test query, it uses a merge join by default rather than the hash match with parallelism repartition streams. Any idea why this would happen? I have double-checked the data is setup the same, similar number of rows in partitions, etc. I have tried this on 2008R2 and 2012 DBs and it's the same behaviour. I've checked settings such as maxdop and it's defaulted to 0.



July 8, 2013 6:56 AM

Paul White said:

Hi Tom,

There is an element of chance with the data distribution, and the choice between parallel hash and serial merge plans in this case is also quite finely balanced (according to the cost model anyway).

To see the parallel plan, you have a couple of choices. 'Force' parallelism using an OPTION (QUERYTRACEON 8649) clause on the query, or pretend you have more logical processing units available using DBCC OPTIMIZER_WHATIF(CPUs, 16). More logical processors often tilts plan choice toward parallelism due to its effect on costing.

Either of those should produce the parallel hash plan. If you choose to use the DBCC command, reset it back to the default setting with DBCC OPTIMIZER_WHATIF(CPUs, 0) after you have finished testing.


July 8, 2013 7:27 AM

JRStern said:

You're a wizard, Harry ... er, Paul!

I'm not sure what's scarier here, the undocumented dbcc's, or optimizing a partitioned query by referencing partition ids explicitly.  Wish I'd ever noticed your description of the SETIOWEIGHT before, could have used that recently, but I expect I'll find more uses for it soon enough.

Of course what's really scary is that SQL Server needs so much coercion to get any of this right, one might have thought this was all going to be baked in when the partitioned tables were introduced ... but one would then have been wrong.

Well I'm off tomorrow to talk to a place using partitioned tables bigtime on about two terabytes (uncompressed) with a heavy daily (monthly?) ETL updated.  Apparently they don't trust TSQL to get it at all right, so they (re)process the partitions individually then roll them back in.  Maybe some of these tricks would help - but their current strategy is apparently to let SSIS do the hard work.  Just making conversation, what do you think of that, using SSIS merge and SCD tasks rather than TSQL, especially in and around partitioned tables?


July 26, 2013 2:11 AM

Paul White said:

Hi JRStern,

I agree there is plenty more that could be done regarding query optimization, parallelism, and query plans for partitioned tables in general.

The idea of hiding the complexity of partitioned tables using the query processor is an interesting design choice; it definitely needs more engineering investment to achieve its full potential.

The balance of history is against us here though; features rarely get the sort of incremental improvement in later versions that we might expect. Often, backward compatibility and concern over sudden plan changes are cited reasons. Meh.

It's hard to generalize about a system I haven't seen, but I have come across good implementations using T-SQL and using SSIS (and bad ones for both, naturally). The default SCD component is notoriously inefficient, however. I would be surprised if there weren't scope for you to improve their current process. Processing single partitions and using SWITCH could work very well, the details matter though.

July 26, 2013 2:56 AM

marek said:


really interesting post

what do you mean by '...the general technique is not limited to partitioned tables...'

I can't think about any other example

August 9, 2013 3:10 AM

Paul White said:

Hi Marek,

The same technique can be used where a 'section' of each table can be joined one at a time. The basic idea is to use an APPLY driven by a table that contains section identifiers. So, instead of joining a partition at a time, we join a section at a time.

For example, the driving table might be a list of customer IDs, which is then applied to a join of two or more tables that contain the customer ID as part of the join criteria.


August 12, 2013 10:16 PM

marek said:

Thank you Paul,

I thought that. Anyway I've seen Adam Machanic Manhandling Paralellism

lecture from PASS Summit and I tried to use it, all techniques that were mentioned and the result is: simple hash match join for me still works better and was almost always choosed by QO, I dont know why.

Only way I was able to force QO to use nested loop was cross apply with TOP, distinct and 0%[primary_key]+[primary_key] on driving table

and top on inner side. It was slower. Yeah and I also thought till I saw it, that cross apply always forces nested loop

August 14, 2013 4:00 AM

Paul White said:

Yes, the costing model does discriminate against parallel nested loops join. It can be quite an exercise to write the query in a way that produces the desired plan with *all* the necessary features.

Cross apply is a logical operation that can be transformed to a join (and then implemented by hash or sort-merge) by the optimizer. It so happens that outer apply is much less easily transformed, but it is possible, and that state of affairs could change over time.

If you have a particular query plan you are trying to optimize for parallel execution, I would suggest uploading it for analysis using the SQL Sentry Plan Explorer tool. Post-execution, non-anonymized plans are much easier to analyse.


August 14, 2013 5:54 AM

marek said:

Well, finally I got it (what can you do without proper indexes).

But funny thing is (and again I don't know why) :

thread 0: 0 rows (that's ok)

threads 1-31 nice distribution around 70k  (+/- 1k)

thread 32: 250k

No matter what I do, one thread is still high, on inner side.

Outer side is nicely distributed.

I tried update statistics, changed maxdop, no help.

August 18, 2013 11:42 AM

marek said:

waaauu and it propably has a lot to do with uniqueness on inner side join column

August 18, 2013 1:20 PM

Paul White said:


It might be unavoidable depending on the way outer side rows are being distributed. It sounds like there's a round-robin exchange above the join, and one (or more) correlated values results in more matches on the inner side than for other rows. Impossible to be certain without seeing the plan, but the distribution you have seems 'good enough'.

August 18, 2013 8:21 PM

Leave a Comment

Privacy Statement