There are a number of interesting questions in parallel execution performance
that can be investigated using the TPC-H kit, which has a data generator.
There are also supplemental files that accompany published results.
Important questions in parallel execution are:
scaling versus degree of parallelism (DOP) on physical cores,
Hyper-Threading, Compression and Columnstore.
(Memory optimized tables support parallel plans in 2016? perhaps in a later post.)
To be clear, the TPC-H data generator (dbgen) and queries are used here,
but no attempt is made to adhere to the TPC-H benchmark requirements
(there are some outright departures).
The material reported here are only for purpose of studying
parallel execution characteristics.
In brief, the findings are as follows.
The sum average performance gain from Hyper-Threading
over the 22 TPC-H queries is 23%, in line with characteristics observed in
SQL Server version 2012.
(Official TPC-H results use a geometric mean.)
There used to be an anomalous behavior on SQL Server 2012 in which
performance was better with HT enabled and the second logical processor idle
over HT disabled, even for single-threaded execution as reported in
Compression has an average overhead of 20% across all DOP levels.
For SQL Server 2008, I had reported in SF 10
that compression overhead was 40% between DOP 1-8,
then decreases to less than 10% at DOP 32.
The test circumstances were different, and may have been related
to the NUMA effects of an 8-way system.
Columnstore performance is excellent and should be considered whenever
- Hyper-Threading performance boost 23% in parallel DW queries
- HT anomalous performance gain in version 2012 at DOP 1 no more
- Page compression overhead 20%, down from 40% in 2008
- Columnstore should run one per physical core, HT can be enabled
The test system is a single socket Intel Xeon E5-2630 v4 10-core 2.2GHz processor
with 64GB memory. Storage is 4 Intel P750 PCI-E NVMe SSDs for data and temp,
and one SATA SSD for logs.
The software stack is Windows Server 2016 and SQL Server 2016 SP1.
SQL Server was limited to 32GB for results reported here.
The data is generated using dbgen from the TPC-H kit at scale factor 100.
This used to mean the Lineitem table is 100GB, but with a 4-byte date for three
columns that were formerly 8-byte datetime, is now 80GB in row-store without compression
(still 600M rows).
Overall database size is 140GB.
With page compression, the Lineitem table is 46GB data and overall database size is 101GB.
In columnstore, total database size is 44GB.
Elsewhere, I have argued that performance characteristics are very different
between scaling with multiple cores on a single socket and scaling with cores
over multiple sockets.
Multi-socket scaling cannot be investigated here.
Conclusions should not be extrapolated beyond what is tested.
Baseline, no compression, row store
The figure below is speedup (elapsed time ratio) at various degrees of parallelism
relative to DOP 1
for the baseline row-store, no compression database.
Note that there are 10 physical cores.
The SQL Server engine did seem to run parallel plans on separate cores
except when DOP exceeds the number of cores.
Queries 5 and 15 are worse at DOP 2 than DOP 1, at 0.675 and 0.802 respectively,
and Q3 is slightly slower at 0.948 relative to DOP 1.
In query 3 and 5, the negative scaling from DOP 1 to 2
is caused by a change in the execution plan structure,
excluding parallelism operators.
Below is focused look at scaling from DOP 1 to 2.
As noted, three queries are worse at DOP 2 than 1.
Several queries are almost exactly 2X better, as expected in the
idea case for a parallel execution plan.
And then there are also queries that are more than 2X better at DOP 2 than 1.
Q17 is 4.3X faster at DOP 2 than 1.
Four others at more than 3.5X.
Usually, these have the bitmap operator in the parallel plan,
which is not used in a non-parallel plan.
The bitmap operator has some startup cost,
but can process large data sets more efficiently.
The reasonable assumption is that large queries have a parallel plan,
so there is no need to employ the bitmap in non-parallel plans.
There are other reasons for super-scaling.
One is the set points for hash and sort operations spilling to tempdb.
The setting is fixed as a fraction of system memory for each thread.
At higher DOP, more memory is allocated across all threads of the parallel plan.
Below is the scaling from DOP 2 to 10, which removes the
wide fluctuation that occurs between DOP 1 to 2.
In the ideal parallelism case, scaling should be close to 5.
Overall, the DOP 2 to 10 scaling is more uniform,
except for Q15, in which the poor scaling from 1 to 2 is recovered
at DOP 4 and higher.
Q15 scaling from 1 to 4 is 3.346, which is not bad, good even.
Performance largely comparable between HT enabled and disabled
for DOP levels up to the number of physical cores, 1-10 in this case.
It was reported in Hyper Threading,
that DOP 1 ran 24% faster at DOP 1 with HT enabled over disabled.
This might seem paradoxical.
A speculative explanation is that the SQL Server engine tries to work
in batches that fit inside L2 cache.
There is an API for the L2 cache size, but how much of this should be assumed
to be available for use by a particular operation?
If the assumed value is too high, but the HT enabled case reduces
this by half, then it is possible that DOP 1 with HT enabled could be faster?
Below is the effect of enabling Hyper-Threading (HT) with DOP 20, over
disabling HT in UEFI (formerly BIOS) for DOP 10.
Overall, then benefit of HT is strong at 1.23X.
Q2 shows as 0.9 reduction, going from 0.61 to 0.67sec runtime,
but this is probably just the effect of running a small query
at very high DOP.
It would be interesting to know which logical processors are employed
when DOP exceeds the number of physical cores on one socket,
in a multi-socket system with HT enabled.
Scaling over cores in different sockets is not as good
compared to all cores in one socket, so perhaps the 23% gain from HT
might be acceptable?
In queries 3 and 5, there is a good plan at DOP 1,
but less efficient plans for DOP 2 and higher.
The result is negative scaling to DOP 2,
and poor scaling at higher DOP relative to DOP 1.
The query optimizer model believes that two or more plans
have similar costs.
There are differences in how the cost for these plans vary with
the degree of parallelism, and hence the winning plan
changes with DOP.
Both Q3 and 5 can be rewritten to force a particular plan.
Note that the special feature Date Correlation Optimization
was designed specifically for TPC-H,
and probably no production database uses it.
When a query plan is forced, the date correlation optimization
is also turned off.
Since we are interested in scaling for a particular plan,
the full set of implied date range SARGs are explicitly specified.
The figure below shows Q3, 5, and 7 scaling relative to DOP 1
for natural and forced execution plans.
In Q3 and Q5, the natural plan from DOP 1 is forced in the parallel plans.
This results in good scaling at all DOP levels.
Below is the execution plan for Query 3 at DOP 1 without other hints
having cost 1349, execution time 28.38sec.
Below is the plan at DOP 2 and higher, without hints, having cost 1057 at DOP2.
Execution time is 29.94 sec, slightly longer than DOP 1.
On forcing the parallel plan to the same join order of the non-parallel
plan, the cost is 1248.
In actuality, above plan having the join order of the non-parallel plan executes
faster at 8.564sec versus 29.94sec for the unforced parallel plan.
Below is the natural plan for query 5 at DOP 1, having cost 4445, and execution time 72.70sec.
The natural parallel plan for Q5 has the structure below.
In this case, the query was written to force the join order of the parallel
plan, and then set to DOP 1.
This is done so that the plan structure is easier to discern without excessive
complexity of the parallelism operators.
The plan cost at DOP 2 is 3243.31 and execution time 107.62,
significantly longer than for DOP 1.
If the plan at DOP2 had been forced to the same join order as at DOP 1,
the plan cost is 3244.02, or 0.022% higher.
In fact, the forced plan is much quicker at 23.469 than both the DOP 1 plan
and the natural DOP 2 plan.
Q15 is the other query that has negative scaling from DOP 1 to 2,
miraculously recovers at DOP 4 and higher relative to DOP 1.
This is a strange case.
The execution plans are similar at various DOP levels,
but there is a slight difference at DOP2. but not the other levels.
It is not always the case that the non-parallel plan structure
is better than the parallel plan structure.
Q7 has a super-scaling gain of 2.969X from DOP 1 to 2,
an outstanding gain of 1.908 from DOP 2 to 4,
then another super-scaling gain of 3.072 from DOP 4 to 8,
before trailing off, all with HT enabled.
However, Q7 performance is less at DOP 2-10 with HT disabled
than with HT enabled, when the results should be similar.
This results in Q7 show an incredible 1.8X gain from HT disabled DOP 10
to HT enabled DOP 20.
With HT enabled in UEFI, the gain from DOP 10 to 20 is 1.25X,
which is very good but not irrationally high.
Below is the natural plan at DOP 1, cost 4780 and execution time 114.96sec.
It changes to another plan at DOP 2, and then a third plan at DOP 4,
for which the structure is shown below, but forced to DOP 1, cost 5840,
execution time 102.074sec.
The forced DOP 2 plan cost is 4614, execution time 31.895,
whereas the natural DOP 2 plan cost 3710.8, execution time 38.72.
In this case, it is the parallel plan at DOP 4 and higher
that runs faster at all DOP levels instead of the DOP 1 or 2 plans.
In forcing the plan to the DOP 4+ plan, normal good scaling results
Query Optimizer Cost Model
(this should be a separate article)
The root cause of most of these anomalies is that the cost formulas
for execution operations are based on very simple, fix model.
The model is adequate for transaction processing in which
queries have highly selective search arguments.
But the model is poor in assessing the index seek plus key lookup
versus scan cross-over point.
The SQL Server query optimizer employs a fixed model for IO cost.
The model is 320 IOPS for operations
assumed to be random (key lookups and loop join inner source)
1350 pages/sec × 8KB/page = 10,800KB/s for
sequential IO operations (scans and range seeks).
The absolute values of random and sequential IOPS are not important,
as it is the ratio that effects the execution plan.
Furthermore, the cost model assumes that IO is saturated at DOP 1
for accesses to source tables and indexes.
A parallel plan does not reduce the IO cost in source accesses.
IO cost is reduced in accordance with DOP in intermediate hash and sort operations.
For storage on a small number of hard disks, IO is saturated at DOP 1.
A very large HDD array is capable of scaling sequential IO with parallelism.
A proper SSD array is capable of that and also scaling random IO.
Why does the cost model assume saturated IO for source accesses
but scalable IO for intermediate non-source operations?
It is as if one developer designed the model for source, another
did the model for intermediate, but they did not talk to each
other to present a consist picture.
The two developers may sit in adjacent cubicles.
A second issue is that the cost model underestimates row access cost,
and essentially does not model column and logic costs.
This is evident in the plan cost of for a query doing a row count
versus aggregating one or more columns.
There is no difference between a count or any number of column aggregations.
There is a small cost assessed for the first logic operation
(example multiply two columns),
but not for additional columns or logic.
In actuality, column and logic might be the large part of an aggregation
The combined effect of the query optimizer cost model limitations
is that it has no meaningful capability to predict the
degree of scaling to be expected with parallelism.
There are only two controls: the cost threshold for parallelism,
and the max degree of parallelism.
The query optimizer employs parallelism
when the parallel plan has a model cost of just slightly lower
than the non-parallel plan and if the non-parallel plan cost is
over the threshold.
This was not a huge issue back when big systems had a total of
four processor cores.
A modern system up to 24 cores per socket in the current generation
as well as typically having multiple sockets.
The all or nothing strategy is seriously inadequate.
It is now important to have the ability to assess what degree
of parallelism to employ.
This should be determined based on the query cost and expected
A query that is executes in a few seconds might employ
a low DOP, perhaps 2 or 4, but only if the gain from parallelism
For example, 40% could be considered sufficient, but 10% is not.
An intermediate-large query might employ up to all the cores in one socket,
again, so long as there is scaling with increasing DOP.
Very large queries might employ cores in all sockets,
but only if it can be predicted that the execution plan
scales well across multiple sockets.
Page Compression, row store
Below is the compression overhead, at various DOP levels
relative to no compression.
In SQL Server 2008 sp1, on an 8-way quad-core Opteron at SF10,
all data in memory
Q5 has the good execution plan at DOP 1 and 2, before shifting to the
poor plan at DOP 4 and higher.
However, at DOP 2 the comparison is between a good plan on the database
with compressed tables and a poor plan on the database having tables without
compression, leading to the anomaly.
Overall, the compression overhead is a fairly uniform 1.2X at all DOP levels,
but can range from being nearly free to a high of 1.4X.
The individual values over 1.4X are probably anomalies rather than representative.
This is different than in 2008R2, in which compression overhead was 1.4X at DOP 1,
but then gradually decreased to being essentially free at higher DOP.
The test system back then was an 8-way Opteron quad-core, in which the effects
of NUMA might have been significant.
The hypothesized explanation was that without compression, there was contention
between threads as DOP increased, decreasing scaling.
The extra overhead of compression just happened to alleviate the contention.
Presumably the explanation now is that the SQL Server engine team did a good
job of removing contention where possible, so that there is better scaling
Then the overhead of compression is not masked.
Or it could be because of the difference in the system, NUMA and not.
Below is columnstore performance at various DOP, relative to row store
at the same DOP.
In some queries, it can be almost 30X better.
Overall, columnstore is more than 4X better than row store.
Note that columnstore at DOP 20 with HT enabled is very poor.
This is the expected behavior, because columnstore marches through memory
sequentially, so HT is expected to interfere with columnstore code.
The figure below shows the large negative impact of DOP 20 relative to DOP 10,
although four queries showed modest gain.
There was little difference between DOP 10 with HT disabled or enabled.
Enabling HT in UEFI is not harmful for columnstore,
so long as max DOP is limited to the number of physical cores
in columnstore queries.
The more proper comparison is between columnstore at DOP 10
and row store at DOP 20 with Hyper-Threading,
as shown below.
The overall gain for columnstore is now 3.45.
Only Q13 shows no gain for columnstore, but no query is worse than row store.
Q13 was one of the few columnstore queries that did show gain with HT,
so columnstore is better than row store at matching DOP and HT.
The figure below shows columnstore scaling versus DOP.
Columnstore scaling is uniformly good from DOP 1 to 4, but becomes erratic
at DOP 8 and 10.
Parallelism is a complicated topic.
The main guidance points were highlighted at the beginning.
In addition, make few or no assumptions and always investigate.
Evaluate performance characteristics at different levels
of parallelism, DOP 1, 2, 4 and higher.
Watch for scaling anomalies, and changes in execution plan structure.
Be especially careful in comparing scaling past the number
of cores in a single socket.
Hyper-Threading does have benefit in parallel plans,
but mostly for row store.
It is probably advisable to enable HT in the UEFI/BIOS,
but to limit columnstore to only the number of physical cores.
It would be better if Microsoft not employ HT for
columnstore unless they know exactly when HT might benefit.
Compression does have overhead, but might still be worthwhile.
The larger issue is that SQL Server seriously needs to overhaul
the query optimizer cost model.
Much of the SQL Server engine is very sophisticated, and yet
the cost model has not changed substantially since version 7 in 1998.
As is, the execution plan cost has weak bearing to reality
and that is has no ability to predict scaling with parallelism,
issues in scaling beyond a single socket, and when scaling to HT works.
Without a proper cost model, the benefit of a sophisticated query
optimizer is lost.
For the time being, developers and DBAs can (re-)learn the
method for writing SQL to a particular join order.