THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

Parallel Execution on SQL Server 2016

There are a number of interesting questions in parallel execution performance that can be investigated using the TPC-H kit, which has a data generator. There are also supplemental files that accompany published results. Important questions in parallel execution are: scaling versus degree of parallelism (DOP) on physical cores, Hyper-Threading, Compression and Columnstore. (Memory optimized tables support parallel plans in 2016? perhaps in a later post.) To be clear, the TPC-H data generator (dbgen) and queries are used here, but no attempt is made to adhere to the TPC-H benchmark requirements (there are some outright departures). The material reported here are only for purpose of studying parallel execution characteristics.

In brief, the findings are as follows. The sum average performance gain from Hyper-Threading over the 22 TPC-H queries is 23%, in line with characteristics observed in SQL Server version 2012. (Official TPC-H results use a geometric mean.) There used to be an anomalous behavior on SQL Server 2012 in which performance was better with HT enabled and the second logical processor idle over HT disabled, even for single-threaded execution as reported in Hyper Threading. Compression has an average overhead of 20% across all DOP levels. For SQL Server 2008, I had reported in SF 10 that compression overhead was 40% between DOP 1-8, then decreases to less than 10% at DOP 32. The test circumstances were different, and may have been related to the NUMA effects of an 8-way system. Columnstore performance is excellent and should be considered whenever practical.

  • Hyper-Threading performance boost 23% in parallel DW queries
  • HT anomalous performance gain in version 2012 at DOP 1 no more
  • Page compression overhead 20%, down from 40% in 2008
  • Columnstore should run one per physical core, HT can be enabled

Test Environment

The test system is a single socket Intel Xeon E5-2630 v4 10-core 2.2GHz processor with 64GB memory. Storage is 4 Intel P750 PCI-E NVMe SSDs for data and temp, and one SATA SSD for logs. The software stack is Windows Server 2016 and SQL Server 2016 SP1. SQL Server was limited to 32GB for results reported here.

The data is generated using dbgen from the TPC-H kit at scale factor 100. This used to mean the Lineitem table is 100GB, but with a 4-byte date for three columns that were formerly 8-byte datetime, is now 80GB in row-store without compression (still 600M rows). Overall database size is 140GB. With page compression, the Lineitem table is 46GB data and overall database size is 101GB. In columnstore, total database size is 44GB.

Elsewhere, I have argued that performance characteristics are very different between scaling with multiple cores on a single socket and scaling with cores over multiple sockets. Multi-socket scaling cannot be investigated here. Conclusions should not be extrapolated beyond what is tested.

Baseline, no compression, row store

The figure below is speedup (elapsed time ratio) at various degrees of parallelism relative to DOP 1 for the baseline row-store, no compression database. Note that there are 10 physical cores. The SQL Server engine did seem to run parallel plans on separate cores except when DOP exceeds the number of cores.

Test System

Queries 5 and 15 are worse at DOP 2 than DOP 1, at 0.675 and 0.802 respectively, and Q3 is slightly slower at 0.948 relative to DOP 1. In query 3 and 5, the negative scaling from DOP 1 to 2 is caused by a change in the execution plan structure, excluding parallelism operators.

Below is focused look at scaling from DOP 1 to 2. As noted, three queries are worse at DOP 2 than 1. Several queries are almost exactly 2X better, as expected in the idea case for a parallel execution plan.

Test System

And then there are also queries that are more than 2X better at DOP 2 than 1. Q17 is 4.3X faster at DOP 2 than 1. Four others at more than 3.5X. Usually, these have the bitmap operator in the parallel plan, which is not used in a non-parallel plan. The bitmap operator has some startup cost, but can process large data sets more efficiently.

The reasonable assumption is that large queries have a parallel plan, so there is no need to employ the bitmap in non-parallel plans. There are other reasons for super-scaling. One is the set points for hash and sort operations spilling to tempdb. The setting is fixed as a fraction of system memory for each thread. At higher DOP, more memory is allocated across all threads of the parallel plan.

Below is the scaling from DOP 2 to 10, which removes the wide fluctuation that occurs between DOP 1 to 2. In the ideal parallelism case, scaling should be close to 5.

Overall, the DOP 2 to 10 scaling is more uniform, except for Q15, in which the poor scaling from 1 to 2 is recovered at DOP 4 and higher. Q15 scaling from 1 to 4 is 3.346, which is not bad, good even.

Performance largely comparable between HT enabled and disabled for DOP levels up to the number of physical cores, 1-10 in this case. It was reported in Hyper Threading, that DOP 1 ran 24% faster at DOP 1 with HT enabled over disabled. This might seem paradoxical. A speculative explanation is that the SQL Server engine tries to work in batches that fit inside L2 cache. There is an API for the L2 cache size, but how much of this should be assumed to be available for use by a particular operation? If the assumed value is too high, but the HT enabled case reduces this by half, then it is possible that DOP 1 with HT enabled could be faster?

Below is the effect of enabling Hyper-Threading (HT) with DOP 20, over disabling HT in UEFI (formerly BIOS) for DOP 10.

Test System

Overall, then benefit of HT is strong at 1.23X. Q2 shows as 0.9 reduction, going from 0.61 to 0.67sec runtime, but this is probably just the effect of running a small query at very high DOP. It would be interesting to know which logical processors are employed when DOP exceeds the number of physical cores on one socket, in a multi-socket system with HT enabled. Scaling over cores in different sockets is not as good compared to all cores in one socket, so perhaps the 23% gain from HT might be acceptable?

Special cases

In queries 3 and 5, there is a good plan at DOP 1, but less efficient plans for DOP 2 and higher. The result is negative scaling to DOP 2, and poor scaling at higher DOP relative to DOP 1. The query optimizer model believes that two or more plans have similar costs. There are differences in how the cost for these plans vary with the degree of parallelism, and hence the winning plan changes with DOP.

Both Q3 and 5 can be rewritten to force a particular plan. Note that the special feature Date Correlation Optimization was designed specifically for TPC-H, and probably no production database uses it. When a query plan is forced, the date correlation optimization is also turned off. Since we are interested in scaling for a particular plan, the full set of implied date range SARGs are explicitly specified.

The figure below shows Q3, 5, and 7 scaling relative to DOP 1 for natural and forced execution plans.

Force

In Q3 and Q5, the natural plan from DOP 1 is forced in the parallel plans. This results in good scaling at all DOP levels. Below is the execution plan for Query 3 at DOP 1 without other hints having cost 1349, execution time 28.38sec.

Q3 DOP1

Below is the plan at DOP 2 and higher, without hints, having cost 1057 at DOP2. Execution time is 29.94 sec, slightly longer than DOP 1.

Q3 n

On forcing the parallel plan to the same join order of the non-parallel plan, the cost is 1248.

Q3 n

In actuality, above plan having the join order of the non-parallel plan executes faster at 8.564sec versus 29.94sec for the unforced parallel plan.

Below is the natural plan for query 5 at DOP 1, having cost 4445, and execution time 72.70sec.

Q5 a

The natural parallel plan for Q5 has the structure below. In this case, the query was written to force the join order of the parallel plan, and then set to DOP 1. This is done so that the plan structure is easier to discern without excessive complexity of the parallelism operators. The plan cost at DOP 2 is 3243.31 and execution time 107.62, significantly longer than for DOP 1.

Q5 a

If the plan at DOP2 had been forced to the same join order as at DOP 1, the plan cost is 3244.02, or 0.022% higher. In fact, the forced plan is much quicker at 23.469 than both the DOP 1 plan and the natural DOP 2 plan.

Q15 is the other query that has negative scaling from DOP 1 to 2, miraculously recovers at DOP 4 and higher relative to DOP 1. This is a strange case. The execution plans are similar at various DOP levels, but there is a slight difference at DOP2. but not the other levels.

It is not always the case that the non-parallel plan structure is better than the parallel plan structure. Q7 has a super-scaling gain of 2.969X from DOP 1 to 2, an outstanding gain of 1.908 from DOP 2 to 4, then another super-scaling gain of 3.072 from DOP 4 to 8, before trailing off, all with HT enabled.

However, Q7 performance is less at DOP 2-10 with HT disabled than with HT enabled, when the results should be similar. This results in Q7 show an incredible 1.8X gain from HT disabled DOP 10 to HT enabled DOP 20. With HT enabled in UEFI, the gain from DOP 10 to 20 is 1.25X, which is very good but not irrationally high.

Below is the natural plan at DOP 1, cost 4780 and execution time 114.96sec.

Q7 a

It changes to another plan at DOP 2, and then a third plan at DOP 4, for which the structure is shown below, but forced to DOP 1, cost 5840, execution time 102.074sec. The forced DOP 2 plan cost is 4614, execution time 31.895, whereas the natural DOP 2 plan cost 3710.8, execution time 38.72.

Q7 a

In this case, it is the parallel plan at DOP 4 and higher that runs faster at all DOP levels instead of the DOP 1 or 2 plans. In forcing the plan to the DOP 4+ plan, normal good scaling results are observed.

Query Optimizer Cost Model

(this should be a separate article)
The root cause of most of these anomalies is that the cost formulas for execution operations are based on very simple, fix model. The model is adequate for transaction processing in which queries have highly selective search arguments. But the model is poor in assessing the index seek plus key lookup versus scan cross-over point.

The SQL Server query optimizer employs a fixed model for IO cost. The model is 320 IOPS for operations assumed to be random (key lookups and loop join inner source) 1350 pages/sec × 8KB/page = 10,800KB/s for sequential IO operations (scans and range seeks). The absolute values of random and sequential IOPS are not important, as it is the ratio that effects the execution plan.

Furthermore, the cost model assumes that IO is saturated at DOP 1 for accesses to source tables and indexes. A parallel plan does not reduce the IO cost in source accesses. IO cost is reduced in accordance with DOP in intermediate hash and sort operations. For storage on a small number of hard disks, IO is saturated at DOP 1. A very large HDD array is capable of scaling sequential IO with parallelism. A proper SSD array is capable of that and also scaling random IO.

Why does the cost model assume saturated IO for source accesses but scalable IO for intermediate non-source operations? It is as if one developer designed the model for source, another did the model for intermediate, but they did not talk to each other to present a consist picture. The two developers may sit in adjacent cubicles.

A second issue is that the cost model underestimates row access cost, and essentially does not model column and logic costs. This is evident in the plan cost of for a query doing a row count versus aggregating one or more columns. There is no difference between a count or any number of column aggregations. There is a small cost assessed for the first logic operation (example multiply two columns), but not for additional columns or logic. In actuality, column and logic might be the large part of an aggregation query.

The combined effect of the query optimizer cost model limitations is that it has no meaningful capability to predict the degree of scaling to be expected with parallelism. There are only two controls: the cost threshold for parallelism, and the max degree of parallelism. The query optimizer employs parallelism when the parallel plan has a model cost of just slightly lower than the non-parallel plan and if the non-parallel plan cost is over the threshold.

This was not a huge issue back when big systems had a total of four processor cores. A modern system up to 24 cores per socket in the current generation as well as typically having multiple sockets. The all or nothing strategy is seriously inadequate. It is now important to have the ability to assess what degree of parallelism to employ. This should be determined based on the query cost and expected scaling characteristics.

A query that is executes in a few seconds might employ a low DOP, perhaps 2 or 4, but only if the gain from parallelism is sufficient. For example, 40% could be considered sufficient, but 10% is not. An intermediate-large query might employ up to all the cores in one socket, again, so long as there is scaling with increasing DOP. Very large queries might employ cores in all sockets, but only if it can be predicted that the execution plan scales well across multiple sockets.

Page Compression, row store

Below is the compression overhead, at various DOP levels relative to no compression. In SQL Server 2008 sp1, on an 8-way quad-core Opteron at SF10, all data in memory

Test System

Q5 has the good execution plan at DOP 1 and 2, before shifting to the poor plan at DOP 4 and higher. However, at DOP 2 the comparison is between a good plan on the database with compressed tables and a poor plan on the database having tables without compression, leading to the anomaly. Overall, the compression overhead is a fairly uniform 1.2X at all DOP levels, but can range from being nearly free to a high of 1.4X. The individual values over 1.4X are probably anomalies rather than representative.

This is different than in 2008R2, in which compression overhead was 1.4X at DOP 1, but then gradually decreased to being essentially free at higher DOP. The test system back then was an 8-way Opteron quad-core, in which the effects of NUMA might have been significant. The hypothesized explanation was that without compression, there was contention between threads as DOP increased, decreasing scaling. The extra overhead of compression just happened to alleviate the contention.

Presumably the explanation now is that the SQL Server engine team did a good job of removing contention where possible, so that there is better scaling without compression. Then the overhead of compression is not masked. Or it could be because of the difference in the system, NUMA and not.

Columnstore

Below is columnstore performance at various DOP, relative to row store at the same DOP. In some queries, it can be almost 30X better. Overall, columnstore is more than 4X better than row store.

Test System

Note that columnstore at DOP 20 with HT enabled is very poor. This is the expected behavior, because columnstore marches through memory sequentially, so HT is expected to interfere with columnstore code.

The figure below shows the large negative impact of DOP 20 relative to DOP 10, although four queries showed modest gain.

Test System

There was little difference between DOP 10 with HT disabled or enabled. Enabling HT in UEFI is not harmful for columnstore, so long as max DOP is limited to the number of physical cores in columnstore queries.

The more proper comparison is between columnstore at DOP 10 and row store at DOP 20 with Hyper-Threading, as shown below.

Test System

The overall gain for columnstore is now 3.45. Only Q13 shows no gain for columnstore, but no query is worse than row store. Q13 was one of the few columnstore queries that did show gain with HT, so columnstore is better than row store at matching DOP and HT.

The figure below shows columnstore scaling versus DOP.

Test System

Columnstore scaling is uniformly good from DOP 1 to 4, but becomes erratic at DOP 8 and 10.

Summary

Parallelism is a complicated topic. The main guidance points were highlighted at the beginning. In addition, make few or no assumptions and always investigate. Evaluate performance characteristics at different levels of parallelism, DOP 1, 2, 4 and higher. Watch for scaling anomalies, and changes in execution plan structure. Be especially careful in comparing scaling past the number of cores in a single socket.

Hyper-Threading does have benefit in parallel plans, but mostly for row store. It is probably advisable to enable HT in the UEFI/BIOS, but to limit columnstore to only the number of physical cores. It would be better if Microsoft not employ HT for columnstore unless they know exactly when HT might benefit. Compression does have overhead, but might still be worthwhile.

The larger issue is that SQL Server seriously needs to overhaul the query optimizer cost model. Much of the SQL Server engine is very sophisticated, and yet the cost model has not changed substantially since version 7 in 1998. As is, the execution plan cost has weak bearing to reality and that is has no ability to predict scaling with parallelism, issues in scaling beyond a single socket, and when scaling to HT works. Without a proper cost model, the benefit of a sophisticated query optimizer is lost. For the time being, developers and DBAs can (re-)learn the method for writing SQL to a particular join order.

Published Wednesday, February 01, 2017 11:33 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

George Walkey said:

Keep beating the drum Joe

out of the 1.2million items fixable on Connect, this one is the most important

definately more than wasting dev cycles on Azure or R Code....

February 4, 2017 3:23 PM
 

Kevin Boles said:

I would love to see similar benchmarking done with two or more different workloads being run concurrently. At first blush I wonder if Hyperthreading would not lose out then due various data/code cache misses happening much more frequently than they due with a single query running against a single set of data.

February 5, 2017 11:16 AM
 

jchang said:

In principle, Azure, AWS etc. are good options to have. Very few companies have the resources to hyper-optimize IT infrastructure administration. Some  have fallen into the trap of gold-bricking, using "enterprise" SANs, i.e. about $6K per HDD. Other have become obstructors of IT.

The problem is that every time something good comes out, the high level executives who have not done real work for years over-promote it. There was a news item that Microsoft IT wants to move 90% to Azure, which seems to be a very good number to me. Meaning only the most critical systems are directly managed, but they are directly managed.

Based on my limited work with Azure, I would not want to manage a large Azure application set given the current Web-based administration. The web is great for information gathering, it sucks as a control system. How long before it gets through their thick skulls that a real application is needed for administration? (I make lots of friends with my commentary?) Just how hard is it to do this.

Also, for Azure to be really successful, there needs to be datacenters in closer proximity to the major cities. I think MS has 4 US locations? excluding government and DOD? The speed of light is 300km or 186 miles per milli-sec. Actual transmission time is longer with all the router hops, store and forward delays. Round-trip network latency really needs to be lower for people to be effective.

February 5, 2017 1:47 PM
 

jchang said:

The key to Hyper-Threading is in there being significant round-trip memory access no-op cycles. For transactions, this is very true. In other posts, I have suggested that this might be around 5% of instructions, meaning that perhaps only 12% of cycles are actually used by a thread. Hence scaling with HT is almost linear in the absence of contention, such a inserting into a table with an identity column.

I am inclined to guess that reduction of effective cache is a less effect. Still, if DW on row-organized tables scales by 25% with HT, this suggests that perhaps only 30% of cpu-cycles are no-op. Columnstore in theory can fully consume the CPU, so don't bother with HT on a columnstore db.

So what is the right strategy in mixing TP with row-DW? Create two soft-NUMA nodes, one for TP and another for DW. but should it be both logical processors in a group of physical cores for each, or should each group get one logical processor per physical across all cores?

We are expecting that TP will have about the same performance regardless of whether it is 10 LP on 5 Phys or 10 LP, one per phys. DW will prefer one LP per phys core. If DW does in fact consume 70% of the cycles, then perhaps a TP running concurrently on the other LP of each phys core might have only mild degradation? If I have time, I might look into this.

February 5, 2017 2:06 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Privacy Statement