I had discussed SQL Server parallelism in Oct 2010, with my thoughts on the
best settings for: Cost Threshold for Parallelism (CTP) and Max Degrees of Parallelism (MAXDOP) in
Parallelism Strategy and Comments.
At the time, I had intended to follow up with detailed measurements.
So now a mere 2 years later, here it is.
The general thought was that CTP should be raised from the default value of 5,
and MAXDOP should be changed from unrestricted, on modern systems with very many cores,
and most especially on systems with Hyper-Threading.
However reasonable each persons ideas/suggestions are, nothing is better than hard data.
The interest is in the smaller queries that can have a parallel execution plan.
With the default Cost Threshold of Parallelism setting,
we are trying to find queries with the lowest plan cost and lowest actual cost (which point to different queries)
that have a parallel execution plan.
(I will details on index seek later).
The test system is now a 2-socket Intel Xeon 5670 six-core 2.93GHz (32nm Westmere-EP) with Hyper-Threading enabled
(a total of 24 logical processors).
Some references are made to test results on an earlier system
with 2 x Xeon E5430 quad-core 2.66GHz (45nm Core 2) without Hyper-Threading.
The test data is built using the TPC-H data generator, initially employing derivatives of the Part table.
At scale factor 10, the Part table is 2M rows, about 285MB, 55 rows per page or 149 bytes per row.
The derived tables have 2 additional 4-byte integer columns, size 301MB, 52 rows per page or 157 bytes per row.
Parallel Hash Join
The smallest hash join on the modified Part table that results in a parallel execution plan occurs at around 100,000 rows.
Lets first examine the non-parallel plan.
The three major components of this plan are the outer and inner source index seeks, and the hash match.
The Stream Aggregate is small to the major components.
The IO cost of a 1.42 corresponds roughly to a range scan/seek of 15MB (IO cost 1 is 10.5MB).
The actual logical IO is 1931, very close to the values of (1.42446 - 0.003125) * 1350,
and includes a few upper level index LIOs.
Below are the Hash Match and Stream Aggregate details.
Note that the Hash Match has the largest cost of all operations in this execution plan,
and the entire cost is in the CPU element.
The 2 index seeks operations have their cost mostly in the IO element, with only 7.1% in the CPU.
Below is the parallel Hash Join execution plan at DOP 2.
The Outer and Inner Source index seeks have identical costs so only the first is shown.
As covered in Cost Based Optimizer Parallelism I
the CPU cost is reduced by the degree of parallelism,
but the IO cost does not change, which I call a saturated IO model.
The Hash Match details above and the Stream Aggregate details below at DOP 2.
Both operations have cost reduced by the degree of parallelism as both elements
have their costs entirely in the CPU portion.
The Parallelism operation, as its name implies, only occurs in parallel execution plans
adding cost 0.0285.
The cost structure of this operation is proportional to the number of rows,
which implies that high row count queries are less likely to have a parallel execution plan
as the cost of reconstituting threads may outweigh the formula benefit of parallel operations.
Parallel Loop Join
The second test query is a loop join.
A parallel execution plan occurs at around 8000 rows for high degrees of parallelism.
At DOP 2, this occurs around 10000 rows, which will be used for test purposes.
The non-parallel loop join plan is shown below.
The outer and inner source operation details are shown below.
Note the inner source subtree cost of 29.0264, based on number of executions 10000.
The CPU component should be 10000 * 0.0001581 = 1.581.
Then the IO component is 29.0264 - 1.581 = 27.4454,
which is approximately equal to 8782 * 0.003125.
This is an estimate of the number of physical IO for 10000 executes.
The assumption is that some of the pages would have been previously loaded into memory
during the execution of this query, but not yet evicted.
There are 38465 leaf level pages in both tables.
The alternate plan for this query is a hash join with a scan on the inner source table.
In this plan, the table scan would have IO component 28.5, CPU 2.2
and the hash match would contribute another 9.
So it would take another 35% more rows before the loop join plan would naturally
shift to a hash join at DOP 1. At higher DOP, the hash join cost is reduced proportionate to DOP.
The Nested Loops and Stream Aggregate details are shown below.
The Nested Loops is only 0.14% of the overall plan cost, and the Stream Aggregate even less.
Below is the parallel loop join query plan at DOP 2.
The parallel outer and inner details are below. Note the very minor reduction
in outer source CPU from 0.155756 to 0.150178.
There is no change in the inner source operation, not even the element attributed to CPU.
The Nested Loops and Parallelism details are below.
In the parallel plan, there is a CPU reduction of 0.005 from the outer source,
0.0209 from the Nest Loops, and 0.003 from the Stream Aggregate,
to just ever so slightly overcome the Parallelism cost of 0.0285.
The total plan cost for the non-parallel loop join is 29.23 and the parallel plan cost is 29.229.
If this model was even remotely accurate, we should ask why bother to switch to a parallel plan
for such a minute gain.
It turns out that the SQL Server parallel execution plan cost estimate is nothing close to the true
cost structure, which raises a completely different set of questions,
starting with: how can SQL Server be expected to generate good parallel (or not) execution plans
if the cost model is completely different than the true cost structure?
Parallel Query Performance and Actual Cost - Hash Joins
Below is the hash join performance in rows per sec (left vertical scale),
and (worker or CPU) cost in micro-sec per row (right vertical scale), both versus degree of parallelism.
Hash Join 100K rows versus Degree of Parallelism
At DOP 1, the performance is just over 1M rows/sec, scaling well to DOP 4,
levels off for DOP 6-12 at just over 4M rows/s, then jumps again at DOP 14.
The peak gain from parallelism is 6.646 speed up at DOP 20 over DOP 1.
The cost is just under 1 us per row at DOP 1, rising as high as 2.6 us per row at high
DOP, more so if HT is involved?
At DOP 1, the query execution time for 100K rows is 95ms.
It appears that at DOP 2 and 4, the individual threads are running on logical processors
in different physical cores, hence the excellent scaling.
At DOP 6-12, some of the logical cores are on the same phyical cores.
Hyper-threading does improve performance to a moderate degree in SQL Server
depending on the operation.
(I worked on a custom b-tree search engine with no locking protection.
The scaling was essentially linear regardless of physical or logical cores from 1 to 24 threads.
This may seem fantastic, but it does make sense because a b-tree search is just serialized memory accesses.
Fetch a memory location, which determine the next memory location to fetch.
The first memory access must be complete before the next can be issued. The core clock rate for 3.3GHz is 0.3ns. Memory access latency is 50ns for local node, 100ns for 1 hop remote node, corresponding to 150 and 300 CPU-cycles respectively.)
If this assessment is correct, then it would suggest the proper strategy for the SQL Server
engine in parallel execution plans is to allocate 1 worker thread from a logical processor
on each physical core, perhaps allocating from the cores on the same socket before
proceeding to the next socket.
Only if the desired degree of parallelism is greater than number of cores should two logical
processor be allocated from any single physical core, and this should only occur in
unusual circumstances.
The next level of sophistication would be to match thread assign with memory alignment, so that the majority of memory accesses are to the local node. This might require partitioned tables (with hash key partitioning?).
This effort would be hugely complicated, and only applicable to special circumstances,
but a true fanatic would not be deterred.
Below is the performance and cost structure for a hash join on the full table of 300MB and 2M rows.
The cost per row is higher 1.495 us per row at DOP 1, rising to 3us at high DOP.
However scaling is better with a peak speedup over DOP 1 of 11.35 and the peak rows per sec
is also higher than the previous query.
Hash Join - 2M rows versus Degree of Parallelism
I am not sure why the cost per row in this query is higher for the full table over the limited range of 100K rows.
There was no tempdb activity in either case.
One speculation is that the hash join intermediate results remain L3 cache (15M), and results in lower access
time than off-die memory accesses.
A test with small and large merge joins, also exhibits the same behavior.
Is it possible this is due to local and remote node memory locations?
Life is not simple on NUMA systems with HT.
Just to be sure, the same full table scan query was tested on a 3GB 20M row table, as shown below.
The DOP 1 cost and performance was about the same, slightly lower actually at 1.376 us per row.
The scaling was better with peak speedup at 14.95 and peak performance at nearly 11M rows/sec.
Hash Join - 20M rows versus Degree of Parallelism
Both full table hash joins (300MB and 3GB) demonstrate that parallel execution plan scaling
is better for larger queries.
Parallel Query Performance and Actual Cost - Loop Joins
Below is the Loop Join performance in rows per sec and cost in usec per row
for a query of 10000 rows on the 300MB tables.
Notice that performance peaks at DOP 6 and degrades at higher DOP.
The peak speedup over DOP 1 was 4.4.
Loop Join - 10K rows versus Degree of Parallelism
The cost at DOP 1 is 1.827 usec per row.
For 10000 rows, the execution time is a mere 1.8 ms (correction) 18ms.
That this query benefit from parallel execution at all is amazing.
Of course it also raises the serious question as
why SQL Server should go to the effort of a parallel execution plan
for a query that runs in 1.8ms 18ms.
This is due to the antiquated IO based cost model and the default CTP setting of 5.
Based on the 100K row hash join hash with plan cost just over 5 and actual query time of 95ms,
we might consider Cost Threshold for Parallelism around 25,
making the theshold at around 0.5 sec.
However, based on the 10K rows loop join with plan cost 29 and actual query time of 1.8ms,
CTP of 25 would point to a 50K rows loop join with actual execution tim 10ms. (OK, it was late)
However the 10K row loop join at plan cost 29 runs in 1.8ms!
The disparity between Loop and Hash join model and actual cost does allow a good strategy
for the Cost Threshold for Parallelism setting without a complete overhaul of the SQL Server
cost model. And this is not something the SQL Server team seems to be will to tackle.
Another possiblilty is for the Cost Threshold for Parallelism setting to only
consider the CPU element of the plan cost.
Too bad we do not have access to source code to investigate this.
The outer and inner source tables were populated
and indexed in a manner such that for this loop join, each row from the outer source joins to a row
in a different page of the inner source table.
When the rows join to consecutive rows in the inner source table,
the cost per row is even lower at 1.33 usec per row.
Below is the Loop Join performance and cost for 100K rows on the 3GB tables.
Performance continues to increase for higher DOP.
The peak speedup over DOP 1 was 8.26.
Loop Join - 100K rows versus Degree of Parallelism
In examining the CPU usage during the tests, it was evident that in some cases,
only one logical processor of a physical core was used.
In other cases, both logical processors on some cores were used.
This leads to the uneven scaling versus DOP.
Parallel Execution Plan Throughput Tests
Another point interest is throughput with parallel execution plans
is supporting concurrent sessions.
The figure below shows performance in rows per sec at various DOP settings
versus number of concurrent sessions.
The results based on the number of queries completed in a 10-sec windows.
Since the single session DOP 1 query run-time is 100ms, this should be reasonably accurate.
Hash Join Performance (rows/s) by DOP versus # of Concurrent Sessions
SQL Server parallel execution plans are great for a single session,
even for queries that run in so short a time that parallelism should have never been considered.
However there are definitely issues with parallel execution in throughput tests.
Loop Join Performance (rows/s) by DOP versus # of Concurrent Sessions
It is stressed that the queries used in the tests above run in 95ms and 1.8ms 18ms respectively
at DOP 1. So it is not surprising that there are significant throughput issues
in attempting concurrent parallel execution with such minuscule queries.
I present these as examples because they are among the smallest queries for which
SQL Server engages parallel execution with the default settings.
This is because the default setting are seriously obsolete.
The original setting that targeted a 5 sec threshold for parallel execution on a Pentium Pro 166MHz
(I think this a 0.8um or 800nm design)
does not work well on modern microprocessor at 3GHz and 32nm design.
That said, SQL Server does have issues with concurrent parallel execution throughput
on properly sized queries. So some effort here would be helpful.
But the priority is below index on hash key, and parallel write operations.
The initial release neglected the simple index seek test
Parallel Query Performance and Actual Cost - Index Seek
Below is the performance and cost characteristics for a simple index seek
aggregating 1 float type column for 350K rows.
The cost per row is 0.1734 usec at DOP 1.
Considering that the hash join test above does 2 index seeks and summed 2 float columns,
we could speculated that the cost of the bare hash join is
0.952 - 2x0.1734 = 0.605 usec per row.
Below is the characteristics for a simple index with only COUNT(*).
The cost is now only 0.0783 usec per row.
This implies that the float type column sum costs 0.095 usec per row.
The bare index seek cost of 0.0783usec per row also amortizes the cost of the page access.
At 52 rows per page, the full page cost including rows is 4.07usec.
Previously, I had assessed that the bare page access cost was 1us on the Core 2 platform.
The Westmere processor is more powerful and mostly importantly,
has significantly lower memory round-trip time.
So I am expecting the bare page cost on Westmere to be less than 1usec.
It might seem that we a quibbling over fractions of 1usec.
But we need to consider that 1us on a modern microprocessor in 2-3,000 CPU-cycles,
and this on a processor core that can execute multiple intructions per cycle.
Those of you with very good friends on the Microsoft SQL Server engine team
might persuade them to show you the source code and compiled binary/assembly.
Even without looking source code, it is clear that there are not nearly that many
instructions to touch a page, row and column.
The CPU-cycles are spent in the lock mechanism and the memory-round sequences.
This is why there significant differences between a traditional database engine
with all data in-memory and the "in-memory" database engine that have
completely different data organization.
It might better to refer to traditional database engines as page-row structures
and the new stuff column-oriented structures.
See my blog
SIMD Extensions for the Database Storage Engine
for my thoughts on an alternative solution.
Parallel Execution Setting Summary
It is definitely clear that the default settings for Cost Threshold for Parallelism (5)
and Max Degree of Parallelism (0 - unrestricted) are seriously obsolete
and should be changed as standard practice on SQL Server installation.
It is not clear there is a single universal best value for Cost Threshold for Parallelism.
I think the more important criteria is that there should not be a high volume of
concurrently running queries with a parallel execution plans.
Whatever the plan cost of the high volume queries are, the CTP should be set above it.
Of course, the other important objective is to reduce the plan cost
of these high volume queries, assuming this correlates to improved actual query execution time.
The strategy for Max Degree of Parallelism is also unclear.
Before the advent of multi-core processors, a good strategy was to disable parallelism
on transaction processing systems.
It was also critical in the old days to ruthlessly enforce the restriction of only pure transactions too.
Today, we have 32-40 powerful cores on transaction systems,
and almost everyone runs complex queries on it.
So parallelism is good.
Ideally I would like to restrict MaxDOP to the number of physical cores
on a Data Warehouse system, and less than that on a transaction system.
I would also like to get great scaling with parallelism.
But because SQL Server does not balance the threads to one logical processor
per phyical core, this strategy does not work.
So this is something the SQL Server team needs to address.
I will make continued updates on this on my website www.qdpma.com in the topic
Parallelism.