The model for the cost of operations used by the SQL Server query optimizer
to produce an execution plan
is rather simple.
It has been largely unchanged since SQL Server version 7, RTM in 1998.
Back then, SQL Server had not yet achieved tier 1 status with commensurate
Some adjustments were made in version 2005.
One might question how something so old can possibly be relevant today,
considering how much hardware has changed over the intervening years.
The cost model is still adequate for pure transaction processing
scenarios in which queries have highly selective search arguments.
Its limitations become pronounced in parallel execution plans.
The salient aspect of modern hardware is that compute capability is
distributed over very many processor cores,
possibly with irregularities due to non-uniform memory access (NUMA).
the decision in parallelism cannot be just on/off with a
cost threshold and one-size fits all with max degree of parallelism.
It is necessary to have a degree of sophistication
in the employment of resources for parallel processing.
For some queries, it might be best to utilize a low to moderate level of
In others, perhaps all cores in one socket.
And in exceptional circumstances, the whole set of cores.
For this, the existing SQL Server cost model does not have sufficient
depth to be the foundation from which an effective strategy for
deploying resources can be formulated.
It should not be too difficult to build a suitable model,
considering the wealth of knowledge that has (or should have been)
accumulated over the years.
There will be some resistance to changing the cost model
given that people have become accustomed to the plans generated
from the current cost model.
But that could be accommodated as a legacy mode,
so as to not stand in the way of progress.
The SQL Server Plan Cost Model Overview
The cost model of the SQL Server
is explored in more detail elsewhere.
Other authors have written on the query optimizer itself.
Only a brief description of the SQL Server cost model is given here.
Buried in Books Online in the section on the
query governor cost limit option, is the statement:
"Query cost refers to the estimated elapsed time, in seconds,
required to complete a query on a specific hardware configuration."
The reference system existed when SQL Server version 7 was in development,
so this could have a been a Pentium Pro system, but some people have said
that it was a Pentium.
My investigation in this topic began around year 2000,
though I never went back to look at whether SQL Server version 6
and 6.5 used the same model.
Each operator in an execution plan can have two components, CPU and IO.
The IO component represents the time to complete the IO operation.
There is in fact also CPU involved to perform an IO, and it is unclear
whether this is part of the IO or represented in the CPU component.
The assumption is that leaf level pages are not in memory at the start of query
execution, but maybe retained if the execution touches a page more than once.
The complete cost of operator factors in the CPU, IO, number of rows
and number of executions.
The CPU and IO may represent the cost for only a single execution,
while the operator cost incorporates the entire operation
as a function of the number of executions.
The IO cost is based on a model of the disparate nature of hard disk performance
in random and sequential access at fixed queue depth.
SQL Server version 7 and 2000 had two baselines, but from version 2005 on,
the IO cost model is that random IO performance is 320 IOPS, and sequential
IO is 1350 pages per sec (10.8MB/s).
As cost is in seconds, the random IO from key lookup and loop join
operations or the first page of a scan is 1/320 = 0.003125.
Each successive page in a scan or range seek is 1/1350 =
0.000740740 (the 740 sequence repeats).
The plan cost does not to model the cost of logic beyond the basic operation.
For example, an aggregation of one column has the same cost as two or more.
The interpretation of this is that the purpose of the plan cost model is to find
the best index and table access sequence and method.
Ancillary logic that must be executed in all cases does not affect the choice
Hard Disks 1995-2005
It might seem that given the pace of hardware change,
such old model cannot possibly valid, resulting horrible execution plans.
Around 1995 or so, the high-performance HDD was 7200RPM with a sequential
bandwidth of 4-5MB/s.
The mean rotational latency for 7200RPM is 4ms.
An average seek time of 8.5ms seems reasonable, though I have not kept
any documentation from this period.
This would correspond to 80 IOPS per disk at queue depth 1 per HDD.
So, it seems curious that the SQL Server cost model is based on the random
IOPS of 4 disks, but the sequential IO of 2 HDDs.
Performance HDDs progressed first to 10K RPM around 1996,
and then to 15K around 2000, with corresponding rotational
latencies of 3 and 2ms respectively.
Average seek time was reduced over time
to 3ms with developments
in powerful rare-earth magnets.
The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD.
But no further progress was made on HDD rotational speed.
In same time period, hard disk sequential IO phenomenally
quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.
In other words, the SQL Server cost model is based on a 1350/320 = 4.2
But 15K HDDs in the mid-2000's were 100MB/s × 128 pages/MB = 12,800 pages/sec
sequential to 200 IOPS random for a ratio of 64:1.
It turns out that achieving nearly the sequential IO capability of HDDs
from SQL Server required a special layout strategy,
as outlined in the Fast Track Data Warehouse Reference Architecture
papers, which few people followed.
This was due to the fixed, inflexible IO pattern of SQL Server, which
required the disk configuration to match that of SQL Server
instead of being able to adjust the IO pattern to match that of the
In a SAN system, where the vendor was exclusively focused on random
IO, the typical configuration supported sequential performance of about 150 IOPS at 64KB,
for 9.6MB/s or 1200 pages per sec.
Quite by accident, this was approximately in line with the SQL Server query optimizer.
In any case, regardless of the actual random and sequential values,
the important criterion is the sequential to random IO ratio,
on the assumption that leaf level pages are not in-memory.
And this was usually not true by the 2005 timeframe, when hot data
was likely to be in memory for a well-tuned transaction processing system
on a maximum memory configuration 4-way system.
All the numbers cited above for HDD random IOPS performance were for queue depth 1
Of course, SCSI HDDs support elevator seeking, in which IO is issued at high queue depth,
and the disk reorders the sequence of operations for higher IOPS.
The SQL Server engine facilitated this by issuing IO at high queue depth when the
estimated of number executions of a key lookup or loop join was 25 or more.
Below this, IO was issued at queue depth 1.
But the plan cost did not have a model for HDD random IO at different queue depths.
It should be pointed out that individual query performance optimization is
not always the prime objective.
For a transaction processing system that also supports reporting queries,
the first priority is usually transaction latency.
In this case the preferred strategy is to sacrifice reporting query performance
by not issuing IO at excessively high-queue depth to keep latency low.
SSDs started to pick up momentum around 2010.
Early use was for extreme requirements.
Now it is the more practical choice for almost all main line-of-business
systems over HDDs.
This comes at a time when the system may also happen
to have enough memory for almost the entire database.
With data on SSD, the access times for sequential and random might
reflect a lower ratio than the optimizer model
as opposed to a much higher ratio for HDDs in the FTDW RA.
Still, the CPU cost for loading 64KB or more into memory with a single IO is lower
than evicting pages and loading multiple individual pages.
The scan should have a lower cost per page.
There is a less appreciated aspect of SSD storage in being able to achieve
massive IOPS capability far in excess of what is really necessary.
In an HDD system, it was possible to support sufficiently high random IOPS
at less than excessive latency for transactions.
It was also possible to achieve high bandwidth for large block IO
in support of DW.
What was not entirely practical was to simultaneously support low latency IOPS
for transactions and high bandwidth for DW.
This is now possible with an SSD storage system
and this should be the configuration objective.
It would be helpful for SQL Server to implement the correct IO
queue depth strategy for this capability.
Key Lookup - Scan Cross-over
For pure transactions, queries have a highly selective search argument.
In such cases, the plan cost model is not hugely important,
regardless of wild differences between the plan cost model and various
implementations of real hardware.
The cost model is more important for queries involving an intermediate
number of rows,
in which the true cross-over point
from an index seek with key lookup to scan operation is of greater interest.
But even then, we may prefer one of the two for other reasons.
Parallel Cost Model Inconsistency
There is a curious inconsistency in the cost model for parallel execution plans.
The CPU component of operator cost is scaled down in proportion to the degree
except for the last doubling of the number of processor (cores).
At DOP 2, the CPU cost is half of that at DOP 1.
On a system with 20 logical processors and unrestricted parallelism,
the CPU cost is reduced by a factor of 10,
regardless of whether it is 10 physical cores with Hyper-Threading or
20 cores without HT.
The inconsistency occurs in IO cost.
There is one model for source access, that is
seeks and scans to table or indexes.
And there is a different model for intermediate operations,
examples being Hash Match and Sort operators.
Parallel Cost - Table and Index IO
The IO component in table and index access operations,
including index seek, scan, and key lookup,
does not change with parallelism.
The interpretation of this is that IO system is saturated with a single
Hence, a parallel plan does not reduce IO time in accesses to tables
This is a very reasonable assumption for the original reference system having
2 or 4 HDDs.
It is a not an accurate representation for scan operations on a system
configured to the FTDW RA, capable of multi-GB/s IO bandwidth.
The saturated IO model might be somewhat applicable for random IO
because a single processor core from about 2006 on could
drive 50K IOPS depending on the situation
and only a very large array of perhaps 200 15K HDDs could support
that volume at queue depth 1 per HDD.
On a modern system with properly configured SSD storage,
the expectation should be that both sequential and random
IO scales with parallelism, fully up to the number of physical cores.
An example of a properly configured system is test system from
Parallel Execution in SQL Server 2016,
one Xeon E5 2640 v4 10-core processor with 4 Intel P750 PCI-E SSDs.
Parallel Cost - Hash and Sort Operations
Hash and Sort operations for a small data set only have a CPU component.
When the number of rows times the row-size exceeds a value,
there will be an IO component as well.
The set point is some fraction of memory, either the
max server memory or system memory if the prior is unlimited.
This appears to be just over 5 MB per GB of memory.
This value is also per thread.
At higher DOP, each thread can accommodate up to the set point.
Beyond the set point, the IO is somewhat greater
than nonlinear, at perhaps IO 25 for 10MB, and 300 for 100MB.
the IO cost in hash and sort operations does scale with parallelism,
unlike IO cost is index seek and scan operations.
This might seem to be the min query memory configuration setting
and the resource governor max memory setting in actual runtime
But it might be that the plan cost model only factors
in the max server memory setting.
Plan Cost Scaling - Parallelism
Below is the ratio of plan cost at DOP 1 to higher DOP between 2 and 10
for the TPC-H SF 100 database on the Xeon E5-2630 v4 system with 64GB,
but SQL Server limited to 32GB.
In essence, this is the predicted speedup from parallelism
by the plan cost model.
In several queries, there is very little reduction in plan cost at higher DOP.
This occurs when the bulk of the costs are from the table and index access
operations, in which parallelism does not reduce IO cost, only the CPU cost is reduced.
In a few queries, examples being 8, 16 and 18, there is significant scaling in plan cost.
This occurs primarily when much of the cost is in the hash match or sort operations,
in which case, both CPU and IO are reduced with parallelism.
In any case, actual scaling is always better than plan scaling,
as shown below. See
Parallel Execution in SQL Server 2016
for a more complete discussion.
The SQL Server plan cost model has very little predictive value
on the matter of scaling.
The plan cost model only predicts scaling when a significant portion
of the cost is in the IO component of hash and sort operations.
When the cost is largely in the IO of table and index scans and range seeks,
the plan predicts weak scaling.
In fact, many operations seem to scale well
when the IO subsystem is not saturated from a single thread.
True Cost versus Plan Cost
As mentioned earlier, the plan cost model was based on a reference
system that was probably a Pentium processor from the mid-1990's.
In that era, each successive generation of processor increased rapidly
It would not make sense to recalibrate the cost model
to newer generation processors.
And furthermore, it does not really matter what the processor is.
Plan cost tends to be dominated by the IO component of table and index
accesses based on 320 IOPS random and 10.5MB/s sequential,
or from the IO cost of hash and sort operations.
Still, one might be curious as how plan cost correlates to actual cost.
The sum of the plan cost for the 22 TPC-H SF100 queries at DOP 1 is 200,839.65.
Note that many of the execution plans have large hash operations for which the IO cost
is dependent on system memory.
The sum of the execution times for the same 22 queries also at DOP 1 is 3,515.15sec,
although it would be less if some queries were manually tuned,
resulting in higher plan cost, but lower runtime.
The ratio of the sum of the plan costs to actual runtime is 57.135.
Curiously, this is the approximate order of magnitude difference
between a Pentium and modern processor core.
There was no particular expectation of this because the plan cost model represents
mostly IO wait while the test system has sufficiently powerful IO
that the ratio of worker time to elapsed time was 98%.
The figure below shows actual query runtime divided by plan cost
scaled by 57.1, for the 22 TPC-H queries.
The range in variation between actual execution time
relative to plan cost scaled by 57 is between 0.32 and 2.53.
This is actually not a bad range, because the query optimizer plan cost
does not attempt to model the true cost of a query,
only the methods of table and index access.
Two of the outliers on either end are Q2 at 0.32 and Q13 at 2.53.
The critical part of the plan for Q2 is shown below.
The large portion of the plan cost is for the loop join to PARTSUPP.
As pointed out earlier and elsewhere, the plan cost for random IO
is fairly high.
Most of the operations in the TPC-H queries are scans or range seeks.
The bulk of weight in the measured 57X scaling relative to plan cost
is driven by the plan cost of sequential IO.
Also, disregarding the plan cost model, the SQL Server engine actually
issues IO at high queue depth when the estimated rows is more than 25.
The expectation is that a storage system with 4 NVMe PCI-E SSDs
blows through the loop join faster than the 57X average scaling.
The other extreme is Q13, for which the core plan component is below.
There is nothing unusual in the plan.
However, in the query search argument is the following:
AND CUSTOMER not like
This expression is expected to be expensive to evaluate, and is not
modeled in the plan cost, being treated as just any other predicate.
With compression, the plan cost for the 22 TPC-H SF100 queries at DOP 1
decreases by 32% from 200,839 to 136,792,
reflecting lower IO cost from compressed tables.
The actual execution time increases by 21% from 3,515 to 4,275sec,
reflecting the overhead of compression significantly outweighing
the reduced IO.
Cost Threshold for Parallelism
The original scale of plan cost was that 1 represented a simple estimate
of execution time in seconds.
This has long lost its original meaning.
For all practical purposes,
from a few years after its arrival,
we treated cost as some arbitrary unit of measure.
Except for the fact that plan cost was used in the cost threshold for parallelism.
In the original model,
the intent was that the query optimizer does not consider parallelism
until the estimated execution time of the non-parallel plan exceeds 5sec,
which was once a very reasonable value.
Today, plan cost 5 might be a query that executes in less than 0.1 sec
This is most definitely well below the point at which parallelism should be
The overhead of coordinating multiple threads
is heavy in relation to the actual work itself.
But we should also not wait until plan cost 285, corresponding to 57 × 5,
for the original 5 sec duration.
Part of this is because people today are far less patient than
in the previous era.
The other part is that today we have so many processor cores at our
disposal that we can now be more liberal in their use than before.
Having discussed both the cost model and the to some degree actual scaling,
it is now the time to lay the ground work for a new model of plan
cost and parallelism.
A parallel plan involves splitting the work to be done among multiple threads.
This involves additional work not in a non-parallel plan,
comprising one or more of: distribute,
repartition or gather streams.
The effectiveness of a parallel plan depends on both
the ability to divide the work with some degree of
uniformity and such that the extra work necessary to
coordinate multiple threads not being too large.
There is also some effort necessary to start a thread
or to acquire threads from a pool.
The Bitmap Operator
There is also another factor in parallelism.
As discussed elsewhere, there are a number of queries in which the parallel
execution plan runs with lower CPU time than a non-parallel plan, which should
It turns out that the parallel plan has a bitmap operator that is not in
the non-parallel plan.
The following is from Microsoft TechNet on
Bitmap Showplan Operator:
SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see
Optimizing Data Warehouse Query Performance Through Bitmap Filtering.
As it turns out, the bitmap operator might have the effect of improving plan efficiency
by up to 30% depending on the circumstances.
Microsoft could allow the bitmap operator to occur in plans above a certain cost
threshold, but not parallel for whatever reason.
That is, a separate the cost threshold for the bitmap operator
cost threshold for parallelism.
I would not hold my breath on this.
Processor and System Architecture
The foundation of the parallelism strategy
must encompass the nature of the modern processor
The largest Intel Broadwell EP/EX of 2016 has 24 cores,
the full set of which is available in the Xeon E7-v4,
but only 22 in the E5-v4.
While we may have been conditioned from past history into viewing
a server system as having multiple processor sockets,
that fact is that a system having a single processor socket
today can be incredibly powerful.
There are significant benefits for a single socket system.
One of which is that all memory is local.
Another substantial benefit is that the effort to
coordinate between multiple threads is low.
This leads to excellent scaling characteristics even
when there appears to be significant inter-thread
For those who cannot let go of the multi-socket mind-lock,
below are representations of a 2-socket E5
and a 4-socket Xeon E7.
The inescapable fact is that coordinating between threads running
on cores in different sockets is very substantial.
And so, there is a very substantial degradation in scaling
when threads are on different sockets.
For this reason, the parallelism strategy must respect
the nature of the system architecture.
Low to moderate parallelism
must use cores on a common socket.
There should probably be one strategy in employing parallelism
up to the number of cores in one socket
and a separate strategy for high-parallelism spanning
cores on multiple sockets.
A Modern Parallelism Strategy
With this knowledge,
the parallelism strategy needs to blend
the method for partitioning of work
with the nature of the modern server processor and system.
There are very many cores, i.e., compute resources that can be allocated
to query plans, in either a single or multi-socket system.
For this, we need something more than the very basic, legacy configuration controls:
1) Cost Threshold for Parallelism and 2) Max Degree of Parallelism.
To Connect request on this matter,
Microsoft replied that the newer additional controls in Resource Governor
In essence, the SQL Server team is abrogating their responsibility
in having an intelligent mechanism to properly utilize the capabilities
of a modern system.
The first step might be to rescale the plan cost so that the cost
threshold for parallelism is in more understandable terms.
This is not absolutely essential, and
we could just state that plan cost 60 might correspond roughly to 1 second.
It did not make sense to rescale plan cost in the days when
processor core performance changed substantially with every generation.
But performance at the core level over the last several years
have been more stable, improving at a slow or moderate pace,
and may continue to do so barring a radical event.
We might take this opportunity to reset the plan cost.
Second, the cost model needs to be more comprehensive.
The cost of touching each row in a scan or other sequential access
is larger than the current model represents
and should be corrected.
For this, the impact of lock-level might be considered.
The cost of touching each column is both non-trivial
and not accounted for at all in the current model.
The cost of ancillary logic also needs to be modeled.
And of course, the cost of access to compressed pages should be accounted for.
The above items may not impact the structure of the execution plan,
which is why it was not considered back in version 7 or 2008 for compression.
But it is relevant in the cost threshold for parallelism assessment,
and also impacts scaling projections.
Third, the IO cost model needs to be fixed.
This might involve modeling the actual IO subsystem performance characteristics.
Or it might just assume that IO performance scales with parallelism.
Also important is that parallelism scaling does not stop at the
half the number of cores, instead going up to the full number of
It would help if the optimizer knew the difference between
physical and logical cores.
The IO cost in hash and sort operations already scale with parallelism,
so fix the IO cost for source access would make the complete
As a rough guess, (current) plan cost 30, corresponding to roughly one-half
of one second execution time might be a good default cost threshold for parallelism,
settable of course.
For a query with plan cost 30 expected to execute in 0.5 CPU-sec (worker time),
a DOP of 2 and not higher might be a good idea.
One suggestion for the DOP strategy might be nonlinear.
Employ DOP 2 at the base value of the cost threshold.
Then double the DOP for every 4× over of the base cost.
If base is 30, then plan cost 120 (4×30) is the threshold for DOP 4
and 480 (16×30) is threshold for DOP 8.
This could be the strategy for DOP up to the number of cores in one socket
with one threshold for the base value and another for escalating
The expectation is that scaling is not so simple or excellent
for parallelism on cores in different sockets.
It would be good if the plan cost could model this.
In any case, there might be an even higher threshold for employing
cores over multiple sockets.
We should not normally set DOP to be somewhat over the numbers cores in one socket.
It should be up to the number of cores in one socket
for most circumstances,
or employ most of the cores of multiple sockets for exceptional circumstances.
Furthermore, if the desired number of cores in one socket are not available,
then reduce the DOP rather than employ cores on different sockets.
Both cost model and the simple controls for parallelism
have become seriously obsolete for modern high core count
An improved cost model, encompassing cost structure
that was not necessary for optimal single thread table access
is now needed for the parallelism strategy.
The cost model needs to correctly model parallelism
on modern hardware, which means fixing the IO cost model
that currently assumes a single thread saturates the IO.
A proper parallelism strategy also needs more
complexity than the current on/off based on the cost threshold
for parallelism, and the all or nothing
with only a MAX DOP setting.
A moderately more complex model has been suggested
but alternatives are possible.