Three TPC-H benchmark results were published in April of this year at SQL Server 2014 launch,
where the new updateable columnstore feature was used.
SQL Server 2012 had non-updateable columnstore that required the base table to exist in rowstore form.
This was not used in the one published TPC-H benchmark result on SQL Server 2012,
which includes two refresh stored procedures, one inserting rows, the second deleting rows.
It is possible that the TPC-H rules do not allow a view to union two tables?
and perhaps a delete via the partitioning feature?
(meaning the delete range must match the partition boundaries).
Another possibility is that SQL Server 2012 columnstore was considered to be a multi-column index
which is also prohibited to reflect the principle of being ad-hoc queries.
SQL Server Columnstore
First a few quick words on SQL Server columnstore.
Columnstore is not actually an index of the b-tree index form.
The MSDN Columnstore Indexes Described
states that columnstore index is "a technology for storing, retrieving and managing data by using a columnar data format,
called a columnstore."
In SQL Server 2012, it is called a nonclustered columnstore index not because it is nonclustered or an index,
but because the base table must exist in traditional rowstore form.
In SQL Server 2014, there is a clustered columnstore index not because data is stored in order of the index key,
as there is no key, but rather that there is no rowstore table, just the columnstore.
TPC-H details Date Columns and Query SARGs
The full details of the TPC-H Decision Support benchmark are described on the
There are a few details of relevance to the use columnstore.
The largest table in TPC-H is LINEITEM, which has 3 date columns, ShipDate, ReceiptDate and CommitDate,
and is clustered on ShipDate.
The second largest table is ORDERS, clustered on the one date column OrderDate.
These two tables are joined on OrderKey.
There is correlation between values in the date columns in these two tables,
some natural, and others based on reasonable business conditions.
ShipDate must be greater than OrderDate obviously, and is also no more than 121 days greater
than OrderDate per benchmark specification built into the data generator.
CommitDate is between -89 and 91 days of ShipDate.
ReceiptDate is between 1 to 30 days after ShipDate.
The date values ranges from Jan 1992 to Dec 1998.
There are 22 Select queries in the TPC-H benchmark, along with the 2 refresh stored procedures.
Many of the Select queries specify a date range on one of the date columns
or a lower bound one date column and an upper bound on a different column.
Ideally, for queries that target rows in a limited date range,
we would like to have upper and lower bounds on for the cluster keys
on both the ORDERS and LINEITEM tables, OrderDate and ShipDate.
However the TPC-H rules do not permit re-writing the query SARGs
based on inferable knowledge.
That said, apparently the rules do not preclude the query optimizer
from discovering such knowledge.
One of the other RDBMSs was probably first to do this,
and Microsoft followed suit in order to be competitive in the TPC-H benchmark
with the Date Correlation Optimization
feature in 2005.
Personally, I am not aware of any production server using this feature.
Realistically, any organization that was having query performance issues
related to date range bounds would probably have directed the
analyst to apply appropriate date bounds on the cluster key.
This is most probably a benchmark specific optimization feature.
The date correlation optimization statistics do not exist when using clustered columnstore,
because there is no underlying rowstore table with foreign key relations?
The date correlation statistics do exist when using rowstore tables with foreign keys
and are used by nonclustered columnstore indexes?
TPC-H on SQL Server 2014 with Columnstore
That said, let us now look at the 3 new SQL Server 2014 TPC-H results published
making use of the new clustered columnstore indexes.
One is from IBM at Scale Factor 1000 (1TB)
and two from HP at 3TB and 10TB respectively.
The new results are compared to prior results with traditional rowstore
on previous versions of SQL Server and previous generation server systems.
Because Columnstore is not really an index in the b-tree sense,
given that queries frequently involve date ranges,
it is presumed to be important to use partitioning with Columnstore.
The three new TPC-H reports on SQL Server 2014 partition both ORDERS and LINEITEM
by OrderDate and ShipDate respectively (the cluster key in previous versions)
with a partition interval of 1 week (7 years x 52 weeks per year = 364 partitions).
Perhaps of interest, the scripts show that a rowstore partitioned index is first build
before building the partitioned clustered columnstore index.
TPC-H at SF 1000 (1TB)
The new TPC-H 1TB result on SQL Server 2014 using columnstore
is compared with 3 previous results on SQL Server 2008 R2.
There is a difference in memory configurations between the four systems below.
For SQL Server 2014 with clustered columnstore indexes,
the TPC-H SF1000 total database size is just under 430GB, so with 1TB memory,
the benchmark is running entirely in memory after the initial data load,
with the exception of hash operation spills to tempdb.
For rowstore, the TPC-H SF1000 total database size is 1420GB,
so the two systems with 2TB memory are mostly running with data in memory,
again except for the initial load and spills to tempdb.
There is definitely disk IO for data in the Cisco system at 1TB physical memory.
The performance impact is noticeable, but probably not as severe as
one might think based on how people talk of memory.
The reason is that all of these systems make correct use of massively parallel
IO channels to storage capable of 10GB/s plus table scans,
and many of these also use SSD storage capable of more random IOPS than
SQL Server can consume even at very high parallelism.
The new SQL Server 2014 result is 2.36X higher on composite score (QphH)
and 3X higher in the Power test than previous versions with conventional rowstore.
The 22 individual query run times from the Power test at 1TB are shown below.
The SQL Server 2014 result is the left item in the legend label 4x60 (sockets/cores) and for the succeeding charts as well.
Query 1, a single table aggregation, is more than 10 times faster on SQL Server 2014 using
columnstore on 60 cores (Ivy-Bridge, 2.8GHz)
than 2008R2 using rowstore, on 80 cores(Westmere-EX).
Per TPC-H benchmark procedure, the test is run immediately after data load and index creation?
The second largest speed-up is Query 16, joining 3 tables, at 6.4X.
Query 10 is 40% slower with columnstore, and Query 4 about the same between columnstore and
This query is listed near the end of this section on TPC-H.
Notice that in the 3 SQL Server 2008R2 results, Query 2 becomes slower
as the degree of parallelism increases from 40 cores (threads unspecified) to 80 cores/80 threads
and then to 80 cores/160 threads.
Elsewhere I had commented that SQL Server really needs a graduated approach to parallelism
instead of the all or nothing approach.
TPC-H at SF 3000 (3TB)
The new TPC-H 3TB result on SQL Server 2014 is compared with a previous result on SQL Server 2008R2.
Here, the difference in memory is a significant contributor.
The SQL Server 2014 system has more memory than the columnstore database,
while 2008R2 systems has much less memory than the 3TB rowstore database (4.5TB).
I am supposing that the reason the HP 2010 report only configured 512GB
(128 x 4GB priced $29,440 in 2010)
was that there would not be a significant performance improvement for the
TPC-H 3TB result at either 1TB or even 2TB memory in relation to the higher price
(128 x 16GB priced $115K in 2011).
Several of the TPC-H queries involve nearly full table scans of the large tables.
If there is not sufficient memory for the entire database,
then the next objective is to have sufficient memory for reducing the spill to disk in hash operations?
HP may have elected for the better price-performance?
Or perhaps someone just wanted to make a point.
The point being that it is important for the SQL Server engine to function correctly
when heavy IO is required.
In the SQL Server 2014 result, the system has 3TB memory (96 x 32GB priced $96K in 2014)
which is sufficient to hold the entire data set for TPC-H 3TB in columnstore.
The overall composite score is 2.8X higher with columnstore and 3.4X higher on the Power test.
The 22 individual query run times from the Power test at 3TB are shown below.
The largest gain with column-store is Query 19 at 19.7X.
Query 4 and 10 show degradation,
similar to the case at SF1000.
TPC-H at SF 10000 (10TB)
The new TPC-H 10TB result on SQL Server 2014 is compared with a previous result on SQL Server 2012.
Strangely, supporting documentation for the HP 2013 report on SQL Server 2012
is missing so there is no indication as to whether nonclustered columnstore is
I am guessing that columnstore was not used because the results are
in line with expectations on rowstore.
The full data size in columnstore at SF 10000 should be 5TB,
and 14TB in rowstore,
so there should have been heavy disk IO in both results.
The overall composite score is 2.55X higher with column-store and 3.1X higher on the Power test.
The 22 individual query run times from the Power test at 10TB are shown below.
The largest gain with column-store is Query 6 at 23.2X.
Query 4 and 10 show degradation
as in the two previous cases.
TPC-H Query 10
Below is Query 10. This query is consistently slower in columnstore relative
/* TPC_H Query 10 - Returned Item Reporting */
SELECT TOP 20 C_CUSTKEY,
AS REVENUE, C_ACCTBAL,
N_NAME, C_ADDRESS, C_PHONE, C_COMMENT
FROM CUSTOMER, ORDERS,
WHERE C_CUSTKEY = O_CUSTKEY AND
L_ORDERKEY = O_ORDERKEY AND
O_ORDERDATE>= '1993-10-01' AND
L_RETURNFLAG = 'R'
AND C_NATIONKEY = N_NATIONKEY
GROUP BY C_CUSTKEY,
ORDER BY REVENUE DESC
The execution plan for rowstore at SF 10 is shown below.
(Click for full-size)
The execution plan for columnstore at SF 10 is shown below.
(Click for full-size)
Below are the details on ORDERS and LINEITEM from the rowstore plan.
Notice that there are seek predicates on LINEITEM
for 1993-09-20 to 1994-06-17.
The actual range should be 1993-10-02 to 1994-05-02, for
1 day after the OrderDate lower bound and 121 days after the
OrderDate upper bound.
Below are the details on ORDERS and LINEITEM from the columnstore plan.
In columnstore, every operation is a scan.
There is a predicate for the ORDERS table
but not on the LINEITEM table.
Presumably storage engine must scan entire set of LINEITEM partitions
while only scanning the ORDERS partitions encompassing the SARG date range
I am thinking the reason is that with date correlation in conventional row-storage,
the SQL Server query optimizer knows that the data range in LINEITEM ShipDate is also restricted
by the lower OrderDate and the upper OrderDate plus 121 days,
corresponding to 1 day after the lower bound on OrderDate
to 121 days after the upper bound on OrderDate.
TPC-H Query 4
TPC-H Query 4 below is slower than row storage in the 3 and 10TB results. I am thinking that the reason is the same?
/* TPC_H Query 4 - Order Priority Checking */
COUNT(*) AS ORDER_COUNT
WHERE O_ORDERDATE >= '1993-07-01'
AND O_ORDERDATE <
WHERE L_ORDERKEY = O_ORDERKEY
AND L_COMMITDATE < L_RECEIPTDATE)
GROUP BY O_ORDERPRIORITY
TPCH Query Plans
for the TPC-H reference queries and execution plans
at SF1 on SQL Server 2005.
The parent page TPCH Interim
has links for the SF1000 query plans with and without parallelism.
TPC-H Columnstore Summary
As with every other new feature, Columnstore is a really interesting new technology.
But think hard about what is really happening,
experiment, and remember to get good execution statistics and plans
prior to making changes,
then get the new execution statistics and plans after the change.
One reason I like to look at official TPC-H benchmark results over "real-world"
is that the benchmark system is properly configured for both before and after results.
There is a significant difference in the data size involved for each query between
rowstore and columnstore.
If the reference system has a poor storage system
(and how often have we seen this? this is guaranteed when the SAN vendor assisted in configuration),
then it is possible to produce almost any performance ratio.
The charts below show the progression of performance over time for the selected TPC-E results
spanning Core 2, Nehalem, Sandy Bridge and Ivy processors at 2, 4 and 8 sockets.
For the 2-socket systems, West-1 is from the first set of TPC-E results reported for Westmere X5680 with HDD storage
and West-2 is the later X5690 report with SSD storage. Both are 6-core Westmere-EP processors.
The West-3 is the E7-2870 10-core (Westmere-EX) on SSD storage.
For the 4-socket systems, West-1 is on HDD storage, and West-2 on SSD, both 2K8R2 and 1TB memory.
The West-3 is on Win/SQL 2012, 2TB memory and SSD storage.
The same data is shown below with reverse organization
showing scaling with sockets for each of the processor architectures.
2-socket 4-core, 2.93GHz (11.72 core x GHz),
4 & 8-socket 8-core 2.26GHz (18.08 core x GHz)
2-socket 6-core, 3.46GHz (20.76 core x GHz/socket),
4 & 8-socket is 10-core 2.4GHz (24 core x GHz)
2-socket 8-core, 2.9GHz (23.2 core x GHz/socket),
4-socket is 8-core 2.7GHz (21.6 core x GHz)
2-socket 12-core, 2.7GHz (32.4 core x GHz/socket),
4 & 8-socket is 15-core 2.8GHz (42 core x GHz)
To interpret performance counters and execution statistics correctly, it is necessary to know state of Hyper-Threading (on or off). In principle, at low overall CPU utilization, for non-parallel execution plans, it should not matter whether HT is enabled or not. Of course, DBA life is never that simple (see my other blogs on HT). The state of HT does matter at high overall utilization and in parallel execution plans depending on the DOP. SQL Server does seem to try to allocate threads on distinct physical cores at intermediate DOP (DOP less than or equal to the number of physical cores).
Suppose for example, that maximum throughput on 10 physical cores is 10000 call/s with HT off and 14000 with HT on (overall CPU near 100%). Then the average CPU (worker time) per call is 1ms with HT off, and 1.43 ms with HT on as there are twice as much available worker time with HT on.
In a very well tuned OLTP system, we might have very steady average CPU per call as call volume increases from low overall CPU utilization to near saturation with HT off.
With HT on, at low overall CPU, the average CPU per call is the same as with HT off, but average CPU per call increases at some point when there is sharing of physical cores between concurrently running queries. But this is still good because system wide throughput capability has increased.
In a not well tuned database application, there could be contention between concurrent queries that causes average CPU per call to increase as overall system CPU load increases.
Without knowing the state of HT, it is hard to make the assessment as to which situation has occurred.
If we have direct sysadmin access to the OS, we could make calls (via WMI) to determine the processor model number, the total number of sockets, the total number of logical processors, then determine with a lookup table matching processor model to the number of physical cores.
(again, a pain)
(per LondonDBA, WMI Win32_Processor does report the number of sockets, physical cores, and logical, I must have been thinking of an older API that was not HT aware or even multi-core aware)
But we do not always have sysadmin access to the host OS, as many organization believe separation of DBA, infrastructure (not to mention storage) is a good thing, and even better when these groups do not communicate with each other, (let alone) working together with a common mission.
SQL Server version 2005 was helpful in the DMV sys.dm_os_sys_info
which had two columns: cpu_count and hyperthread_ratio
defined in version 2005 as:
1) "Number of logical CPUs on the system."
2) "Ratio of the number of logical and physical processors."
In version 2008, RTM and R2, the definition of hyperthread_ratio was changed to:
"Ratio of the number of logical or physical cores that are exposed by one physical processor package."
in 2012 and 2014, slightly different wording but same meaning:
"Specifies the ratio of the number of logical or physical cores that are exposed by one physical processor package."
why the change in definition?
There are actually 3 pieces of information we are interested in:
a) the number of sockets
b) the number of physical cores per socket
c) the state of HT (or the logical processors per socket)
In 2005, we have information to determine the product of A and B, and the value of C,
but not atomic values of A and B,
In the 2008 and later version, we can determine A (using the ratio) and the composite product of A x B x C, but not atomic values of B and C.
Ok, then I noticed that in SQL Server version 2012 and 14, there would be a line in the log of the form:
"SQL Server detected 4 sockets with 10 cores per socket and 10 logical processors per socket, 40 total logical processors; using 40 logical processors based on SQL Server licensing."
So from this, I have A, B and C, even though I must parse the error log for this info.
In version 2008R, the info is only: "Detected 40 CPUs."
which has no additional information to what is in the DMV.
It would be helpful if this information were available directly from the DMV, but then life might be easier?
EXEC sys.xp_readerrorlog 0, 1, "detected", "socket"
Three years ago, I conducted an extensive investigation on a SQL Server system
running kCura's Relativity document e-discovery application.
It was fascinating to see the broad range of problematic queries all from one application.
This provided good material for my presentation Modern Performance which
focuses on the more spectacular problems that can occur with a cost based query optimizer.
(I am not sure why the images are not showing up. See the alternate link Top Clause
It should be pointed that unlike a typical application in which the key queries
can be rigorously tuned, Relativity must generate the SQL from options in the UI.
The objective of Relativity is to support complicated searches,
with heavy emphasis on queries of the form: IN set A but NOT IN set B.
This causes problems in row estimation because there is no generally valid logic
to assess whether there is overlap between Set A and Set B even if row estimates
on the individual sets are possible.
If this were not difficult enough, there can also be nested
SQL Server has difficulty generating good execution plans even for a single AND/OR
It is unfortunate that the most direct conversion of natural (user-oriented) logic
to a SQL expression (that Relativity employs) just happens to be the form
for which the SQL Server query optimizer has difficulties in producing a good execution plans.
This article is based on a brief observation of Relativity version 8.1.
The first Relativity article was
originally based on version 7.3-7.5, and later updated with observations of
The architecture of Relativity first employs a SELECT COUNT query to determine the number
of rows that meet a specific search query.
A second call is then made in the form of a SELECT TOP 1000 query to retrieve just the document identity
column (ArtifactID) values.
There should be a third query that retrieves all the desired columns
for specific ArtifactID values, but this query is very low cost and does not warrant discussion here.
If more than 1000 rows are needed, then the next query will be TOP 6000.
One observation is that even when the COUNT query indicates that there are fewer than 1000 rows,
the ArtifactID query is still issued with the TOP 1000 clause.
In earlier versions of Relativity (7.x), the TOP 1000 query was issued even when the COUNT query
indicated zero rows.
Perhaps an oversight by the developers
in not realizing zero rows are returned when COUNT is zero.
One aspect of Relativity architecture that is correct is the not enabling plan reuse.
The expectation is that individual searches probably have high CPU for execution than for compile
and that each parameter set could have skewed distribution.
However, in issuing explicit SQL, there is also no opportunity to fix problematic SQL.
Hence it is important the Relativity anticipates as many problems as possible
and employ good strategies for know issues.
We could wish for such, and we would still be wishing.
Problematic SQL in Relativity 8.1
There are several difficulties in Relativity queries, three of which are covered here.
One is due to the SELECT COUNT followed by TOP 1000 architecture.
The second is due to the use of the direct translation of natural (user) logic
to SQL rather than a form of SQL for which the SQL Server query optimizer happens
to produce very good execution plans.
The third is in parallelism strategy.
The data in Relativity is expected to have heavily skewed distribution
so is necessary to watch for situations where this produces ineffective parallel execution plans.
Relativity Query Example
Below is the COUNT form of an example search query to be studied in greater detail.
LEFT JOIN eddsdbo.[Custodian]
AS [o1000022_f2154932] (NOLOCK)
WHERE [Document].[AccessControlListID_D] IN
(22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...)
LEFT JOIN eddsdbo.[Custodian]
AND [Custodian].[ArtifactID] IS NOT NULL
(22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))
The actual number of values in the AccessControlListID_D IN clause is 113.
The actual number of values in each of the Custodian.ArtifactID IN clauses is somewhat over 100,
with both sets being identical.
Below is subsequent TOP 1000 query that is issued after the COUNT query.
In the 7.x versions, the TOP 1000 clause is present regardless of the number of rows
indicated by the COUNT query, and is in fact issued even if COUNT reports zero rows.
It is also very possible that either one of the COUNT and TOP queries or both are very expensive
regardless of the actual number of rows.
SELECT TOP 1000
eddsdbo.[Custodian] AS [o1000022_f2154932]
WHERE [Document].[AccessControlListID_D] IN
(22552945,22552935, 22552925, 22552915, 22552905, 14768277,..)
IS NOT NULL
(22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))
ORDER BY [Document].[ArtifactID]
For the above search with the actual data set, there are in fact 232 rows.
Three alternatives queries are examined here.
One is the above query without the TOP clause.
The second alternative is the above query (including the TOP clause) but with an index hint applied.
The third alternative tested is the form below, with the OR clause replaced by a UNION.
FROM eddsdbo.[Document] WITH(NOLOCK)
LEFT JOIN eddsdbo.[Custodian] AS [o1000022_f2154932] (NOLOCK)
ON [o1000022_f2154932].[ArtifactID] = [Document].[Custodian]
WHERE [Document].[AccessControlListID_D] IN (1,1000062,1000063)
(22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...)
eddsdbo.[Custodian] AS [o1000022_f2154932]
LEFT JOIN eddsdbo.[Custodian]
IS NOT NULL
AND ([Custodian].[ArtifactID] IN
(22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))
For this particular case, the UNION will produce exactly the same set of rows
as the original query because only the Document table ArtifactID column is in the select list
and this column is the primary key of the Document table.
In the more general case with columns from more than one table in the SELECT list,
the UNION form could have a different row set than the original OR form.
The architecture of Relativity is such that the OR forms can be converted to UNION while maintaining correct results.
Earlier, it was mentioned that the SQL Server query optimizer can have problems in producing
a good execution plan for queries with a combination of AND and OR clauses.
The form shown above is set A OR set B, with A having an AND clause.
The other form of this is set A AND set B, with either A or B having an OR clause.
The second form could probably be handled by converting the AND to an INNER JOIN,
which would require both conditions to be true, but a full study has not been done on this.
Relativity Query Execution Plans
Now consider the estimated execution plans for the 5 queries:
1) COUNT, 2) TOP, 3) no TOP, 4) index hint, and 5) UNION.
The SELECT COUNT estimated execution plan is shown below
SELECT TOP 1000 estimated execution plan
Note top right operation in the above execution plan for the TOP 1000 query.
The operation is an Index Scan on index EV_ArtifactID.
The index lead key is ArtifactID, and has all the columns required for this query.
Below left is the Index Scan detail.
Notice that the Estimated I/O cost is 85,
corresponding to approximately 892MB (I/O cost 1 = 1350 pages or 10.5MB).
Below right is the COUNT query Nested Loops operation at the top left of the plan.
Below are the left most operator showing total plan cost for the COUNT and TOP 1000 queries respectively.
The TOP 1000 plan cost is 0.31269 even though the Index Scan on Document table index EV_ArtifactID
has an I/O of 85.38.
This is because the execution plan is expecting the first 1111 rows from Documents
to be sufficient to meet the total required 1000 rows.
The plan cost for the COUNT query is much higher at 519
as all rows must be evaluated to test the match to the search arguments.
The Document Index Scan in the COUNT query uses the IX_AccessControlListID_D index
because this index is narrow with cost of 35.46 (272MB)
Below is the estimated plan for the SELECT query without the TOP clause
The estimated row count is 11,569,800, exactly the same as in the COUNT query
at the Nested Loops operation just before the rows are aggregated into a COUNT value.
There are 12.8M rows in the Document table,
so the query optimizer believes that about 90% of row meet the search argument (with OR clause).
The execution plan is the same as the COUNT query except that the
return list requires a sort operation
contributing to the higher overall plan cost of 720.
Below is the estimated plan for the TOP query but with an index hint applied
The use of the index hint results in the query optimizer not changing the join order.
Hence when using hints, it is also necessary to write the query in the form with the desired join order.
The hinted index lead key does not match in the ORDER BY clause,
so the entire index is scanned and then sorted.
The plan cost 197 is more than the original TOP query, but less than the no TOP query.
The reason is that the query optimizer believes it can exit the execution plan once the TOP 1000 rows have been found.
Below is the estimated execution plan for the UNION query.
The Estimated Number of Rows is 3.2M and the plan cost is 92.8.
Curiously, the number of rows from each of the two sub-queries is much less at 23,764 and
One would think that the UNION of the two result sets should at least equal to the larger
and could be as high as the sum.
Relativity Query Execution Times
Below are the execution time in milli-seconds for both CPU and elapsed times of each of the 5 queries.
The first is the COUNT query, followed by the Relativity standard TOP 1000 query.
Next are the 3 variations considered: without the TOP clause, with an Index Hint
and finally a UNION structure instead of OR in the WHERE clause.
|Query||CPU time||elapsed time|
The standard Relativity TOP query does not have a parallel execution plan
as the plan cost (0.31269) is below the cost threshold for parallelism.
Even though some of the operations in the plan show a cost greater than 0.3,
the Top clause believes that the query can terminate after few rows
from Documents (estimate 1111) have been examined.
In this example, the WHERE clause arguments are highly selective,
and fewer than 1000 rows meet the conditions.
So what happens is that the full set of 12.8M rows from the right most loop join outer source
must be processed, as the Top clause "exit" criteria is never reached.
Furthermore, the execution in not parallel, because it is believed to be a low cost plan.
Note the fat arrows in the
SELECT TOP 1000 actual execution plan
(full plan) below.
There is not a significant difference in CPU times
between the COUNT, regular TOP, no TOP, and TOP + index hint query plans.
The differences are mostly in the elapsed time.
The regular TOP query has the longest elapsed time being a non-parallel plan as explained above.
The COUNT and no TOP queries have elapsed times approximately one-half of the CPU time,
corresponding to a 2X speed-up with parallelism even though the actual degree of parallelism
was 8 (on separate physical cores).
Parallel Execution Plans
There are two critical factors/challenges in parallel execution plans.
One is to divide the work evenly among multiple threads (each running on different cores).
The second is to have sufficiently large granularity of work on each thread before
some form of synchronization is required.
My recollection was that SQL Server 2000 employed a method that ensured even distribution
by having each thread process one page at a time(?) This was fine in the Pentium II/III 500MHz generation,
but was far too small by the NetBurst and Core2 architecture processors.
SQL Server 2005 employed a different methods with the strategy of allowing larger granularity
and reduced need for synchronization between threads(?) but could also result
in having highly uneven distribution of work between threads.
Some of this was reported to have been fixed in SQL Server 2008 sp1,
see Using Star Join and Few-Outer-Row Optimizations to Improve Data Warehousing Queries
but apparently this is still an issue in SQL Server execution plans?
In the COUNT and no TOP queries,
the Constant Scan operation acts as the outer source in a loop join to Document.
This is the artificial rowset from the IN clause on AccessControlListID_D.
It is known to SQL Server from statistics that this
column has highly skewed distribution resulting in the row-thread split shown below.
The distribution on the other joins are also skewed but not as strongly?
The query with the TOP clause and an index hint just happens to prevent
the SQL Server query optimizer from using a Constant Scan as a loop join outer source.
The index hint does nothing to improve the CPU efficiency of this query.
In fact it is more than 10% less efficient in terms of CPU.
The positive effect is that work is evenly distributed among threads as shown below
resulting in nearly linear scaling with parallelism (7.6 to 1).
Technically the problem here is not all few outer rows,
but just uneven few outer rows?
Microsoft seems to be aware of the problem and has fixed some aspects?
but apparently now all.
Perhaps the SQL Server engine could implement a flag or hint to avoid
execution plans that are sensitive uneven distributions?
For now, the only work-around is to look for this situation and take appropriate action.
UNION in place of AND/OR Combination
The execution plan that has outstanding results in terms of CPU efficiency
is the UNION replacing the OR clause.
It was previously documented that the SQL Server query optimizer has difficulty
in generating a good execution plan for queries with a combination of AND/OR conditions,
but has no problems when an alternative structure is employed.
There are other forms for which the SQL Server query optimizer does not produce
good execution plans.
Once these can be catalogued with appropriate alternative SQL expression strategies,
there are probably not any searches that cannot be handled.
COUNT + TOP Architecture
This actually encompasses several sub-topics.
Presumably the purpose of this architecture is to know the exact number
of documents that match a search, but also so as to not over-whelm the application server
But we should consider that 1) the query with a rowset only has an integer column,
2) even a large case should have no more than tens of millions of documents
(a small number in the modern world) and 3) the application server today has many gigabytes
So this is not absolutely necessary?
There is also a client/application-side work around for this.
Simply issue the query to return the identity column for all rows in the search,
load the first X into an array, then continue to read but not store the remaining rows.
Consider the COUNT + TOP architecture.
There are 4 possibilities. One is that both queries are inexpensive in which case
this is not important.
Second is that the COUNT query is expensive but not the TOP.
This could happen when many rows match the search, and the TOP clause allows
the query to exit quickly.
Third is that the COUNT is not expensive but the TOP is.
This happens when the query optimizer estimates many rows but in fact there are few rows.
In this case, an appropriate high-volume parallel plan is employed for the COUNT query,
but a non-parallel plan is used for the TOP query relying on the expectation of exiting quickly.
The exit condition is never met, and the non-parallel plan must process the full
set of rows with only a single thread.
Consider also that this non-parallel plan was formulated based on low start-up overhead
(loop joins) rather than volume efficiency (hash joins).
The fourth possibilities is that both the COUNT and TOP queries are inherently expensive.
In this case, we now have to execute two expensive queries all so that the developer
can avoid a few lines of code on the application side?
(Many SQL/database disasters have been traced to lazy/incompetent coders.)
As I said in the first Relativity article,
all of the database problems appear to be solvable,
but most require action in the application code where the SQL is generated,
including the architectural strategy of COUNT + TOP.
Some problems from Relativity 7.x appear to have been resolved, such as
the data type mismatch
between a varchar column and nvarchar search parameter.
This by itself was a nuisance, but when combined with the TOP clause had significant negative consequence.
The full set of details along with a test database and queries to both reproduce and fix
the problems in Relativity 7.x were sent to kCura.
Most of the advice seems to have been disregarded.
It would seem that kCura was aware that there were problems.
In version 7.x, there was a CodeArtifact table (3 columns CodeTypeID, CodeArtifactID, AssociatedArtifactID)
that was frequently involve in search queries that would take forever,
as in 30min (or whatever the web page time-out is) to 30 months (estimated) to complete.
It is possible that the long running read queries and write activity also resulted in blocking and deadlocks
despite prolific use of NOLOCK (without the WITH keyword?).
For version 8.x kCura went to great effort to split this table into multiple tables
with names of the form ZCodeArtifact_xxx, one for each of the group values (xxx)?
The new tables have columns CodeArtifactID and AssociatedArtifactID,
so perhaps there is one table for each CodeTypeID in the version 7.x table?
(This topic is covered in
ZCodeArtifact & Statistics,
as there were a series of performance problems in queries to these tables
related to statistics.)
The problem was that blocking and deadlocks were the symptoms of Relativity problems,
not the cause. The causes were the topics discussed here: 1) injudicious use of the TOP clause
in situations in which the query optimizer makes a serious error in the estimated number of rows
when the actual row count is already known from the COUNT query,
2) generating a complicated SQL expression with combination AND/OR clauses instead of JOIN and UNION,
and 3) ineffective parallel execution plans due of skewed distribution.
One more point, kCura also went to the effort of changing the SQL a from single expression form
to one using CTEs. This may contribute to the clarity of the expression,
but does not impact the execution plan problems.
Another problem seen in 8.1 is a search involving both conventional search arguments
and Full-Text elements. The execution plan had the Full-Text Search (FTS) operation
as a loop join inner source
probably because the estimate number of rows from the outer source
showed 1 (which could mean 0).
This might have been because the ZCodeArtifact_xxx table was newly generated and statistics were not updated?
The query produced a good execution plan after statistics were updated and had quick actual execution time as well.
(232 row(s) affected) SQL Server Execution Times: CPU time = 152148 ms, elapsed time = 152298 ms. TOP 1000
(232 row(s) affected) SQL Server Execution Times: CPU time = 143973 ms, elapsed time = 71857 ms. no TOP
(232 row(s) affected) SQL Server Execution Times: CPU time = 167671 ms, elapsed time = 22016 ms. Index hint
SQL Server Execution Times: CPU time = 1665 ms, elapsed time = 324 ms. UNION
SQL Server Execution Times: CPU time = 138279 ms, elapsed time = 69109 ms. COUNT
The full list of Document.AccessControlListID_D and o1000022_f2154932.ArtifactID values are below
Note: a Test database was created with just these tables and limited columns for in-depth investigation.
A great session by Adam Machanic at SQL Saturday Boston the previous weekend
on methods to influence the query optimizer while still letting it do its task.
The gist of this is that while SQL Server has what are called Query Hints,
there are adverse consequences.
The Join Hints (Loop, Hash and Merge)
"specify that the query optimizer enforce a join strategy between two tables,"
but also results in the query optimizer not bothering to investigate
the different join orders, even though only the join type was specified.
Hence the general advice is that one should not use the SQL Server Query/Join Hints
unless one is prepared to completely override the query optimizer,
which is essentially to say, one should almost never use join hints.
Microsoft's advice is:
"we recommend that hints, including , be used only as a last resort by experienced developers and database administrators."
Adams' session investigated an alternative method of providing advice to the query optimizer without causing it to otherwise shutdown.
Now that we have said that the Loop, Hash and Merge Join Hints should almost never be used, and without recommending the use of hints, consider the question of how to use hints in the case of a last resort situation.
Given the fact that the query optimizer disables join order optimization when hints are applied, the task is to reconstruct a good join order.
It is explained elsewhere the general preference regarding join order.
See either my articles on
(mostly I just examine the formulas, without bothering on the explanation),
Here will only examine the technique of join ordering.
In a two table join, there is only one shape, one table as the outer source in the upper right of the execution plan and the second table as the inner source in the lower left of the execution plan
as in the diagram below.
We can reverse the order of the join, or change the type of join,
but there is only one shape.
In a three table join, there are two possible shapes.
One is linear: the first table is the outer source, joins to the second table (inner source),
and then the output of this is the outer source for the final join with the third table as the inner source.
The second possible shape is that one table is the outer source in one join
to another table. The output of this is now the inner source in the other join
with the third table as the outer source.
From these two basic shapes, we can assemble almost any possible execution plan
(sorry, but I do not have examples with the spool operation, if any one would like to comment on these).
Until a few years ago, I had always been under the impression that it was
necessary to write out the full sub-query expression in order to force a bushy join,
JOIN C ON xx
) ON xx
The both join shape and order are forced with either a join hint or the
OPTION (FORCE ORDER) clause.
In a complex query with a long SELECT list, this style of expresssion quickly becomes cumbersome.
Then one day, I needed to relax, so I read one of
Itzik Ben-Gan's books
and saw a style of SQL expression on joins that I had never seen before.
B JOIN C ON xx
) ON xx
There is no SELECT in the sub-expression!
My heart skipped a beat.
What would be the execution plan join shape be if there were a join hint
or force order hint on this expression?
Below is an SQL query example from Adam's session.
The execution plan for this query is below. Note that the join order is different than in the SQL.
If we forced a hash join, we would get the linear plan below.
Note that the join order is the same as in the SQL.
We could write the SQL in the form below.
But without a hint,
the execution plan is the same as the original (natural) plan.
Now if we were to force the join order in the new SQL, as below
we do indeed get the bush shape with the join type.
We now have the basic techniques for writing SQL with the objective of forcing a particular
join shape and order, to which we could apply join hints that also override
much of the query optimizer.
Again, this is not an endorsement of using join hints.
Do not use join hints without understanding that it has the effect of overriding the query optimizer
on join ordering, and the implications.
I do not accept any consequences on the use of join hints unless I was the consultant engaged.
OK, so I just gave you a loaded gun while saying don't blame me for its improper use.
Search Microsoft Technet for the terms
Advanced Query Tuning Concepts,
Understanding Nested Loops Joins,
Understanding Merge Joins, and
Understanding Hash Joins.
I seem to have forgotten that role reversal was a feature in hash joins?
Note that Adam's session is the "Gentle Art ..."
Join hints and force order is definitely the bulldozer and burn approach
I used to build white box servers because there were usually enough spare parts left over from upgrade projects. (management did not see the need for non-production systems, so I arranged for there to be spare parts). But since 2002 or so, I have been buying Dell PowerEdge servers for my own test environment.
This was in part because of the hassle of troubleshooting connections to multiple SCSI HDDs, was it the cable or connector?
In the Nehalem/Westmere time frame 2009-10, I decided to step down from a 2-socket system of the previous generation (PowerEdge T2900) to a single socket system, the T110 II. In principle, this was because single socket systems had become powerful enough for me to demonstrate important characteristics I need for my papers, such as generating 2.4GB/s in IO bandwidth. In practice, it was also because I sit in the same room as the servers, and the T2900 had noisy fans while the T110 II was whisper quiet.
Processors - Intel Xeon E3-1200 series v3, Haswell
For the current generation processor, the Intel Xeon E3 v3 based on Haswell,
Dell decided to focus on pre-built ready to ship systems rather than build to order systems. The only E3 processor option in the Dell T20 is the E3-1225 3.2GHz nominal and 3.6GHz Max Turbo. This system has 1 PCI-E x16 gen3 slot and 1 x4 G2 slot.
Graphics is not normally important in a server, as it usually resides in a server room and is accessed via remote desktop or even completely remote administration.
The previous generation T110 II used an old Matrox G200eW 8MB graphics (based on a 1998 design?) that only supports normal video resolutions (1280x1024?, ok I am getting 1600x1200 on the T100II).
The new T20 with E3-1225 has the Intel P4600 graphics.
For some strange reason my T20 would only power up with 1 DIMM socket populated.
I opened a case with Dell Technical Support, but they seem to have lost track of the ticket. I wonder if the people are still there. Or have they been outsourced?
So I thought that I would give building my own server another try.
I got the Intel Xeon
E3-1275 v3 3.5GHz nominal and 3.9GHz Max Turbo ($350 versus $224 for the 1225, but less than the $552 price tag of the 1285). The 1225 to 1275 processors have the P4600 graphics, which support 3 displays.
Supermicro X10SAE Motherboard
My motherboard of choice is the Supermicro
X10SAE with PCI-E 1 x16 and 1 x8 gen3 slots.
The E3 v3 only has 16 PCI-E gen3 lanes.
The Supermicro motherboard has an ASMedia Switch (ASM1480) that redirects 8 lanes from one slot to another slot so that all 16 lanes connect to a x16 slot
if that slot has a x16 adapter and the x8 slot is unpopulated?
Otherwise, both slots are x8?
If the ASM chip is a PCI-E expander, then in principle, both the x16 and x8 slots have all lanes always connected, its just that half of the x16 lanes are shared with the x8?
The ASMedia website describes the
ASM 1480 as a multiplexer/demultiplexer.
But there is not a detailed document.
I would hope that in the situation of simultaneous traffic, priority is given to the x8 slot, as the x16 slot should direct traffic to its x8 dedicated lanes? but there is no protocol to support this mode?
What I like about Supermicro is their deep lineup of server class motherboards with almost every conceivable slot arrangement.
I recall that when Intel spent a huge amount of money to focus on one motherboard for an entire processor class, not optimal for any particular purpose.
Display - Dell P2815Q 3840x2160
I also got the new Dell P2815Q monitor currently $699. It had priced higher, but Dell offered a second monitor for a discount, so I bought two.
This has a 28in diagonal, and maximum resolution of 3840x2160 at 30Hz.
The low refresh rate at maximum resolution would not be suitable from gaming.
Neither does the P2815Q have the glossy display popular in home entertainment.
But it is perfect for viewing SQL Server execution plans.
At standard zoom (80%) I can see 17 execution plan operators horizontally across the 3840 pixels.
Connecting two of the monitors would display 34 operators?
Of course, it might be more important to have dual monitors in portrait mode,
but I do not know where to get the stands.
(per Dave, the P2815Q does rotate)
I might give the UP2414Q at $1149 a try. The UP3214Q at $2799 is too expensive for my needs. The other large screen with high-resolution is the U2713H at $999 supporting 2560x1440. I have two XPS 27in AIO with apparently the same 27in display?
Storage - LSI 9361 PCI-E gen3 12Gbps SAS
My preference would be to plug in 2 PCI-E gen 3 SSDs capable of the full (or nearly) x8 slot bandwidth of 6.4GB/s, at least on the large block read side.
This is to avoid a jungle of SATA power splitters and cables inside the system. However, for some reason, there are no PCI-E gen3 SSDs?
There are PCI-E gen 3 RAID controllers with either 2 x4 12Gb/s SAS interfaces and also some with 24 (x1) 6Gbps ports. There are no 12Gbps SSDs so if I used the standard 2 x4 ports at 12Gbps, I would have to find some enclosure with 12Gbps capable expanders, which will of course escalate the cost.
All of this is rather unfortunate for building cost optimized high bandwidth storage system. NAND chips currently operate at up to 333MHz. This means 20 channels could saturate the full PCI-E gen 3 x8 bandwidth, even though we would probably use 24 for RAIN and general downstream over-configuration.
At 32GB per channel (4 x 64Gbit die) and 24 channels, the raw capacity is 768GB would be a very inexpensive storage yet capable of 6GB/s read?
Previous generation PCI-E NAND controllers supported 16 and 32 channels.
The standard SATA-NAND controller has 8 channels. This was a good choice when NAND was 100MHz. Now this means we have too much (but unusable) downstream bandwidth.
The new NVMe NAND controllers might offer the option of connecting to either 6Gbps SATA or x2 PCI-E gen 3, which would be 1.6GB/s, but I am not sure when we would have supporting infrastructure.
The upcoming (now) LSI SandForce SF3700
can interface to either PCI-E gen2 at x2 in the 3719 & 3729 models or x4 (3739 & 3759) and SATA at 6 Gb/s (all models)
There are 9 channels on the NAND side.
SQL Server 2014
I just installed SQL Server 2014 RTM on this system. I notice that SQL Server 2014 does not show the Parallelism (Repartition Streams) operator, same with the Bitmap.
The Parallelism (Distribute Streams) and (Gather Streams) operations are still displayed.
Below is part of a SQL Server 2012 execution plan with both the Parallelism (Repartition Streams) and Bitmap operations.
In SQL Server 2014, the execution plan for the same query does not show these two operations.
I imagine that the parallelism and bitmap operations are still there,
just no displayed because they do not contribute to understanding the execution plan, while wasting valuable screen real estate.
Of course, having the option to reduce the spacing between operations without reducing the display font would be very valuable.
I do not think the Program Manager for SSMS looks at complex query plans to understand why this would be a valuable feature?
I would like to find someone willing to build a system with the Supermicro X9DRX+-F 2-socket motherboard
with 10 PCI-E g3 x8 slots, filling most of these with storage controllers.
This would be massive overkill as I am not sure SQL Server can consume 20GB/s from 4 controllers,
let alone 40GB/s from 8 controllers.
Interpret this as I do not want to pay for 2 12/15-core processors, 16-32x16GB DIMMs, 8 controllers, and 64 SSDs out of my own pocket.
I have ordered a LSI 9361-8i PCI-E gen3 - 12Gpbs RAID controller that I will use in the Supermicro system w/the Xeon E3 v3 (Haswell), although I have no means of using the 12Gbps SAS signaling rate. If anyone has a 12Gbs expander board, I would appreciate it (there is not a pressing need for SSDs to support 12Gbps, we would just like to connect 12 or SSDs at 6Gbps to the 2 x4 SAS 12Gbps ports.
I also ordered OCZ Vector 150 SSDs. I will probably mix these with the original Vectors that I already have.
In my previous generation system, the Dell PowerEdge T110 II, I had the LSI 9260 controller initially with a mix of OCZ Vertex 3 and Crucial m4 SSDs. The Crucial m4's would occasionally show as offline on the LSI RAID controller, but there was nothing wrong with the m4 when attached to a SATA port. Eventually I replaced the m4 with OCZ Vectors, and since then all 8 SSDs have worked fine with the LSI 9260.
The recently announced SanDisk CloudSpeed SSDs are also of interest, but I suspect these will be OEM only products.
Plextor has a PCI-E gen2 SSD for a x2 slot (x4 connector?), rated for 770MB/s.
Tom's Hardware says its a M.2 SSD on a PCI-E board.
If is the case, then I think the correct SSD product for now are PCI-E boards on which we can plug in 1-4 or perhaps even 8 M.2 SSDs.
The M.2 form factor supports x2 PCI-E lanes. A simple board could wire up to 4 directly the lanes in the PCI-E slot. A more flexible mode would have a PCI-E expander, so that the number SSDs (each PCI-E x2) can exceed the slot width (x4, x8 or even x16).
I am seeing just under 4GB/s from the LSI9361 with a mix of 4 OCZ Vector 150 and 4 older OCZ Vertex 3 Max IOPS SSDs.
Technically the SSDs are capable of over 500MB/s each, but in an array (2 actually, the 4 Vectors in one, and the 4 Vertex 3 in the other) with SQL Server driving IO, that's pretty good.
I got 2.4GB/s with the previous generation LSI 9260. Presumably the controller could not drive the full 3.2GB/s PCI-E gen 2 x8 limit?
The Dell P2815Q connected to my Supermicro X10SAE motherboard's display port connector does operate at the full 3840x2160 resolution, but not when connected to the HDMI connector. I do not know if it is possible to have 2 displays at 3840x2160 with just the Supermicro motherboard, or if I need to get a separate video card?
It is rather curious that the
two TPC-E benchmarks results
published for SQL Server 2014 did not employ the memory optimized tables and natively compiled procedures, given that Hekaton is the hallmark feature of 2014.
Potential for 30X gain in transaction processing have been cited in presentations and whitepapers.
No explanation has been given so I will speculate.
Why guess when I do not know?
Well hopefully someone will come out with the real reason.
The TPC benchmarks are meant to be a common standard that allows people to compare results of different DBMS,
operating systems and processor or storage hardware.
Many years ago, the TPC-C benchmark was very important so that one could be sure that database engines
would not have any bottlenecks in processing that workload,
which only comprised a small number of SQL operations.
An actual database application might use many different database features and there would be no assurance on
whether one or more operations not in the benchmarks had scaling limitations.
Several years ago, all of the major database players, software and hardware, contributed to TPC-E,
which was supposed to reduce the cost of the benchmark system,
increase schema complexity and be more representative (higher read-to-write ratio?).
But after the benchmark was approved, Oracle decided not to publish results,
even though every indication is that they are perfectly capable of producing very good single system results.
Oracle RAC scales very well in the TPC-C benchmark, which has a high degree of locality by Warehouse.
TPC-E does not have locality and is presumed to be more difficult to scale in an active-active cluster architecture.
At the time, RAC scaling was very important to Oracle.
Since then, Oracle has favored single system benchmarks,
especially on their Sun platform with SPARC processors.
Recent SPARC processor have 8 simultaneous multi-threading per processor core
(SMT, equivalent to HT for Intel processors).
Microsoft decided to quickly shift over from TPC-C to TPC-E.
The time frame of TPC-E roughly corresponded to the SQL Server versions 2005 and 2008 boundary.
Microsoft did not allow TPC-C results to be publish on SQL Server 2008,
and the few TPC-E results that were published after TPC-E launch employed SQL Server 2005.
One reason was that log stream compression was added into SQL Server 2008.
This is to improve database mirroring functionality.
Log compression consumes CPU cycles and is constraint to a single thread,
but reduces network traffic especially important over a WAN.
Many people use DB mirroring, and few people are up against the single core log compression throughput limit,
so perhaps this was for the greater good?
See Paul Randal blog
SQL Server 2008: Performance boost for Database Mirroring,
11 Oct 2007.
A SQLCAT article, http://sqlcat.com/sqlcat/b/ technical notes/ archive/2007/09/17/ database-mirroring-log-compression-in-sql-server-2008-improves-throughput.aspx,
on database mirroring dated 17 Sep 2007 is cited, but the link is no longer valid,
Trace flag 1462 is cited as Disable Mirroring Log Compression (Paul says this flag is active).
The TPC-C benchmark consists of 5 stored procedures, New-Order (45%), Payment (43%)
and three at 4% each: Order-Status, Delivery and Stock-Level.
The New-Order is the principle call, getting the next order id, inserts one row for the order,
an average of 10 rows for order line items,
and 10 rows updated in the Stock table.
The Stock table has 2 integer columns forming the key,
4 quantity related columns and 11 fixed length char columns totaling 290 bytes,
(308 bytes per row excluding overhead for the table).
Only the four quantity columns are actually updated.
However it is necessary to write both the original and new (?) rows in its entirety to the log?
Paul Randall says only the bytes that changed need to be written to the log,
but a key column update has the effect of a insert/delete.
It is also possible the exact nature of the Stock update statement
requires the char columns to be logged?
This would imply that the raw New Order transaction log write is on the order of 6.4KB to encompass 10 original
and new rows of Stock plus 10 rows of order line insert?
The TPC-C reports cite 6.4KB per transaction presumably including the payment transaction?
The last TPC-C results on SQL Server (2005) were in 2011,
one for a system with 4 Opteron processors and one for 2 x Xeon X5690 (Westmere) 6-core processors at 1,024,380 tpm-C,
or 17,073 New Order transactions per sec. On the assumption of 6.4KB per transaction log write per New-Order,
the raw (uncompressed) log IO would be 109MB/s.
More recently, Cisco posted a TPC-C with Oracle 11/Linux on a 2 x Xeon E5-2690 (Sandy Bridge) 8-core processors
at 1,609,186 tpm-C, for 26,819 New Order transactions/s.
There is also an Oracle/Linux TPC-C result on 8 x E7-8870 (Westmere-EX) 10-core processors at 5,055,888 tpm-C
or 84,264 New Order transactions/s.
Even if the 2 x 8-core transaction log volume were within the single thread compression capability,
it is likely that higher volume from systems with 4-socket 8-cores or even the newer 2-socket E5 v2 12-core processors
would exceed the log stream compression capability of a single core?
The more recent TPC-E benchmark is comprised of 10 transactions, of which Trade-Order,
representing the benchmark score, is 10% of volume.
Some of the individual transactions are comprised of multiple frames,
so the total number of stored procedures calls (involving a network round-trip) per scored transaction is 24.
If we were to work out the RPC volume between TPC-C and TPC-E,
we see that the average transaction cost is not substantially different between the two benchmarks.
So one of the cited objectives for TPC-E, being more representative of current workloads, may be true,
but the database engine does not really care.
Perhaps the more important matter is that the TRADE table is 139 bytes per row, excluding overhead.
The Trade-Order involves 1 row, and is only 10% of the transaction volume (4% of call volume?).
Several of the TPC-E transactions involves write operations,
and the TPC-E reports cited a log write of 6.7KB/trade, perhaps amortizing the other transactions.
Some TPC-E stats with raw log write estimates
|TPC-C||2 x X5690||6||1,024,380 tpm-C
|TPC-E||2 x X5690||6||1284 tps-E ||5.6/6.0TB||98(178?)GB|
|TPC-E||2 x E5-2690||8||1881.76 tps-E
|TPC-E||2 x E5-2697||12||2590.93 tps-E ||10.7/11.3TB||523GB|
So perhaps the net difference between TPC-E and TPC-C is a 10X reduction in log write?
So what does all this have to do with Hekaton? Let suppose that Hekaton is capable of providing a 10-30X increase in transaction performance.
The current 2-socket with Xeon E5 v2 processor has 12 cores per socket,
15 cores per socket in the 4-socket E7 model.
This would generate 17 and 34 MB/s raw transaction log volume without Hekaton.
At a 10X gain from Hekaton,
we would be beyond the log compression capability of a single core in the 4-socket.
At 30X gain, beyond the compression capability for the 2-socket system?
All of this is just guess work. Would someone from Microsoft like to comment?
See the SQLCAT slide deck “Designing High Scale OLTP Systems” by Thomas Kejser and Ewan Fairweather present at SQL Bits 2010.
It cites log write throughput at 80MB/s. It is not stated whether this was a limitation or just that particular case.
It does appear to be in the correct range from what one Nehalem-EX core can generate in compressed output.
The Hekaton performance gain math calculation is worth examining.
An example was cited of the performance in a test scenario.
In employing just the memory optimized table, the performance gain was 3X. With the natively compiled stored procedures (NCP),
another 10X gain was realized for a combined gain of 30X.
Let’s assume that baseline test query requires 100 units of work.
A 3X gain via memory optimized tables means that work is now 33 units, so the difference between tradition concurrency and the new system is 67 units.
The 10X gain with natively compiled procedures over the first step means that the 33 units with just memory optimized tables is further reduced to 3.3,
for a net reduction of 29.7.
So if we were not careful with the underlying math,
we might have thought that the 10X with natively compiled procedures was more important than the memory optimized tables.
In principle, NCP could have been implemented without memory optimized tables.
This example demonstrates why it is important to first tackle the big resource consumer first,
but the gains are magnified when secondary items can also be improved.
TPC-C New Order Stock update SQL Statement
-- update stock values
s_ytd = s_ytd
@s_quantity = s_quantity
= s_quantity -
THEN 91 ELSE 0
s_order_cnt + 1,
THEN 0 ELSE 1 END
WHEN 2 THEN s_dist_02
WHEN 3 THEN s_dist_03
WHEN 4 THEN s_dist_04
WHEN 5 THEN s_dist_05
WHEN 6 THEN s_dist_06
WHEN 7 THEN s_dist_07
WHEN 8 THEN s_dist_08
WHEN 9 THEN s_dist_09
WHEN 10 THEN s_dist_10
s_i_id = @li_id
Regarding the comment below on max memory opt table of 256GB,
I do not see any reason that even a modern transaction processing
database needs even half of this for in-memory data.
But it is very reasonable to have tens of TB in the DB for generally beneficial functionality.
This being the case, I think it would be nice to have a feature for rolling in-memory table data to a regular table.
My thought is that at the end of each day or other period,
the entire set transactions should be converted to a regular table,
that will then become a partition in the archive/history table.
SQL Server performance has the interesting nature in that no matter how sound and logical an idea is
on how it might behave in one scenario compared to another,
what actually happens could be very different.
Presumably the reason is that what happens inside the SQL Server engine is very complex
with very many elements to support even a basic query, including steps that we are not even aware of.
On top of this, the modern microprocessor is also very complicated,
with radically different characteristics depending on whether an address is in (the processor L2) cache,
or a memory round-trip necessary, and then whether it is a local or remote node memory access,
not to mention the implications of cache-coherency checks.
With this in mind, some aspects of the Decimal/Numeric data type are of interest.
There have been previous discussions on the fact that the Decimal data type is more expensive than integer or float,
with impact that could be on the order of 2-3X,
enough to become a serious consideration in queries that aggregate very many rows and columns with the Decimal data type.
It is easy enough to understand the integer and floating point data types can be executed directly by the microprocessor
while decimal must be handled in software, which should mean that the overhead is much higher than 2-4X.
The explanation is that the even simple matter of accessing a column in the page-row storage organization
of traditional database engines involves a series of address offset calculations for which the
modern microprocessor cannot pre-fetch from memory sufficiently far in advance to keep its execution pipeline filled.
If this is indeed the case, then one would expect that the difference between integer and float compared to
decimal would have far larger impact in column store indexes introduced in SQL Server 2012 for nonclustered
and clustered in the upcoming 2014.
The whole point of column store is to access memory sequentially to fully utilize the capability of modern microprocessors
emphasizing bandwidth oriented over serialized round-trip memory accesses.
In any performance investigation, it is always very helpful first to build baseline with non-parallel execution plans.
This is because parallel execution introduces a whole new set of variability's that can complicate the assessment procedure.
Of course, with such sound and logical reasoning, the outcome is inevitably the unexpected, hence the opening paragraph.
It would seem that the SQL Server engine follows completely different code path on operations to column store indexes
depending on whether the execution plan is non-parallel or parallel. But it turns out that is occurs in SQL Server 2014 CTP2,
and not SQL Server 2012 SP1, so it is possible the unexpected behavior will not occur in the 2014 release version?
The test system is a Dell PowerEdge T110II with 1 Xeon E3-1240 (Sandy Bridge) 4-core,
3.3GHz nominal (3.7GHz in Turbo) processor with Hyper-Threading enabled,
32GB memory, and storage on 8 SATA SSDs in RAID 0 attached to a LSI 9260 RAID controller.
The operating system is Windows Server 2012 R2, and SQL Server version 2014 CTP 2.
The database was populated using the TPC-H data generator (dbgen) to produce a SF10 data set.
This puts 59.986M rows in the Lineitem table which would have been 10GB using the 8-byte datetime data type
but is 8GB with the 4-byte date data type.
The index keys are different from the TPC-H kit,
but the test here are not represented as conforming to TPC rules for official results.
Four Lineitem tables were created, all at SF 10.
Two use the conventional page/row storage, and the other two use Clustered Columnstore indexes.
The conventional tables were not compressed, while Columnstore indexes are compressed without option.
For each type of storage, one table has 4 columns of type float (8 byte),
and the other has 4 columns declared as decimal(18,6) at 9 bytes, NOT NULL in both cases.
The conventional Lineitem table average 139 bytes per row or 59 row per page with 8 byte float
and 143 bytes per row, 57.25 rows per page for the 9 byte decimal.
The table with clustered column store index averaged 44.1 and 45.67 bytes per row for float and decimal respectively.
The columnstore indexes where about 2.5GB versus 8GB for the conventional tables.
Previous test have shown that there is no difference between int/bigint, money and float,
as all are natively supported on the processor hardware.
From the table definition, the four float/decimal columns are adjacent, and should be within 64 bytes of the row header?
Meaning all values are in the same cache line?
One additional note is that this report is a quick response to a question concerning decimal overhead.
I did not have time to setup rigorous measurements averaged over 100 executions and follow-up
with an examination of anomalies. All measurements here are based on a few runs.
Page/Row and ColumnStore Performance with Parallelism
The basic test case a simple aggregate of 1-4 of the float or numeric columns along with a count,
in reference to a count only query.
For the page/row table, a clustered index (table) scan is forced.
There is no point to forcing a scan on columnstore index, due to the nature of column storage.
Performance data is collected from sys.dm_exec_query_stats.
An attempt was made to ensure data in memory prior to each measurement,
but some columnstore accesses generated a small amount of disk IO.
Below is the CPU in nanoseconds per row for the four cases at DOP 1.
The DMV reports worker time in micro-seconds, so that value was multiplied by 1000 to get nanoseconds.
The CPU nominal frequency is 3.3GHz but for single thread operations, could very be running at the turbo frequency of 3.7GHz,
somewhat more than 3 cycles per ns.
The cost of the Count only operation, forcing a scan on the entire 8GB table
(but does not touch either the float or decimal columns)
is about the same for both conventional tables at 58.8 and 60.15 ns per row respectively,
probably reflecting the slight difference in table size (3%).
The true cost structure of a SQL Server table (clustered index) scan consists of a cost for the page access,
each row within a page, and each column, typically with the first column access having high cost than the subsequent
columns, and perhaps higher cost if a subsequent columns is not on a previously accessed cache line,
and perhaps higher for non-fixed length columns that involve a more complicated address calculation.
In previous reports, I have cited the page access cost as in the 650-750 CPU-ns range for Sandy Bridge generation processors.
So about 10ns of the average row cost cited above is amortizing the page access cost (for just under 60 rows per page).
Below are the same test data, but showing incremental cost of each additional column accessed and aggregated.
The Count value is the same as above because it is the baseline operation.
Notice that the incremental cost for the first column aggregated (1SUM) is higher than the subsequent columns.
It is strongly evident that decimal aggregation is much more expensive than the float type
(and other tests show float to be the same as int and money).
The reason that we cannot put a specific value on the difference is because of the cost structure
of complete operation has page, row and columns components of which the int/float/decimal difference only involves
the last component. In addition, the number of columns of each type also impacts the differential.
Below is the cost per row in CPU-ns of the count query,
with the two conventional tables on the left
and the two columnstore indexes on the right at DOP from 1 to 8.
The system under test has 4 physical cores with HT enabled.
SQL Server correctly places threads on separate physical cores when DOP allows,
but the DOP 8 test forces both logical processors on each core to be used.
It is also clear in the Columnstore tests that there is something very peculiar.
CPU put unit work is not supposed to decrease from DOP 1 to 2.
There are certain cases when this does happen,
example being a hash join in where the parallel plan has a bitmap filter,
which is not employed (per rule) in non-parallel plans.
This not the case here, and a test on SQL Server 2012 shows the expected performance advantage
for columnstore at all DOP levels.
Below is the rows per second for the Count query.
This is the inverse of elapsed time and is better for demonstrating scaling with DOP.
The vertical axis is log scale in base 2 to better distinguish 2X.
The scaling with DOP is not particularly good in the DOP 1-4 range,
with each thread on separate physical cores.
This is believed to be the case with as CPU/row only rises moderately with parallelism to DOP 4.
This query does almost no work other than page access,
so it is possible there is contention somewhere in the buffer pool management?
Perfect scaling would be doubling performance for each doubling of DOP
(each thread on separate physical cores),
an example being from 16 to 32 rows/µs on the vertical scale from DOP 1 to 2.
An indicator of quality of the measurement the ratio of worker time to elapsed time.
In a perfect situation, this would be equal to the DOP.
At DOP 4, the ratio is unexpectedly low at 2.8.
Very good scaling is normally expected when parallelism is over separate physical cores.
Here the scaling in that case is poor,
but appears to be great when both logical processors on each core are active at DOP 8.
The sharp rise in CPU per row from DOP 4 to 8 is indicative of this aspect of HT.
Had the DOP 4 measurement indicated the correct worker/elapsed ratio closer to 4,
there would have been only a moderate increase in performance from DOP 4 to 8.
Below is the single column SUM query cost per row (in ns) for the two conventional tables on the left
and the two columnstore tables on the right.
The cost difference between float and decimal in the conventional tables is now evident though not large.
It is much more significant in the columnstore tables, as expected.
Below is the rows per second for the single column SUM query.
Below is the two column SUM query cost per row for the two conventional tables on the left
and the two columnstore tables on the right.
There is a larger difference between float and decimal in the conventional tables
compared to the single column query.
This is expected as there is more work is in column operations relative to the page and row access.
The difference on the columnstore table is than in the single column and this was not expected.
Below is the rows per second for the two column SUM query.
Below is the 3 column SUM query cost per row for the two conventional tables on the left
and the two columnstore tables on the right.
Below is the rows per second for the 3 SUM test.
The graphs below show the query cost per row for Columnstore access with 1, 2 and 3 columns,
the same as before, except with the DOP 1 value truncated.
Aspects of the Decimal data type cost relative to float have been examined,
noting the float has elsewhere been observed to be the same as int and money.
The overhead can be as large as 2-3X in conventional page/row tables,
depending on the ratio of work between page and row access, relative to column aggregation.
In queries that access small to even moderately large number of decimal values (rows*columns)
the higher cost of decimal should not be significant.
In large data warehouse queries that access ten million values, it will be noticeable
and probably start to get painful in the hundred million scale.
It was expected that the Decimal data type cost would be much high in columnstore indexes,
as the row access costs are greatly reduced, magnifying the column contribution.
First, the problem in SQL Server 2014 CTP2 with columnstore clustered index for
non-parallel execution plans was observed, and it is hoped that this problem will be resolved in RTM.
The expected large impact was observed on queries aggregating a single decimal type columns,
but the trend decreased unexpectedly for 2 and 3 columns aggregated.
SQL Server performance characteristics are not always in-line with expectations,
no matter how sound and reasonable the logic
Intel Xeon E7 v2 processors (Ivy Bridge-EX) officially launched today. The E5 v2 processors (Ivy Bridge-EP) and E3 v3 (Haswell) came out last fall. The previous generation E7 was based on Westmere, so the Sandy Bridge generation was skipped for the EX. This makes sense because big systems desire 2-year product stability versus the annual refresh for 2-socket systems.
The new E7 v2 tops out at 15 cores 2.8/3.4GHz nominal and turbo compared to the Westmere E7 with 10 cores 2.4/2.8GHz. The Xeon E5 v2 top model has 12 cores at 2.7/3.5GHz versus the Sandy Bridge E5 at 8 cores 2.7/3.5GHz.
When the Ivy Bridge EP separate dies with 10 and 12 cores was announced, it seemed rather an unusual choice.
Later, when the 15-core EX model was brought, it then became clear the 12-core E5 v2 actually shares a 15-core die with the E7 v2. (the diagram below is from Anandtech, see links at bottom, this reference was inadvertently left out in the original edit)
Below is my rendering of the 3 Ivy Bridge EP dies.
Below are the 10 and 15-core Ivy Bridge EP/X die
I will try to scale these in relation to Sandy Bridge and others when time permits.
Note that the L2 cache is on the other side of the core from the L3 or rather last level cache (LLC).
The Dell website shows the new PowerEdge R920 (but not yet taking orders), featuring 96 DIMM sockets which could support 6TB memory, but 1.5TB is most economical for now, with 3TB for “moderately” extreme situations.
The HP ProLiant DL580 G8 lists support for 32GB DIMMs, so it will probably be sometime before 64GB DIMM support can be verified.
It is not clear if 8-socket system support will be available.
In the period up to SQL Server 2008 R2, with licensing determined only by socket count, the obvious strategy was to pick a system with the desired number of sockets, and the most powerful processor for that type of socket. There was no point to analyzing memory requirements because it was both simple and cheap to fill the DIMM slots with the second largest available memory module (currently 16GB).
From SQL Server 2012 on, the new with core licensing dictates that we should now base our sizing strategy on the appropriate number of cores, and then determine between the E7 versus E5 platforms if applicable.
Intel Xeon E7 v2 Processors
Intel Xeon E5 v2 Processors
The benefit in stepping down the total number of cores (in addition to reduced licensing cost $6.7-10K per core?) is the possibility of higher core frequency in the lower core count processors. Also consider that write (and certain other) operations are not parallelizable, so single thread operations may be running at the turbo mode frequency.
When desired number of cores can be achieved with a 2-socket system,
consider that the E7 supports 24 DIMM slots per socket compared with the E5 at 12 per socket.
Even though we have been conditioned by DBA indoctrination that more memory is better,
this "rule" originated from the days when the maximum memory configuration may have been 64MB to 1GB.
In those circumstances, every MB of memory helped reduce disk IO. By blowing out the budget on memory and months of hard work in performance tuning, and with luck, it might be possible to bring disk IO
within the capability of a storage without so many components that disk drives fail on a weekly basis.
Given the maximum memory supported today,
very few situations really call for 1TB+ memory configuration.
It is extremely likely that a 2-socket Xeon E5 system with DIMM slots filled with 16GB DIMMs (24x16GB = 384GB, $200 each, $4.8K total) is already more than enough by a factor of 4 if not ten.
More than likely, disk IO (excluding log writes) is only a sporadic occurrence.
And if we needed disk IO, we could configure more IOPS capability from an SSD storage system (that is more economical than 1GB of memory was 20 years ago) than we could actually use.
(Yet people still finds ways to run into problems!)
Unless your SAN admin dictated the storage configuration, in which case maybe go for broke on memory.
Socket Options 1 or 2 and 2 or 4
Another possible option for less than maximum core count situations is whether fill the sockets with low core count processors or only populate half the sockets with the high core count processors.
Example 1: 1 x 12 core or 2 x 6 core processors.
Example 2: 2 x 15 core or 4 x 8 core processors.
Filling the processor sockets enables maximum memory bandwidth (and memory capacity, but in this situation we most probably do not need it).
The decision criteria might be based on the parallelism strategy.
If our expectation is to run critical queries at higher degree of parallelism (8, 10, 12 or 15), one would expect that having all cores on one socket would benefit from having better (true) performance in Parallelism Repartition Streams operations, as the latency between cores is low, favoring fewer sockets of the high core count processors?
Do not bother looking at the plan cost for this, it is strictly based on a model that does not take into account the processor/system architecture.
On the other hand, if we expect to restrict max degree of parallelism lower, say 4, 6 or maybe 8, then more sockets populated with lower core count processors
would benefit in having greater memory bandwidth?
I have not tested these two scenarios side-by-side in otherwise equivalent configurations, so I ask readers to alert me if data to support this assessment should become available.
It is possible that having the fewest sockets is the better solution because of less complicated cache coherency, despite the lower memory bandwidth and capacity.
It is uncertain whether there will be a Xeon E5-4600 v2 series, as this seems unnecessary?
There is also the Xeon E5-2400 v2 series with 3 memory channels instead of 4 for slightly lower platform cost structure. We can also consider the single socket E3 v3 (Haswell) at 4 cores 3.6/4.0GHz with 32GB and 2 x8 PCI-E gen 3 slots. It might seem beneath our dignity to run on a server similar to our desktop or laptop, but the fact is this 4-core system with 32GB is far more powerful than the 4-socket systems of 10 years ago.
I bought a Dell PowerEdge T20 to test out the E3 v3 Haswell processor. Unfortunately the system would only power with 1 DIMM slot populated, not 2 or 4. Dell support has not responded. I may buy a Supermicro motherboard and chassis.
New TPC-E benchmarks were announce for the E7 v2 from IBM and NEC. Below are the recent IBM TPC-E results spanning the Westmere-EX, Sandy Bridge, and Ivy Bridge processors.
|Sockets||Processor||Freq||cores||threads||Memory||SQL Server Version||tpsE|
|8||E7-8870||2.4GHz||80||160||4TB||SQL Server 2012||5,457.20|
|4||E7-4870||2.4GHz||40||80||2TB||SQL Server 2012||3,218.46|
|2||E5-2690||2.9GHz||16||32||512GB||SQL Server 2012||1,863.23|
|2||E5-2697 v2||2.7GHz||24||48||512GB||SQL Server 2012||2,590.93|
|4||E7-4890 v2||2.8GHz||60||120||2TB||SQL Server 2014||5,576.26|
At first I thought that the 4-socket E7 v2 performance gain over the 4 and 8 socket Westmere E7 also involved the new Hekaton feature in SQL Server 2014.
But then I realized that the 2-socket E5 v2 performance on SQL Server 2012 was inline with this being the traditional table structure?
The E7 v2 benchmark details have not been released?
Is there a reason Hekaton was or was not enabled?
IBM System x3850 X6 with 4 x E7-4890 (60 cores, 120 threads) 2TB (64x32GB) memory, 5 RAID controllers connected to 208 SAS SSDs and 1 RAID controller for logs. The Server, processors, memory and miscellaneous items totaled $151K, storage was $681K, and the 30 SQL Server Enterprise Edition 2-core licenses at $13,742 totaled $404K.
The complete list price was $1.25M with a discount of $212K (17% of the complete price) but this might actually be a 25% discount on the hardware or just the storage. The price on the 200GB SSD (SLC?) is $3079 which should easily support a 30% discount.
I would like to know what discount levels people are actually getting on SQL Server EE? The price with Software Assurance is about $10K per core, so this might be the proper budgeting value. Oh yeah, IBM include 1 Microsoft Problem Resolution Services incident as part of the 3-year cost of ownership.
On storage IO side, PCI-E gen 3 has been available in server systems (Sandy Bridge EP) for almost 2 years. PCI RAID controllers came sometime after that. There now also RAID controllers with both PCI-E gen 3 and SAS 12Gb/s on the downstream side.
Much of the storage infrastructure (especially HDDs and SSDs) are expected to remain 6Gbps for some time.
It would be helpful if there were either or both disk enclosures support 12Gb/s SAS and RAID controllers that have 4 x4 SAS ports to better leverage the bandwidth of PCI-E gen 3 x8 on the upstream side when the downstream side is still 6Gb/s.
There is still bandwidth mismatch but such is life.
Internally, disk enclosures have (2) chips, one per controller, with each having sufficient x1 SAS ports for each bay and 2 x4 ports for upstream and downstream traffic. The two controllers supporting dual-path - with dual-port SAS devices.
We would like to be able to have the x4 ports operate at 12Gb/s per lane, while connecting to either 6 or 12 Gb/s, allowing continued used of 6Gbps storage devices. There might be 24 bays communicating at 6Gbps, more than enough to load the x4 port on each of the two controllers.
I am curious as to the lack of SSDs with PCI-E gen3 interface. Dell says their PCI-E SSDs are now on the new NVMe standard.
I suppose the effort to work this in,
along with the combo PCI-E or SATA interface has run into longer than expected debugging effort.
If so, then we will wait patiently.
In server world, it is important for new storage technology to be thoroughly tested.
Intel's Xeon E5-2600 V2: 12-core Ivy Bridge EP for Servers
by Johan De Gelas September 17 2013, and
Intel Readying 15-core Xeon E7 v2,
A technical look at Intel's new Ivy Bridge-EX,
and Toms Hardware
Intel Xeon E5-2600 v2: More Cores, Cache, And Better Efficiency.
The TPC-E supporting files are now available for the two new results on the Xeon E7 v2 and SQL Server version 2014.
In the IBM report, the SQL does use either Hekaton table or compiled SQL.
I will look over the NEC report later.
2014 Mar 13
Anandtech article on
Testing SATA Express ...
2014 Apr 5 - Memory prices
RDIMM 4/8/16GB $85/142/243
LRDIMM 32/64GB $823/4510
Database backup compression is incredibly useful and valuable. This became popular with then Imceda (later Quest and now Dell) LiteSpeed. SQL Server version 2008 added backup compression for Enterprise Edition only. The SQL Server EE native backup feature only allows a single compression algorithm, one that elects for CPU efficiency over the degree of compression achieved. In the long ago past, this strategy was essential. But today the benefits are irrelevant while the lower compression is becoming a significant liability. All the third party backup compression products offer multiple levels of compression via open source algorithms and it is time for SQL Server to follow.
As always, we shall start with the historical context. LiteSpeed was introduced sometime around 2000 when the server system would have had 4 Pentium II or III processors at around 400-500MHz. This was before there were multi-processors so 4 sockets means four cores. Today, the individual processor core (Sandy Bridge to Haswell generations) is about 20 times more powerful, 6X via frequency and another 3X on instructions per cycle efficiency. And there are be 8 cores per processor socket for Intel Xeon E5 (Sandy Bridge) and 12 in the E5 v2 (Ivy Bridge) processors. Even on dropping from 4 sockets in 2000 to 2 sockets today, the system has perhaps 80 times more compute power.
In addition, the compression code also benefit from Hyper-Threading more so than SQL Server, as there are no lock contention issues. Even the Pentium 4 first generation HT, the performance gain was 50% for LiteSpeed database backup compression.
Over the same period of time, IO bandwidth has improved as well, but not by the same degree. Fourteen years ago, 350MB/s from the IO system was good (basically 2 x 2Gbit/s FC ports). Today it is possible for an IO system to deliver 10-20GB/s, except that some brain fart IO strategy in recent versions of SQL Server effectively caps practical IO bandwidth to disk storage in the 4-5GB/s.
So the net change is that today there is far more CPU available to support compression relative to the practical IO bandwidth. To better use this, we would like to use one of the algorithm that can achieve high compression with more CPU per unit of raw data. This is valuable when the data has to be sent over a wide-area network with bandwidth limits or even a local network because the infrastructure team would not put in 10GbE for the mission critical database server and has no concept of parallel network connections.
High compression can also be important when local storage space limited and extra database backups are necessary for migration operations. Of course the occurrence of constrained storage space is a separate matter of total organizational stupidity. Hard disk capacity is cheap, even for the more reliable enterprise models ($100 per TB for 7.2K HDDs and $450/TB for 10K). Yet in a SAN that is billed as helping to reduce cost (via increased utilization), somehow storage becomes precious, doled out in driblets from the SAN admin only after a series of time wasting meetings, forms and justifications. OK I am side-tracking to my long standing feud with SAN vendors bent on crippling database performance. The bottom line is that we have CPU resources on the SQL Server system, and we can put it to good use.
Compressibility of Data
Normally I like to use the TPC-H database for performance investigations because the data generator is openly available (with source code) to facilitate independent verification and because there a good library of published reports for comparison. Of course the TPC-H database is populated with randomly generated data which has different compression characteristics than normal data (less).
This is because in a typical production database, there is prolific use of sequentially increasing 8-byte data types or otherwise having a limited range of values, i.e. 63 out of 64 bits are the same from one row to the next. Another cause is lack of normalization, meaning repeating values that are easily reduced with dictionary based compression. The exception of course is databases with extensive use of (non-sequential) unique identifiers, which have almost no compressibility.
That said below are some results with the TPC-H data set.
|Type|| ||DB size ||Compressed B/U ||7Z|
|Uncompressed tables|| ||14.17||5.66||3.14|
|Compressed tables|| ||9.55||5.54||3.29|
|Clustered Column stored|| ||4.55||3.50||2.34|
With no compression at the table and index level, the SQL Server backup can achieve a 2.5 compression ratio. 7z can achieve a 4.5 compression ratio on the uncompressed backup, albeit at far high CPU consumption. With compressed tables and indexes, the database size is reduced to 67% of the original size. The SQL Server backup can achieve further compression as the block size is larger allowing for more effective dictionary use. It is perhaps no accident that both the normal database and one with compressed tables and indexes backup to about the same size. Even the 7z compressions on the uncompressed backup files are about the same size.
The clustered column store has even efficiency both in the database storage and the backup. I am thinking that the storage organization makes compression more effective.
Note that the SQL Server compressed database backup is not further reduced with 7z or other common file compression utility.
This is a simpleton’s solution. It is important to differentiate between the objective and the method. Backup compression consumes CPU. When LiteSpeed first came on the market, people asked for a means to throttle CPU. The reason was to maintain some degree of responsiveness to user transactions during the backup. Full database backups are normally done at night or during off-hours if the two are different. But there might be some user activity during the backup. So it might seem reasonable to throttle the CPU used by backup compression.
However, this is why I stress the need to differentiate between the objective and method. State the true objective as accurately as possible so that intelligent people can come up with the best solution rather than constraining the developer to artificial requirements. The reason a system is unresponsive is not because the CPU is fully utilized, but because of the wait for a CPU to become available.
In the Windows operating system from the very beginning or early days, there is the quantum. On multi-processor servers, there is a time interval of 15.625 milliseconds (1/64 seconds) and the quantum is 4 times this. A processor core can compress a 1MB block (the standard backup transfer size) in well under the full quantum. Simply yield the thread after each block. If there are any user transactions, there will be minimum wait time. And any spare CPU cycles can still be used for compression with only moderate loss of efficiency versus running for the full quantum.
Of course, a good strategy might be to impose some kind of IO bandwidth control. The key element is to key disk latency low, meaning the IO queue depth should be low. It should be possible to control this with the buffer count setting. If there are 1 or 2 buffers per thread, then the IO queue should not be over saturated with IO generated by the backup process.
My observations are that IO to lob pages (and row overflow pages as well?) is restricted to synchronous IO, which can result in serious problems when these reside on disk drive storage. Even if the storage system is comprised of hundreds of HDDs, the realizable IO performance to lob pages is that of a single disk, with some improvement in parallel execution plans.
The reason for this appears to be that each thread must work its way through a page to find the lob pointer information, and then generates a synchronous IO if the page is not already in memory, continuing to the next row only when the IO is complete.
I believe this issue could be addressed if we could build a non-clustered index where the index row header contains a pointer to the lob page instead of back to the cluster key (or heap page). Presumably this structure would be able to use the asynchronous IO capability of the SQL Server engine used for normal in-row pages. Per Mark Rasmussen’s PASS 2013 session, the index should perhaps point to all lob pages for very large structures requiring multi-level lob pointers.
Another thought is for the nonclustered index included columns could have the lob page pointer, so that we can jump directly to it instead of first going back to the table.
In the last two years, I have worked on databases that made extensive use of large text fields, prominently kCura (see
Relativity), the legal document discovery application. Typically the fields are declared as varchar(max) but it could be any of the data types that are stored outside of the row, i.e., in either the lob_data pages or row_overflow_data as opposed to the normal in_row_data pages.
It is quickly apparent that accesses to the lob fields are horribly slow when the data is not in the buffer cache. This is on a storage system comprised of over one hundred 10K HDDs distributed over four RAID controllers.
SQL Server has an outstanding storage engine for driving disk IO involving normal in-row data (of course, I had previously complained on the lack of direct IO queue depth controls.) In the key lookup and loop join operations (that generate pseudo random IO), both high volume and single large query, SQL Server can issue IO asynchronously and at high queue depth. For a database distributed over many disk volumes, many IO channels, and very many HDDs, SQL Server can simultaneously drive IO to all disks, leveraging the full IO capability of the entire storage system, in this case 25-25K IOPS at low queue depth.
Apparently, this is not true for lob (and row-overflow) pages. A quick investigation shows that IO for lob pages is issued synchronously at queue depth one per thread (or DOP). This means each thread issues one IO. When the IO completes, it does the work on the data, then proceeds to the next IO. Given the data was distributed over limited portion of the disk, the actual access times was in the 2.5-3ms range corresponding to 300-400 IOPS. This is slightly less than 10K HDD theoretical random access comprised of 3ms rotational latency plus 3.5ms average seek for data distributed over the entire disk for 150 IOPS.
SET STATISTICS IO shows “lob physical reads 14083, lob read-ahead reads 39509”, indicating read-ahead reads, but this might be for the addition pages of a field larger than 8KB? There is some scaling with degree of parallelism in the IO access to LOB pages, as it appears that the synchronous IO is per thread. On this particular system, the decision was made to set MAXDOP at 8 out of 20 physical cores, 40 logical to better support multiple concurrent search queries. A query specifying LOB data can generate on order of 1500-2000 IOPS at DOP 8, perhaps 4000-6000 at DOP 20.
It was noticed that the time to generate a Full-Text index on a table with 20M rows and 500GB of LOB data took several hours. This would be consistent with a single thread running at 400 IOPS (not all rows have LOB pages). I do not recall any single CPU being pegged high during this period, but a proper observation is necessary. It could be that Full Text indexes creation will be much faster with better IO to the LOB pages (and parallelism?).
It so happens that kCura Relativity seems to run with most normal in-row pages in memory, generating little disk activity. In this case, one possible solution is to use the TEXTIMAGE_ON specification to place LOB in a file group on an SSD array. In principle the in-row data is more important, but these will mostly like be in-memory, hence not need the fast disk IO. The LOB pages that cannot benefit from proper asynchronous IO at high queue depth operation is placed on the low latency SSD. This is the reverse of putting important data on the more expensive storage media, but it suits the actual mode operation?
Even better is for Microsoft to fix this problem. My suggestion (and I know how the SQL Server team loves to hear my advice) is to use the index Included Columns feature to support pointers to LOB pages, instead of the column value itself. Naturally it does not make sense to include the LOB column because it would be stored out of page, as in the original table.
The reason disk IO to lob pages is synchronous is that each thread must work its way into the row to find the file-page pointer? Nowhere is there list of pointers similar to the nonclustered index header. The advantages would: 1) having the list of pointers available in a compact structure that can be scan quickly, and 2) support asynchronous IO. The index row header would point to the lob pages, instead of back to the table? In order of the cluster key? Or some other column? Naturally it is necessary to understand how to SQL Server engine handle LOB IO to come up with the right solution.
At PASS 2013, there were two sessions on row and page organization: DBA-501 Understanding Data Files at the Byte Level by Mark S. Rasmussen, of iPaper A/S (improve.dk)
and DBA-502 Data Internals Deep Dive by Bradley Ball, .
The Rasmussen session also details LOB organization.
There will probably be consequences for delving into such a charged topic. But this has never stopped me before and I am too old to change my ways. So here goes. Many of us have of the problems with the online HealthCare.gov even though technical details are not yet available.
First let me say that I am not singling out this particular project, as severe problems are common in government IT projects, and I suspect the incidence rate may be no different in private industry.
Whereas other projects may hide failures with schedule slips or otherwise do not attract broad news coverage.
This project involves the general public, is a politically charged topic and had a hard deadline, so there is no escaping public scrutiny.
Some special aspects of this project are that it involves interaction with third-party health care insurers (and providers?) and the fact that each state in the US has their own health care laws and licensing? It has been reported that the complete system was not tested as a whole until one week before the public launch deadline of Oct 1. And that despite flaws found in late testing, the agency in charge decided that is was necessary to launch on the prescribed date regardless of known and possible issues instead of announcing a delay.
There are serious issues with the overall manner in which the government handles IT projects that seem to make either serious deficiency in meeting the objectives or outright failure the expected outcome. For now, I will comment on the patterns that I have seen in other government projects. A serious problem in all government projects I have been involved with, is the lack of understanding or appreciation of the different grades of technical talent.
The government has no problems in spending (I cannot call this investment as that would imply a reasonable expectation of positive return) hundreds of millions or in the billions of dollars on projects. (Note: the cost of these projects is not the subject of contention.)
I assume (but I do not know) that the top managers are compensated accordingly for large project responsibility. It is in the high technical positions where it seems that the government is unwilling to offer a proper pay scale.
A large project needs a large staff of people, be it permanent employees or contractors.
(per comment by Ryan below, large IT projects have the budget for a large staff, be it necessary or not. I was at a company years ago that did have large projects. It was acknowledged that the project could have been done better with fewer people if there was technology to clone certain people. But as such, it was done by a large team with a few seriously top people.)
The general expectation is that much of this staff should have several years of experience. It is not a requirement to have people with little experience. But if there is not, then there will be a shortage of experience personnel for future projects. The key is in having a number of very talented and experienced people to provide technical leadership, to ensure that the work of the staff is built on a solid foundation.
Of course top technical talent costs top dollar. Every time I get called on a government project, I hear that they cannot pay at a rate above that suitable for a senior developer. It is possible that there are government rules on how many high salary slots are allowed in a project and that these have already been assigned. It might also be possible that the government imposed such rules as a guard against fraud.
But without top talent, it is very likely that a high cost project created by a large team of middle to senior developers will have serious problems due to being built on questionable foundations. So it seems that our government would rather stick to their project management rules even though it means critical projects end up as very public failures.
Addendum 23 Oct
Consider the example of the military procurement system. There are so many messed up rules and other red tape meant to prevent fraud? or serve some forgotten purpose?
In the end, a project costs tens of billions of dollars, and by the time it is complete 10 to 20 years later, some combination of 1) it does not do something useful, 2) it is too expensive for what it does and 3) so much time has elapsed that the original requirements are no longer valid.
In the two majors interventions of the last decade plus, when it was brought up that the troops needed equipment not previously in the system, and that could be developed, Congress has the good sense to set aside most of the rules, saying just do it. The necessary stuff got done and done quickly. True some money was wasted, but it is also important the rules that are meant to prevent waste do not also defeat the ultimate objective.
Competitive bidding is a good idea for known entities.
I regard it as nearly totally stupid for IT development projects.
It might be better to set an annual budget for development, which should determine the size of the team, and a timeline.
In new projects, a prototype with moderate functionality should be up within 12 months if not sooner. In any case, something of value needs to reach production within 2 years or there is a risk of a project out of control, lacking cohesion.
I do believe that our federal government has an alternative,
as there should be several concurrent projects in development among the many agencies.
These should be judged against each other,
with successful contractors getting assigned more work.
Addendum 24 Oct
It is generally known that the high executives can regurgitate IT catch phrases on demand.
Something amusing quotes reported on Fox News covering the congressional inquiry:
Senior Vice President Cheryl Campbell also said in her prepared remarks that "no amount of testing" could have prevented the site's problem-plagued start.
... a system "this complex with so many concurrent users, it is not unusual to discover problems that need to be addressed once the software goes into a live production environment."
"No amount of testing within reasonable time limits can adequately replicate a live environment of this nature,"
In fact testing did find errors that were not fixed prior to release (of course testing was just before release.
So lesson for those aspiring to be executives: learn to make statements that are generally accepted to be true, preferably irrelevant to the actual root cause(s).
Optum/QSSI blamed in part a "late decision" to require customers to register before browsing for insurance, which could have helped overwhelm the registration system.
"This may have driven higher simultaneous usage of the registration system that wouldn't have occurred if consumers could window-shop anonymously," said Andy Slav
This is true. Still a modern server system can support thousands of connections.
It is important to run many web server instance/processes, so that each can be recycled without affecting too many users.
My own limited experience with web servers is that the session state system is more likely to be the source of a problem - requiring process restart. So if we did not implement session state, it is less likely we would need a restart in the first place.
Addendum 25 Oct
Centers for Medicare and Medicaid Services (CMS) - under the Department of Health and Human Services is the government agency handling the HealthCare.gov web site/app. In this case, they acted as the project lead, as no contractor had project lead responsibility (and authority), so they are responsible for the top level decisions. CGI Federal handled most of the work. QSSI handled the registration element with identity management? They owned up for their part of the project problems.
CGI cited inadequate testing, started only 1-2 weeks before Oct 1 go-live, that it should have been months of testing. (Of course months of testing between code-feature complete and go-live is totally stupid. But it is correct to conduct months of testing during development.) And CGI did not mention this in September.
I do not find that the testimony given by the contractor executives at the congressional hearings provides meaningful insight or get to the key elements. In practice, the only way to get the whole truth and nothing but is under a no-blame inquiry.
What I would like know is what the original project plan schedule was. When was the date for code completion? the original testing period, and was there an allowance for bug fixes and retesting?
I think the test, bug fix, re-testing cycle should be about 4 weeks. Of course there will still be bugs remaining that are fixed after go-live, which is why the CGI testimony of months of pre-launch testing is full of sh!t.
It is perfectly reasonably to be fixing bugs found in testing as well as making minor performance fixes.
But if they had no idea of what the performance characteristics was going to be until post-code freeze testing, then that means the architects were incompetent, blindly building a project on stupid principles with no connection to performance, see my blog
What almost always happens is that code-feature completion is late, and the project leads decide to compress the test-bug fix cycle to preserve the launch date. This never works because the test cycle was already tight to be with because they thought only 1 bug-fix retest cycle was necessary. Two might be workable.
Addendum 31 Oct
CBS News cites the cost of the website development at $118M and $56M for IT support.
Various states have collectively spent more than $1B on their own health care/insurance websites?
Health and Human Services (HHS) which oversees CMS, secretary Kathleen Sebelius testified that the contractors never asked to delay the sites launch, while CGI testified that there should have been months of testing. Marilyn Tavenner, head of CMS, testified that she had no idea there were problems prior to Oct 1?
even though internal testing prior to launch showed severe problems.
Lesson for those aspiring to be (government) executives: don't listen to the people who work for you, they can't have anything important to say.
Ms. Sebelius said to "Hold me accountable for the debacle". I am not a fan of firing someone every time a project fails. I have seen this before, and the next project still fails. Even Stalin eventually learned that shooting generals does not inspire the next general to victory.
Also HHS oversees several agencies, so I think the head of CMS should be accountable.
CBS News sebelius-blames-contractors-for-healthcare.gov-problems
Addendum 1 Nov
The Healthcare.gov people cited big numbers (4.7M) for the number of visitors in the first few days, but declined to state exactly how many people actually purchased health insurance. Ms Sebelius stated that the numbers were unreliable, and it will be mid-Nov before they are certain.
It now comes out that 6 people completed enrollment the first day and 248 by the end of the second day.
So I presume that HHS/CMS did in fact know exactly how many enrollments there were on each day but did not want this to become public knowledge, and lied about it.
Just yesterday I said that firing people for failure does not prevent the next failure.
However lying pretty much ensure the next project will be a failure.
It is being reported the tech "big guns" are being brought in to help fix the problem.
There is a Google person for reliability, even though Google did not provide software for this project.
Oracle is also sending people, and presumably this means that there is an Oracle back-end.
Let me say that I do not have technical issue with the core Oracle engine. It is indeed very sophisticated. There are also some extremely talent Oracle DBAs out there. Back about 12 years ago I noted that the best of the tech writings on Oracle were of greater depth than for SQL Server. (Since then, there has been good quantitative technical analysis on SQL Server.)
On the SQL Server side, there are many "accidental" DBAs. This might be a person foolishly inquiring as to who was the DBA for a particular SQL Server back-end. The boss thinks for a moment, and then says you are! That said, the accidental SQL Server DBAs are not afraid to admit they know very little about being a DBA and need to learn.
On the Oracle side, I have already said there is serious talent.
But there are also very many light weight DBAs.
Without making a generalization,
some of these think that because they are using a very sophisticated product,
then they are sophisticated too.
At the last Oracle World, six or seven years ago, one the top Oracle performance experts was explaining to the audience that an application making excessive hard parses (SQL compile) can scale. It is essential to have plan reuse.
On more than one occasion, I have seen an Oracle DBA who seem to be superficially versed in the latest Oracle big gun features, but absolutely clueless on fundamentals.
I was asked to assist in helping major hotel management company do an Oracle & SQL Server comparison for their new reservation system (being ported from Informix).
When I got there, everything indicated that the project team wanted Oracle and only wanted to show that Oracle was superior.
So they were using advanced Oracle features in the belief that this would give a decisive advantage. However their SQL coding was total crap. They used PL/SQL which may actually implement a hidden temp table with inserts, when in fact the query could have been written with straight SQL.
I had been told that this was supposed to be a like for like comparison, but since it was obvious this was not the case, I felt no obligation to point out their crappy SQL, while implementing correct SQL on my side (I also tightened up the data types to reduced table size).
They were shocked when SQL Server obliterated Oracle in performance tests.
They spent 2-3 weeks checking the individual query results to be sure I was returning identical rows. Only then did they finally bother to look at the SQL. In the end, with the Oracle system using correct SQL, the results were about the same, and they recommended Oracle.
Addendum 22 Nov
Apparently outside consultants were hired in early 2013 to assess the HealthCare.Gov project.
It was assessed that there were serious issues?
but none of this was reported to Congress.
It might be accepted that the project team would put on a public face of all is well,
but failing to disclose the truth to a Congress should be a more serious matter?
I have complained about lack of
queue depth control in SQL Server IO
before and I am complaining again. I fully understand the desire to restrict the number of knobs, which used incorrectly can cause more harm than good.
The main point about IO queue depth is that there is not a single right answer for all circumstances but rather depends on both the hardware environment and the purpose.
One example is a transaction processing server that is also used for reports, or perhaps queries that behave as a report.
The transaction may involve a series of queries each for a single or few rows. These would be issued synchronously at queue depth 1.
A report-type query such that the estimated involves 25 or more rows would issue IO asynchronously at high queue depth. In principle, this would increase the storage subsystem throughput by allowing the hard disk to employ elevator seeking. But it also increases latency.
The report query completes faster despite the higher latency because it issues multiple IO at once. The transaction query issues 1 (sequence of) IO, and waits for it to complete before issues the next.
So why complain now? I am testing a storage system for a data warehouse environment.
I noticed that for a table scan, SQL Server was not issuing IO at sufficient high queue depth when the data files are distributed over 16 or more volumes (RAID groups).
SQL Server issues 512K IO at high queue depth for 8 or fewer volumes.
The onset of high queue depth IO seems to occur when degree of parallelism (DOP) is equal or higher than the number of files.
The throughput on each volume can be up to 600MB/s, the volume being comprised of six 10K HDDs in one case and six 7.2K HDDs in another.
With 8 volumes spread over 4 RAID controllers, it is possible to for a table scan to generate IO at approximately 4GB/s.
Each RAID controller, HP P812 (this is an older model) in this case, can support 2GB/s IO. This is rather low considering the PCI-E slot is gen 2 and x8 wide, supporting 3.2GB/s.
This might be because the memory on the RAID controller is 64-bits (8-byte) wide and operates in 800MHz for a nominal bandwidth of 6.4GB/s.
That's a lot right?
Well not if a disk read is written to memory first, then read back to be sent to the server (where the memory write-read sequence happens again).
SDRAM, including DDR derivatives, write at half the read rate.
So the effective write-read throughput is one-third of the memory bandwidth.
LSI controllers have similar (or less) bandwidth to external memory, but can drive higher bandwidth. There may be some other effect, perhaps the controller's internal memory?
Anyways, with 4 controllers each capable of 2GB/s, the whole system should support 8GB/s?
Why 4GB/s? It turns out there is a setting in Resource Governor named REQUEST_MAX_MEMORY_GRANT_PERCENT which can "improve I/O throughput" at high degree parallelism.
(oh yeah, the applicable DMV is dm_resource_governor_workload_groups)
Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.
In summary. there is no a priori single correct setting on queue depth and IO size for any or all circumstances and certainly not for different hardware configurations.
To favor any specific purpose, SQL Server must know how many hard disk drives (or SSDs) that each volume is comprised of, and whether these are dedicated or part of a shared pool.
The operating system cannot see beyond what the RAID controller presents, so we must provide this information manually.
Next, SQL Server must know what the mission is. Is the mission to keep latency low for transactions? Or is the mission to flood the queues for maximum throughput in batch processing? Or something in between?
It is also disappointing that SQL Server has only a single table scan IO size, such that any RAID volume not at a specific number of disks x RAID stripe size is sub-optimal.
Below is the disk IO generated by the TPC-H LINEITEM table scan after DBCC DROPCLEANBUFFERS, SQL Server already has memory allocated from the OS (see previous post on Lock Pages in Memory)
this is not working for some reason? OK - missing slash (2014-03-08)
table scan disk IO
For several years I have made my (free) SQL Server performance tool
ExecStats publicly available
I have just recently improved existing or added new features (build 2013-09-23 or later)
that could be useful so I am asking people to give it a try
and please do send feedback.
I starting working on ExecStats when SQL Server version 2005, for the first time,
made sufficient information available via the
Dynamic Management Views
(DMV) and functions to do a pretty good job of performance analysis.
The new features have been added to the DMV's over the more recent versions
have further improved the ability to assess SQL Server health.
In SQL Server 2000, we had to use Profiler to start a Trace and collect data for a sufficient
period of time. The information from various DMVs are available at any given point in time,
and depending on the degree to which execution plans are retained in the procedure cache,
could provide a reasonably accurate assessment of the workload on SQL Server.
Of course Profiler and Trace are still useful in the exception cases
(and Microsoft wants to move to
Extended Events going forward).
Two starting points in performance tuning with DMV's are
which keep execution statistics for entries in
the plan cache and
which keeps index usage statistics.
These avenues can be pursued independently, but working both together is best.
The dm_exec_query_stats view has sql and plan handles that links to DMFs for the SQL
and the execution plan
Inside the XML plan is information such as which indexes are used by each SQL statement
and the access method (seek, scan, lookup, update, etc.).
It should quickly evident that while SQL is a great language for data access,
it is not suitable for step-by-step processing, for which there are procedural programming languages
that are designed for this purpose.
This is why ExecStats is a C# program and a not a massively foreboding morass of dynamically generated SQL.
ExecStats parses top execution plans to build a cross-reference of
index usage by SQL statement.
Some other tools and scripts may attempt to parse 5 or 10 execution plans.
ExecStats default is 1000, and can parse the entire contents of the plan cache if desired.
In working from the index usage stats alone, it is possible to identify unused indexes.
It is also reasonable to guess that indexes on the tables with the same leading might be consolidated.
However it is not certain and no infrequently good assessment can be made with infrequently used indexes.
By building the full index usage to SQL statement cross-reference map via the execution plans,
it is possible to determine where each index is used.
This allows a reliable determination for the minimum set of good indexes.
Now that the capability to parse execution plans has been established,
the scope can be expanded because there are other sources for execution plans.
One option is to simply get a list of all the stored procedures in the database
to generate the estimated execution plans (for NULL or default parameter values).
I did this once for a client, and from the trapped error messages,
there were nearly one hundred procedures that referenced tables which no longer existed.
These were obsolete procedures that no one had identified for removal even though
the underlying tables had been dropped.
Another option is to maintain a list of SQL,
which could be stored procedures with parameter values
either for generating the estimated execution plan only
or be executed for the actual execution plan which has additional information.
In addition, certain stored procedure could be tested multiple times to expose the impact of compile
parameters or different code paths from to Control-of-Flow keywords.
ExecStats automates the data collection for all of this.
Results can be collected and archive on a regular basis and especially before and after code or other changes.
It is especially important to consider that execution plans can change with both compile parameters
and data distribution statistics.
When there are complaints of poor performance, one could investigate for top resource consuming SQL,
but it is immensely helpful to have previous characteristics for comparison.
This includes execution statistics, the execution plan, compile parameters, indexes,
and even the data distribution statistics last update.
In 2012, I integrated a previously separate program for performance monitoring into ExecStats.
The key point in ExecStats performance monitoring is that important system and storage architecture
details are captured and incorporated into the display.
Too many of the third party tools were obviously developed on a desktop with 1 processor and 1 disk.
It is clear that the tool was built by someone who got a list of counters with "best practice" parameters
with very little real understanding of what any of it actually means.
It is certainly not someone who actually solves SQL Server performance problems on a day-to-day basis,
or has even done it at all.
There are two recent additions in 2013-Sep.
One pertains to index fragmentation, and the other to fn_virtualfilestats.
Technically we should use the DMF
which replaces DBCC SHOWCONTIG, to assess fragmentation.
However the effort to run dm_db_index_physical_stats on large tables is substantial.
There are several pieces of information that can be used to provide a clue on the state of
fragmentation, but only for nonclustered indexes without included columns.
The DMV dm_db_partition_stats tells us the number of rows and the size of the index,
from which we can calculate the average bytes per row.
Next, DBCC SHOW_STATISTICS tells use the average key size,
which includes both the nonclustered index key portion and the clustered index key portion.
Keeping in mind that the row overhead is 10-12 bytes,
we can compare the value calculated from dm_db_partition_stats
to the key length from statistics plus row overhead to determine if this is reasonable.
See Paul Randall at SQL Skills
for more on the record overhead, which also applies to index records.
The previous versions of ExecStats display index row count and size in the Index Usage tab,
and the Average Key Length in the Dist Stats tab.
The newer version make a copy of this in 3rd right most column in Index Usage as AvKyL.
For now, compute the 1024*Used KB/Rows to compare with AvKyL.
The function fn_virtualfilestats provide useful
file IO statistics at the SQL Server file level.
In a complex environment, there are many files for the key databases spread over many volumes
on the storage system (each seen as a physical disk by the Windows operating system).
It is a simple matter to rollup the file IO statistics by database or filegroup.
It was not simple to rollup the file IO statistics on a volume basis
because complex storage systems are typically implemented with mount points.
The new (as of SQL Server 2008 R2) DMF
provides the mount point.
So now ExecStats rolls up file IO by filegroup, database and volume.
Below is the previous version of Databases, showing database size information.
The new file IO rollup by database is at the far right, as below.
The new file IO rollup by OS Volume is in the new Volumes tab.
So feel free to give ExecStats a try and please send feedback.
If any one would like to volunteer to write proper documentation, that would be especially appreciated.
This is purely a question and not a proposal, as I know how Microsoft is so very thrilled every time I have a suggestion. Could the Parallel Data Warehouse concept work on a single NUMA system?
Technically all multi-socket systems today are NUMA because the processors now have integrated memory controllers. The path to memory is different between local access and memory attached to a
differentanother processor socket. But I am more interested in 8-socket, and potential 16+ socket systems. The single system definition here is that of a single operating system image. Some large systems allow partitioning to support multiple OS images to be running, but that configuration is not discussed here.
Both Windows Server and SQL Server are NUMA aware in that there are distinct memory nodes (that are closer to the local processor cores than remote processor cores). But there is no workable mechanism to fully benefit from knowledge of the underlying hardware organization. Say for example we have a large table resident in the buffer cache evenly distributed across the nodes. We would like for a connection to SQL Server issue a query to this table to be farmed out to one thread on each node with that thread accessing only rows stored locally.
But this is not what happens. More likely, starting from an empty buffer cache, a query accesses the table. The table is initially loaded into memory on the same node as the connection, then switching to other nodes? I have no idea what actually happens so it would be nice if someone would like to investigate this. Now another connection that happens to be assigned to a thread on the different node would have to cross nodes to remote memory for this same query? So the expectation is that there would be apparent inconsistent variations in query time depending on both the node distribution of the participating buffer cache pages and the node affinity of the execution thread.
In the PDW environment, there are multiple compute nodes, each a 2-socket system running its own Windows operating system, and (modified?) SQL Server instance\process with fact tables partitioned by node and dimension tables replicated to all nodes. A front-end sends a query to each node which processes data on its node. The purpose of this architecture is to allow scaling the computational resources (CPU, memory and IO) beyond the limits of a single system. But it also happens to achieve a degree of locality optimization.
The limitation of the PDW concept is that any single query is a more involved process. Furthermore, the query execution plan elements cannot cross nodes without a remote server mechanism. Even with the InfiniBand protocol, communications between nodes are expensive and long latency relative to the processor cycle time scale. This is why PDW is Parallel Data Warehouse and not Parallel Database.
So now what? My question is can we apply the PDW concept to a NUMA system, one big 8-socket NUMA system running a single Windows operating system. Instead of PDW with a SQL Server instance on each system (PDW compute node), there is now a SQL Server instance localized to each node of the NUMA system. Communications between processes on a system is lower latency than across systems over InfiniBand.
Can we go further than this? How about bringing the Parallel concept inside to a single process? Now each “node” is a collection of threads affinitized to a node on the NUMA system. Communication between threads is even faster than between processes. And now, potentially execution plan elements can cross “nodes”?
We might ask why bother. The nature of server system is that the 8-socket system is more than twice as expensive as four 2-socket system. In addition, the 8-socket system may have processors one generation behind the 2-socket system. The fact is that by the time we assemble a collection of systems or one big system, the full costs of hardware are irrelevant to a really important mission. The real question is which meets the objectives better.
Below is a completely separate discussion but is on the topic of NUMA systems.
A Better NUMA System Memory Strategy?
As far as I know, Windows or SQL OS creates 2 memory nodes from processor node 0, then 1 additional memory node on each of the other nodes. One of the memory nodes on node 0 is for generally allocations. All other memory nodes, including the second one on node 0, are for NUMA aware allocations? The immediate problem is that the memory nodes are asymmetric in size.
There is one possibility to consider for system not at maximum memory. This is to put maximum memory in node 0 (good luck figuring out which one this is) and the normal memory in the other nodes. But of course, the nature of component pricing today is that is makes more sense to fill the DIMM sockets with the second largest capacity memory module, today 16GB. Today, Crucial shows 16GB ECC DIMMs at $200 each and 32GB DIMMs at $800 each. By sometime in the next year or so, expect the 32GB DIMM to trends towards $400 and a new 64GB DIMM to introduce at several thousand dollars initially.
HP-UX on the other hand has a very clever memory strategy, which is to configure both interleaved and node memory, or LORA in their terminology. Say for example that there are 8 nodes and 1TB memory total, 128GB per node. Interleaved memory totaling 128GB is allocated from all nodes at 16GB per node. The remaining 114GB on each node are formed into node local memory. The addressing scheme might be something like this. For interleaved portion in the first 128GB, the memory is interleaved in 256 bytes stripes across the nodes. The first 256 bytes is from node 0, the next 256 bytes is from node 1, and so on before wrapping back to node 0. Above this are separate 114GB contiguous chunks on each node.
This arrangement recognizes the fact that not all consumers want local node memory. Now the NUMA local memory nodes are uniform in size. I happen to think this is an excellent idea. I do not know if there is an institutional bias at MS against ideas not invented in happy valley.
Edit based on Lonny Niederstadt's comment
on further thought, I think it might be better if the OS did not interleave a single memory page (4KB on X86) across NUMA nodes, perhaps even for the 2MB large page mode, but not sure on the 1GB page.
The HP-UX scheme allows the sys admin to adjust the interleaved / LORA ratio, with 1/8:7/8 being recommended. This would allow NUMA to be turned off.
The SQL OS could also override the Windows scheme, simply by choosing how it allocates pages.
For NUMA controls, the default should be off, so people do not encounter anomalies unless they make the decision to go for it.
I am thinking on a transaction server, the scheme should be to prefer allocating buffer cache pages from the local node memory (of the thread making the call). The app server would connect over different port values. SQL Server would have affinity set by port. So each group would be likely to have their data in the local node.
For DW, the strategy should be to round-robin extents (or large pages?) by NUMA node. Even better would be if the clustered index could be on a hash key. Then two tables on the hash key, ORDERS and LINEITEM for example, would have the same node as preferred buffer location.
Taking a step further, if storage were also distribute by node, and the FG has on or more file on each node, then the preferred file location matches. And the index rebuild would reset to match this scheme.
There should be some parallelism options beyond MAXDOP. One would be Local node. So parallel execution should use local node cores up to the socket limit. Another would be Global, so allocate 1 thread per node, then round robin. An independent option would be Hyper-Thread or HT, in which both or all logical processors of 1 core would be allocated before moving to the next core. Default of course would be off, but this could be a powerful tool for investigating HT issues.
Perhaps to local node parallelism option should be default for the OLTP scheme, and the Global be default for the DW scheme
I am thinking something needs to be done about processor affinity bit mask. There should be 3 values, which could be done with 2 sets of bitmasks. One is preferred processor affinity. The other is secondary affinity, and finally excluded.
This might help manage multi-instance, each instance could get separate preferred, but the secondary would allow activity on other nodes as well.
Since you mention CPU per node, I will plug my ExecStats tool,
which is PerfMon mode display individual CPU, annotating the node boundaries. Note the 3rd party tools vendors - learn what a real server is, not the developer's desktop!
The high CPU on node zero might be related to this
FIX: CPU spike when there is no load on a server after you install SQL Server 2012 on the server, sp1 cu3.
Now having thought about it some, I think this should be a request,
we could call it SQL Server Data Center Edition
The most important setting for SQL Server today is probably the Lock Pages in Memory option set in Local Security Policy, User Rights Assignment. Enabling this option on recent versions for the SQL Server user account in Enterprise Edition on systems with 8GB+ more memory also activates Large Pages. It used to be that Large Page support was activated by Trace flag 834, which required the Lock Pages in Memory right. As typical, Microsoft officially provides no guidance or assistance in trying to quantify the benefit of this setting.
I recall a presentation by Thomas Kejser at SQL Bits that mentioned this setting doubled performance in one particular situation. Based on the description of Lock pages in memory and large page allocation, it seems perfectly reasonable that these two setting should have meaning performance impact. So naturally I attempted to investigate this in my test environment, and found zero measurable different. I was certain that the principles in this were correct, but had to accept that I was missing something and left this matter alone.
Just now, I am doing an assessment of the storage performance for a system already in production. This system has several instances of SQL Server, and hundreds of other processes running, stopping and running. In the storage performance sequential IO tests, I wanted to look at a table scan reading from disk, both loading into an empty buffer cache and a full buffer cache forcing the eviction of other data. During the testing, I changed to the SQL Server max server memory several times between 10GB and 70GB, with the test query being a 60GB table scan.
The most pronounced behavior was observed when the SQL Server instance maximum server memory was raised from 10 to 70GB (with 10GB memory already allocated from the OS before the query start). The table scan query would cause SQL Server to start allocating additional memory from the OS up to the 70GB limit. The query executed at a horrifically slow rate, averaging 16MB/s. When SQL Server was not allocating memory from the OS, the table scan generated disk IO at 2.0 to 2.3GB/sec. This is for all combinations of loading into an empty or full buffer cache with and without eviction. The consumption rate with data in memory was just over 1GB/s per core showing excellent scaling from degree of parallelism 1 to 10.
None of the SQL Server instances had the Lock Pages in Memory right assigned, and hence all SQL Server processes were using the conventional memory manager. The operating system had been running for over three weeks with several SQL Server instances and hundreds of processes that started and stopped for various reasons. Overall system physical memory in use probably never exceeded 85% (of 1TB).
My interpretation of what happened was that the physical memory range had become heavily fragmented over the period with so many processes allocating and de-allocating memory pages (4KB). I would presume that the operating system can allocate 4KB pages from the free list much faster 16MB/s. When processes de-allocate memory pages, the OS must first zero out the page before returning it to the free list (Demand Zero Fault/s). There must be something that makes memory allocation very expensive, considering that the OS must track 256M x 4K pages = 1TB.
This could why my test system showed no difference between Large Page Allocation and the conventional memory manager. The test system had only a single instance of SQL Server, few other running processes, and tests were usually conducted from a clean boot. Even if I had changed SQL Server memory up and down several times, nothing would have fragmented the physical memory? On top of the, the nature of the queries could have consumed more cycles than the underlying memory address translation for either 4K or large pages?
There are some implications in using lock pages in memory which automatically enables large pages allocations if the conditions are met. First is that memory is allocated on SQL Server startup. Second, the operating system cannot page locked memory. So the assumption is that the remaining memory is sufficient for the OS and other user processes. A side effect is that Task Manager only reports working set, and not the private memory of the locked pages.
I am speculating that it might be possible that if a SQL Server instance with lock pages were stopped, an attempt to restart later could fail if the OS cannot allocate sufficient contiguous 2M blocks. So this would require booting the system. So the locked pages option should be only used for servers dedicated to SQL Server. Careful planning on memory setting is needed for multi-instance.
Below is a good starting point on this topic