THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

TPC-H Benchmarks on SQL Server 2014 with Columnstore

Three TPC-H benchmark results were published in April of this year at SQL Server 2014 launch, where the new updateable columnstore feature was used. SQL Server 2012 had non-updateable columnstore that required the base table to exist in rowstore form. This was not used in the one published TPC-H benchmark result on SQL Server 2012, which includes two refresh stored procedures, one inserting rows, the second deleting rows. It is possible that the TPC-H rules do not allow a view to union two tables? and perhaps a delete via the partitioning feature? (meaning the delete range must match the partition boundaries). Another possibility is that SQL Server 2012 columnstore was considered to be a multi-column index which is also prohibited to reflect the principle of being ad-hoc queries.

 

SQL Server Columnstore

First a few quick words on SQL Server columnstore. Columnstore is not actually an index of the b-tree index form. The MSDN Columnstore Indexes Described states that columnstore index is "a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore." In SQL Server 2012, it is called a nonclustered columnstore index not because it is nonclustered or an index, but because the base table must exist in traditional rowstore form. In SQL Server 2014, there is a clustered columnstore index not because data is stored in order of the index key, as there is no key, but rather that there is no rowstore table, just the columnstore.

 

TPC-H details Date Columns and Query SARGs

The full details of the TPC-H Decision Support benchmark are described on the TPC.org website. There are a few details of relevance to the use columnstore. The largest table in TPC-H is LINEITEM, which has 3 date columns, ShipDate, ReceiptDate and CommitDate, and is clustered on ShipDate. The second largest table is ORDERS, clustered on the one date column OrderDate. These two tables are joined on OrderKey. There is correlation between values in the date columns in these two tables, some natural, and others based on reasonable business conditions. ShipDate must be greater than OrderDate obviously, and is also no more than 121 days greater than OrderDate per benchmark specification built into the data generator. CommitDate is between -89 and 91 days of ShipDate. ReceiptDate is between 1 to 30 days after ShipDate. The date values ranges from Jan 1992 to Dec 1998.

There are 22 Select queries in the TPC-H benchmark, along with the 2 refresh stored procedures. Many of the Select queries specify a date range on one of the date columns or a lower bound one date column and an upper bound on a different column. Ideally, for queries that target rows in a limited date range, we would like to have upper and lower bounds on for the cluster keys on both the ORDERS and LINEITEM tables, OrderDate and ShipDate. However the TPC-H rules do not permit re-writing the query SARGs based on inferable knowledge.

That said, apparently the rules do not preclude the query optimizer from discovering such knowledge. One of the other RDBMSs was probably first to do this, and Microsoft followed suit in order to be competitive in the TPC-H benchmark with the Date Correlation Optimization feature in 2005. Personally, I am not aware of any production server using this feature. Realistically, any organization that was having query performance issues related to date range bounds would probably have directed the analyst to apply appropriate date bounds on the cluster key. This is most probably a benchmark specific optimization feature.

The date correlation optimization statistics do not exist when using clustered columnstore, because there is no underlying rowstore table with foreign key relations? The date correlation statistics do exist when using rowstore tables with foreign keys and are used by nonclustered columnstore indexes?

 

TPC-H on SQL Server 2014 with Columnstore

That said, let us now look at the 3 new SQL Server 2014 TPC-H results published making use of the new clustered columnstore indexes. One is from IBM at Scale Factor 1000 (1TB) and two from HP at 3TB and 10TB respectively. The new results are compared to prior results with traditional rowstore on previous versions of SQL Server and previous generation server systems.

Because Columnstore is not really an index in the b-tree sense, given that queries frequently involve date ranges, it is presumed to be important to use partitioning with Columnstore. The three new TPC-H reports on SQL Server 2014 partition both ORDERS and LINEITEM by OrderDate and ShipDate respectively (the cluster key in previous versions) with a partition interval of 1 week (7 years x 52 weeks per year = 364 partitions). Perhaps of interest, the scripts show that a rowstore partitioned index is first build before building the partitioned clustered columnstore index.

 

TPC-H at SF 1000 (1TB)

The new TPC-H 1TB result on SQL Server 2014 using columnstore is compared with 3 previous results on SQL Server 2008 R2. There is a difference in memory configurations between the four systems below. For SQL Server 2014 with clustered columnstore indexes, the TPC-H SF1000 total database size is just under 430GB, so with 1TB memory, the benchmark is running entirely in memory after the initial data load, with the exception of hash operation spills to tempdb.

Vendor
System
Processor
GHz
Sock
ets
Cores
Threads
MemSQLQphHPowerThrough
put
Date
IBM
x3850 X6
E7-4890 v2
2.8
460/12015362014519.976695,445388,7794/16/14
HP
DL980 G7
E7-4870
2.4
880/16020482008R2219,888233,119207,4078/30/11
IBM
x3850 X5
E7-8870
2.4
880/8020482008R2173,961200,889150,6355/20/11
Cisco
UCS C460
E7-4870
2.4
440/?10242008R2134,117156,157115,18812/7/11

For rowstore, the TPC-H SF1000 total database size is 1420GB, so the two systems with 2TB memory are mostly running with data in memory, again except for the initial load and spills to tempdb. There is definitely disk IO for data in the Cisco system at 1TB physical memory. The performance impact is noticeable, but probably not as severe as one might think based on how people talk of memory. The reason is that all of these systems make correct use of massively parallel IO channels to storage capable of 10GB/s plus table scans, and many of these also use SSD storage capable of more random IOPS than SQL Server can consume even at very high parallelism.

The new SQL Server 2014 result is 2.36X higher on composite score (QphH) and 3X higher in the Power test than previous versions with conventional rowstore.

The 22 individual query run times from the Power test at 1TB are shown below. The SQL Server 2014 result is the left item in the legend label 4x60 (sockets/cores) and for the succeeding charts as well.

tpcE

Query 1, a single table aggregation, is more than 10 times faster on SQL Server 2014 using columnstore on 60 cores (Ivy-Bridge, 2.8GHz) than 2008R2 using rowstore, on 80 cores(Westmere-EX). Per TPC-H benchmark procedure, the test is run immediately after data load and index creation? The second largest speed-up is Query 16, joining 3 tables, at 6.4X.

Query 10 is 40% slower with columnstore, and Query 4 about the same between columnstore and conventional. This query is listed near the end of this section on TPC-H.

Notice that in the 3 SQL Server 2008R2 results, Query 2 becomes slower as the degree of parallelism increases from 40 cores (threads unspecified) to 80 cores/80 threads and then to 80 cores/160 threads. Elsewhere I had commented that SQL Server really needs a graduated approach to parallelism instead of the all or nothing approach.

TPC-H at SF 3000 (3TB)

The new TPC-H 3TB result on SQL Server 2014 is compared with a previous result on SQL Server 2008R2. Here, the difference in memory is a significant contributor. The SQL Server 2014 system has more memory than the columnstore database, while 2008R2 systems has much less memory than the 3TB rowstore database (4.5TB).

Vendor
System
Processor
GHz
Sock
ets
Cores
Threads
MemSQLQphHPowerThroughputDate
HP
DL580 G8
E7-4890 v2
2.8
460
120
30722014461,837631,309337,8594/15/14
HP
DL980 G7
X7560
2.27
864
128
5122008R2162,601185,297142,6856/21/10

I am supposing that the reason the HP 2010 report only configured 512GB (128 x 4GB priced $29,440 in 2010) was that there would not be a significant performance improvement for the TPC-H 3TB result at either 1TB or even 2TB memory in relation to the higher price (128 x 16GB priced $115K in 2011).

Several of the TPC-H queries involve nearly full table scans of the large tables. If there is not sufficient memory for the entire database, then the next objective is to have sufficient memory for reducing the spill to disk in hash operations? HP may have elected for the better price-performance? Or perhaps someone just wanted to make a point. The point being that it is important for the SQL Server engine to function correctly when heavy IO is required.

In the SQL Server 2014 result, the system has 3TB memory (96 x 32GB priced $96K in 2014) which is sufficient to hold the entire data set for TPC-H 3TB in columnstore.

The overall composite score is 2.8X higher with columnstore and 3.4X higher on the Power test.

The 22 individual query run times from the Power test at 3TB are shown below.

tpcE

The largest gain with column-store is Query 19 at 19.7X. Query 4 and 10 show degradation, similar to the case at SF1000.

TPC-H at SF 10000 (10TB)

The new TPC-H 10TB result on SQL Server 2014 is compared with a previous result on SQL Server 2012. Strangely, supporting documentation for the HP 2013 report on SQL Server 2012 is missing so there is no indication as to whether nonclustered columnstore is used? I am guessing that columnstore was not used because the results are in line with expectations on rowstore.

Vendor
System
Processor
GHz
Sock
ets
Cores
Threads
MemSQLQphHPowerThroughputDate
HP
DL580 G8
E7-4890 v2
2.8
460
120
3072 2014404,005631,309337,8594/15/14
HP
DL980 G7
E7-4870
2.4
880
160
4096 2012158,108185,297142,6856/21/11

The full data size in columnstore at SF 10000 should be 5TB, and 14TB in rowstore, so there should have been heavy disk IO in both results.

The overall composite score is 2.55X higher with column-store and 3.1X higher on the Power test.

The 22 individual query run times from the Power test at 10TB are shown below.

tpcE

The largest gain with column-store is Query 6 at 23.2X. Query 4 and 10 show degradation as in the two previous cases.

 

TPC-H Query 10

Below is Query 10. This query is consistently slower in columnstore relative to rowstore.

 

/* TPC_H Query 10 - Returned Item Reporting */

SELECT TOP 20 C_CUSTKEY, C_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, C_ACCTBAL,
N_NAME, C_ADDRESS, C_PHONE, C_COMMENT
FROM CUSTOMER, ORDERS, LINEITEM, NATION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND
O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
L_RETURNFLAG = 'R' AND C_NATIONKEY = N_NATIONKEY
GROUP BY C_CUSTKEY, C_NAME, C_ACCTBAL, C_PHONE, N_NAME, C_ADDRESS, C_COMMENT
ORDER BY REVENUE DESC

The execution plan for rowstore at SF 10 is shown below.

tpcE
(Click for full-size)

The execution plan for columnstore at SF 10 is shown below.

tpcE
(Click for full-size)

Below are the details on ORDERS and LINEITEM from the rowstore plan.

tpcE tpcE

Notice that there are seek predicates on LINEITEM for 1993-09-20 to 1994-06-17. The actual range should be 1993-10-02 to 1994-05-02, for 1 day after the OrderDate lower bound and 121 days after the OrderDate upper bound.

Below are the details on ORDERS and LINEITEM from the columnstore plan.

tpcE tpcE

In columnstore, every operation is a scan. There is a predicate for the ORDERS table but not on the LINEITEM table. Presumably storage engine must scan entire set of LINEITEM partitions while only scanning the ORDERS partitions encompassing the SARG date range

I am thinking the reason is that with date correlation in conventional row-storage, the SQL Server query optimizer knows that the data range in LINEITEM ShipDate is also restricted by the lower OrderDate and the upper OrderDate plus 121 days, corresponding to 1 day after the lower bound on OrderDate to 121 days after the upper bound on OrderDate.

TPC-H Query 4

TPC-H Query 4 below is slower than row storage in the 3 and 10TB results. I am thinking that the reason is the same?

 

/* TPC_H Query 4 - Order Priority Checking */

SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT FROM ORDERS
WHERE O_ORDERDATE >= '1993-07-01' AND O_ORDERDATE < dateadd(mm,3, cast('1993-07-01' as date))
AND EXISTS (SELECT * FROM LINEITEM WHERE L_ORDERKEY = O_ORDERKEY AND L_COMMITDATE < L_RECEIPTDATE)
GROUP BY O_ORDERPRIORITY

 

See TPCH Query Plans for the TPC-H reference queries and execution plans at SF1 on SQL Server 2005. The parent page TPCH Interim has links for the SF1000 query plans with and without parallelism.

TPC-H Columnstore Summary

As with every other new feature, Columnstore is a really interesting new technology. But think hard about what is really happening, experiment, and remember to get good execution statistics and plans prior to making changes, then get the new execution statistics and plans after the change.

One reason I like to look at official TPC-H benchmark results over "real-world" is that the benchmark system is properly configured for both before and after results. There is a significant difference in the data size involved for each query between rowstore and columnstore. If the reference system has a poor storage system (and how often have we seen this? this is guaranteed when the SAN vendor assisted in configuration), then it is possible to produce almost any performance ratio.

 

TPC-E

The charts below show the progression of performance over time for the selected TPC-E results spanning Core 2, Nehalem, Sandy Bridge and Ivy processors at 2, 4 and 8 sockets.

tpcE

For the 2-socket systems, West-1 is from the first set of TPC-E results reported for Westmere X5680 with HDD storage and West-2 is the later X5690 report with SSD storage. Both are 6-core Westmere-EP processors. The West-3 is the E7-2870 10-core (Westmere-EX) on SSD storage.

For the 4-socket systems, West-1 is on HDD storage, and West-2 on SSD, both 2K8R2 and 1TB memory. The West-3 is on Win/SQL 2012, 2TB memory and SSD storage.

The same data is shown below with reverse organization showing scaling with sockets for each of the processor architectures.

tpcE

Notes: Nehalem
2-socket 4-core, 2.93GHz (11.72 core x GHz), 4 & 8-socket 8-core 2.26GHz (18.08 core x GHz)
Westmere
2-socket 6-core, 3.46GHz (20.76 core x GHz/socket), 4 & 8-socket is 10-core 2.4GHz (24 core x GHz)
Sandy Bridge
2-socket 8-core, 2.9GHz (23.2 core x GHz/socket), 4-socket is 8-core 2.7GHz (21.6 core x GHz)
Ivy Bridge
2-socket 12-core, 2.7GHz (32.4 core x GHz/socket), 4 & 8-socket is 15-core 2.8GHz (42 core x GHz)

Published Monday, June 09, 2014 9:39 AM by jchang

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Scott Whigham said:

Great writeup, Joe. I tried to follow the links to view the execution plans but they 404. I think these are the 2008R2 versions:

http://www.qdpma.com/tpch/TPCH100_Query_plans.html

And a few of the missing ones might be these?

http://www.qdpma.com/tpch/tpch_interim.html

Some of the links on the interim page return a blank page though.

June 9, 2014 9:26 AM
 

jchang said:

On the tpch_interim link

in the section

TPC-H SF 100 Execution Plans

you have to save target as.

the links have the *.html extension

but are actually sqlplan files

for example note the name

tpch_sf100_nonparallel_nc.sqlplan.html

so save this as tpch_sf100_nonparallel_nc.sqlplan

June 9, 2014 9:57 AM
 

tobi said:

It is an unfortunate limitation that CS indexes tend to be scanned entirely.

Segment elimination is hard to make work reliably.

Also, we can only have one CS index per table. We can't segment once on one dimension and a second time on another by just having two CS indexes that are roughly sorted in different ways (e.g. once on a date column and once on a status code column).

Basic requirements are easy to handle. For example, partition by date and you have reliable elimination. But what if there are two or more common search arguments?! CS lets you down.

June 9, 2014 9:58 AM
 

SQL Server 2014 News | phoebix said:

June 11, 2014 10:54 AM
 

Columnstore: Gooooooal!!! | Garrett Edmondson said:

June 22, 2014 7:00 PM
 

Thomas Kejser said:

Hi Joe

I am not sure your theory is correct in the case of Q10. It is noteworthy that the column store requires that the join with CUSTOMER is performance before the sort on revenue. The row store on the other hand can do a loop join (so can the column store, but that is not the plan you get it seem).

This must mean that the sort buffer is significantly larger for the column store (as reflected in the plan estimates) - which in turn can cause a rather significant memory consumption. It is also noteworthy that the column store does not seem to push the return flag predicate into the storage engine.

With column storage segments storing the min/max of all values contain in each column, it is unclear if the date correlation provides any benefit that isn't already gain from the segment header.

Another odd thing about the column store plan of Q10 is that the join of LINEITEM/ORDER is hashed, while the probe happens on CUSTOMER. Unless the predicate on RETURNFLAG is very selective (I don't recall) this is the wrong way around and may cause further spilling

June 28, 2014 2:58 PM
 

Joe said:

Good input, we could test this theory by removing the non-batch mode operations from the query.

SELECT TOP 20 O_CUSTKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT))AS REVENUE

FROM ORDERS, LINEITEM

WHERE L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND

O_ORDERDATE <dateadd(mm, 3,cast('1993-10-01'as date))AND

L_RETURNFLAG = 'R'

GROUP BY O_CUSTKEY

ORDER BY REVENUE DESC

we could also try

SELECT C_CUSTKEY, C_NAME, REVENUE, C_ACCTBAL, N_NAME, C_ADDRESS, C_PHONE, C_COMMENT

FROM (

SELECT TOP 20 O_CUSTKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT))AS REVENUE

FROM ORDERS, LINEITEM

WHERE L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND

O_ORDERDATE <dateadd(mm, 3,cast('1993-10-01'as date))AND

L_RETURNFLAG = 'R'

GROUP BY O_CUSTKEY

ORDER BY REVENUE DESC

) x, CUSTOMER, NATION

WHERE  _CUSTKEY = O_CUSTKEY AND C_NATIONKEY = N_NATIONKEY

let me also look into keeping Customer as rowstore.

I am on the road, will do when I get back

I might visit London this summer, hope you are around

June 30, 2014 12:46 PM
 

Thomas Kejser said:

Will look forward to the results Joe. I plan to be around London the next 3-4 months at least - so just look me up when you are around.

July 1, 2014 4:53 PM

Leave a Comment

(required) 
(required) 
Submit

About jchang

Reverse engineering the SQL Server Cost Based Optimizer (Query Optimizer), NUMA System Architecture, performance tools developer - SQL ExecStats, mucking with the data distribution statistics histogram - decoding STATS_STREAM, Parallel Execution plans, microprocessors, SSD, HDD, SAN, storage performance, performance modeling and prediction, database architecture, SQL Server engine

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement