I had been meaning to do a somewhat comprehensive review of SQL Server performance from versions 2000 to 2008 for both 32 and 64-bit on Data Warehouse type queries, with in depth examination of scaling in parallel execution plans.
For now, I can provide a short summary. The test platform is a Dell PowerEdge 2900 with 2 quad-core Xeon E5330 2.66GHz processors, and 24GB memory. The operating system is Windows Server 2008 64-bit for both 32 and 64-bit SQL Server versions. Technically SQL Server 2000 is not supported, but this is just a performance comparison, not a production environment. The database is generated using the TPC-H dbgen kit for scale factor 10, meaning the Lineitem table is approximately 10GB, and the entire database is approximately 17GB, which fits entirely in memory. There was some tempdb activity, which is spread across 10 15K drives.
All tests are run twice to load data into memory and pre-compile the execution plan for the second run. All results shown are for the second run. For SQL Server 2008, the tables use the new Date data type in place of Date Time, and queries are modified to avoid conversion anomalies as noted below. Below is the total (sum) CPU time in milli-seconds to execute the 22 queries in sequence for max degree of parallelism: 1, 2, 4, and 8.
DOP 1 DOP 2 DOP4 DOP 8
2000 RTM 534,912 663,848 656,232 697,794
2000 bld 2187 514,881 589,245 657,543 770,272
2005 RTM 32 463,526 444,479 456,567 498,623
2005 SP2 32 464,478 403,668 413,685 452,134
2005 RTM 64 379,363 377,570 394,962 474,200
2005 SP2 64 370,206 327,149 345,155 436,491
2008 RTM 375,136 324,264 343,250 410,220
Duration in milli-seconds to run 22 queries by max DOP.
DOP 1 DOP 2 DOP4 DOP 8
2000 RTM 553,900 293,411 191,552 149,568
2000 bld 2187 566,333 276,085 188,497 164,677
2005 RTM 32 480,839 237,933 134,644 84,721
2005 SP2 32 483,842 214,804 119,525 72,515
2005 RTM 64 379,563 194,199 107,409 65,094
2005 SP2 64 370,374 166,579 94,844 59,388
2008 RTM 375,135 171,390 94,028 56,795
On SQL Server 2000 build 2187, notice that CPU increases from 514.8 to 589.2 seconds going from Max DOP 1 to 2 and so on to Max DOP 8. This is expected because there is overhead to employing a parallel execution plan, and the overhead increases with the number of threads involved. Between SQL Server 2000 RTM and build 2187, there was a sharp jump in the CPU required at Max DOP 8. I will disregard this as there were significant changes and code fixes between the two builds concerning correctness of parallel execution plan results. Still, there is an overall performance gain from Max DOP 4 to 8. Several years ago, I mentioned that SQL Server 2000 performance is very problematic beyond Max DOP 4. That was before multi-core processors, and there were at most, 4 cores per NUMA node. So the more correct interpretation is that SQL Server 2000 is very problematic on NUMA systems. An earlier look at SQL Server 2005 RTM showed no such problems on NUMA.
In SQL Server 2005, and 2008, there is actually a decrease in CPU going from Max DOP 1 to 2. This is mostly attributed to the bitmap filter in hash operations. Some queries show a significant drop in CPU from DOP 1 to 2, others no change, and some an increase. From DOP 2 to 4 there is a slight increase in CPU and more significant in going from DOP 4 to 8. This might indicate that DOP 2 and 4 are very good for overall efficiency, benefitting from bitmap filters in hash join operations, yet without incurring excessive parallelism overhead. (This is unrelated to the recommendation of Max DOP 4 on Itanium systems based on cores per NUMA node). Unrestricted parallelism on the 8 core system yields the best single stream completion times, although this should really be tested on 16 or more cores before setting any rules.
In the transition from SQL Server 2000 to 2005 RTM, both 32-bit, the duration performance gain is a modest 15% for non-parallel plans and a very substantial 49% at Max DOP 8. From SQL Server 2005 32-bit to 64-bit, both RTM builds, the performance gain was a solid 20%. The CPU efficiency improvement was a little less, so the tempdb configuration affects the results. Even though the entire data fits in memory, a large query with intermediate results is more likely to spool to tempdb in 32-bit than 64-bit. From SQL Server 2005 64-bit RTM to Service Pack 2, an additional 10% was realized at DOP 2 and higher.
SQL Server 2008 RTM is marginally better than SQL 2005 SP2. There is significant variation from query to query, so improvements should be expected over time hopefully to correct the query plans that are slower while maintaining the performance advantage of plans that are better. One of the big disasters in 2008 parallel execution plans occurs on Query 5, Local Supplier Volume. The query is:
SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY
AND L_SUPPKEY = S_SUPPKEY AND C_NATIONKEY = S_NATIONKEY
AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY
AND R_NAME = 'ASIA'
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01'))
GROUP BY N_NAME
ORDER BY REVENUE DESC
The MaxDOP 1 plan is essentially:
SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM SUPPLIER
INNER JOIN (
SELECT N_NATIONKEY, N_NAME, L_EXTENDEDPRICE, L_DISCOUNT, L_SUPPKEY
FROM NATION
INNER JOIN REGION ON N_REGIONKEY = R_REGIONKEY
INNER JOIN CUSTOMER ON C_NATIONKEY = N_NATIONKEY
INNER JOIN ORDERS ON C_CUSTKEY = O_CUSTKEY
INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY
WHERE R_NAME = 'ASIA'
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01'))
) x ON L_SUPPKEY = S_SUPPKEY AND S_NATIONKEY = N_NATIONKEY
GROUP BY N_NAME ORDER BY REVENUE DESC
OPTION (FORCE ORDER)
At MaxDOP 1, the actual CPU is 21,965 ms for the original query,
the MaxDOP 2, the CPU is 28,721ms for the original.
MaxDOP 2 CPU for the forced query is 13,323.
So this one query added 15.4 CPU-sec to the total 22 query 324.3 CPU-sec,
close to 5%, and about 8.0sec duration.
Query 8 was also bad news on the parallel plans, with about 5 CPU-sec lost on the MaxDOP 2 parallel plan compared with a forced parallel plan modeled on the non-parallel plan. One might think that MS should have caught these anomalies. I think the reason they do not is that MS does not look at SF1-30 TPC-H results. The minimum for publication is 100GB, and that will probably increase to 300GB soon, because 30GB is not a real data warehouse. I do think MS should look very carefully at SF1-30. The queries are at the onset of eligibility for parallelism. The really big queries in SF100 and higher are less likely to encounter plan problems. While not strictly a data warehouse, most transactional databases I have seen do not remotely resemble TPC-C or E. I would say most have TPC-H SF1-10 sized queries mixed in with smaller transactions. So a bad execution plan can be really bad news.
I am sufficiently satisfied that SQL Server 2008 has a very powerful engine, and a decent optimizer. However, I have complained in the past about the rigid assumptions that all query costs factor in IO time, the use a fixed random to sequential IO performance model (320 IOPS to 10.5MB/sec) and an out of balance IO-CPU ratios. If a proper calibration of the true cost formulas were to be done, there would probably be fewer silly mistakes resulting in goofy execution plans. Given that many people do not know how to diagnose this type of problem, a simple test of 2000 or 2005 and 2008 can encounter this matter, leading to a decision to stay with 2000/2005, when a few simple adjustments would have corrected the 2008 results.
SQL Server Settings
Generally I follow the HP TPC-H publications on optimization settings, particularly -E and -T834. Neither changed results by more than 1% either way. I had also looked at -T2301 in the past finding no apparent differences. I really would like MS to provide more details on T2301. Are there set points below which it has no effect?
SQL Server 2008 new Date data type changes
The 3 datatime columns in the LineItem table from 2005 become Date columns, for an apparent savings of 12 bytes. The 2005 tpch SF10 database was 13.77GB (rather million KB) data and 3.68G indexes for a total of 17.46G. In 2008, using the Date data type in place of datetime, the size is 12.77 data and 2.96G index for a total of 15.74G. The average bytes per row of LineItem drops from 169 to 153, because one of the DateTime/Date columns was the cluster key.
Nornally, a simple reduction in size on column width, not row count, does not improve performance unless it impacts fit in memory. I always try to exclude this factor because one can generate any difference in performance by adjusting the amount of disk IO.
The original TPC-H queries may have SARG of the form
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < DATEADD(YY, 1, '1994-01-01'))
Even before SQL 2008, the date functions would return a datetime or smalldatetime result as appropriate. In SQL 2008, the nature extension is to return a date type when the comparison is a date column. I made this request in connect and was told to bugger off. So SQL 2008 will convert the column to date time to equate with the function, losing the benefit of a proper SARG. Anyone upgrading to SQL 2008 with the date type and not changing code as below may get a nasty suprise.
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01'))
Little things like this can cause people to refuse to budge from SQL 2000, which really needs to be retired.
Duration for SQL 2008 64-bit
|
1P D |
2P D |
4P D |
8P D |
| Q1 |
50,013 |
26,317 |
12,591 |
7,159 |
| Q2 |
504 |
268 |
150 |
107 |
| Q3 |
16,296 |
5,186 |
3,158 |
1,902 |
| Q4 |
19,232 |
5,288 |
3,452 |
2,340 |
| Q5 |
21,648 |
16,150 |
8,371 |
5,120 |
| Q6 |
1,845 |
929 |
496 |
312 |
| Q7 |
17,397 |
4,369 |
2,388 |
1,376 |
| Q8 |
5,734 |
6,765 |
3,628 |
1,849 |
| Q9 |
48,361 |
22,034 |
11,335 |
6,372 |
| Q10 |
15,281 |
5,822 |
3,595 |
2,425 |
| Q11 |
4,423 |
1,238 |
657 |
600 |
| Q12 |
9,363 |
4,828 |
4,356 |
2,365 |
| Q13 |
21,699 |
11,310 |
5,751 |
2,967 |
| Q14 |
2,146 |
1,033 |
547 |
334 |
| Q15 |
1,368 |
970 |
521 |
249 |
| Q16 |
6,599 |
3,615 |
2,018 |
1,848 |
| Q17 |
1,243 |
521 |
294 |
213 |
| Q18 |
50,909 |
27,945 |
15,439 |
9,365 |
| Q19 |
2,096 |
1,093 |
607 |
378 |
| Q20 |
841 |
430 |
255 |
165 |
| Q21 |
69,191 |
22,064 |
12,826 |
8,337 |
| Q22 |
8,946 |
3,213 |
1,592 |
1,010 |
|
375,135 |
171,390 |
94,028 |
56,795 |
my apologies, Linchi post SQL 2005 64-bit results, so my duration results for SQL 2005 64-bit, SP2 (no cu) below
|
1P D |
2P D |
4P D |
8P D |
| Q1 |
64,761 |
32,553 |
16,428 |
8,344 |
| Q2 |
504 |
295 |
158 |
106 |
| Q3 |
14,733 |
4,782 |
3,003 |
1,989 |
| Q4 |
17,506 |
5,338 |
3,747 |
2,519 |
| Q5 |
19,716 |
7,376 |
4,654 |
3,159 |
| Q6 |
1,609 |
893 |
471 |
309 |
| Q7 |
15,855 |
5,472 |
3,306 |
2,403 |
| Q8 |
5,225 |
2,391 |
1,333 |
2,147 |
| Q9 |
44,611 |
23,291 |
12,213 |
7,222 |
| Q10 |
13,989 |
6,384 |
3,934 |
2,754 |
| Q11 |
4,093 |
1,192 |
669 |
495 |
| Q12 |
8,166 |
4,497 |
4,022 |
1,714 |
| Q13 |
25,830 |
13,566 |
7,521 |
4,260 |
| Q14 |
2,060 |
1,020 |
526 |
352 |
| Q15 |
1,358 |
1,931 |
1,139 |
235 |
| Q16 |
6,476 |
3,476 |
2,429 |
1,215 |
| Q17 |
1,012 |
524 |
291 |
199 |
| Q18 |
46,954 |
26,156 |
13,896 |
9,209 |
| Q19 |
2,133 |
1,172 |
623 |
450 |
| Q20 |
830 |
446 |
253 |
172 |
| Q21 |
64,231 |
20,850 |
12,536 |
9,087 |
| Q22 |
8,722 |
2,972 |
1,692 |
1,049 |
|
370,374 |
166,579 |
94,844 |
59,388 |