|
|
|
|
-
I had been meaning to do a somewhat comprehensive review of SQL Server performance from versions 2000 to 2008 for both 32 and 64-bit on Data Warehouse type queries, with in depth examination of scaling in parallel execution plans.
For now, I can provide a short summary. The test platform is a Dell PowerEdge 2900 with 2 quad-core Xeon E5330 2.66GHz processors, and 24GB memory. The operating system is Windows Server 2008 64-bit for both 32 and 64-bit SQL Server versions. Technically SQL Server 2000 is not supported, but this is just a performance comparison, not a production environment. The database is generated using the TPC-H dbgen kit for scale factor 10, meaning the Lineitem table is approximately 10GB, and the entire database is approximately 17GB, which fits entirely in memory. There was some tempdb activity, which is spread across 10 15K drives.
All tests are run twice to load data into memory and pre-compile the execution plan for the second run. All results shown are for the second run. For SQL Server 2008, the tables use the new Date data type in place of Date Time, and queries are modified to avoid conversion anomalies as noted below. Below is the total (sum) CPU time in milli-seconds to execute the 22 queries in sequence for max degree of parallelism: 1, 2, 4, and 8.
DOP 1 DOP 2 DOP4 DOP 8
2000 RTM 534,912 663,848 656,232 697,794
2000 bld 2187 514,881 589,245 657,543 770,272
2005 RTM 32 463,526 444,479 456,567 498,623
2005 SP2 32 464,478 403,668 413,685 452,134
2005 RTM 64 379,363 377,570 394,962 474,200
2005 SP2 64 370,206 327,149 345,155 436,491
2008 RTM 375,136 324,264 343,250 410,220
Duration in milli-seconds to run 22 queries by max DOP.
DOP 1 DOP 2 DOP4 DOP 8
2000 RTM 553,900 293,411 191,552 149,568
2000 bld 2187 566,333 276,085 188,497 164,677
2005 RTM 32 480,839 237,933 134,644 84,721
2005 SP2 32 483,842 214,804 119,525 72,515
2005 RTM 64 379,563 194,199 107,409 65,094
2005 SP2 64 370,374 166,579 94,844 59,388
2008 RTM 375,135 171,390 94,028 56,795
On SQL Server 2000 build 2187, notice that CPU increases from 514.8 to 589.2 seconds going from Max DOP 1 to 2 and so on to Max DOP 8. This is expected because there is overhead to employing a parallel execution plan, and the overhead increases with the number of threads involved. Between SQL Server 2000 RTM and build 2187, there was a sharp jump in the CPU required at Max DOP 8. I will disregard this as there were significant changes and code fixes between the two builds concerning correctness of parallel execution plan results. Still, there is an overall performance gain from Max DOP 4 to 8. Several years ago, I mentioned that SQL Server 2000 performance is very problematic beyond Max DOP 4. That was before multi-core processors, and there were at most, 4 cores per NUMA node. So the more correct interpretation is that SQL Server 2000 is very problematic on NUMA systems. An earlier look at SQL Server 2005 RTM showed no such problems on NUMA.
In SQL Server 2005, and 2008, there is actually a decrease in CPU going from Max DOP 1 to 2. This is mostly attributed to the bitmap filter in hash operations. Some queries show a significant drop in CPU from DOP 1 to 2, others no change, and some an increase. From DOP 2 to 4 there is a slight increase in CPU and more significant in going from DOP 4 to 8. This might indicate that DOP 2 and 4 are very good for overall efficiency, benefitting from bitmap filters in hash join operations, yet without incurring excessive parallelism overhead. (This is unrelated to the recommendation of Max DOP 4 on Itanium systems based on cores per NUMA node). Unrestricted parallelism on the 8 core system yields the best single stream completion times, although this should really be tested on 16 or more cores before setting any rules.
In the transition from SQL Server 2000 to 2005 RTM, both 32-bit, the duration performance gain is a modest 15% for non-parallel plans and a very substantial 49% at Max DOP 8. From SQL Server 2005 32-bit to 64-bit, both RTM builds, the performance gain was a solid 20%. The CPU efficiency improvement was a little less, so the tempdb configuration affects the results. Even though the entire data fits in memory, a large query with intermediate results is more likely to spool to tempdb in 32-bit than 64-bit. From SQL Server 2005 64-bit RTM to Service Pack 2, an additional 10% was realized at DOP 2 and higher.
SQL Server 2008 RTM is marginally better than SQL 2005 SP2. There is significant variation from query to query, so improvements should be expected over time hopefully to correct the query plans that are slower while maintaining the performance advantage of plans that are better. One of the big disasters in 2008 parallel execution plans occurs on Query 5, Local Supplier Volume. The query is:
SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY
AND L_SUPPKEY = S_SUPPKEY AND C_NATIONKEY = S_NATIONKEY
AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY
AND R_NAME = 'ASIA'
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01')) GROUP BY N_NAME
ORDER BY REVENUE DESC
The MaxDOP 1 plan is essentially:
SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE FROM SUPPLIER INNER JOIN (
SELECT N_NATIONKEY, N_NAME, L_EXTENDEDPRICE, L_DISCOUNT, L_SUPPKEY FROM NATION INNER JOIN REGION ON N_REGIONKEY = R_REGIONKEY INNER JOIN CUSTOMER ON C_NATIONKEY = N_NATIONKEY INNER JOIN ORDERS ON C_CUSTKEY = O_CUSTKEY INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY WHERE R_NAME = 'ASIA'
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01')) ) x ON L_SUPPKEY = S_SUPPKEY AND S_NATIONKEY = N_NATIONKEY GROUP BY N_NAME ORDER BY REVENUE DESC OPTION (FORCE ORDER)
At MaxDOP 1, the actual CPU is 21,965 ms for the original query,
the MaxDOP 2, the CPU is 28,721ms for the original.
MaxDOP 2 CPU for the forced query is 13,323.
So this one query added 15.4 CPU-sec to the total 22 query 324.3 CPU-sec,
close to 5%, and about 8.0sec duration.
Query 8 was also bad news on the parallel plans, with about 5 CPU-sec lost on the MaxDOP 2 parallel plan compared with a forced parallel plan modeled on the non-parallel plan. One might think that MS should have caught these anomalies. I think the reason they do not is that MS does not look at SF1-30 TPC-H results. The minimum for publication is 100GB, and that will probably increase to 300GB soon, because 30GB is not a real data warehouse. I do think MS should look very carefully at SF1-30. The queries are at the onset of eligibility for parallelism. The really big queries in SF100 and higher are less likely to encounter plan problems. While not strictly a data warehouse, most transactional databases I have seen do not remotely resemble TPC-C or E. I would say most have TPC-H SF1-10 sized queries mixed in with smaller transactions. So a bad execution plan can be really bad news.
I am sufficiently satisfied that SQL Server 2008 has a very powerful engine, and a decent optimizer. However, I have complained in the past about the rigid assumptions that all query costs factor in IO time, the use a fixed random to sequential IO performance model (320 IOPS to 10.5MB/sec) and an out of balance IO-CPU ratios. If a proper calibration of the true cost formulas were to be done, there would probably be fewer silly mistakes resulting in goofy execution plans. Given that many people do not know how to diagnose this type of problem, a simple test of 2000 or 2005 and 2008 can encounter this matter, leading to a decision to stay with 2000/2005, when a few simple adjustments would have corrected the 2008 results.
SQL Server Settings
Generally I follow the HP TPC-H publications on optimization settings, particularly -E and -T834. Neither changed results by more than 1% either way. I had also looked at -T2301 in the past finding no apparent differences. I really would like MS to provide more details on T2301. Are there set points below which it has no effect?
SQL Server 2008 new Date data type changes
The 3 datatime columns in the LineItem table from 2005 become Date columns, for an apparent savings of 12 bytes. The 2005 tpch SF10 database was 13.77GB (rather million KB) data and 3.68G indexes for a total of 17.46G. In 2008, using the Date data type in place of datetime, the size is 12.77 data and 2.96G index for a total of 15.74G. The average bytes per row of LineItem drops from 169 to 153, because one of the DateTime/Date columns was the cluster key.
Nornally, a simple reduction in size on column width, not row count, does not improve performance unless it impacts fit in memory. I always try to exclude this factor because one can generate any difference in performance by adjusting the amount of disk IO.
The original TPC-H queries may have SARG of the form
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < DATEADD(YY, 1, '1994-01-01'))
Even before SQL 2008, the date functions would return a datetime or smalldatetime result as appropriate. In SQL 2008, the nature extension is to return a date type when the comparison is a date column. I made this request in connect and was told to bugger off. So SQL 2008 will convert the column to date time to equate with the function, losing the benefit of a proper SARG. Anyone upgrading to SQL 2008 with the date type and not changing code as below may get a nasty suprise.
AND O_ORDERDATE >= '1994-01-01'
AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY, 1, '1994-01-01'))
Little things like this can cause people to refuse to budge from SQL 2000, which really needs to be retired.
Duration for SQL 2008 64-bit
|
1P D |
2P D |
4P D |
8P D |
| Q1 |
50,013 |
26,317 |
12,591 |
7,159 |
| Q2 |
504 |
268 |
150 |
107 |
| Q3 |
16,296 |
5,186 |
3,158 |
1,902 |
| Q4 |
19,232 |
5,288 |
3,452 |
2,340 |
| Q5 |
21,648 |
16,150 |
8,371 |
5,120 |
| Q6 |
1,845 |
929 |
496 |
312 |
| Q7 |
17,397 |
4,369 |
2,388 |
1,376 |
| Q8 |
5,734 |
6,765 |
3,628 |
1,849 |
| Q9 |
48,361 |
22,034 |
11,335 |
6,372 |
| Q10 |
15,281 |
5,822 |
3,595 |
2,425 |
| Q11 |
4,423 |
1,238 |
657 |
600 |
| Q12 |
9,363 |
4,828 |
4,356 |
2,365 |
| Q13 |
21,699 |
11,310 |
5,751 |
2,967 |
| Q14 |
2,146 |
1,033 |
547 |
334 |
| Q15 |
1,368 |
970 |
521 |
249 |
| Q16 |
6,599 |
3,615 |
2,018 |
1,848 |
| Q17 |
1,243 |
521 |
294 |
213 |
| Q18 |
50,909 |
27,945 |
15,439 |
9,365 |
| Q19 |
2,096 |
1,093 |
607 |
378 |
| Q20 |
841 |
430 |
255 |
165 |
| Q21 |
69,191 |
22,064 |
12,826 |
8,337 |
| Q22 |
8,946 |
3,213 |
1,592 |
1,010 |
|
375,135 |
171,390 |
94,028 |
56,795 |
my apologies, Linchi post SQL 2005 64-bit results, so my duration results for SQL 2005 64-bit, SP2 (no cu) below
|
1P D |
2P D |
4P D |
8P D |
| Q1 |
64,761 |
32,553 |
16,428 |
8,344 |
| Q2 |
504 |
295 |
158 |
106 |
| Q3 |
14,733 |
4,782 |
3,003 |
1,989 |
| Q4 |
17,506 |
5,338 |
3,747 |
2,519 |
| Q5 |
19,716 |
7,376 |
4,654 |
3,159 |
| Q6 |
1,609 |
893 |
471 |
309 |
| Q7 |
15,855 |
5,472 |
3,306 |
2,403 |
| Q8 |
5,225 |
2,391 |
1,333 |
2,147 |
| Q9 |
44,611 |
23,291 |
12,213 |
7,222 |
| Q10 |
13,989 |
6,384 |
3,934 |
2,754 |
| Q11 |
4,093 |
1,192 |
669 |
495 |
| Q12 |
8,166 |
4,497 |
4,022 |
1,714 |
| Q13 |
25,830 |
13,566 |
7,521 |
4,260 |
| Q14 |
2,060 |
1,020 |
526 |
352 |
| Q15 |
1,358 |
1,931 |
1,139 |
235 |
| Q16 |
6,476 |
3,476 |
2,429 |
1,215 |
| Q17 |
1,012 |
524 |
291 |
199 |
| Q18 |
46,954 |
26,156 |
13,896 |
9,209 |
| Q19 |
2,133 |
1,172 |
623 |
450 |
| Q20 |
830 |
446 |
253 |
172 |
| Q21 |
64,231 |
20,850 |
12,536 |
9,087 |
| Q22 |
8,722 |
2,972 |
1,692 |
1,049 |
|
370,374 |
166,579 |
94,844 |
59,388 |
|
-
Intel Dunnington
Yet another update with the publication TPC results for the Intel X7460 six core (Dunnington)
X7460 2.67GHz 3x3M L2, 16M L3
Dell, HP lists availability Sep 15, 2008, IBM lists availability on the x3950M2 as Dec 10,
Dell R900 with 4 x X7460, 2.67GHz, 6 core, 16M L3, $17,195
HP DL580G5 with 4 x X7460, 2.67GHz 6 core, 16M L3 $19,151
I think the IBM x3950M2 with 4 x X7460 is $41K (understanding this system can be expanded to 16 sockets, and hence has higher cost structure)
Tukwila Quad Core Itanium due in Q1 2009?
I am looking over the Intel IDF slides on Tukwila. It is a quad-core Itanium, with probably just minor improvements in the core (specifically mentioned are HT) but with integrated memory controller and the QuickPath Interconnect (QPI) replacing the FSB. Tukwila will be 65nm when the first 45nm procs are 1 year old, meaning it really 2 years late (What I mean by this is a 65-nm quad core Itanium could have been built in late 2006/early 2007, if the prep work started early with clear objectives, and its performanc would have been earth shattering relative to x86/64). Frequency improvements are mentioned over the 90nm Dual-core, which is running about the same frequency as the 130nm single core Madison. Still, Tukwila has a large cache, 6M L3 per core, massive bandwith via QPI for good scaling characteristics (4 full width + 2 half width, allowing glueless 8-way), 4 DDR memory channels per socket, an HT, which is good for high call volume database apps. Intel mentioned about 2X performance, so they are probably targeting 740K tpm-C.
This means we will have the choice of 1) the six-core Dunnington, with the most powerful CPU core on earth (prior to Nehalem) on a weak chipset with 4 memory channels support 4 sockets of 24 cores, 2) the new quad-core Itanium with outstanding scaling characteristics plus HT for added throughput, but a weak core, 3) AMD Barcelona, also with very good scaling (8-way glueless), no HT, and slightly better than weak core, 4) Nehalem, with what will be the most powerful core, the new QPI for good scaling, 3 memory channels per socket, HT, but for the first year, only 2-sockets. Decisions, decisions.
Nehalem/Beckton
Due out in Q4 2008, the initial Nehalem core will support 2-way, 3 memory channels, 2 QPI. About a year later, Beckton, the MP server (4 sockets and up) version come out, 4 mem channels/socket, 4QPI.
Performance
TPC-C (Windows Server 2003, SQL Server 2005SP2)
4 x Intel X7460 six core 2.67GHz, 634,825 tpm-C
4 x AMD 8360 quad-core 2.5GHz 471,883 tpm-C
4 x Intel X7350 quad-core 2.93GHz 407,079 tpm-C
TPC-E (W2K8, S2K8, Dell PowerEdge R900 results)
4 x Intel X7460 six core 2.67GHz, 671.35 tps-E
4 x Intel X7350 quad core 2.93GHz, 451.29 tps-E
4 x AMD 8360 quad core ??
TPC-H (W2K3, S2K5, SF 100)
4 x Intel X7350 quad core 2.93GHz, 46,034QphH
4 x Intel X7460 six core ??
4 x AMD 8360 quad core ??
TPC-H 300GB
8 x Intel X7350 QC 46,034 QqpH (IBM x3950M2, W2K3, S2K5 sp2)
8 x AMD 8360 QC 52,860 QqpH (HP DL785, W2K8, S2K8)
On TPC-C, the 7460 six core generated a 34% edge over the quad core AMD and 56% advantage over the quad core X3750. Even with the large cache, this is higher than expected. At the time, I suspected HP did not pursue optimization with the 407K result.
On TPC-E, the six core showed a 49% edge over the older quad core.
This could indicate the 7300 chipset with 4 memory channels cannot properly scale the 4 QC 2x4M L2 processors, but can scale the new six core 16M L3 procs.
What's missing are comparable TPC-H numbers, especially at 100GB. The big cache on X7460 helps high call volume apps like TPC-C and E, but not TPC-H. Can the 7300 chipset drive the extra 8 cores (in the X7460 over 16 core in the X7350) in DW queries?
There is an 8xOpteron QC and 8xX7350 at 300GB, but the Opteron is on S2K8 while the X7350 is on S2K5, which has different characteristics.
The X7460 (Dunnington) is a clear winner at 4-way for the high-call volume apps. There are not sufficient results in DW to make a call. AMD does have a small openning in the fairly low-priced 8-way (compared with hard-NUMA systems).
As much as I would like to buy one of these for my own use in researching SQL Server performance characteristics, I am holding my 2008 budget for a Nehalem system as soon as it comes out, and a SSD array. As soon as I can confirm an SSD can do 10K IOPS on random 8K reads (I see the IDF announcements that the new Intel SSD due early 2009 will do 30K IOPS at 4K), I will get a dozen to see what is involved in reaching 100K IOPS from SQL queries. A few years ago, a quick test on a TMS SSD SAN showed 45K, limited by the SQL Server side CPU. On Nehalem, the big question is whether the Hyper-Threading issues of NetBurst has been fixed.
______________________________________________________________________
This is an update to the original post on Server Sizing for SQL Server to reflect the new quad-core Opteron systems. The recommended server systems, as of Q3 2008, for line-of-business database applications are:
2-way Intel Xeon: HP ProLiant ML370G5 and Dell PowerEdge 2900 III
4-way Intel: Dell PowerEdge R900 and HP ProLiant DL580G5
2-way AMD Opteron: Dell PowerEdge R805
4-way AMD Opteron: Dell PowerEdge R905 and HP ProLiant DL585G5
8-way AMD Opteron: HP ProLiant DL 785G5
Processors
For 2-socket Xeon, the top processors include the X5460 3.16GHz, and the E5440 2.83GHz, both 2x6M cache.
For 4-socket Xeon, the X7350 2.93GHz 2x4M.
For 4 & 8 socket Opteron, the top processor is the8360SE at 2.5GHz.
Processor Notes
I really do not want to get heavily into Xeon versus Opteron. It is too emotional a subject for many people and too infested with FUD driven by marketing people. This frequently involves valid technical points taken out of context. What it comes down to is the Core 2 architecture has by far the highest SPEC CPU integer (not rate) scores, and will generate the best results in certain categories of performance tests. This is most evident in single large query tests.
At the 4 & 8-socket level, AMD Opteron has the better memory architecture, with 2 DDR2 memory channels per socket, 8 total in a 4-socket system and 16 memory channels in an 8-socket system, compared with 4 in the 4-socket Xeon with the 7300 chipset. This may yield an advantage in full saturation tests, which are more difficult to run. So at the 4-socket level, the difference is Xeon has the more compute power in the processor cores, while Opteron can turn memory faster. What is better: a 400HP engine with a transmission with 70% efficiency or a 310 HP engine with a 90% transmission efficiency?
At the 8-socket level, Opteron is the best choice for most situations. The 8-socket Opteron (Barcelona) system has what is considered to be soft a NUMA architecture, meaning that memory latency difference between local and remote nodes is low or inconsequential (i.e. do not set the NUMA flag). The IBM and Unisys big iron systems are considered hard NUMA, meaning that memory latency between local and remote nodes is high. Hard NUMA systems can scale, but would most likely required specialized performance analysis skills which are not easily found.
Additional comments
I rate the HP ProLiant ML370G5 over the PowerEdge 2900 III on technical grounds: more memory sockets, and more PCI-E sockets. On the same grounds, I rate the Dell PowerEdge R805 over the ProLiant DL385G5 because the R805 has 16 DIMM sockets over 8 for the DL385G5.
At the 4-socket level, for Intel platforms, the Dell and HP systems are sufficiently comparable.
Note that Dual-Core Opteron processors are an option in the 2 and 4-socket systems, but not the 8 socket DL785. The original and dual core Opteron processors have up to 3 full (16-bit) HT links, of which 2 connect to other processors, and 1 connects to an IO hub. In a 4-socket system, the processors are at the corners of a square, with each processor connected to processors on the two adjacent corners. Hence there is a far processor two hops away.
The Barcelona quad-core has up to 4 full width HT links, each of which can be split as two half-width (8-bit) HT links. In a 4-socket system, each processor can connect directly to all of the other three sockets with a full HT link, leaving one for IO. In an 8-socket system, each processor can connect directly to all seven other sockets with one half-wide HT link, leaving one half-wide link for IO. The HP 4-socket DL585G5 and 8-socket DL785 only support quad-core Opteron, not dual core, which may indicate the use of three full HT links to processors. The Dell R905 supports both dual and quad-core Opteron, which may indicate an older 2-hop to the far processor.
Finally, until quad-core on a current generation manufacturing process is available, Itanium has the very high memory capacity (>256-512GB) and IO bandwidth (>10GB/sec) niche. It could be pointed out that whatever the criticism of Itanium be since its launch, at the time it was conceived in the 1990s, it was a forgone conclusion that RISC would overwhelm x86, which would be not able to benefit from advanced design concepts. Intel was not content to do a johnny come lately to the RISC party, and with HP, came up with a better idea than RISC. And yet, what processor today has the best SPEC CPU int.
IBM Xeon Systems
I have said before that I do not have recent experience with IBM systems. Just from looking at the IBM redbook ob the x3850 M2, it looks very impressive. For this and the x3950 M2, IBM does their own chipset, which supports a NUMA architecture to 4 nodes of 4 sockets. I do not know if 8 nodes are still supported. What I like about the x3850 M2 memory controller is the 8 DDR2 memory channels. I really think the Intel 7300 with 4 memory channels is too weak to support 4 quad cores, and now 4 six core procs. Intel always was afraid of the high-end chipset, obsessively looking at the entry point price, which drags down the high-end configuration. The IBM x3850 M2 did post a TPC-E of 479.51 tps-E over Dell's 451.29. The IBM system has 128GB memory compared with 64GB for Dell, so it is not clear if the 8 memory channels contributed.
|
-
Has any one encountered SQL backup failures that were ultimately diagnosed to a MemToLeave issue? This should only occur on 32-bit SQL Server versions, including 2000 and 2005. A number of other operations could generate a MemToLeave issue, but I am only addressing backup failures here. (the diagnosis should have led to messages in the SQL Logs like: reserve contiguous memory of Size=65536,131072, or some power of 2 bytes failed).
A search on the MemToLeave topic might point to using the –g switch on sqlservr startup, possibly setting this to 384 or 512, but hopefully, not a higher value. If the only MemToLeave problem is with backups, then there is an alternative to jacking up the –g MemToLeave setting.
The default SQL Server backup allocates (or tries to) a certain number of buffers of a certain size (the BUFFERCOUNT and MAXTRANSFERSIZE settings). The default MAXTRANSFERSIZE is 1M (1,048,576 bytes for the digital deficient). I am not sure what the default BUFFERCOUNT is; it might be 10 or 20. So in a 32-bit SQL Server instance doing various funky things, there very well might not be 10 or 20 contiguous 1M chunks of virtual address space (VAS) in the MemToLeave area (which completely not related to how much physical memory your system has or how much physical memory is available, so do not even mention this, and just shoot anyone who does). When the allocation fails, the backup command will progressively try smaller MaxTransferSize settings, which can take awhile and maybe ultimately failing.
Of course jacking up the MemToLeave addresses this issue, but alternatively one could just decrease the BufferCount or MaxTransferSize. The default settings usually yields decent backup performance on ordinary storage systems (i.e., not the brute force configurations I discussed in an earlier post). It is quite possible that decreasing the BufferCount to 4 and MaxTransferSize to 262144 will not cause much of drop in backup performance, perhaps, 10-20%. If this seems severe, consider the alternative of failed backups with no action or daytime impact with the –g512 setting. Which is more acceptable?
If you use a third party database backup compression software that is multi-threaded, the default maybe X buffers per thread. If your system has 16 cores, and the default threads is one-half of the number of cores, then that might be 80 x 1MB buffers by default.
In case you are the curious type, the BufferCount and MaxTransferSize setting are described in SQL Server 2005 BOL, not in the SQL 2000 BOL, but are mentioned in the SDK optional part of the SQL Server 2000 installation. Everybody installs the SDK part on their personal systems right? It has lots of code samples.
From comments below: At one time LiteSpeed defaulted to n-1 threads, if you have 4 cores, 3 threads was default. The Core2 2.6GHz can compress 150-200MB/sec depending on the data, so if you have 16 cores, 15 threads can do 2.2GB/sec if you can feed it. If you cannot feed the beast, whats the point of all the threads? If your disks can read just 600MB/s, 4 threads will work fine.
Last I saw, 4.6 was supposed to be 4 buffers per thread on 32-bit SQL. For LiteSpeed, I would recommend just dialing the right thread and buffercount, set threads no more than what your disks can supply, set buffercount to 4 if its not default. Only dial down the maxtransfersize if you still have failures or see the error message in the log.
|
-
I was just about ready to unleash a long accumulating stream of rants against SAN vendors for pushing seriously obsolete computers as powerful storage systems. Of course, in a final check of products specs, I saw that EMC just announced the new Clariion CX4 line. The previous CX3 line was built around the Intel E7520 chipset, which was a 2H 2004 product for the old NetBurst processors. A SAN does not need top line CPU power, but I wanted the Intel 5000P or even better, the 5400 chipset, with 2 x 1066 or 1333MHz FSB and 20GB/s memory bandwidth, not the single 800MHz FSB and 6.4GB/sec memory bandwidth on the E7520. A low voltage Core2 Xeon would be a good match to keep power down (never mind, 50 or 65W does not matter). A SAN should not need a full blown 3GHz Quad-Core. Dual core is fine.
The new CX4 line uses the current generationCore2 architecture processors. The low-end CX4-120 has a single 1.2GHz dual-core, the CX4-240 has 1 x 1.6GHz DC, the CX-480 1 x 2.2GHz DC and the top of the line CX4-960 has 2 x 2.33GHz quad core. As I said above about a SAN CPU needs, but where did the 1.2GHz come from? The lowest Xeon bottoms out at 1.6GHz. The very good E5205 1.86GHz has an Intel list price of $177. Not only that, on single socket systems, the E3110 3GHz is listed at $167. The desktop E7200 2.53GHz is $113. Considering that the CX4-480 is not a chump change system, it should have 2 dual core procs (1333MHz FSB) to be able to drive the full bandwidth of 4 FB-DIMM 667MHz memory channels. Consider what happens on a disk read. The data block read is not sent straight from disk to host, it is first written to memory regardless of cache settings(?), then read from memory, and finally sent to host. For this, I want both FSB processor sockets populated. A quad-core on one socket will only have half the FSB bandwidth as 2 dual-cores, one socket attached to each of the two FSB on the 5000/5400 chipset.
The 960 does has two processor sockets populated to utilize the full memory bandwidth. I am not really sure why the CX4-960 needs quad-core. Does the Clariion line have some capability to do compression? It would be nice to off-load this from the host server. I did work on the LiteSpeed compression engine. I always thought it would be good to build a fully multi-threaded version of winzip, with control of intermediate file placement, the buffering flags, and special capability for super fast network transfers.
I was going to really gripe hard on the memory in the CX3, given that a pair of 1GB DDR2 ECC DIMMs now costs $99, and a pair of 2GB $240. Back when the CX3 was launched, memory was more expensive, but they could have done a product refresh, especially considering what vendors typically charge for SANs. Its really stupid that the CX3-10 only had 1 GB per SP, and the CX3-20 only 2GB per SP, expecially considering of that, only 310MB and 1053MB respectively, are available for cache, the rest is required for the SAN OS and other software. The new high-end CX4-960 is listed at 32GB, meaning 16GB in each SP. The CX4-480 has 8GB per SP, the CX4-240 has 4GB per SP and the CX-120 3GB per SP.
The SAN is a computer system in itself, with operating system etc, that needs memory. The memory actually available for cache, from high to low are 10.7GB, 4.5GB, 1.2GB and 600MB. Now I have said before that read cache is worthless except a small amount for read-ahead in sequential ops. I do like write caching, to handle T-Log backups, checkpoints and tempdb surges. So allocating cache by LUN is important. Given that 4x2GB FB DIMM today is $440, the low-end should have started at 4GB per SP, with 8GB in the 240, 16 in the 480 and 32-64GB per SP on the high-end. Atleast a proper performance analysis should be done.
On backend FC ports, the 120 has 2 total, 1 per SP, the 240 has 2 per SP, the 480 4 per SP, and the 960 supports a maximum of 8 per SP (not | | |