THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Joe Chang

  • Why have we not seen TPC-C and TPC-E benchmarks using SSD storage?

    Dell recently published a TPC-H report for the PowerEdge T610, 2 x Xeon 5570, with 4 FusionIO 80GB SSD storage devices at 100GB scale factor. So why have we not seen TPC-C or TPC-E OLTP benchmark results published?

     

    Now it is much more feasible to run the TPC-H data warehouse benchmark on SSD because the Scale Factor 100 size is still allowed, for which the Line item table is 100GB for data only, not indexes or other tables. The full SF 100 tpch database is about 170GB for all tables and indexes. Additional space is required for tempdb.

     

    The TPC-C and TPC-E benchmarks require the database size to be scaled with performance target ranges. Consider the Fujitsu TPC-E published result for the Primergy RX300 S5 with 2 Xeon 5570. The dual-socket Xeon 5570 system scored 800 tps-E, for which the required initial database size is about 3TB. The space actually allocated for the data files is approximately 4.5TB, plus another 85GB for log space.

     

    System

    Fujitsu Primergy RX300 S5

    Processors

    2 x Intel Xeon X5570

    Memory

    96GB

    RAID controllers

    5+1

    Disk enclosures

    30

    HDD

    360 (192 73GB 15K + 168 146GB 15K)

    Storage cost

    $148K + $49K for 3 year maintenance

    Raw capacity

    35GB

    RAID 10 Capacity

    18TB

    Estimated IOPS

    360*200 = 72K

     

    For the 360 15K disk drives, based on 200 IOPS per disk, the small block random IOPS capability of this storage system is 72K, excluding RAID 10 overhead. If the actual load is 10,000 IOPS (at the operating system) with a 50/50 read/write mix, then the raw IOPS to disk is 5K reads and 2x5K writes for a total of 15K IOPS to disk. So a 75K IOPS system can actually handle 50K IOPS at 50/50 read/write mix in RAID 10.

     

    If we consider that the active database resides on only 15% of the disk space (3TB of 18TB after RAID 10 overhead), then there is some benefit from the short-stroke effect. If the average disk queue depth per disk were higher than 1, then command queuing capability would result in even high IOPS per disk. The actual IOPS per disk might be anywhere from 200-300 depending on whether the emphasis was on pure performance or balanced price/performance.

     

    Below is a proposed SSD (+ HDD for archival space) configuration.

     

    SSD + HDD configuration

     

    SSD Capacity

    4.5TB

    60-day space

    13TB

    SSD drives

    155 @ 32GB

    62 @ 80GB

    Cost for Intel SSD

    $520 for X25E, 32GB

    $340 for X25M, 80GB

     

    $80K

    $21K

    HDD drives for 60-day space

    20 x 1TB SATA $3200

    or 42 x 450GB SAS

     

    In additional to the above, we need disk enclosures. Ideally I would like to place no more than 4-5 SSD devices on each x4 SAS port. A x4 3Gbps SAS port can support 1GB/s, but if a HBA/RAID controller with 2 x4 SAS ports is plugged into a x8 PCI-E gen 1 slot, we can only expect 1.6GB/s total (single direction) throughput. The Intel X25 SSDs are rated at 250MB/s sequential read, 170MB/s write for the E, and 70MB/s write for the M (all sequential).

     

    The X25-E random 4K IO characteristics are 35K IOPS read, 3.3K write. In the absence of data, let assume the 8K random read is 15.7K IOPS (probably higher), so under 8K random IO, the bandwidth requirement is only 140MB/s or much less for read/write mixes.

     

    The data sheet also say 8K 2:1 R/W 7K IOPS for the X25-E. No random IO data is listed for the X25-M in the datasheet, so it is not clear the X25-M can meet the TPC-C/E random IO requirements for mixed R/W.

     

    A 1U enclosure with 2 SAS ports (daisy chained enclosures not expected) and 8-10 2.5in bays seem appropriate. The 2U enclosures with 24 bays should have 6 independent SAS ports. The 20 or so 3.5in SATA drives for the 60-day space requirement could be accommodated between the internal bays and 1 or 2 external enclosures. The next generation systems and components should be PCI-E gen 2 (5Gbps per lane) and  6Gbps SAS, but we expect higher SSD bandwidths as well.

     

    So the SSD cost structure does seem to support the TPC-E benchmark. It would probably also support the TPC-C benchmark as well, based on the HP DL370G6 result for a

     

    The main issue above is that I have not included RAID overhead. It is my m opinion that the SSD is not fundamentally a single component device, like a disk drive with a single motor. If the SSD were built with dual controllers, and chip-kill ECC on the NAND, then the SSD would be inherently single component failure tolerant. Of course, this is not the case yet. I am just looking forward to when we can do without RAID in SSD. I am not convinced RAID controllers are going to be able to keep up with an SSD arrays anyways.

     

    Since the IOPS capability of the X25E with SLC NAND (not sure for the X25M with MLC NAND), RAID 5 with the higher small block random write overhead is not an issue. So 190 of 32GB or 77 of the 80GB SSDs in RAID 5 would be required.

     

    I should briefly touch on expected performance benefits of SSD over HDD. The Dell TPC-H result did seem to indicate some benefits from SSD, even though there was not a otherwise similar HDD result to compare with. The TPC-H data warehouse queries may generate many table scans for which HDD is fine, there are still loop joins and key lookups, which generate pseudo-random IO. Several TPC-H queries also dump intermediate results to tempdb.

     

    I am expecting TPC-C and E to show reasonable benefits from SSD over HDD. Consider the main TPC-C new order transaction. A typical TPC-C published result might show an average response time 0.3-0.4sec. This procedures processing an order for upto 15 items (average of 10?) which means one update of the Stock table for each item, one insert in to the Order Line table, and one insert to the New Order table, plus a few a others. Since the TPC-C database is very large, each of the above steps might require a disk IO. On a perfectly configured disk system (for OLTP), the average latency could be as low as 5ms even when the entire system drives 200K IOPS.

     

    Still, if you look at the New Order procedure, it is clear each item must be processed serially. The SQL Server engine might use the Scatter-Gather IO API to consolidate IO calls from multiple concurrent users, but in each step in the new order is issued sequentially, after the previous step completes. Since there are over 20 steps, if each step take 5ms, then we can see why the average duration is well over 100ms.

     

    With SSD, the IO latency should drop to 0.08 milli-sec (80us), meaning 20 steps should in the range of 2ms. Because there are fewer transactions "in-flight" at any given point in time, the expectation is that the SQL Server engine has less to keep track of.

     

    Consider a system supporting 600,000 tpm-C. Thats 10,000 new order transactions per second. If each new order procedure averages 0.3sec, then there are 3,300 new order transactions in-flight at any point in time (plus others).

     

    TPC-C also has performance/size scaling requirements. A 600K transactions per minute result requires approx 50,000 warehouses, each of which requires approx 84MB, for a database size of 4.2TB. The recent 600K tpm-C results required 1000+ disk drives (no RAID requirement) meaning the IOPS load is probably 200-300K, possibly a R/W mix close to 50/50.

     

    Since Wes say the FusionIO 640GB devices are out, lets consider what kind of system would be required. The FusionIO is built with a PCI-E interface, that is, it plugs into the PCI-E slot directly, so it probably comes with its own driver. The second generation FusionIO matches up nicely with either PCI-E gen 1 x8 or PCI-E gen 2 x4 in terms of bandwidth.

     

    For 4TB we need 7-8 of the 640GB drives. So, ideally a system should be configured with 9 PCI-E gen 2 x4 slots, plus with embedded devices (the extra slot or two is for additional network or SATA drives). The new Intel 5500 IOH has 36 PCI-E gen 2 lanes, plus the x4 gen 1 off the ESI. So a single IOH would support 9 x4 slots, plus GE and SAS off the ESI. The HP ML/DL370G6 actually uses 2 IOHs for a mix of x16, x8 and x4 slots.

     

    Per Grumpy below, at this point in time, SSD devices have very different characteristics, particulary with regard to writes. Writes to NAND need to be in large blocks. Depending on how the SSD controller is implemented, expect some issues. So it may not be time yet to deploy transaction processing to SSD. DW might be worth considering. Still, we should see OLTP benchmarks plus accompaning details to better understand SSD characteristics. Where are Bashful, Doc, Dopey, Happy, Sleepy and Sneezy DBAs?


  • Automating dm_exec_query_stats and dm_db_index_usage_stats analysis

    Everyone should know by now how really useful the DMVs dm_exec_query_stats, and dm_db_index_usage_stats and the associated DMFs for obtaining the SQL and XML plan: dm_exec_sql_text, dm_exec_query_plan, and dm_exec_text_query_plan. Of course it has been explained that dm_exec_query_stats is not a replacement for SQL Server Profiler and SQL Trace. The DMV is only reliable if execution plans are not frequently evicted from the procedure cache and if SQL is not frequently recompiled.

    Still the DMV dm_exec_query_stats is popular because the results are available with a simple query. The complexity of setting up a trace, making sure it is not adversely impacting server performance, making sure space is available, and most of all, the effort of parsing the trace are all avoided.

    So what do we do next? From the top query statistics we can start looking at the SQL and the execution plans. One minor annoyance is that if we use CROSS APPLY to dm_exec_query_plan, each XML plan has to be saved individually. This can be annoying if one has to remote into to a PC in the server room and this connection is slow.
    There is much information in the XML plan that would be helpful if we could see it in a grid along with the execution statistics.

    From the index usage statistics, we can eliminate unused indexes. There might also be indexes that are infrequently used. Sometimes it is obvious that certain indexes can be consolidated. Other times, it is necessary to examine the execution plan to determine if another index is sufficient. The XML plan analysis cross-references in which plans each index is used. All of this is tedious work which can be automated. So this is the purpose of the SQLExecStats tool.

    A few people have tried it on production systems. There are still bugs, and feature requests, but hopefully I can turn this into something useful over time.

    lets use this url for the latest build rather than changing the url for each build

    http://www.qdpma.com/tools/ExecStats.zip

    The above build is now 20090628. The (poor excuse for) documention is also online http://www.qdpma.com/SQLExecStats.htm

    Feedback is welcome, in particular, send me the output files so I can try to figure out if this is collecting the right information. Support will be based on availability.

    Note: SQL Server 2008 sp1 fix now correctly reports parallel execution plan total_worker_time (CPU). It is not fixed on SQL Server 2005 sp3, latest cu?

    Limitations:

    European Localization The XML plan is stored as a string, with numbers in US-en format. My program extracts plan cost and estimated row numbers from the XML plan. The previous version did not specify any localization, but I think the default is the host environment. Hence the code double.Format(value) is interpreted as whatever the local environment is instead of US-en. The builds since 20090623 specifies the US-en format in most places.

    If some one with European number formats could run this program and send me the output, I would appreciate it.

    Bug fixes, features etc

    2009-06-22: The password now should display *, I don't do security work, and it shows. Some formatting was changed to right align, comma thousands separator, Very large numbers in Excel cannot be changed, 4 bytes integers are represented as numbers, larger numbers are represented as text.
    I will do more formatting changes later

    2009-06-28
    The error described by Zen occurs with case sensitive collations. Apparently one SQL query I issued from the executable was not a case correct with the columns I declared on the receiving side, and this is sufficient to cause the data table load to fail.
    The current build is really an intermediate build. I am adding the ability to collect table and index information for multiple databases and not everything is consistent yet.
    Be careful on trying to collect for too many databases, and too many execution plans, there are checks for over size yet.

  • HP Oracle 10X Extreme Performance Data Warehouse with Exadata Storage Server

    For several months, we have seen ads for the joint HP/Oracle RAC and Exadata storage combination talking about extreme performance (10X faster) for large data warehouses. One thing I like about Oracle is that they have courage to pursue technology with deep hardware design implications, even if it takes several iterations to iron out the major issues. I just got around to looking through the Oracle papers on this. Like OPS/RAC, the Exadata technology has implications for how hardware is built. Hardware vendors can be squeamish on designing silicon for special requirements if there is not a viable installed base. This leads to the chicken and egg, which comes first situation that many other vendors cannot successfully initiate. Oracle dares to do this, and amazingly get customers to shell out big chunks of money on the first iteration. This in turn provides justification for gutless risk averse hardware vendors to do their part.

     

    I will start by saying that the SAN systems out there today are designed for transaction processing, not data warehousing. Most SAN systems are designed with a certain number of FC ports, with the intent of supporting 1-4 disk enclosures (typically 15 disk drives each) per FC port. The SAN controller (or service processor) is a significant portion of the overall cost. Configuring one enclosure per FC port leads to a higher amortized cost per disk than having multiple enclosures per FC port, but sequential bandwidth is still limited by the number of FC ports. Depending on the architecture, it is possible to sustain between 330MB/sec (loop architecture) and 390MB/sec (star architecture) per 4Gbit/sec FC port. So it can require 3 FC ports and 45 disks to support 1GB/sec, even though each individual 15K disk drive can sustain 125-160MB/sec. The amortized cost of a SAN system might be $2,000 per disk, so each 1GB/sec through-put costs around $90K.

     

    This is why I have advocated direct attach storage for data warehouses, where each 15 disk enclosure can sustain 800-1000MB/sec at an amortized cost of about $500 per disk. But most people do not like inexpensive high-performance storage solutions. And none of the expensive SAN systems provide sufficient bandwidth for really high-end data warehouse systems.

     

    In the Exadata system, the interconnect between host and storage is InfiniBand, which signals at 5Gbit/s using a x4 wide connector (like SAS) for a net bandwidth (after 8B/10B encoding) of 16Gbit/s or 2GB/sec. The Exadata Storage Server (or cell) is an HP DL180G5 with 2 Xeon E5430 quad-core 2.66GHz processors, 8GB memory, a P400 RAID controller, 12 450GB 15K SAS or 1TB 7200 RPM SATA disks and dual-port Infiniband HCA. Curiously, of the 5.4TB raw storage with 12 450GB drives, only 1.5TB is available. With RAID 10 overhead, there is 2.7TB. Some space is required for internal use, but 1.2TB seems to be rather large.

     

    A complete pre-configured HP/Oracle Database Machine full rack comprises 8 HP DL360 servers with two Xeon E5430 quad-core processors, and 32GB memory for running Oracle RAC, 14 Exadata storage cells, 4 InfiniBand switches (and 1 Gigabit Ethernet switch for auxiliary communications). A half-rack has half of the above components. Each storage cell is listed as supporting 1,000MB/sec with SAS drives and 750MB/sec with SATA drives. The listed bandwidth for the pre-configured full-rack is 14GB/sec. It is stated that Exadata bandwidth scales linearly with the number of racks, but without explicit performance numbers.

     

    Compare the Exadata cell with an EMC CLARiiON CX4-960 mid-range SAN. The CX4-960 comprises 2 SPs, each with two quad-core processors, 16GB memory per SP, for which the minimum meaningful configuration is 16 disk enclosures (240 disks) over 16 FC ports. So the resource allocation per SP is 2 quad-core processors, 16GB memory and 120 disk, with probable sequential bandwidth of 3GB/sec (380MB/sec per FC port). The Exadata cell provides approximately the same compute power, 8GB memory, for 12 disks targeting 1GB/sec sequential bandwidth.

     

    The purpose of the massive computer power per disk in the Exadata cell relative to a standard SAN is to offload compute functions from the main database engine. Concentrating capability, be it compute power or IO bandwidth, in a single system is always difficult, so distributing work can be useful if it can be done effectively. One candidate is compression. (SQL Server 2008 can store tables and indexes with row or page level compression, which take CPU resources. Oracle probably has comparable capability as well.) [Exadata is for Oracle systems only?] Offloading this to the storage element might be desirable. Finally, the Exadata cell is not just a storage engine, but can also handle database protocols.

     

    In addition to a command such as fetch this block, and decompress, the Exadata can also handle SELECT * FROM Table WHERE col = ‘SARG’ (Smart Scan Offload Processing). In a data warehouse, that expectation is that queries are ad-hoc, for which indexes have not been built. So a data warehouse must be able to power through very large table scans. This requires both IO bandwidth and CPU resources, as a database table scan is not a simple IO operation (see my other posts on this matter).

     

    The very recent TPC-H report for the HP BladeSystem of June 3, 2009 uses the Oracle Exadata Storage Servers (more on this below) and has price information. The full cost for 6 Exadata storage cells and supporting components is $536,516. (Of this, each Exadata Storage Server is $24,000, and the cost of the Exadata software is $360,000. Interesting, the price of a similarly configure DL180G5 is $14,000.) Three-year support is another $479,846 for a total 3-year cost of approximately $1M.  It is unclear how much of the discount applies to the Exadata versus the Oracle database software. The amortized list price per cell is $166K or 14K per disk. That Oracle sells this clearly sends the message that people do not look at cheap hardware for the data center.

     

    Now when HP published a 1000GB TPC-H report for a HP Superdome with 32 Itanium 2 9140 sockets, and 64-cores on April 29, 2009, I was very puzzled. What was the purpose of this publication? HP had already published a 10TB result on a Superdome with 64 sockets (128 cores) of the same Itanium 2 9140 processors back in March 2008 with no real disparity (acknowledging that results at different sizes are not directly comparable). Then about one month later, HP published the 64-node Oracle RAC with Exadata storage result of June 2009 (note that the RAC servers are BL460 with 2 Xeon quad-core 3GHz processors, different from the pre-configured database machine).

     

    System

    SuperDome

    BL460 Cluster

    Database

    Oracle 11g + Partitioning

    Oracle 11gR2, RAC

    QphH@1000GB

    123,323

    1,166,976

    TPC-H Power

    118,577

    782,608

    TPC-H Throughput

    128,259

    1,740,122

    Total System Cost

    $2,532,527

    $6,320,001

    Processors

    32 Itanium 9140 1.6GHz

    128 X5450 3GHz

    Cores

    64

    512

    Memory

    384GB

    2080GB

    Disks

    768

    6 Exadata Cells

    HBA

    64 Dual Port FC 4Gb/s

    64 x 2 Infini-band

     

    Was the intent that people would see the Oracle RAC with Exadata storage and draw conclusions based on the approximately 10X difference in performance with another recent 1000GB result?

     

    What conclusions can we draw from each of the two results? The first matter to understand is that the TPC-H scale factor 1000GB means the LineItem table, data only, is approximately 1000GB in size. The full database with all tables and indexes is approximately 1700GB. So the entire database fits in the 512-core RAC system with 2080GB and not in the 64 core system with 384GB.

     

    Next the 64 core Itanium system has 64 dual-port FC adapters, meaning 128 x 4Gbit/sec ports, which could support 42GB/sec based on 330MB/sec per 4Gbit/s FC port. But it is unlikely that 768 disk drives can sustain this volume (55MB/sec per disk) in a SAN. It is also interesting that system was configured with the EVA 4400 while other HP SuperDome Unix results employ the MSA1000 storage. (It is nice to have 32 EVA SAN systems or 256 MSA 1000s available for performance testing).

     

    Note that TPC-H query 1 (a table scan of most of the LineItem table) takes 169.8 seconds on the 64 core Itanium, and 10.3 seconds on the 512 core RAC system. This means that if the data had to be read from disk, then the disk system would have to support 6GB/sec on the 64 core and 97MB/sec on the 512 core system. The 64 core Itanium system definitely has to read data from disk and the configured disk can easily support 6GB/sec (only 8MB/sec per disk) while the 512 core system has 6 Exadata storage systems which can support exactly 6GB/sec per specifications, nowhere near 100MB/sec. But all the data fits in memory so disk reads for data does not occur given the TPC-H sequence where the test run occurs after the database load and index build.

     

    The data load time on the 64-core Itanium was 1:07:12 and 2:22:57 on the 512-core RAC, which also may indicate relative storage performance

     

    Another point that can be noted is that there is an 8X difference in cores for a 9.5X difference in performance. However, the Core 2 architecture Xeon 3.0GHz cores (45nm) are much more powerful than the Itanium 2 1.6GHz cores (90nm). The gain in Power is 6.6X and 13.6X in through-put. TPC-H scored is based on a geometric mean of 22 queries, some of which are small and other large. The geometric mean has the effect that a 2X speedup in a small query has the small benefit as 2X in a large query. The issue is the getting 10X gain in a small query is very difficult so scaling on Power is attenuated.

     

    In summary, the two performance reports are definitely sufficient to assert that Oracle RAC can scale, but having single node and 8 node performance reports for the BL460 would be confirm this. The two TPC-H reports say nothing of Exadata storage system performance, either in the sustainable sequential bandwidth or the value of the Smart Scan Offload processing. If the 64 node, 2080GB memory Exadata storage result had been reported at 10TB, then we might have an idea of its capabilities. Based on the 100GB/sec table scan estimate above, it would require 100 Exadata cells, which might beyond its actual scaling capabilities.

     

    The performance data cited in Oracle’s Exadata whitepaper lack details to attribute the source of the performance gain. Given that most SAN systems are horribly configured for Data Warehouse performance, it is quite probable that dropping in the preconfigure full rack Exadata with 14GB/sec sustained table scans can easily generate the quoted numbers.

     

    Notes on the Exadata Storage Server

    At the time this product came out, the choice of the DL180G5 was reasonable. However, this system, based on the Intel 5100 chipset, has 1 x8 and 2 x4 PCI-E Gen 1 slots.

    I am guessing that the Infini-band dual port HCA occupies the PCI-E x8 slot, the P400 RAID controller occupies one of the x4 slots and that the 12 internal disks are connected to one of two x4 SAS ports on the P400.

    Technically, the each Infiniband DDR (5Gbit/s) x4 channel is 20Gbit/s, which after 8B/10B encoding is 16Gbit/sec (2GBytes/s) which could fully consume a x8 PCI-E Gen 1 channel. There are 2 IB channels for path redundancy, not bandwidth aggregation, as the x8 PCI-E bandwidth is limited to 2GB/sec.

    Since the disk drives are probably on a single x4 SAS port, this bandwidth is limited to 1GB/sec, even though each 15K disk drive can do 160MB/sec (not accounting for RAID 10 implications). So while there is 4GB/s combined bandwidth on the Infiniband links, only 2GB/sec can be sent over the PCI-E port to the IB HCA, and only 1GB/s can be drive from the RAID controller.

     

    Now that the DL180G6 is available, with 1 x16 and 2 x8 PCI-E Gen 2 channels, I would retain the IB dual port x4 (if it supports PCI-E gen 2), go to the new P410 RAID Controller, the 25-bay SFF drive bay (assuming the drive bays can be split across two x4 SAS channels), and 24 146GB 15K or 300GB 10K SFF drives (There is no point offering a 250GB SATA option). This unit should support 2GB/sec.

    I might even be tempted to bug HP to split the drive bays 4 channels, 6 bays per channel, x4 SAS on each channel. There would be 2 P410 controllers, each driving 2 channels. The 6 disks in each channel might not drive 1GB/s, but 4 channels might support 3GB/s? I priced this around $19K.

    To bad we cannot have a generic Infiniband SAN, and skip the Exadata software ($5K licensing per disk?).


  • Microsoft's-Thinking Oracle RAC Think Again Whitepaper

    Now that I have actually kind of read this paper, I will comment on it. I say kind of because I cannot focus on non-technical matters, the same way I get really frustrated trying to explain something to a person that just cannot understand the difference between an argument substantiated by hard analysis and an argument that seems logical but is not the predominant underlying effect.

    First, Microsoft competes with Oracle, and MS gets hit with this all the time so of course they need to collect in one place their best arguments. On a side note, many Microsoft whitepapers list the author(s). When an artist paints a masterpiece, he/she will put a signature on it because he takes pride in his work. Not all talented artists can paint what they want and must accept commercial work (having food to eat is not overrated?). In such cases, he does not want his peers to know that he had to stoop to do such work and prefers it be anonymous. There is no listed author for this paper.

     

    The first argument in the paper is valid. Only a small percentage of customers have actually deployed RAC. Back in the OPS days, one prominent expert said that he had never seen for himself an OPS deployment that actually achieved positive scaling, but he did hear from someone he considered competent on this matter that one customer did. RAC is much better than OPS (to get something right, you do have to screw it up once or twice or trice). The Oracle people I talk (more hurling insults/ridicule back and forth than talking actually) with do say that the technical skills to deploy RAC are not common among Oracle DBAs.

     

    But the fact that Oracle RAC is not really required by most people does not stop sales and marketing from making a big deal out of it. Funny how CIOs are influenced by the scaling argument (men are most susceptible the size thing) even though it is of no consequence to their specific environment.

     

    Of course RAC is expensive. The alternative is buying big iron which is also expensive. If a solution is not really painfully expensive, is it any good? If one project manager turns what should have been a $1M project into a $10M project and another manager does his project for $1M, who will know it should have been a $1M project? Who will get the higher job ranking come evaluation time? That is: the next promotion and big raise?

     

    When a project is so hideously expensive, the CIO must go to the CEO, who in turn may need to go the board for approval. Well if this project becomes really messed up, the big bosses are not inclined to declare it a failure because they will look bad as well, that is, they had endorsed the project. When a small project runs into difficulty, no matter who is at fault, it is not hard for the CEO/CIO to pin the blame on a lowly project manager, i.e., fire him.

     

    If what you need is SQL Server cluster style fail-over redundancy, then the expense of RAC licensing does not make sense. So RAC only makes sense if scaling performance is needed. The paper talks about scale-out OLTP [by which I mean with distributed partitioned views]. Scale out on any DBMS is not simple and not just for the reasons described. If you understand in detail how the [SQL Server] cost base optimizer (CBO) works with respect to local and remote data, you will understand the horrible implications. [In the execution plan with remote data, the row count estimates are mostly 1,000 or 10,000 rows. When two sources of 1,000 rows each are joined, the output is 1M rows. When this is joined to a large local table, the plan is frequently a table scan.][Technically, RAC scales out on hardware, but does not have a DPV database design.]

     

    If one wanted to point out an issue with Oracle RAC for OLTP, the most apparent is the near complete absence of published benchmark results. There is one RAC TPC-C (there are no Oracle TPC-E period) and it was done long ago (12/2003). The TPC-C RAC pub was 16 HP Integrity rx5670 each with 4 Itanium 2 1.5GHz for a total of 64 cores, a score of 1,184,893tpm-C at total cost $6.5M, versus the contemporary result for a HP Superdome, same 64 Itanium 2 processors, scored 1,008,144 tpm-C at $8.4M cost. The RAC had 17% better performance at 22% lower cost. Looking closely at the price detail, the cost of memory for the rx5670 was $1.4M versus $5M on the Superdome. It is a little difficult to compare pricing because the Superdome discount was about 45% versus 25% for the clustered rx5670. The major software licensing difference was $640K for RAC and $320K for partitioning. So a big chunk of the price advantage is because of a memory pricing anomaly.

    Nodes

    1

    16

    System

    Superdome

    rx5670

    Database

    Oracle 10g

    Oracle 10g+RAC+Part

    Report Date

    11/04/2003

    12/08/2003

    Tpm-C

    1,008,144

    1,184,893

    Total System Cost

    $8,397,262

    $6,541,770

    Price/Performance

    $8.33 per tpm-C

    $5.52/tpm-C

    Processors

    64 Itanium 2 1.5GHz 6M

    16x4 Itanium 2 1.5GHz 6M

    Memory

    1024GB

    768GB (16x48)

    Disks

    2100+120

    672+1344+224

    HBA

    28 FC 2Gb/s

    64 x FC

    Costs

     

     

    Processors

    $1,280,000

    ($40K ea)

    $528,000

    ($8.25K ea)

    Memory

    $4,998,400

    $39K for 8GB

    $1,440,000

    $7500 for 4GB

    Server Subtotal

    $7,085,433

     

    Storage

    $5,032,188

     

    Server+Storage

     

    $4,694,618

    Oracle 10g

    $1,280,000

    $1,280,000

    RAC+Partitioning

     

    $960,000

    Discounts

    $7,000,000

    $1,900,000

    Since then, there have been no Oracle RAC TPC-C publications. This usually means there is no good news. If MS wants to criticize RAC for OLTP, I am ok with it. In other posts, I argued that going forward, the new Intel QPI and existing AMD HT interconnects should allow building big-iron (scale-up) systems with better scaling than RAC, on account of the higher bandwidth and lower latency that can be achieved versus RAC going over Infiniband. This is a theoretical argument that needs actual measurements to assess its validity.

     

    On the Data Warehouse side, there are many Oracle RAC TPC-H publications. From the results, I think RAC has decent scalability, and I am really happy that I can bring a better balance of processor power, memory and storage than I can with big iron, or rather, a max'ed out server system (i.e., expensive big capacity DIMMs, and high priced storage). So apparently my arguments above on interconnect bandwidth and latency are not as important on DW. MS does mention they will soon have their own MPP solution, so that good, because I am too old to learn Oracle, and I am relatively happy (meaning I am bitching a lot) doing big SQL Server projects.

     

    I will expand on what I mean by balance. Lets compare 2 recent TPC-H 1000GB results. A 64-node, BL460c with 2 QC Xeon 5450 3.0GHz, 32GB memory per node, (128 sockets, 512 cores, 2TB memory) scored 1,166,976 QphH at total cost $6.3M compared to a 32-socket, 64 core Superdome, 384GB  memory score of 123,323 QphH at cost $2.5M. Put aside for this discussion the fact that Itanium is still dual core on a 90nm process while Xeon is quad core on 45nm. For 8X the number of cores, the performance gain is 9.5X. Keeping on mind that the Xeon 5450 core is about 50% faster than the Itanium 1.6GHz based on SPEC CPU int 2006, this is about right. Also the RAC system used the Exadata storage which offloads some processing, but probably not too much.

     

    The cost breakdown for the Itanium is $736K for processors, $800K for memory, $1.3M for storage (EVA SAN), $470K for Oracle (+54% discount on all). In the RAC-Xeon system, it is $160K for processors, $85K for memory, $200K for Infiniband, $540K for storage, $6M for Oracle (30% discount on Oracle only). OK, this is not what I was really getting at. If you max out a system, it means buying the 8GB DIMMs which cost 4X more than the 4GB DIMMs but does not contribute proportionately higher performance. The big gain in the RAC system is that its possible to configure enough memory to fit the database and make do with a less powerful storage system. This argument goes out the window if the RAC memory cannot encompass the entire DB plus working space. Still, it is good to have this avenue.

     

    System

    Superdome

    BL460 Cluster

    Database

    Oracle 11g + Partitioning

    Oracle 11gR2, RAC

    QphH@1000GB

    123,323

    1,166,976

    TPC-H Power

    118,577

    782,608

    TPC-H Throughput

    128,259

    1,740,122

    Total System Cost

    $2,532,527

    $6,320,001

    Processors

    32 Itanium 9140 1.6GHz

    128 X5450 3GHz

    Cores

    64

    512

    Memory

    384GB

    2080GB

    Disks

    768

    128+6x12

    HBA

    64 Dual Port FC 4Gb/s

    64 x 2 Infiniband

    Costs

     

     

    Processors

    $736,000

    ($23K ea)

    $159,872

    ($1.2K ea)

    Memory

    $850,000

    $35,611 per 16GB

    $92,000

    $339 per 8GB

    HBA

    $233,680 ($3.5K ea)

    $199,000 (HCA+switch)

    Server Subtotal

    $2,397,614

    $696,419

    Storage

    $1,390,861

    $536,516

    Software

    $624,245

    $6,425,600

    List Price (ex maint.)

    $4,412,720

    $

    Discounts

    $2,210,000

    $1,965,000

    Maintenance (net)

    $330,067

    $626,054


  • First Nehalem TPC-H

    Earlier I talked about the first TPC-C and TPC-E results for 2-way Nehalem, ie, the Intel Xeon 5500 series. The results were spectacular relative to the previous generation Xeon 5400 series, (2.5X gain on the Intel slide deck for database OLTP) and were pretty much hitting the same range as 4-way Xeon 7460.

    I pointed out that while these were legitimate results, the TPC-C and TPC-E benchmarks generate high call volume, about 1000 RPC stored procedure calls per second per core. Meaning each call averages around 1 CPU-ms. This type of usage benefits from the Intel Hyper-Threading feature. It was around 10-20% back in the NetBurst days. I am inclined to think it is now much larger with Nehalem, possibly 30-40%. An application like TPC-H would not benefit from HT. Nehalem should still show a moderate performance gain over Core2 on the basis on micro-architecture improvements alone (plus the integrated memory controller).

    Well, my thanks to Dell for publishing a TPC-H for Nehalem. Notice to other vendors: get going, slackers! Below is the 2-way Xeon 5500 versus Xeon 5400 or 5300, and 4-way Xeon 7460 or 7350.

    System             Configuration                                                    TPC-H@100GB

    T610                 2 Xeon 5570 Quad-Core 2.93GHz, 8M L3, 48GB               28,773

    ML370G5          2 Xeon 5355 Quad-Core 2.66GHz, 2x4M L2, 64GB           17,687

    DL580G5           4 Xeon 7350 Quad-Core 2.93GHz, 2x4M L2, 128GB         34,990

     

    System             Configuration                                                      TPC-C

    DL370G6           2 Xeon 5570 Quad-core 2.93GHz, 8M L3, 144GB      631,766 (Oracle/Linux)

    ML370G5          2 Xeon 5460 Quad-core 3.16GHz, 2x6ML2, 64GB      275,149

    DL580G5           4 Xeon 7460 Six-core 2.66GHz 16M L3, 256GB         634,825

     

    System             Configuration                                                      TPC-E

    Fujitsu RX300    2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB     800.00

    TX300 S4          2 Xeon X5460 Quad-core 3.16GHz, 8M L2, 64GB      317.45

    Dell R900          4 Dunnington Six-core 2.66GHz, 16M L3, 64GB       671.35

    It is unfortunate that the last TPC-H results on Intel were the 65nm Xeon 5300 and 7300 series (except for the Unisys 10TB 16-way). Lets just suppose that a Xeon 5470 3.33GHz would score 20% higher than the Xeon 5355 2.66GHz. The 20% frequency difference might contribute 10%, and the micro-architecture improvements going from the 65nm to 45nm Core2 contribute the rest (The larger cache is not expected to benefit TPC-H high row count queries). This would make the Xeon 5500 series 35% faster on TPC-H than the 5400, which is more than I expected just from the Core 2 to Nehalem architecture improvements.

    Of course, this Dell result uses the Fusion-IO SSD drives that plug directly into the PCI-E slots, instead of going through a RAID controller, then the SAS interface. I am looking through the individual TPC-H queries, comparing against both the 2-way 5355 and 4-way 7350 results. I think there is reason to believe that the SSD storage improves performance over a large array of disk drives. A large disk array can deliver sufficient sequential bandwidth, but some SQL operations will generate small block IO, and writes to tempdb should be much faster. The 4-way 16-core Core 2 (7350) has better overall performance than 2-way 8-core Nehalem, but on some individual queries, the Nehalem system scores better.

    I am inclined to think the 2-way Opteron six-core (Istanbul) could be close to 2-way Nehalem quad-core on TPC-H, despite the large advantage in TPC-C/E. Intel has a myopic view that the big-dog processor should be reserved for the 4-way+ systems. (It may not make sense to put 8-core Nehalem EX into a 2-way system if the 6-core 32nm Westmere core will be available soon.)

    To reiterate, it is very important that all key benchmarks are published so we can get a good idea of what to expect under each circumstance. No one (among reasonable people) expects miracles and magic. It is a complete picture that is important. Knowing that you should expect 20% is better than a misguided belief or hope for 2X.

    Some people think I am paranoid and deeply cynical. So I will now resemble this accusation. My thinking is Intel had the full set of benchmark results months ago. It was pointed out that some benchmarks, mostly the ones the benefit from HT, showed huge gains, while others just show good gains. Every organization has worthless marketing types that feel the need to justify their salary. So it was decided to withhold the DW results, just so the worthless crap marketing slides could show the big numbers instead of a complete picture. The complete picture is important, we are happy that Nehalem has arrived. We can work its actual performance characteristics; spectacular gain on some, good gain on others. So stop tinkering with the slide deck!

    There is the truth, the whole truth, and nothing but the whole truth. Some people can handle the item 1, but know to stay well clear of 2 and 3,

    The use of the FusionIO SSD is interesting. As mentioned above, it interfaces directly to PCI-E. The first generation was PCI-E gen 1 x4, and can do 750MB/s (32K) read, 500MB/s write, 116-119K IOPS (4K), in capacities of 80 and 160GB for SLC, 320GB for MLC. The second generation can do 1.5GB/s read, 1GB/s write, 200K IOPS (4K), in capacities 160/320 SLC, 640GB MLC. The interface is PCI-E x8 gen 1 or x4 Gen 2.

    The Intel 5520 chipset has 36 PCI-E gen 2 lanes plus the ESI. A 2-way Nehalem system can be built with 1 or 2 5520 IOHs. The Dell T610 has 1 IOH for 2 x8 and 3x4 slots available (x4 for the internal SAS?). The Dell TPC-H config has 4 Fusion-IO drives, which is fine for this test. An actual production system might want to configure more SSDs. The HP ML370G6 with 2 IOHs has 10 slots (2x16, 2x8, 6x4, one for NICs). The x16 slots are useless for database servers because no network or storage IO adapters can really use x16 bandwidth, and definitely cannot make good use of the unbalanced slots. The x16 slots might be useful for HPC or something. Hopefully Dell or HP will make a system with something like 7 x8 and 4 x4 slots. Now to make maximum use of the (current) Fusion IO SSDs, we would have 18 PCI-E x4 gen 2 slots, but I think Fusion IO could be persuaded to do a double wide SSD instead.

    SuperMicro does have a dual 5520 IOH motherboard with 7 x8 PCI-E slots. The onboard SAS occupies a x8, and the dual GbE NIC takes another x4. It looks like one x4 is not wired. For a server, I would have used a x4 for the onboard SAS because that may only connect the boot drives, and I would stick the GbE NIC off the south bridge ICH. The x4 Gen2 should be made available for 10GbE. I used to buy SuperMicro systems because their wide motherboard selection allowed me to get the one with the best IO arrangement for database servers. But when SAS came out, I had a hard time getting the right connectors. I may give them another try if Dell or HP does not do a 7 x8 PCI-E Gen 2 system.


  • Big Iron Revival, Intel Nehalem EX and AMD Magny-Cours

    Yesterday Intel held a product announcement press event for the upcoming Nehalem EX, which will succeed the current Xeon 7400 series based on the Core 2 micro-architecture for "expandable system", i.e., 4-way and higher, in late 2009 or early 2010. The current Xeon 5500 series (also Nehalem architecture) has 4 cores, 8M shared L3, 2 QPI links, and 3 DDR3 memory channels. Nehalem EX has 8 cores, 24M shared L3 cache, 4 QPI links and 4 FBD memory channels (there is now a Scalable Memory Buffer between the memory interface and memory, did Intel just move the AMB from the DIMM to the motherboard?).

    AMD has also recently discussed their plans. The current quad-core Shanghai gets a frequency bump from 2.7GHz to 3.1GHz, and a six core Istanbul should be released very soon (June, announced at 2.6GHz). See the Johan de Gelas Anandtech article on Istanbul. It describes HT assist, (essentially a snoop filter for HT) as using 1M of the L3 cache. The HP ProLiant DL585G6 for Istanbul also appears to be HT version 3.0 or HT3, upping the HT transfer rate from 2GT/s to 4.4GT/s.

    Later on, there will be Magny-Cours, which would be 2 Istanbul die in one package. Istanbul has six cores, 3 Hyper Transport links and 2 memory channels. In Magny-Cours, the two six core chips are linked by one HT link, so the external package will have 12 cores, 4 HT links and 4 memory channels. After this, a new improved micro-architecture would arrive?

    Now there have been big iron Windows systems for many years. The HP Superdome supports up to 64 Itanium 2 sockets. The problem has been that Intel has not kept pace with Itanium. The current Itanium 9100 series, Montvale, is a 90nm dual core, while the Xeon line is at 45nm and six+ cores. Tukwila, the 65nm quad-core Itanium that should have been launched in 2008, was recently delayed until 2010. Supposedly Itanium should finally be caught up on process technology in 2011 with the 32nm Poulson. Unisys (ES7000 7600R), NEC (Express5800/A1160) and IBM (x3950M2) all have had 16-socket capable Xeon systems for a while. HP has the 8-way ProLiant DL785G5 for Opteron processors (I really would like to get the architectural diagram for how HP connects the 8 sockets). I have not followed Sun since I focus on Windows/SQL Server. (Sun has the 8-way x4600 for Opteron. see http://www.sun.com/servers/x64/x4600/arch-wp.pdf for an architectural diagram on how 8 Opterons are connected in a twisted ladder)

    Still, I consider this to be a revival or perhaps true arrival of big iron because of the issues in the past on scaling beyond 4-sockets, both in terms of performance and price-performance.

    Previously, there were technical challenges in scaling the Intel Xeon beyond 4 sockets, both for the system vendors in designing such a system, and the DBA/developer in getting their application to scale beyond 4-sockets. For an OEM to build an 8-way+ system, it required the effort to built custom chips, the market volume was low, and Intel kept changing the FSB. All of this meant there was a big step up in price per socket going from a 4-socket system to 8, 16 or 32.

    This was the rational for Oracle RAC. Instead of buying really expensive big-iron hardware, one can buy lower cost high volume hardware and really expensive software licenses. Think about it. Scaling up on big iron or a RAC-type technology depends on interconnect bandwidth and latency. For either the Intel QPI or AMD HT, it should be possible to achieve far better bandwidth and latency in big-iron than a RAC-type solution. The best Infini-band can do now in a x4 link is 40Gbit/s (5GB/s) at approx 1us latency.

    Now that there is prospect of stability in the Intel processor interconnect, my expectation is that we should now see 8-way+ systems at a less severe price premium over 4-way systems. (there will always be a premium because validating and supporting big systems requires deeper technical skills). On AMD Opteron, having the 4 HT ports from one package enables 8-way glue-less systems (with fewer hops) and helps in building 8-way+ (with glue?).

    In the Intel announcement was that 8 OEMS have 15 or so 8-way+ (including 16 and 32-way) Nehalem EX systems in the works. IBM, NEC and Unisys are obviously 3 of the OEMs, given their recent commitment to big-iron Xeon. Fujitsu and Hitachi might be another 2, as the Japanese players love big-iron. Sun should be one for 6 of the 8 OEMs. I am guessing this means HP and Dell are the two remaining OEMs. HP is no surprise. They already have the 8-way Opteron. Their commitment to Itanium means that HP would have built a chipset around QPI for the next generation, which is the same processor interconnect on Nehalem. Dell is the question. Their attitude might be that they do not expect to sell many big-iron systems, considering the technical difficulties they had in the past on this. To sell big iron, it is absolutely necessary to have top technical expertise to go into customer shops to find out if it is the right solution and what changes need to be made to deploy successfully. (OEM reps are invited to drop hints, even if its still a company secret, we will keep it just between us)

    Up to Windows Server 2008 RTM, the OS does not support more than 64 cores, physical or logical. This limit will be lifted with Windows Server 2008 R2, accompanied by SQL Server 2008 R2(?). Both the Unisys 7600R and NEC A1160 posted TPC-E benchmark results for 16-sockets, but only 4 of the 6 cores in the Intel X7460 processor enabled, to stay under the current 64-core limit. Scaling was decent, but not spectacular, going from 721tps-E@4-sockets/24 cores, to 1156 tps-E@8S/48c, to 1400tps-E@12S/64c and 1568tps-E@16S/64c.

    Note that scaling large/(hard) NUMA systems require proper use of port affinity settings, and how interrupts are handled. Windows 2008 R2 supposedly has a much improved disk I/O handling on NUMA systems.

    The Intel announcement mentioned that 4-way Nehalem EX will have 2.5X+ performance over 4-way Xeon 7460, based on a very recent internal measurement using OLTP workload, i.e., TPC-C or TPC-E. This is also inline with the huge TPC-C & E gains posted by 2-way Xeon 5500 over Xeon 5400. Previously I discussed this matter. Each Nehalem core should have moderately better performance than a Core 2 micro-architecture core. Nehalem systems have more memory channels to better support multi-core scaling. The Nehalem EX 4-way system has 16 memory channels supporting 32 cores, versus the Xeon 7400 (7300 MCH) 4 memory channels supporting 24 cores. Nehalem EX will have 8 physical cores compared with 6 on Xeon 7460. Finally, both TPC-C and TPC-E benefit from Hyper-Threading, a feature from the Pentium 4 (NetBurst) micro-architecture (designed in Oregon), but not implemented by Core 2 (designed in Israel). Anyways, 2.5X over X7460 means 1.6M tpm-C or 1700 tps-E.

    Now both TPC-C and TPC-E are OLTP benchmarks (workloads). The interpretation should not be that HT (and large cache) benefit OLTP workloads as in any one else's OLTP workload. Each TPC-C transaction involves on average 2.25 or so RPC calls (network roundtrip) and each TPC-E transaction involves approximately 22.3 RPCs. By looking at the recent results on Xeon 7460 or Opteron Quad-core, one can figure out that the average cost per RPC in both TPC-C and E is on the order of 1 CPU-millisecond (the duration of the complete RPC might be longer, say 80-400ms)

    The correct interpretation should be that HT and large cache benefits high call volume applications, transaction processing or not. HT benefits mostly in the network round-trip. This was based on tests done on the previous version of HT, i.e., Pentium 4 architecture. I did not find one SQL operation that benefited from HT except in handling just the RPC overhead. The Quest LiteSpeed compression engine did show huge gains with HT, 40%. This indicates the theory behind HT is valid. One just needs to figure what in the SQL Server engine does not like HT. It is possible that the HT in Nehalem now works better with SQL Server. 

    The large cache reduces the (fixed) startup cost of an SQL operation, but not the incremental cost per additional rows. So if someone else's OLTP application average 10 CPU-ms per call, then it might not show as much gain going from Core 2 to Nehalem.

    I suspect this is the reason Intel has not posted any TPC-H benchmark results. It should show some gain over Core 2, just not the spectacular gains in C & E. I am inclined to think that the 4-way Xeon 7460 is memory bandwidth constraint in TPC-H, and that is alleviated in Nehalem, but there are no published TPC-H results to substantiate this matter.

    Dunnington and Nehalem EX  are both 45nm. Dunnington has 1.9 billion transistors, 6 cores, there is a 3M L2 cache shared by each pair of cores, and a 16M L3 cache shared by all cores for a L2+L3 total of 25M. Nehalem EX has 2.3B transistors, 256K L2 cache dedicated for each core and 24M L3 cache for 26M L2+L3 cache. Granted there is a big increase in latency from L2 to L3. I would interest to see the supporting data (estimates made before the design work) for the big L2 caches in Dunnington.

    Even with all of the improvements over time, on the hardware with Nehalem, integrated memory controllers, QPI, on the software stack, w2k8r2 and s2k8r8, scaling on NUMA systems is not trivial. What SQL execution plan operations scale?, what does not?, what might have negative scaling? etc, what problems can be fixed with code changes etc. All of this should be done with proper expertise. (Not to be construed as an advertisement or solicitation for services, this will not be cheap either)

    PS -

    I am neither advocating nor criticizing big-iron systems. The important point is that new systems coming every year are approximately 40% more powerful at comparable price ranges. That means the value of compute power depreciates at 30% per year (1/1.4 = 0.71). So it does not make sense to buy now for what you do not expect to need for 2+ years. Buy what need for the next year, and buy a new system after that, rotating the existing system to a less important task. Of course, if you work for an inflexible government agency that mandates replacement at 5 year intervals, or if buying the $1M system makes you more important than the other group that runs on a $30K system, well then go for it! On the flip side, one should not argue for the minimum system that meets requirements, but rather think about how massive compute power can be used to generate value.

    I used have many complaints about Intel, particularly on the chipsets. Most have been addressed. The remaining complaint is that Intel has a twisted view that 4-way systems are special, ie, compared to 2-way systems. This is why the 6-core Dunnington is only used in the Xeon 7400 series and not the 5400 series, even though there is no reason it cannot be used in the 5400. The same applies to the upcoming 8-core Nehalem EX being only positioned in the 7000 line and not the 5000 line. AMD has no issues offering 6-core Istanbul in a 2-way. Hopefully, hardware vendors will have a better picture of customer interests, and offer a 2-way for Nehalem EX. Sure I know it is not cheap, this is why the different between men a boys is the size and price of their toys.

    HP/Oracle just published a RAC TPC-H result with 64 BL640c blade servers at1000GB. This system comprised 128 quad core Xeon 5450 processors (512 cores), 32GB memory per node (64GB on one node). The total memory was 2080GB. The full database size should be around 1700GB. The 1000GB description is for just the LineItem table, not including the two non-clustered indexes and the other tables.

    CPU                            memory            Power              Throughput       QphH

    32 Itanium2 DC            256GB             90,909             53,899             69,999

    128 Core2 QC             2080GB           782,609           1,740,122        1,166,977

    Based on the published Oracle RAC results, I should point out that RAC scaling on TPC-H does look good. The almost total lack of TPC-C (1 published?) may indicate an issue in scaling high-call volume applications. In the above mentioned Oracle RAC result, the blade server hardware costs were about $700K, $500K for storage, $3M for Oracle, $1.5M for RAC, $700K for partitioning, $700K for compression, $400K for support ($100K for unbreakable Linux support, if its unbreakable, why the support?) for about $6M in software, minus $1.8M in Oracle discounts. If I could charge that much, I would get myself a 400ft yacht. Never mind, Larry already did.


  • Early Intel Nehalem Xeon 5500 performance results

    SAP numbers were available a few months ago, and now, TPC-C and TPC-E results are available for 2-socket Intel Xeon 5570 systems with the Nehalem quad-core processor. The TPC-C result is on Oracle/Linux, no SQL Server on Windows yet. The TPC-E results are all SQL Server 2008.

    I mentioned before that my expectations was that Nehalem would generate substantial performance gain on high call volume applications (ie, the transaction benchmarks), partially attributed to the processor core, but mostly to the return of Hyper-Threading (HT). When HT was first proposed for the Pentium 4 generation, the expectation was that substantial performance gain was possible because certain applications could would many dead cycles waiting for memory access, etc. As with most first generation concepts, the actual performance gains in TPC-C were in the range of 7-10%, and a number of other operations could actually experience performance anomalies, that is, degradation, so HT had to be used with care, or even disabled. I did observe a very large gain (30-50%) with HT in Quest LiteSpeed compression tests on the first Prescott generation NetBurst Xeons. So the expectation was that once the tricky issues with HT in complex code like SQL Server, large gains should also be possible.

    System             Configuration                                                    TPC-C

    DL370G6           2 Xeon 5570 Quad-core 2.93GHz, 8M L3, 144GB   631,766 (Oracle/Linux)

    DL580G5           4 Xeon 7460 Six-core 2.66GHz 16M L3, 256GB     634,825

     

                                                                                                      TPC-E 

    Fujitsu RX300    2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB   800.00

    x3650M2           2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB   798.00

    Dell T610           2 Xeon X5570 Quad-core 2.93GHz, 8M L3, 96GB   766.47

    TX300 S4          2 Xeon X5460 Quad-core 3.16GHz, 8M L2, 64GB    317.45

    Dell R900          4 Dunnington Six-core 2.66GHz, 16M L3, 64GB      671.35

    My expectation is that Xeon 5500 series should show moderate gains in TPC-H style large queries. No results are currently available. It is very dissappointing that vendors only want to show a partial picture of what to expect. SPEC CPU integer results show a reasonable gain of 31.5 for Xeon 5570 versus 25.3 Xeon 5460. The individual components range from a few very large gains, several good gains, a number of modest gains and a slight decrease for the bzip2 component, so expect variations in narrow testing. 

    Anyways, I will get a Xeon 5500 system for my own testing as soon as I can. If anyone wants to loan a Opteron Shanghai system, I can do current generation comparison tests. But I do not really have my own budget right now for the Opteron system.


  • Table Scan - Aggregate Cost Structure

    In the previous post, I mention that in an earlier post on IO cost structure, that the cost of a table scan was approximately 1 CPU-micro sec per page, plus approximately 0.05 us per row for just a SELECT COUNT(*)

    This was calibrated on a 2-way Xeon 5430 Quad-Core 2.66GHz. I should also add that the test tables have a clustered primary key, with between 3-18 columns, all fixed length not null, varying row density per page from 1 to 323. Without hints, the default locking level should be TABLOCK (?) and Max Degree of Parallelism set to 1.

    I just looked a table scan query to the TPC-H LineItem table at SF10 and saw significant differences. At SF10, using the SQL Server 2008 table, ie, DATE type instead of DATETIME, there are 59,986,052 rows, and 1,127,206 leaf level pages, averaging 53.2 rows per page. By my formula, I would expect the SELECT COUNT(*) FROM LineItem with a forced clustered index scan would cost around 3.5 micro-sec per page, instead I observed 5.8 micro-sec per page.

    So why is the scan on LineItem much more expensive than the calibration from my tests? Is it because the LineItem table has 1 varchar(44) column? or because the LineItem clustered index is not unique? I need to find out, as you heard what I said about people who speculate, without investigation.

    So additional information on aggregation: SUM(float) costs the same as SUM(int), there is nearly no difference between SELECT COUNT(*), SUM(float) and SELECT SUM(float), so if you aggregate any column, the count is free. Anyways, the first SUM is about 0.20 micro-sec per row on top on the COUNT(*). A SELECT MAX() is just slightly less expensive than a SUM(), around 0185 micro-sec per row. Adding a second, third or fourth SUM()  or AVG() contributes about 0.05 micro-sec per col aggregated. However, SELECT SUM(Col1), AVG(Col1) does not incur the extra column penalty, SELECT SUM(Col1), MAX(Col1) does incur the additional col cost.

     

    thanks to Mario (http://sqlinternals.com/) for looking at what the SQL Server engine actually does in SELECT COUNT(*) FROM Table, and to Kevin who is big into what code paths should be,
    and likes to count CPU-cycles (might be an indication of a lack of social life,
    of course if chicks like to count cpu-cycles, I might get out more myself)
    3 Billion cycles per second, and he still wants to nickel and dime the 133 per row.
    http://kevinclosson.wordpress.com/

    From Mario
    So that's why I fired up my little debugger and did some tracing (on my AMD Athlon PC, win x32, running SS2005 SP2).

    And I was surprised to find you are actually right.

    SqlServer *does* touch every row, in order to optimize this process it does use hardware prefetching, and it scans the rows using the row lenghts starting from the bottom of the page.

    This is for counting.

    For a max or avg it drags in all the column data using memcpys, and applies max or avg functions, the actually referencing the data makes the story of course much more expensive. Here it depends if prefetching does what it should do, how long the memory latencies are..are they local (in NUMA cases), and how costly the cache coherency process behind this is.

    So the cost of accessing data depends also (much) on the architecture of your box and the amount of sockets.  And not only what you measure using on cpu. But since you say that you are damn careful and this is not your first day doing cpu measurements, i guess you are aware of all that..

    ________________________________________

    the parts that I highlighted above imply that scaling is not linear, ie, a test on one processor, or with OPTION (MAXDOP 1) does not scale with the number of processors, either using a single query with a parallel execution plan, or from multiple concurrent non-parallel execution plans.


  • SQL Server Fast Track Data Warehouse

    This came out on the Microsoft website. I have not had time to look over the hardware configuration carefully. Of course the recommended storage is a SAN, vendors love to sell very high margin products when there are perfectly good merely high margin alternatives. Well atleast this one looked at sequential disk IO bandwidth, unlike past configs that were totally silly. HP config for the DL385G5p does 1.5GB/s, the DL585 3GB/s and the DL785 6GB/s. The Dell configs for PE2950 does 1.6GB/s and the R900 4.8GB/s. Gut feeling is the 2-socket systems are underpowered storage wise, but that what happens when you config expensive storage solutions.

    http://www.microsoft.com/sqlserver/2008/en/us/fasttrack.aspx

    Ok, I withdraw my initial complaint on the MSA 2000, depending on which model. I love direct-attach storage in non-cluster environments for brute force capability. So I like on the HP side, the MSA60 with LFF drives or the MSA70 with SFF drives (because of the high drive count, 25, on a single x4 SAS, this is really for 8K random IO, not large block or sequential). The MSA 60 base unit is $2,999. The MSA 2012sa single controller base unit is $4,499. The MSA 60 requires a RAID controller in the server. The MSA 2012sa has a built-in RAID controller, requiring an SAS HBA on the server. So the MSA 2000 SAS variant is only slightly more expensive than the pure DA solution. The MSA2000 FC variants are far more expensive and (while suitable for cluster environments which this is not) do not contribute to performance.

    Intel vs. AMD

    The Dell reference systems are on Intel Core2 architecture processors and the HP systems are on AMD Shanghai. For many people, the choice of Intel vs. AMD is highly emotional. Some feel they are Luke Skywalker battling the evil empire. Others feel compelled to quash rebellion in the ranks. I am of the Han Solo thinking ("I ain't in it for your rebellion honey, I'm in it for the money"?). From the technical perspective, the Core2 has the more powerful processor core for CPU intensive operations. Opteron has the integrated memory controller for faster serialized round-trip memory operations. At the 4-socket level, the AMD system has 8 memory channels versus 4 for systems based on the Intel 7300 chipset. The AMD 4-way systems have more IO bandwidth at the 4-way Intel 7300 chipset. One reason HP may have referenced the AMD Opteron line is to have a uniform line from 2-way to 8-way. Unisys now has an ES7000 for the Core 2 architecture. I would like to see the detailed system architecture for that, as well as run some IO bandwidth calibration tests. The HP Integrity line has outstanding IO capability, but is saddled with Itanium 2 on 90nm (soon to be 65nm?).

     

    SAN versus Direct-Attach

    I strongly prefer DA over SAN for DW because DA can achieve very high bandwidth at a low cost. A pair of Dell PowerVault MD1000 can support 1.6GB/s for about $12K. I am inclined to think the EMC Clarion CX4-240 with 2 dual-port 4Gbit/s FC HBAs, and 2 DAE will cost in the range of $30K. Yes, the SAN has lots of features, useful for clustered and transactional environments, but not really essential for DW. However, if one were intent on using SAN for DW, I would agree with this approach of achieving very high bandwidth using multiple entry/mid-level SANs rather a single high-end SAN.

     

    Data Consumption Calculations

    The reference configurations seem to be built around the calculated table scan data consumption rate of 200MB/sec per core. It is commendable that configurations be built around calculations. Also, most people cannot handle complexity, so calculations must be kept as simple as possible. However, it is grossly irresponsible to make the statement that table scan consumption rate is linear with the number of cores. It is highly dubious that parallel execution plan consumption rate was tested at all given the cited SELECT * OPTION (MAXDOP 1) method. An all rows returned SELECT * query is very unlikely to generate a parallel execution plan. This has to do with the formulas used by the internal SQL Server cost based optimizer (or execution plan cost formulas), as IO costs are not reduced, only CPU. The extra cost of the “Parallelism (Gather Streams)” operation inhibits the parallel plan.

    Also, SELECT * FROM Table is not a proper test of data consumption, it is as much as a test of the ability of the client to receive data.

     

    Anyways, I had previously reported precise measurements of the cost structure of SQL Server table scans. A rough model is that the base table scan (SELECT COUNT(*) or equivalent) depends on the number of pages and rows involved, and whether disk access is required. The cost per page (referenced a 2-way system with the Intel Xeon 5430, Core 2 architecture at 2.66GHz) is around 1 CPU-microsecond for in-memory. The test tables columns were all fixed length not null.

    The cost per row per page is roughly 0.05 CPU-microseconds. So the in-memory table scan for a table with 20 rows per page works out to approximately 2 CPU-microseconds. This applies for a unhinted SQL query, for which the default for a table this size should be table lock. The cost of a table scan using row lock is much higher, possibly 1000 cpu-cycles per row (I reported on this long ago, in SQL Server 2000 day on the Pentium III, 4 and Itanium architectures).

     

    Let me emphasize that these are measure values, and are known to be consistent over a wide range of conditions. It does not matter what one imagines the code to execute the above looks like. 

     

    For large block disk access, the cost additional per page is approximately 4.3 CPU-micro-seconds. So the base table scan at 20 rows per page will run in the range of 4GB/sec in memory and 1.2GB/sec to disk on a single core. Note that I said for large block disk access. So I think this amortizes the disk IO cost over many pages (32-64?) and of course includes the cost of evicting a page from the buffer cache and entering the new page.

     

    Notice I said base table scan, meaning more or less SELECT COUNT(*) or equivalent so that the execution plan is a table scan or clustered index scan, not a nonclustered index scan, but the formulas are approximately valid. It turns out that the cost of logic, i.e. AVG, SUM, and other calculations can be quite expensive. A single aggregate on an integer column appears to cost about 0.20 CPU-microseconds per row, meaning the 20 rows per page in-memory table scan cost is now 6 CPU-microseconds versus 2 for the base. The TPC-H query 1 has 7 aggregates, 3 multiplies and 3 add (or subtracts), yielding a net data consumption rate around 140MB/sec on a single core.

     

    Storage Configuration

    So should each application be tested to the actual data consumption rate for the most important aggregate query? I think this is too complicated and narrow. Even if the main query is complicated, it still helps to have brute force capability in the storage system. This is why my recommendation is to simply fill the available PCI-E slots with controllers, which are relatively inexpensive, and distribute many disks across the controllers, with consideration for the cost of the disk enclosures. This means avoiding the big capacity drives, i.e. skip the 300GB 15K drives. Last year, I recommended 73GB 15 drives. Today, there is little price difference between the 73G and 146G 15K drives, so go for the 146G 15K drives.

    Memory

    On memory, forget the stupid 4GB per core rule. I see no justification for it. Memory is cheap. Fill the DIMM sockets with the largest capacity memory module where the cost per GB is essentially linear. Two years ago, 2GB ECC DIMMs were around $200, and 4GB ECC were around $1000, so 2GB DIMMs were the recommendation then. I just checked on Crucial, today 4GB ECC is less than $200 each, while 8GB is around $850, so 4GB DIMMs are the recommendation for now.

    Processors 

    Finally, a note on recommended processors: Large data warehouse queries do benefit from high clock rate (only comparable within a micro-architecture, do not equate Opteron to Core 2 frequency). DW queries do not benefit from cache size. So if the same frequency is available with different cache sizes, at different prices, just be aware that processor cache size does not impact DW. Large processor cache does significant benefit high call volume transactional queries, so the higher price is more than justified there.

     

    Reference Configurations with Direct Attach Storage

    Below are my reference configurations on the assumption the data warehouse is not clustered. I have included prices obtained from the Dell and HP websites on 26 February 2009. If you can get Dell or HP to quote the price of SAN based storage as in the Microsoft/Dell/HP reference configurations

     

    Dell PowerEdge 2950 III, 2x3.33GHz Xeon X5470, 32GB memory

    2 PERC6/E controllers, $850 each

    2-4 MD 1000 w/ 15 x 146GB 15K SAS drives, $7K each

    Total cost: $22K with 30 disks, $36K with 60 disks

     

    PowerEdge R900 4 x Six Core Intel Xeon X7460 2.67GHz, 16M, 128GB, $21,366

    6 PERC6/E controllers, $5,100,

    6 MD 1000 w/15x146G 15K, $43K

    Total cost: $70K.

     

    While I do agree in principle with the concept of balanced configuration, I ask the question: which is better? Each configuration is approximately $36K.

    1 PE 2950 w 2x3.33GHz X5470, 32GB memory, 2 PERC6, 60 x 146G 15K disks

    or 1 PE R900, 4x2.67GHz X7460, 128GB memory, 2 PERC6, 30 x 146G 15K disks

     

    HP ProLiant configurations with Opteron (Shanghai, 6M L3) processors.

    DL385G5p, 2 x Opteron 2384 2.7GHz, 32GB, 2 x 72GB 15K HDD, $7,282

    DL385G5p, 2 x Opteron 2384 2.7GHz, 32GB, 16x72GB 15K HDD $12,500

    DL 585G5, 4 x Opteron 8384 2.7GHz, 128GB, 2 x 72GB 15K, $21,500

    DL 785G5, 8 x Opteron 8384 2.7GHz, 256GB, 2 x 72GB 15K, $60,600

    MSA60, 12 x 146K 15K $7K ($5765 bundle price?)

    Smart Array P800 $949

    Smart Array P411 $649

     

    The lowest cost DL385G5p DW configuration is to populate all 16 internal SFF drive bays, then attach 2 MSA 60, each with 12 15K disks. The preferred DL585G5 configuration is to populate 6 RAID controller, with 2 MSA 60 (directly connected, not daisy chained) on each controller in the 3 x8 PCI-E slots, and 1 MSA 60 for each controller in the x4 PCI-E slots.

    On the DL785G5, place RAID controllers in each of the 3 x16 and 3 x8 PCI-E slots, each connecting 2 MSA 60, and connect a single MSA 60 for RAID controllers in the x4 PCI-E slots. Defer to Gunter Zink’s team for any more specific slot to disk configuration.


  • Nehalem, Shanghai, and Dunnington performance notes

    Back in March 2008, I discussed the SAP SD 2 Tier results for some AMD Opteron and Intel Xeon systems. Here are some more recent results. All Opteron and Xeon processors below are on the 45nm process except for the Opteron 8360, which is a 65nm product. For some reason, HP has not posted a 4-socket result for the 45nm Opteron 8384, where the bigger 6M L3 cache in known to improve network round-trip performance. The Itanium is a dual-core 90nm product, which is at serious disadvantage (to be replaced Tukwila, a 65nm quad-core). Note that the Xeon 7460 is a six core processor. My expectation is that the 4-way Opteron 8384 should be around 22-23K which would be very competitive for a quad-core going against a six-core.

     

    System             Processors                                           Users                SAPS

    DL785G5           8 x 2.7GHz        Opteron 8384                7,101                 35,400 (O10,Lin)

    DL580G5           4 x 2.66GHz      Xeon X7460                   5,155                 25,830 (S2K5)

    DL585G5           4 x 2.5GHz        Opteron 8360                3,801                 19,020 (S2K5)

     

    DL380G6           2 x 2.93GHz      Xeon X5570                   4,995                 25,000 (S2K5)

    DL385G5           2 x 2.7GHz        Opteron 2384                2,752                 13,780 (S2K5)

    DL380G5           2 x 3.33GHz      Xeon 5470                     2,518                 12,600 (S2K5)

    BL860C             2 x 1.66GHz      Itanium 9140M                  501                  5,850

     

    Of special note is the exceptional result for the 2-socket Xeon X5570, based on the new Nehalem processor, expected to be available some time in the first half of 2009. This is not entirely unexpected. First, the integrated memory controller probably helps in network round-trip intensive operations. Second is the return of Hyper-Threading, as Nehalem was designed in Oregon, while the Core 2 architecture is an Israeli design (each design team has their own opinion of various micro-architecture features). On the last Oregon design, the Pentium 4 NetBurst core, I noted that Hyper-Threading improved network round-trip performance by 15-20%, but did not actually improve any SQL operation outside of the network round-trip. Later, I measured 40-50% performance gains for HT on LiteSpeed backup compression tests. I have heard that while the theory behind HT is sound, certain operations such as acquiring locks (at the C/C++ level) can cause problems. The compression algorithm has no such issues, and probably represents the upper bound on what could be achieved with HT. Supposedly the Itanium HT, which was introduced with Montecito after the last NetBurst, had some improvements over NetBurst. Now that the Oregon team has had 8+ years to investigate HT characteristics, we should expect a much improved HT with Nehalem.

     

    My expectation is HT has the biggest benefit in SAP type environments, i.e. stored procedure calls that retrieve a single row (50-100 CPU-micro-sec), moderate benefit in TPC-C and E type environments (on the order of 1-3 CPU-milli-sec), and less or no benefit in large TPC-H type queries. When I get the Nehalem Xeon 55xx system, I will look into this.

     

    Unisys to focus on Xeon over Itanium?

    On a side note, see the article regarding Unisys.

    http://news.cnet.com/8301-13924_3-10167332-64.html?part=rss&subj=news&tag=2547-1_3-0-20

    Unisys just posted a 10TB TPC-H result for a 16-socket Xeon X7460. While the X7460 has six cores, Unisys only enabled four cores per socket, in part, because the current version of Windows and SQL Server only support up to 64 cores. The result is 26% higher than for a 32-socket 64 core Itanium 2 system. The presumption is that later this year, Intel will release the quad-core Itanium, codename Tukwila, so this result might be representative of 16-socket systems in late 2009. Even if Tukwila can achieve 2.0GHz, it will probably just be comparable to the X7460. After Windows Server 2008 R2 releases, the 16-way X7460 will have all 72 cores available for the through-put portion of TPC-H.

     

    64-core TPC-H 10TB

    Xeon X7460 Six-core 2.66GHz (Dunnington, 4 cores used) 80,172.7 (ES7600R)

    Itanium 9140N Dual-core 1.6GHz (Montecito) 63,650.9 (Superdome)

     

    Dell TPC-E results for Shanghai versus Dunnington

    Both systems 4 sockets, 64GB memory, 

    4 x six-core Dunnington 2.66GHz, 16M L3   671.35 tpsE

    4 x quad-core Shanghai 2.7GHz, 6M L3      635.43 tpsE

    So Shanghai quad-core competes very well against Dunnington with six cores. The large cache relative to Barcelona (2M L3) really helped 

     

    SSD Test Platform

    Anyway, I am all set to buy the new 2-socket Xeon 5500 series as soon as one becomes available. I will look into Nehalem performance relative to Core 2, with and without HT. I will try to configure this system with 2 PCI-E SAS RAID controllers and 8 SSD (2 per x4 SAS port, 4 devices per controller) initially, and then expand to 4 RAID controllers with 24-32 SSD as budget allows (probably 3 SSD and 1 HDD per x4 SAS port). I should also get a Shanghai platform as well, as my last good numbers for Opteron are now very old, but this is my own money. Business is down with the economy, and too many consultants are dropping their rates to get business. I am not inclined to do so. So I should have time to bring my past performance papers, many of which pertain to SQL Server 2000, up to date. I will also try to re-release some of my performance tools like SQL Clone on www.qdpma.com. I should also be able to release new tools, one for Profiler Trace analysis and another for performance tuning using dm_db_index_usage_stats and dm_exec_query_stats.


  • Run-time Execution Plan Options

    What are the top core SQL Server engine performance issues today, after all the improvements that have gone into 2005 and 2008? (I am excluding matters beyond the power of Microsoft, like eliminating bad developers.)

    SQL Server already has statement level recompile for multi-statement stored procedures. But there is no option to change a plan for a single statement in the course of execution. SQL Server also it has the OPTIMIZE FOR hint and Plan Guide options. This allows for a single plan for a particular query. What if there needs to be multiple plans for a single query?

    Runtime Plan Change.

    Consider a single SQL query, with multiple table access steps in the execution plan. It is not always possible to estimate row count at each step in the execution plan. This is why it is called an estimate. (I know some complain when the estimate is off by 10%, Lubor once said he is interested in estimates off by 2X, but I consider catastrophic errors tend to be orders of magnitude, 10-1000X or more). Suppose the SQL Server engine starts the first steps of execution plan. The estimated row count was low, so the subsequent steps employ index seeks with nested loops joins to complete the query. In actuality, the row count is very large, so table scans with hash or merge joins are much better. The opposite scenario is a very high estimate row count that turns out to be low. The remaining steps of the execution plan involve table scans with hash or merge joins, when index seeks with nested loops joins would be far more efficient. Now there are two options. One is to retain the current partial results, but change subsequent steps, to account for the new row count. In some cases, it will be much better to discard the current progress and generate a completely new plan based on the new row count. If this could be implemented, my thoughts are that this should only be enabled explicitly for known problem queries instead of a system wide on/off switch.

    Multiple Execution Plans for a single SQL

    Now consider a single SQL query, involving multiple search arguments. Depending on the input parameters, the row count involved for each argument varies drastically. So the Optimize For and Plan Guide options can only optimize for one particular scenario. I have handled this by making two or more copies of the original stored procedure, then having the original procedure check actual data distribution, and then call one of the copies. The sub-procedures are identical, but copy A is only called when parameter 1 is high, parameter 2 is low, and Copy 2 is called when parameter 1 is low and 2 is high. It is not necessary to have one copy for each possibility, only if that possibility requires a significantly different execution plan.

    My thought on handling this automatically is to extend the OPTIMIZE FOR hint to from specifying a single parameter value to multiple.

    Example: instead of

    OPTION (OPTIMIZE FOR @P1 = 1, @P2 = ‘A’, …)

    New:

    OPTION (OPTIMIZE FOR @P1 = 1 or 2, @P2 = ‘A’ or ‘B’, …)

     

    The presumption above is that @P1 values 1 and 2 have drastically different distributions that really need separate execution plans. So the SQL Server engine would generate a plan for each option. At run time, depending on whether the input parameter has distribution closer option 1 or 2, the appropriate plan is selected. Of course, this is a rather sophisticated hint that should not be used by anyone without detailed understanding of the execution plan. Also consider that as data distribution changes over time (or on statistics recomputed) the Optimize For values might be rendered ineffective. The SQL Server optimizer could also check whether each Optimizer For option actually leads to a significantly different plan, and discard irrelevant options.

    Microsoft might take the position that statement level recompile is the right solution for this. My argument only stands if the cost of recompile is high relative to the cost of execute for the desired plan. So maybe an alternative is I want to use the OPTIMIZE FOR in the low row count scenario, but a RECOMPILE if the actual parameter value has high row count.

     

    Correlated Column Statistics

    I posted this a comment in another blog, but am moving here. I do not consider a top issue, but desirable. The reason SQL Server needs correlated column statistics (it is in at the vector level, but not at the histogram).

    Consider the query:

    SELECT * FROM Women

    WHERE Attractiveness > 9 AND MonthlyShoppingExpenses < $3000

    Or:

    SELECT * FROM Men

    WHERE FinancialStability > 9 AND Fidelity > 9 AND IsSingle > 50%

    The individual column statistics may indicate 10% on each column condition, leading one to think that the combined AND conditions yields 1%, which will still have a respectable row count output. But in fact, if correlation is taken into account, a drastically lower row count is the result. Yes, I know: get a life, and I have been called MCP before the MS had the MCP program.

    Additional Items

    Index Seek with Key Lookup versus Table/Clustered Index Scan

    In previous blogs, I also complained about the static approximate 4:1 (1350:320) ratio in table scan versus index seek followed key lookup. That is, the cross-over point from an index seek with key lookup to table scan occurs when the number of rows requiring key lookups reaches approximately one-fourth the number of pages. SQL Server 2008 has the hint OPTION (TABLE HINT (table,FORCE SEEK)), which pushes the index seek with key lookup regardless of the number of rows.

    I commented that this was rather like giving us a sledge hammer when a chisel was preferred. My preference is to have a hint that changes the cross-over ratio from 4:1 to 1:1.

    Hash and Sort spill to tempdb

    Back when SQL Server 2005 was in late beta, I looked at SQL Server 2000 and 2005 execution plans with Hash (Join and Match) and Sort operations. I noted that the point at which hash and sort operations incurred an IO cost was function of the size of the intermediate table and system memory. Naturally AWE memory in 32-bit system does not count, i.e. only directly addressable memory counts. I did not verify that this was the actual point at which a hash or sort spooled to tempdb, and I did not notice any performance issues on my systems with adequate storage performance. Of course, many people have really crappy storage performance. Should there be a tunable parameter for handling this? Especially now that full 64-bit systems with huge (128GB) memory (and crappy storage performance) are common? Or does this fall under matters beyond the power of Microsoft?

    Default Settings for Cost Threshold for Parallelism and Max Degree of Parallelism

    Oh yeah, probably a tier 2 item. It is high time Microsoft change the default settings for these. A default of 0 or all available is potentially silly and severely adverse on systems with multi-core processors. The current cost threshold for parallelism of 5 is also really silly. I recall seeing queries that run in 50ms (non-parallel) get a parallel plan. This is way too low considering the overhead of setting up parallel execution. At minimum, the cost threshold should be recalibrated so that parallelism is not initiated until true CPU (on Core 2 or later architecture) is around 300-500 milli-sec. This is almost getting like how Oracle left the default buffer cache setting to 8-16MB, when system memory typically reached GBs, and the "DBA" neglected to change the default.

    Multi-level Cost Threshold and Max Degree of Parallelism

    I am thinking this should really be a bi-level or even multi-level setting. That is, a plan cost of 50-1000 allows MAXDOP 2-4, and plan cost 1000-10000 targets MAXDOP 8-16, and so on. (I am winging these numbers. If considered, then I can a more serious study on proper values.)

    New Parallel Execution Plan Formulas

    I have previously griped about the formulas used in parallel execution plans (OK, so you name it, and I have complained about it!). Anyways, the current execution plan cost model is preponderantly IO weighted, and IO costs are not reduced in parallel execution plans (except for large hash and sort operations). This has the effect of inhibiting parallel execution when in fact parallel execution would be beneficial, especially when almost all hot data is memory, so the IO argument does not hold. One more complaint/request. The bitmap filter operation in large parallel hash operations really help. In some cases, the parallel operation must be inhibited, but it would still be nice to employ the bitmap filter, even though large hash operations should be parallel.


  • Dedicated Network Adapter(s) for transferring SQL backups to the tape archival system

    One thing that really surprises me is how few people configure dedicated network adapters/ports (and preferably multiple adapters) for copying the SQL Server backup to the server with the tape archival system. The common reason cited by a really naïve system admin is that the percent network utilization as shown in task manager or perfmon never goes above 40-50%. Basically, this is a really worthless counter highly subject to misinterpretation with serious consequences in transaction processing environments. What it comes down to is that the network traffic generated by the backup file copy over the same network used by transaction processing can be highly disruptive even at 20-30% network utilization.

     

    Think about this. On a gigabit Ethernet link, which has been standard for the last few years, a single file copy (or SQL Server backup to a network location) can generate 40-50MB/sec network traffic, assuming good disk system on both the SQL Server and destination and a clean network. Almost everyone is on the default Ethernet frame size of 1500 bytes. This means there can be approximately 30,000 packets per sec sent from SQL Server, and 15,000 packets per sec received. Running perfmon on the SQL Server might show only 4,000-5,000 packets per sec sent. This is because the Windows Operating system presents a very large packet size, which is what the performance counters record, that is broken into 1500 byte packets by the network adapter.

     

    Even though this is not 100% network utilization on Gigabit Ethernet, it is highly disruptive to other activity on the network, especially transaction processing which requires a highly responsive SQL Server.

     

    The Windows default is: for every 2 packets sent, it must wait for an acknowledgement before the next packet is sent. If the network round-trip is 67 micro-sec, then 15K pairs can be transmitted and acknowledged per sec. For a two-transmit, 1 acknowledgement policy, and 1500 bytes per packet, it would require a 30 micro-sec round-trip to achieve the full 100% (or 100MB/sec) network utilization. In the past I tried setting the TCP/IP registry settings for 4-8 window size, but had trouble getting this to behave properly. Large Internet Packet (9000 bytes per frame) does help achieve higher network utilization. It helps to have matching network adapters at both ends, and a gigabit switch that also supports LIP. Configuring LIP might cause issues with the Web/App servers, so I strongly recommend configuring LIP only on the NICs dedicated for backups.

     

    There is no reason today to buy a single port Gigabit NIC with x4 PCI-E slot. So always add dual-port Gigabit NICs. Technically a quad-port NIC would be a good match for a x4 PCI-E slot. Per my storage configuration discussions, every x8 PCI-E slot in a server system should be populated with a disk IO controller (dual x4 SAS or dual port 4Gbit/s FC), and possibly 1 additional controller in a x4 slot. The remaining x4 PCI-E slots should then be populated with dual-port NICs. Now it is not necessary to have each gigabit port be connected to a different dedicated switch. Most gigabit Ethernet switches should have sufficient backplane bandwidth the handle the full saturation traffic of several Gigabit ports. But this should be tested.

     

    It is necessary to assign each port a different network part of IP address. See Thomas Grohser website on this: http://www.sqlserver-hwguide.com/ I will also add that important environments should really be handled by a system/network admin familiar with configuring multiple networks ports on a server.


  • Early TPC-C and H reports for Shanghai relative to Barcelona and Dunnington

    Below are the first performance numbers for the new AMD 45nm Opteron (Shanghai) relative to other recent AMD and Intel Xeon results.

    4-way TPC-C

    Opteron 8360 Quad-core 2.5GHz (Barcelona) 2M L3, 471,883 (DL585G2)

    Opteron 8384 Quad-core 2.7GHz (Shanghai) 6M L3, 579,814 (DL585G2)

    Xeon 7350 Quad-core 2.93GHz (Tigerton) 2x4M L2, 407,079 (DL580G5)

    Xeon 7460 Six-core 2.66GHz (Dunnington) 16M L3, 634,825 (DL580G5)

    Xeon 7460 Six-core 2.66GHz (Dunnington) 16M L3, 684,508 (x3850M2)

    (Note: Intel uses inclusive cache, so I just quote the size of the last level. Opteron is exclusive, so technically I should quote L2+L3, the L2 being 4x512K.) 

    Both Opteron TPC-C systems have 256GB memory, 700 disk drives, so the Shanghai is running at lower memory per tpm-C and lower disk per tpm-C. Still, the performance gain from Barcelona (65nm) to Shanghai (45nm) is 23% for 8% higher frequency. For reference the 4-way dual-core Opteron result was 262,989 for the 8220 2.8GHz.

     

    8-way TPC-H 300GB

    Opteron 8360 Quad-core 2.5GHz (Barcelona) 2M L2, 52,860 QphH (DL785, S2K8 w/compression)

    Opteron 8374 Quad-core 2.7GHz (Shanghai) 6M L2, 57,865 (DL785, S2K8 w/compression)

    Xeon 7350 Quad-core 2.93GHz (Tigerton) 2x4M L2, 46,034 (x3950M2, S2K5)

     

    Both Opteron TPC-H systems have 256GB, 206 disks, and 7 disk controllers. The Xeon system is on S2K5, and the Opteron results are S2K8 using the date data type in place of datetime, saving 12 bytes per row on LineItem. Page compression is also enabled, so the entire 300GB Lineitem + indexes and other tables essentially fits in memory, so the Opteron and Xeon results are not properly comparable.

    The performance gain from Barcelona to Shanghai is 9% on the combined QphH, 11.7% on power, and 6.6% on throughput.

     

    The gains can be divided between increased cache size, frequency and core improvements.

    TPC-C is known to benefit from large cache, so the majority of the gain from Barcelona to Shanghai is probably due cache. The Xeon 7460 has the best performance due to the combination of additional cores and the very large L3 cache.

    TPC-H is known to be essentially cache size independent. So the fact that TPC-H improves by more than clock frequency says some gain is due to core improvements. For an 8% frequency increase, an expected performance of 5-6% is reasonable. So 3-4% can probably be attributed to core improvements (see below). There is no published TPC-H result for the Intel six-core Dunnington. This is probably an indication that the Intel FSB architecture cannot properly feed 24 cores with a single MCH. The Intel Core 2 significantly out performs Opteron at this single core level (non-parallel large queries), but Opteron catches up at high degrees of parallelism with better memory bandwidth (more memory channels).

     

    There is still no SPEC CPU Integer 2006 Base for Shanghai (there is for SPEC CPU rate, but I still like looking at the non-rate). Core i7 results look decent, though not spectacular. A 10% gain from QX9770 3.2GHz to i7 3.2GHz. The goal of each tick and tock is 40%, but this is very difficult to do on SPEC INT. Hopefully we will see Core i7 (45nm) at 3.5-3.8GHz in 2009.

    Prior to Shanghai launch, AMD made noise that Shanghai would have 20-30% better performance than Barcelona at the same clock. In a complex processor architecture, it would not be unusual that a minor design mistake, when corrected, yield a significant performance gain on a specific operation, but may yield only a modest gain on a complete suite of operations. AMD was vague as to whether the 20-30% was specific or broad. Anyways, I just talked to the person who did the AMD results. In the past he had made subtle contributions to achieving best in class performance. Between the Barcelona and Shanghai results, there was a minor change that contributed 2%. So this would reduce the contribution from any core improvements in Shanghai.

     

    On Scotts comments, there are only TPC-E results for Intel Xeon and Itanium, one major vendor say they have done TPC-E for Opteron but has elected to no publish. Let me just say benchmarking is a viscious world where second place is for losers, even if its by just 1%. Oracle and DB2 refuses to publish. Anyways:

    4 x X7460 Six-core (24 cores) 2.66GHz, 128GB  729.65 tpsE (IBM x3850 M2)

    12 x X7460 Six-core (64 of 72 cores used) 2.66GHz, 384GB, 1,400 tpsE (NEC 5800)

     

    I am glad NEC decided to enter a Xeon for big iron. There is a role for big boxes even though they require special tuning skills. They had an Itanium, but Intel is far behind the ball in getting Itanium architecture and manufacturing process current. There is no way a 90nm processor (Montvale) can compete with 45nm X86/64. Even the upcoming 65nm Tukwila will probably be short, more so once the big Nehalem 8-core server variant arrives. I am here a PASS now, but have not had a chance to bug NEC on details of their 5800 architecture.

     

    The SPEC CPU mentioned in the comment below are SPEC int_rate, which is a through-put test. I want the SPEC CPU Integer base (no special complie flags, substituting hand coded assembly for compiler generated code). For reference (I will check if the Xeon 5460 is on a server or workstation platform)

    Dell has now published Shanghai SPEC CPU integer results

    X5460 3.16GHz 2x6M, 25.3

    X7460 2.66GHz, 16M L3, 21.7

    Opteron 8360 2.5GHz 14.4

    Opteron 8384 2.7GHz 16.9

    Core i7 965 3.2GHz 30.2

    The reason I say SPEC CPU integer base is important is that is a reasonable predictor of standard non-parallel execution plans for a wide range of SQL operations. Many people have a favorite query they like to run to evaluate new systems. So even though the Opteron can achieve very good results on the 8-way for well tuned parallel plans (and spec int rate), single threaded operations fall far short of the Core 2 based Xeon. So it is important to test single thread, through-put and parallel plan performance separately.


  • Statistics Blunders

    One of the major blunders in SQL Server 2000 is in the calculation of data distribution statistics. In theory, it should be possible to take just a small true random sample to get a reasonable estimate of the true distribution. Because of the page organization of storage in SQL Server (and other relational database engines), it is much easier sample all rows in a set of randomly selected pages, than to randomly samples rows which encompasses a much larger set of pages. A simple test shows that SQL Server 2000 in fact does sample entire pages for column statistics. So if there is any correlation between a column value and storage location, a very common occurence in tables with a clustered index, then the distribution statitistics is grossly distorted.

    When SQL Server 2005 came out, it did seem that this very serious error was corrected. But just recently I was looking a SQL Server 2005 (build 3186) database. Looking into unexpected behavior, the index statistics with the default sampling were significantly off. Statistics at higher sampling percentages were progressively more accurate. Now index and column statistics should different strategies.

    For columns statistics, supposed one sampled 1% of the rows, find M occurences of N distinct values. Should the interpretation be that in the full distribution, there are 100xM occurences of the N distinct values or M occurences of 100xN distinct values or something in between? 

    For index statistics, there is an alternative strategy. Start a number random points including the first and last. At each point, go backward and forward to find the number of occurences of the value at the starting point. The number of occurrences is exactly the same for the full distribution.,So the reasonable assumption is that there are proportionaly more distinct values with the average sampled occurence.

    Still one should always check for key indexes that the estimated row count from the initial index access with an equality SARG, specifically for one of the Range Hi Keys, is reasonably accurate (within 2X or so).


  • Dell PERC6 RAID Controller Performance

    Last month, Scott pointed out the really bad performance characteristics of the Dell PERC6 in RAID0 sequential write, particularly compared with RAID-5. Granted, this is not necessarily a red flag because few people use RAID 0 in production. Still, if one can't write code or test correctly, one should not be in the hardware/firmware business. Dell recently released firmware 6.1.1, the previous was 6.0.3. The driver was updated from 2.20 to 2.23.

    My results for 2 x 4 disk RAID0 arrays on PERC6, writing to another PERC6 with same disks.

    Large block Read test: 690-713 MB/sec

    SQL Server backup test 

    Old 6.0.3 firmware:         81MB/sec Write-Back    129MB/sec Write-Thru

    New 6.1.1 firmware:        354MB/s Write-Back      631MB/sec Write-Thru

    The old driver with new firmware has the same results as new driver with new firmware, so this was really a firmware issue. The old PERC5 did not have as bad RAID 0 write performance as the PERC6 with old firmware. Random write IO testing shows mixed results between write-thru and write back, some favoring WT, some WB but not by a large amount.

    A curious note. The Dell TPC-E report for the R900 with 4 x Xeon X7460 used the LSI MegaRAID SAS 8888ELP controllers, not the PERC6E (there was a PERC6i for internal drives). Both PERC6 and LSI 8888 use LSI components (and common drivers?)


More Posts Next page »
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement