THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • SSD-HDD price parity

    It is hard to believe that we are essentially at SSD-HDD price parity? Of course I am comparing enterprise class 10K/15K HDDs to consumer grade SSDs. Below are prices I am seeing

    600GB 15K 3.5in HDD $370
    3TB 7.2K 3.5in HDD $400

    300GB 15K 2.5in HDD $370
    900GB 10K 2.5in HDD $600
    1TB 7.2K 2.5in HDD $230 (less for consumer HDDs)

    512GB SATA SSD $400-600
    Intel SSD DC S3700 400GB $940

    The 512GB SATA SSDs are consumer grade, MLC NAND, with only 7% over provisioning.
    That is 512GB (1GB = 2^30) of NAND, with 512GB (1GB =10^9) of user capacity.
    Intel just announced the SSD DC S3700, which appears to be a reasonable enterprise product, in having 32% over-provisioning. I am inclined to think that DW permanent data does not need more over-provisioning than in the consumer grade SSDs. Otherwise your db is probably not a DW.

    Potentially tempdb might need more write endurance than in the consumer SSDs. So the question is whether a large array of consumer SSDs supporting mostly static data and a smaller write intensive tempdb is a good match.
    Most SAN vendors are peddling grossly over-priced enterprise grade SLC SSDs. That is because they want SSD to be used for caching or tiered storage, which results in heavy write activity. Apparently SAN vendors have no concept of DW.

    Now if we could only get system vendors to provide storage bays for 9.3mm SSDs instead of 15mm 10K/15K HDDs to achieve higher density. Better yet, arrange with SSD vendors to ditch the case, providing the SSD on just the PCB. The SSD storage bays should also be more appropriately balanced, say 2 x4 SAS ports to 8-bays.

  • Supermicro motherboards and systems

    I used to buy SuperMicro exclusively for my own lab. SuperMicro always had a deep lineup of motherboards with almost every conceivable variation. In particular, they had the maximum memory and IO configuration that is desired for database servers. But from around 2006, I became too lazy to source the additional components necessary to complete the system, and switched to Dell PowerEdge Tower servers.

    Now, I may reconsider as neither Dell or HP are offering the right combination of PCI-E slots. Nor do the chassis support the capability I am looking for. The two Supermicro motherboards of interest are the X9DRX+-F for 2-way Xeon E5-2600 series processors, and the X9QR7-TF-JBOD for 4-way Xeon E5-4600 series processors.

    Below is a comparison of the Dell, HP and Supermicro 2-way Xeon E5-2600 systems (or motherboards). Both the Dell and HP have PCI-E x16 slots. Unfortunately this is not particularly useful as the only PCI-E SSD capable of using the full x16 bandwidth is the Fusion-IO ioDrive Octal at over $90K.

     Dell
    T620
    HP
    ML350p G8
    SuperMicro
    X9DRX+-F
    DIMM sockets242416
    PCI-E 3 x16430
    PCI-E 3.0 x82110
    PCI-E 3.0 x4140
    PCI-E 2.0 x4011

    Below are the Dell and HP 4-way systems for Xeon E5-4600, the HP 4-way Xeon E7 (Westmere-EX) and the Supermicro 4-way E5-4600 motherboard. It is apparent that neither the Dell and HP E5-4600 systems are meant to fully replace the previous generation 4-way E7 (Westmere-EX) systems, as both implement only half of the full set of PCI-E lanes.

     Dell R820HP DL560 Gen8HP DL580 Gen7SuperMicro
    DIMM sockets48486424
    PCI-E 3 x16 2207
    PCI-E 3.0 x85*36 (g2)1
    PCI-E 3.0 x40000
    PCI-E 2.0 x40150

    The Xeon E5-2600 and 4600 series processor has 40 PCI-E gen 3 lanes, and the DMI which is equivalent to x4 PCI-E gen 2. One processor needs to have the south-bridge using DMI, but the others could implement a x4 g2 port. Of course the full set of 160 gen 3 lanes are only available if all processors are populated, but the same concept applies for the memory sockets. These systems are probably more suitable for VM consolidation servers. Hopefully there will be a true database server in the works?

    Today, I am interested in maximum IO bandwidth with a uniform set of PCI-E slots. Maximum memory bandwidth is required to support this, but it is not absolutely essential to have maximum memory capacity.

    The IO bandwidth plan is built around SSDs, because a 15K HDD starts around $200 providing 200MB/s on the outer tracks while a 128GB SSD can deliver 500MB/s for around $100. It would actually be easier to build a high bandwidth system with PCI-E SSDs.

    The Intel 910 400GB model is rated at 1GB/s for just over $2000, and the 800GB model does 2GB/s at the same IO bandwidth per dollar. The Fusion-io ioDrive2 Duo 2.4TB can do 3GB/s but costs $25K (or is it $38K?). The Micron P320h can also do 3GB/s but is probably expensive being based on SLC.

    The other option is 8 x 128GB SATA SSDs on a PCI-E SAS RAID Controller. The LSI SAS 9260-8i can support 2.4GB/s with 8 SSDs. In theory 6 SSDs could support this but I have not validated this. So one 9260-8i ($500) and 8x128GB SATA SSDs ($100 each) means I can get 2.4GB/s for $1300, possibly less. I understand that the LSI SAS 9265-8i ($665) can support 2.8GB/s (or better?), but the LSI rep did not send me one when said he would. LSI now has PCI-E 3.0 SAS controllers, the 9271 and 9286, but I do not have any yet.

    Bandwidth versus cost options
    Fusion-IO ioDrive2 365GB            910MB/s        $6K
    Intel 910 400GB                           1000MB/s       $2K
    LSI+8 SATA SSD                         2400MB/s?     $1.3K?

    To implement this strategy, the chassis should support many SATA/SAS devices organized 8 bays per 8 SAS lanes. The both Dell T620 and HP ML350p support 32 2.5in SAS devices, but organized as 16 per dual SAS ports (x4 each?). So for my purposes, these systems are adequate to house SSDs for 2 adapters. It could also be pointed out that the 2.5in SAS bays are designed for the enterprise class 10K/15K HDDs which are 15mm thick. SATA SSDs on the other hand are 9.3mm thick, designed to fit laptop HDD dimensions. It could be even thinner without the case.

    I should point out that the Intel 910 400GB PCI-E SSD has 768GB actual NAND, about 50% of capacity is reserved, so this should have very good write endurance for MLC. This typical for enterprise oriented SSDs. The typical consumer SSD, has about 7% reserve. For example, a device with 128GB (binary 1GB=1024^3) has 128GB decimal (1GB=10^9) user capacity. So for production use, stay with the enterprise oriented products.

    RichB
    First, the is a lab, not a production server, and I am paying for this myself, so $1-2K matters to me.
    Lets start with the a 2-way Xeon E5, and the Dell T620 for example.
    A reasonable IO target for this system is 5GB/s based on 300MB/s per core. I can get this with 2 PCI-E cards that can do 2.5GB/s, but the cheap card can only do 1GB/s so I need 5 cards. Plus I might like 2 RAID controllers to HDD so I can do really fast local backups. Next I might like to have 2x10GbE or even an Infiniband HBA. So by this time I am out of slots. I might like to run extreme IO tests, so perhaps targeting 8GB/s.
    So the x16 slots are wasting PCI-E lanes that I need for extra x8 slots. And I cannot afford the Fusion Octal, and Fusion will not lend one to me long-term.

    Next, the Intel 910 400GB is $2100, while the Fusion ioDrive2 365GB is $5-6K (sorry, how much is 1£ these days?)
    both are about the same in bandwidth. The Fusion is about 50% better in random read, and 10X+ in random write. Both cite 65us latency. If I had only a single card, I am sure I could see the difference between the Intel and Fusion. But if I were to fill the empty slots with PCI-E SSDs, I am inclined to think that I have exceeded the ability of SQL Server to drive random IO.

    I tested this once with OCZ RevoDrives, but OCZ cannot provide server system support and OCZ uses the Marvell PCIE to SATA controller, so I stopped using OCZ PCI-E cards. I still used OCZ SATA SSDs, just connected to LSI SAS HBAs. Intel uses the LSI, which has much better interrupt handling. While Fusion may be better at the individual card level, I am not willing to spend the extra money on a lab system. And I am using SATA SSDs because they even cheaper than the Intel 910.

    Realistically, I need to replace my lab equipment every 1-2 years to be current, so I treat this as disposables, not investments.

  • Server Systems for SQL Server 2012 per core licensing

    Until recently, the SQL Server Enterprise Edition per processor (socket) licensing model resulted in only 2 or 3 server system configurations being the preferred choice. Determine the number of sockets: 2, 4 or 8. Then select the processor with the most compute capability at that socket count level. Finally, fill the DIMM sockets with the largest capacity ECC memory module at reasonable cost per GB. Currently this is the 16GB DIMM with a price of $365 on the Dell website, and $240 from Crucial. The 32GB from Dell is currently (2012-Oct) at $1399 each, down significantly from $2499 in early 2012? Perhaps next year the 32GB DIMM might be under $800?

    SQL Server 2012 Enterprise Edition options

    Now with SQL Server 2012 per core licensing, there are a broader range of possibilities based on the number of cores. The table below shows Dell PowerEdge system examples for the Intel Xeon E5 processors from 8 to 32 cores. I would cite HP ProLiant configurations as well, but their website has become so painful to use that I have given up.

    SystemSocketsProcessorGHzCores/
    socket
    Total
    cores
    Max
    DIMMs
    System
    Price
    SQL 2012
    EE
    license?
    T620 2 E5-2643 3.3GHz 4 8 24 $10K $48K
    T620 2 E5-2667 2.9GHz 6 12 24 $12K $72K
    T620 2 E5-2690 2.9GHz 8 16 24 $13K $96K
    R820 4 E5-4603 2.0GHz 4 16 48 $14K $96K
    R820 4 E5-4617 2.9GHz 6 24 48 $20K $144K
    R820 4 E5-4650 2.7GHz 8 32 48 $29K $192K

    Pricing for the Dell PowerEdge T620 system above with 16x16GB memory and 1 boot drive. The prices for the Dell PowerEdge R820 are also with 16x16GB memory. Each additional 16x16GB DIMMs cost $5840.

    The SQL Server 2012 Enterprise Edition licensing shown above is based on a discounted price of $6K per core. The list price is $6,736 per core. The Fujitsu RX300 S7 TPC-E full disclosure report of 2012 Jul 5 shows a full environment: system + storage + software discount of 20%. If evenly applied, this would mean SQL Server license at $5,400 per core. I would like hear what discounts people are getting with respect to volume. My understanding prior to 2012 was that the Microsoft sales rep does not love you unless you buy 16 EE processor licenses, which would translate to 32 core licenses in 2012. Is this still the threshold?

    It is unfortunate that Intel does not offer a high frequency 4-core in the E5-4600 line as they do in the E5-2600 series. I am of the opinion that a 4-way system with Xeon E5 3GHz+ quad-core processor and 48 DIMM sockets would be a very interesting platform. The Intel list price for the E5-4650 8-core is $3616, the 4617 6-core for $1611, and the 4603 4-core at $551.

    Considering that the SQL Server Enterprise Edition licensing component dwarfs the system and processor costs, it would be a good idea for Intel to offer an all-purpose E5-4600 at the high-end that can configured to 4, 6 or 8 cores in the microcode. It would be simpler for large organizations to purchase 4-way systems with the all-purpose processor as a standard configuration. Then each individual system could have the number cores dialed down to the desired level.

    The 4-way E5-4603 2.0GHz is probably not as useful compared to the 2-way E5-2690 2.9GHz, both at 16-cores total. The 4-way has twice the memory bandwidth and capacity but probably also much more than necessary to support the 16 x 2GHz cores. The 2-way has nearly 50% more compute capability with balanced memory bandwidth because the complete processor was designed to be in balance for the high-end configuration. There are only a small number of situations that would favor the larger memory capacity of the 4-way E5-4600.

    The recent generation Intel processor cores are so powerful that 4 or 6 cores is probably good enough for most medium size businesses. I would prefer a 2-socket system for the extra memory bandwidth and capacity, but the minimum SQL Server 2012 license is for 4-cores per socket, negating the feasibility of a 2-way dual-core system.

    Standard Edition

    The limit for SQL Server 2012 Standard Edition is the lessor of 16-cores or 4 sockets and 64GB memory. In addition, many important features are not available like compression, partitioning, and advanced security. I recall that there was a limit to parallel query execution, and that it was less than 16? Standard Edition does not have parallel index operations - i.e., index creation? Perhaps all this means that 16-core is far more than can be used in a Standard Edition environment. The 64GB memory also provides guidance on when to use standard edition.

    Personally, I do not have much side-by-side comparisons of Standard versus Enterprise edition. I would like to hear from people what are the key technical considerations in determining when SE is suitable.

    It would seem that a single socket system with 4-6 cores and 64GB+ memory is most suitable for Standard Edition. The 64GB memory limit applies to SQL Server, it might be a good idea to configure the server with more than 64GB, perhaps as much as 96GB, so as to leave more than sufficient memory for the operating system and other processes.

    Below are some Dell system examples that might be suitable for SQL Server Standard Edition. It appears that Dell is discontinuing the T320 and T420, in favor of the Rx20 systems. While the R-modes are great from higher density environments as web servers, the T-models are best for small business database servers.

    SystemProcessorFrequencyCoresMemoryPCI-ESystem Price
    T110 1 E3-1270v2 3.5GHz 4 4x8GB 1x16, 1x8, 1x4 $2,080
    T320* 1 E5-1410 2.8GHz 4 6x8GB 1x16, 2x4g3, 1x4g2 $1,944
    T320* 1 E5-2407 2.2GHz 4 6x8GB 1x16, 2x4g3, 1x4g2 $2,019
    T320 1 E5-2440 2.4GHz 6 6x8GB 1x16, 2x4g3, 1x4g2 $2,799
    R320 1 E5-2470 2.3GHz 8 6x8GB 1x16, 1x4g2 $3,529
    R520 2 E5-2407 2.2GHz 2x4 12x8GB 1x16, 3x8 $3,503
    T620 2 E5-2643 3.3GHz 2x4 8x8GB 4x16, 2x8, 1x4 $4,292

    Notes:
    * The T320 and T420 are no longer available? Only the R320 and R420?
    The T110 II with 4x8GB added $1,251 from Dell, price from Crucial is $440.
    The T320 memory price from Dell is $160 for 8GB, and $365 for 16GB. Crucial is $85 for 8GB and $240 for 16GB.

    Technically, the systems for the E5 processors are better than the E3, with more memory bandwidth (3 channels versus 2) and larger memory capacity. On the downside is a large drop in processor frequency. The 2-socket quad-core is probably a better option than the single socket 8-core processors.

    The SQL Server 2012 per core licensing may be a shock over 2008 R2 licensing at the 8-core per socket level, effectively doubling SQL Server licensing costs. However, based on direct observations of many environments, I am of the opinion that most businesses would have more than adequate performance with a properly tuned 2-way quad-core system with 8 cores total. This system has more than 4X the compute capability of 4-way systems from the period before multi-core processors. So in fact, SQL Server licensing costs have gone down, we just need to be judicious in the choice of configuration.

  • Decoding STATS_STREAM

    Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field accessible with the STATS_STREAM clause. Back in SQL Server 2000, it was possible to modify system tables directly, including the sysindexes stat_blob field. At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan, presumably on a development system and not a production system.

    Starting with SQL Server 2005, it was nolonger possible to directly modify system tables. An API was provided to access data distribution statistics to allow cloning the statistics from one database to another. The presumed usage is to clone statistics from a large production database to a small development database. In other database engines, I had heard of the idea of updating statistics on a backup system to be applied to the production system. While it was still possible to decode most of the 2005 stats_stream binary, it appears that a checksum was added so it was not possible to apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.

    Around this time, I was working on other SQL Server tools, most prominently SQL System for performance monitoring, Exec Stats for execution plan analysis and TraceAnalysis for trace processing. Work on the SQL Server data distribution cloning tool was discontinued, and I could not continue further research into the decoding of SQL Server data distribution statistics.

    Since several people have asked about the data distribution statisics decode, I am making what I know about stat_stream available. It would be helpful is other people would contribute to the missing pieces.

    Note that organization of stats_stream changed from SQL Server version 2000 (then sysindexes stat_blob) to 2005 and again to 2008? It is quite possible there are also changes in version 2012? Most of what I discuss here applies to version 2008 R2.

    Decoding Stats Stream for SQL Server 2008R2

    Here I am using 1 based reference. Byte index 1 is the first byte.
    In C# and most other programming languages use zero based index.
    PositionLengthValue/TypePurpose
    1 4? 1 unkown
    5 4?   number of vectors
    9 4 0 zero
    13 4 0 zero
    17 4   checksum
    21 4 0 zero
    25 4   stats stream length
    29 4 0 zero
    33 4   stats stream length - minus vector variable length
    The difference [25]-[33] is 64 for 1 vector (defined as off1).
    Each additional vector adds 24 bytes starting at byte position 41
    37 4 0 zero
    Start of vector information, 24-bytes per vector
    41 1   system type id
    42 1   unkown
    43 2   unkown
    45 4   user type id
    49 2   length
    51 1   Prec
    52 1   Scale
    53 4   unknown
    57 4   unknown
    61 2   unknown
    63 2   unknown
          Some of the unknown fields should be for nullable, collation, etc
    Addition vectors if present
    off1+1* 9   Updated?, 9 byte datetime2?
    off1+10 3   unknown
    off1+13 8   Rows
    off1+21 8   Rows sampled
    off1+29 4 4 byte real Density - Header
    off1+33 4x33=132 4 byte real Density - vector, upto 33 values
    off1+165 4 4 byte int Steps (first copy)
    off1+169 4 4 byte int Steps (second copy)
    off1+173 4 4 byte int number of vectors
    off1+177 4 4 byte int Step size (in bytes)
    off1+181 4 4 byte real Average Key length - header
    off1+185 4 4 byte real Unfiltered rows
    off1+189 4 4 byte int unknown
    off1+193 4x33=132 4 byte real Average key length - vector
          Some fields may represent string index (bool), or filter expression
    off1+325 8 8 byte int unkown, values 0x11, 0x13 and 0x19 observed,
    may determine the # of post histogram 8 bytes values starting at off1+341?
    off1+333 8 0 8 byte 0?
    off1+341 8 0 offset for value after histogram?
    off1+349 8 0 another offset
    off1+357 8 0 another offset if value of [off1+25] is 19 or more?
    more offsets if value of [off1+25] is 25 or more?
    Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
    a value in native type of the stat, and then ending with 0x040000
    off2 2 0x10 - 16 length of core columns determines the organization of histogram structures?-->
    off2+2 2 17 or higher size of step, excluding 3 byte trailer
    off2+4 4 4 byte real Eq Rows
    off2+8 4 4 byte real Range Rows
    off2+12 4 4 byte real Avg Range Rows
    off2+16 native length native type Range Hi Key
    off2+16+x 3 byte 0x040000 step terminator?, x is the size of the type
    off3 ? ? additional info
    *off1 = value of 4(8) byte int at position [25] - value of [33]
    **off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19
    ***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]

    So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics. Variable length statistics has different organization, particularly in the histogram part. String statistics may have extended information after the histogram, per new feature of SQL Server 2008?

    Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float, and vice versa.

    Supporting SQL functions and procedures:

    The updated tools now has a stored procedure that accepts table and index (or column stat) as input parameters, in addition to the original procedure that has the stats stream binary.

    Updates on QDPMA Stats Stream - Updated

    decoding stats stream - Updated

    ps
    An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.

  • Intel Xeon E5 (Sandy Bridge-EP) and SQL Server 2012 Benchmarks

    Intel officially announced the Xeon E5 2600 series processor based on Sandy Bridge-EP variant with upto 8 cores and 20MB LLC per socket. Only one TPC benchmark accompanied product launch, summary below.

    ProcessorsCores perFrequencyMemorySQLVendorTPC-E
    2 x Xeon E5-269082.9GHz512GB (16x32GB)2012IBM1,863.23
    2 x Xeon E7-2870102.4GHz512GB (32x16GB)2008R2IBM1,560.70
    2 x Xeon X569063.46GHz192GB (12x16GB)2008R2HP1,284.14

    Note: the HP report lists SQL Server 2008 R2 Enterprise Edition licenses at $23,370 per socket.
    The first IBM report lists SQL Server 2012 Enterprise Edition licenses at $13,473 per pair of cores(?) or $53,892 per socket. All results used SSD storage. The IBM E7 result used eMLC SSDs, the IBM E5 results showed more expensive SSDs, but did not explicitly say SLC?.

    The Xeon E5 superceeds 2-socket systems based on both the Xeon 5600 (Westmere-EP) and Xeon E7 (Westmere-EX). It is evident that Sandy Bridge improves performance over Westmere at both the socket and core levels and also on a GHz basis.

    ArchitectureTotal CoresFrequencyCore-GHzTPC-Etps-E per core-GHz
    Sandy Bridge-EP2 x 8 = 162.9GHz46.41,863.2340.16
    Westmere-EX2 x 10 = 202.4GHz48.01,560.7032.51
    Westmere-EP2 x 6 = 123.46GHz41.521,284.1430.93

    One advantage of the Xeon E7 (Westmere-EX) system is that the memory expanders support for 4 DIMMs per channel or 16 DIMMs per socket (4 memory channels). However, a two-socket Sandy Bridge-EP system supports 256GB with 16 (8 per socket) of the lower price (per GB) 16GB DIMMs. And really, 256GB is more than enough for most situations, so it is quite reasonable to not burden outlier configuration requirements on the large majority.

    A later version of the Xeon E5 will support 4-socket systems. There is no explanation as to whether glue-less 8-socket systems will be supported in the future. It was previously discussed that there would a EN variant of Sandy Bridge with 3 memory channels and fewer PCI-E lanes.

    Hardware Strategy for SQL Server 2012 per core licensing
    Top frequency on the 6 core E5-2667 is 2.9GHz, the same as the 8 core (excluding the 8 core 2687W model at 3.1GHz). Top frequency for the 4 core E5-2643 and 2 core E5-2637 are 3.3 and 3.0GHz respectively. The desktop i7-2830 is 3.6GHz with 4 cores, so Intel is deliberately constraining the top frequency on 2 & 4 core version for the server parts, apparently to favor interest in the 8 core part.

    Given the SQL Server 2012 per core licensing, there should be interest in a system with fewer cores per socket running at higher frequency, while taking advantage of the high memory and IO bandwith of the E5 system. Consider also that SQL Server write operations (Insert, Update, Delete, the final stage of index builds) and even certain SELECT operations are not parallel (the Sequence Project operator that support the ROW_NUMBER function).

    I think it would also make sense for Intel to allow cores to be disabled in BIOS (now UEFI) on the top of line E5-2690 like the desktop extreme edition unlocked processors. Large corporate customers can buy a batch of identical systems, disabling cores that are not needed on individul systems. 

    It would also be of value to engage a (nolonger quite so, relative to core licenses) exhorbitantly priced consultant to tune SQL Server to run on fewer cores.(Not to be construed as a solicitation for services)

  • Query Optimizer Gone Wild - Full-Text

    Overall, SQL Server has become a very capable and mature product, with a very powerful engine and sophisticated query optimizer. Still, every now and then, a certain query structure throws the optimizer for a loop resulting in an execution plan that will take forever. The key to identifying this type of problem begins with the exeuction plan. First, the plan cost does not tell the whole story. It is necessary to know which execution plan operations can run well on modern server systems and which do not. Solving the problem can be a simple matter of rewriting the SQL to a different execution plan, one that uses good execution components.

    Of course, when working with 3rd party applications that do not use stored procedures, it is necessary to convince the ISV, often first talking to someone who does not write code, not mention someone with any understanding of the SQL Server query optimizer.

    Anyways, the topic here is Full-Text Search, in particular CONTAINS and CONSTAINSTABLE. CONTAINS is "a predicte used in a WHERE clause" per Microsoft documentation, while CONTAINSTABLE acts as a table.

    Consider the two queries below, the first is an example of CONTAINS and the second an example of CONTAINSTABLE.

    LoopJoinScan

    We might intuitively think that there should be no difference between the two, which is why in SQL Server, we should never even bother with intuition and instead always, repeat always, focus on the execution plan.

    LoopJoinScan

    LoopJoinScan

    Both queries perform a Full-Text search, both the CONTAINS function must also scan a index on the source table to get the count. The CONTAINSTABLE function on the other hand, being a row source, can be summed directly. In this example, the Document table is on the order of 60GB excluding lob structures stored out of the table, the index in question is 150MB, and there are 16M rows in the table. Both queries run in about 2+ sec elapsed. The first consumes 6 CPU-sec running in 2.2 sec, while the second query consumes 2.6 CPU-sec in 2.6 sec as it the not a parallel plan. OK, so the first query runs slightly faster with parallel execution on the Stream Aggregate, while the second is single-threaded. But the Full-Text function itself is not multi-threaded, and probably the bulk of the 2.2 sec of the first query. So why is the CONTAINS operation beneficial?

    Before jumping to the title topic - Query Optimizer Gone Wild, lets look at another query, shown below.

    LoopJoinScan

    Below is the query plan. Note that neither column in the search argument is indexed, because this is an administrative query that the executive director runs once every month as a Key Performance Indicator, which is also probably related to why I am not an executive. So the execution plan is a table scan.

    LoopJoinScan

    The IO portion of the full table (Clustered Index) scan is 5677 (1350 pages or 10.5MB has an IO cost of 1 in a scan operation). For this particular example, the Fulltext Match Table Valued Function is assessed a plan cost of 1.6. When combined with the other components, Stream Aggregate and Filter, the total plan cost of this Full-Text search is 4.14.

    On this particular system, a Xeon E7-48xx, with max degree of parallelism set to 8, the table scan query consumes 25 CPU-sec running 3.8 sec when data is in memory. At MAXDOP 20, the query consumes 37 CPU-sec running in 2.1sec. This is why I emphasized earlier that plan cost is not hugely relavent.

    (In case you were curious, the 60GB, 16M row table scan consumes 23 CPU-sec at DOP 1, 24.5 CPU-sec, 12.3 sec elapsed at DOP 2, the same 24.5 CPU-sec, 6.6 sec elapsed at DOP 4, i.e., excellent scaling to DOP 8, and good continued scaling to DOP 20. This is an amazing 2.6GB/s per sec per core, and 700,000 rows per sec per core. Of course, this is a wide table with 175 columns averaging 3750 bytes per row.)

    The Wild Plan

    The actual query we are interested in is not the ones discussed above. Due to the nature of the application, PowerPoint documents can be indicated by the expression shown in any of three columns, one of which is part of a Full-Text catalog, as expressed by the query below.

    LoopJoinScan

    (It actually turns out that this query is not entirely correct from the technical perspective, but it is correct by executive direction; also, part of the reason why I will never be an executive.)

    Given that this is a relatively simple SQL expression, and that the two elements of this query are known to run quickly, we might intuitively expect this composite query to also run quickly. But as I said earlier, do not even bother with intuition, and always always focus on the execution plan as shown below.

    LoopJoinScan

    Can you say: "We are sooo screwed!"

    What is wrong with this plan? Let us compare this plan with the table scan plan from above. Both plans have approximately equal cost, as the 60GB table scan dominates. The extra Table Valued function contributes very little, as shown below.

    LoopJoinScan

    The problem with this execution plan is that there is a fat arrow (indicating very many rows, 16M in fact) coming from the outer source (top) with the Full-Text search in the inner source (bottom). For each row from the outer source, the inner source is evaluated.

    This is why I said to not pay much attention to the plan cost, including components with high cost relative to other components.

    Instead, it is important to focus on the SQL operations, the number of rows and pages involved, along with our knowledge of how each operation behaves non-parallel and parallel execution plans and the difference between data in memory versus on hard drive, and now also SSD storage.

    This execution plan will attempt to performance 16M Full-Text searches. We have already established that this particular Full-Text search takes about 2 sec. The full query might take 32M sec. There are 86,400 seconds per day. We should expect this query to complete in 370 days, assuming there is not a need to reboot the OS after a critical security patch. And oh by the way, we need to run this query next month too, every month as a matter of fact.

    Note, in the first Query Optimizer Gone Wild, the topic was a loop join with a table scan on the inner source. So this is another loop join example.

    The Tamed Plan

    Now that we have identified a problem, and we know exactly what to look for in the execution plan, it is time to solve the problem. Because we have been working with SQL Server and other DBMS engines built around a cost base optimizer for many years, we know exactly what to do. The solution is to rewrite the SQL to get a good execution plan where the two base operations forle which we know that have run time is reasonable are only executed only once each.

    The query below meets this objective.

    LoopJoinScan

    The execution plan is below.

    LoopJoinScan

    This query consumes 37 CPU-sec, and 6.9 sec elapsed. Given that the two component elements of this query combined for 27 CPU-sec and 6.4 sec elapsed, the hash join and 2 parallelism repartition streams component increased true cost by 10 CPU-sec, but almost a minuscue 0.5 sec elapsed time.

    I suppose that I should file this on Connect, but Microsoft locked my account out, and does not want to send me the unlock code. So I am posting this here.

  • What breaks with GPT disk partitions greater than 2TB?

    In one of the recent Windows OS versions, GUID Partition Table (GPT) became an option in addition to Master Boot Record (MBR) for creating disk partitions, with GPT supporting volumes larger than 2TB. In MBR, a 32-bit unsigned integer addresses 512-byte sectors (yeah, there is a push to adapt 4K sectors), so the disk partition limit was 2TB (2.19x1012).

    OK, then fine. The Windows Server OS supports GPT and SQL Server has been tested to support >2TB partitions. But to what extent has this been tested? I am sure Microsoft has many SANs with 10-100TB storage capacity, and someone tested 2TB plus. But anyone that works with big complex systems and storage systems has probably got tired of clicking the GUI repeatedly (no joke: one colleague had to go on 6-week disability after doing too many PowerPoint slides), so we do everything from SQL scripts and probably forgot how to use SSMS. (Me, I really liked Query Analyzer, especially how quickly it launches.)

    I am sure Microsoft has QA people who must test every single feature of each GUI tool, SSMS, CVT etc., but how many tests are on 2TB plus disks? and then 2TB+ files? So what can break? Even though the core OS and the SQL Server engine core works, there are many utility tools out there that makes file IO API calls. How many work with >2TB partitions or files, and how many still use a 32-bit unsigned integer to represent the sector offset? Or otherwise thinks a partition/file must be less than 2 billion KB?

    Now I am sure most people out listen to every word I say as the word of @#$. In which case your storage system is comprised of a great very many 15K 146GB disks distributed over many IO channels, which further implies that each RAID group is probably comprised of 4-8 disks (Fast Track originally recommended 2 disk RAID groups, which results in too many LUNs).

    In which case, 8 disks at 146GB (decimal 146x1012 = binary 136x230) in RAID 10 makes for a 543GB LUN. Even if it was 8 disks 300GB in RAID 5, the 1955GB LUN is still under 2TB. So you would have never encountered any >2TB issues. But there are a few who do not seem to follow my advice, and instead choose to trust the technical expertise of your SAN vendor.

  • Intel Server Strategy Shift with Sandy Bridge EN & EP

    The arrival of the Sandy Bridge EN and EP processors, expected in early 2012, will mark the completion of a significant shift in Intel server strategy. For the longest time 1995-2009, the strategy had been to focus on producing a premium processor designed for 4-way systems that might also be used in 8-way systems and higher. The objective for 2-way systems was use the desktop processor that later had a separate brand and different package & socket to leverage the low cost structure in driving volume. The implication was that components would be constrained by desktop cost requirements.

    The Sandy Bridge collection will be comprised of one group for single processor systems designed for low cost, and one premium processor. The premium processor will support both the EN and EP product lines, the EN limited to 2-way, and the EP for both 2-way and 4-way systems, with more than adequate memory and IO in each category. The cost structure of both 2-way and 4-way increased from Core 2 to Nehalem, along with a significant boost in CPU, memory and IO capability. With quad-core available in 1P, the more price sensitive environments should move down to single processor systems. This allows 2 & 4-way systems to be built with balanced compute, memory and IO unconstrained by desktop cost requirements.

    In other blogs, I had commented that the default system choice for database server for a long time had been a 4-way system should now be a 2-way since the introduction of Nehalem in mid-2009. Default choice means in the absence of detailed technical analysis, basically a rough guess. The Sandy Bridge EP, with 8 cores, 4 memory channels and 80 PCI-E lanes per socket in a 2-way system provides even stronger support for this strategy.

    The glue-less 8-way capability of the Nehalem and Westmere EX line is not continued. One possibility is that 8-way systems do not need to be glue-less. The other is that 8-way systems are being abandoned, but I am inclined to think this is not the case.

    The Master Plan

    The foundation of the premium processor strategy, even though it may have been forgotten in the mists of time, not to mention personnel turnover, was that a large cache improves scaling at the 4-way multi-processor level for the shared bus SMP system architectures of the Intel Pentium to Xeon MP period. The 4-way server systems typically deployed with important applications that could easily justify a far higher cost structure than that of desktop components, but required critical capabilities not necessary in personal computers. Often systems in this category were fully configured with top line components whether needed or not.

    Hence the Intel large cache strategy was an ideal match between premium processors and high budget systems for important applications. One aspect that people with an overly technical point of view have difficulty fathoming is that the non-technical VP's don't want their mission critical applications running on a cheap box. In fact, more expensive means that is must be better, and the most expensive is the best, right? From the Intel perspective, a large premium is necessary to amortize the substantial effort necessary to produce even a derivative processor in volumes small relative to desktop processors.

    The low cost 2-way strategy was to explore demand for multi-processor systems in the desktop market. Servers were expected to be a natural fit for 2-way systems. Demand for 2-way servers exploded to such an extent that it was thought for a brief moment there would be no further interest for single processor servers. Eventually, the situation sorted itself out, in part with the increasing power of processors. Server unit volume settled to a 30/60/10 split between single, dual and quad processors (this is old data, I am not sure what the split is today). The 8-way and higher unit volume is low, but potentially of importance in having a complete system lineup.

    AMD followed a different strategy based on the characteristics of thier platform. The Hyper-Transport (HT) interconnect and integrated memory controller architecture did not have a hard requirement for large cache to support 4-way and above. So AMD elected to pursue a premium product strategy on the number of HT links. Single processor systems require one HT to connect the IO hub. Two HT is required in a 2-way system, one HT connecting to IO, and another to the second processor. Three HT could support 4-way and higher with various connection arrangements. The pricing structure is based on the number of HT links enabled, on the theory that the processor has higher value in big systems than in small systems.

    What Actually Happened

    Even with the low cost structure Intel enabled in 2-way, desktop systems remained and actually became defined as single processor. Instead, the 2-way systems at the desk of users became the workstation category. This might have been because the RISC/UNIX system vendors sold workstations. The Intel workstations quickly obliterated RISC workstations, and there have been no RISC workstations for sometime? Only two RISC architectures are present today, having retreated to the very high-end server space, where Intel does not venture.

    Itanium was supposed to participate in this space, but the surviving RISC vendors optimized at 8-way and higher. Intel would not let go of the 4-way system volume and Itanium was squeezed by Xeon at 4-way and below, yet could not match IBM Power in high SMP scaling. To do so would incur a high price burden on 4-way systems. One other aspect of Intel server strategy of the time was the narrow minded focus on optimizing for a single platform.

    Most of the time, this was the 4-way server. There was so much emphasis on 4-way that there actually 2 reference platforms, almost to the exclusion of all else. For a brief period in 1998 or so, there was an incident of group hysteria that 8-way would become the standard high volume server. But this phase wore off eventually. The SPARC was perhaps the weakest of the RISC at the processor level. Yet the Sun strategy to design for a broad range of platforms from 2-way to 30-way, (then with luck 64-way via acquisition of one of the Cray spin-offs) was successful until their processor fell too far behind.

    After the initial implementation of the high volume 2-way strategy, desktop systems became intensely price sensitive. The 2-way workstations and server system were in fact not price sensitive even though it was thought they were. It became clear that desktops could not incur any burden to support 2-way capability. The desktop processor for 2-way systems was put into a different package and socket, and was given the Xeon brand.

    Other cost reduction techniques were implemented over the next several generations as practical on timing and having the right level of maturity. The main avenue is integration of components to reduce part count. This freed 2-way system from desktop cost constraints, but as with desktops, it would take several generations to evolve into a properly balanced architecture.

    The 4-way capable processors remained on a premium derivative, given the Xeon MP brand in the early Pentium 4 architecture (or NetBurst) period. To provide job security for marketing people, 2-way processors were then became the Xeon 5000 series, and 4-way the Xeon 7000 series in the late NetBurst to 2010 period. In 2011, the new branding scheme is E3 for 1P servers, E5 for 2-way and E7 for 4-way and higher. Presumably each branding adjustment requires changes to thousands of slidedecks.

    At first, Intel thought both 2-way and 4-way systems had high demand versus cost elasticity. If cost could be reduced, there would be substantially higher volume. Chipsets (MCH and IOH) had overly aggressive cost objectives that limited in memory and IO capability. In fact, 4-way systems had probably already fallen below the boundary of demand elasticity.

    The same may have been true for 2-way systems, as people began to realize that single processor systems were just fine for entry server requirements. For Pentium II and III 2-way systems, Intel only had a desktop chipset. In 2005-6, Intel was finally able to produce a viable chipset for 2-way systems (E7500? or 5000P) that provided memory and IO capability beyond desktop systems. Previously, the major vendors elected for chipsets from ServerWorks.

    It was also thought at the time that there was not a requirement for premium processors in 2-way server systems. The more correct interpretation was that the large (and initially faster) cache of premium processors did not contribute sufficient value for 2-way systems. A large cache does improve performance in 2-way systems, but not to the degree that it does at the 4-way level. So the better strategy by far on performance above the baseline 2-way system with standard desktop processors was to step up to a 4-way system with the low-end premium processors instead of a 2-way system with the bigger cache premium processors.

    And as events turned out, the 4-way premium processors lagged desktop processors in transitions to new microarchitectures and manufacturing processes by 1 full year or more. The 2-way server on the newer technology of the latest desktop processors was better than a large cache processor of the previous generation, especially one that carried a large price premium. So the repackaged desktop processor was the better option for 2-way systems

    The advent of multi-core enabled premium processors to be a viable concept for 2-way systems. A dual-core processor has much more compute capability than a single core and the same for a quad-core over dual-core in any system, not just 4-way, provided that there not too much difference in frequency. The power versus frequency characteristics of microprocessors clearly favors multiple cores for code that scale with threads, as in any properly architected server application.

    However, multi-core at the dual and quad-core level was employed for desktop processors. So the processors for 2-way servers did not have a significant premium in capability relative to desktops. The Intel server strategy remained big cache processors. There was the exception of Tigerton, when two standard desktop dual-core processor die in the Xeon MP socket was employed for the 4-way system, until a large cache variant was readied in the next generation Dunnington processor incorporated a large cache. This also happened for the Paxville and Tulsa.

    System Architecture Evolution from Core 2 to Sandy Bridge

    The figure below shows 4-way and 2-way server architecture evolution relative to single processor desktops (and servers too) from 45nm Core 2 to Nehalem & Westmere and then to Sandy Bridge. Nehalem systems are not shown for space considerations, but are discussed below.


    System architecture from Penryn to Westmere to Sandy Bridge, (Nehalem not shown)

    The Core 2 architecture was the last Intel processor to use the shared bus, which allows multiple devices, processors and bridge chips, to share a bus with a protocol to arbitrate for control of the bus. It was called the front-side bus (FSB) because there was once a back-side bus for cache. When cache was brought on-die more than 10 years ago, the BSB was no more. By the Core 2 period, to support higher bus frequency, the number of devices was reduced to 2, but the shared bus protocol was not changed. The FSB was only pushed to 1066MHz for Xeon MP, 1333MHz for 2-way servers, and 1600MHz for 2-way workstations.

    Nehalem was the first Intel processor with a true point-to-protocol, Quick Path Interconnect (QPI), at 6.4GHz transfer rate, achieving much higher bandwidth to pin efficiency than possible over shared bus. Intel had previously employed a point-to-point protocol for connecting nodes of an Itanium system back in 2002. (AMD implemented point-to-point with HT for Opteron in 2003? at an initial signaling rate of 1.6GHz?) Shared bus also has bus arbitration overhead in addition to lower frequency of operation. The other limitation of Intel processors up to Core 2, was the concentration of signals on the memory controller hub (also known as North Bridge) for processors, memory and PCI-E. The 7300 MCH for the 4-way Core 2 has 2013-pins, which is at the practical limit, and yet the memory and IO bandwidth is somewhat inadequate.

    Nehalem and Westmere implement a massive increase in memory and PCI-E bandwidth (number of channels or ports) for the 2-way and 4-way systems compared to their Core 2 counterparts. Both Nehalem 2-way and 4-way systems have significantly higher cost structure than Core 2. Previously, Intel had been mindlessly obsessed with reducing system to the detriment of balanced memory and IO. This shows Intel recognized that their multi-processor systems were already below the price-demand elasticity point, and it was time to rebalance memory and IO bandwidth, now possible with point to point interconnect and the integrated memory controller.

    QPI in Nehalem required an extra chip to bridge the processor to PCI-E. This was not an issue for multi-processor systems, but was undesirable for the hyper sensitive cost structure of desktop systems. The lead quad-core 45nm Nehalem processor with 3 memory channels and 2 QPI ports in a LGA 1366 socket was followed by a quad-core, 2-memory channel derivative (Lynnfield) with 16 PCI-E plus DMI replacing QPI in a LGA 1156 socket. The previously planned dual-core Nehalem on 45nm was cancelled. Nehalem with QPI was employed in the desktop extreme line, while the quad-core without QPI was employed in the high-end of the regular desktop line.

    The lead 32nm Westmere was a dual-core with the same LGA 1156 socket (memory and IO) as Lynnfield. Per the desktop and mobile objective, cost structure was reduced with integration, with 1 processor die and potentially a graphics die in the same package, and just 1 other component the PCH.

    The follow-on Westmere derivative was a six-core using the same LGA 1366 socket as Nehalem, i.e., 3 memory channels and 2 QPI. This began the separation process of desktop and other single processor systems from multi-processor server and workstation systems. Extreme desktops employ the higher tier components designed for 2-way, but are still single-socket systems. I suppose that a 2-way extreme system is a workstation. Gamers will have settle for the mundane look of a typical workstation chassis.

    With the full set of Sandy Bridge derivatives, the server strategy transition will be complete. Multi-processor products, even for 2-way, are completely separated from desktops without the requirement to meet desktop cost structure constraints. With desktops interested only in dual and quad-core, a premium product strategy can be built for 2-way and above around both the number of cores and QPI links.

    The Sandy Bridge premium processor has 8 cores, 4 memory channels, 2 QPI, 40 PCI-E lanes and DMI (that can function as x4 PCI-E). The high-end EP line in a LGA 2011 socket will have full memory, QPI and PCI-E capability. The EN line in LGA 1356 socket will have 3 memory channels, 1 QPI and 24 PCI-E lanes plus DMI to supports up to 2-way systems, and will be suitable for lower priced systems. Extreme desktops will use the LGA 2011 socket, but without QPI.

    What is interesting is that the 4-way capable Sandy Bridge EP line is targeted at both 2-way and 4-way systems. This is a departure from the old Intel strategy of premium processors for 4-way and up. Since the basis of the old strategy is no longer valid, of course a new strategy should be formulated. But too often, people only remember the rules of the strategy, not the basis. And hence blindly follow the old strategy even when it is no longer valid (does this sound familiar?)

    This element of a premium 2-way system actually started with the Xeon 6500 line based on Nehalem-EX. Nehalem-EX was designed for 4-way and higher with eight-cores, 4 memory channels supporting 16 DIMMs per processor and 4 QPI links. A 2-way Nehalem-EX with 8 cores, 16 DIMMs per socket might be viable versus Nehalem at 4 cores, 9 DIMMs per socket, even though the EX top frequency 2.26GHz versus 2.93GHz and higher in Nehalem. The more consequential hindrance was that Nehalem-EX did not enter production until Westmere-EP was also in production, with 6 cores per socket at 3.33GHz. So the Sandy-Bridge EP line will provide a better indicator for premium 2-way systems.

    The Future of 8-way and the EX line

    There is no EX line with Sandy Bridge. Given the relatively low volume of 8-way systems, it is better not to burden the processor used by 4-way systems with glue-less 8-way capability. Glue-less means that the processors can be directly connected without the need for additional bridge chips. This both lowers cost and standardizes multi-processor system architecture, which is probably one of the cornerstones for the success Intel achieved in MP systems. I am expecting that 8-way systems are not being abandoned, but rather a system architecture with "glue" will be employed.

    Since 8-way systems are a specialized very high-end category, this would suggest a glued system architecture is more practical in terms of effort than a subsequent 22nm Ivy Bridge EX. Below are two of my suggestions for 8-way Sandy Bridge or perhaps Ivy Bridges depending on when components could be available. The first has two 4-port QPI switch, or cross-bar or routers connecting four nodes with 2 processors per node.

    The second system below has two 8-port QPI switches connecting single processor nodes.

    The 2 processor node architecture would be economical, but I am inclined to recommend building the 8-port QPI switch. Should the 2 processor node prove to be workable, then a 16-way system would be possible. Both are purely speculative as Intel does not solicit my advice on server system architecture and strategy, not even back in 1997-99.

    In looking at the HP DL980 diagram, I am thinking that the HP node controllers would support Sandy Bridge EP in an 8-way system.

    DL980

    There are cache coherency implications (Directory based versus Snoop) that are beyond the scope for database server oriented topic. There was an IBM or Sun discussion transactional memory. I would really like to see some innovation on handling locks. This is critical to database performance and scaling. For example, the database engine ensures exclusive access to a row, i.e., memory, before allowing access. Then why does the system architecture need to do a complex cache coherency check when the application has already done so? I had also previously discussed SIMD instructions to improve handling of page and row base storage, SIMD Extensions for the Database Storage Engine (same here).

    If that were not enough, I had also called for splitting the memory system. Over the period of Intel multi-processor systems 1995 to 2011, practical system memory has increased from 2GB to 2TB. Most of the new memory capacity is used for data buffers. The exceptionally large capacity of the memory system also means that it cannot be brought very close to the processor, as into to the same package/socket.

    So the memory architecture should be split into a small segment that needs super low latency byte addressability. The huge data buffer portion could be changed to block access. If so, then perhaps the database page organization should also be changed to make the metadata access more efficient in terms of modern processor architecture to reduce the impact of off-die memory access by making full use of cache line organization. The NAND people are also arguing for Storage Class Memory, something along the lines of NAND used as memory.

    More on QDMPA System Architecture. and Sandy Bridge.

  • New SQL Server 2012 per core licensing – Thank you Microsoft

    Many of us have probably seen the new SQL Server 2012 per core licensing, with Enterprise Edition at $6,874 per core super ceding the $27,495 per socket of SQL Server 2008 R2 (discounted to $19,188 for 4-way and $23,370 for 2-way in TPC benchmark reports) with Software Assurance at $6,874 per processor? Datacenter was $57,498 per processor, so the new per-core licensing puts 2012 EE on par with 2008R2 DC, at 8-cores per socket.

    This is a significant increase for EE licensing on Intel Xeon 5600 6-core systems (6x$6,874 = $41,244 per socket) and a huge increase for Xeon E7 10-cores systems, now $68,740 per socket. I do not intend to discuss justification of the new model. I will say that SQL Server licensing had gotten out of balance with the growing performance capability of server systems over time. So perhaps the more correct perspective is that SQL Server had become underpriced in recent years. (Consider that there was a 30%+ increase in the hardware cost structure in the transition from Core 2 architectures systems to Nehalem systems for both 2-way and 4-way to accommodate the vastly increased memory and IO channels.)

    Previously, I had discussed that the default choice for SQL Server used to be a 4-way system. In the really old days, server sizing and capacity planning was an important job category. From 1995/6 on, the better strategy for most people was to buy the 4-way Intel standard high-volume platform rather than risk the temperamental nature of big-iron NUMA systems (and even worse, the consultant to get SQL Server to run correctly by steering the execution plan around operations that were broken on NUMA). With the compute, memory and IO capabilities of Intel Xeon 5500 (Nehalem-EP), the 2-way became the better default system choice from mid-2009 on.

    By “default choice”, I mean in the absence of detailed technical sizing analysis. I am not suggesting that ignorance is good policy (in addition to bliss), but rather the cost of knowledge was typically more than the value of said knowledge. Recall that in the past, there were companies that made load testing tools. I think they are mostly gone now. An unrestricted license for the load test product might be $100K. The effort to build scripts might equal or exceed that. All to find out whether a $25K or $50K server is the correct choice?

    So now there will also be a huge incentive on software licensing to step down from a 4-way 10-core system with 40 cores total to a 2-way system with perhaps 8-12 cores total (going forward, this cost structure essentially kills the new AMD Bulldozer 16-core processor, which had just recently achieved price performance competitiveness with the Intel 6-core Westmere-EP in 2-way systems).

    In the world of database performance consulting, for several years I had been advocating a careful balance between performance tuning effort (billed at consultant rates) with hardware. The price difference between a fully configured 2-way and 4-way system might be $25,000. For a two-node cluster, this is $50K difference in hardware, with perhaps another $50K in SQL Server licensing cost, with consideration that blindly stepping up to bigger hardware does not necessarily improve the critical aspect of performance proportionately, sometimes not at all, and may even have negative impact.

    With performance tuning, it is frequently possible to achieve significant performance gains in the first few weeks. But after that, additional gains become either progressively smaller, limited in scope, or involve major re-architecture. In the long ago past, when hardware was so very expensive, not mention the hard upper limits on performance, it was not uncommon for a consultant to get a long term contract to do performance work exclusively.

    More recently, performance consulting work tended to be shorter-term. Just clean up the long hanging fruit, and crush moderate inefficiencies with cheap powerful hardware. While this is perfectly viable work, it also precludes the justification for the deep skills necessary to resolve complex problems, which also calls into question the need to endure an intolerably arrogant, exorbitantly expensive consultant.

    It had gotten to the point that I had given thought to retiring, and go fishing in some remote corner of the world. But now with the new SQL Server per core licensing, Microsoft has restored the indispensable (though still intolerable) status to arrogant, exorbitantly expensive, performance consultant. So, thank you Microsoft.

    Edit 16 Dec 2011
    VR-Zone mentions a Windows 7/Server 2008 R2 hot-fix that treats the 8-core AMD Bulldozer die as 4 cores with HT, as opposed to AMD's positioning as 8-cores. AMD should hope that this is Microsoft's position for SQL Server 2012 or no one should consider the AMD in light of the per core licensing, given that Intel physical cores are much more powerful than the Bulldozer "core"

    Edit 20 Feb 2012
    I might add that the new per core licensing would be well worth the extra money if SQL Server would give us:
    1) Parallel Execution plans for Insert, Update and Delete
    2) Improve Loop Join parallel scaling - I believe today there is content between thread in latching the inner source index root
    3) Fix parallel merge join - If the parallel merge join code is broken, why can we not use the parallel hash join code with the existing index?

    The basis for this if we going to pay the cores, then SQL Server should not let the core sit idle in time consuming operations.

  • TPC-H Benchmarks - Westmere-EX versus RISC

    There has been relatively litle activity in TPC Benchmarks recently with the exception of the raft of Dell TPC-H results with Exa Solutions. It could be that systems today are so powerful that few people feel the need for benchmarks. IBM published an 8-way Xeon E7 (Westmere-EX) TPC-E result of 4593 in August, slightly higher than the Fujitsu result of 4555, published in May 2011. Both systems have 2TB memory. IBM prices 16GB DIMMs at $899 each, $115K for 2TB or $57.5K per TB. (I think a 16MB DIMM was $600+ back in 1995!) The Fujistu system has 384 SSDs of the 60GB SLC variety, $1014 each, and IBM employed 143 SSDs of the 200GB eMLC variety, $1800 each for 24-28TB raw capacity respectively. Except for unusually write intensive situations, eMLC or even regular MLC is probably good enough for most environments.

    HP published a TPC-H 1TB of 219,887.p QphH for their 8-way ProLiant DL980 G7 with the Xeon E7-4870, 26% higher in the overall composite score than the IBM x3580 with the Xeon E7-8870 (essentially the same processor). The HP scores 16% higher in power and 37.7% higher in throughput. Both throughput tests were with 7 streams. The HP system had Hyper-Threading enabled (80 physical cores, 160 logical) while the IBM system did not. Both systems had 2TB memory, more than sufficient to hold the entire database, data and indexes in memory. The IBM system had 7 PCI-E SSDs and the HP system has 416 HDDs over 26 D2700 disk enclosures, 10 LSI SAS RAID controllers, 3 P411 and 1 dual-port 8Gbps FC controller.

    Also of interest are TPC-H 1TB reports published for the 16-way SPARC M8000 (June 2011) with SPARC64 VII+ processors and the 4-way SPARC T4-4 (Sep 2011). The table below shows configuration information for recent TPC-H 1000GB results.

    TPC-H 1000GBIBM x3850 X5HP ProLiant DL980 G7IBM Power 780SPARC M8000SPARC T4-4
    DBMS SQL 2K8R2 EESQL 2K8R2 EESybase IQ ASE 15.2Oracle 11g R2Oracle 11g R2
    Processors8 Xeon E78 Xeon E78 POWER716 SPARC64 VII+4 SPARC T4
    Cores Threads 80-8080-16032-12864-12832-256
    Memory 2048TB2048TB512GB512GB512GB
    IO Controllers 713124 Arrays4 Arrays
    HDD/SSD7 SSD416 HDD52 SSD4x80 SSD4x80 SSD

    The figure below shows TPC-H 1000GB power, throughput and QphH composite scores for 4 x Xeon 7560 (32 cores, 64 threads), two 8 x Xeon E7 (80 cores, 80 and 160 threads) systems, 8 x POWER7 (32 cores, 128 threads) 16 SPARC64 VII+ (64 cores, 128 threads) and the 4 SPARC T4 (32 cores, 256 threads).

    tpch100
    TPC-H SF 1000 Results

    The HP 8-way Xeon and both Oracle/Sun systems, one with 16 sockets and the newest with 4 SPARC T4 processors, are comparable, within 10%.

    An important point is that both Oracle/Sun and the IBM Power systems are configured with 512GB memory versus 2TB for the 8-way Xeon E7 systems, which enough to keep all data and indexes in memory. There is still disk IO for the initial data load and tempdb intermediate results. This good indication that Oracle and Sybase have been reasonably optimized on IO, in particular, when to use an index and when not to. I had previously raised the issue that the SQL Server query optimizer should consider the different characteristics of in-memory, DW optimized HDD storage (100MB/s per disk sequential) and SSD.

    Sun clearly made tremendous improvements from the SPARC 64 VII+ to the T4, with the 4-way new system essentially matching the previous 16-way. Of course, the Sun had been lagging at the individual processor socket level until now. The most interesting aspect is that the SPARC T4 has 8 threads per core. The expectation is that server applications have a great deal of pointer chasing code, that is: fetch memory which determines next address to fetch with inherently poor locality.

    A modern microprocessor with core frequency 3GHz corresponds to a 0.33 nano-second clock cycle. Local node memory access time might be 50ns, or 150 CPU-clocks. Remote node memory acess time might be 100ns for a neighboring node to over 250ns for multi-hop nodes after cache-coherency is taken into account. So depending on how many instructions are required for each non-cached memory access, we can expect each thread or logical core to have many dead cycles, possibly enough to justify 8 threads per core. What is surprising is that Oracle published a TPC-H benchmark with their new T4-4 and not a TPC-C/E which is more likely to emphasize the pointer chasing code than DW.

    Below are the 22 individual query times for the above systems in the power test (1 stream).

    tpch100
    TPC-H SF 1000 Queries 1-22

    Below are the 22 individual query power times for just the two 8 Xeon E7 systems. Overall, the HP system (with HT enabled) has 16% TPC-H power score, but the IBM system without HT is faster or comparable in 9 of the 22 queries. Not considering the difference in system architecture, the net might be attributed to HT?

    tpch100
    TPC-H SF 1000 IBM and HP 8-way Xeon E7

    Below are the 22 individual query power times for the HP 8 Xeon E7 and Oracle SPARC T4-4 systems.

    tpch100
    TPC-H SF 1000 8-way HP Xeon E7 and 4-way SPARC T4

  • New Fusion ioDrive2 and ioDrive2 Duo

    Fusion-iO just announced the new ioDrive2 and ioDrive2 Duo on Oct 2011 (at some conference of no importance). The MLC models will be available late November and the SLC models afterwards. See the Fusion-iO press release for more info.

    Below are the Fusion-IO ioDrive2 and ioDrive2 Duo specifications. The general idea seems to be for the ioDrive2 to match the realizable bandwidth of a PCI-E gen2 x4 slot (1.6GB/s) and for the ioDrive2 Duo to match the bandwidth of a PCI-E gen2 x8 slot (3.2GB/s). I assume that there is a good explanation why most models have specifications slightly below the corresponding PCI-E limits.

    The exception is that 365GB model at about 50% of the PCI-E g2 x4 limit. Suppose that the 785GB model implement parallelism with 16 channels and 4 die per channel. Rather than building the 365GB model with the same 16 channels, but a different NAND package with 2 die each, they just implemented 8 channels using the same 4 die per package. Lets see if Fusion explains this detail.

    Fusion-IO ioDrive2

    ioDrive2 Capacity400GB600GB365GB785GB1.2TB
    NAND Type SLC (Single Level Cell) MLC (Multi Level Cell)
    Read Bandwidth (64kB) 1.4 GB/s 1.5 GB/s 710 MB/s 1.2 GB/s 1.3 GB/s
    Write Bandwidth (64kB) 1.3 GB/s 1.3 GB/s 560 MB/s 1.0 GB/s 1.2 GB/s
    Read IOPS (512 Byte) 351,000 352,000 84,000 87,000 92,000
    Write IOPS (512 Byte) 511,000 514,000 502,000 509,000 512,000
    Read Access Latency 47 µs 47 µs 68 µs 68 µs 68 µs
    Write Access Latency 15 µs 15 µs 15 µs 15 µs 15 µs
    Bus Interface PCI-E Gen 2 x4
    Price $? ? $5,950? $? ?

    Fusion-IO ioDrive2 Duo

    ioDrive2 Capacity1.2TB2.4TB
    NAND Type SLC (Single Level Cell) MLC (Multi Level Cell)
    Read Bandwidth (64kB) 3.0 GB/s 2.6 GB/s
    Write Bandwidth (64kB) 2.6 GB/s 2.4 GB/s
    Read IOPS (512 Byte) 702,000 179,000
    Write IOPS (512 Byte) 937,000 922,000
    Read Access Latency 47 µs 68 µs
    Write Access Latency 15 µs 15 µs
    Bus Interface PCI-E Gen 2 x8
    Price $? ?

    SLC verus MLC NAND
    Between the SLC and MLC models, the SLC models have much better 512-byte reads IOPS than the MLC models, with only moderately better bandwidth and read latency. Not mentioned, but common knowledge is that SLC NAND has much greater write-cycle endure than MLC NAND.

    It is my opinion that most database, transaction processing and DW, can accommodate MLC NAND characteristics and limitations in return for the lower cost per TB. I would consider budgeting a replacement set of SSDs if analysis shows that the MLC life-cycle does not match the expected system life-cycle. Of course, I am also an advocate of replacing the main production database server on a 2-3 year cycle instead of the traditional (bean-counter) 5-year practice.

    The difference in read IOPS at 512B is probably not important. If the ioDrive2 MLC models can drive 70K+ read IOPS at 8KB, then it does not matter what the 512B IOPS is.

    Post-RAID?
    One point from the press release: "new intelligent self-healing feature called Adaptive FlashBack provides complete chip level fault tolerance, which enables ioMemory to repair itself after a single chip or a multi chip failure without interrupting business continuity." For DW systems, I would like to completely do away with RAID when using SSDs, instead having two system without RAID on SSD units. By this, I mean fault-tolerance should be pushed into the SSD at the unit level. Depending the failure rate of the controller, perhaps there could be two controllers on each SSD unit.

    For a critical transaction processing system, it would be nice if Fusion could provide failure statistics for units that have been in production for more than 30 days (or whatever the infant mortality period is) on the assumption that most environments will spend a certain amount of time to spin up a new production system. If the failure rate for a system with 2-10 SSDs is less than 1 per year, then perhaps even a transaction processing system using mirroring for high-availability can also do without RAID on the SSD?

    ioDrive2 and ioDrive2 Duo
    I do think that it is great idea for Fusion to offer both the ioDrive2 and ioDrive2 Duo product lines matched to PCI-E gen2 x4 and x8 bandwidths respectively. The reason is that server systems typically have a mix of PCI-E x4 and x8 slots with no clear explanation of the reasoning for the exact mix, other than perhaps that being demanded by the customer complaining the loudest.

    By have both the ioDrive2 and Duo, it is possible to fully utilize the bandwidth from all available slots balanced correctly. It would have been an even better idea if the Duo is actually a daughter card the plugs onto the ioDrive2 base unit, so the base model can be converted to a Duo, but Fusion apparently neglected to solicit my advice on this matter.

    I am also inclined to think that there should also be an ioDrive2 Duo MLC model at 1.2TB, on the assumption that the performance will be similar to the 2.4TB model, as the ioDrive2 765GB and 1.2TB models have similar performance specifications. The reason is that a database server should be configuration with serious brute force IO capability, that is, all open PCI-E gen 2 slots should be populated. But not every system will need the x8 slots populated with the 2.4TB MLC model, hence the viability of a 1.2TB model as well.

    ps
    if Fusion should be interested in precise quantitative analysis for SQL Server performance, instead of the rubish whitepapers put out by typical system vendors, well I can turn a good performance report very quickly. Of course I would need to keep the cards a while for continuing analysis...

  • Consumer SSDs with SQL Server

    Over the last two years, I have stood up several proof-of-concept (POC) database server systems with consumer grade SSD storage at cost $2-4K per TB. Of course production servers are on enterprise class SSD, Fusion-IO and others, typically $25K+ per TB. (There are some special situations where it is viable to deploy a pair of data warehouse servers with non-enterprise SSD).

    PCI-E SSDs - OCZ RevoDrive, RevoDrive X2, & RevoDrive 3 X2
    The first POC system was a Dell T710 with 2 Xeon 5670 processors 96GB (12x8GB) memory, 16x10K SAS HDDs and 6 OCZ RevoDrive (original version) PCI-E SSDs supporting table scans at nearly 3GB/s. The most difficult query repeatedly hashed a large set of rows (as in there were multiple large intermediate result sets) generating extremely heavy tempdb IO. With tempdb on 12 10K HDDs, the query time was 1 hour. With tempdb on the 6 OCZ RevoDrives, the query time was reduced to 20min.

    Before SSDs became viable, I would normally have configured a 2-socket system with 48 (2x24) 15K HDDs, with one RAID controller for each 24-disk enclosure. This setup costs about $11K per enclosure with 24x146GB 15K SAS drive and can be expected to deliver 4GB/s sequential bandwidth, 10K IOPS at low queue, low latency (200 IOPS per 15K disk) and in the range of 15-20K IOPS at high queue, high latency. As it was my intent to deploy on SSD, I only configured 16 HDDs in the internal disk bays and did not direct the purchase of external HDDs.

    The 6 OCZ RevoDrive 120GB PCI-E SSDs in the POC system cost about $400 each at the time (now $280?). I recall that the tempdb IO traffic was something like 40K IOPS (64KB), around 2.5GB/s bandwidth. This was consistent with the manufacturers specifications of 540MB/s read and 480MB/s write at 128K IO, and considering that there will be some degradation in aggregating performance over 6 devices. The IO latency was somewhere in the range of 40-60ms (note that the SQL Server engine issues tempdb IO at high queue depth). OK, so the real purpose of the POC exercise was to tell the SAN admin in no uncertain terms that the 350MB/s from his $200K iSCSI storage system (4x1GbE) was pathetic, and even the 700MB/s on 2x4Gbps FC ports does not cut mustard in DW.

    The next set of systems was ordered with 4 OCZ RevoDrive X2, 160GB (<$500 each). There was some discussion on whether to employ the OCZ enterprise class ZDrive R3, but this product was cancelled and the OCZ substitute, the VeloDrive (4 SandForce 1565 controllers, rated for ~1GB/s), was not yet available. I was expecting somewhat better performance for 4 RevoDrive X2 (4 SandForce 1222 controllers each, rated for 700MB/s) over 6 of the original RevoDrives (2 SandForce controllers each).  The tempdb IO intensive query that took 20min with the 6 RevoDrives now ran in 15min with the 4 RevoDrive X2s. In additional, IO latency was under 10ms.

    I was hoping to test the new OCZ RevoDrive 3 X2 with 4 SandForce 2281 controllers, rated for 1500MB/s read and 1200MB/s write. Unfortunately there is an incompatibility with the Dell T110 II with the E3-1240 (Sandy Bridge) processor which has a new UEFI replacing the BIOS. OCZ does not provide server system support on their workstation/enthusiast products. Hopefully Dell will eventually resolve this.

    SATA SSDs - OCZ Vertex 2, Vertex 3 & Vertex 3 Max IOPS, Crucial C300 & m4
    My preference is to employ PCI-E rather than SATA/SAS SSD devices. This is mostly driven by the fact the disk enclosures reflect the IO capability of HDDs, with 24 bays on 4 SAS lanes. An SSD oriented design should have 4 SSDs on each x4 SAS port. Of course, 4 SSDs and 4-8 HDDs on each x4 SAS port is also a good idea.

    So I have also looked at SATA SSDs. Earlier this year, I started with the OCZ Vertex 2 and Crucial C300 SSDs. After encountering the issue with the RevoDrive 3 on the new Dell server, I acquired OCZ Vertex 3, Vertex 3 Max IOPS, and Crucial m4 SATA SSDs. The OCZ Vertex 2 has a 3Gbps interface, the Vertex 3 and both Crucial C300 and m4 all support 6Gbps SATA interface.

    The OCZ Vertex SSDs use SandForce controllers; the Vertex 2 uses the previous generation SandForce 1222 and the Vertex 3 uses the current generation 2281 controller. The Crucial SSDs use Marvel controllers (both?). Perhaps the significant difference between the OCZ Vertex and Crucial SSDs are that the SandForce controllers implement compression. The OCZ Vertex SSDs have far better write performance with compressible data, but is comparable for incompressible data. It does appear that SQL Server tempdb IO is compressible and benefits from the compression feature.

    Another difference is that OCZ offers 60, 120, 240 and 480GB capacities while Crucial offers 64, 128, 256 and 512GB capacities. All capacities are in decimal, that is, 1GB = 10^9 bytes. Both OCZ 60GB and Crucial 64GB presumably have 64GB NAND flash, the 64GB being binary, meaning 1GB = 1024^3, or 7.37% more than 1GB decimal. Basically, OCZ has more over-provisioning than the Crucial, which in theory should also contribute to better write performance. (Earlier Vertex drives had 50 and 100GB capacities. But there are so many varieties of the Vertex 2 that I cannot keep track.)

    In brief, the performance difference between SSD generations, both from the OCZ Vertex 2 to Vertex 3 and from the Crucial C300 to m4, is substantial, so I will focus mostly on the newer Vertex 3 and m4 drives. The performance observed in SQL Server operations seemed to be consistent with manufacturer specifications for both generations of OCZ and Crucial SSDs. It was not noted whether writing compressed SQL Server database tables were further compressible by the SandForce controller. This may be because it is difficult to achieve the high write performance necessary to stress modern SSDs in SQL Server with transactional integrity features.

    Test System - Dell T110 II, Xeon E3 quad-core Sandy Bridge processor
    The test system is a Dell PowerEdge T110 II, Xeon E3-1240 3.30GHz quad-core processor (Sandy Bridge) with 16GB memory. This system has 2 PCI-E Gen2 x8 and 1 Gen 1 x4 slot. All SSDs were attached to a LSI MegaRAID SAS 8260 controller (PCI-E Gen2 x8, 8 6Gbps SAS ports). I did not some testing with 2 SSDs on the SATA ports (3Gbps) but did make detailed observations. 

    Incidentally, the cost of this system, processor, memory and 1 SATA HD was $1078? The 128GB SSDs were about $220 ($268 for the Max IOPS). So a very capable system with 2 SSDs could be built for $1300-1500 (64GB or 128GB SSDs). A better configuration with 2 SATA HDDs, 4 SSDs and SAS controller would push this to $2500. But if Dell and OCZ could resolve this RevoDrive 3 -UEFI issue, then I would recommend the T110 II, E3 processor, 16GB memory, 2 SATA HDDs and 1 RevoDrive 3 X2.

    One unfortunate aspect of this system is that the SATA ports are all 3Gbps per the Intel C202 PCH, even though the ever so slightly more expensive C204 supports 6Gpbs SATA on 2 ports. Basically, this has similar characteristics as the database laptop with super IO that I proposed earlier, except that the laptop would be 2.5GHz to keep power reasonable.

    Performance tests with the TPC-H SF 100 database
    With 8 SSDs (2 Vertex 3, 2 Vertex 3 MaxIOPS, 2 m4, and 2 C300) I was able to generate 2.4GB/s in table scan aggregation query, possibly gated by the 355MB/s rating of the C300s. A configuration consisting of the 4 Vertex 3 and 2 m4’s would have been gated by the 415MB/s rating of the m4. If can get a total 8 Vertex 3s, which are rated at 550/500MB/s for compressible and incompressible data, then I would either be limited by the adapter or the PCI-E Gen2 x8 limit of 3.2GB/s. There is an LSI SAS8265 adapter with dual-cores that has even higher IOPS capability, but it is not known whether this is necessary for large block IO.

    The tests consisted of running the TPC-H queries, single stream (but not per official benchmark requirements). The figure below show the time to run the the 22 queries (excluding statistics, parse and compile) for 2, 4 and 6 SSDs with no data compression (raw) and with page mode data compression.

    tpch 100 query time

    Run time on 2 OCZ Vertex 3 (regular) SSDs was 815 sec with compression and 936 sec raw (w/o compression). On 4 OCZ Vertex 3 SSDs (2 regular, 2 MaxIOPS) total query times were reduced to 658 sec with compression and 622 sec raw. On 6 SSDs, 4 OCZ and 2 Crucial m4, total query times are 633 sec with compression and 586 sec raw.

    The figure below shows tempdb IO write latency for 2, 4 and 6 SSDs, with raw and compressed tables.

    tpch 100 query time

    On the 2 OCZ SSDs, IO latency from fn virtual file stats (for the entire run) averaged 30ms (temp write) and 90ms (data read) with compression and 60ms (temp write) - 130ms (data read) without compression. The performance with 2 Crucial m4 drives was less, showing much higher write latencies. On 4 OCZ Vertex 3 SSDs, IO latency was 14 temp write and 30ms data read with compression and 18-60ms without compression. The IO latencies on the Max IOPS models were lower than on the regular Vertex 3 models. For 6 SSDs, IO latencies are now down to the 15ms range, with somewhat higher latency on the Crucial m4 SSDs.

    With data and tempdb on 2 OCZ Vertex 3 SSDs, performance was decent but IO constrained. Performance was 15% better with data compression (page) than without, even though CPU was 23% higher. Performance with 4 OCZ Vertex 3 SSDs (2 regular, 2 Max IOPS) was 20% better for compression on and 34% better without compression, relative to performance with 2 SSDs. The performance without compression was now 6% better than with compression. At 6 SSDs (4 Vertex 3, 2 m4), there was another 5% performance improvement relative to 4 SSDs, for both compressed and not compressed.

    In the above tests, each SSD was kept as a standalone disk, i.e., I did not use RAID. There was 1 data and 1 tempdb file on each SSD. I noticed that the uncompressed (raw) database tended to generate 64K or 128K IO, while the compressed database tended to have 256K IO. Two queries, 17 and 19(?) generated 8KB IO, and would have much better performance with data in memory. There was also wide variation from query to query in whether performance was better with or without compression.

    SandyBridgeLaptop

  • Laptop for database performance consultants

    Today, it is actually possible to build a highly capable database system in a laptop form factor. There is no point to running a production database on a laptop. The purpose of this is so that consultants (i.e., me), can investigate database performance issues without direct access to a full sized server. It is only necessary to have the characteristics of a proper database server, rather than be an exact replica.

    Unfortunately, the commercially available laptops do not support the desired configuration, so I am making an open appeal to laptops vendors. What I would like is:

    1) Quad-core processor with hyper-threading (8 logical processors),
    2) 8-16GB memory (4 SODIMM so we do not need really expensive 8GB single rank DIMMs) 
    3) 8x64GB (raw capacity) SSDs on a PCI-E Gen 2 x8 interface (for the main database, not the OS)
    - alternatively, 2-4 x4 externally accessible PCI-E ports for external SSDs
    - or 2 x4 SAS 6Gbps ports for external SATA SSDs 
    4) 2-3 SATA ports for HDD/SSD/DVD etc for OS boot etc
    5) 1-2 e-SATA
    6) 2 1GbE

    Below is a representation of the system, if this helps clarify.
    SandyBridgeLaptop
    SandyBridgeLaptop

    The Sandy-Bridge integrated graphics should be sufficient, but high-resolution 1920x1200 graphics and dual-display are desired. (I could live with 1920x1080).
    There should also be a SATA hard disk for the OS (or SATA SSD without the 2.5in HDD form factor if space constrained) as the primary SSD array should be dedicated to the database.
    Other desirable elements would be 1 or 2 e-SATA port, to support backup and restores with consuming the valuable main SSD array,
    and 2x1GbE ports (so I can test code for parallel network transfers.

    The multiple processor cores allow parallel execution plans. Due to a quirk of the SQL Server query optimizer, 8 or logical processors are more likely to generate a parallel execution plan in some cases.
    Ideally, the main SSD array is comprised of 2 devices, one on each PCI-E x4 channel.

    The point of the storage system is to demonstrate 2GB/sec+ bandwidth, and 100-200K IOPS. One of the sad fact is even today storage vendors promote $100K+ storage systems that end up delivering less than 400-700MB/s bandwidth and less than 10K IOPS. So it is important to demonstrate what a proper database storage system should be capable of.
    Note that is it not necessary to have massive memory.  A system with sufficient memory and a powerful storage system can run any query, while a system with very large memory but weak storage can only run read queries that fit in memory. And even if data fits in memory, the performance could still fall off a cliff on tempdb IO.

    Based on component costs, the laptop without PCI-E SSD should be less than $2000, and the SSD array should be less than $1000 per PCI-E x4 unit (4x64GB).
    It would really help if the PCI-E SSD could be powered off from SW, i.e., without having to remove it. This why I want to boot off the SATA port, be  it HDD or SSD.

    NAND notes
    per below, 2 SSDs on SATA ports do not cut the mustard,
    The spec above call for 8 SSDs. Each SSD is comprised of 8 NAND packages, and each package is comprised of 8 die. So there are 64 die in one SSD, and IO is distributed over 8 SSDs, or a total of 512 individual die.
    The performance of a single NAND die is nothing special and even pathetic on writes. However, a single NAND die is really small and really cheap. That is why it is essential to employ high parallelism at the SSD unit level. And then, employ parallelism over multiple SSD units.
    An alternative solution is for the laptop to expose 2-4 PCI-E x4 ports (2 Gen 2 or 4 Gen 1) to connect to something like the OCZ IBIS, or an SAS controller with 2 x4 external SAS ports.

    System notes
    The laptop will have 1 Intel quad-core Sandy-Bridge processor, which has 2 memory channels supporting 16GB dual-rank DDR3 memory. The processor has 16 PCI-E gen 2, DMI g2 (essentially 4 PCI-E g2 lanes) and integrated graphics. There must be a 6-series (or C20x) PCH, which connects upstream on the DMI. Downstream, there are 6 SATA ports (2 of which can be 6Gbps), 1 GbE port, and 8 PCI-E g2 lanes. So on the PCH, we can attach 2 HDD or SSD at 6Gbps, plus support 2 eSATA connections. There is only a single 1GbE port, so if we want 2, we have to employ a separate GbE chip.

    While the total PCH down stream ports exceeds the upstream, it ok for our purposes to support 2 internal SATA SSDs at 6Gbps, 2 eSATA ports and 2 GbE, plus USB etc. The key is how the 16 PCI-E gen 2 lanes are employed. In the available high-end laptops, most vendors attach a high-end graphics chip (to all 16 lanes?). We absolutely need 8 PCI-E lanes for our high performance SDD storage array. I would be happy with the integrated graphics, but if the other 8 PCI-E lanes were attached to graphics, I could live with it.

    The final comment (for now) is that even though it is possible to attach more than 2 SSD off the PCH, we need then bandwidth on the main set of PCI-E ports. It is insufficient for all storage to be clogging the DMI and PCH.

    Thunderbolt
    Thunderbolt is 2x2 PCI-E g2 lanes, so technically thats almost what I need (8 preferred, but 6 acceptable).
    What is missing from the documentation is were Thunderbolt attaches.
    If directly to the SandyBridge processor (with bridge chip for external?), then that's OK,
    if off the PCH, then that is not good enough for the reasons I outlined above.

    Also, we need serious SSDs to attach off TB, does the Apple SSD cut mustard?

    The diagram below shows the Thunderbolt controller connected to the PCH, but also states that other configurations are possible. The problem is that most high-end laptops are designed with high-end graphics, which we do not want squandering all 16 PCI-E lanes.

    thunderbolt

    A Thunderbolt controller attached to the PCH is capable of supporting x4 PCI-E gen 2, but cannot also simultaneously support saturation volume traffic from internal storage (SATA ports), and network (not to mention eSATA). I should add that I intend to place the log on the SATA port HDD/SSD, along with the OS, hence I do not want the main SSD array generating traffic over the DMI-PCH connection.

    A Thunderbolt SDK is supposed to released very soon, so we can find out more. I am inclined to think that Thunderbolt is really a docking station connector, being able to route both video and IO over a single connector. If we only need to route IO traffic, then there are already 2 very suitable protocols for this, i.e., eSATA for consumer, and SAS for servers, each with a decent base of products. Of course, I might like a 4 bay disk enclosure for 2.5in SSDs on 1x4 SAS, or an 8-bay split over 2 x4 ports. Most of the existing disk enclosures carry over from hard disk environment, with either 12-15 3.5in bays or 24-25 2.5in bays.

  • Oracle Index Skip Scan

    There is a feature, called index skip scan that has been in Oracle since version 9i. When I across this, it seemed like a very clever trick, but not a critical capability. More recently, I have been advocating DW on SSD in approrpiate situations, and I am thinking this is now a valuable feature in keeping the number of nonclustered indexes to a minimum.

    Briefly, suppose we have an index with key columns: Col1, Col2, in that order. Obviously, a query with a search argument (SARG) on Col1 can use this index, assuming the data distribution is favorable. However, a query with the SARG on Col2 but not Col1 cannot use the index in a seek operation.

    Now suppose that the cardinality of Col1, (the number of distinct values of Col1), is relatively low. The database engine could seek each distinct first value of Col1 and the specified SARG on Col2. Microsoft SQL Server currently does not have the Oracle Index Skip-Scan feature, but the capability can be achieved with a work-around.

    In this example, the LINEITEM table has a cluster key on columns L_SHIPDATE, L_ORDERKEY, but does not have an index leading with L_ORDERKEY. Our query is to find a specific Order Key in the LineItem table. If there is a table with the distinct date values, DimDate, we could force a loop join from the DimDate table to LineItem (even though only columns from LineItem are required) to get the execution plan below.

    SkipScan

    The question is now: how effective is this technique? The most efficient execution plan is of course, to have an index leading with the Order Key column. But the situation calls for keeping the number of nonclustered indexes to an absolute minimum. So how does the above execution plan compare with a table scan?

    A table scan, in this type of query, such that only few rows meet an easy to evaluare SARG, might run at about 1GB/s per core. Note this is far higher than the 200MB/sec cited in the Microsoft Fast Track Data Warehouse documents. This is because the FTDW baseline is a table scan that aggregates several columns of every row. And not only that, a Hash Match is also required to group the results. Basically, a needle in haystack table scan runs much faster than the more complex aggregate and group scan. At most, a 1GB table scan might acceptable for a non-parallel execution plan and even a 50GB table scan could be tolerable on a powerful 32-core system with an unrestricted parallel execution plan.

    A loop join can run somewhere in range of 100,000-200,000 seeks/sec to the inner source. Realistically, a Data Warehouse with 10 years data has distinct 3652 days (depending on the leap year situation). A loop join with 3650 rows from the outer source should run some where around 36ms. Even if the DW had 20 years data, this is still acceptable, on the assumption that the non-lead column Order Key search is in the minority, with the plus being one less index on the big table is required. If the query could be bounded to with a single year or quarter-year, then we are approaching the efficiency of having the extra nonclustered index.

  • Intel Xeon E7 (Westmere-EX) and Sandy Bridge comments

    Last week Intel announced the 10-core Xeon E7-x8xx series (Westmere-EX), superceding the Xeon 6500 and 7500 series (Nehalem-EX). The E7 group consists of the E7-8800 series for 8-way systems, the E7-4800 series for 4-way systems and the E7-2800 series for 2-way systems. Also, the E3-12xx series (Sandy Bridge) for 1-socket servers, superceding the Xeon 3000 series (Nehalem and Westmere). This week at Intel Developer Forum Bejing, Intel has a slidedeck on Sandy Bridge-EP, an 8-core die that will presumably be the Xeon E5-xxxx series superceding the Xeon 5600 series (Westmere-EP) scheduled for 2H 2011.

     2010-112011-12
    High Xeon 6500/7500, 4-8 cores E7-8/4/2800, 6-10 cores
    Mid Xeon 5600, 4-6 cores E5-xx00, upto 8 cores
    Entry  Xeon 3x00, 2-6 cores E3-1200, 2-4 cores

    The top-of-the-line Xeon E7-8870 is 10-core, 2.4GHz (max turbo 2.8GHz) and 30M last level cache, compared with Xeon X7560 8-core, 2.26GHz (turbo 2.67GHz) and 24M LLC. HP ProLiant DL580 G7 TPC-E results for 4-way Xeon E7-8870 and 7560 are 2454.51 and 2,001.12 respectively. This is a 22% gain from 25% more cores, and 6% higher frequency, inline with expectations.

    IBM System x3850 X5 TPC-H results at scale factor 1TB for the 4-way Xeon X7560 and 8-way Xeon E7-8870 are below.

     PowerThroughputQphH
    4-way Xeon 7560 127,676.1 81,039.6 101,719.3
    8-way E7-8870 200,899.9 150,635.8 173,961.8

    It is unfortunate that a direct comparison (with same number of processors and at the same SF) between the Xeon E7-8870 and X7560 is not available. The presumption is that the Xeon E7-8870 would show only moderate improvement over the X7560. This because the TPC-H is scored on a geometric mean of the 22 queries, of which only some benefit from very high degree-of-parallelism.

    The more modest performance gain from Nehalem-EX to Westmere-EX, compared to the previous 40% per year objective, is probably an indication of the future trend in the pace of performance progression. The pace of single core performance progression slowed several years ago. Now, the number of cores per processor socket also cannot be increased at a rapid pace.

    Fortunately, the compute power available in reasonably priced systems is already so outstanding that the only excuse for poor performance is incompetence on the software side. My expectation is that transaction processing performance can still be boosted significantly with more threads per core. The IBM POWER 7 and Oracle/Sun SPARC T3 implement 8 threads per core. It is unclear if Intel intends to pursue this avenue. Data Warehouse performance could be increased with columnar storage, already in Ingres VectorWise and coming in the next version of SQL Server. Scale out is now available in PDW (EXA SOL has TPC-H results with 60 nodes).

    I am also of the opinion that SIMD instruction set extensions for the row & column offset calculation could improve database engine performance. The object is not just the reduce the number of instructions, but more importantly to make the memory access sequence more transparent, ie, allow for effective prefetching.

    At the system level, the processor interconnect technology (AMD Hyper-Transport and Intel QPI) should also allow scale-up systems. HP has mentioned that 16-way Xeon is possible. HP already has the crossbar technology from their Itanium based Superdome and the sx3000 chipset. It is probably just a matter of gauging the market volume of the 8-way ProLiant DL980 to assess whether their is also a viable market for 16-way Xeon systems.

    Another observation is the price structure of 2-way and 4-way systems. It used to be that there was very little price difference between 1-way and 2-way systems with otherwise comparable features. So the default system choice frequently started with a 2-way system and higher. On the downside, the older 2-way systems also did not have substantially better memory or IO capability. With the 2-way Xeon 5500 and 5600 systems, there is a more significant price gap between 1-way and 2-way systems. However, the 2-way Xeon 5500 systems also have serious memory and IO capability. So low-end entry system now needs to revert to a single socket system.

    The price gap with 4-way systems has also grown along with capabilities, particularly in memory capacity and reliability. The default system upgrade choice should be to replace older 4-way systems with new generation 2-way systems. The new 4-way systems should target very-high reliability requirements.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement