THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • Hardware rant 2015

    It has been a while so I suppose it is time for another rant on hardware. There are two systems I would like:
    One is a laptop.
    The second is a server capable of demonstrating extreme IO performance, with the secondary objective of being small enough to bring along to customer sites.

    On the laptop I am looking for
    1) quad-core with HT, i.e. 8 logical processors for better parallel execution plans.
    2) PCIe SSD, I would prefer 3GB/s+, so PCIe gen3 x4, or 2 x M.2 PCIe x2 is also an option.
    3) drive 1, but preferably 2 external 4K monitors (so I can look at complex execution plans)

    On this matter, it is time to bitch at the MS SQL Server team that there should be an option to contract the white space in execution plans. The existing zoom capability is worthless.
    Yes I know SQL Sentry Plan Explorer can do this, but really MS, is it so hard? or have you outsourced the entire SSMS some team that does not know that there is such a thing as complex queries?
    The reason I want to drive 2 external 4K displays is that at the 4K resolution, I need more than a 28 in monitor to use the resolution.

    A couple of days ago, Dell announce the new XPS 15 with Core i7-6700 processors (Sky Lake) which I immediately ordered , but unfortunately it shows a shipping date of Nov 16

    it does have 4K display, and 1 external port which may or may not support 4K. I thought I ordered the docking station, but I do not know if this would support dual external 4K monitors.
    I currently have the Dell 28in 4K monitors, which is great for looking at execution plans, but at the normal text size setting, is difficult read.
    I am thinking that the much more expensive Dell 32in 4K monitor will be better, but maybe not enough. Should I get a 55in 4k TV instead? these all have just the HDMI connector, so I need to make sure there are proper 4K adapters.

    The new XPS 15 data sheet says it has HDD bay (SATA interface) and one M.2 bay (uncertain if PCIe x2 or x4). I would have been nice if 2 M.2 x2 bays were available instead of the HDD bay. I ordered the XPS 15 with the PCIe SSD. I do not know if it is good one (Samsung SM951 cite 2150MB/s) if not, it will throw the Dell SSD out, and get a good one.

    One more thing, ECC memory
    Intel desktop and mobile processors all do not have ECC (or parity) memory capability. ECC memory has been built into Intel processors for some time now, it is just disabled in the Core product lines, enabled on in the server Xeon line.
    So the Sky Lake equivalent is the Xeon E3 v5. Intel released the v5 under the mobile group, with a 45W rating.
    Unfortunately I cannot find a laptop for sale that uses the Xeon E3 v5.

    Perhaps Dell or someone could offer a Xeon E3 mobile system?

    Extreme IO Performance demonstrator
    First, why do I need such a thing?
    when my clients have multi-million dollar SAN storage systems?
    Because SAN people are complete idiots on the matter of IO performance, being locked into irrelevant matters (to enterprise DB) like thin provisioning etc.
    Invariably, the SAN people (vendor sales engineer, the SAN admin etc) confuse that Fiber Channel is specified in Gigabits/sec (Gb/s) while all other IO bandwidth is specified in GigaBytes/sec (GB/s).
    So we have a multi-million dollar storage system (full of add-on software that have no purpose in an enterprise DB) connected to a powerful server (60+ cores and paying for SQL Server 2012 EE per core licensing) over 2 x 8Gbit/s FC links.
    Is this stupid or is this exceptionally stupid?

    Yes I know it is extremely rude of me to call other people stupid, and that being stupid is not crime, but when you are the vendor for multi-million dollar equipment, there is a reasonable expectation that you are not stupid.

    So onto the system.
    For this, I am sure I need more than 4 cores, so it needs to the Xeon E5. Perhaps 8 cores (single socket) is sufficient.
    The new Intel SSD DC P3608 has great specs, but I am not sure when it is actually available?
    I would put 2 of these in the system to demonstrate 10GB/s. Ideally this would all go into box that fits carry on luggage, which is unfortunately not one of the standard PC or server form factors.

    Another option is a 2 x 12 core system to demonstrate 20GB/s on 4 x P3608.

    I would prefer to get a laptop without high performance graphics, the NVidia GTX 960M in this case.
    The current Intel graphics is sufficient for high resolution rendering, but I do not need high frame rate. All the Intel Core i7 6th gen processors have graphics, I wonder if I can remove the GTX (for power savings)?
    Apparently Dell will have a new docking station, the Thunderbolt Dock TB15 next year, that will support 2 x 4K monitors?

    I did already rant on PC laptops only being available with 16x9 displays?
    How stupid is this? It is one thing for consumer laptops to have a 16x9 display, on the assumption that the home users just watch movies.
    but on what justification does this apply to business and workstation laptops?

    Concurrent with the Intel Skylake Xeon E3 v5 regular announcement, Supermicro announced motherboards for the E3 v5.
    There is a micro-ATX (X11SAE-M) but with just 1 x16 and 1 x4 PCIe g3 slots.
    where as the ATX (X11SAT) has 3 slot with 16/8/8 as an option. This would let me put 2 P3608? for 10GB/s?

  • GHz and MHz-BIOS updates and Amdahl Revisited

    Last week, a routine performance test ran about twice as long as expected. A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time) were approximately 3 times higher than previous tests for matching SQL statements. Almost all of the SQL involved single or few row index seeks, usually Select, some Insert and Update. The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center. The data center had sent out notice that there would be system restarts the previous weekend, which could mean either OS patches or BIOS/UEFI updates. So naturally the next thing to do is check the Processor Information object for the Processor Frequency and % of Maximum Frequency counters (or equivalent). This showed 135, as in 135MHz, and 5% of maximum. Another system of the same model also rebooted showed 1188 and 44%.

    This issue has occurred previously in this environment and in other HP systems that I am aware of. The BIOS (or UEFI) update puts the system into one of the energy efficient configurations. It could also be an operating system setting, but most that I have seen are BIOS settings? One can imagine a firmware engineer being so committed to green activism that this was made the default on BIOS updates without discussion with other parties. Perhaps there is a facility (in Houston?) with inadequate air conditioning for the number systems, that this setting was put in to prevent the lab from overheating. Then no one remembered to exclude the step in the production code? Not that I have ever done such a thing (and no further questions on this should be asked).

    Another question might be why the data center monitoring team did not check for this, as it has happened before. The whole argument for going to managed data center instead of a simple hosted data center was that the managed data center could provide the broad range of expertise that is not economical for a mid-size IT department. Obviously this managed data center did not monitor for the performance/power configuration.

    This matter is of serious concern to production DBAs and IT staff in handling operations. As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2, many software monitoring tools may not alert on changes of Processor Frequency, especially after reboot. Imagine the IT staff or DBA encountering this for the first time on the production servers, with users complaining, your boss watching over your shoulder, and his/her boss hovering over your boss, offering their helpful insights in the non-judgemental manner as bosses do.

    Performance Insight

    However, I am more interested in a different aspect of this incident. When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz, we can extrapolate parameters of interest. Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20, only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X? Perhaps, but this actually should have been expected.

    The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time. A young person might not know, but old timers would know. Up to about 20 years ago, the primary memory performance specification was access time, with 80, 70 and 60 ns being common in fast page mode and extended data out. Then with the switch to synchronous dram (SDRAM), the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common. This means a memory channel can deliver one line every 0.75ns, or 1.333 billion times per sec, with a line being 64-bits (excluding ECC bits). The Xeon E5 26xx series has four DDR3 channels. The Intel processor internally is shown as having 2 memory controllers, each controller driving 2 DDR channels, so channel can have different meanings depending on the context).

    What is less commonly cited is the round trip latency, from a processor issuing a request to memory, the internal memory access within the DRAM chip and finally the transmit time back to the processor. (The L1, L2 and L3 cache sequence is also involve in memory access timing.) For local memory (attached directly to the processor) this is around 50ns. For memory on an adjacent processor, the round trip time might be 95ns or so.

    On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles. This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency, and half at 95ns latency.

    This is a well understood issue. Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor, and 2) pre-fetching memory. Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core, and an additional shared L3 cache sized at 2.5MB per core. From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line. Prefetching is exposed in the instruction set architecture (can someone provide a reference please) and there should also be a BIOS/UEFI for hardware prefetch.

    Simple Model

    Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures. There is a memory access for the index root level page. Read to the page to find the pointer for the second level page. Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much) and the next location is not known until the current memory request is completed, so pre-fetching is not possible.

    Modern processor performance characteristics are very complicated, but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency. Start with an imaginary processor with a cycle time equal to the full round-trip memory access time. In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.

    Such a system may have never existed so now consider a system where the round trip memory access latency is some multiple of the CPU cycle time. The average time to complete an instruction where time is in units of the memory access latency (50ns or 20MHz for local node), “a” is the fraction of instructions that involve (non-local, non-prefetch-able) memory access and “n” is the processor frequency.

      (1-a)/n + a

    The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache (from previous access or pre-fetched). “1/n” is the processor cycle time (in units where memory access time is 1).

    Performance (instructions per unit time), the inverse of average time per instruction is:

      P = 1 / ( (1-a)/n + a )

         = n / (1 +(n-1)*a )

    We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:

      n -> infinity, P = 1/a

    Also, if the fraction on instructions that require memory access, “a,” is zero, then P = n.

    Does the above look familiar? It is just Amdahl’s Law, which formulated in the old days to demonstrate the limits of vectorization in supercomputers. I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.

    If someone would like to check my math, please do so. It has been a long time. Trying tricking your teenage son/daughter into doing this as a practical math exercise?

    OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things. What am I really after? In the above equation, what is known is processor frequency relative to memory access latency. While we know the performance or worker time of certain queries, we do not know it terms of instructions per CPU-cycle. And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency that cannot be hidden with cache or pre-fetching.

    But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz. Express the relative performance between the two points as a ratio.

      P2/P1 = R

    Then from the two equations

      P1 = 1 / ( (1-a)/n1 + a )

      P2 = 1 / ( (1-a)/n2 + a )

    we can solve for a in terms of the know values n1, n2 and R.

      a = (n2 – n1*R) / ( n1*n2*(R-1) + n2-n1*R )

    Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle. Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3. We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz. (Perhaps it would be more correct in estimating round-trip memory access latency as the average between the local and 1-hop remote node at 75ns?)

    So while it might seem astonishing that the difference between 135MHz and 2700MHz translates to only 3X performance, the database transaction processing workload is an extreme (but important) case. There are many workloads which exhibit better locality or have memory access patterns that are amenable to prefetch and have performance scaling better with frequency.


    Earlier, two methods of hiding round-trip memory access latency were mentioned. There is another, Hyper-threading. The processor core to appears as two (or more) logical processors to the operating system. Presumably, there is an extra set of program counters, and resources to determine which physical registers (different from the registers specified in the instruction set architecture) are assigned to each logical processor.

    In the earlier example, say that the round-trip memory access time is 135 CPU-cycles and the fraction of instructions that incurs the full round-trip latency is 6%. Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar) as either not involving memory access or data is already in cache, and the 6 the incurs the round-trip memory latency of 135 cycles. Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904. In other words, only 100 cycles out of 904 are used, the rest are no-ops.

    The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled. The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)

    In considering the nature of the database transaction processing workload, being a memory access to determine the next memory access in nature, it is perhaps time for Intel to increase the degree of HT, especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the smaller desktop/mobile processor variants. I seem to recall IBM POWER as having 4 logical processors per physical core, one of the SPARC processor lines as having 8. It would also be necessary to have a good strategy for using HT based on workload. The option to enable or disable HT in the BIOS/UEFI is not I what mean. HT should be visible to the OS. But the application itself should detect the presence and degree of HT, and make its own decision on whether HT should be used and how it should be used.

    Xeon Phi, Many Integrated Core

    Another item worth mentioning here is the Intel many integrated core (MIC) architecture, codename Knights something, now Xeon Phi. The processor puts many smaller processor cores, 61 in the 22nm Knights Corner, versus 12-18 in the 22nm mainline Xeon processors. The theory behind many smaller cores stems from one of the two main elements of Moore's Law. Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.

    (This was the case up to several generations ago. Since then, Intel no longer tries to double the logic from one process to the next. There might be 10-20% performance gain in general instructions. Some effort is given to expanding the functionality of the special/vector instructions. And most effort has been in increasing the number of cores.)

    One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago). Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC). Following the rule of 2X transistor per 1.4X (square root of 2) backwards, the expectation is that a processor if 1/4th the size would have 1/2 the performance. But potentially there could be 4X as many cores, depending on interconnect and power limitations. So in workloads that are amenable to vectorization, or otherwise can be parallelized, the more smaller cores could be a better strategy.

    The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions. If we were thinking about a transaction processing database engine on the MIC architecture, we would probably consider a very basic ALU without SIMD, (not sure on FP). I am thinking that an asymmetric processor architecture might be the objective. Perhaps two powerful cores from the current main line, and many simpler cores (without SIMD) perhaps even simpler than Atom? (The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)


    As said earlier, the nature of database page storage along rows make serialized memory access (also called pointer chasing code?) its hallmark. This is why there is interest in column storage architecture. Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over, already in cache. The work a little further down touches memory in the next 64 byte line or two away. Both the software and hardware knows this, and either is capable of issuing a pre-fetch. It does not matter that columnstore must touch more data. Processor can stream huge volumes of data, much more effectively than pointer chasing only the necessary rows.


    I should probably say something here.


    As I said earlier, modern microprocessors are very complex. Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and superscalar execution with Pentium (1993). Pipelined means that while the processor can complete an instruction in each cycle, the actual start to finish time of a specific instruction occurs over several cycles. Intel does not talk about pipeline stages any more, but there are occasional references to Core2 and later processors having a 14+ stage pipeline. (Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)

    Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle. Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar. We could apply the principle of Amdahl’s on scaling performance to any and all of pipe-lining, superscalar, and round-trip memory latency, and probably other things too.

    Rethinking Computer System Architecture

    I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture. Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk. Why do we still have a page file on disk? In the old days, there was not enough physical memory such that it was tolerable to have a page file on disk to support a larger virtual address space.

    Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing). But the key factor is in how memory is used. Back then, it was mostly for executable code and internal data structures. The assumption was that very few database data pages would actually be in memory at any given point in time. Today, a very large percentage of memory is used for caching data pages. Of the memory used for executable code and internal data structures, most is junk.

    The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket. Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file. Of course the operating system (and applications) would have to be designed around this new architecture. Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.

  • Transaction IO Performance on Violin

    Back in Feb, I went on a diatribe-rant against the standard SAN Vendor configuration practice. The Problem with Standard SAN Configuration IO Performance, article and accompanying post, showed IO performance metrics for a batch driven transaction processing workload on a SAN managed by a data center/cloud company. The only option offered by the service provider was to request volumes for storage. There was no consideration for special IO characteristics of transaction processing or other database workloads. No discussion. This practice is doctrine pontificated by SAN vendors, calling SAN admins on a mission to implement the "storage as a service" concept while remaining completely blind to the requirements of mission critical databases.

    Ok, I am venting again. Now I have performance metrics for the same workload, except that storage is on a Violin system. The system is different in having 24 physical cores, no HT, and 256GB memory versus previous system with 16 physical cores, HT (32 logical) and 384GB.

    Below are the IO characteristics. The horizontal axis time scale is 5 min per major division for 1 hour across the entire chart. Each tick is 1 minute. Data points are every 15 sec. Note that the Feb (HDD) charts were 1 min per major division, 15 min total, with data point every 5 sec.


    Transactions/sec (red)

    IOPS - read (green), write (blue), log write (red)

    ms/Rd or Wr

    The obvious difference between the IO characteristics on Violin and the previous HDD-based storage is that checkpoints now have almost no visible impact on performance. Both CPU and transactions/sec are very steady, with slightly noticeable blips, versus the severe drops before. It is evident that checkpoint writes now have almost no impact on data reads or log write IOPS. The same is true of IO latency, in milli-seconds per read or write.

    If the storage had been on HDD storage but with logs on a separate physical disks, we would expect that the checkpoint would drive up data read latency, and hence depress data read performance. But it would have no impact on log write latency, and hence no impact on log write performance. The lower data reads should have only moderately depress performance, not severely.

    The difference in system processor, 24 physical cores no-HT versus 16 cores plus HT is not a factor in the IO characteristics. The difference in physical memory, 256 GB on the system Violin storage and 384 GB in the system with HDD storage is evident in the data read IOPS, starting at 7-8K IOPS then drifting down to 2-3K IOPS on the system with less memory, compare with mostly 1K IOPS on the system with more memory. Both storage systems can easily handle 20K IOPS.

    The main argument here is not that SSD/Flash storage is a requirement for transaction processing databases, even though there are clear benefits. (NAND flash based SSD have both maturity and cost-structure to be very attractive for any new storage system purchases.) The point is that there is a severe problem with the SAN vendor doctrine of one common pool for all volumes.

    This very severe problem can mostly and easily be mitigated simply by having separate physical disks for the log volume. So the point could and should be demonstrated by showing the IO performance on an HDD SAN with separate physical disks for logs. But this violates the SAN doctrine of ignoring user requirements, and would not be considered or allowed by the service provider under any circumstance. So the only real solution is the keep performance critical databases off a storage system administered by a team on a different mission than supporting the database.

    Below are excerpts and the graphs from the Feb article.

    Standard SAN Configuration IO Performance 2015-02

    The chart below is CPU. The horizontal axis is time. One major division marked by the vertical line is 1 minute, and the small tick is 12 sec. The data points are 5 sec. There are 12 steps between each major division. The vertical axis is overall (system) CPU utilization in percent. Each of the stacked green lines represents an individual processor. There are 16 physical cores and 32 logical. A single logical core at 100% utilization would show a separation of 3.125% to the line below.


    On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec. Note that the vertical axis is log-scale, base 10. One major division is a factor of 10. Each minor tick is an integer. The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.

    time scale

    The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes. The vertical axis is log scale.


    The fourth chart is IO latency, milli-sec per IO. The same color codes applies as for IOPS. Again the vertical axis is log scale.

    IO latency ms
    ms/Rd or Wr

  • SAN Configuration Performance Problems

    The charts provided here illustrates my complaints against SAN vendor doctrine, obstinately adhering to the concept of one large pool of disks from which all volumes are created for any purpose (data, log, and junk non-DB stuff). There is no consideration for the radically different characteristics of hard disks in random versus sequential IO (low vs. high queue depth IO behavior should also be an element of IO strategy). The architecture of all traditional relational database engines are built on the premise that high volume log writes are possible at very low latency (using dedicated disks) in order to provide durability of transactions. And yet SAN vendors blithely disregard this (because it is at odds with the doctrine derived from principles invented) to justify their mission to sell inexpensive commodity hardware components at extraordinary prices.

    Transaction Performance Data

    The chart below is CPU. The horizontal axis is time. One major division marked by the vertical line is 1 minute, and the small tick is 12 sec. The data points are 5 sec. There are 12 steps between each major division. The vertical axis is overall (system) CPU utilization in percent. Each of the stacked green lines represents an individual processor. There are 16 physical cores and 32 logical. A single logical core at 100% utilization would show a separation of 3.125% to the line below.


    On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec. Note that the vertical axis is log-scale, base 10. One major division is a factor of 10. Each minor tick is an integer. The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.

    time scale

    The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes. The vertical axis is log scale.


    The fourth chart is IO latency, milli-sec per IO. The same color codes applies as for IOPS. Again the vertical axis is log scale.

    IO latency ms

    The SQL

    The SQL activity is batch driven transaction processing. There are 14 or so threads concurrently looping through a set of items to be processed. Each item involves about 20 rows of insert or update activity, hence 1000 log writes per sec corresponds to approximately 20,000 transaction/sec on the SQL counter.

    Most of the active data is in memory. There are probably 30-40 SELECT rows for each transaction or twice as many reads as writes. The data IO reads necessary to support the 20,000 inserts and updates/sec is about 2,000/sec, which the storage system is capable of supporting at about 4ms latency. This is because the data resides a small part of each disk. The actual latency for random IO is less than the expected value of 7.5 ms for data randomly accessed over an entire (10K) disk at queue depth 1.

    For approximately 20 seconds out of each minute, the transaction rate drops from the peak value of 20,000 all the way down to about 8,000 per sec (noting the log scale). These are the check points when the data write IO surges to 20-50K IOPS, (which demonstrates that the storage system is actually pretty decent) though write latency is driven up to 50-90ms.

    The checkpoint surge also pushes log write latency up from 1ms to 20-30ms. This dataset occurred during the day, when presumably there was activity for other hosts on different volumes but on the same SAN. At night, log write latency away from checkpoints could be under 0.3ms even at high volume.

    The Storage System

    I had not previously discussed the storage configuration in detail. The storage system consists of 240 x 10K HDDs only, with the standard system level caching. The SQL Server host is connected to the SAN over 4 FC ports (8Gb from host to switch, 4Gb from switch to SAN, and presumably 4Gb on the SAN backend?). The data is distributed over 8 volumes. The log is on a separate volume as seen by the host OS.

    The Problem

    The problem is that on the SAN, all disks are aggregated into a single pool, from which volumes are created. This was done per standard SAN vendor doctrine. Their magically great and powerful "value-add" intelligence would solve all your performance problems. We cannot ask for dedicated physical disks for the log because the SAN was already configured, with the SAN admin getting assistance from the SAN vendor's field engineer who followed the SAN vendor's doctrine.

    Input from the DBA team was not solicited and would have ignored in any case. Besides, there are no unallocated disks. And no, the SAN team will buy more disks because there are no empty bays in the disk enclosures. And there is no room for more enclosures in the storage rack. So the DBA request is denied.

    Even if we put up the money to get an extra cabinet for one more disk enclosure, the SAN admin will still refuse to configure dedicated physical disks for the log volume because the SAN vendor said that their great and powerful SAN will solve all performance problems. Any problems must be with their application and not the SAN. As can be seen from the charts above, this is a load of crap.

    The SAN Vendor Solution

    As I said above, this particular SAN is comprised of 240 or so 10K HDDs.

    Naturally the SAN vendor's proposed solution is that we should buy more of their value-add products in the form of auto-tiering SSD-HDD, and perhaps additional SSDs as flash-cache. The particular SAN with base features probably has an amortized cost of $4,000 per HDD. So SAN with 240 disks would cost just under $1M (while still failing to provide desired database performance characteristics). A mid-range SAN might have amortized cost per disk of $1,500-2K. Enterprise SAN could be $4-6K per disk.

    The additional value-add features would substantially increase the already high cost, while providing only minor improvement, because the checkpoint IO surge will still drive up log write latency.

    The sad thing is that the SAN group might buy into this totally stupid idea, and refuse to acknowledge that the correct solution is to simply have dedicated physical disks for the logs. If there were dedicated physical disks, the checkpoint data IO surge goes to completely different physical disks than the log disks.

    In the specific example, it is not necessary to have separate FC ports for the logs. The 50K IOPS at 8K per IO generates 400MB/sec, which is only 25-33% of the realizable IO bandwidth of 4 x 4Gbit/s FC ports. The checkpoint IO surge would increase latency on data reads, but the primary reason for the performance drop is the increase (degradation) in log write latency.

    Another angle is changing the checkpoint parameters in SQL Server, but the real problem is because we are prevented from leveraging the pure sequential IO characteristics of HDDs by allocating data and log volumes from a common pool.

    One more item. In the old days before we had immense memory, typical transactional database data read/write mix was 50/50. This is because a read forces a dirty page to be written to disk. In this situation, a data write IO surge would also depress the data reads necessary to support transactions. So the standard practice those days was to set the checkpoint interval to infinity to prevent data write IO surges. In our case, very little data reads are necessary to support transactions, so the checkpoint surge might depress data reads should have lesser impact on transactions. It is the increase in log write latency that is depressing transaction volume.

    Solutions that work

    A solution that would work is simply to have separate dedicated physical disks for the log volume. It is that simple! And yet this is not possible because the SAN people would refuse to do this, as it is not in their agenda.

    It is unfortunate that the only practical solution is to get the critical database off the corporate SAN. I have done this by going to all flash in the form of PCI-E SSDs. That is, SSDs installed internal to SQL Server system. This is not because the exceptional performance characteristics of SSDs were required.

    It was because I needed to get away from the SAN admin and his dogmatic adherence of SAN vendor doctrine. The IO performance requirements could have been meet with a direct-attach HDD array (or on a SAN). But anything with HDD enclosures would have been under the authority of the SAN admin, who would have nixed any storage components that was not a SAN, and then configured it according the SAN vendor principles.

    I have used the excuse that PCI-E SSD "accelerators" are needed for tempdb, which are not really "storage", hence there is no IT department mandate that it be on the SAN, under the absolute control of the SAN admin. In fact there were no special requirements for tempdb different from that of data. Then for unrelated reasons, there was enough capacity to put the entire DB on local SSD. Perhaps a file group with non-critical objects could reside on the SAN to carry the pretense of the local SSD not really being storage.


    Note that I have not naively suggested engaging in constructive dialog with the SAN team. They are on a holy mission that has no alignment with their company/organization's mission. Anything that contradicts SAN scripture is heresy.

    Oracle database machine has been described as hardware optimized for Oracle database. In fact the true mission is to take the SAN admin out of the loop. I think HP offered appliance oriented systems for SQL Server in 2012?, but they only system option was the DL980? which is severely narrow market segment. There needs to be DL380 and 580 options as well.


    A SAN is simply a computer system that resides between storage elements and hosts (storage clients) providing volumes (similar to any other client-server application). One practical feature is that the SAN can provide protection against single component failure. I have no objection to this concept, nor to the fact that the SAN vendor wants to sell hardware at a list price of 10X markup to cost.

    Strangely, it seems that people want to buy expensive stuff (just look around outside of IT). Consider that an organization might spend anywhere from tens of millions to a few billion dollars to develop the main database application. It does not seem right (to a high level executive) to put such an application on a $100K server plus $200K storage when there are $1M servers and $6M storage systems available. Never mind whatever might be the consequential differences between them. Or the fact that there are supreme technical challenges in scaling on (hard) NUMA, and oh yeah, the chief architects have never heard of NUMA.

    The point here is that people will buy very expensive SAN systems without technical understanding or justification. There is a perfectly sound business justification. What the client needs is a system supported by deep experts. Not some rookie field engineer who incorrectly chooses to replace the memory board with no uncorrected errors before then board with uncorrected errors.

    My observation is that the SAN Vendors feels a need to have technical justification for selling a storage system with extraordinarily high marks. So the vendors creates reasons. (Technical justification created for marketing requirements tend to have serious logical flaws, but a creative marketing professional is not easily deterred).

    It follows on pretending that these technical justifications are valid, then the "best" practices for employing a SAN should be blah, blah, as derived from the underlying reason. Do I need to explain the consequences of implementing practices built on a lie?

    I will add that there is an absolutely critical value-add that the storage system vendor must provide that alone justifies very expensive pricing. This is integration testing, verifying that the complex system with very many components work well together. The hard disk vendors are generally good at validating that their products works to specification in desktop and workstation configuration with 1 or even a few drives. Ensuring a storage system with several controllers, thousands of HDD, dual-controllers etc., is vastly more complicated. This is a valid reason. Building practices on a valid reason has benefits.


    In a large environment, there might be hundreds or thousands of servers, database instances, SQL Server or other. Managing very many systems and databases economically requires a high degree of automation, usually implemented with standard configurations.

    However, there are some databases that are critical either in being required for day to day operations and possibly providing a competitive advantage over other options. One indicator of this is that there are DBAs and developers dedicated to a specific application. In this case, the argument that customization of storage configuration is not feasible because of the other responsibilities on the SAN team is total BS.

  • ExecStats update - automating execution plan analysis

    It has been over 7 years now that I have made my ExecStats
    (current version Exec Stats 2015-02-18)
    tool publicly available (late 2007), with prototype versions going back to 2006. The two distinguishing elements of ExecStats is 1) the emphasis on cross-referencing a) query execution statistics to b) index usage via the c) execution plans, and 2) saving information locally so that it can be sent to a remote expert.

    The more recent versions now have the capability to simultaneous group stored procedure statements together while consolidating SQL by query hash. On the performance (counter) monitoring, charts are now displayed on logarithmic scale, allow insight over several decades of dynamic range.

    Too many of the commercial SQL Server performance tool emphasis generating pretty charts and reports on query execution statistics, thinking that by giving the top SQL to the originating developer is all that is expected of the DBA. In principle the DBA should have greater expertise on SQL Server performance than a developer, who is expert on the development platform (Visual Studio or other) and perhaps the business logic.

    In any case, the query execution statistics by itself is not very helpful, without the execution plan, and possibly the difference between estimated and actual rows. We should first appreciate that SQL Server is of some complexity, with several factors that could have adverse impact on performance.

    Examples are 1) does the table architecture support the business logic, 2) is the SQL written in manner that the query optimizer can interpret in the desired manner, 3) having a few good indexes, 4) no more indexes than necessary, 5) a good statistics strategy, not just frequency of rebuilds but also whether full scan samples are necessary, and possibly explicit statistics update in between certain SQL statements typically for ETL, 6) compile parameter issues, 7) row estimation issues at the source, 8) serious row estimate errors after intermediate operations (not at the source) and of course 9) system level issues.

    Some very advanced tools have the capability of generating alternative SQL in an attempt to solve a performance problem. The question is have other factors been considered before jumping into re-writing the SQL?

    I have had good success with ExecStats in onsite engagements in greatly simplifying data collection and to a degree the analysis as well. On many occasions I have been able solve simple or medium complexity issues remotely with just the information collected by ExecStats. It should be able to work against Azure as well, but this is only tested intermittently (and on request). People are invite to use and send feedback.

    ExecStats documentation

  • extensions for sp_helpindex

    The system stored procedures in SQL Server from the very beginning(?) provide useful information. However they have not been updated in a substantial manner for the new features in later versions, nor have they been extended with additional details that are now available in the DMVs for more sophisticated DBA's. Presumably, this is for backward compatibility. As an alternative, there is a provision for creating custom procedures that behave as system procedure with sp_ms_marksystemobject.

    In the case of sp_helpindex new features of are included columns, filtered indexes, compression and partitioning. Other information of interest might be size, index usage and statistics. Part of the reason for not changing could be for backward compatibility, which is fine, but let's then make new system procedures with extended information.

    In the text of sp_helpindex for SQL Server 2012, the only difference from earlier versions is that the description field has a provision for columnstore. The SQL Server 2014 version adds hash index and memory optimized in the description field.


    Below is my new procedure for extended index information. I have retained the same error checking code from the original. The cursor loop to assemble the index key columns has been replaced with a code sequence using the STUFF function and FOR XML PATH. A similar structure reports on the included columns. This procedure does not replicate the index description field of the original, but rather has a limited description and a separate field for the type code.


    USE master


    CREATE procedure [dbo].[sp_helpindex3]

     @objname nvarchar(776)


    DECLARE @objid int

     , @dbname sysname

     -- Check to see that the object names are local to the current database.

    select @dbname = parsename(@objname,3)

    if @dbname is null

      select @dbname = db_name()

    else if @dbname <> db_name()



      return (1)


    -- Check to see the the table exists and initialize @objid.

     select @objid = object_id(@objname) 

     if @objid is NULL 



      return (1) 



    ;WITH b AS (

      SELECT d.object_id, d.index_id, part = COUNT(*)

      , reserved = 8*SUM(d.reserved_page_count)

      , used = 8*SUM(d.used_page_count )

      , in_row_data = 8*SUM(d.in_row_data_page_count)

      , lob_used = 8*SUM(d.lob_used_page_count)

      , overflow = 8*SUM( d.row_overflow_used_page_count)

      , row_count = SUM(row_count)

      , notcompressed = SUM(CASE data_compression WHEN 0 THEN 1 ELSE 0 END)

      , compressed = SUM(CASE data_compression WHEN 0 THEN 0 ELSE 1 END) -- change to 0 for SQL Server 2005

      FROM sys.dm_db_partition_stats d WITH(NOLOCK)

      INNER JOIN sys.partitions r WITH(NOLOCK) ON r.partition_id = d.partition_id

      GROUP BY d.object_id, d.index_id

    ), j AS (

      SELECT j.object_id, j.index_id, j.key_ordinal, c.column_id,,is_descending_key

      FROM sys.index_columns j

      INNER JOIN sys.columns c ON c.object_id = j.object_id AND c.column_id = j.column_id


    SELECT ISNULL(, '') [index]

    , ISNULL(STUFF(( SELECT ', ' + name + CASE is_descending_key WHEN 1 THEN '-' ELSE '' END

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND j.key_ordinal >0

       ORDER BY j.key_ordinal FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Keys

    , ISNULL(STUFF(( SELECT ', ' + name

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND j.key_ordinal = 0 

       ORDER BY j.column_id FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Incl

    , i.index_id

    , CASE WHEN i.is_primary_key = 1 THEN 'PK'

       WHEN i.is_unique_constraint = 1 THEN 'UC'

       WHEN i.is_unique = 1 THEN 'U'

       WHEN i.type = 0 THEN 'heap'

       WHEN i.type = 3 THEN 'X'

       WHEN i.type = 4 THEN 'S'

       ELSE CONVERT(char, i.type) END typ

    , i.data_space_id dsi

    , b.in_row_data in_row , b.overflow ovf , b.lob_used lob

    , b.reserved - b.in_row_data - b.overflow -b.lob_used unu

    , 'ABR' = CASE row_count WHEN 0 THEN 0 ELSE 1024*used/row_count END

    , y.user_seeks, y.user_scans u_scan, y.user_lookups u_look, y.user_updates u_upd

    , b.notcompressed ncm , b.compressed cmp , b.row_count

    , s.rows, s.rows_sampled, s.unfiltered_rows, s.modification_counter mod_ctr, s.steps

    , CONVERT(varchar, s.last_updated,120) updated

    , i.is_disabled dis, i.is_hypothetical hyp, ISNULL(i.filter_definition, '') filt

    FROM sys.objects o

    JOIN sys.indexes i ON i.object_id = o.object_id

    LEFT JOIN b ON b.object_id = i.object_id AND b.index_id = i.index_id

    LEFT JOIN sys.dm_db_index_usage_stats y ON y.object_id = i.object_id AND y.index_id = i.index_id

    AND y.database_id = DB_ID()

    OUTER APPLY sys.dm_db_stats_properties(i.object_id , i.index_id) s

    WHERE o.type IN('U','V')

    AND i.object_id = @objid



    -- Then mark the procedure as a system procedure.

    EXEC sp_ms_marksystemobject 'sp_helpindex3'





    Information from my extended version of index help are space, index usage, and statistics. The DMV/F function dm_db_stats_properties was added in SQL Server (SQL Server 2008 R2 Service Pack 2, SQL Server 2012 Service Pack 1. The function STATS_DATE was added in SQL Server 2008 can be used if dm_db_stats_properties is not supported.

    We could also join to dm_db_index_operational_stats, dm_db_index_physical_stats or dm_db_xtp_index_stats for additional information, but I do not think the extra information is necessary for routine use.

    The above query might be more useful as a view capable of reporting on indexes for all tables and indexed views, but is there a system view option?

    It is too bad more statistics information in DBCC SHOW_STATISTICS is not available in query form.

    I am not sure why color coding is not display, see the same page on my web site sp_helpindex3 for
    We could also do an option to show indexes for all tables, leaving off the object_id =.

  • Join Row Estimation in Query Optimization

    This topic is titled to specifically consider only row estimation after joins, precluding discussion of row estimates at the source table, which has already been addressed in papers covering the new Cardinality Estimator in SQL Server 2014 and other statistics papers for previous versions of SQL Server.

    There are certain situations in which the query compile time can be excessively long even for queries of moderate complexity. This could be true when the plan cost is very high, so that the query optimizer expends more effort to find a better plan before reaching the time out, that is, the time out setting appear to be a function of the plan cost. Even then, the query optimizer could still make row estimation errors in the operations after the source table (for which data distribution statistics are kept on columns and indexes) of sufficient consequence that renders the remainder of the plan un-executable in a practical time frame.

    The new cardinality estimator in SQL Server 2014 is helpful in resolving known issues, but has little to improve row estimates after the initial access at the data source beyond fixed rules which may be more generally true than the rule used before. That said, the query optimizer only attempts to estimate rows (and other cost factors) using a combination of the information it has, and rules for situations where there are no grounds for making a knowledge based estimate.

    So why be bound by this rule of estimating only? The other company (only sometimes in the news concerning databases) has (recently introduced?) adaptive query optimization that can make run-time adjustments to the execution plan. Sure that’s nice, but I am thinking something more sophisticated could done in forming the plan but less complex than the ability to change course in the midst of execution.

    Obviously, if a query has an obviously correct execution plan that can be determined by existing techniques, and the plan cost is fairly low, then no change is necessary. Further special technique should only be considered for expensive queries, particularly in which the row estimates at the intermediate steps are difficult to assess.

    Consider query below
    SELECT bunch of columns
    FROM Nation n
    JOIN Customers c ON c.NationId = n.NationId
    JOIN Orders o ON o.CustumerId = c.CustomerId
    WHERE n.Country LIKE ‘Z%’

    From statistics, we know approximately how many countries have a name beginning with Z. SQL Server also has a histogram for the NationId column in the Customers table. If we had specified the list of NationId values (with equality on to both tables), SQL Server could use the more detailed information from the histogram to make a row estimate. But because we specified the Country name column that is only in the Nation table, we must use the average number of customers per county (in the density vector) multiplied by the number of countries to estimate rows after the join to customers.

    And next of course, all customers are not alike. There are customers who place a small, medium or large number of orders. So why expend a great deal of effort to find a plan from all the possible join orders and index combination based on only estimates of rows when it is known that data distribution in each succeeding table is heavily skewed. Why not pre-execute the tables for which a SARG was specified to get the column values used in the join to next table so that the more detailed histogram distribution information. This technique could be pushed to multiple levels depending on the initially assessed plan cost, and perhaps controlled by a query hint.

    For example, in the above query, suppose NationId’s 19, 37, 42, and 59 meet the criteria Country beginning with Z. The optimizer would next look at the Customers table histogram on NationId for these values to estimate rows after the join. If the situation warrants, the next level could be examined as well.

    It could be argued that the query optimizer should not execute the query to determine the plan, but why follow that principle if the cost of query optimization is excessively high (several seconds) in relation to relatively minor effort to make a more extensive reconnaissance (of tens or hundreds of milli-seconds)? Especially considering that the reliability of row estimates becomes progressively worse after each join or other operation beyond the original source?

    This technique should probably be used when there are tables with search arguments joining to tables on columns with highly skewed distribution. The first implementation might be activated only be a query hint until some maturity is achieved, followed by greater use.

    Presumably there might be a cost threshold as well. I would prefer not to tie it with parallelism. Of course, given the nature of modern systems, it really is time for the cost threshold for parallelism and max degree of parallelism to have graduated controls, instead of the single setting on-off.

    Side note 1

    OK, now forget what I said at the beginning and I will gripe about SQL Server default statistics. It has been discussed else where that SQL Server uses random page samples and not random row samples, as this is a much less expensive way to collect data. It does use an index for which the column is not a lead key if available, to improve randomness. Still, I have notice a severe sensitivity to sampling percentage in cases where the column value is correlated with (page) storage location.

    So I suggest that as the page sampling is in progress, a count of new values found in each successive page sampled versus existing values be kept. If the number of new values found falls off sharply, then most distinct values have probably been found, and the interpretation of the existing sample is that its distribution should be scaled by ratio of total rows to the rows sampled. If almost of the values in the last few pages sampled are new (not previously found), then the interpretation should be that the number of distinct values should be scaled by the total to sample ratio. And some blend when there is an intermediate number of new values versus previously found values in each successive page.


    The query optimization/plan compile process is single threaded. The modern microprocessor might be 3GHz, so a 10 sec compile is 30 Billion cpu-cycles. And I have seen compiles run more than 100 sec? One query even broke SQL Server, of course that was a set of deeply nested, repeating CTE's that should have been PIVOT/UNPIVOT. So why the principle of optimizing based on unreliable estimates when an index seek is a mere 10 micro-sec? Presumably the key column statistics have already been decoded.

    It would be nice to have a more powerful query optimizer, but there is a method for writing SQL to get a specific execution plan, Bushy Joins. Of course the other element in this is known what the correct execution plan is. This involves not what the query optimizer uses for cost model, but what the true cost model is.

  • TPC-H and Columnstore (Update)

    Earlier I had commented on the TPC-H results published in April of this year for SQL Server 2014 using clustered column store storage, noting that two of the 22 TPC-H queries did not perform well in column store. I had speculated on the reason without investigation (I should have learned by now not to do this), that perhaps the cause was that the row store result benefited from date correlation optimization. Thomas suggested otherwise (see below) pointing out that column store has an alternative mechanism of greater general usefulness in the keeping min/max on each columns, along with citing the join to Customers as a more likely explanation, evident in the query plan (which is why one should always provide the plan).

    Thomas Kejser Comments
    I am not sure your theory is correct in the case of Q10. It is noteworthy that the column store requires that the join with CUSTOMER is performance before the sort on revenue. The row store on the other hand can do a loop join (so can the column store, but that is not the plan you get it seem).

    This must mean that the sort buffer is significantly larger for the column store (as reflected in the plan estimates) - which in turn can cause a rather significant memory consumption. It is also noteworthy that the column store does not seem to push the return flag predicate into the storage engine.

    With column storage segments storing the min/max of all values contain in each column, it is unclear if the date correlation provides any benefit that isn't already gain from the segment header.

    Another odd thing about the column store plan of Q10 is that the join of LINEITEM/ORDER is hashed, while the probe happens on CUSTOMER. Unless the predicate on RETURNFLAG is very selective (I don't recall) this is the wrong way around and may cause further spilling

    This is easy enough to test. First strategy is to remove the CUSTOMER and NATION tables from the query, making it a pure Below is the test version of Query 10.

    O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND

    The execution plan for the test query is below with row-store. (Top operation to the left not shown for compactness).


    The execution plan for the test query is below with column-store. (Top operation also not shown).


    The two plans are essentially the same, with the difference being that the column-store plans applies the ReturnFlag as a filter operation instead of as a predicate in the LINEITEM access. I suppose this is because each column is stored separately, or perhaps this is just the way the column-store plan is shown. About 25% of rows in LINEITEM meet the ReturnFlag = R condition.

    On my 1 socket 4 core, HT enabled test system, at Scale Factor 10 (SF10), the SQL Server query execution statistics for the original version of Q10 are:
    Row Store CPU time =  5876 ms, elapsed time =  814ms
    Col Store  CPU time = 10826 ms, elapsed time = 1758ms

    This is somewhat in-line with the 3 official TPC-H reports at SF 1000, 3000 and 10000 (1, 3 and 10TB) compared against different systems and SQL Server 2012 or earlier.

    For just the core ORDER - LINEITEM query
    Row Store CPU time = 5248 ms, elapsed time = 769ms
    Col Store   CPU time = 4030 ms, elapsed time = 565ms

    So it is clear that TK had the correct explanation for the poor column store performance relative to row store in the case of Q10. The counter test for my original suggestion, is to explicitly apply the date range on LINEITEM discovered by the date correlation optimization in row-store, L_SHIPDATE between 1993-09-20 and 1994-06-17. As pointed out earlier, the actual ship date range 1993-10-02 and 1994-05-02. This further improved the Columnstore result to
    Col Store   CPU time = 2686 ms, elapsed time = 403ms

    This is a small improvement over the existing Columnstore min/max feature. My thinking is that the row store date correlation feature is not particularly useful in real world databases with highly irregular date correlation, and that if such date correlation did exist, the analyst should spell it out rather than depend on a database engine feature. I am tempted to speculate that it might be better to partition on join columns instead of date range, but perhaps I should not do so without investigation? unless of course, this prompts someone else to do the investigation.

    Now that we know were the problem occurred in Q10, we can attempt to rewrite the query to avoid the error, as shown below.

    FROM (
     O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
    ) x

    The alternate query improved both the row and column-store query plans in pushing out the join to Customers and Nation to after the Top clause. The row-store plans is:

    tpcE tpcE

    The column-store plan is below.

    tpcE tpcE

    The impact is minimal in the row-store plan because the reduced number of index seeks for Customers from 115 to 20 is small in the overall query. For the column store plan, the performance retains most of the gains achieved in the Order-Lineitem only test.
    Col Store   CPU time = 4218 ms, elapsed time = 598ms

    In my test system, I have Q4 as 3 times faster with column-store over row-store, so I do not know why the published reports have it as comparable or slower.

    columnstore is a very powerful. but the query optimizer is as mature as row store. So pay attention to the query plan.
    this is an update to my previous post on this topic, not about updateable columnstore, which is updateable.

  • TPC-H Benchmarks on SQL Server 2014 with Columnstore

    Three TPC-H benchmark results were published in April of this year at SQL Server 2014 launch, where the new updateable columnstore feature was used. SQL Server 2012 had non-updateable columnstore that required the base table to exist in rowstore form. This was not used in the one published TPC-H benchmark result on SQL Server 2012, which includes two refresh stored procedures, one inserting rows, the second deleting rows. It is possible that the TPC-H rules do not allow a view to union two tables? and perhaps a delete via the partitioning feature? (meaning the delete range must match the partition boundaries). Another possibility is that SQL Server 2012 columnstore was considered to be a multi-column index which is also prohibited to reflect the principle of being ad-hoc queries.


    SQL Server Columnstore

    First a few quick words on SQL Server columnstore. Columnstore is not actually an index of the b-tree index form. The MSDN Columnstore Indexes Described states that columnstore index is "a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore." In SQL Server 2012, it is called a nonclustered columnstore index not because it is nonclustered or an index, but because the base table must exist in traditional rowstore form. In SQL Server 2014, there is a clustered columnstore index not because data is stored in order of the index key, as there is no key, but rather that there is no rowstore table, just the columnstore.


    TPC-H details Date Columns and Query SARGs

    The full details of the TPC-H Decision Support benchmark are described on the website. There are a few details of relevance to the use columnstore. The largest table in TPC-H is LINEITEM, which has 3 date columns, ShipDate, ReceiptDate and CommitDate, and is clustered on ShipDate. The second largest table is ORDERS, clustered on the one date column OrderDate. These two tables are joined on OrderKey. There is correlation between values in the date columns in these two tables, some natural, and others based on reasonable business conditions. ShipDate must be greater than OrderDate obviously, and is also no more than 121 days greater than OrderDate per benchmark specification built into the data generator. CommitDate is between -89 and 91 days of ShipDate. ReceiptDate is between 1 to 30 days after ShipDate. The date values ranges from Jan 1992 to Dec 1998.

    There are 22 Select queries in the TPC-H benchmark, along with the 2 refresh stored procedures. Many of the Select queries specify a date range on one of the date columns or a lower bound one date column and an upper bound on a different column. Ideally, for queries that target rows in a limited date range, we would like to have upper and lower bounds on for the cluster keys on both the ORDERS and LINEITEM tables, OrderDate and ShipDate. However the TPC-H rules do not permit re-writing the query SARGs based on inferable knowledge.

    That said, apparently the rules do not preclude the query optimizer from discovering such knowledge. One of the other RDBMSs was probably first to do this, and Microsoft followed suit in order to be competitive in the TPC-H benchmark with the Date Correlation Optimization feature in 2005. Personally, I am not aware of any production server using this feature. Realistically, any organization that was having query performance issues related to date range bounds would probably have directed the analyst to apply appropriate date bounds on the cluster key. This is most probably a benchmark specific optimization feature.

    The date correlation optimization statistics do not exist when using clustered columnstore, because there is no underlying rowstore table with foreign key relations? The date correlation statistics do exist when using rowstore tables with foreign keys and are used by nonclustered columnstore indexes?


    TPC-H on SQL Server 2014 with Columnstore

    That said, let us now look at the 3 new SQL Server 2014 TPC-H results published making use of the new clustered columnstore indexes. One is from IBM at Scale Factor 1000 (1TB) and two from HP at 3TB and 10TB respectively. The new results are compared to prior results with traditional rowstore on previous versions of SQL Server and previous generation server systems.

    Because Columnstore is not really an index in the b-tree sense, given that queries frequently involve date ranges, it is presumed to be important to use partitioning with Columnstore. The three new TPC-H reports on SQL Server 2014 partition both ORDERS and LINEITEM by OrderDate and ShipDate respectively (the cluster key in previous versions) with a partition interval of 1 week (7 years x 52 weeks per year = 364 partitions). Perhaps of interest, the scripts show that a rowstore partitioned index is first build before building the partitioned clustered columnstore index.


    TPC-H at SF 1000 (1TB)

    The new TPC-H 1TB result on SQL Server 2014 using columnstore is compared with 3 previous results on SQL Server 2008 R2. There is a difference in memory configurations between the four systems below. For SQL Server 2014 with clustered columnstore indexes, the TPC-H SF1000 total database size is just under 430GB, so with 1TB memory, the benchmark is running entirely in memory after the initial data load, with the exception of hash operation spills to tempdb.

    x3850 X6
    E7-4890 v2
    DL980 G7
    x3850 X5
    UCS C460

    For rowstore, the TPC-H SF1000 total database size is 1420GB, so the two systems with 2TB memory are mostly running with data in memory, again except for the initial load and spills to tempdb. There is definitely disk IO for data in the Cisco system at 1TB physical memory. The performance impact is noticeable, but probably not as severe as one might think based on how people talk of memory. The reason is that all of these systems make correct use of massively parallel IO channels to storage capable of 10GB/s plus table scans, and many of these also use SSD storage capable of more random IOPS than SQL Server can consume even at very high parallelism.

    The new SQL Server 2014 result is 2.36X higher on composite score (QphH) and 3X higher in the Power test than previous versions with conventional rowstore.

    The 22 individual query run times from the Power test at 1TB are shown below. The SQL Server 2014 result is the left item in the legend label 4x60 (sockets/cores) and for the succeeding charts as well.


    Query 1, a single table aggregation, is more than 10 times faster on SQL Server 2014 using columnstore on 60 cores (Ivy-Bridge, 2.8GHz) than 2008R2 using rowstore, on 80 cores(Westmere-EX). Per TPC-H benchmark procedure, the test is run immediately after data load and index creation? The second largest speed-up is Query 16, joining 3 tables, at 6.4X.

    Query 10 is 40% slower with columnstore, and Query 4 about the same between columnstore and conventional. This query is listed near the end of this section on TPC-H.

    Notice that in the 3 SQL Server 2008R2 results, Query 2 becomes slower as the degree of parallelism increases from 40 cores (threads unspecified) to 80 cores/80 threads and then to 80 cores/160 threads. Elsewhere I had commented that SQL Server really needs a graduated approach to parallelism instead of the all or nothing approach.

    TPC-H at SF 3000 (3TB)

    The new TPC-H 3TB result on SQL Server 2014 is compared with a previous result on SQL Server 2008R2. Here, the difference in memory is a significant contributor. The SQL Server 2014 system has more memory than the columnstore database, while 2008R2 systems has much less memory than the 3TB rowstore database (4.5TB).

    DL580 G8
    E7-4890 v2
    DL980 G7

    I am supposing that the reason the HP 2010 report only configured 512GB (128 x 4GB priced $29,440 in 2010) was that there would not be a significant performance improvement for the TPC-H 3TB result at either 1TB or even 2TB memory in relation to the higher price (128 x 16GB priced $115K in 2011).

    Several of the TPC-H queries involve nearly full table scans of the large tables. If there is not sufficient memory for the entire database, then the next objective is to have sufficient memory for reducing the spill to disk in hash operations? HP may have elected for the better price-performance? Or perhaps someone just wanted to make a point. The point being that it is important for the SQL Server engine to function correctly when heavy IO is required.

    In the SQL Server 2014 result, the system has 3TB memory (96 x 32GB priced $96K in 2014) which is sufficient to hold the entire data set for TPC-H 3TB in columnstore.

    The overall composite score is 2.8X higher with columnstore and 3.4X higher on the Power test.

    The 22 individual query run times from the Power test at 3TB are shown below.


    The largest gain with column-store is Query 19 at 19.7X. Query 4 and 10 show degradation, similar to the case at SF1000.

    TPC-H at SF 10000 (10TB)

    The new TPC-H 10TB result on SQL Server 2014 is compared with a previous result on SQL Server 2012. Strangely, supporting documentation for the HP 2013 report on SQL Server 2012 is missing so there is no indication as to whether nonclustered columnstore is used? I am guessing that columnstore was not used because the results are in line with expectations on rowstore.

    DL580 G8
    E7-4890 v2
    3072 2014404,005631,309337,8594/15/14
    DL980 G7
    4096 2012158,108185,297142,6856/21/11

    The full data size in columnstore at SF 10000 should be 5TB, and 14TB in rowstore, so there should have been heavy disk IO in both results.

    The overall composite score is 2.55X higher with column-store and 3.1X higher on the Power test.

    The 22 individual query run times from the Power test at 10TB are shown below.


    The largest gain with column-store is Query 6 at 23.2X. Query 4 and 10 show degradation as in the two previous cases.


    TPC-H Query 10

    Below is Query 10. This query is consistently slower in columnstore relative to rowstore.


    /* TPC_H Query 10 - Returned Item Reporting */

    O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND

    The execution plan for rowstore at SF 10 is shown below.

    (Click for full-size)

    The execution plan for columnstore at SF 10 is shown below.

    (Click for full-size)

    Below are the details on ORDERS and LINEITEM from the rowstore plan.

    tpcE tpcE

    Notice that there are seek predicates on LINEITEM for 1993-09-20 to 1994-06-17. The actual range should be 1993-10-02 to 1994-05-02, for 1 day after the OrderDate lower bound and 121 days after the OrderDate upper bound.

    Below are the details on ORDERS and LINEITEM from the columnstore plan.

    tpcE tpcE

    In columnstore, every operation is a scan. There is a predicate for the ORDERS table but not on the LINEITEM table. Presumably storage engine must scan entire set of LINEITEM partitions while only scanning the ORDERS partitions encompassing the SARG date range

    I am thinking the reason is that with date correlation in conventional row-storage, the SQL Server query optimizer knows that the data range in LINEITEM ShipDate is also restricted by the lower OrderDate and the upper OrderDate plus 121 days, corresponding to 1 day after the lower bound on OrderDate to 121 days after the upper bound on OrderDate.

    TPC-H Query 4

    TPC-H Query 4 below is slower than row storage in the 3 and 10TB results. I am thinking that the reason is the same?


    /* TPC_H Query 4 - Order Priority Checking */

    WHERE O_ORDERDATE >= '1993-07-01' AND O_ORDERDATE < dateadd(mm,3, cast('1993-07-01' as date))


    See TPCH Query Plans for the TPC-H reference queries and execution plans at SF1 on SQL Server 2005. The parent page TPCH Interim has links for the SF1000 query plans with and without parallelism.

    TPC-H Columnstore Summary

    As with every other new feature, Columnstore is a really interesting new technology. But think hard about what is really happening, experiment, and remember to get good execution statistics and plans prior to making changes, then get the new execution statistics and plans after the change.

    One reason I like to look at official TPC-H benchmark results over "real-world" is that the benchmark system is properly configured for both before and after results. There is a significant difference in the data size involved for each query between rowstore and columnstore. If the reference system has a poor storage system (and how often have we seen this? this is guaranteed when the SAN vendor assisted in configuration), then it is possible to produce almost any performance ratio.



    The charts below show the progression of performance over time for the selected TPC-E results spanning Core 2, Nehalem, Sandy Bridge and Ivy processors at 2, 4 and 8 sockets.


    For the 2-socket systems, West-1 is from the first set of TPC-E results reported for Westmere X5680 with HDD storage and West-2 is the later X5690 report with SSD storage. Both are 6-core Westmere-EP processors. The West-3 is the E7-2870 10-core (Westmere-EX) on SSD storage.

    For the 4-socket systems, West-1 is on HDD storage, and West-2 on SSD, both 2K8R2 and 1TB memory. The West-3 is on Win/SQL 2012, 2TB memory and SSD storage.

    The same data is shown below with reverse organization showing scaling with sockets for each of the processor architectures.


    Notes: Nehalem
    2-socket 4-core, 2.93GHz (11.72 core x GHz), 4 & 8-socket 8-core 2.26GHz (18.08 core x GHz)
    2-socket 6-core, 3.46GHz (20.76 core x GHz/socket), 4 & 8-socket is 10-core 2.4GHz (24 core x GHz)
    Sandy Bridge
    2-socket 8-core, 2.9GHz (23.2 core x GHz/socket), 4-socket is 8-core 2.7GHz (21.6 core x GHz)
    Ivy Bridge
    2-socket 12-core, 2.7GHz (32.4 core x GHz/socket), 4 & 8-socket is 15-core 2.8GHz (42 core x GHz)

  • Detecting Hyper-Threading state

    To interpret performance counters and execution statistics correctly, it is necessary to know state of Hyper-Threading (on or off). In principle, at low overall CPU utilization, for non-parallel execution plans, it should not matter whether HT is enabled or not. Of course, DBA life is never that simple (see my other blogs on HT). The state of HT does matter at high overall utilization and in parallel execution plans depending on the DOP. SQL Server does seem to try to allocate threads on distinct physical cores at intermediate DOP (DOP less than or equal to the number of physical cores).

    Suppose for example, that maximum throughput on 10 physical cores is 10000 call/s with HT off and 14000 with HT on (overall CPU near 100%). Then the average CPU (worker time) per call is 1ms with HT off, and 1.43 ms with HT on as there are twice as much available worker time with HT on.

    In a very well tuned OLTP system, we might have very steady average CPU per call as call volume increases from low overall CPU utilization to near saturation with HT off. With HT on, at low overall CPU, the average CPU per call is the same as with HT off, but average CPU per call increases at some point when there is sharing of physical cores between concurrently running queries. But this is still good because system wide throughput capability has increased.

    In a not well tuned database application, there could be contention between concurrent queries that causes average CPU per call to increase as overall system CPU load increases. Without knowing the state of HT, it is hard to make the assessment as to which situation has occurred.

    If we have direct sysadmin access to the OS, we could make calls (via WMI) to determine the processor model number, the total number of sockets, the total number of logical processors, then determine with a lookup table matching processor model to the number of physical cores. (again, a pain)
    (per LondonDBA, WMI Win32_Processor does report the number of sockets, physical cores, and logical, I must have been thinking of an older API that was not HT aware or even multi-core aware)
    But we do not always have sysadmin access to the host OS, as many organization believe separation of DBA, infrastructure (not to mention storage) is a good thing, and even better when these groups do not communicate with each other, (let alone) working together with a common mission.

    SQL Server version 2005 was helpful in the DMV sys.dm_os_sys_info
    which had two columns: cpu_count and hyperthread_ratio
    defined in version 2005 as:
    1) "Number of logical CPUs on the system."
    2) "Ratio of the number of logical and physical processors."

    In version 2008, RTM and R2, the definition of hyperthread_ratio was changed to:
    "Ratio of the number of logical or physical cores that are exposed by one physical processor package."

    in 2012 and 2014, slightly different wording but same meaning:
    "Specifies the ratio of the number of logical or physical cores that are exposed by one physical processor package."

    why the change in definition?
    There are actually 3 pieces of information we are interested in:
    a) the number of sockets
    b) the number of physical cores per socket
    c) the state of HT (or the logical processors per socket)

    In 2005, we have information to determine the product of A and B, and the value of C, but not atomic values of A and B,
    In the 2008 and later version, we can determine A (using the ratio) and the composite product of A x B x C, but not atomic values of B and C.

    Ok, then I noticed that in SQL Server version 2012 and 14, there would be a line in the log of the form:
    "SQL Server detected 4 sockets with 10 cores per socket and 10 logical processors per socket, 40 total logical processors; using 40 logical processors based on SQL Server licensing."

    So from this, I have A, B and C, even though I must parse the error log for this info. In version 2008R, the info is only: "Detected 40 CPUs."
    which has no additional information to what is in the DMV. It would be helpful if this information were available directly from the DMV, but then life might be easier?


    EXEC sys.xp_readerrorlog 0, 1, "detected", "socket"

  • Top Clause and Other Factors in Problematic Execution Plans

    Three years ago, I conducted an extensive investigation on a SQL Server system running kCura's Relativity document e-discovery application. It was fascinating to see the broad range of problematic queries all from one application. This provided good material for my presentation Modern Performance which focuses on the more spectacular problems that can occur with a cost based query optimizer.

    (I am not sure why the images are not showing up. See the alternate link Top Clause)

    It should be pointed that unlike a typical application in which the key queries can be rigorously tuned, Relativity must generate the SQL from options in the UI. The objective of Relativity is to support complicated searches, with heavy emphasis on queries of the form: IN set A but NOT IN set B. This causes problems in row estimation because there is no generally valid logic to assess whether there is overlap between Set A and Set B even if row estimates on the individual sets are possible.

    If this were not difficult enough, there can also be nested AND/OR combinations. SQL Server has difficulty generating good execution plans even for a single AND/OR combination. It is unfortunate that the most direct conversion of natural (user-oriented) logic to a SQL expression (that Relativity employs) just happens to be the form for which the SQL Server query optimizer has difficulties in producing a good execution plans.

    Relativity 8.1

    This article is based on a brief observation of Relativity version 8.1. The first Relativity article was originally based on version 7.3-7.5, and later updated with observations of version 8.0.

    Relativity Architecture

    The architecture of Relativity first employs a SELECT COUNT query to determine the number of rows that meet a specific search query. A second call is then made in the form of a SELECT TOP 1000 query to retrieve just the document identity column (ArtifactID) values. There should be a third query that retrieves all the desired columns for specific ArtifactID values, but this query is very low cost and does not warrant discussion here. If more than 1000 rows are needed, then the next query will be TOP 6000.

    One observation is that even when the COUNT query indicates that there are fewer than 1000 rows, the ArtifactID query is still issued with the TOP 1000 clause. In earlier versions of Relativity (7.x), the TOP 1000 query was issued even when the COUNT query indicated zero rows. Perhaps an oversight by the developers in not realizing zero rows are returned when COUNT is zero.

    One aspect of Relativity architecture that is correct is the not enabling plan reuse. The expectation is that individual searches probably have high CPU for execution than for compile and that each parameter set could have skewed distribution. However, in issuing explicit SQL, there is also no opportunity to fix problematic SQL. Hence it is important the Relativity anticipates as many problems as possible and employ good strategies for know issues. We could wish for such, and we would still be wishing.

    Problematic SQL in Relativity 8.1

    There are several difficulties in Relativity queries, three of which are covered here. One is due to the SELECT COUNT followed by TOP 1000 architecture. The second is due to the use of the direct translation of natural (user) logic to SQL rather than a form of SQL for which the SQL Server query optimizer happens to produce very good execution plans. The third is in parallelism strategy. The data in Relativity is expected to have heavily skewed distribution so is necessary to watch for situations where this produces ineffective parallel execution plans.

    Relativity Query Example

    Below is the COUNT form of an example search query to be studied in greater detail.

    SELECT COUNT( [Document].[ArtifactID])
    FROM eddsdbo.[Document] WITH(NOLOCK)
    LEFT JOIN eddsdbo.[Custodian] AS [o1000022_f2154932] (NOLOCK)
    ON [o1000022_f2154932].[ArtifactID] = [Document].[Custodian]
    WHERE [Document].[AccessControlListID_D] IN (1,1000062,1000063,...)
      [o1000022_f2154932].[ArtifactID] IN (22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...)
     OR (EXISTS(
      SELECT [f2154930f2154931].[f2154931ArtifactID]
      FROM eddsdbo.[f2154930f2154931] (NOLOCK)
      LEFT JOIN eddsdbo.[Custodian] (NOLOCK)
      ON [Custodian].[ArtifactID] = [f2154930f2154931].[f2154931ArtifactID] 
      WHERE [f2154930f2154931].[f2154930ArtifactID] = [Document].[ArtifactID]
      AND [Custodian].[ArtifactID] IS NOT NULL
      AND ([Custodian].[ArtifactID] IN (22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))

    The actual number of values in the AccessControlListID_D IN clause is 113. The actual number of values in each of the Custodian.ArtifactID IN clauses is somewhat over 100, with both sets being identical.

    Below is subsequent TOP 1000 query that is issued after the COUNT query. In the 7.x versions, the TOP 1000 clause is present regardless of the number of rows indicated by the COUNT query, and is in fact issued even if COUNT reports zero rows. It is also very possible that either one of the COUNT and TOP queries or both are very expensive regardless of the actual number of rows.

    SELECT TOP 1000 [Document].[ArtifactID]
    FROM eddsdbo.[Document] WITH(NOLOCK)
    LEFT JOIN eddsdbo.[Custodian] AS [o1000022_f2154932] (NOLOCK)
    ON [o1000022_f2154932].[ArtifactID] = [Document].[Custodian]
    WHERE [Document].[AccessControlListID_D] IN (1,1000062,1000063,...)
      [o1000022_f2154932].[ArtifactID] IN (22552945,22552935, 22552925, 22552915, 22552905, 14768277,..)
     OR (EXISTS(
      SELECT [f2154930f2154931].[f2154931ArtifactID]
      FROM eddsdbo.[f2154930f2154931] (NOLOCK)
      LEFT JOIN eddsdbo.[Custodian] (NOLOCK)
      ON [Custodian].[ArtifactID] = [f2154930f2154931].[f2154931ArtifactID] 
      WHERE [f2154930f2154931].[f2154930ArtifactID] = [Document].[ArtifactID]
      AND  [Custodian].[ArtifactID] IS NOT NULL
      AND ([Custodian].[ArtifactID] IN (22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))
    ORDER BY [Document].[ArtifactID]

    Alternative Queries

    For the above search with the actual data set, there are in fact 232 rows. Three alternatives queries are examined here. One is the above query without the TOP clause. The second alternative is the above query (including the TOP clause) but with an index hint applied. The third alternative tested is the form below, with the OR clause replaced by a UNION.

    SELECT [Document].[ArtifactID]
    FROM eddsdbo.[Document] WITH(NOLOCK)
    LEFT JOIN eddsdbo.[Custodian] AS [o1000022_f2154932] (NOLOCK)
    ON [o1000022_f2154932].[ArtifactID] = [Document].[Custodian]
    WHERE [Document].[AccessControlListID_D] IN (1,1000062,1000063)
      [o1000022_f2154932].[ArtifactID] IN (22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...)
    SELECT  [Document].[ArtifactID]
    FROM eddsdbo.[Document] WITH(NOLOCK)
    LEFT JOIN eddsdbo.[Custodian] AS [o1000022_f2154932] (NOLOCK)
    ON [o1000022_f2154932].[ArtifactID] = [Document].[Custodian]
    WHERE [Document].[AccessControlListID_D] IN (1,1000062,1000063,...)
      SELECT [f2154930f2154931].[f2154931ArtifactID]
      FROM eddsdbo.[f2154930f2154931] (NOLOCK)
      LEFT JOIN eddsdbo.[Custodian] (NOLOCK)
      ON [Custodian].[ArtifactID] = [f2154930f2154931].[f2154931ArtifactID] 
      WHERE [f2154930f2154931].[f2154930ArtifactID] = [Document].[ArtifactID]
      AND  [Custodian].[ArtifactID] IS NOT NULL
      AND ([Custodian].[ArtifactID] IN (22552945, 22552935, 22552925, 22552915, 22552905, 14768277,...))
    ORDER BY [Document].[ArtifactID]

    For this particular case, the UNION will produce exactly the same set of rows as the original query because only the Document table ArtifactID column is in the select list and this column is the primary key of the Document table. In the more general case with columns from more than one table in the SELECT list, the UNION form could have a different row set than the original OR form. The architecture of Relativity is such that the OR forms can be converted to UNION while maintaining correct results.

    Earlier, it was mentioned that the SQL Server query optimizer can have problems in producing a good execution plan for queries with a combination of AND and OR clauses. The form shown above is set A OR set B, with A having an AND clause. The other form of this is set A AND set B, with either A or B having an OR clause. The second form could probably be handled by converting the AND to an INNER JOIN, which would require both conditions to be true, but a full study has not been done on this.

    Relativity Query Execution Plans

    Now consider the estimated execution plans for the 5 queries: 1) COUNT, 2) TOP, 3) no TOP, 4) index hint, and 5) UNION.

    The SELECT COUNT estimated execution plan is shown below (full plan).


    The SELECT TOP 1000 estimated execution plan (full plan).

    Note top right operation in the above execution plan for the TOP 1000 query. The operation is an Index Scan on index EV_ArtifactID. The index lead key is ArtifactID, and has all the columns required for this query.

    Below left is the Index Scan detail. Notice that the Estimated I/O cost is 85, corresponding to approximately 892MB (I/O cost 1 = 1350 pages or 10.5MB). Below right is the COUNT query Nested Loops operation at the top left of the plan.

    Below are the left most operator showing total plan cost for the COUNT and TOP 1000 queries respectively.


    The TOP 1000 plan cost is 0.31269 even though the Index Scan on Document table index EV_ArtifactID has an I/O of 85.38. This is because the execution plan is expecting the first 1111 rows from Documents to be sufficient to meet the total required 1000 rows. The plan cost for the COUNT query is much higher at 519 as all rows must be evaluated to test the match to the search arguments. The Document Index Scan in the COUNT query uses the IX_AccessControlListID_D index because this index is narrow with cost of 35.46 (272MB)

    Below is the estimated plan for the SELECT query without the TOP clause (full plan)


    The estimated row count is 11,569,800, exactly the same as in the COUNT query at the Nested Loops operation just before the rows are aggregated into a COUNT value.

    There are 12.8M rows in the Document table, so the query optimizer believes that about 90% of row meet the search argument (with OR clause). The execution plan is the same as the COUNT query except that the return list requires a sort operation contributing to the higher overall plan cost of 720.

    Below is the estimated plan for the TOP query but with an index hint applied (full plan)

    The use of the index hint results in the query optimizer not changing the join order. Hence when using hints, it is also necessary to write the query in the form with the desired join order. The hinted index lead key does not match in the ORDER BY clause, so the entire index is scanned and then sorted. The plan cost 197 is more than the original TOP query, but less than the no TOP query. The reason is that the query optimizer believes it can exit the execution plan once the TOP 1000 rows have been found.

    Below is the estimated execution plan for the UNION query. (full plan)

    The Estimated Number of Rows is 3.2M and the plan cost is 92.8.

    Curiously, the number of rows from each of the two sub-queries is much less at 23,764 and 923,036.

    One would think that the UNION of the two result sets should at least equal to the larger and could be as high as the sum.

    Relativity Query Execution Times

    Below are the execution time in milli-seconds for both CPU and elapsed times of each of the 5 queries. The first is the COUNT query, followed by the Relativity standard TOP 1000 query. Next are the 3 variations considered: without the TOP clause, with an Index Hint and finally a UNION structure instead of OR in the WHERE clause.

    QueryCPU timeelapsed time
    TOP 1000152148152298
    no TOP14397371857
    index hint16767122016

    The standard Relativity TOP query does not have a parallel execution plan as the plan cost (0.31269) is below the cost threshold for parallelism. Even though some of the operations in the plan show a cost greater than 0.3, the Top clause believes that the query can terminate after few rows from Documents (estimate 1111) have been examined.

    In this example, the WHERE clause arguments are highly selective, and fewer than 1000 rows meet the conditions. So what happens is that the full set of 12.8M rows from the right most loop join outer source must be processed, as the Top clause "exit" criteria is never reached. Furthermore, the execution in not parallel, because it is believed to be a low cost plan.

    Note the fat arrows in the SELECT TOP 1000 actual execution plan (full plan) below.

    There is not a significant difference in CPU times between the COUNT, regular TOP, no TOP, and TOP + index hint query plans. The differences are mostly in the elapsed time. The regular TOP query has the longest elapsed time being a non-parallel plan as explained above. The COUNT and no TOP queries have elapsed times approximately one-half of the CPU time, corresponding to a 2X speed-up with parallelism even though the actual degree of parallelism was 8 (on separate physical cores).

    Parallel Execution Plans

    There are two critical factors/challenges in parallel execution plans. One is to divide the work evenly among multiple threads (each running on different cores). The second is to have sufficiently large granularity of work on each thread before some form of synchronization is required. My recollection was that SQL Server 2000 employed a method that ensured even distribution by having each thread process one page at a time(?) This was fine in the Pentium II/III 500MHz generation, but was far too small by the NetBurst and Core2 architecture processors.

    SQL Server 2005 employed a different methods with the strategy of allowing larger granularity and reduced need for synchronization between threads(?) but could also result in having highly uneven distribution of work between threads. Some of this was reported to have been fixed in SQL Server 2008 sp1, see Using Star Join and Few-Outer-Row Optimizations to Improve Data Warehousing Queries but apparently this is still an issue in SQL Server execution plans?

    In the COUNT and no TOP queries, the Constant Scan operation acts as the outer source in a loop join to Document. This is the artificial rowset from the IN clause on AccessControlListID_D. It is known to SQL Server from statistics that this column has highly skewed distribution resulting in the row-thread split shown below.

    The distribution on the other joins are also skewed but not as strongly?

    The query with the TOP clause and an index hint just happens to prevent the SQL Server query optimizer from using a Constant Scan as a loop join outer source. The index hint does nothing to improve the CPU efficiency of this query. In fact it is more than 10% less efficient in terms of CPU. The positive effect is that work is evenly distributed among threads as shown below resulting in nearly linear scaling with parallelism (7.6 to 1).

    Technically the problem here is not all few outer rows, but just uneven few outer rows? Microsoft seems to be aware of the problem and has fixed some aspects? but apparently now all. Perhaps the SQL Server engine could implement a flag or hint to avoid execution plans that are sensitive uneven distributions? For now, the only work-around is to look for this situation and take appropriate action.

    UNION in place of AND/OR Combination

    The execution plan that has outstanding results in terms of CPU efficiency is the UNION replacing the OR clause. It was previously documented that the SQL Server query optimizer has difficulty in generating a good execution plan for queries with a combination of AND/OR conditions, but has no problems when an alternative structure is employed.

    There are other forms for which the SQL Server query optimizer does not produce good execution plans. Once these can be catalogued with appropriate alternative SQL expression strategies, there are probably not any searches that cannot be handled.

    COUNT + TOP Architecture

    This actually encompasses several sub-topics. Presumably the purpose of this architecture is to know the exact number of documents that match a search, but also so as to not over-whelm the application server with data. But we should consider that 1) the query with a rowset only has an integer column, 2) even a large case should have no more than tens of millions of documents (a small number in the modern world) and 3) the application server today has many gigabytes of memory. So this is not absolutely necessary?

    There is also a client/application-side work around for this. Simply issue the query to return the identity column for all rows in the search, load the first X into an array, then continue to read but not store the remaining rows.

    Consider the COUNT + TOP architecture. There are 4 possibilities. One is that both queries are inexpensive in which case this is not important. Second is that the COUNT query is expensive but not the TOP. This could happen when many rows match the search, and the TOP clause allows the query to exit quickly.

    Third is that the COUNT is not expensive but the TOP is. This happens when the query optimizer estimates many rows but in fact there are few rows. In this case, an appropriate high-volume parallel plan is employed for the COUNT query, but a non-parallel plan is used for the TOP query relying on the expectation of exiting quickly. The exit condition is never met, and the non-parallel plan must process the full set of rows with only a single thread. Consider also that this non-parallel plan was formulated based on low start-up overhead (loop joins) rather than volume efficiency (hash joins).

    The fourth possibilities is that both the COUNT and TOP queries are inherently expensive. In this case, we now have to execute two expensive queries all so that the developer can avoid a few lines of code on the application side? (Many SQL/database disasters have been traced to lazy/incompetent coders.)


    As I said in the first Relativity article, all of the database problems appear to be solvable, but most require action in the application code where the SQL is generated, including the architectural strategy of COUNT + TOP. Some problems from Relativity 7.x appear to have been resolved, such as the data type mismatch between a varchar column and nvarchar search parameter. This by itself was a nuisance, but when combined with the TOP clause had significant negative consequence.

    The full set of details along with a test database and queries to both reproduce and fix the problems in Relativity 7.x were sent to kCura. Most of the advice seems to have been disregarded.

    It would seem that kCura was aware that there were problems. In version 7.x, there was a CodeArtifact table (3 columns CodeTypeID, CodeArtifactID, AssociatedArtifactID) that was frequently involve in search queries that would take forever, as in 30min (or whatever the web page time-out is) to 30 months (estimated) to complete. It is possible that the long running read queries and write activity also resulted in blocking and deadlocks despite prolific use of NOLOCK (without the WITH keyword?).

    For version 8.x kCura went to great effort to split this table into multiple tables with names of the form ZCodeArtifact_xxx, one for each of the group values (xxx)? The new tables have columns CodeArtifactID and AssociatedArtifactID, so perhaps there is one table for each CodeTypeID in the version 7.x table? (This topic is covered in ZCodeArtifact & Statistics, as there were a series of performance problems in queries to these tables related to statistics.)

    The problem was that blocking and deadlocks were the symptoms of Relativity problems, not the cause. The causes were the topics discussed here: 1) injudicious use of the TOP clause in situations in which the query optimizer makes a serious error in the estimated number of rows when the actual row count is already known from the COUNT query, 2) generating a complicated SQL expression with combination AND/OR clauses instead of JOIN and UNION, and 3) ineffective parallel execution plans due of skewed distribution. One more point, kCura also went to the effort of changing the SQL a from single expression form to one using CTEs. This may contribute to the clarity of the expression, but does not impact the execution plan problems.

    Another problem seen in 8.1 is a search involving both conventional search arguments and Full-Text elements. The execution plan had the Full-Text Search (FTS) operation as a loop join inner source (see FTS), probably because the estimate number of rows from the outer source showed 1 (which could mean 0). This might have been because the ZCodeArtifact_xxx table was newly generated and statistics were not updated? The query produced a good execution plan after statistics were updated and had quick actual execution time as well.


    (232 row(s) affected) SQL Server Execution Times: CPU time = 152148 ms, elapsed time = 152298 ms. TOP 1000
    (232 row(s) affected) SQL Server Execution Times: CPU time = 143973 ms, elapsed time = 71857 ms. no TOP
    (232 row(s) affected) SQL Server Execution Times: CPU time = 167671 ms, elapsed time = 22016 ms. Index hint
    SQL Server Execution Times: CPU time = 1665 ms, elapsed time = 324 ms. UNION
    SQL Server Execution Times: CPU time = 138279 ms, elapsed time = 69109 ms. COUNT

    The full list of Document.AccessControlListID_D and o1000022_f2154932.ArtifactID values are below
    /* [AccessControlListID_D]
    1,1000062,1000063,1000064,1000065,1000066,1000067,1000099,1000100,1000101,1000102,1000103,1000104,1000105, 1000106,1000107,1000108,1000109,1000110,1000111,1000112,1000113,1000114,1000115,1000116,1000117,1000118, 1000119,1000120,1000121,1000122,1000123,1000124,1000125,1000126,1000127,1000128,1000129,1000130,1000131, 1000132,1000133,1000134,1000135,1000136,1000137,1000138,1000139,1000140,1000141,1000142,1000143,1000144, 1000145,1000146,1000147,1000148,1000149,1000150,1000151,1000152,1000153,1000154,1000155,1000156,1000157, 1000158,1000159,1000160,1000161,1000162,1000163,1000164,1000165,1000166,1000167,1000168,1000169,1000170, 1000171,1000172,1000173,1000174,1000175,1000176,1000177,1000178,1000179,1000180,1000181,1000182,1000183, 1000184,1000185,1000186,1000187,1000193,1000206,1000240,1000244,1000245,1000246,1000259,1000288,1000315, 1000483,1000535,1000536,1000550,1000599,1000617,1000763,1000770

    22552945,22552935,22552925,22552915,22552905,14768277,14768265,5375743,5375742,5375741,5375740,22552947, 22552937,22552927,22552917,22552907,22552894,22552884,22552933,22552923,22552913,22552903,22552890, 22552880,14768269,5375739,5375738,22552951,22552941,22552931,22552921,22552911,22552901,22552896, 22552886,14768275,22552892,22552882,22552939,22552929,22552919,22552909,14768273,14768267,22552949, 28315630,5549264,22552953,22552898,22552888,22552878,22552954,22552944,22552934,22552924,22552914, 22552904,22552950,22552940,22552930,22552920,22552910,22552900,14768271,22552948,22552938,22552928, 22552918,22552908,22552952,22552942,22552946,22552936,22552926,22552916,22552906,22552895,22552885, 22552932,22552922,22552912,22552902,22552891,22552881,14768268,22552943,22552899,22552889,22552879, 14768270,5549202,22552897,22552887,14768274,14768272,22552893,22552883,14768266,14768276,14768264

    Note: a Test database was created with just these tables and limited columns for in-depth investigation.

  • Bushy Joins

    A great session by Adam Machanic at SQL Saturday Boston the previous weekend on methods to influence the query optimizer while still letting it do its task. The gist of this is that while SQL Server has what are called Query Hints, there are adverse consequences. The Join Hints (Loop, Hash and Merge) "specify that the query optimizer enforce a join strategy between two tables," but also results in the query optimizer not bothering to investigate the different join orders, even though only the join type was specified. Hence the general advice is that one should not use the SQL Server Query/Join Hints unless one is prepared to completely override the query optimizer, which is essentially to say, one should almost never use join hints. Microsoft's advice is: "we recommend that hints, including , be used only as a last resort by experienced developers and database administrators." Adams' session investigated an alternative method of providing advice to the query optimizer without causing it to otherwise shutdown.

    Now that we have said that the Loop, Hash and Merge Join Hints should almost never be used, and without recommending the use of hints, consider the question of how to use hints in the case of a last resort situation. Given the fact that the query optimizer disables join order optimization when hints are applied, the task is to reconstruct a good join order. It is explained elsewhere the general preference regarding join order. See either my articles on on the Query Optimizer (mostly I just examine the formulas, without bothering on the explanation), articles by Paul White, Benjamin Nevarez and others. Here will only examine the technique of join ordering.

    In a two table join, there is only one shape, one table as the outer source in the upper right of the execution plan and the second table as the inner source in the lower left of the execution plan as in the diagram below.


    We can reverse the order of the join, or change the type of join, but there is only one shape.

    In a three table join, there are two possible shapes. One is linear: the first table is the outer source, joins to the second table (inner source), and then the output of this is the outer source for the final join with the third table as the inner source.


    The second possible shape is that one table is the outer source in one join to another table. The output of this is now the inner source in the other join with the third table as the outer source.


    From these two basic shapes, we can assemble almost any possible execution plan (sorry, but I do not have examples with the spool operation, if any one would like to comment on these).

    Until a few years ago, I had always been under the impression that it was necessary to write out the full sub-query expression in order to force a bushy join, example below.

    SELECT xx
    FROM A
    JOIN (
      SELECT xx
      FROM B
      JOIN C ON xx
    ) ON xx

    The both join shape and order are forced with either a join hint or the OPTION (FORCE ORDER) clause. In a complex query with a long SELECT list, this style of expresssion quickly becomes cumbersome. Then one day, I needed to relax, so I read one of Itzik Ben-Gan's books and saw a style of SQL expression on joins that I had never seen before.

    SELECT xx
    FROM A
    JOIN (
      B JOIN C ON xx
    ) ON xx

    There is no SELECT in the sub-expression!

    My heart skipped a beat.

    What would be the execution plan join shape be if there were a join hint or force order hint on this expression?

    Below is an SQL query example from Adam's session.


    The execution plan for this query is below. Note that the join order is different than in the SQL.


    If we forced a hash join, we would get the linear plan below.


    Note that the join order is the same as in the SQL.

    We could write the SQL in the form below.


    But without a hint, the execution plan is the same as the original (natural) plan.

    Now if we were to force the join order in the new SQL, as below


    we do indeed get the bush shape with the join type.


    We now have the basic techniques for writing SQL with the objective of forcing a particular join shape and order, to which we could apply join hints that also override much of the query optimizer.

    Again, this is not an endorsement of using join hints. Do not use join hints without understanding that it has the effect of overriding the query optimizer on join ordering, and the implications. I do not accept any consequences on the use of join hints unless I was the consultant engaged. OK, so I just gave you a loaded gun while saying don't blame me for its improper use.

    Search Microsoft Technet for the terms Advanced Query Tuning Concepts, Understanding Nested Loops Joins, Understanding Merge Joins, and Understanding Hash Joins. I seem to have forgotten that role reversal was a feature in hash joins?


    Note that Adam's session is the "Gentle Art ..."
    Join hints and force order is definitely the bulldozer and burn approach

  • Build your own server with Supermicro motherboards

    I used to build white box servers because there were usually enough spare parts left over from upgrade projects. (management did not see the need for non-production systems, so I arranged for there to be spare parts). But since 2002 or so, I have been buying Dell PowerEdge servers for my own test environment. This was in part because of the hassle of troubleshooting connections to multiple SCSI HDDs, was it the cable or connector?

    In the Nehalem/Westmere time frame 2009-10, I decided to step down from a 2-socket system of the previous generation (PowerEdge T2900) to a single socket system, the T110 II. In principle, this was because single socket systems had become powerful enough for me to demonstrate important characteristics I need for my papers, such as generating 2.4GB/s in IO bandwidth. In practice, it was also because I sit in the same room as the servers, and the T2900 had noisy fans while the T110 II was whisper quiet.

    Processors - Intel Xeon E3-1200 series v3, Haswell

    For the current generation processor, the Intel Xeon E3 v3 based on Haswell, Dell decided to focus on pre-built ready to ship systems rather than build to order systems. The only E3 processor option in the Dell T20 is the E3-1225 3.2GHz nominal and 3.6GHz Max Turbo. This system has 1 PCI-E x16 gen3 slot and 1 x4 G2 slot.

    Graphics is not normally important in a server, as it usually resides in a server room and is accessed via remote desktop or even completely remote administration. The previous generation T110 II used an old Matrox G200eW 8MB graphics (based on a 1998 design?) that only supports normal video resolutions (1280x1024?, ok I am getting 1600x1200 on the T100II). The new T20 with E3-1225 has the Intel P4600 graphics.

    For some strange reason my T20 would only power up with 1 DIMM socket populated. I opened a case with Dell Technical Support, but they seem to have lost track of the ticket. I wonder if the people are still there. Or have they been outsourced?

    So I thought that I would give building my own server another try. I got the Intel Xeon E3-1275 v3 3.5GHz nominal and 3.9GHz Max Turbo ($350 versus $224 for the 1225, but less than the $552 price tag of the 1285). The 1225 to 1275 processors have the P4600 graphics, which support 3 displays.

    Supermicro X10SAE Motherboard

    My motherboard of choice is the Supermicro X10SAE with PCI-E 1 x16 and 1 x8 gen3 slots. The E3 v3 only has 16 PCI-E gen3 lanes. The Supermicro motherboard has an ASMedia Switch (ASM1480) that redirects 8 lanes from one slot to another slot so that all 16 lanes connect to a x16 slot if that slot has a x16 adapter and the x8 slot is unpopulated? Otherwise, both slots are x8?

    If the ASM chip is a PCI-E expander, then in principle, both the x16 and x8 slots have all lanes always connected, its just that half of the x16 lanes are shared with the x8? The ASMedia website describes the ASM 1480 as a multiplexer/demultiplexer. But there is not a detailed document. I would hope that in the situation of simultaneous traffic, priority is given to the x8 slot, as the x16 slot should direct traffic to its x8 dedicated lanes? but there is no protocol to support this mode?

    What I like about Supermicro is their deep lineup of server class motherboards with almost every conceivable slot arrangement. I recall that when Intel spent a huge amount of money to focus on one motherboard for an entire processor class, not optimal for any particular purpose.

    Display - Dell P2815Q 3840x2160

    I also got the new Dell P2815Q monitor currently $699. It had priced higher, but Dell offered a second monitor for a discount, so I bought two. This has a 28in diagonal, and maximum resolution of 3840x2160 at 30Hz. The low refresh rate at maximum resolution would not be suitable from gaming. Neither does the P2815Q have the glossy display popular in home entertainment.

    But it is perfect for viewing SQL Server execution plans. At standard zoom (80%) I can see 17 execution plan operators horizontally across the 3840 pixels. Connecting two of the monitors would display 34 operators? Of course, it might be more important to have dual monitors in portrait mode, but I do not know where to get the stands. (per Dave, the P2815Q does rotate)

    I might give the UP2414Q at $1149 a try. The UP3214Q at $2799 is too expensive for my needs. The other large screen with high-resolution is the U2713H at $999 supporting 2560x1440. I have two XPS 27in AIO with apparently the same 27in display?

    Storage - LSI 9361 PCI-E gen3 12Gbps SAS

    My preference would be to plug in 2 PCI-E gen 3 SSDs capable of the full (or nearly) x8 slot bandwidth of 6.4GB/s, at least on the large block read side. This is to avoid a jungle of SATA power splitters and cables inside the system. However, for some reason, there are no PCI-E gen3 SSDs?

    There are PCI-E gen 3 RAID controllers with either 2 x4 12Gb/s SAS interfaces and also some with 24 (x1) 6Gbps ports. There are no 12Gbps SSDs so if I used the standard 2 x4 ports at 12Gbps, I would have to find some enclosure with 12Gbps capable expanders, which will of course escalate the cost.

    All of this is rather unfortunate for building cost optimized high bandwidth storage system. NAND chips currently operate at up to 333MHz. This means 20 channels could saturate the full PCI-E gen 3 x8 bandwidth, even though we would probably use 24 for RAIN and general downstream over-configuration. At 32GB per channel (4 x 64Gbit die) and 24 channels, the raw capacity is 768GB would be a very inexpensive storage yet capable of 6GB/s read? Previous generation PCI-E NAND controllers supported 16 and 32 channels.

    The standard SATA-NAND controller has 8 channels. This was a good choice when NAND was 100MHz. Now this means we have too much (but unusable) downstream bandwidth.

    The new NVMe NAND controllers might offer the option of connecting to either 6Gbps SATA or x2 PCI-E gen 3, which would be 1.6GB/s, but I am not sure when we would have supporting infrastructure.

    The upcoming (now) LSI SandForce SF3700 can interface to either PCI-E gen2 at x2 in the 3719 & 3729 models or x4 (3739 & 3759) and SATA at 6 Gb/s (all models) (SF3700 datasheet). There are 9 channels on the NAND side.

    SQL Server 2014

    I just installed SQL Server 2014 RTM on this system. I notice that SQL Server 2014 does not show the Parallelism (Repartition Streams) operator, same with the Bitmap. The Parallelism (Distribute Streams) and (Gather Streams) operations are still displayed.

    Below is part of a SQL Server 2012 execution plan with both the Parallelism (Repartition Streams) and Bitmap operations.


    In SQL Server 2014, the execution plan for the same query does not show these two operations.


    I imagine that the parallelism and bitmap operations are still there, just no displayed because they do not contribute to understanding the execution plan, while wasting valuable screen real estate.

    Of course, having the option to reduce the spacing between operations without reducing the display font would be very valuable. I do not think the Program Manager for SSMS looks at complex query plans to understand why this would be a valuable feature?


    I would like to find someone willing to build a system with the Supermicro X9DRX+-F 2-socket motherboard with 10 PCI-E g3 x8 slots, filling most of these with storage controllers. This would be massive overkill as I am not sure SQL Server can consume 20GB/s from 4 controllers, let alone 40GB/s from 8 controllers.

    Interpret this as I do not want to pay for 2 12/15-core processors, 16-32x16GB DIMMs, 8 controllers, and 64 SSDs out of my own pocket.


    I have ordered a LSI 9361-8i PCI-E gen3 - 12Gpbs RAID controller that I will use in the Supermicro system w/the Xeon E3 v3 (Haswell), although I have no means of using the 12Gbps SAS signaling rate. If anyone has a 12Gbs expander board, I would appreciate it (there is not a pressing need for SSDs to support 12Gbps, we would just like to connect 12 or SSDs at 6Gbps to the 2 x4 SAS 12Gbps ports.

    I also ordered OCZ Vector 150 SSDs. I will probably mix these with the original Vectors that I already have. In my previous generation system, the Dell PowerEdge T110 II, I had the LSI 9260 controller initially with a mix of OCZ Vertex 3 and Crucial m4 SSDs. The Crucial m4's would occasionally show as offline on the LSI RAID controller, but there was nothing wrong with the m4 when attached to a SATA port. Eventually I replaced the m4 with OCZ Vectors, and since then all 8 SSDs have worked fine with the LSI 9260.

    The recently announced SanDisk CloudSpeed SSDs are also of interest, but I suspect these will be OEM only products.

    Plextor has a PCI-E gen2 SSD for a x2 slot (x4 connector?), rated for 770MB/s. Tom's Hardware says its a M.2 SSD on a PCI-E board. If is the case, then I think the correct SSD product for now are PCI-E boards on which we can plug in 1-4 or perhaps even 8 M.2 SSDs.

    The M.2 form factor supports x2 PCI-E lanes. A simple board could wire up to 4 directly the lanes in the PCI-E slot. A more flexible mode would have a PCI-E expander, so that the number SSDs (each PCI-E x2) can exceed the slot width (x4, x8 or even x16).


    I am seeing just under 4GB/s from the LSI9361 with a mix of 4 OCZ Vector 150 and 4 older OCZ Vertex 3 Max IOPS SSDs. Technically the SSDs are capable of over 500MB/s each, but in an array (2 actually, the 4 Vectors in one, and the 4 Vertex 3 in the other) with SQL Server driving IO, that's pretty good. I got 2.4GB/s with the previous generation LSI 9260. Presumably the controller could not drive the full 3.2GB/s PCI-E gen 2 x8 limit?

    The Dell P2815Q connected to my Supermicro X10SAE motherboard's display port connector does operate at the full 3840x2160 resolution, but not when connected to the HDMI connector. I do not know if it is possible to have 2 displays at 3840x2160 with just the Supermicro motherboard, or if I need to get a separate video card?

  • Hekaton and Benchmarks?

    It is rather curious that the two TPC-E benchmarks results published for SQL Server 2014 did not employ the memory optimized tables and natively compiled procedures, given that Hekaton is the hallmark feature of 2014. Potential for 30X gain in transaction processing have been cited in presentations and whitepapers. No explanation has been given so I will speculate. Why guess when I do not know? Well hopefully someone will come out with the real reason.

    TPC Benchmarks

    The TPC benchmarks are meant to be a common standard that allows people to compare results of different DBMS, operating systems and processor or storage hardware. Many years ago, the TPC-C benchmark was very important so that one could be sure that database engines would not have any bottlenecks in processing that workload, which only comprised a small number of SQL operations. An actual database application might use many different database features and there would be no assurance on whether one or more operations not in the benchmarks had scaling limitations.

    Several years ago, all of the major database players, software and hardware, contributed to TPC-E, which was supposed to reduce the cost of the benchmark system, increase schema complexity and be more representative (higher read-to-write ratio?). But after the benchmark was approved, Oracle decided not to publish results, even though every indication is that they are perfectly capable of producing very good single system results.

    Oracle RAC scales very well in the TPC-C benchmark, which has a high degree of locality by Warehouse. TPC-E does not have locality and is presumed to be more difficult to scale in an active-active cluster architecture. At the time, RAC scaling was very important to Oracle. Since then, Oracle has favored single system benchmarks, especially on their Sun platform with SPARC processors. Recent SPARC processor have 8 simultaneous multi-threading per processor core (SMT, equivalent to HT for Intel processors).

    Microsoft decided to quickly shift over from TPC-C to TPC-E. The time frame of TPC-E roughly corresponded to the SQL Server versions 2005 and 2008 boundary. Microsoft did not allow TPC-C results to be publish on SQL Server 2008, and the few TPC-E results that were published after TPC-E launch employed SQL Server 2005.

    One reason was that log stream compression was added into SQL Server 2008. This is to improve database mirroring functionality. Log compression consumes CPU cycles and is constraint to a single thread, but reduces network traffic especially important over a WAN. Many people use DB mirroring, and few people are up against the single core log compression throughput limit, so perhaps this was for the greater good?

    See Paul Randal blog SQL Server 2008: Performance boost for Database Mirroring, 11 Oct 2007. A SQLCAT article, technical notes/ archive/2007/09/17/ database-mirroring-log-compression-in-sql-server-2008-improves-throughput.aspx, on database mirroring dated 17 Sep 2007 is cited, but the link is no longer valid, Trace flag 1462 is cited as Disable Mirroring Log Compression (Paul says this flag is active).


    The TPC-C benchmark consists of 5 stored procedures, New-Order (45%), Payment (43%) and three at 4% each: Order-Status, Delivery and Stock-Level. The New-Order is the principle call, getting the next order id, inserts one row for the order, an average of 10 rows for order line items, and 10 rows updated in the Stock table. The Stock table has 2 integer columns forming the key, 4 quantity related columns and 11 fixed length char columns totaling 290 bytes, (308 bytes per row excluding overhead for the table). Only the four quantity columns are actually updated. However it is necessary to write both the original and new (?) rows in its entirety to the log?

    Paul Randall says only the bytes that changed need to be written to the log, but a key column update has the effect of a insert/delete.

    It is also possible the exact nature of the Stock update statement requires the char columns to be logged?

    This would imply that the raw New Order transaction log write is on the order of 6.4KB to encompass 10 original and new rows of Stock plus 10 rows of order line insert? The TPC-C reports cite 6.4KB per transaction presumably including the payment transaction?

    The last TPC-C results on SQL Server (2005) were in 2011, one for a system with 4 Opteron processors and one for 2 x Xeon X5690 (Westmere) 6-core processors at 1,024,380 tpm-C, or 17,073 New Order transactions per sec. On the assumption of 6.4KB per transaction log write per New-Order, the raw (uncompressed) log IO would be 109MB/s.

    More recently, Cisco posted a TPC-C with Oracle 11/Linux on a 2 x Xeon E5-2690 (Sandy Bridge) 8-core processors at 1,609,186 tpm-C, for 26,819 New Order transactions/s. There is also an Oracle/Linux TPC-C result on 8 x E7-8870 (Westmere-EX) 10-core processors at 5,055,888 tpm-C or 84,264 New Order transactions/s.

    Even if the 2 x 8-core transaction log volume were within the single thread compression capability, it is likely that higher volume from systems with 4-socket 8-cores or even the newer 2-socket E5 v2 12-core processors would exceed the log stream compression capability of a single core?


    The more recent TPC-E benchmark is comprised of 10 transactions, of which Trade-Order, representing the benchmark score, is 10% of volume. Some of the individual transactions are comprised of multiple frames, so the total number of stored procedures calls (involving a network round-trip) per scored transaction is 24. If we were to work out the RPC volume between TPC-C and TPC-E, we see that the average transaction cost is not substantially different between the two benchmarks. So one of the cited objectives for TPC-E, being more representative of current workloads, may be true, but the database engine does not really care.

    Perhaps the more important matter is that the TRADE table is 139 bytes per row, excluding overhead. The Trade-Order involves 1 row, and is only 10% of the transaction volume (4% of call volume?). Several of the TPC-E transactions involves write operations, and the TPC-E reports cited a log write of 6.7KB/trade, perhaps amortizing the other transactions.

    Some TPC-E stats with raw log write estimates

    data size
    log space
    TPC-C2 x X569061,024,380 tpm-C 7TB (83MB/WH)3000GB?
    TPC-E2 x X569061284 tps-E 5.6/6.0TB98(178?)GB
    TPC-E2 x E5-269081881.76 tps-E 7.7/8.4TB378GB
    TPC-E2 x E5-2697122590.93 tps-E 10.7/11.3TB523GB

    So perhaps the net difference between TPC-E and TPC-C is a 10X reduction in log write?


    So what does all this have to do with Hekaton? Let suppose that Hekaton is capable of providing a 10-30X increase in transaction performance. The current 2-socket with Xeon E5 v2 processor has 12 cores per socket, 15 cores per socket in the 4-socket E7 model. This would generate 17 and 34 MB/s raw transaction log volume without Hekaton. At a 10X gain from Hekaton, we would be beyond the log compression capability of a single core in the 4-socket. At 30X gain, beyond the compression capability for the 2-socket system? All of this is just guess work. Would someone from Microsoft like to comment?


    See the SQLCAT slide deck “Designing High Scale OLTP Systems” by Thomas Kejser and Ewan Fairweather present at SQL Bits 2010. It cites log write throughput at 80MB/s. It is not stated whether this was a limitation or just that particular case. It does appear to be in the correct range from what one Nehalem-EX core can generate in compressed output.

    Hekaton Math

    The Hekaton performance gain math calculation is worth examining. An example was cited of the performance in a test scenario. In employing just the memory optimized table, the performance gain was 3X. With the natively compiled stored procedures (NCP), another 10X gain was realized for a combined gain of 30X.

    Let’s assume that baseline test query requires 100 units of work. A 3X gain via memory optimized tables means that work is now 33 units, so the difference between tradition concurrency and the new system is 67 units.

    The 10X gain with natively compiled procedures over the first step means that the 33 units with just memory optimized tables is further reduced to 3.3, for a net reduction of 29.7.

    So if we were not careful with the underlying math, we might have thought that the 10X with natively compiled procedures was more important than the memory optimized tables. In principle, NCP could have been implemented without memory optimized tables.

    This example demonstrates why it is important to first tackle the big resource consumer first, but the gains are magnified when secondary items can also be improved.

    TPC-C New Order Stock update SQL Statement

    -- update stock values
    UPDATE stock SET s_ytd = s_ytd + @li_qty
    , @s_quantity = s_quantity = s_quantity - @li_qty
    + CASE WHEN(s_quantity - @li_qty < 10) THEN 91 ELSE 0 END

    , s_order_cnt = s_order_cnt + 1, s_remote_cnt = s_remote_cnt
    + CASE WHEN(@li_s_w_id = @w_id) THEN 0 ELSE 1 END
    , @s_data = s_data

    , @s_dist = CASE @d_id
    WHEN 1 THEN s_dist_01 WHEN 2 THEN s_dist_02
    WHEN 3 THEN s_dist_03 WHEN 4 THEN s_dist_04
    WHEN 5 THEN s_dist_05 WHEN 6 THEN s_dist_06
    WHEN 7 THEN s_dist_07 WHEN 8 THEN s_dist_08
    WHEN 9 THEN s_dist_09 WHEN 10 THEN s_dist_10

    WHERE s_i_id = @li_id AND s_w_id = @li_s_w_id



    Regarding the comment below on max memory opt table of 256GB, I do not see any reason that even a modern transaction processing database needs even half of this for in-memory data. But it is very reasonable to have tens of TB in the DB for generally beneficial functionality. This being the case, I think it would be nice to have a feature for rolling in-memory table data to a regular table. My thought is that at the end of each day or other period, the entire set transactions should be converted to a regular table, that will then become a partition in the archive/history table.

  • Column Store, Parallelism and Decimal

    SQL Server performance has the interesting nature in that no matter how sound and logical an idea is on how it might behave in one scenario compared to another, what actually happens could be very different. Presumably the reason is that what happens inside the SQL Server engine is very complex with very many elements to support even a basic query, including steps that we are not even aware of. On top of this, the modern microprocessor is also very complicated, with radically different characteristics depending on whether an address is in (the processor L2) cache, or a memory round-trip necessary, and then whether it is a local or remote node memory access, not to mention the implications of cache-coherency checks.

    With this in mind, some aspects of the Decimal/Numeric data type are of interest. There have been previous discussions on the fact that the Decimal data type is more expensive than integer or float, with impact that could be on the order of 2-3X, enough to become a serious consideration in queries that aggregate very many rows and columns with the Decimal data type. It is easy enough to understand the integer and floating point data types can be executed directly by the microprocessor while decimal must be handled in software, which should mean that the overhead is much higher than 2-4X. The explanation is that the even simple matter of accessing a column in the page-row storage organization of traditional database engines involves a series of address offset calculations for which the modern microprocessor cannot pre-fetch from memory sufficiently far in advance to keep its execution pipeline filled.

    If this is indeed the case, then one would expect that the difference between integer and float compared to decimal would have far larger impact in column store indexes introduced in SQL Server 2012 for nonclustered and clustered in the upcoming 2014. The whole point of column store is to access memory sequentially to fully utilize the capability of modern microprocessors emphasizing bandwidth oriented over serialized round-trip memory accesses. In any performance investigation, it is always very helpful first to build baseline with non-parallel execution plans. This is because parallel execution introduces a whole new set of variability's that can complicate the assessment procedure. Of course, with such sound and logical reasoning, the outcome is inevitably the unexpected, hence the opening paragraph.

    It would seem that the SQL Server engine follows completely different code path on operations to column store indexes depending on whether the execution plan is non-parallel or parallel. But it turns out that is occurs in SQL Server 2014 CTP2, and not SQL Server 2012 SP1, so it is possible the unexpected behavior will not occur in the 2014 release version?

    Test System

    The test system is a Dell PowerEdge T110II with 1 Xeon E3-1240 (Sandy Bridge) 4-core, 3.3GHz nominal (3.7GHz in Turbo) processor with Hyper-Threading enabled, 32GB memory, and storage on 8 SATA SSDs in RAID 0 attached to a LSI 9260 RAID controller. The operating system is Windows Server 2012 R2, and SQL Server version 2014 CTP 2.

    The database was populated using the TPC-H data generator (dbgen) to produce a SF10 data set. This puts 59.986M rows in the Lineitem table which would have been 10GB using the 8-byte datetime data type but is 8GB with the 4-byte date data type. The index keys are different from the TPC-H kit, but the test here are not represented as conforming to TPC rules for official results.

    Four Lineitem tables were created, all at SF 10. Two use the conventional page/row storage, and the other two use Clustered Columnstore indexes. The conventional tables were not compressed, while Columnstore indexes are compressed without option. For each type of storage, one table has 4 columns of type float (8 byte), and the other has 4 columns declared as decimal(18,6) at 9 bytes, NOT NULL in both cases.

    The conventional Lineitem table average 139 bytes per row or 59 row per page with 8 byte float and 143 bytes per row, 57.25 rows per page for the 9 byte decimal. The table with clustered column store index averaged 44.1 and 45.67 bytes per row for float and decimal respectively. The columnstore indexes where about 2.5GB versus 8GB for the conventional tables.

    Previous test have shown that there is no difference between int/bigint, money and float, as all are natively supported on the processor hardware. From the table definition, the four float/decimal columns are adjacent, and should be within 64 bytes of the row header? Meaning all values are in the same cache line?

    One additional note is that this report is a quick response to a question concerning decimal overhead. I did not have time to setup rigorous measurements averaged over 100 executions and follow-up with an examination of anomalies. All measurements here are based on a few runs.

    Page/Row and ColumnStore Performance with Parallelism

    The basic test case a simple aggregate of 1-4 of the float or numeric columns along with a count, in reference to a count only query. For the page/row table, a clustered index (table) scan is forced. There is no point to forcing a scan on columnstore index, due to the nature of column storage. Performance data is collected from sys.dm_exec_query_stats. An attempt was made to ensure data in memory prior to each measurement, but some columnstore accesses generated a small amount of disk IO.

    Below is the CPU in nanoseconds per row for the four cases at DOP 1. The DMV reports worker time in micro-seconds, so that value was multiplied by 1000 to get nanoseconds.

    The CPU nominal frequency is 3.3GHz but for single thread operations, could very be running at the turbo frequency of 3.7GHz, somewhat more than 3 cycles per ns. The cost of the Count only operation, forcing a scan on the entire 8GB table (but does not touch either the float or decimal columns) is about the same for both conventional tables at 58.8 and 60.15 ns per row respectively, probably reflecting the slight difference in table size (3%).

    The true cost structure of a SQL Server table (clustered index) scan consists of a cost for the page access, each row within a page, and each column, typically with the first column access having high cost than the subsequent columns, and perhaps higher cost if a subsequent columns is not on a previously accessed cache line, and perhaps higher for non-fixed length columns that involve a more complicated address calculation.

    In previous reports, I have cited the page access cost as in the 650-750 CPU-ns range for Sandy Bridge generation processors. So about 10ns of the average row cost cited above is amortizing the page access cost (for just under 60 rows per page).

    Below are the same test data, but showing incremental cost of each additional column accessed and aggregated. The Count value is the same as above because it is the baseline operation.

    Notice that the incremental cost for the first column aggregated (1SUM) is higher than the subsequent columns. It is strongly evident that decimal aggregation is much more expensive than the float type (and other tests show float to be the same as int and money).

    The reason that we cannot put a specific value on the difference is because of the cost structure of complete operation has page, row and columns components of which the int/float/decimal difference only involves the last component. In addition, the number of columns of each type also impacts the differential.

    Below is the cost per row in CPU-ns of the count query, with the two conventional tables on the left and the two columnstore indexes on the right at DOP from 1 to 8. The system under test has 4 physical cores with HT enabled. SQL Server correctly places threads on separate physical cores when DOP allows, but the DOP 8 test forces both logical processors on each core to be used. It is also clear in the Columnstore tests that there is something very peculiar. CPU put unit work is not supposed to decrease from DOP 1 to 2. There are certain cases when this does happen, example being a hash join in where the parallel plan has a bitmap filter, which is not employed (per rule) in non-parallel plans. This not the case here, and a test on SQL Server 2012 shows the expected performance advantage for columnstore at all DOP levels.

    Below is the rows per second for the Count query. This is the inverse of elapsed time and is better for demonstrating scaling with DOP. The vertical axis is log scale in base 2 to better distinguish 2X. The scaling with DOP is not particularly good in the DOP 1-4 range, with each thread on separate physical cores. This is believed to be the case with as CPU/row only rises moderately with parallelism to DOP 4. This query does almost no work other than page access, so it is possible there is contention somewhere in the buffer pool management?

    Perfect scaling would be doubling performance for each doubling of DOP (each thread on separate physical cores), an example being from 16 to 32 rows/µs on the vertical scale from DOP 1 to 2. An indicator of quality of the measurement the ratio of worker time to elapsed time. In a perfect situation, this would be equal to the DOP. At DOP 4, the ratio is unexpectedly low at 2.8. Very good scaling is normally expected when parallelism is over separate physical cores. Here the scaling in that case is poor, but appears to be great when both logical processors on each core are active at DOP 8. The sharp rise in CPU per row from DOP 4 to 8 is indicative of this aspect of HT. Had the DOP 4 measurement indicated the correct worker/elapsed ratio closer to 4, there would have been only a moderate increase in performance from DOP 4 to 8.

    Below is the single column SUM query cost per row (in ns) for the two conventional tables on the left and the two columnstore tables on the right. The cost difference between float and decimal in the conventional tables is now evident though not large. It is much more significant in the columnstore tables, as expected.

    Below is the rows per second for the single column SUM query.

    Below is the two column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right. There is a larger difference between float and decimal in the conventional tables compared to the single column query. This is expected as there is more work is in column operations relative to the page and row access. The difference on the columnstore table is than in the single column and this was not expected.

    Below is the rows per second for the two column SUM query.

    Below is the 3 column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right.

    Below is the rows per second for the 3 SUM test.

    The graphs below show the query cost per row for Columnstore access with 1, 2 and 3 columns, the same as before, except with the DOP 1 value truncated.


    Aspects of the Decimal data type cost relative to float have been examined, noting the float has elsewhere been observed to be the same as int and money. The overhead can be as large as 2-3X in conventional page/row tables, depending on the ratio of work between page and row access, relative to column aggregation. In queries that access small to even moderately large number of decimal values (rows*columns) the higher cost of decimal should not be significant. In large data warehouse queries that access ten million values, it will be noticeable and probably start to get painful in the hundred million scale.

    It was expected that the Decimal data type cost would be much high in columnstore indexes, as the row access costs are greatly reduced, magnifying the column contribution. First, the problem in SQL Server 2014 CTP2 with columnstore clustered index for non-parallel execution plans was observed, and it is hoped that this problem will be resolved in RTM. The expected large impact was observed on queries aggregating a single decimal type columns, but the trend decreased unexpectedly for 2 and 3 columns aggregated. SQL Server performance characteristics are not always in-line with expectations, no matter how sound and reasonable the logic al is.

This Blog


Privacy Statement