THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • Hekaton and Benchmarks?

    It is rather curious that the two TPC-E benchmarks results published for SQL Server 2014 did not employ the memory optimized tables and natively compiled procedures, given that Hekaton is the hallmark feature of 2014. Potential for 30X gain in transaction processing have been cited in presentations and whitepapers. No explanation has been given so I will speculate. Why guess when I do not know? Well hopefully someone will come out with the real reason.

    TPC Benchmarks

    The TPC benchmarks are meant to be a common standard that allows people to compare results of different DBMS, operating systems and processor or storage hardware. Many years ago, the TPC-C benchmark was very important so that one could be sure that database engines would not have any bottlenecks in processing that workload, which only comprised a small number of SQL operations. An actual database application might use many different database features and there would be no assurance on whether one or more operations not in the benchmarks had scaling limitations.

    Several years ago, all of the major database players, software and hardware, contributed to TPC-E, which was supposed to reduce the cost of the benchmark system, increase schema complexity and be more representative (higher read-to-write ratio?). But after the benchmark was approved, Oracle decided not to publish results, even though every indication is that they are perfectly capable of producing very good single system results.

    Oracle RAC scales very well in the TPC-C benchmark, which has a high degree of locality by Warehouse. TPC-E does not have locality and is presumed to be more difficult to scale in an active-active cluster architecture. At the time, RAC scaling was very important to Oracle. Since then, Oracle has favored single system benchmarks, especially on their Sun platform with SPARC processors. Recent SPARC processor have 8 simultaneous multi-threading per processor core (SMT, equivalent to HT for Intel processors).

    Microsoft decided to quickly shift over from TPC-C to TPC-E. The time frame of TPC-E roughly corresponded to the SQL Server versions 2005 and 2008 boundary. Microsoft did not allow TPC-C results to be publish on SQL Server 2008, and the few TPC-E results that were published after TPC-E launch employed SQL Server 2005.

    One reason was that log stream compression was added into SQL Server 2008. This is to improve database mirroring functionality. Log compression consumes CPU cycles and is constraint to a single thread, but reduces network traffic especially important over a WAN. Many people use DB mirroring, and few people are up against the single core log compression throughput limit, so perhaps this was for the greater good?

    See Paul Randal blog SQL Server 2008: Performance boost for Database Mirroring, 11 Oct 2007. A SQLCAT article, technical notes/ archive/2007/09/17/ database-mirroring-log-compression-in-sql-server-2008-improves-throughput.aspx, on database mirroring dated 17 Sep 2007 is cited, but the link is no longer valid, Trace flag 1462 is cited as Disable Mirroring Log Compression (Paul says this flag is active).


    The TPC-C benchmark consists of 5 stored procedures, New-Order (45%), Payment (43%) and three at 4% each: Order-Status, Delivery and Stock-Level. The New-Order is the principle call, getting the next order id, inserts one row for the order, an average of 10 rows for order line items, and 10 rows updated in the Stock table. The Stock table has 2 integer columns forming the key, 4 quantity related columns and 11 fixed length char columns totaling 290 bytes, (308 bytes per row excluding overhead for the table). Only the four quantity columns are actually updated. However it is necessary to write both the original and new (?) rows in its entirety to the log?

    Paul Randall says only the bytes that changed need to be written to the log, but a key column update has the effect of a insert/delete.

    It is also possible the exact nature of the Stock update statement requires the char columns to be logged?

    This would imply that the raw New Order transaction log write is on the order of 6.4KB to encompass 10 original and new rows of Stock plus 10 rows of order line insert? The TPC-C reports cite 6.4KB per transaction presumably including the payment transaction?

    The last TPC-C results on SQL Server (2005) were in 2011, one for a system with 4 Opteron processors and one for 2 x Xeon X5690 (Westmere) 6-core processors at 1,024,380 tpm-C, or 17,073 New Order transactions per sec. On the assumption of 6.4KB per transaction log write per New-Order, the raw (uncompressed) log IO would be 109MB/s.

    More recently, Cisco posted a TPC-C with Oracle 11/Linux on a 2 x Xeon E5-2690 (Sandy Bridge) 8-core processors at 1,609,186 tpm-C, for 26,819 New Order transactions/s. There is also an Oracle/Linux TPC-C result on 8 x E7-8870 (Westmere-EX) 10-core processors at 5,055,888 tpm-C or 84,264 New Order transactions/s.

    Even if the 2 x 8-core transaction log volume were within the single thread compression capability, it is likely that higher volume from systems with 4-socket 8-cores or even the newer 2-socket E5 v2 12-core processors would exceed the log stream compression capability of a single core?


    The more recent TPC-E benchmark is comprised of 10 transactions, of which Trade-Order, representing the benchmark score, is 10% of volume. Some of the individual transactions are comprised of multiple frames, so the total number of stored procedures calls (involving a network round-trip) per scored transaction is 24. If we were to work out the RPC volume between TPC-C and TPC-E, we see that the average transaction cost is not substantially different between the two benchmarks. So one of the cited objectives for TPC-E, being more representative of current workloads, may be true, but the database engine does not really care.

    Perhaps the more important matter is that the TRADE table is 139 bytes per row, excluding overhead. The Trade-Order involves 1 row, and is only 10% of the transaction volume (4% of call volume?). Several of the TPC-E transactions involves write operations, and the TPC-E reports cited a log write of 6.7KB/trade, perhaps amortizing the other transactions.

    Some TPC-E stats with raw log write estimates

    data size
    log space
    TPC-C2 x X569061,024,380 tpm-C 7TB (83MB/WH)3000GB?
    TPC-E2 x X569061284 tps-E 5.6/6.0TB98(178?)GB
    TPC-E2 x E5-269081881.76 tps-E 7.7/8.4TB378GB
    TPC-E2 x E5-2697122590.93 tps-E 10.7/11.3TB523GB

    So perhaps the net difference between TPC-E and TPC-C is a 10X reduction in log write?


    So what does all this have to do with Hekaton? Let suppose that Hekaton is capable of providing a 10-30X increase in transaction performance. The current 2-socket with Xeon E5 v2 processor has 12 cores per socket, 15 cores per socket in the 4-socket E7 model. This would generate 17 and 34 MB/s raw transaction log volume without Hekaton. At a 10X gain from Hekaton, we would be beyond the log compression capability of a single core in the 4-socket. At 30X gain, beyond the compression capability for the 2-socket system? All of this is just guess work. Would someone from Microsoft like to comment?


    See the SQLCAT slide deck “Designing High Scale OLTP Systems” by Thomas Kejser and Ewan Fairweather present at SQL Bits 2010. It cites log write throughput at 80MB/s. It is not stated whether this was a limitation or just that particular case. It does appear to be in the correct range from what one Nehalem-EX core can generate in compressed output.

    Hekaton Math

    The Hekaton performance gain math calculation is worth examining. An example was cited of the performance in a test scenario. In employing just the memory optimized table, the performance gain was 3X. With the natively compiled stored procedures (NCP), another 10X gain was realized for a combined gain of 30X.

    Let’s assume that baseline test query requires 100 units of work. A 3X gain via memory optimized tables means that work is now 33 units, so the difference between tradition concurrency and the new system is 67 units.

    The 10X gain with natively compiled procedures over the first step means that the 33 units with just memory optimized tables is further reduced to 3.3, for a net reduction of 29.7.

    So if we were not careful with the underlying math, we might have thought that the 10X with natively compiled procedures was more important than the memory optimized tables. In principle, NCP could have been implemented without memory optimized tables.

    This example demonstrates why it is important to first tackle the big resource consumer first, but the gains are magnified when secondary items can also be improved.

    TPC-C New Order Stock update SQL Statement

    -- update stock values
    UPDATE stock SET s_ytd = s_ytd + @li_qty
    , @s_quantity = s_quantity = s_quantity - @li_qty
    + CASE WHEN(s_quantity - @li_qty < 10) THEN 91 ELSE 0 END

    , s_order_cnt = s_order_cnt + 1, s_remote_cnt = s_remote_cnt
    + CASE WHEN(@li_s_w_id = @w_id) THEN 0 ELSE 1 END
    , @s_data = s_data

    , @s_dist = CASE @d_id
    WHEN 1 THEN s_dist_01 WHEN 2 THEN s_dist_02
    WHEN 3 THEN s_dist_03 WHEN 4 THEN s_dist_04
    WHEN 5 THEN s_dist_05 WHEN 6 THEN s_dist_06
    WHEN 7 THEN s_dist_07 WHEN 8 THEN s_dist_08
    WHEN 9 THEN s_dist_09 WHEN 10 THEN s_dist_10

    WHERE s_i_id = @li_id AND s_w_id = @li_s_w_id



    Regarding the comment below on max memory opt table of 256GB, I do not see any reason that even a modern transaction processing database needs even half of this for in-memory data. But it is very reasonable to have tens of TB in the DB for generally beneficial functionality. This being the case, I think it would be nice to have a feature for rolling in-memory table data to a regular table. My thought is that at the end of each day or other period, the entire set transactions should be converted to a regular table, that will then become a partition in the archive/history table.

  • Column Store, Parallelism and Decimal

    SQL Server performance has the interesting nature in that no matter how sound and logical an idea is on how it might behave in one scenario compared to another, what actually happens could be very different. Presumably the reason is that what happens inside the SQL Server engine is very complex with very many elements to support even a basic query, including steps that we are not even aware of. On top of this, the modern microprocessor is also very complicated, with radically different characteristics depending on whether an address is in (the processor L2) cache, or a memory round-trip necessary, and then whether it is a local or remote node memory access, not to mention the implications of cache-coherency checks.

    With this in mind, some aspects of the Decimal/Numeric data type are of interest. There have been previous discussions on the fact that the Decimal data type is more expensive than integer or float, with impact that could be on the order of 2-3X, enough to become a serious consideration in queries that aggregate very many rows and columns with the Decimal data type. It is easy enough to understand the integer and floating point data types can be executed directly by the microprocessor while decimal must be handled in software, which should mean that the overhead is much higher than 2-4X. The explanation is that the even simple matter of accessing a column in the page-row storage organization of traditional database engines involves a series of address offset calculations for which the modern microprocessor cannot pre-fetch from memory sufficiently far in advance to keep its execution pipeline filled.

    If this is indeed the case, then one would expect that the difference between integer and float compared to decimal would have far larger impact in column store indexes introduced in SQL Server 2012 for nonclustered and clustered in the upcoming 2014. The whole point of column store is to access memory sequentially to fully utilize the capability of modern microprocessors emphasizing bandwidth oriented over serialized round-trip memory accesses. In any performance investigation, it is always very helpful first to build baseline with non-parallel execution plans. This is because parallel execution introduces a whole new set of variability's that can complicate the assessment procedure. Of course, with such sound and logical reasoning, the outcome is inevitably the unexpected, hence the opening paragraph.

    It would seem that the SQL Server engine follows completely different code path on operations to column store indexes depending on whether the execution plan is non-parallel or parallel. But it turns out that is occurs in SQL Server 2014 CTP2, and not SQL Server 2012 SP1, so it is possible the unexpected behavior will not occur in the 2014 release version?

    Test System

    The test system is a Dell PowerEdge T110II with 1 Xeon E3-1240 (Sandy Bridge) 4-core, 3.3GHz nominal (3.7GHz in Turbo) processor with Hyper-Threading enabled, 32GB memory, and storage on 8 SATA SSDs in RAID 0 attached to a LSI 9260 RAID controller. The operating system is Windows Server 2012 R2, and SQL Server version 2014 CTP 2.

    The database was populated using the TPC-H data generator (dbgen) to produce a SF10 data set. This puts 59.986M rows in the Lineitem table which would have been 10GB using the 8-byte datetime data type but is 8GB with the 4-byte date data type. The index keys are different from the TPC-H kit, but the test here are not represented as conforming to TPC rules for official results.

    Four Lineitem tables were created, all at SF 10. Two use the conventional page/row storage, and the other two use Clustered Columnstore indexes. The conventional tables were not compressed, while Columnstore indexes are compressed without option. For each type of storage, one table has 4 columns of type float (8 byte), and the other has 4 columns declared as decimal(18,6) at 9 bytes, NOT NULL in both cases.

    The conventional Lineitem table average 139 bytes per row or 59 row per page with 8 byte float and 143 bytes per row, 57.25 rows per page for the 9 byte decimal. The table with clustered column store index averaged 44.1 and 45.67 bytes per row for float and decimal respectively. The columnstore indexes where about 2.5GB versus 8GB for the conventional tables.

    Previous test have shown that there is no difference between int/bigint, money and float, as all are natively supported on the processor hardware. From the table definition, the four float/decimal columns are adjacent, and should be within 64 bytes of the row header? Meaning all values are in the same cache line?

    One additional note is that this report is a quick response to a question concerning decimal overhead. I did not have time to setup rigorous measurements averaged over 100 executions and follow-up with an examination of anomalies. All measurements here are based on a few runs.

    Page/Row and ColumnStore Performance with Parallelism

    The basic test case a simple aggregate of 1-4 of the float or numeric columns along with a count, in reference to a count only query. For the page/row table, a clustered index (table) scan is forced. There is no point to forcing a scan on columnstore index, due to the nature of column storage. Performance data is collected from sys.dm_exec_query_stats. An attempt was made to ensure data in memory prior to each measurement, but some columnstore accesses generated a small amount of disk IO.

    Below is the CPU in nanoseconds per row for the four cases at DOP 1. The DMV reports worker time in micro-seconds, so that value was multiplied by 1000 to get nanoseconds.

    The CPU nominal frequency is 3.3GHz but for single thread operations, could very be running at the turbo frequency of 3.7GHz, somewhat more than 3 cycles per ns. The cost of the Count only operation, forcing a scan on the entire 8GB table (but does not touch either the float or decimal columns) is about the same for both conventional tables at 58.8 and 60.15 ns per row respectively, probably reflecting the slight difference in table size (3%).

    The true cost structure of a SQL Server table (clustered index) scan consists of a cost for the page access, each row within a page, and each column, typically with the first column access having high cost than the subsequent columns, and perhaps higher cost if a subsequent columns is not on a previously accessed cache line, and perhaps higher for non-fixed length columns that involve a more complicated address calculation.

    In previous reports, I have cited the page access cost as in the 650-750 CPU-ns range for Sandy Bridge generation processors. So about 10ns of the average row cost cited above is amortizing the page access cost (for just under 60 rows per page).

    Below are the same test data, but showing incremental cost of each additional column accessed and aggregated. The Count value is the same as above because it is the baseline operation.

    Notice that the incremental cost for the first column aggregated (1SUM) is higher than the subsequent columns. It is strongly evident that decimal aggregation is much more expensive than the float type (and other tests show float to be the same as int and money).

    The reason that we cannot put a specific value on the difference is because of the cost structure of complete operation has page, row and columns components of which the int/float/decimal difference only involves the last component. In addition, the number of columns of each type also impacts the differential.

    Below is the cost per row in CPU-ns of the count query, with the two conventional tables on the left and the two columnstore indexes on the right at DOP from 1 to 8. The system under test has 4 physical cores with HT enabled. SQL Server correctly places threads on separate physical cores when DOP allows, but the DOP 8 test forces both logical processors on each core to be used. It is also clear in the Columnstore tests that there is something very peculiar. CPU put unit work is not supposed to decrease from DOP 1 to 2. There are certain cases when this does happen, example being a hash join in where the parallel plan has a bitmap filter, which is not employed (per rule) in non-parallel plans. This not the case here, and a test on SQL Server 2012 shows the expected performance advantage for columnstore at all DOP levels.

    Below is the rows per second for the Count query. This is the inverse of elapsed time and is better for demonstrating scaling with DOP. The vertical axis is log scale in base 2 to better distinguish 2X. The scaling with DOP is not particularly good in the DOP 1-4 range, with each thread on separate physical cores. This is believed to be the case with as CPU/row only rises moderately with parallelism to DOP 4. This query does almost no work other than page access, so it is possible there is contention somewhere in the buffer pool management?

    Perfect scaling would be doubling performance for each doubling of DOP (each thread on separate physical cores), an example being from 16 to 32 rows/µs on the vertical scale from DOP 1 to 2. An indicator of quality of the measurement the ratio of worker time to elapsed time. In a perfect situation, this would be equal to the DOP. At DOP 4, the ratio is unexpectedly low at 2.8. Very good scaling is normally expected when parallelism is over separate physical cores. Here the scaling in that case is poor, but appears to be great when both logical processors on each core are active at DOP 8. The sharp rise in CPU per row from DOP 4 to 8 is indicative of this aspect of HT. Had the DOP 4 measurement indicated the correct worker/elapsed ratio closer to 4, there would have been only a moderate increase in performance from DOP 4 to 8.

    Below is the single column SUM query cost per row (in ns) for the two conventional tables on the left and the two columnstore tables on the right. The cost difference between float and decimal in the conventional tables is now evident though not large. It is much more significant in the columnstore tables, as expected.

    Below is the rows per second for the single column SUM query.

    Below is the two column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right. There is a larger difference between float and decimal in the conventional tables compared to the single column query. This is expected as there is more work is in column operations relative to the page and row access. The difference on the columnstore table is than in the single column and this was not expected.

    Below is the rows per second for the two column SUM query.

    Below is the 3 column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right.

    Below is the rows per second for the 3 SUM test.

    The graphs below show the query cost per row for Columnstore access with 1, 2 and 3 columns, the same as before, except with the DOP 1 value truncated.


    Aspects of the Decimal data type cost relative to float have been examined, noting the float has elsewhere been observed to be the same as int and money. The overhead can be as large as 2-3X in conventional page/row tables, depending on the ratio of work between page and row access, relative to column aggregation. In queries that access small to even moderately large number of decimal values (rows*columns) the higher cost of decimal should not be significant. In large data warehouse queries that access ten million values, it will be noticeable and probably start to get painful in the hundred million scale.

    It was expected that the Decimal data type cost would be much high in columnstore indexes, as the row access costs are greatly reduced, magnifying the column contribution. First, the problem in SQL Server 2014 CTP2 with columnstore clustered index for non-parallel execution plans was observed, and it is hoped that this problem will be resolved in RTM. The expected large impact was observed on queries aggregating a single decimal type columns, but the trend decreased unexpectedly for 2 and 3 columns aggregated. SQL Server performance characteristics are not always in-line with expectations, no matter how sound and reasonable the logic al is.

  • Hardware Update 2014 and Server Sizing

    Intel Xeon E7 v2 processors (Ivy Bridge-EX) officially launched today. The E5 v2 processors (Ivy Bridge-EP) and E3 v3 (Haswell) came out last fall. The previous generation E7 was based on Westmere, so the Sandy Bridge generation was skipped for the EX. This makes sense because big systems desire 2-year product stability versus the annual refresh for 2-socket systems. The new E7 v2 tops out at 15 cores 2.8/3.4GHz nominal and turbo compared to the Westmere E7 with 10 cores 2.4/2.8GHz. The Xeon E5 v2 top model has 12 cores at 2.7/3.5GHz versus the Sandy Bridge E5 at 8 cores 2.7/3.5GHz.

    When the Ivy Bridge EP separate dies with 10 and 12 cores was announced, it seemed rather an unusual choice. Later, when the 15-core EX model was brought, it then became clear the 12-core E5 v2 actually shares a 15-core die with the E7 v2. (the diagram below is from Anandtech, see links at bottom, this reference was inadvertently left out in the original edit)


    Below is my rendering of the 3 Ivy Bridge EP dies.


    Below are the 10 and 15-core Ivy Bridge EP/X die
    IvyBridge-10c IvyBridge-15c

    I will try to scale these in relation to Sandy Bridge and others when time permits. Note that the L2 cache is on the other side of the core from the L3 or rather last level cache (LLC).

    The Dell website shows the new PowerEdge R920 (but not yet taking orders), featuring 96 DIMM sockets which could support 6TB memory, but 1.5TB is most economical for now, with 3TB for “moderately” extreme situations. The HP ProLiant DL580 G8 lists support for 32GB DIMMs, so it will probably be sometime before 64GB DIMM support can be verified. It is not clear if 8-socket system support will be available.

    Server Sizing

    In the period up to SQL Server 2008 R2, with licensing determined only by socket count, the obvious strategy was to pick a system with the desired number of sockets, and the most powerful processor for that type of socket. There was no point to analyzing memory requirements because it was both simple and cheap to fill the DIMM slots with the second largest available memory module (currently 16GB).

    From SQL Server 2012 on, the new with core licensing dictates that we should now base our sizing strategy on the appropriate number of cores, and then determine between the E7 versus E5 platforms if applicable.

    Intel Xeon E7 v2 Processors

    E7-x890 v2152.8GHz3.4GHz
    E7-x857 v2123.0GHz3.6GHz
    E7-x891 v2103.2GHz3.7GHz
    E7-x893 v263.4GHz3.7GHz

    Intel Xeon E5 v2 Processors

    E5-2697 v2122.7GHz3.5GHz
    E5-2690 v2103.0GHz3.6GHz
    E5-2667 v283.3GHz4.0GHz
    E5-2643 v263.5GHz3.8GHz
    E5-2637 v243.5GHz3.8GHz

    The benefit in stepping down the total number of cores (in addition to reduced licensing cost $6.7-10K per core?) is the possibility of higher core frequency in the lower core count processors. Also consider that write (and certain other) operations are not parallelizable, so single thread operations may be running at the turbo mode frequency.

    When desired number of cores can be achieved with a 2-socket system, consider that the E7 supports 24 DIMM slots per socket compared with the E5 at 12 per socket. Even though we have been conditioned by DBA indoctrination that more memory is better, this "rule" originated from the days when the maximum memory configuration may have been 64MB to 1GB. In those circumstances, every MB of memory helped reduce disk IO. By blowing out the budget on memory and months of hard work in performance tuning, and with luck, it might be possible to bring disk IO within the capability of a storage without so many components that disk drives fail on a weekly basis.


    Given the maximum memory supported today, very few situations really call for 1TB+ memory configuration. It is extremely likely that a 2-socket Xeon E5 system with DIMM slots filled with 16GB DIMMs (24x16GB = 384GB, $200 each, $4.8K total) is already more than enough by a factor of 4 if not ten. More than likely, disk IO (excluding log writes) is only a sporadic occurrence. And if we needed disk IO, we could configure more IOPS capability from an SSD storage system (that is more economical than 1GB of memory was 20 years ago) than we could actually use. (Yet people still finds ways to run into problems!)

    Unless your SAN admin dictated the storage configuration, in which case maybe go for broke on memory.

    Socket Options 1 or 2 and 2 or 4

    Another possible option for less than maximum core count situations is whether fill the sockets with low core count processors or only populate half the sockets with the high core count processors.

    Example 1: 1 x 12 core or 2 x 6 core processors.

    Example 2: 2 x 15 core or 4 x 8 core processors.

    Filling the processor sockets enables maximum memory bandwidth (and memory capacity, but in this situation we most probably do not need it).

    The decision criteria might be based on the parallelism strategy. If our expectation is to run critical queries at higher degree of parallelism (8, 10, 12 or 15), one would expect that having all cores on one socket would benefit from having better (true) performance in Parallelism Repartition Streams operations, as the latency between cores is low, favoring fewer sockets of the high core count processors? Do not bother looking at the plan cost for this, it is strictly based on a model that does not take into account the processor/system architecture.

    On the other hand, if we expect to restrict max degree of parallelism lower, say 4, 6 or maybe 8, then more sockets populated with lower core count processors would benefit in having greater memory bandwidth?

    I have not tested these two scenarios side-by-side in otherwise equivalent configurations, so I ask readers to alert me if data to support this assessment should become available. It is possible that having the fewest sockets is the better solution because of less complicated cache coherency, despite the lower memory bandwidth and capacity.

    It is uncertain whether there will be a Xeon E5-4600 v2 series, as this seems unnecessary? There is also the Xeon E5-2400 v2 series with 3 memory channels instead of 4 for slightly lower platform cost structure. We can also consider the single socket E3 v3 (Haswell) at 4 cores 3.6/4.0GHz with 32GB and 2 x8 PCI-E gen 3 slots. It might seem beneath our dignity to run on a server similar to our desktop or laptop, but the fact is this 4-core system with 32GB is far more powerful than the 4-socket systems of 10 years ago.

    I bought a Dell PowerEdge T20 to test out the E3 v3 Haswell processor. Unfortunately the system would only power with 1 DIMM slot populated, not 2 or 4. Dell support has not responded. I may buy a Supermicro motherboard and chassis.


    New TPC-E benchmarks were announce for the E7 v2 from IBM and NEC. Below are the recent IBM TPC-E results spanning the Westmere-EX, Sandy Bridge, and Ivy Bridge processors.

    SocketsProcessorFreqcoresthreadsMemorySQL Server VersiontpsE
    8E7-88702.4GHz801604TBSQL Server 20125,457.20
    4E7-48702.4GHz40802TBSQL Server 20123,218.46
    2E5-26902.9GHz1632512GBSQL Server 20121,863.23
    2E5-2697 v22.7GHz2448512GBSQL Server 20122,590.93
    4E7-4890 v22.8GHz601202TBSQL Server 20145,576.26

    At first I thought that the 4-socket E7 v2 performance gain over the 4 and 8 socket Westmere E7 also involved the new Hekaton feature in SQL Server 2014. But then I realized that the 2-socket E5 v2 performance on SQL Server 2012 was inline with this being the traditional table structure? The E7 v2 benchmark details have not been released? Is there a reason Hekaton was or was not enabled?

    IBM System x3850 X6 with 4 x E7-4890 (60 cores, 120 threads) 2TB (64x32GB) memory, 5 RAID controllers connected to 208 SAS SSDs and 1 RAID controller for logs. The Server, processors, memory and miscellaneous items totaled $151K, storage was $681K, and the 30 SQL Server Enterprise Edition 2-core licenses at $13,742 totaled $404K.

    The complete list price was $1.25M with a discount of $212K (17% of the complete price) but this might actually be a 25% discount on the hardware or just the storage. The price on the 200GB SSD (SLC?) is $3079 which should easily support a 30% discount.

    I would like to know what discount levels people are actually getting on SQL Server EE? The price with Software Assurance is about $10K per core, so this might be the proper budgeting value. Oh yeah, IBM include 1 Microsoft Problem Resolution Services incident as part of the 3-year cost of ownership.


    On storage IO side, PCI-E gen 3 has been available in server systems (Sandy Bridge EP) for almost 2 years. PCI RAID controllers came sometime after that. There now also RAID controllers with both PCI-E gen 3 and SAS 12Gb/s on the downstream side. Much of the storage infrastructure (especially HDDs and SSDs) are expected to remain 6Gbps for some time. It would be helpful if there were either or both disk enclosures support 12Gb/s SAS and RAID controllers that have 4 x4 SAS ports to better leverage the bandwidth of PCI-E gen 3 x8 on the upstream side when the downstream side is still 6Gb/s. There is still bandwidth mismatch but such is life.

    Internally, disk enclosures have (2) chips, one per controller, with each having sufficient x1 SAS ports for each bay and 2 x4 ports for upstream and downstream traffic. The two controllers supporting dual-path - with dual-port SAS devices. We would like to be able to have the x4 ports operate at 12Gb/s per lane, while connecting to either 6 or 12 Gb/s, allowing continued used of 6Gbps storage devices. There might be 24 bays communicating at 6Gbps, more than enough to load the x4 port on each of the two controllers.

    I am curious as to the lack of SSDs with PCI-E gen3 interface. Dell says their PCI-E SSDs are now on the new NVMe standard. I suppose the effort to work this in, along with the combo PCI-E or SATA interface has run into longer than expected debugging effort. If so, then we will wait patiently. In server world, it is important for new storage technology to be thoroughly tested.


    See Anandtech Intel's Xeon E5-2600 V2: 12-core Ivy Bridge EP for Servers by Johan De Gelas September 17 2013, and Intel Readying 15-core Xeon E7 v2,
    SemiAccurate A technical look at Intel's new Ivy Bridge-EX,
    and Toms Hardware Intel Xeon E5-2600 v2: More Cores, Cache, And Better Efficiency.


    The TPC-E supporting files are now available for the two new results on the Xeon E7 v2 and SQL Server version 2014. In the IBM report, the SQL does use either Hekaton table or compiled SQL. I will look over the NEC report later.

    2014 Mar 13

    Anandtech article on Testing SATA Express ...

    2014 Apr 5 - Memory prices

    RDIMM 4/8/16GB $85/142/243
    LRDIMM 32/64GB $823/4510

    4/8/16/32 $57/100/180/700

  • Backup Compression - time for an overhaul

    Database backup compression is incredibly useful and valuable. This became popular with then Imceda (later Quest and now Dell) LiteSpeed. SQL Server version 2008 added backup compression for Enterprise Edition only. The SQL Server EE native backup feature only allows a single compression algorithm, one that elects for CPU efficiency over the degree of compression achieved. In the long ago past, this strategy was essential. But today the benefits are irrelevant while the lower compression is becoming a significant liability. All the third party backup compression products offer multiple levels of compression via open source algorithms and it is time for SQL Server to follow.

    As always, we shall start with the historical context. LiteSpeed was introduced sometime around 2000 when the server system would have had 4 Pentium II or III processors at around 400-500MHz. This was before there were multi-processors so 4 sockets means four cores. Today, the individual processor core (Sandy Bridge to Haswell generations) is about 20 times more powerful, 6X via frequency and another 3X on instructions per cycle efficiency. And there are be 8 cores per processor socket for Intel Xeon E5 (Sandy Bridge) and 12 in the E5 v2 (Ivy Bridge) processors. Even on dropping from 4 sockets in 2000 to 2 sockets today, the system has perhaps 80 times more compute power.

    In addition, the compression code also benefit from Hyper-Threading more so than SQL Server, as there are no lock contention issues. Even the Pentium 4 first generation HT, the performance gain was 50% for LiteSpeed database backup compression.

    Over the same period of time, IO bandwidth has improved as well, but not by the same degree. Fourteen years ago, 350MB/s from the IO system was good (basically 2 x 2Gbit/s FC ports). Today it is possible for an IO system to deliver 10-20GB/s, except that some brain fart IO strategy in recent versions of SQL Server effectively caps practical IO bandwidth to disk storage in the 4-5GB/s.

    So the net change is that today there is far more CPU available to support compression relative to the practical IO bandwidth. To better use this, we would like to use one of the algorithm that can achieve high compression with more CPU per unit of raw data. This is valuable when the data has to be sent over a wide-area network with bandwidth limits or even a local network because the infrastructure team would not put in 10GbE for the mission critical database server and has no concept of parallel network connections.

    High compression can also be important when local storage space limited and extra database backups are necessary for migration operations. Of course the occurrence of constrained storage space is a separate matter of total organizational stupidity. Hard disk capacity is cheap, even for the more reliable enterprise models ($100 per TB for 7.2K HDDs and $450/TB for 10K). Yet in a SAN that is billed as helping to reduce cost (via increased utilization), somehow storage becomes precious, doled out in driblets from the SAN admin only after a series of time wasting meetings, forms and justifications. OK I am side-tracking to my long standing feud with SAN vendors bent on crippling database performance. The bottom line is that we have CPU resources on the SQL Server system, and we can put it to good use.

    Compressibility of Data

    Normally I like to use the TPC-H database for performance investigations because the data generator is openly available (with source code) to facilitate independent verification and because there a good library of published reports for comparison. Of course the TPC-H database is populated with randomly generated data which has different compression characteristics than normal data (less). This is because in a typical production database, there is prolific use of sequentially increasing 8-byte data types or otherwise having a limited range of values, i.e. 63 out of 64 bits are the same from one row to the next. Another cause is lack of normalization, meaning repeating values that are easily reduced with dictionary based compression. The exception of course is databases with extensive use of (non-sequential) unique identifiers, which have almost no compressibility.

    That said below are some results with the TPC-H data set.

    Type DB size  Compressed B/U  7Z
    Uncompressed tables 14.175.663.14
    Compressed tables 9.555.543.29
    Clustered Column stored 4.553.502.34

    With no compression at the table and index level, the SQL Server backup can achieve a 2.5 compression ratio. 7z can achieve a 4.5 compression ratio on the uncompressed backup, albeit at far high CPU consumption. With compressed tables and indexes, the database size is reduced to 67% of the original size. The SQL Server backup can achieve further compression as the block size is larger allowing for more effective dictionary use. It is perhaps no accident that both the normal database and one with compressed tables and indexes backup to about the same size. Even the 7z compressions on the uncompressed backup files are about the same size.

    The clustered column store has even efficiency both in the database storage and the backup. I am thinking that the storage organization makes compression more effective. Note that the SQL Server compressed database backup is not further reduced with 7z or other common file compression utility.


    This is a simpleton’s solution. It is important to differentiate between the objective and the method. Backup compression consumes CPU. When LiteSpeed first came on the market, people asked for a means to throttle CPU. The reason was to maintain some degree of responsiveness to user transactions during the backup. Full database backups are normally done at night or during off-hours if the two are different. But there might be some user activity during the backup. So it might seem reasonable to throttle the CPU used by backup compression.

    However, this is why I stress the need to differentiate between the objective and method. State the true objective as accurately as possible so that intelligent people can come up with the best solution rather than constraining the developer to artificial requirements. The reason a system is unresponsive is not because the CPU is fully utilized, but because of the wait for a CPU to become available.

    In the Windows operating system from the very beginning or early days, there is the quantum. On multi-processor servers, there is a time interval of 15.625 milliseconds (1/64 seconds) and the quantum is 4 times this. A processor core can compress a 1MB block (the standard backup transfer size) in well under the full quantum. Simply yield the thread after each block. If there are any user transactions, there will be minimum wait time. And any spare CPU cycles can still be used for compression with only moderate loss of efficiency versus running for the full quantum.

    Of course, a good strategy might be to impose some kind of IO bandwidth control. The key element is to key disk latency low, meaning the IO queue depth should be low. It should be possible to control this with the buffer count setting. If there are 1 or 2 buffers per thread, then the IO queue should not be over saturated with IO generated by the backup process.

  • LOB Pointer Indexing Proposal

    My observations are that IO to lob pages (and row overflow pages as well?) is restricted to synchronous IO, which can result in serious problems when these reside on disk drive storage. Even if the storage system is comprised of hundreds of HDDs, the realizable IO performance to lob pages is that of a single disk, with some improvement in parallel execution plans.

    The reason for this appears to be that each thread must work its way through a page to find the lob pointer information, and then generates a synchronous IO if the page is not already in memory, continuing to the next row only when the IO is complete. I believe this issue could be addressed if we could build a non-clustered index where the index row header contains a pointer to the lob page instead of back to the cluster key (or heap page). Presumably this structure would be able to use the asynchronous IO capability of the SQL Server engine used for normal in-row pages. Per Mark Rasmussen’s PASS 2013 session, the index should perhaps point to all lob pages for very large structures requiring multi-level lob pointers.

    Another thought is for the nonclustered index included columns could have the lob page pointer, so that we can jump directly to it instead of first going back to the table.

    In the last two years, I have worked on databases that made extensive use of large text fields, prominently kCura (see Relativity), the legal document discovery application. Typically the fields are declared as varchar(max) but it could be any of the data types that are stored outside of the row, i.e., in either the lob_data pages or row_overflow_data as opposed to the normal in_row_data pages.

    It is quickly apparent that accesses to the lob fields are horribly slow when the data is not in the buffer cache. This is on a storage system comprised of over one hundred 10K HDDs distributed over four RAID controllers.

    SQL Server has an outstanding storage engine for driving disk IO involving normal in-row data (of course, I had previously complained on the lack of direct IO queue depth controls.) In the key lookup and loop join operations (that generate pseudo random IO), both high volume and single large query, SQL Server can issue IO asynchronously and at high queue depth. For a database distributed over many disk volumes, many IO channels, and very many HDDs, SQL Server can simultaneously drive IO to all disks, leveraging the full IO capability of the entire storage system, in this case 25-25K IOPS at low queue depth.

    Apparently, this is not true for lob (and row-overflow) pages. A quick investigation shows that IO for lob pages is issued synchronously at queue depth one per thread (or DOP). This means each thread issues one IO. When the IO completes, it does the work on the data, then proceeds to the next IO. Given the data was distributed over limited portion of the disk, the actual access times was in the 2.5-3ms range corresponding to 300-400 IOPS. This is slightly less than 10K HDD theoretical random access comprised of 3ms rotational latency plus 3.5ms average seek for data distributed over the entire disk for 150 IOPS.

    SET STATISTICS IO shows “lob physical reads 14083, lob read-ahead reads 39509”, indicating read-ahead reads, but this might be for the addition pages of a field larger than 8KB? There is some scaling with degree of parallelism in the IO access to LOB pages, as it appears that the synchronous IO is per thread. On this particular system, the decision was made to set MAXDOP at 8 out of 20 physical cores, 40 logical to better support multiple concurrent search queries. A query specifying LOB data can generate on order of 1500-2000 IOPS at DOP 8, perhaps 4000-6000 at DOP 20.

    It was noticed that the time to generate a Full-Text index on a table with 20M rows and 500GB of LOB data took several hours. This would be consistent with a single thread running at 400 IOPS (not all rows have LOB pages). I do not recall any single CPU being pegged high during this period, but a proper observation is necessary. It could be that Full Text indexes creation will be much faster with better IO to the LOB pages (and parallelism?).

    It so happens that kCura Relativity seems to run with most normal in-row pages in memory, generating little disk activity. In this case, one possible solution is to use the TEXTIMAGE_ON specification to place LOB in a file group on an SSD array. In principle the in-row data is more important, but these will mostly like be in-memory, hence not need the fast disk IO. The LOB pages that cannot benefit from proper asynchronous IO at high queue depth operation is placed on the low latency SSD. This is the reverse of putting important data on the more expensive storage media, but it suits the actual mode operation?

    Even better is for Microsoft to fix this problem. My suggestion (and I know how the SQL Server team loves to hear my advice) is to use the index Included Columns feature to support pointers to LOB pages, instead of the column value itself. Naturally it does not make sense to include the LOB column because it would be stored out of page, as in the original table. The reason disk IO to lob pages is synchronous is that each thread must work its way into the row to find the file-page pointer? Nowhere is there list of pointers similar to the nonclustered index header. The advantages would: 1) having the list of pointers available in a compact structure that can be scan quickly, and 2) support asynchronous IO. The index row header would point to the lob pages, instead of back to the table? In order of the cluster key? Or some other column? Naturally it is necessary to understand how to SQL Server engine handle LOB IO to come up with the right solution.

    At PASS 2013, there were two sessions on row and page organization: DBA-501 Understanding Data Files at the Byte Level by Mark S. Rasmussen, of iPaper A/S ( and DBA-502 Data Internals Deep Dive by Bradley Ball, . The Rasmussen session also details LOB organization.

  • HealthCare.Gov Fiasco

    There will probably be consequences for delving into such a charged topic. But this has never stopped me before and I am too old to change my ways. So here goes. Many of us have of the problems with the online even though technical details are not yet available.

    First let me say that I am not singling out this particular project, as severe problems are common in government IT projects, and I suspect the incidence rate may be no different in private industry. Whereas other projects may hide failures with schedule slips or otherwise do not attract broad news coverage. This project involves the general public, is a politically charged topic and had a hard deadline, so there is no escaping public scrutiny.

    Some special aspects of this project are that it involves interaction with third-party health care insurers (and providers?) and the fact that each state in the US has their own health care laws and licensing? It has been reported that the complete system was not tested as a whole until one week before the public launch deadline of Oct 1. And that despite flaws found in late testing, the agency in charge decided that is was necessary to launch on the prescribed date regardless of known and possible issues instead of announcing a delay.

    There are serious issues with the overall manner in which the government handles IT projects that seem to make either serious deficiency in meeting the objectives or outright failure the expected outcome. For now, I will comment on the patterns that I have seen in other government projects. A serious problem in all government projects I have been involved with, is the lack of understanding or appreciation of the different grades of technical talent.

    The government has no problems in spending (I cannot call this investment as that would imply a reasonable expectation of positive return) hundreds of millions or in the billions of dollars on projects. (Note: the cost of these projects is not the subject of contention.) I assume (but I do not know) that the top managers are compensated accordingly for large project responsibility. It is in the high technical positions where it seems that the government is unwilling to offer a proper pay scale.

    A large project needs a large staff of people, be it permanent employees or contractors. (per comment by Ryan below, large IT projects have the budget for a large staff, be it necessary or not. I was at a company years ago that did have large projects. It was acknowledged that the project could have been done better with fewer people if there was technology to clone certain people. But as such, it was done by a large team with a few seriously top people.)

    The general expectation is that much of this staff should have several years of experience. It is not a requirement to have people with little experience. But if there is not, then there will be a shortage of experience personnel for future projects. The key is in having a number of very talented and experienced people to provide technical leadership, to ensure that the work of the staff is built on a solid foundation.

    Of course top technical talent costs top dollar. Every time I get called on a government project, I hear that they cannot pay at a rate above that suitable for a senior developer. It is possible that there are government rules on how many high salary slots are allowed in a project and that these have already been assigned. It might also be possible that the government imposed such rules as a guard against fraud.

    But without top talent, it is very likely that a high cost project created by a large team of middle to senior developers will have serious problems due to being built on questionable foundations. So it seems that our government would rather stick to their project management rules even though it means critical projects end up as very public failures.


    Addendum 23 Oct
    Consider the example of the military procurement system. There are so many messed up rules and other red tape meant to prevent fraud? or serve some forgotten purpose? In the end, a project costs tens of billions of dollars, and by the time it is complete 10 to 20 years later, some combination of 1) it does not do something useful, 2) it is too expensive for what it does and 3) so much time has elapsed that the original requirements are no longer valid.

    In the two majors interventions of the last decade plus, when it was brought up that the troops needed equipment not previously in the system, and that could be developed, Congress has the good sense to set aside most of the rules, saying just do it. The necessary stuff got done and done quickly. True some money was wasted, but it is also important the rules that are meant to prevent waste do not also defeat the ultimate objective.

    Competitive bidding is a good idea for known entities. I regard it as nearly totally stupid for IT development projects. It might be better to set an annual budget for development, which should determine the size of the team, and a timeline. In new projects, a prototype with moderate functionality should be up within 12 months if not sooner. In any case, something of value needs to reach production within 2 years or there is a risk of a project out of control, lacking cohesion.

    I do believe that our federal government has an alternative, as there should be several concurrent projects in development among the many agencies. These should be judged against each other, with successful contractors getting assigned more work.


    Addendum 24 Oct
    It is generally known that the high executives can regurgitate IT catch phrases on demand. Something amusing quotes reported on Fox News covering the congressional inquiry:

    Senior Vice President Cheryl Campbell also said in her prepared remarks that "no amount of testing" could have prevented the site's problem-plagued start.

    ... a system "this complex with so many concurrent users, it is not unusual to discover problems that need to be addressed once the software goes into a live production environment."

    "No amount of testing within reasonable time limits can adequately replicate a live environment of this nature,"

    In fact testing did find errors that were not fixed prior to release (of course testing was just before release. So lesson for those aspiring to be executives: learn to make statements that are generally accepted to be true, preferably irrelevant to the actual root cause(s).

    Optum/QSSI blamed in part a "late decision" to require customers to register before browsing for insurance, which could have helped overwhelm the registration system.

    "This may have driven higher simultaneous usage of the registration system that wouldn't have occurred if consumers could window-shop anonymously," said Andy Slav

    This is true. Still a modern server system can support thousands of connections. It is important to run many web server instance/processes, so that each can be recycled without affecting too many users. My own limited experience with web servers is that the session state system is more likely to be the source of a problem - requiring process restart. So if we did not implement session state, it is less likely we would need a restart in the first place.

    Foxnews crash-and-burn-lawmakers-grill-obamacare-website-contractors


    Addendum 25 Oct

    Centers for Medicare and Medicaid Services (CMS) - under the Department of Health and Human Services is the government agency handling the web site/app. In this case, they acted as the project lead, as no contractor had project lead responsibility (and authority), so they are responsible for the top level decisions. CGI Federal handled most of the work. QSSI handled the registration element with identity management? They owned up for their part of the project problems. CGI cited inadequate testing, started only 1-2 weeks before Oct 1 go-live, that it should have been months of testing. (Of course months of testing between code-feature complete and go-live is totally stupid. But it is correct to conduct months of testing during development.) And CGI did not mention this in September.

    I do not find that the testimony given by the contractor executives at the congressional hearings provides meaningful insight or get to the key elements. In practice, the only way to get the whole truth and nothing but is under a no-blame inquiry.

    What I would like know is what the original project plan schedule was. When was the date for code completion? the original testing period, and was there an allowance for bug fixes and retesting?

    I think the test, bug fix, re-testing cycle should be about 4 weeks. Of course there will still be bugs remaining that are fixed after go-live, which is why the CGI testimony of months of pre-launch testing is full of sh!t. It is perfectly reasonably to be fixing bugs found in testing as well as making minor performance fixes. But if they had no idea of what the performance characteristics was going to be until post-code freeze testing, then that means the architects were incompetent, blindly building a project on stupid principles with no connection to performance, see my blog load-test-manifesto

    What almost always happens is that code-feature completion is late, and the project leads decide to compress the test-bug fix cycle to preserve the launch date. This never works because the test cycle was already tight to be with because they thought only 1 bug-fix retest cycle was necessary. Two might be workable.


    Addendum 31 Oct

    CBS News cites the cost of the website development at $118M and $56M for IT support. Various states have collectively spent more than $1B on their own health care/insurance websites?

    Health and Human Services (HHS) which oversees CMS, secretary Kathleen Sebelius testified that the contractors never asked to delay the sites launch, while CGI testified that there should have been months of testing. Marilyn Tavenner, head of CMS, testified that she had no idea there were problems prior to Oct 1? even though internal testing prior to launch showed severe problems. Lesson for those aspiring to be (government) executives: don't listen to the people who work for you, they can't have anything important to say.

    Ms. Sebelius said to "Hold me accountable for the debacle". I am not a fan of firing someone every time a project fails. I have seen this before, and the next project still fails. Even Stalin eventually learned that shooting generals does not inspire the next general to victory. Also HHS oversees several agencies, so I think the head of CMS should be accountable.

    CBS News


    Addendum 1 Nov

    The people cited big numbers (4.7M) for the number of visitors in the first few days, but declined to state exactly how many people actually purchased health insurance. Ms Sebelius stated that the numbers were unreliable, and it will be mid-Nov before they are certain. It now comes out that 6 people completed enrollment the first day and 248 by the end of the second day.

    So I presume that HHS/CMS did in fact know exactly how many enrollments there were on each day but did not want this to become public knowledge, and lied about it. Just yesterday I said that firing people for failure does not prevent the next failure. However lying pretty much ensure the next project will be a failure.

    It is being reported the tech "big guns" are being brought in to help fix the problem. There is a Google person for reliability, even though Google did not provide software for this project. Oracle is also sending people, and presumably this means that there is an Oracle back-end. Let me say that I do not have technical issue with the core Oracle engine. It is indeed very sophisticated. There are also some extremely talent Oracle DBAs out there. Back about 12 years ago I noted that the best of the tech writings on Oracle were of greater depth than for SQL Server. (Since then, there has been good quantitative technical analysis on SQL Server.)

    On the SQL Server side, there are many "accidental" DBAs. This might be a person foolishly inquiring as to who was the DBA for a particular SQL Server back-end. The boss thinks for a moment, and then says you are! That said, the accidental SQL Server DBAs are not afraid to admit they know very little about being a DBA and need to learn. On the Oracle side, I have already said there is serious talent. But there are also very many light weight DBAs.

    Without making a generalization, some of these think that because they are using a very sophisticated product, then they are sophisticated too. At the last Oracle World, six or seven years ago, one the top Oracle performance experts was explaining to the audience that an application making excessive hard parses (SQL compile) can scale. It is essential to have plan reuse.

    On more than one occasion, I have seen an Oracle DBA who seem to be superficially versed in the latest Oracle big gun features, but absolutely clueless on fundamentals. I was asked to assist in helping major hotel management company do an Oracle & SQL Server comparison for their new reservation system (being ported from Informix). When I got there, everything indicated that the project team wanted Oracle and only wanted to show that Oracle was superior. So they were using advanced Oracle features in the belief that this would give a decisive advantage. However their SQL coding was total crap. They used PL/SQL which may actually implement a hidden temp table with inserts, when in fact the query could have been written with straight SQL. I had been told that this was supposed to be a like for like comparison, but since it was obvious this was not the case, I felt no obligation to point out their crappy SQL, while implementing correct SQL on my side (I also tightened up the data types to reduced table size). They were shocked when SQL Server obliterated Oracle in performance tests. They spent 2-3 weeks checking the individual query results to be sure I was returning identical rows. Only then did they finally bother to look at the SQL. In the end, with the Oracle system using correct SQL, the results were about the same, and they recommended Oracle.



    Addendum 22 Nov

    Apparently outside consultants were hired in early 2013 to assess the HealthCare.Gov project. It was assessed that there were serious issues? but none of this was reported to Congress. It might be accepted that the project team would put on a public face of all is well, but failing to disclose the truth to a Congress should be a more serious matter?

  • Need for Queue Depth Control in SQL Server IO

    I have complained about lack of queue depth control in SQL Server IO before and I am complaining again. I fully understand the desire to restrict the number of knobs, which used incorrectly can cause more harm than good. The main point about IO queue depth is that there is not a single right answer for all circumstances but rather depends on both the hardware environment and the purpose.

    One example is a transaction processing server that is also used for reports, or perhaps queries that behave as a report. The transaction may involve a series of queries each for a single or few rows. These would be issued synchronously at queue depth 1.

    A report-type query such that the estimated involves 25 or more rows would issue IO asynchronously at high queue depth. In principle, this would increase the storage subsystem throughput by allowing the hard disk to employ elevator seeking. But it also increases latency.

    The report query completes faster despite the higher latency because it issues multiple IO at once. The transaction query issues 1 (sequence of) IO, and waits for it to complete before issues the next.

    So why complain now? I am testing a storage system for a data warehouse environment. I noticed that for a table scan, SQL Server was not issuing IO at sufficient high queue depth when the data files are distributed over 16 or more volumes (RAID groups). SQL Server issues 512K IO at high queue depth for 8 or fewer volumes. The onset of high queue depth IO seems to occur when degree of parallelism (DOP) is equal or higher than the number of files.

    The throughput on each volume can be up to 600MB/s, the volume being comprised of six 10K HDDs in one case and six 7.2K HDDs in another. With 8 volumes spread over 4 RAID controllers, it is possible to for a table scan to generate IO at approximately 4GB/s.

    Each RAID controller, HP P812 (this is an older model) in this case, can support 2GB/s IO. This is rather low considering the PCI-E slot is gen 2 and x8 wide, supporting 3.2GB/s. This might be because the memory on the RAID controller is 64-bits (8-byte) wide and operates in 800MHz for a nominal bandwidth of 6.4GB/s. That's a lot right? Well not if a disk read is written to memory first, then read back to be sent to the server (where the memory write-read sequence happens again).

    SDRAM, including DDR derivatives, write at half the read rate. So the effective write-read throughput is one-third of the memory bandwidth. LSI controllers have similar (or less) bandwidth to external memory, but can drive higher bandwidth. There may be some other effect, perhaps the controller's internal memory?

    Anyways, with 4 controllers each capable of 2GB/s, the whole system should support 8GB/s? Why 4GB/s? It turns out there is a setting in Resource Governor named REQUEST_MAX_MEMORY_GRANT_PERCENT which can "improve I/O throughput" at high degree parallelism. (oh yeah, the applicable DMV is dm_resource_governor_workload_groups) Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.

    In summary. there is no a priori single correct setting on queue depth and IO size for any or all circumstances and certainly not for different hardware configurations. To favor any specific purpose, SQL Server must know how many hard disk drives (or SSDs) that each volume is comprised of, and whether these are dedicated or part of a shared pool. The operating system cannot see beyond what the RAID controller presents, so we must provide this information manually.

    Next, SQL Server must know what the mission is. Is the mission to keep latency low for transactions? Or is the mission to flood the queues for maximum throughput in batch processing? Or something in between? It is also disappointing that SQL Server has only a single table scan IO size, such that any RAID volume not at a specific number of disks x RAID stripe size is sub-optimal.

    Below is the disk IO generated by the TPC-H LINEITEM table scan after DBCC DROPCLEANBUFFERS, SQL Server already has memory allocated from the OS (see previous post on Lock Pages in Memory)

    this is not working for some reason? OK - missing slash (2014-03-08)

    table scan disk IO

  • Automating Performance with ExecStats

    For several years I have made my (free) SQL Server performance tool ExecStats publicly available (download at ExecStats). I have just recently improved existing or added new features (build 2013-09-23 or later) that could be useful so I am asking people to give it a try and please do send feedback.

    I starting working on ExecStats when SQL Server version 2005, for the first time, made sufficient information available via the Dynamic Management Views (DMV) and functions to do a pretty good job of performance analysis. The new features have been added to the DMV's over the more recent versions have further improved the ability to assess SQL Server health.

    In SQL Server 2000, we had to use Profiler to start a Trace and collect data for a sufficient period of time. The information from various DMVs are available at any given point in time, and depending on the degree to which execution plans are retained in the procedure cache, could provide a reasonably accurate assessment of the workload on SQL Server. Of course Profiler and Trace are still useful in the exception cases (and Microsoft wants to move to Extended Events going forward).

    Two starting points in performance tuning with DMV's are dm_exec_query_stats which keep execution statistics for entries in the plan cache and dm_index_usage_stats which keeps index usage statistics. These avenues can be pursued independently, but working both together is best. The dm_exec_query_stats view has sql and plan handles that links to DMFs for the SQL (dm_exec_sql_text) and the execution plan (dm_exec_text_query_plan). Inside the XML plan is information such as which indexes are used by each SQL statement and the access method (seek, scan, lookup, update, etc.).

    It should quickly evident that while SQL is a great language for data access, it is not suitable for step-by-step processing, for which there are procedural programming languages that are designed for this purpose. This is why ExecStats is a C# program and a not a massively foreboding morass of dynamically generated SQL. ExecStats parses top execution plans to build a cross-reference of index usage by SQL statement. Some other tools and scripts may attempt to parse 5 or 10 execution plans. ExecStats default is 1000, and can parse the entire contents of the plan cache if desired.

    In working from the index usage stats alone, it is possible to identify unused indexes. It is also reasonable to guess that indexes on the tables with the same leading might be consolidated. However it is not certain and no infrequently good assessment can be made with infrequently used indexes. By building the full index usage to SQL statement cross-reference map via the execution plans, it is possible to determine where each index is used. This allows a reliable determination for the minimum set of good indexes.

    Now that the capability to parse execution plans has been established, the scope can be expanded because there are other sources for execution plans. One option is to simply get a list of all the stored procedures in the database via sys.procedures to generate the estimated execution plans (for NULL or default parameter values). I did this once for a client, and from the trapped error messages, there were nearly one hundred procedures that referenced tables which no longer existed. These were obsolete procedures that no one had identified for removal even though the underlying tables had been dropped.

    Another option is to maintain a list of SQL, which could be stored procedures with parameter values either for generating the estimated execution plan only or be executed for the actual execution plan which has additional information. In addition, certain stored procedure could be tested multiple times to expose the impact of compile parameters or different code paths from to Control-of-Flow keywords.

    ExecStats automates the data collection for all of this. Results can be collected and archive on a regular basis and especially before and after code or other changes. It is especially important to consider that execution plans can change with both compile parameters and data distribution statistics. When there are complaints of poor performance, one could investigate for top resource consuming SQL, but it is immensely helpful to have previous characteristics for comparison. This includes execution statistics, the execution plan, compile parameters, indexes, and even the data distribution statistics last update.

    In 2012, I integrated a previously separate program for performance monitoring into ExecStats. The key point in ExecStats performance monitoring is that important system and storage architecture details are captured and incorporated into the display. Too many of the third party tools were obviously developed on a desktop with 1 processor and 1 disk. It is clear that the tool was built by someone who got a list of counters with "best practice" parameters with very little real understanding of what any of it actually means. It is certainly not someone who actually solves SQL Server performance problems on a day-to-day basis, or has even done it at all.

    There are two recent additions in 2013-Sep. One pertains to index fragmentation, and the other to fn_virtualfilestats. Technically we should use the DMF dm_db_index_physical_stats, which replaces DBCC SHOWCONTIG, to assess fragmentation. However the effort to run dm_db_index_physical_stats on large tables is substantial. There are several pieces of information that can be used to provide a clue on the state of fragmentation, but only for nonclustered indexes without included columns. The DMV dm_db_partition_stats tells us the number of rows and the size of the index, from which we can calculate the average bytes per row. Next, DBCC SHOW_STATISTICS tells use the average key size, which includes both the nonclustered index key portion and the clustered index key portion. Keeping in mind that the row overhead is 10-12 bytes, we can compare the value calculated from dm_db_partition_stats to the key length from statistics plus row overhead to determine if this is reasonable. See Paul Randall at SQL Skills anatomy-of-a-record for more on the record overhead, which also applies to index records.

    edit The previous versions of ExecStats display index row count and size in the Index Usage tab, and the Average Key Length in the Dist Stats tab. The newer version make a copy of this in 3rd right most column in Index Usage as AvKyL. For now, compute the 1024*Used KB/Rows to compare with AvKyL.

    The function fn_virtualfilestats provide useful file IO statistics at the SQL Server file level. In a complex environment, there are many files for the key databases spread over many volumes on the storage system (each seen as a physical disk by the Windows operating system). It is a simple matter to rollup the file IO statistics by database or filegroup. It was not simple to rollup the file IO statistics on a volume basis because complex storage systems are typically implemented with mount points. The new (as of SQL Server 2008 R2) DMF dm_os_volume_stats provides the mount point. So now ExecStats rolls up file IO by filegroup, database and volume.

    edit Below is the previous version of Databases, showing database size information.


    The new file IO rollup by database is at the far right, as below.


    The new file IO rollup by OS Volume is in the new Volumes tab.


    So feel free to give ExecStats a try and please send feedback. If any one would like to volunteer to write proper documentation, that would be especially appreciated.

  • PDW Concept on NUMA

    This is purely a question and not a proposal, as I know how Microsoft is so very thrilled every time I have a suggestion. Could the Parallel Data Warehouse concept work on a single NUMA system?

    Technically all multi-socket systems today are NUMA because the processors now have integrated memory controllers. The path to memory is different between local access and memory attached to a differentanother processor socket. But I am more interested in 8-socket, and potential 16+ socket systems. The single system definition here is that of a single operating system image. Some large systems allow partitioning to support multiple OS images to be running, but that configuration is not discussed here.

    Both Windows Server and SQL Server are NUMA aware in that there are distinct memory nodes (that are closer to the local processor cores than remote processor cores). But there is no workable mechanism to fully benefit from knowledge of the underlying hardware organization. Say for example we have a large table resident in the buffer cache evenly distributed across the nodes. We would like for a connection to SQL Server issue a query to this table to be farmed out to one thread on each node with that thread accessing only rows stored locally.

    But this is not what happens. More likely, starting from an empty buffer cache, a query accesses the table. The table is initially loaded into memory on the same node as the connection, then switching to other nodes? I have no idea what actually happens so it would be nice if someone would like to investigate this. Now another connection that happens to be assigned to a thread on the different node would have to cross nodes to remote memory for this same query? So the expectation is that there would be apparent inconsistent variations in query time depending on both the node distribution of the participating buffer cache pages and the node affinity of the execution thread.

    In the PDW environment, there are multiple compute nodes, each a 2-socket system running its own Windows operating system, and (modified?) SQL Server instance\process with fact tables partitioned by node and dimension tables replicated to all nodes. A front-end sends a query to each node which processes data on its node. The purpose of this architecture is to allow scaling the computational resources (CPU, memory and IO) beyond the limits of a single system. But it also happens to achieve a degree of locality optimization.

    The limitation of the PDW concept is that any single query is a more involved process. Furthermore, the query execution plan elements cannot cross nodes without a remote server mechanism. Even with the InfiniBand protocol, communications between nodes are expensive and long latency relative to the processor cycle time scale. This is why PDW is Parallel Data Warehouse and not Parallel Database.

    So now what? My question is can we apply the PDW concept to a NUMA system, one big 8-socket NUMA system running a single Windows operating system. Instead of PDW with a SQL Server instance on each system (PDW compute node), there is now a SQL Server instance localized to each node of the NUMA system. Communications between processes on a system is lower latency than across systems over InfiniBand.

    Can we go further than this? How about bringing the Parallel concept inside to a single process? Now each “node” is a collection of threads affinitized to a node on the NUMA system. Communication between threads is even faster than between processes. And now, potentially execution plan elements can cross “nodes”?

    We might ask why bother. The nature of server system is that the 8-socket system is more than twice as expensive as four 2-socket system. In addition, the 8-socket system may have processors one generation behind the 2-socket system. The fact is that by the time we assemble a collection of systems or one big system, the full costs of hardware are irrelevant to a really important mission. The real question is which meets the objectives better.


    Below is a completely separate discussion but is on the topic of NUMA systems.

    A Better NUMA System Memory Strategy?

    As far as I know, Windows or SQL OS creates 2 memory nodes from processor node 0, then 1 additional memory node on each of the other nodes. One of the memory nodes on node 0 is for generally allocations. All other memory nodes, including the second one on node 0, are for NUMA aware allocations? The immediate problem is that the memory nodes are asymmetric in size.

    There is one possibility to consider for system not at maximum memory. This is to put maximum memory in node 0 (good luck figuring out which one this is) and the normal memory in the other nodes. But of course, the nature of component pricing today is that is makes more sense to fill the DIMM sockets with the second largest capacity memory module, today 16GB. Today, Crucial shows 16GB ECC DIMMs at $200 each and 32GB DIMMs at $800 each. By sometime in the next year or so, expect the 32GB DIMM to trends towards $400 and a new 64GB DIMM to introduce at several thousand dollars initially.

    HP-UX on the other hand has a very clever memory strategy, which is to configure both interleaved and node memory, or LORA in their terminology. Say for example that there are 8 nodes and 1TB memory total, 128GB per node. Interleaved memory totaling 128GB is allocated from all nodes at 16GB per node. The remaining 114GB on each node are formed into node local memory. The addressing scheme might be something like this. For interleaved portion in the first 128GB, the memory is interleaved in 256 bytes stripes across the nodes. The first 256 bytes is from node 0, the next 256 bytes is from node 1, and so on before wrapping back to node 0. Above this are separate 114GB contiguous chunks on each node.

    This arrangement recognizes the fact that not all consumers want local node memory. Now the NUMA local memory nodes are uniform in size. I happen to think this is an excellent idea. I do not know if there is an institutional bias at MS against ideas not invented in happy valley.


    Edit based on Lonny Niederstadt's comment

    on further thought, I think it might be better if the OS did not interleave a single memory page (4KB on X86) across NUMA nodes, perhaps even for the 2MB large page mode, but not sure on the 1GB page.

    The HP-UX scheme allows the sys admin to adjust the interleaved / LORA ratio, with 1/8:7/8 being recommended. This would allow NUMA to be turned off.

    The SQL OS could also override the Windows scheme, simply by choosing how it allocates pages.

    For NUMA controls, the default should be off, so people do not encounter anomalies unless they make the decision to go for it.

    I am thinking on a transaction server, the scheme should be to prefer allocating buffer cache pages from the local node memory (of the thread making the call). The app server would connect over different port values. SQL Server would have affinity set by port. So each group would be likely to have their data in the local node.

    For DW, the strategy should be to round-robin extents (or large pages?) by NUMA node. Even better would be if the clustered index could be on a hash key. Then two tables on the hash key, ORDERS and LINEITEM for example, would have the same node as preferred buffer location.

    Taking a step further, if storage were also distribute by node, and the FG has on or more file on each node, then the preferred file location matches. And the index rebuild would reset to match this scheme.

    There should be some parallelism options beyond MAXDOP. One would be Local node. So parallel execution should use local node cores up to the socket limit. Another would be Global, so allocate 1 thread per node, then round robin. An independent option would be Hyper-Thread or HT, in which both or all logical processors of 1 core would be allocated before moving to the next core. Default of course would be off, but this could be a powerful tool for investigating HT issues.

    Perhaps to local node parallelism option should be default for the OLTP scheme, and the Global be default for the DW scheme

    I am thinking something needs to be done about processor affinity bit mask. There should be 3 values, which could be done with 2 sets of bitmasks. One is preferred processor affinity. The other is secondary affinity, and finally excluded.

    This might help manage multi-instance, each instance could get separate preferred, but the secondary would allow activity on other nodes as well.

    Since you mention CPU per node, I will plug my ExecStats tool,

    which is PerfMon mode display individual CPU, annotating the node boundaries. Note the 3rd party tools vendors - learn what a real server is, not the developer's desktop!

    The high CPU on node zero might be related to this
    FIX: CPU spike when there is no load on a server after you install SQL Server 2012 on the server, sp1 cu3.

    Now having thought about it some, I think this should be a request, we could call it SQL Server Data Center Edition

  • Lock Pages in Memory and Large Page Extensions

    The most important setting for SQL Server today is probably the Lock Pages in Memory option set in Local Security Policy, User Rights Assignment. Enabling this option on recent versions for the SQL Server user account in Enterprise Edition on systems with 8GB+ more memory also activates Large Pages. It used to be that Large Page support was activated by Trace flag 834, which required the Lock Pages in Memory right. As typical, Microsoft officially provides no guidance or assistance in trying to quantify the benefit of this setting.

    I recall a presentation by Thomas Kejser at SQL Bits that mentioned this setting doubled performance in one particular situation. Based on the description of Lock pages in memory and large page allocation, it seems perfectly reasonable that these two setting should have meaning performance impact. So naturally I attempted to investigate this in my test environment, and found zero measurable difference. I was certain that the principles in this were correct, but had to accept that I was missing something and left this matter alone.

    Just now, I am doing an assessment of the storage performance for a system already in production. This system has several instances of SQL Server, and hundreds of other processes running, stopping and starting. In the storage performance sequential IO tests, I wanted to look at a table scan reading from disk, both loading into an empty buffer cache and a full buffer cache forcing the eviction of other data. During the testing, I changed to the SQL Server max server memory several times between 10GB and 70GB, with the test query being a 60GB table scan.

    The most pronounced behavior was observed when the SQL Server instance maximum server memory was raised from 10 to 70GB (with 10GB memory already allocated from the OS before the query start). The table scan query would cause SQL Server to start allocating additional memory from the OS up to the 70GB limit. The query executed at a horrifically slow rate, averaging 16MB/s. When SQL Server was not allocating memory from the OS, the table scan generated disk IO at 2.0 to 2.3GB/sec. This is for all combinations of loading into an empty or full buffer cache with and without eviction. The consumption rate with data in memory was just over 1GB/s per core showing excellent scaling from degree of parallelism 1 to 10.

    None of the SQL Server instances had the Lock Pages in Memory right assigned, and hence all SQL Server processes were using the conventional memory manager. The operating system had been running for over three weeks with several SQL Server instances and hundreds of processes that started and stopped for various reasons. Overall system physical memory in use probably never exceeded 85% (of 1TB).

    My interpretation of what happened was that the physical memory range had become heavily fragmented over the period with so many processes allocating and de-allocating memory pages (4KB). I would presume that the operating system can allocate 4KB pages from the free list much faster 16MB/s. When processes de-allocate memory pages, the OS must first zero out the page before returning it to the free list (Demand Zero Fault/s). There must be something that makes memory allocation very expensive, considering that the OS must track 256M x 4K pages = 1TB.

    This could why my test system showed no difference between Large Page Allocation and the conventional memory manager. The test system had only a single instance of SQL Server, few other running processes, and tests were usually conducted from a clean boot. Even if I had changed SQL Server memory up and down several times, nothing would have fragmented the physical memory? On top of that, the nature of the queries could have consumed more cycles than the underlying memory address translation for either 4K or large pages?

    There are some implications in using lock pages in memory which automatically enables large pages allocations if the conditions are met. First is that memory is allocated on SQL Server startup. Second, the operating system cannot page locked memory. So the assumption is that the remaining memory is sufficient for the OS and other user processes. A side effect is that Task Manager only reports working set, and not the private memory of the locked pages.

    I am speculating that it might be possible that if a SQL Server instance with lock pages were stopped, an attempt to restart later could fail if the OS cannot allocate sufficient contiguous 2M blocks. So this would require booting the system. So the locked pages option should be only used for servers dedicated to SQL Server. Careful planning on memory setting is needed for multi-instance.

    Below is a good starting point on this topic


  • SSD and Internal SQL Server Tuning Strategies

    Microsoft does not like to make meaningful tuning parameters in SQL Server. In part, I understand this because‌ in the past I have seen really bad settings being used in other products that do expose more tuning parameters based on seriously mistaken understanding of the inner workings of the database engine. My own investigation have also shown that static settings do not have adequately broad general validity, hence more complex dynamic strategies are necessary to be effective.

    With SSD at sufficient level of maturity and cost viability, it is possible to build immensely powerful storage within an acceptable price for high-value database systems. (Whether storage system vendors will allow this is an entirely different matter) My preference is to make judicious use of the characteristics of NAND, specifically the lower cost of MLC versus SLC, and accepting the lower write endurance of MLC over SLC. (SLC NAND also has much better write performance over MLC) In certain situations, the database will contain data for an extended period of time, which is another way of saying the write load relative to capacity is low, hence MLC write endurance is perfectly adequate.

    What could potentially exceed MLC write endurance is tempdb activity. Internally, SQL Server employs rules on how much memory to allocate for hash and sort operations prior to spilling to tempdb. In SQL Server 2008 or so, the estimated execution plan would show IO cost beyond a certain point determined by system memory and degree of parallelism following a formula based on excess size. However, I did not ascertain whether the actual execution followed the spill rules evident in the estimate plan.

    For a database on an MLC SSD storage, the preferred tuning strategy might then be to favor discarding cache, prioritizing memory for temporary data, to avoid a write that will just be discarded afterwards. Reloading permanent data for subsequent queries is fast for both random and sequential IO. Of course, this strategy is specific to SSD MLC storage. Expensive SLC can support either strategy. Some other future nonvolatile storage might not have restrictive endurance limits. On hard disk storage, the strategy should be to discard pages taken from a contiguous sequence, i.e., a scan. Priority on retention being given to index upper levels, and randomly loaded pages.

    None of these should be normal exposed tuning parameters. However, the SQL Server engine should understand the nature of different types of storage media. It may be possible via some future mechanism for the OS to know whether storage media is HDD or SSD. But it may not know if it is MLC or SLC, so we might have file location directives on whether HDD or MLC performance strategies should be employed.

    There is one other option. Samsung described a trick for MLC, allowing certain internal operations to using a portion as SLC! Recall that the NAND storage cells are just that, suitable for either SLC or MLC. Something in the logic portion of the NAND chip must understand SLC or MLC voltage levels. Apparently there is little overhead to add SLC logic to a MLC chip? This could be incredibly useful if it were also exposed to the OS? Perhaps we could create two partition on an MLC/SLC capable SSD, one partition as MLC and the second as SLC? When the device is near the MLC write endurance limits, the entire device can then be set to SLC only mode?

  • Hardware update Aug 2013

    Intel Xeon E3 12xx v3 - Haswell 22nm
    Intel Xeon E3 12xx v3 processors based on Haswell 22nm came out in Q2-2013. Dell does not offer this in the PowerEdge T110, holding to the Ivy Bridge 22nm E3-12xx v2 processors (Ivy Bridge) and below. The HP ProLiant ML310e Gen8 v2 does offer the Intel E3-12xx v3 processor.

    Is there a difference in performance between Sandy Bridge (32nm), Ivy Bridge (22nm) and Haswell (22nm)?
    Ideally as far as SQL Server is concerned, we would like to see TPC-E and H benchmarks, but very few of these are published, and almost never for single socket systems. The other benchmark is SPEC CPU integer, but we must be very careful to account for the compiler. If possible, use the same compiler version, but there are usually compiler advances between processor generations. In general as far as SQL Server is concerned, discard the libquantum result and look only at the other. It is possible to find Sandy Bridge and Ivy Bridge Xeon E3-1220 3.1GHz (original and v2) results on matching compiler, which seem to show about 5% improvement. The only result for v3 is on the next compiler version (from Intel C++ to showing about 10% gain, so we do not know what can be attributed to the processor architecture versus the compiler.

    In any case, it would nice if Dell would ditch the external graphics, using the Intel integrated graphics in v3. I know this is a server, but I use it as a desktop because it has ECC memory.

    Intel Xeon E5 26xx and 46xx v2 - Ivy Bridge 22nm - in Sep 2013
    Intel Xeon E5 26xx and 46xx v2 processors based on Ivy Bridge 22nm with up to 12 cores supporting 2 and 4 socket systems respectively should come out soon (September), super ceding the original Xeon E5 (Sandy Bridge 32nm). The 2600 series will have 12-core 2.7GHz, 10-core 3GHz and 8-core 3.3GHz at 130W. The general pattern is E5 processors will follow E3 and desktop by 12-18 months?

    Intel Xeon E7 v2? - Ivy Bridge 22nm - in Q1 2014
    There will be an E7 Ivy Bridge with up to 15 cores in Q1 2014 for 8 socket systems, replace Westmere-EX. I am not sure if it will be glue-less. The current strategy is that there will be an E7 processor every other generation?

    Storage Systems

    EMC VNX2 in Sep 2013?
    VNX2 was mention as early as Q3 2012. I thought it would come out at EMC World 2013 (May). Getting 1M IOPS out of an array of SSDs is not an issue, as 8(NAND)-channel SATA SSDs can do 90K IOPS. Similarly, revving the hardware from 1-socket Westmere-EP to 2-socket Sandy Bridge EP poses no problems. Perhaps however, changing the software stack to support 1M IOPS was an issue?
    EMC Clariion used Windows XP as the underlying OS. One might presume VNX would be Windows 7 or Server? or would EMC have been inclined to unify the VMAX and VNX OSs?
    In any case, the old IO stack intended for HDD arrays would probably be replaced with NVMe, with much deeper queues, designed for SSD. It would not be unexpected that several iterations were required to work out the bugs for a complex SAN storage system?

    IBM FlashSystem 720 and 820 5/10TB SLC, 10/20TB eMLC (raw capacity 50% greater, with or w/o RAID) 4x8 Gbps FC or 4x40Gbs QDR Infini-Band interfaces.

    HP MSA 2040 with four 16/ or 8Gbps FC.

    I still prefer SAS in direct attach storage, or if it must be a SAN, the Infini-Band.
    FC even at 16Gbps is just inconvenient in not properly supporting multi-lane operation.

    Storage Components

    Crossbar made a news splash with the announcement of Resitive RAM (RRAM or ReRAM) Nonvolatile Memory with working samples from a production fab partner. Products should be forth coming. Since this is very different from NAND, it would require a distinct PCI-E or SATA interface to RRAM controller, analogous to the Flash controllers for NAND.

    see Crossbar-RRAM-Technology-Whitepaper-080413

    Current thought is that NAND Flash technology may be near its effective scaling limits (increasing bit density). Any further increase leads to higher error rates and lower endurance. My view is that for server products, 25nm or even the previous generation is a good balance between cost and endurance/reliability. The 20nm technology should be the province of consumer products. Some companies are pursing Phase-change Memory (PCM) Crossbar is claiming better performance, endurance and power characteristics for RRAM over NAND.

    Seagate lists 1200GB and 900GB 10K 2.5in HDD, along with enterprise version of 7200 RPM HDD 4TB 3.5in FF. HP lists these as options on their ProLiant servers. Dell too.
    I would think that a 2TB 2.5in 7.2K disk should be possible?

    Dell HDD pricing:
    7.2K 3.5in SATA 1/2/4TB $269, 459, 749
    7.2K 3.5in SAS 1/2/3/4TB $369, 579, 749, 939
    10K 2.5in SAS 300/600/900/1200GB $299, 519, 729, 839
    6Gbps MLC SAS 800GB/1.6TB $3499, 6599

    Samsung described the idea of using a small portion of an MLC NAND as SLC to improve write performance in certain situations. So apparently a NAND designed as MLC can also be used as both SLC and MLC, perhaps on a page or block basis. I am thinking this feature is worth exposing?

    The Samsung 2013 Global SSD Summit was in Jul. Video on youtube, I cannot find a pdf. PCI-E interface in 2.5in form factor, i.e. NVMe. Tom's HWG seems to have the best coverage.,3570.html

    Supermicro is advertising 12Gbps SAS in their products, presumably the next generation of servers will have it.


    There is a company with a SSD product attaching via the memory interface. There is a huge disparity in characteristics between DRAM and NAND, that I would have serious concerns. The Intel Xeon E5 2600/4600 processors have 40 PCI-E gen 3 lanes, capable of supporting 32GB/s IO bandwidth, so I don't see the need to put NAND on the memory channel.

  • Load Test Manifesto

    Load testing used to be a standard part of the software development, but not anymore. Now people express a preference for assessing performance on the production system. There is a lack of confidence that a load test reflects what will actually happen in production. In essence, it has become accepted that the value of load testing is not worth the cost and time, and perhaps whether there is any value at all.

    The main problem is the load test plan criteria – excessive focus on perceived importance of “real behavior”, instead of the factors that actually matters. Real behavior is frequently believed to be the random intervals between user actions and sometimes raw database size. The execution plan produced by the query optimizer is what is important. There are three factors:
      1) the formulas used for the cost of component SQL operations,
      2) the data distribution statistics, and
      3) the procedure/SQL compile parameters.

    A second deficiency is treating the load test as a formality to validate that the application designed to “such and such” principles will meet arbitrary performance expectations instead of using the load test harness to investigate performance.

    A third problem is neglecting to conduct a performance unit test, i.e., single queries without other load.

    With an understanding of the database storage engine, including the query optimizer, a load test can correctly reflect actual production system performance characteristics, and more importantly identify potential serious issues.

    The software development disciple has established a number of principles (significantly under the umbrella term Agile) that are very important for many reasons, almost none of which have more than the nebulous connection to database performance.

    Finally, while the actual business requirement might be for a specific load capability, the purpose of load testing is also to investigate. A load test may indicate that performance does not meet the objective. To assess the nature of the problem, it is necessary to determine whether the problem is at the component level, or in the ability to achieve throughput due to concurrency issues.

    Query Optimizer - Cost Based Optimization

    Almost all (?) modern database engines use cost based optimization, as opposed to rule based optimization. The details of the SQL Server query optimizer formulas are discussed in QDPMA Query Optimizer and by others elsewhere (Paul White on SQLBlog). The most significant elements in the SQL Server query optimizer formulas is relation between the key lookup (following an index seek) and the table (or index) scan cost.

    The cost of a key lookup is dominated by the IO component of 0.003125 (=1/320). The incremental cost of a scan operation is also mostly in the IO component at 0.00074 (=1/1350) per page. The SQL Server query optimizer uses a model of the IO system based on a capability of 320 IOPS for non-sequential page access and 1,350 pages/sec (=10,800KB/sec) for scan operations. Of course, the absolute numbers are not important, only the ratio between IOPS and bandwidth. The query optimizer always assumes there is an IO element in the plan cost regardless of the amount of memory or whether the object is already in memory. There is no consideration for the actual random and sequential IO capability of the storage system.

    The decision between using a nonclustered index seek followed by key lookup versus a scan operation is determined by the ratio of number of pages in the scan to the number of rows in the key lookup. For a non-parallel execution plan, when the CPU components of the plan are included, the effective pages to rows ratio may be about 3.5:1. In parallel plan at high DOP, the CPU components are reduced and the cost is entirely determined by the IO components, so the pages to rows ratio approaches 4.22 (1350/320).

    Others have speculated that have speculated that the conditions for using an index could be based on the percentage of rows selected or perhaps the comparing the logical IO count between the key lookup and scan. Even a cursory examination of the SQL Server query optimizer shows that neither is true. It is only a coincidence that for a b-tree depth of 3 or 4, the true cross-over from index + key lookup to scan occurs near the logical IO cross-over point. A simple test on a table with b-tree depth 5 or higher shows that logical is not a consideration. It can also be shown that the key lookup to scan cross-over occurs at the same point for both clustered index and heap tables. The key lookup to a heap is only 1 logical IO per row.

    It also so happens that the loop join operation is essentially the same as a key lookup such that the SQL Server query optimizer uses the same cost formula and may use the same (loop join) symbol depending on the version. At some point, the execution plan shifts from a loop join with index seek on the inner source to a hash join with a scan. This is governed by the same set of formulas as for key lookup versus scan, with the difference being the hash join operation.

    The main implication of this is that cardinality is important along with the average row size of key tables in the test database. Both meanings of cardinality apply; the uniqueness of a column within a table and the number of rows in the relation between tables.

    The second element is the average row size. This is to ensure that the table has a reasonably correct number of pages relative to the number of rows.

    The absolute size of the database or an individual table is not particularly important in a transaction processing system where the expectation is that the critical queries are properly indexed. The important criteria are the cardinality and the page to row ratio.

    Data Distribution Statistics

    The previous reference to cardinality did not elaborate on distribution. Obviously we do not expect uniform cardinality in a database. An example being every customer has exactly 10 orders, and each order having exactly 10 line items. Not only does the actual data distribution matter, but also the data structured used by SQL Server to represent data distribution statistics. Other important aspects are the technical details of random page sampling, when statistics are resampled.

    The data structure used to represent data distribution is shown below. There are three parts, a header with scalar values, the density vector for overall distribution with successively increasing number of keys, and finally the histogram at the end. There can be up to 200 steps for keys for which there is an equal rows value. The distribution between keys is handled with a distinct and total range row values from which the average range rows can be calculated.

    The purpose of the histogram is an attempt to handle skewed data distribution. Presumably one would try to use the key values for the skewed distributions that are not successive. Obviously there are limitations to how much skew this can handle, but a set data structure with size limits can only do so much.

    If the impact of data distribution were understood before the database became populated, then there are strategies that could be adopted to ensure good statistics are possible and help in ensuring consistently good execution plans. Otherwise, we can only rely on the inherent capabilities of SQL Server.

    Next, the other aspects: sampling and percentage. From sampling theory, we know that based on a true random sample, the accuracy is expect to be the square root of the number samples at a given value, with relative accuracy as the inverse of the square root. There are some indications that SQL Server follows this principle in determining the sampling percentage.

    The difference is that SQL Server samples all rows from randomly selected pages (including first and last?). The reason for this is obvious, it is much less expensive. A large chunk of the work is getting the page, so there only a little more work to sample the column value for all the rows in the page. To reduce the impact of correlation between page and value, it tries to use a nonclustered index for which the column is not the lead key, see Elisabeth Redei.

    When an index is built or rebuilt, the Rows sampled is always 100%. On an update statistics call without specifying FullScan, whether automatic or manual, the sampling can be seriously off on the high side. Below it the fullscan statistics for the TPC-H Lineitem table, nonclustered index on Partkey.

    Below is the distribution after UPDATE STATISTICS at default (re-)sampling.

    The full scan has correct values for EQ_Rows, while the partial sample is too high. Interestingly, in both cases, the Avg Range Rows is correct. This occurs in both SQL Server 2008R2 and 2012. Earlier version did not? Below is the execution plan detail with both estimated and actual rows counts based on fullscan statistics and default partial sample for one of the equal rows.

    A significantly incorrect row estimate may not have negative impact on the execution plan operation at the source. This is because it would take a much larger error to change itself to change the plan from an index seek to a scan. However, it is more likely to have negative consequences further on in the execution plan of a more complex query with row estimate propagation errors.

    The final aspect of statistics is resampling. From a brand new database (with no statistics), statistics are automatically created when a column is referenced in the search or join condition. Indexes have statistics, and there can also be statistics on columns. Thereafter, statistics are marked for recompute when certain thresholds are exceeded. The recompute occurs on the next query referencing impacted columns. The statistics recompute thresholds used to on the first 6 and 500 rows modified, and every 20% thereafter. After a recompute, any affected execution plans are also marked for recompile. (I am not sure if these thresholds have since changed or on the exact recompute and recompile sequence, so refer to other documents on this matter.)

    As a side note, permanent tables and temp tables are subject to statistics create and recompute. Table variables do not have statistics, and the execution plan assumes 1 row and 1 page for the table variable. In SQL Server 7 and 2000, the execution plan recompile following a statistics recompute could be more expensive than the actual execute for complex queries involving few rows.

    This may have been the reason table variables were introduced. But the explanation given for table variables was convoluted and probably resulted in incorrect strategies being adopted over whether to use temp tables or table variables. From SQL Server 2008 on, it does seem that the execution plan generation (query optimization) is far more efficient that this is less of a problem. I am of the opinion that the 6 rows statistics recompute and plan recompile never helps, but the 500 row and subsequent threshold points are useful.

    The statistics recompute points introduce interesting effects on SQL Server performance. Consider the following scenario. A particular column has few distinct values (200 or less, the step limit of the statistics data structure), perhaps all 0. Statistics are currently accurate. A query modifies a number of rows to a new value that did not previously exist. The next query specifies this new value on the column as a search argument with joins to other tables.

    If a large number of rows are modified in the first query, then statistics are recomputed and a good execution plan is produced for the second query. If a smaller number of rows are modified (below the recompute threshold), then the second query generates an execution plan based on statistics saying with certainty that zero rows meet the search criteria. The execution plan will then start with a table access applies the search criteria. The subsequent operations are then based on the assumption that zero rows come from the initial operation.

    The execution plan always shows 1 row when statistics says 0 rows. Say that 1 row from the first table joins to many rows in the next table. At this point the execution plan will still show an estimate of 1 row. So when the estimate is 0, 1 is always shown, but the propagation estimates are based on 0. This may or may not lead to a poor plan, but it could lead to a disastrous plan.

    The intuitive reasoning may have been that more rows modified in the first query requires more work in the second, but the internal working of SQL Server renders such intuitive reasoning completely irrelevant. There is nothing fundamentally wrong with the fact that SQL Server has statistics recompute thresholds, or what the threshold values are. The point is that we must consider this in our database.

    There are many more interesting affects (if there is not a problem, it would not be interesting) dues to the nature of statistics. This is only an overview on points most related to load testing. The full details of statistics in SQL Server are important for many reasons. Search for the Technical Article “Statistics Used by the Query Optimizer in Microsoft SQL Server 2008” by Eric Hanson and Yavor Angelov, with Lubor Kollar as contributor. Other SQL Server statistics material is found from SQL CAT and Benjamin Naverez.

    Compile Parameters and Variables

    The final link in the chain from the query optimizer and data distribution statistics are the execution plan compile parameters. In a stored procedure, the actual parameter values are used when the stored procedure is compiled. Towards the end of the sqlplan file, there is a node for ParameterList. A stored procedure may also have internally declared variables. These values are assumed to be unknown. The estimate for unknown is based on the density value in the density vector section. The estimate for known parameter values is taken from the histogram, either the equal value or range rows as appropriate.

    This has deep implications. Just because the test database was populated with a good choice of cardinality, page to row ratio and even good data distribution skew, this does not mean a load test will generate self-consistent performance throughput results or results consistent with what will happen in production, which may also be inconsistent.

    Obviously we expect that there could be differences in the execution plans for procedures between compile parameters with known low estimated row count compared to compile parameters with known high estimated row count. As stated earlier the impact of consequence may not be in the immediate operation, but rather further on in the execution plan as result of differences between estimated and actual row counts.

    There are some measures for handling this in SQL Server. One is the WITH RECOMPILE, which directs a recompile on execute. This means the current parameters are used for optimization, which leads to as good a plan as the query optimizer can produce given its capability, at the expense of requiring compile for each execute. Another measure is to direct the optimization based on a specified parameter value instead of the actual parameter value. A variation on this is to direct optimization for unknown, which should be the same as declaring a variable, and using the variable in the query, set equal to the parameter.

    The Load Test Plan

    With knowledge of the significant factors that affect database performance, it becomes clear how the load test plan should be constructed. First cardinality on columns within a table and between tables should be defined. Next a reasonable estimate is needed for the average row size in bytes. In a well design database, there should be few nullable columns in the critical tables. So the main variable is the average number of characters in the variable length string (or binary) columns. With this, the populated test tables should have the correct page-row ratio that would influence the execution plan between key lookup and scan or loop versus hash join.

    In the old days, there might have been valid reasons why it would be impractical to populate the test database to the full anticipated production size. Today we could put two 1TB SSDs in a large laptop if we wanted to. If the developer complains on space, then he/she probably has too much non-project content (hint).

    Still, if it were desired to anticipate future problems (most of which will be due to an execution plan change) without a full size database, this could be accomplished by scaling cardinality in relation to size . For example, if it is anticipated on the production server at some point a particular index would select 100,000 rows from an index, and the table size would be 2.8GB (350K pages), the execution plan could be modeled by scaling the selectivity to 10K rows and 280MB (35K pages). For good measure, we could scale the buffer cache size with the database size as well. There might still be discrepancies due to differences in the number of logical processors between test and production unless MAXDOP is restricted.

    After defining cardinality and the table page-row ratio, the next step is data distribution. This point is interesting because of the data structure that SQL Server maintains to model data distribution, described earlier with header, vector and histogram portions. Just because two sets of data have the same distribution profile does not mean that SQL Server will generate the same distribution statistics for both.

    The critical aspect is whether skewed data distribution is accounted in the same manner between test and production. If the skewed key values are isolated, then the SQL Server statistics could try to account for up to 200 values with the equal steps. If the skewed values are adjacent, SQL Server could try to lump these into the range rows. In any case, the SQL Server statistics data structure can only do so much to accurately model arbitrarily populated heavily skewed data distributions.

    We could go to a great deal of effort so that both the test and production databases have not just identical data distribution, but also identical data distribution statistics as in the output of DBCC SHOW_STATISTICS. Alternatively, with our understanding of the SQL Server data distribution statistics data structure, we could come up with a strategy so that the SQL Server statistics will be mostly correct in both systems.

    One is to directly manage the assignment of key values. For example, assign large customers an id from a low range, medium customers to an id from a middle range and small customers to an id from the high range, then the statistics data structure can capture this reasonably well. Other strategies are possible, such as using filtered indexes, along with writing the SQL so that the optimizer knows that it can use the filtered index, which happens to provide more accurate statistics. It is even possible to falsify statistics, but that is another discussion.

    Finally, the load test plan must account for differences in execution plans due to compile parameters, along with variations in data skew on the actual parameters. The first question is how many different execution plans are possible depending on the compile parameters. If there is more than one possible plan, then how many execution plans are necessary? If only one plan is necessary, then we can just force whatever compile parameter is necessary to generate the one good plan.

    Let us assume that with proper database design, thoughtfully constructed SQL and indexes, that the query optimizer produces a good execution plan when the compile parameter has low estimated row count when there are in fact few rows and also produces a good plan when the compile parameter has high estimated count when the actual is in fact high. If either the plan for low estimate rows is not suitable for high actual rows or the plan for high estimate rows is not suitable for low actual rows, then more than one plan is necessary. If the cost of compile is low relative to the cost of execution, then an acceptable solution is to apply the WITH RECOMPILE hint. Otherwise, we need devise a strategy so there are the necessary distinct execution plan that can be reused and such that the correct plan is used for the appropriate situation.


    A final point, no data is provided here that demonstrate that the typical delays in user actions along with random variation do not affect system performance. In several tests, it has been shown that the impact of reasonable user delays between calls is that there are more concurrent open connections. This may impact application server requirements such as memory, but the impact on the database server is believed to be minor. So there has been little interest in investigating this matter further.

    Unit Testing for Performance

    Not many software development projects specify a unit test for performance. The requirement is only for throughput and response time under load. Of course, this is a rather trivial exercise so one could argue that it is not necessary list such minor items. If the presumption is that the application coming from development as is will meet performance objectives without rework, then the load test is a mere formality and a unit test for performance would have no purpose.

    If on the other hand the load test shows that there are issues, then what is the cause? Are all the individual elements good and the problem only a matter of scaling throughput with multiple concurrent streams of traffic? A comparison of the unit test with the load test would provide this information transparently. The performance unit test can be accomplished simply by running the complete sequence of calls using the load test harness set to a single stream.


    The point of interest in this discussion is not simply an explanation of the key factors that cause apparently inexplicable widely differing results in load tests as well as in the production environment. Instead of just attempting to reproduce what might happen in production, we see that it is actually possible to avert problems in production before it happens with the right load test strategy.

    This is unlikely to be accomplished by attempting to build the test plan on general intuitive principles that would appear to reflect the real world. The SQL Server query optimizer is not built on an intuitive perception of the real world, but rather on cost based optimization with data distribution statistics. By targeting the critical aspects of the query optimizer, we can replicate any possible execution plan, which in turn implies almost any possible problem can be replicated as well.

    But it is also possible to work strategies for averting problem execution plans into the SQL code as well. This might be controlling the stored procedure compile parameters and identifying critical (index) statistics that need non-default (probably FullScan) update maintenance. More sophisticated strategies might involve explicitly directing primary key assignment or possibly even using false statistics (can a lie correct a wrong?).



    Some additional items.
    This applies to transactional databases in which much of the active data is not in memory, and particularly more important when residing on hard disk storage systems. For an estimated row count of the key lookup or loop join inner source access is less than 25 rows, SQL Server issues synchronous disk IO. For over 25 rows (I am not sure if the cutoff is equal or over 25 rows) the disk IO is asynchronous. Asynchronous IO at higher queue depth will yield better performance on hard disk storage. Flash/SSD storage should have sufficiently low access latency such that this effect is not as critical

    It used to be that updates using nonclustered indexes generated synchronous IO instead of asynchronous regardless of the estimated row count. I am not sure if this has been fixed in more recent versions?

    In tables with varchar(max) fields that are stored outside of normal pages, a pointer to the actual page is stored instead (could someone provide a reference for this). Disk IO for lob pages is necessarily synchronous, because the in-row page and row must be accessed to determine the pointer? (Its too bad SQL Server does not have a feature for an included column on just the pointer and not the full column?)

  • Enterprise Storage Systems - EMC VMAX

    I generally do not get involved in high-end SAN systems. It is almost impossible to find meaningful information on the hardware architecture from the vendor. And it is just as impossible to get configuration information from the SAN admin. The high-end SAN is usually a corporate resource managed in a different department from the database.

    The SAN admin is generally hard set on implementing SAN vendor doctrine of "Storage as a Service" and does not care to hear input on special considerations from the database team. In addition to unpredictable or poor performance and sometimes both, it is often necessary for fight for every GB of storage space via email requests, filling out forms or endless meetings. This is especially ridiculous because storage capacity at the component level is a really cheap. It only becomes a precious resource in a SAN.

    Still, I am expected to answer questions on what is wrong with SQL Server when there are performance problems against a magically all-powerful enterprise SAN, so this is my best understanding. The example I am using is the EMC Symmetrix line, but the concepts here could be applied to other systems if details were available.

    The EMC Symmetrix VMAX was introduced in 2009 using Intel Core2 architecture processors (45nm Penryn) with RapidIO fabric. A second generation came out in 2012, with the VMAX 10K, 20K and 40K models using Intel Xeon 5600 (32nm Westmere) processors. The predecessor to the VMAX was the Symmetrix DMX-4, which used PPC processors and a cross-bar architecture connecting front-end, memory and back-end units.

    The basic information here is from the EMC documents. Because the details on the internal architecture of the VMAX are not found in a single authoritative source, much of it has to be pieced together. Some of the assessments here are speculation, so anyone with hard knowledge is invited to provide corrections.

    VMAX (2009)

    The original VMAX architecture is comprised of up to 8 engines. Each engine is comprised of a pair of directors. Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory (compared with 16GB for the CLARiiON CX4-960).


    Each director has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?) and various options for the front-end including 8 x 4Gb/s FC ports.

    The engine with 2 directors has 16 back-end FC ports (2 ports making 1 loop) and can have 16 ports on the front-end in the FC configuration. Assuming 375MB/s net realizable throughput with 4Gbps FC, each director could support an aggregate of 3.0GB/s on both the front and back-end ports.

    In the full VMAX system of 8 engines (16 directors) with FC front-end configuration there are 128 x 4Gb/s FC ports on the front and back ends. Then in theory, the combined front-end and back-end bandwidth of the full system is 16 x 3.0GB/s (or 128 x 375MB/s) = 48 GB/s.

    Of course, there is no documentation on the actual sequential (or large block) IO capability of the V-Max system. There is an EMC VMAX Oracle document mentioning 10GB/s on 2 engines (not sure whether this is the 2009 VMAX or the second generation VMAX).

    To support the above director, I would guess that the system architecture should have 6 x8 PCI-E slots. Based on a quad-port FC HBA, the 8 back-end ports requires 2 x8 slots, and there are also 2 x8 slots for the front-end for any supported interface.

    Without discussing the nature of the interconnect between directors in an engine, and the Virtual Matrix Interface, I am supposing that each requires one x8 slot. The above diagram does show a connection between the two directors in one engine.

    So there should be 2 back-end, 2 front-end, 1 VM and 1 director-director x8 PCI-E slots in all. It could also be presumed that the slots are not connected through an expander, as this would result in an arrangement with unbalanced bandwidth.

    At this point I would like to digress to review the Intel Core2 system architecture. The original memory controller hub (MCH or chipset) for the 2-socket Core2 system was the 5000P in 2006, 1432-pins. The 5000P has 24 PCI-E lanes and the ESI, which is equivalent to 4 lanes. So this is clearly inadequate to support the VMAX director.

    5000P MCH

    In late 2007-early 2008 or so, late in the product life of the Core2 architecture processors, Intel produced the 5400 MCH chipset, codename Seaburg, with 1520-pins supporting 36 PCI-E lanes plus the ESI, equivalent to 4 PCI-E lanes.


    This MCH chipset was not used by any server system vendor, so why did Intel make it if there were no apparent customers? It is possible the 5400 MCH was built specifically to the requirements of the high-end storage system vendors. I mentioned this briefly in System Architecture 2011 Q3.

    The 5400 MCH can support 5 x8 PCI-E slots. I think this is done by using the ESI plus 1 x4 on the upstream side of the Enterprise South Bridge to support x8 on the downstream side. So there is something wrong with my estimate to the PCI-E slot count required for the VMAX engine.


    When the original EMC VMAX came out in 2009, I could find no documentation on the Virtual Matrix interface. I had assumed it was Infini-band, as FC would not have been suitable on bandwidth or protocol support. Later I came across a slide deck illustrating VMI implemented with an ASIC connecting x8 PCI-E to RapidIO. The second generation VMAX specification sheets explicitly lists RapidIO as the interconnect fabric.

    RapidIO is an open-standard switched fabric. In short, RapidIO has protocols for additional functionality that was not necessary in PCI-E, a point-to-point protocol. (Some of these may have been added to PCI-E in later versions?) RapidIO can "seamlessly encapsulate PCI-E". The other aspect of RapidIO is that the packet overhead is far lower than Ethernet layer 2, and even more so than Ethernet layer 2 plus layer 3 (IP) plus layer 4 (TCP) as there is no requirement to handle world-wide networks. The RapidIO protocol overhead is also slightly lower than PCI-E.

    The first version of serial RapidIO supported 1.25, 2.5 and 3.125 Gbaud, and x1 and x4 links. Version 2 added 5 and 6.25 Gbaud and x2, x8 and x16 links.

    The diagram below is for the original VMAX using two Xeon L5410 processors. I neglected to note the source, so some help on this would be appreciated.

    VMax Director

    In the diagram above, the VMI ASIC is connected to x8 PCI-E to the director system, and 2 x4 RapidIO for the interconnect. The RapidIO encoded data rate is 3.125GHz. The data rate before 8b/10b encoding is 2.5Gb/s per lane or 1.25GB/s bandwidth for the x4 connection in each direction. The bandwidth per connection cited at 2.5GB/s full duplex is the combined bandwidth in each direction on the RapidIO side.

    The bandwidth on the PCI-E side is 2.5Gbit/s raw, or 2Gbps unencoded data (8b/10b) for 2.0GB/s on the x8 slot. This is the nominal bandwidth of the full PCI-E packet including header and payload. The PCI-E packet overhead is 22 bytes.

    The net bandwidth that I have seen for disk IO on x8 PCI-E gen 1 is 1.6GB/s. I am not sure what the average payload size was for this. It could have been 512 bytes, the disk sector size commonly used. In any case, the packet overhead is much less than 20%, so there is a difference between the net achievable bandwidth and the net bandwidth after PCI-E packet overhead.

    The VMAX diagram above shows one x8 PCI-E for VMI and 4 x8 PCI-E for Disks (Back-end) and front-end channels (HBAs). The 4 BE and FE slots are labeled at 1.5GB/s each and 6.0GB/s for the set of four. Presumably this is the 4 x 375MB/s FC bandwidth, and not the PCI-E x8 bandwidth of 2.0 GB/s including packet overhead.

    A dedicated interconnect between the two directors in one engine is not shown. So this would represent a valid configuration for 5400 MCH, except that 4 x8 PCI-E should be to the MCH, and only 1 x8 on the ICH (ICH was the desktop I/O controller hub, ESB was the server version).

    The main observation here is that EMC decided it is a waste of time and money to continue to building custom architecture in silicon when there are alternatives. It is better to use Intel Xeon (or AMD Opteron) components along with an open-standard fabric. There are ASIC and FPGA vendors that provide a base PCI-E to RapidIO interface design that can be customized. I am presuming the EMC VMI ASIC is built on this.

    Below is EMC's representation of the VMAX system, showing 8 engines (16 directors) interconnected via the Virtual Matrix.

    VMax Matrix

    The diagram is pretty, but conveys very little understanding of what it is. Knowing that the Virtual Matrix interface is RapidIO is all that we need to know. The Virtual Matrix is a RapidIO switch, or rather a set of RapidIO switches.

    Each of 16 directors is connected to the VM with 2 x4 RapidIO ports. A single switch with 128 (16x2x4) RapidIO lanes could connect the full VMAX system. A second possibility is two switches with 64 (16x4) RapidIO lanes. Each switch connects one x4 port on each director. Other possibilities with fewer than 64 lanes include 8 switches of 16 lanes, or some arrangement involving more than 1 switch between directors.

    IDT makes RapidIO switches and PCI-E to RapidIO bridges (not to mention PCI-E switches). There are other vendors that make RapidIO switches and I do not know the source for the EMC VMAX. The RapidIO switches are available with up to 48 lanes as shown below.

    I am not sure if there are any with 64 lanes? There is an IDT PCIe switch with 64 lanes in a 1156-pin BGA. IDT describes their 48-port RapidIO switch, capable of operating at 6.25Gbaud, as having 240Gb/s Throughput. So they felt it was more correct to cite the unencoded bandwidth, single direction, not the full duplex, and not the encoded data rate.

    The diagram below shows the full VMax system comprising 11 racks with the maximum disk configuration!

    VMax Full Config

    The center rack is for the VMax engines, the other 10 are storage bays. Each storage bay can hold up to 240 drives. There are 160 disk array enclosures, 64 directly connected, and 96 daisy chained. There are 8 VMax engines, with the disk enclosures in matching color.

    The 2009 VMAX only supported 3.5in drives initially? (I misplaced or did not keep the original VMAX documentation, oops) The back-end interface on both the original and second generation (!@#$%^&) VMAX is 4Gbps FC. The 3.5in disk drives are also FC. The 2.5in disk drives for the second generation VMAX is listed as SAS, so presumably the disk enclosure converts the external FC interface to SAS internally. There are flash drive options for both 3.5in and 2.5in, the 3.5in being FC and the 2.5in SAS?

    The mid-range VNX moved off FC disks in 2011. Perhaps the size of the VMAX with all 11 racks is beyond the cable limits of SAS? But why 4Gb/s FC and not 8Gbps? Is this to maintain compatibility with the previous generation DMX? I am inclined to think it is not a good idea to saddle a new generation with the baggage from the older generation. Perhaps in the next generation FC on the back-end would be replaced by SAS?

    VMAX Second Generation (2012)

    The second generation EMC VMAX employs the Intel Xeon 5600 series (Westmere-EP) processors with up to six cores. There are three series, the VMAX 10K, 20K and 40K. The complete system is comprised of one or more engines. There can be up to 8 engines in the 20K and 40K and up to 4 engines in the 10K.

    Each engine is comprised of 2 directors. A director is a computer system. The 10K director originally had a single quad-core processor; later versions have a single six-core processor. The 20K director has two quad-core processors. The 40K director has two six-core processors. Both the 10K and 20K (directors) have dual Virtual Matrix Interface (VMI or just VM?). The 40K (director) has quad-VM.

    It is very hard to find useful detailed SAN system architecture information. I came across the following from an EMC VMAX 40K Oracle presentation, which appears to be just an update of the original VMAX engine diagram to the second generation VMAX 40K.


    But notice that the interconnect between the two directors (2-socket servers) is labeled as CMI-II. CMI is of course the acronym for CLARiiON Messaging Interface, which in turn was once Configuration Manager Interface (prior to marketing intervention?). This makes sense. There is no reason to develop different technologies to perform the same function in the two product lines. So the separation between VNX and VMAX is that the latter has VMI to cluster multiple engines together.

    Along the same lines, does there need to be a difference in the chips to perform the CMI and VMI functions? It does not matter if the software stacks are different.

    To support the VMAX 40K director, there should be 2 x8 PCI-E slots each for both the front-end and back-end ports as before in the original VMAX. I am also assuming a single x8 PCI-E slot for the CMI-II. The difference is that the 40K director needs 2 x8 PCI-E slots to support 4 VM connections, each x4 RapidIO. This makes a total of 7 x8 PCI-E slots.

    The 2-socket Xeon 5600 system architecture is shown below with two 5520 IOH devices each supporting 36 PCI-E gen2 lanes for 72 lanes total, not counting the ESI (equivalent to PCI-E gen 1 x4).


    The full Xeon 5600 system can support 8 PCI-E gen2 x8 slots, plus 2 gen2 x4 (because the extra x4 on each IOH cannot be combined into a single x8?). So this time there are more PCI-E slots than necessary. Note also that all of these are PCI-E gen2 slots. The back-end FC on the 2nd generation VMAX is still 4Gb/s FC. The front-end FC can be 8Gbps FC. It could be that all FC HBAs in the second generation can support 8Gbps, just that the back-end ports operate at 4Gbps?

    Virtual Matrix and RapidIO

    The original VMAX used RapidIO at 3.135 Gbaud. After 8b/10b encoding, the un-encoded data rate is 2.5Gbps. In a x4 link, the combined data rate is 10 Gbit/s or 1.25 GByte/s. As with modern serial protocols, data transmission is simultaneous bi-directional. So the bandwidth in both directions combined is 2.5GB/s full duplex.

    In a server system, citing full duplex bandwidth for storage is not meaningful because IO is almost always heavily in one direction (except for backups directly to disk). However, it should be pointed out that the bi-directional capability is immensely valuable because the primary stream is not disrupted by minor traffic in the opposite direction (including acknowledgement packets). Just do not confuse this with the full duplex bandwidth being a useful value.

    In a storage system, it could be legitimate to cite the full duplex bandwidth for the engine, because each engine could be simultaneously processing data in-bound from and out-bound to other engines. So the engine must be able to handle the full duplex bandwidth.

    Now considering the complete storage system, any traffic that leaves one engine must arrive at another engine. The total traffic is the sum of the bandwidth a single direction. So it is misleading to cite the sum total full duplex bandwidth. But marketing people can be relied upon to mislead, and we can trust marketing material to be misleading.

    The VMI ASIC bridges 8 PCI-E lanes to 8 RapidIO lanes. In the original VMAX, this PCI-E gen 1 to RapidIO at 3.125 Gbaud. In the second generation VMAX with Westmere-EP processors, the PCI-E is gen2 and RapidIO is now presumed to be 6.25 Gbaud. PCI-E gen1 is 2.5Gbps and gen2 is 5Gbps.

    I suppose that there is a good reason RapidIO was defined to 3.125 Gbaud at the time PCI-E was 2.5Gbps. Consider sending data from one system to another. In the first system, data is first transmitted over PCI-E (hop 1), A device converts the data to be transmitted over RapidIO (hop 2). At the other end, a device converts back for transmission over PCI-E (hop 3) to the final destination.

    It would seem reasonable that if all interfaces had equal data rates, there would be some loss of efficiency due to the multiple hops. So for lack of hard analysis I am just speculating that there was a deliberate reason in the RapidIO specification.

    Another speculation is that it was known that RapidIO would be interconnecting systems with PCI-E, and the extra bandwidth would allow encapsulated PCI-E packets on RapidIO with the upstream and downstream PCI-E ports to be running at full bandwidth?

    The point of the above discussion is that the bandwidth on the RapidIO of the VMI ASIC is less material to the storage professional. The bandwidth on the PCI-E side is closer to net storage IO bandwidth.


    In the table below, I am trying to make sense of the Virtual Matrix bandwidth of the original VMAX, and the second generation VMAX 10K, 20K and 40K. The original VMAX 2009 had 3.125 Gbaud RapidIO, so each x4 link had 1.25GB/s unencoded bandwidth per direction. Each director has dual Virtual Matrix, so the combined full duplex bandwidth of 4 VM for the engine is 10GB/s unencoded. The combined full duplex bandwidth on the PCI-E side is 8GB/s per engine.

    Processor Core2 Westmere Westmere Westmere
    Sockets 2 1 2 2
    cores 4 4-6 4 6
    VMI/dir 2 2 2 4
    VMI/eng 4 2 4 8
    RapidIO 3.125 Gbaud 6.25 Gbaud ? 6.25 Gbaud
    Unencoded 8b/10b 2.5 Gbaud 5 Gbaud ? 5 Gbaud
    x4 link 1.25GB/s 2.5GB/s ? 2.5GB/s
    x4 link bi-dir 2.5GB/s 5GB/s ? 5GB/s
    Engine VM BW 10GB/s 50GB/s? 24GB/s 50GB/s
    System VM BW 80GB/s? 200GB/s? 192GB/s 400GB/s

    The second generation VMAX should be on RapidIO at 6.25 Gbaud and PCI-E gen 2 at 5Gbps. The VMAX 40K specification sheet cites Virtual Matrix bandwidth of 50GB/s for the engine and the full system with 8 engines VM at 400GB/s. The VMAX 20K specification sheet cites VM bandwidth of 24GB/s for the engine and the full system with 8 engines VM at 192GB/s. The VMAX 10K specification sheet cites the full system (4 engines) VM bandwidth at 200GB/s, implying a single engine at VM bandwidth of 50GB/s.

    Given that the VMAX 40K has twice as many Virtual Matrix interfaces and double the signaling rate, the cited VM value of 50GB/s can only mean the bi-directional encoded rate of 6.25 Gbaud over 8 x4 lanes on the RapidIO side. The VMAX 20K value of 24GB/s is perplexing. Why is it not the full duplex rate of 25GB/s for 6.25 GBaud over 8 x4 lanes?

    The VMAX 10K system value of 200GB/s is also perplexing. There are only 4 engines maximum, meaning each engine would be 50GB/s. The other documents or slide decks indicate the VMAX 10K director is dual VMI? So the VM bandwidth should be 25GB/s full duplex encoded?

    On the assumption that the VMAX 40K engine has 50GB/s Virtual Matrix encoded full duplex bandwidth, then the unencoded bi-directional bandwidth is 40GB/s on the RapidIO side, and the unencoded bi-directional bandwidth is 32GB/s on the PCI-E side, corresponding to 4 x 8 PCI-E gen 2 lanes. So the useful bandwidth for the engine VM is 16GB/s single direction.



    Bandwidth Calculation and Speculation

    For lack of hard data on what the VMAX IO bandwidth capability actually is, I will speculate. The original VMAX director could have 8 x 4Gbps FC ports on both front-end and back-end. As discussed above, based on 375MB/s for each 4Gbps FC, the director FE and BE bandwidth is then 3.0 GB/s.

    I will further assume that the sole purpose of the CMI-II between the two directors in each engine is to maintain a duplicate of the write cache for fault tolerance. This means all other traffic between directors must go through the VMI.

    In the circumstance that every I/O request coming to a particular port on one of the directors access data only RAID groups directly attached to that director, then we would have 100% locality and the would be nearly zero traffic over the VM. Not only is this highly improbable and extremely difficult to contrive, it also goes against one of the key principles of the central SAN argument. The idea is to pool a very large number of disks into one system such that every volume from all hosts could access the aggregate IOPS capability of the complete storage system.

    A RAID group must be built only from the disks directly attached to the director. So the aggregate concept is achieved by pooling all RAID groups together. Volumes are created by taking a (small) slice of each RAID group across all directors. Each volume now has access to the IOPS capability of the entire set of disks. This is why the SAN shared storage concept is valid for transaction processing systems but not for DW systems that would benefit from sequential large block IO.

    In this scenario, the presumption is that IO requests arriving at any director are evenly distributed to all directors. In the full system of 8 engines, 16 directors, 6.25% (1/16) of IO are on local disks accessed via the back-end ports and 93.75% (15/16) must come through the VM from the other directors.

    Then the SAN system bandwidth is constrained by the more limiting of the Front-end channels, the backend channels, and the adjusted Virtual Matrix single direction, not full duplex, bandwidth. The adjustment accounts for the percentage of traffic that must come through the VM. If the VM must handle 15/16 of the total traffic, then the upper limit is 16/15 times the VM bandwidth. On the VM, it so happens the PCI-E is more limiting than the RapidIO side, so quoting bi-directional bandwidth is misleading and so is quoting the RapidIO side bandwidth instead of the PCI-E bandwidth.

    The PCI-E bandwidth to VM in the original VMAX is 2.0 GB/s (x8 gen 1) including PCI-E protocol overhead. The actual net bandwidth is less than 2GB/s but possibly more than 1.6GB/s cited earlier as the maximum that I have seen in direct attach IO. This is more limiting than the 3GB/s on the 8 x 4Gbps FC front-end or backend ports.

    The second generation VMAX allows 8 x 8Gbps FC ports on the front-end for an aggregate bandwidth of 6GB/s based on 750MB/s per 8Gbps FC port. However the back-end ports are still 4Gbps FC for an aggregate of the same 3GB/s in the original VMAX. The 40K VMAX engine is described as 50GB/s VM bandwidth, not mentioning this is the full-duplex value encoded on the RapidIO side. The single direction encoded data rate on a single director is 12.5GB/s. The unencoded rate 10GB/s on the RapidIO side. The single direction unencoded rate on the PCI-E side is 8GB/s (16 PCI-E gen 2 lanes). Still this is much more than either the FE or BE ports.

    Note that with fewer engines and corresponding directors, more of the traffic is local. With 4 engines and 8 directors, the local traffic is 12.5% and 87.5% remote. With 2 engines and 4 directors, the local traffic is 25% and 75% remote.

    All of the above is for read traffic, and does not consider if there are other more limiting elements. Another consideration is memory bandwidth. A read from "disk" could be first written to memory, then read from memory. (the latency due to the CPU-cycles involved is not considered). An 8-byte wide DDR DRAM channel at 1333MHz has 10.7GB/s bandwidth, but this is only for memory read.

    The memory write bandwidth to SDR/DDR is one-half the nominal bandwidth. In the really old days, a disk access involving a memory write followed by a memory read would be constrained by one-third of the nominal memory bandwidth. Intel server systems from 2006 or so on used memory with a buffer chip that is described as supporting simultaneous read at the nominal bandwidth and write at one-half the nominal bandwidth.

    In writes to storage, the write IO is first sent to memory on the local director, then copied across the CMI-II(?) to the other director in the same engine? So the net bandwidth across the CMI is also limiting.

    Now that SQL Server 2012 allows clustering (AlwaysOn?) with tempdb on local (preferably SSD) storage, I recommend this to avoid burdening the SAN with writes. Or a SAN vendor can bother to understand the nature of tempdb and allow write cache mirroring to be selectively disabled?

    Even with all this, there is not a definitive statement from EMC on the actual bandwidth capability of the VMAX, original or extra-crispy second generation. Some slides mention a 3X increase in bandwidth. Was that a particular element, or the realizable bandwidth? Is it possible that the original VMAX could do only 1/3 the back-end aggregate of 48GB/s, and that the second generation can do the full back-end limit?


    Regardless of the SAN, focus on the IOPS and bandwidth that can be realized by actual SQL Server queries. SAN vendor big meaningless numbers are not helpful. The standard SAN vendor configuration should be able to produce reasonable IOPS, but will probably be sadly deficient on bandwidth that can be realized by SQL Server. I do like SSD. I do not like auto-tiering, flash-cache or 7200RPM drives for the main Line of Business database. It should be the database administrators responsibility to isolate hot data with filegroups and partitioning.

    Considering that a 10K 900GB drive lists for $600, why bother with the 7200RPM (3TB) drive in an enterprise system, unless it is because the markup is so ridiculously high? Or perhaps data that needs to be on 7200RPM drives for cost should not be on an host-cost enterprise storage system?) If there are SSDs, these should be made available as pure SSD.

    (Edit 2013-08-02)
    Also extremely important is how SQL Server IO works. First, FC 8Gbps is 8 Giga-bits/sec, not Giga-Bytes/s. After encoding overhead, the net single direction BW is 700-750MB/s. On modern server hardware, SQL Server can easily consume data at 5-10GB/s, ie, single direction, so 8-16 x 8Gbps FC ports are recommended for multi-ten TB DW. Furthermore, use serious configuration documents, not the stupid idiots who say 1 LUN each for data, temp, log. The should be 1 volume per FC path to be shared between data and temp. For OLTP, log should be on a separate volume with dedicated physical HDD (ie, not a volume carved from a shared pool) and even possibly its own FC path. For DW, log can share a path with the data volumes. Each critical FG has one file on each of the data volumes. And don't forget the -E startup parameter.



    Symmetrix DMX (pre-2009)

    The old DMX-4 architecture below. The front-end and back-end units used PPC processors(?) connected with a matrix to memory controllers?


    A history of the EMC Symmetrix product line can be found on Storage Nerve.

    There can be up to 8 front-end units. Each FE can have 8 FC ports for a total of 64 FE ports? Assuming that this was designed for 4Gbps FC, with a realizable bandwidth 375MB/s on each 4Gbps FC port, then each FE unit of 8 ports would in theory have a maximum BW of 3.0GB/sec. The combined 8 FE units with 64 ports total would a have theoretical upper bound of 24GB/s. Now it is possible that the DMX was originally design to 2Gbps FC, for an upper bound design of 12GB/s.

    Various EMC documents mentions the interconnect bandwidth as a sum total of the individual component bandwidths. But nowhere in EMC document is there a mention of the actual DMX bandwidth capability. I have heard that due to some internal architecture aspect, the actual bandwidth capability of the DMX is in fact 3GB/s.

    Lonny Niederstadt provided the following link Agilysys Audit.

  • Relativity e-discovery on SQL Server

    Back in late 2011 to early 2012, I was asked to look into issues for a SQL Server system supporting kCura Relativity. Relativity is an e-discovery platform, that is, a document search management system frequently used for document discovery is legal cases. So it also has auditing to prove that a search was done.

    Normally I would post items both here and on my own website. Now it is my nature to say things that others find to be not entirely polite (see my previous post) and I am too old to change. So to avoiding hurting the feelings of too many people, I am not posting the details of my findings of Relativity on SQL Server here.

    The link is on my site,, which does not get much traffic, in the Optimizer section.

    Here is the direct link


    Still, I would like hear from SQL people who support Relativity. Of course other people have different conclusions. So it is always important to fully examine the information to make an assessment on technical merit. Avoid judgment based solely on the reputation of the author.

This Blog


Privacy Statement