THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • Low Latency Memory

    For the last twenty years, the standard practice has been to scale server performance with multi-processor (MP) systems. What is not widely understood is that after processors integrated the memory controller, the initial step from single processor to 2-way has only moderate scaling due to the introduction of non-uniform memory access (NUMA) in the MP system architecture. Scaling versus cores within a single processor however, can be excellent. The new practice should now be to employ single processor systems for the very large majority of situations, with multi-processor systems relegated to extreme boundary cases.

    Once we accept that the single processor system is the correct strategy, the next conclusion is that low latency DRAM should be pursued. Over the last twenty years, the data transfer rate for memory has increased from 100-133MHz in the SDRAM generation to over 2666MT/s in DDR4. But the full access latency (or row cycle time) has barely changed from perhaps 60ns to 45ns. This not because some fundamental limit of DRAM has been reached. Rather it is a deliberate decision to prioritize manufacturing cost. It was presumed that low latency would not have significant incremental market value to justify the higher cost.

    A significant portion of the memory latency in multi-processor servers occurs in the system elements outside of DRAM. A reduction in latency on the DRAM chip may not impact total memory latency in the MP system sufficiently to outweigh the cost. However, in a single processor system, the impact is sufficient to justify even a very large increase in DRAM cost. The first opportunity is for low latency DRAM that would be compatible with the existing memory interface of current or near-term next generation processors, be it DDR4 now or DDR5 in a couple of years.

    The next step is for memory with a new optimized interface, which must be implemented in conjunction with the processor. The most obvious change is to de-multiplex the address bus, basically RL-DRAM, but optimized for server systems. Memory latency is so important that it is likely even SRAM cost structure is viable, but the discussion here focuses on DRAM.

    Memory Latency in Single and Multi-Processor Systems

    There are three die options in the Skylake based Intel Xeon SP: LCC, HCC and XCC. The figure below shows representation of the Xeon SP XCC model. Memory access latency on L2 miss consists of L3 plus the DRAM access. L3 latency may be less than 19ns on the LCC and HCC die.


    Intel cites 19.5ns for the XCC die. 7-cpu reports memory access for Skylake X as L3 + 50ns = 69ns for DDR4-3400 16-18-18-36. On more conservative ECC memory at 2666MT/s 18-18-18 timing, the full memory access might be 76ns, or perhaps slightly higher.

    The figure below, from Anandtech, source from an Intel slide?, shows memory latency for a 2-way system with Xeon SP 8180 28-core processors. Also see Sizing Up Servers....


    Note that local node memory latency is shown as 89ns, significantly higher than latency on the single processor system. Remote node memory latency is 139ns. I am assuming that this difference is due to remote node cache coherency, but processor architects and engineers are welcome to elaborate on this. There is a slide in Notes on NUMA Architecture, Intel Software Conference 2014 Brazil stating that in local memory access, the memory request is sent concurrently with the remote node snoop, implying minimal penalty, and yet there is a difference of 13ns?

    Be aware that different values are cited by various sources. The figure below is from


    Even if a database or other application had been architected to achieve a high degree of memory locality on a NUMA system, there is still a memory latency penalty on the multi-processor system relative to the single processor system.

    Almost zero real-world databases have been architected to achieve memory locality on a NUMA system. I have heard that a handful of Oracle environments went to effort to re-architect for RAC, which is the same type of architecture necessary to achieve memory locality on NUMA. Most environments should see average memory latency of 114ns on a 2-way Xeon SP system based on 50/50 local-remote node mix. Average memory latency on a 2-way Xeon E5 v4 system might be 120ns.

    Memory Latency and Performance

    The main line Intel processor cores manufactured on the original 14nm process can operate with base frequency at the high 3GHz level. Cores manufactured on the 14nm+ process can operate at the low 4GHz level, base. In the high core Xeon models, this is typically de-rated to 2-2.5GHz base frequency with turbo boost at about 3GHz.

    At 2.5GHz, the CPU clock is 0.4ns. Single processor memory latency of 76ns corresponds to 190 CPU-cycles. Local node memory access in a multi-processor system of 89ns is 222.5 cycles and remote node latency of 139ns is 347.5 cycles. Average 2-way memory access based on 50/50 local-remote node split is 114ns or 285 cycles.

    If even a very smaller percentage of code involves a round-trip memory access to determine the next operation, then most CPU cycles are spent as no-ops. In this case, performance is essentially the inverse of memory latency. A 30% reduction in memory latency corresponds to 1.4X performance and 50% to 2X. See Memory Latency for more on this topic.

    The DIMM

    The Dual In-line Memory Module has been a standard for many years now. In the figure below, a memory module is shown with 9 DRAM chip packages forming a 72-bit wide word or line.


    Of this, 64-bits are data and the 8 extra bits are for ECC. There is an extra chip for registered DIMMs. Desktop system used unbuffered DIMMs and do not have ECC.

    A module can be double-sided. When there are chip packages for two separate sets, then the DIMM is said to be having 2 ranks. In case above, there is one set on each side of the DIMM, hence dual.

    The figure below shows two separate sets on each side, for a total of four sets.


    In this case, the DIMM is rank 4. The DIMM datasheet will have the actual configuration as there are exceptions. Below is a DIMM probably of rank 4.


    Each package can have stacked DRAM die. The image below is probably NAND die as it appears to be stacked 16 high. DRAM packages with stacked die are usually 2 or 4 high?



    DRAM - preliminary

    The DRAM chip is an enormously complicated entity for which only some aspects are briefly summarized here. See Onur Mutlu and other sources for more.

    Below is a simplified rendering of the Micron 8Gb DDR4 in the 2G x 4 organization. See the Micron datasheet for a more complete and fully connected layout (DDR4 8Gb 2Gx4).


    The die is sub-divided into 4 bank groups. Each group has 4 banks for a total of 16 banks. Within a bank, there are rows and columns. The row is a word-line and the column is a bit-line.

    The 2G x 4 organization means there are 2G words of 4-bits. The address bus has 21 signals. Two bits are for addressing the bank groups, and two bit are for the banks. The remaining 17 signals are multiplexed for the row and column addresses. All 17 bits are used for the row (128x1024 = 131,072) address. Only 10 bits are used for the column address. The total address is then 4 + 17 + 10 = 31 bits, sufficient for 2G words.

    Notice that only 7 bits of the column address go into the column decoder. Eight words are fetched from the bank on each access

    The diagrams below are from Bounding Worst-Case DRAM Performance on Multicore Processors, on Oak Central. The original source is: "Memory Systems: Cache, DRAM, Disk " by Bruce Jacob et. al. (2008). The first image shows "Command and data movement for an individual DRAM device access on a generic DRAM device."


    The next figure is "A cycle of a DRAM device access to read data."


    The three DDR timing parameters commonly cited are CL, tRCD, and tRP. A fourth number that may be cited after the first three is tRAS. Wikipedia Memory timings has definitions for these terms. "The time to read the first bit of memory from a DRAM with the wrong row open is tRP + tRCD + CL." The tRCD, CL and tRP element are often identical and in the 13-14ns range. In DDR4, the data burst phase transfers 8 words. For DDR4-2666, the command-address clock is 0.75ns and the data transfer rate is one word every 0.375ns, so 8 transfers takes 3ns.

    The data restore phase occurs shortly tCAS (CL) and overlaps with tBurst. There is a gap between the column read and the (array) precharge periods. As mentioned earlier, DRAM is very complicated. For database transaction processing and other pointer-chasing-code, the parameter of interest is tRC. This value is sometimes not cited, and it cannot be determined by the three commonly cited timing parameters.

    Micron SDRAM through DDR4

    The table below shows some organization details for selected Micron DRAM products from SDRAM through DDR4. The year column is the year of the datasheet, which might be a revision published after initial product launch.

    Rows x
    SDRAM256M64Mx4413118K x 2K199913315ns60ns
    DDR512M128Mx4413128K x 4K200040015ns55ns
    DDR22G512Mx48151132K x 2K2006106613.13ns54ns
    DDR31G256Mx48141116K x 2K2008186613.91ns48ns
    DDR34G1Gx48161164K x 2K2017213313.09ns47ns
    DDR48G2Gx4161710128K x 1K2017320013.75ns46ns

    Most of the parameters above are in the DRAM chip datasheet, but tRC is usually found in the DIMM datasheet, and sometimes buried in the back pages. The timing column is the CL-tRCD-tRP amalgamated into a single value and converted from (command/address) clock cycles to nanoseconds.

    The CL-tRCD-tRP values decreased only slightly from SDRAM to DDR2. From DDR2 to current DDR4, they have not changed for the registered ECC DIMMs used in server systems. There are unbuffered DIMMs made with specially screened DRAM without ECC having lower (better) values. These products also have heat sinks for improved cooling. Presumably, it would not be feasible to rely on this approach for server systems.

    The 256Mb SDRAM datasheet is listed as 1999. This could be a 250 or 180nm manufacturing process. The 8Gb DDR4 is probably a 2016 or 2017 product, possibly on a 22nm or 2x process. From 256Mb to 8Gb, there are 5 doublings of density. In between 180nm and 22nm, are five manufacturing processes: 130, 90, 65, 45 and 32nm.

    The manufacturing process used to mean the transistor gate length. However, in the last 15 or so years, it is just an artificial placeholder representing a doubling of transistor density between generations. In the Intel 22 to 14nm process transition, the density increased by 2.7X (Intel's 10nm).

    It would seem that one doubling of DRAM density between 256Mb at 180nm and 8Gb at 22nm is missing, but this could be the upcoming 16Gb single die. There is a convention in which the DRAM bit cell size is specified as xF2 and F is the process linear dimension.

    DRAM Die Images

    In Under the Hood: The race is on for 50-nm DRAM, the Samsung 1Gb DDR2 is mentioned as 50nm class in 2009 as a 6F2-based cell design, die image shown below. In this case, 50nm class is 58nm.


    Also is Under the Hood, "On the other hand, Hynix's 8F2 cell design showed a 16.5 percent larger cell than Samsung's. It should be noted, that despite the larger cell size, Hynix's 1-Gbit DDR2 SDRAM achieved an impressive chip size of 45.1 mm2, only 2.7 percent larger than Samsung's 1-Gbit DDR2 SDRAM."

    Semiconductor Manufacturing & Design has a die image of the Samsung 1Gb DDR3 in How to Get 5 Gbps Out of a Samsung Graphics DRAM, shown below.


    Embedded has a discussion on Micron SDRAM and DDR in Analysis: Micron cell innovation changes DRAM economics.

    DRAM Latency

    Two items are mentioned as having significant impact on DRAM latency. One is the bank size. Another is the multiplexing of row and column addresses. In principle, it is a simple matter to sub-divide the DRAM chip into more banks, as the design of each bank and its control logic is the same. This would however increase the die size as there would be more of the control logic associated with each bank. Earlier we said that DRAM manufacturers understood that the market prioritized cost over latency.

    In SDRAM 256Mbit, the bank size is 8Kx2K = 16M words, with 4 banks at x4 word. (I am using the term word to mean the data path out. The could apply to a DRAM chip at 4, 8 or 16-bits, and a DIMM at 64 or 72-bits.) For DDR4 8Gbit, bank size is 128Kx1K = 128M with 16 banks at x4 word size.

    There is some improvement in latency between the manufacturing process for SDRAM in 1999 and DDR4 in 2017. Long ago, the expectation was a 30% reduction in transistor switching time corresponding to 40% increase in frequency for a logic. From about the 90nm process forwards, transistor performance improved at a much diminished rate.

    In DRAM, there is both the logic sub-units made of transistors and the bit cell array made of transistors and capacitors. I do not recall mention of capacitor performance versus process generation. Presumably it is minimal? Still, this allowed the main timing parameters to remain at 13-14ns each as bank size increased from 16M to 128M at 4-bit data width.

    A single bank can only transfer one 8-word burst every tRC. This takes 4 clocks, as DDR transfers 2 data words every clock. The tRC is somewhat longer than tRCD + CL + tRP. For DDR4-2666, these components are 19 clocks each for a total of 57 clocks at 0.75ns per clock. The cited tRC at 3200 MT/s is 45.75 and at 2933 MT/s is 46.32ns. Presumably then tRC at 2666 MT/s is either 61 clocks for 45.75ns or 62 clocks for 46.5ns. (61 clocks for tRC is 57 + 4?)

    A single bank can transfer data for only 4 out of 61 or 62 clock cycles. DDR4 currently has 16 banks. So, it would seem that sustained memory bandwidth was important and warranted an increase in the number of banks. But latency was not deemed sufficiently important to further increase the number of banks.

    Reduced Latency DRAM

    Micron has a RLDRAM 3 product at 1.125Gb. One organization is 4M x18 and 16 banks. The datasheet is 2015 so this might be on the same manufacturing process as the 2Gb DDR3. The RLDRAM package is 13.5mm x 13.5mm versus 8mm x 14mm for the 2Gb in x16 form. This might be due to the difference in pins, 168 versus 96. The two products might have the same die size or the 1.15Gb RLDRAM could have a larger die size than the 2Gb DDR3, and may or may not be on a common manufacturing process.

    The 1.125Gb RL-DRAM has 16 banks. Each bank is 16K rows by 256 columns = 4M words, or 72M bits.

    The appropriate comparison for this is the 2Gbit DDR3 in 128M x 16 organization. Then the comparable DDR3 chip is 8 banks. Each bank is 16K x 1K = 16M words at 16 bit word = 256M bits. So, roughly, the bank size is 3.55 times smaller in terms of bits per bank.

    The second aspect of RLDRAM is that the row and column addresses are not multiplexed. For the 64M x 18 organization, there are 25 address bits, even though 26 bits are required to address 64M words. There are 4-bits for the bank address. The row address is 14-bits. The column address is 7-bits, but only 5-bits go into the Column decoder. The bank is shown as 16K x 32 x 8 (x18 data). The 5-bit going into the decoder corresponds to 32 columns. I am presuming that the address has 2 word granularity?

    Between the smaller bank size and the non-multiplexed address, Micron cites their RL-DRAM as having tRC minimum value of 6.67ns (8-cycles of 2400T/s, or 0.833ns per data transfer). But why only cite the minimum value? We are mostly interested in average value and possible the min-max range, excluding refresh, which all DRAM must have.

    My guess would be that by having extra banks, the tRP period can be hidden if access are randomly distributed between banks. If so, then the smaller banks reduces the 13-14ns timings to 6.67ns?

    It is presumed that both the smaller bank size and non-multiplexed address contributes to significantly lower latency. Some details on how much each aspect contributes would be helpful.

    As we are accustomed to multi-GHz processors, it might seem strange that the multiplexed row and column address has much of a penalty. The tRCD and CAS latency are each 13-14ns in mainstream DRAM. In this regard, we should recall that processors have been at 10 pipeline stages since Pentium Pro in 1995. The post-Pentium 4 Intel processors are probably 14-16 pipeline stages, though Intel no longer shows the number of stages in their micro-architecture diagrams.

    In this regard, the full start to finish time to execute an instruction is on the order of 5ns. Then factor in that DRAM has a very different manufacturing process than logic, which is not optimized for performance. It is presumed that the logic on DRAM is not pipelined except for the data transfer sub-units? (the DRR logic core runs at 1/8th of the data transfer rate, or 1/4 of the command/address clock. On DDR-2666, the core runs at 333MHz.)

    Low Latency DRAM

    As mentioned earlier, we are interested in two separate approaches to low latency DRAM for server systems in handling transaction processing type workloads. The near-term approach is to retain the existing mainstream DDR4 interface and DDR5 when appropriate. Ideally, the new special memory could be compatible with now or then existing processors designed for conventional memory. But some divergence may be necessary and is acceptable.

    This would preclude a non-multiplexed address bus. The number of banks would be increased from 16 in the 8Gb DDR4 presumably to 64 or more. The memory controller would have to know that there are more bits for bank addressing, which is why this new memory may or may not work (at full capability?) in existing processors. (The memory controller tries to schedule requests so that accesses do not go to a given bank within a tRC interval.)

    But it is possible the next generation processor could work with both conventional and the new high bank count DRAM memory. (Before conventional memory was made with multi-banks, there was a Multibank DRAM product from Mosys.)

    The low latency DRAM die would be larger than a similar density conventional DRAM chip. Either the new memory would have lower DIMM capacity options or the DIMM form factor would have to allow for a taller module. A 2X reduction is capacity would not be a serious issue. If the difference in areal size were 4X, then the larger form factor would be preferable.

    In the longer term strategy, more avenues for latency reduction are desired. The RL-DRAM approach of non-multiplexed row and column addresses is the right choice. Somewhere its mentioned that the RL-DRAM interface is not very different from SRAM. This could be an additional option that I looked at in SRAM as Main Memory.


    We need to wake up to the fact that scaling performance with multi-processor systems should no longer be the standard default approach. Most situations would benefit greatly from single processor system. On a hardware cost only evaluation, the cost-performance value could go either way. Two 20 core processors cost less than one 28-core, but one 24-core is about the same as two 16-core processors. When software per-core licensing is factored, the single processor system advantage is huge. Once we accept the single processor system approach, it is easy to then realize that a new low latency memory provides more huge additional value.

    This needs to be communicated to both DRAM and processor manufacturers. A cost multiplier of 2 or even 4X is well worth it if system level memory latency could be reduced by 20ns. A near-term future product with a backward compatibility option would mitigate risk. The longer term approach is a clean break from the past to do what is right for the future.


    There is manufacturing variation in DRAM latency. Most mainstream DRAM is set at a conservative rating for high yield. Micron does have parts with slightly better latency, usually denoted with a E in the speed grade. Some companies (ex. G.Skill) offers memory made with specially screened parts. One product is DD4-4266 (2133 clock) at 17-18-18 timings. CL 17 at 2133MHz is 7.97ns. RCD and RP at 18 is 8.44ns. This strategy works for extreme gaming, allowing a few deep-pocket players to get an advantage. For servers, it is probably worth while to offer two or three binned parts. If two, then perhaps the top one-third as the premium and the rest as ordinary.

    The other factor that affects latency is temperature. Lower temperature allows the charge state of the capacitor to be measured more quickly. Many DRAM parts lists 64ms, 8192-cycle refresh for up to 85C, and 32ms refresh for >85C to 95C. Onur's slides also says that lower temperature allows lower timing values. This is one reason why the special DIMMs for gaming have a heat sink. So, a question is whether we want this or even more elaborate cooling for our low latency strategy. Perhaps cooling should be used for the specially binned parts, and just normal cooling for the bulk parts.

  • sp_helpindex2

    this is my extended version of sp_helpindex

    lets face it, sp_helpindex is old and has not been updated for all the new features of SQL Server introduced since the beginning.

    sp_helpindex provides a list of indexes, information on: the type (clustered, unique, primary key, statistics no recompute), and file group, and the key columns

    What additional information would we like?

    the included column list is one, and also if it is a filtered index, and the filter definition.

    Size information is also good, rows, size: in row pages, overflow, lob and unused. Also average bytes per row is useful. If an index is compressed or partitioned, if so, how many partitions and how many partitions are populated.

    new sources are: sys.dm_db_index_usage_stats and sys.dm_db_stats_properties
    index usage - seeks, scans, lookups, updates
    compressed? # of partitions, populated partitions,
    difference between current rows and rows when statistics were sampled
    rows sampled, modification counter, when statistics were updated

    We might also like to know so information from DBCC SHOW_STATISTICS, but this is not available from a query.

    USE master



     SELECT * FROM sys.procedures WHERE object_id = OBJECT_ID('sp_helpindex2')

    ) DROP procedure [dbo].sp_helpindex2


    CREATE PROCEDURE [dbo].[sp_helpindex2]

     @objname nvarchar(776)


    DECLARE @objid int

     , @dbname sysname

     -- Check to see that the object names are local to the current database.

    select @dbname = parsename(@objname,3)

    if @dbname is null

    select @dbname = db_name()

    else if @dbname <> db_name()



    return (1)


    -- Check to see the the table exists and initialize @objid.

     select @objid = object_id(@objname)

     if @objid is NULL


    raiserror(15009, -1, -1, @objname,@dbname)

    return (1)



    ;WITH b AS (

    SELECT d.object_id, d.index_id, part = COUNT(*) , pop = SUM(CASE row_count WHEN 0 THEN 0 ELSE 1 END)

    , reserved = 8*SUM(d.reserved_page_count)

    , used = 8*SUM(d.used_page_count )

    , in_row_data = 8*SUM(d.in_row_data_page_count)

    , lob_used = 8*SUM(d.lob_used_page_count)

    , overflow = 8*SUM( d.row_overflow_used_page_count)

    , row_count = SUM(row_count)

    , notcompressed = SUM(CASE data_compression WHEN 0 THEN 1 ELSE 0 END)

    , compressed = SUM(CASE data_compression WHEN 0 THEN 0 ELSE 1 END) -- change to 0 for SQL Server 2005

    FROM sys.dm_db_partition_stats d WITH(NOLOCK)

    INNER JOIN sys.partitions r WITH(NOLOCK) ON r.partition_id = d.partition_id

    GROUP BY d.object_id, d.index_id

    ), j AS (

    SELECT j.object_id, j.index_id, j.key_ordinal, c.column_id,, j.is_descending_key, j.is_included_column, j.partition_ordinal

    FROM sys.index_columns j

    INNER JOIN sys.columns c ON c.object_id = j.object_id AND c.column_id= j.column_id


    SELECT ISNULL(, '') [index]

    , ISNULL(STUFF(( SELECT ', ' + name + CASE is_descending_key WHEN 1 THEN '-' ELSE '' END + CASE partition_ordinal WHEN 1 THEN '*' ELSE '' END

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND j.key_ordinal > 0

       ORDER BY j.key_ordinal FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Keys

    , ISNULL(STUFF(( SELECT ', ' + name + CASE partition_ordinal WHEN 1 THEN '*' ELSE '' END

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND (j.is_included_column = 1 OR (j.key_ordinal = 0 AND partition_ordinal = 1) )

       ORDER BY j.column_id FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Incl

    --, AS ptky

    , i.index_id

    , CASE WHEN i.is_primary_key = 1 THEN 'PK'

       WHEN i.is_unique_constraint = 1 THEN 'UC'

       WHEN i.is_unique = 1 THEN 'U'

       WHEN i.type = 0 THEN 'heap'

       WHEN i.type = 3 THEN 'X'

       WHEN i.type = 4 THEN 'S'

       ELSE CONVERT(char, i.type) END typ

    , i.data_space_id dsi

    , b.row_count

    , b.in_row_data in_row , b.overflow ovf , b.lob_used lob

    , b.reserved - b.in_row_data - b.overflow -b.lob_used unu

    , 'ABR' = CASE row_count WHEN 0 THEN 0 ELSE 1024*used/row_count END

    , y.user_seeks, y.user_scans u_scan, y.user_lookups u_look, y.user_updates u_upd

    , b.notcompressed ncm , b.compressed cmp , b.pop, b.part

    , rw_delta = b.row_count - s.rows, s.rows_sampled --, s.unfiltered_rows

    , s.modification_counter mod_ctr, s.steps

    , CONVERT(varchar, s.last_updated,120) updated

    , i.is_disabled dis, i.is_hypothetical hyp, ISNULL(i.filter_definition, '') filt

    , t.no_recompute no_rcp

    FROM sys.objects o

    JOIN sys.indexes i ON i.object_id = o.object_id

    LEFT JOIN sys.stats t ON t.object_id = o.object_id AND t.stats_id = i.index_id

    LEFT JOIN b ON b.object_id = i.object_id AND b.index_id = i.index_id

    LEFT JOIN sys.dm_db_index_usage_stats y ON y.object_id = i.object_id AND y.index_id = i.index_id

    AND y.database_id = DB_ID()

    OUTER APPLY sys.dm_db_stats_properties(i.object_id, i.index_id) s

    --LEFT JOIN j ON j.object_id = i.object_id AND j.index_id = i.index_id AND j.partition_ordinal = 1

    WHERE i.object_id = @objid


    -- Then mark the procedure as a system procedure.

    EXEC sys.sp_MS_marksystemobject 'sp_helpindex2'




  • sp_spaceused2

    this is my version of sp_spaceused. Instead of just overall spaceusage, all tables are shown with extended information. --sp_spaceused2

    USE master -- skip this for Azure

      SELECT * FROM sys.procedures WHERE object_id = OBJECT_ID('sp_spaceused2')
    ) DROP procedure [dbo].sp_spaceused2

    CREATE PROCEDURE [dbo] .[sp_spaceused2] @objname nvarchar( 776) = NULL, @minrow int = 0
    DECLARE @objid int , @dbname sysname

    -- Check to see that the object names are local to the current database.
    select @dbname = parsename(@objname,3)
    if @dbname is null
      select @dbname = db_name()
    else if @dbname <> db_name()
      return (1)
    -- initialize @objid.
    select @objid = object_id(@objname)  


    ;WITH a AS (

    SELECT CASE WHEN o.schema_id = 4 THEN CASE WHEN o.type = 'S' THEN 1 WHEN o.type = 'IT' THEN 2 ELSE 3 END ELSE o.object_id END AS object_id

    , o.type AS otype, d.index_id, i.data_space_id , d.reserved_page_count, d.used_page_count

    , d.in_row_data_page_count, d.lob_used_page_count, d.row_overflow_used_page_count, d.row_count

    , r.data_compression, r.partition_number, i.type itype, i.is_unique

    , i.fill_factor , i.is_disabled, i.is_hypothetical, i.has_filter

    FROM sys.objects o WITH(NOLOCK) INNER JOIN sys.indexes i WITH(NOLOCK) ON i.object_id = o.object_id

    LEFT JOIN sys.partitions r WITH(NOLOCK) ON r.object_id = i.object_id AND r.index_id = i.index_id

    LEFT JOIN sys.dm_db_partition_stats d WITH(NOLOCK) ON d.partition_id = r.partition_id --AND r.object_id = d.object_id AND r.index_id = d.index_id AND r.partition_number = d.partition_number

    WHERE o.type <> 'TF' AND o.type <> 'IT'

    AND (@objid IS NULL OR o.object_id = @objid)

    AND (@minrow = 0 OR row_count > @minrow)

    -- AND row_count > 0 -- optional


    ), b AS (

    SELECT object_id, index_id, otype, itype, data_space_id  -- MAX(CASE WHEN index_id <= 1 THEN data_space_id ELSE 0 END) data_space_id

    , CASE WHEN COUNT(*) > 1 THEN 1 ELSE 0 END Part, COUNT(*) AS Cnt

    , reserved = 8*SUM(reserved_page_count) , used = 8*SUM(used_page_count) , in_row_data = 8*SUM(in_row_data_page_count) , lob_used = 8*SUM(lob_used_page_count)

    , row_overflow_used = 8*SUM(row_overflow_used_page_count) , row_count = SUM(row_count) , compressed = SUM(data_compression) -- change to 0 for SQL Server 2005

    , Pop = SUM(CASE WHEN row_count = 0 OR index_id > 1 THEN 0 ELSE 1 END)

    , Ppz = SUM(CASE WHEN row_count = 0 AND index_id <= 1 THEN 1 ELSE 0 END)

    , Clus = MAX(CASE a.index_id WHEN 1 THEN 1 ELSE 0 END) , IxCt = MAX(CASE itype WHEN 2 THEN 1 ELSE 0 END)

    , XmlC = MAX(CASE itype WHEN 3 THEN 1 ELSE 0 END)             , SpaC = MAX(CASE itype WHEN 4 THEN 1 ELSE 0 END)

    , CoSC = MAX(CASE itype WHEN 5 THEN 1 ELSE 0 END)

    , ncs  = MAX(CASE itype WHEN 6 THEN -1 ELSE 0 END)

    , MO   = MAX(CASE itype WHEN 7 THEN 1 ELSE 0 END)     

    , Uniq = MAX(CASE is_unique WHEN 1 THEN 1 ELSE 0 END)

    , disa = MAX(CASE is_disabled WHEN 1 THEN 1 ELSE 0 END) , hypo = MAX(CASE is_hypothetical WHEN 1 THEN 1 ELSE 0 END) , filt = MAX(CASE has_filter WHEN 1 THEN 1 ELSE 0 END)

    FROM a GROUP BY object_id, index_id, otype, itype , data_space_id

    ), c AS (

    SELECT CASE WHEN otype IS NULL THEN 'A' ELSE otype END otype

    , CASE WHEN b.object_id IS NULL THEN 0 ELSE b.object_id END AS object_id

    , CASE WHEN b.object_id IS NULL THEN 0 WHEN b.object_id IN (1,2) THEN b.object_id ELSE 3 END Ord --, data_space_id

    , MAX(CASE WHEN index_id <= 1 THEN data_space_id ELSE 0 END) data_space_id

    , [Rows] = SUM(CASE WHEN b.index_id < 2 THEN b.row_count ELSE 0 END ) , Reserved = SUM(b.reserved), Used = SUM(b.used)

    , Data = SUM(CASE WHEN (b.index_id < 2) THEN (b.in_row_data + b.lob_used + b.row_overflow_used)  ELSE b.lob_used + b.row_overflow_used END )

    , index2 = SUM(CASE WHEN b.index_id > 1 AND itype=2 THEN (b.in_row_data) ELSE 0 END)

    , index3 = SUM(CASE WHEN b.index_id > 1 AND itype>2 THEN (b.used) ELSE 0 END)

    , in_row_data = SUM(in_row_data), lob = SUM(lob_used), ovrflw = SUM(row_overflow_used)

    , SUM(CASE compressed WHEN 0 THEN 0 ELSE 1 END) Cmpr , SUM(CASE WHEN b.object_id > 10 AND Part > 0 THEN 1 ELSE 0 END) AS Part

    , SUM(Pop) Pop, SUM(Ppz) Ppz

    , MAX(CASE WHEN b.object_id < 10 AND disa = 0 THEN Cnt ELSE 0 END) AS Cnt

    , SUM(Clus) Clus, SUM(IxCt) IxCt, SUM(XmlC) XmlC, SUM(SpaC) SpaC, SUM(CoSC) CoSC, SUM(ncs) ncs, SUM(Uniq) Uniq, SUM(disa) disa, SUM(hypo) hypo, SUM(filt) filt --, SUM(MO) MO

    FROM b GROUP BY b.object_id, otype -- , data_space_id

    WITH ROLLUP HAVING (b.object_id IS NOT NULL AND otype IS NOT NULL /*AND data_space_id IS NOT NULL*/ ) OR b.object_id IS NULL

    ) , j AS (

      SELECT j.object_id, j.index_id, j.key_ordinal, c.column_id,,is_descending_key

      FROM sys.index_columns j INNER JOIN sys.columns c ON c.object_id = j.object_id AND c.column_id = j.column_id )


    SELECT otype , CASE WHEN t.schema_id IS NULL THEN '' ELSE END [Schema] , CASE c.object_id WHEN 0 THEN '_Total' WHEN 1 THEN '_sys' WHEN 2 THEN '_IT' ELSE END [Table]

    , ISNULL(STUFF(( SELECT ', ' + name + CASE is_descending_key WHEN 1 THEN '-' ELSE '' END

       FROM  j WHERE j.object_id = c.object_id AND j.index_id = 1 AND j.key_ordinal >0

       ORDER BY j.key_ordinal FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as ClKey

    , /*CASE is_memory_optimized WHEN 1 THEN x2.rows_returned ELSE [Rows] END*/ [Rows]

    , /*CASE is_memory_optimized WHEN 1 THEN memory_allocated_for_table_kb  ELSE Reserved END*/ Reserved

    , /*CASE is_memory_optimized WHEN 1 THEN memory_used_by_table_kb ELSE [Data] END*/ [Data]

    , lob --, ovrflw

    , /*CASE is_memory_optimized WHEN 1 THEN memory_used_by_indexes_kb ELSE*/ index2 /*END*/ [Index]  --, newIx = index3

    , /*CASE is_memory_optimized WHEN 1 THEN memory_allocated_for_table_kb+memory_allocated_for_indexes_kb-memory_used_by_table_kb -memory_used_by_indexes_kb ELSE*/ Reserved - Used /*END*/ Unused


    , AvBR = CASE [Rows] WHEN 0 THEN 0 ELSE 1024*[Data]/ [Rows] END

    , CASE WHEN c.object_id IN (1,2,3) THEN Cnt ELSE Clus END Clus , IxCt, Uniq , XmlC Xm, SpaC Sp, CoSC + ncs cs

    , /*CASE is_memory_optimized WHEN 1 THEN 1 ELSE 0 END*/ 0 MO

    , Stct, kct, Cmpr , Part, Pop, Ppz -- , Cnt

    , CIxSk, IxSk, Scans, lkup, upd , cols, guids, ngu

    , c.data_space_id dsid , CASE y.lob_data_space_id WHEN 0 THEN NULL ELSE y.lob_data_space_id END lobds  --, fif.ftct, fif.ftsz

    , rkey, fkey, def, trg --, cols

    , disa -- , hypo

    , filt , o.create_date

    FROM c 

    LEFT JOIN sys.objects o WITH(NOLOCK) ON o.object_id = c.object_id

    LEFT JOIN sys.tables y WITH(NOLOCK) ON y.object_id = c.object_id

    LEFT JOIN sys.schemas t WITH(NOLOCK) ON t.schema_id = o.schema_id

    --LEFT JOIN sys.dm_db_xtp_table_memory_stats x ON x.object_id = y.object_id

    --LEFT JOIN sys.dm_db_xtp_index_stats x2 ON x2.object_id = y.object_id AND x2.index_id = 0


    SELECT CASE WHEN object_id IS NULL THEN 0 ELSE object_id END object_id, COUNT(*) Stct FROM sys.stats WITH(NOLOCK) WHERE object_id > 3 /* skip low values */ GROUP BY object_id WITH ROLLUP HAVING object_id IS NOT NULL OR object_id IS NULL ) s ON s.object_id = c.object_id


    SELECT table_id, SUM(data_size)/1024 ftsz , COUNT(*) ftct FROM sys.fulltext_index_fragments WITH(NOLOCK) WHERE [status] = 4 GROUP BY table_id  ) fif ON fif.table_id = c.object_id


    SELECT object_id, COUNT(*) kct FROM sys.index_columns WITH(NOLOCK) WHERE index_id = 1  GROUP BY object_id  ) k ON k.object_id = c.object_id


    SELECT CASE WHEN object_id IS NULL THEN 0 ELSE object_id END object_id, COUNT(*) cols  , SUM(CASE system_type_id WHEN 36 THEN 1 ELSE 0 END) guids , SUM(CASE WHEN system_type_id = 36 AND is_nullable = 1 THEN 1 ELSE 0 END) ngu

    FROM sys.columns WITH(NOLOCK)  GROUP BY object_id /*WITH ROLLUP HAVING object_id IS NOT NULL OR object_id IS NULL*/ ) e  ON e.object_id = c.object_id

    LEFT JOIN ( SELECT CASE WHEN referenced_object_id IS NULL

    THEN 0 ELSE referenced_object_id END referenced_object_id, COUNT(*) rkey

    FROM sys.foreign_keys WITH(NOLOCK) GROUP BY referenced_object_id

    /*WITH ROLLUP HAVING referenced_object_id IS NOT NULL OR referenced_object_id IS NULL*/ ) r ON r.referenced_object_id = c.object_id

    LEFT JOIN ( SELECT CASE WHEN parent_object_id IS NULL THEN 0 ELSE parent_object_id END parent_object_id, COUNT(*) fkey

    FROM sys.foreign_keys WITH(NOLOCK) GROUP BY parent_object_id

    /* WITH ROLLUP HAVING parent_object_id IS NOT NULL OR parent_object_id IS NULL */

    ) f ON f.parent_object_id = c.object_id

    LEFT JOIN ( SELECT CASE WHEN parent_object_id IS NULL THEN 0 ELSE parent_object_id END parent_object_id, COUNT(*) def

    FROM sys.default_constraints WITH(NOLOCK) GROUP BY parent_object_id

    /*WITH ROLLUP HAVING parent_object_id IS NOT NULL OR parent_object_id IS NULL*/

    ) d ON d.parent_object_id = c.object_id

    LEFT JOIN ( SELECT CASE WHEN parent_id IS NULL THEN 0 ELSE parent_id END parent_id, COUNT(*) trg

    FROM sys.triggers WITH(NOLOCK)

    WHERE parent_id > 0 GROUP BY parent_id

    /*WITH ROLLUP HAVING parent_id IS NOT NULL OR parent_id IS NULL*/

    ) g ON g.parent_id = c.object_id

    LEFT JOIN ( SELECT object_id , SUM( CASE index_id WHEN 1 THEN user_seeks ELSE 0 END) CIxSk

    , SUM( CASE WHEN index_id < 2 THEN 0 ELSE user_seeks END) IxSk

    , SUM( CASE WHEN index_id < 2 THEN user_scans ELSE 0 END) Scans

    , SUM( CASE WHEN index_id < 2 THEN 0 ELSE user_lookups END) lkup

    , SUM( CASE WHEN index_id < 2 THEN 0 ELSE user_updates END) upd

    , SUM( CASE WHEN index_id > 1 AND user_seeks = 0 THEN 1 ELSE user_updates END) ZrIx

    FROM sys.dm_db_index_usage_stats WITH(NOLOCK) WHERE database_id = DB_ID() GROUP BY object_id

    ) l ON l.object_id = c.object_id

    /*LEFT JOIN ( SELECT object_id , SUM( CASE WHEN s.avg_user_impact > 90.0 THEN 1 ELSE 0 END) mix

    FROM sys.dm_db_missing_index_details d WITH(NOLOCK)

    INNER JOIN sys.dm_db_missing_index_groups g WITH(NOLOCK) ON g.index_handle = d.index_handle

    INNER JOIN sys.dm_db_missing_index_group_stats s WITH(NOLOCK) ON s.group_handle = g.index_group_handle

    WHERE d.database_id = DB_ID() GROUP BY object_id

    ) m ON m.object_id = c.object_id


    WHERE --o.type IN ('U','V') AND

    (c.object_id IS NOT NULL /*OR x.object_id IS NOT NULL*/)

    --WHERE ( <>'dbo' AND NOT LIKE 'Trace%') OR IS NULL

    ORDER BY Ord, Reserved DESC --, Reserved DESC



    -- Then mark the procedure as a system procedure.

    EXEC sys. sp_MS_marksystemobject 'sp_spaceused2' -- skip this for Azure


  • Asymmetric Processor Cores

    Most Intel processors use one of two cores, the main-line core and the Atom core. Desktop processors might use the newer Kaby Lake or Coffee Lake micro-architecture cores, while the Xeon SP lines uses the previous generation Skylake micro-architecture. All of these are part of the main core line. The Atom core is used in Xeon Phi, Atom and some Pentium or Celeron processors. In simple terms, the main-line is a big core and the Atom is a small core.

    The obvious question is which is better: very many small cores or not quite as many large cores? The answer, as always, is: it depends. Workloads that are highly amenable to multi-threading do better on more smaller cores. This has to do with the second aspect of Moore's Law. Workloads that do not scale well with multi-threading may do better on big cores. It would seem that all we have to do is match the workload to the processor with the desired core type. For example, database transaction processing should scale well with cores and threads. And if so, then it is a match to the processor with very many small cores.

    Except that life and databases are not simple. The main workload might be transaction processing, but there are also many other important activities that occur on the system. Some do not scale well on many cores and heavily favors the big core. Is the answer to this conundrum a mix of small and big cores?

    A processor could have an asymmetric arrangement of small and large cores. In a large die, the could be both very many small cores and a small number of big cores all in one processor. The number of small cores is larger than the number of big cores that can be placed on a single die. A workload that does not scale well to high thread counts still has access to the large cores. Because this workload does not scale, there is no point in having many of the large cores.

    Skylake Xeon SP and Knights Landing Xeon Phi

    Below left is the Intel Skylake-based Xeon SP, XCC die and below right is the Knights Landing based Xeon Phi x200. Both die are just under 700mm2.

    Xeon SP (Skylake), 694mm2Xeon Phi 200 (Knights Landing) 683mm2
    5 rows, 6 columns, 28 cores max.7 rows, 6 columns, 2 cores/tile

    Below is a layout representation of the Skylake XCC die. The Skylake XCC die has 5 rows by 6 columns minus 2 spots for the memory controllers, and excluding the top row for PCI-E and UPI elements, for a maximum of 28 big cores.


    My rendering

    Below is the layout of Knights Landing (KNL), organized as 7 rows by 6 columns of tiles, excluding the top and bottom rows for the MCDRAM controllers.


    Two tile spots are used by the DDR memory controller, and two are used for PCI-E and DMI. This leaves 38 tiles. There are 2 Atom cores in each tile. The Xeon Phi x200 series has a maximum of 72 cores. Two tiles are spares used to improve manufacturing yield.

    Below is a functional representation of the KNL tile. The sub-units may not be to scale. There are 2 cores, two double-VPU units, 1MB L2 cache shared by the two cores, and the agent which handles connections to other tiles and functional units.


    As far as I am aware, the SQL Server core engine code does not use the vector (SSE/AVX) instructions. It would be a shame to waste so much silicon real estate on unused units. Either new instructions should be added to be useful in b-tree navigation or page-row-column offset calculation or the VPU should be removed from the mix core product. I gave a brief discussion on this in Rethink System Architecture.

    Below is a simpler representation of the big core Xeon SP and little core Xeon Phi processors.

    Xeon SP (Skylake), 694mm2Xeon Phi 200 (Knights Landing) 683mm2

    Based on an eye-ball estimate of the die images, a mixed core processor could have four large cores and 30 tiles for double small cores in a slightly smaller die of about 670mm2, or 4 large and 36 tiles in a larger 750mm2 die. A representation is shown below.

    Mix1, ~670mm2Mix2 750mm2
    4 big cores, 30 tiles4 big cores, 36 tiles

    The smaller option might be better because both Skylake and KNL processors are thermally limited? Other mix options are possible. The assumption here is that the big cores are for workloads that do not scale well at high thread count and hence there is a need for only a limited number of large cores. It is further assumed that some mechanism will be implemented to determine which type of cores handle each particular function or workload.

    One option in having very many small cores is in how cores are allocated. Polling might be a better mechanism to handle extreme IO than interrupts. One or more small cores could be dedicated (and customized?) to kernel tasks like polling.


    Matt Gillespie paper, "Preparing for the Second Stage of Multi-Core Hardware: Asymmetric (Heterogeneous) Cores", 2008


    Intel also has the Quark core, 32-bit only, no MMX/SSE/AVX. It is based on Pentium.

  • Rethink Server Sizing II

    It is almost universally accepted without question that the server is a multi-processor system. Long ago, there were valid reasons for this. Over time, the reasons changed but were still valid. When the memory controller became integrated into the processor, the very low memory latency of a single processor system was an interesting but non-actionable tidbit. The greater compute and memory of the multi-processor was then more important. In recent years, the high core count and memory capacity of a single modern processor is now more than capable of handling most transaction processing workloads. The lower memory latency of the single socket system has the effect of having 40-50% better performance per core or thread than a multi-processor system. It is time to rethink our concept of what a server system is. As attractive as cloud computing is for a number of reasons, none of the major cloud providers offer single processor systems, forgoing the price-performance advantage of the 1S system.


    The huge disparity between round-trip access to DRAM memory and the CPU clock cycle has been a serious problem for more than ten years. It is simple to demonstrate that even a small amount of pointer chasing code, in which the memory value being accessed determines the next memory to access, results in performance being almost entirely gated by memory latency. The most important application with this characteristic is database transaction processing. The nature of B-tree index navigation is largely an exercise in pointer-chasing.

    The large part of memory latency is incurred on the DRAM chip itself, but L3 cache, the memory controller and signal transmission time also comprise significant elements. Curiously, there are specialty memory products for desktop systems without ECC featuring "hand-screened" ICs and an elaborate cooling mechanism that have latency timings of around 10ns in each of RCD, CL, and RP. For server systems, registered ECC DIMMs are only available with the standard timings of about 14ns for each of the three.

    Single and Multi-Processor Systems

    There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.

    That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.

    Intel Xeon E5 v2, Ivy Bridge to Xeon SP, Skylake

    Prior to 2012, the Intel product strategy was a desktop-type processor for 2-way systems and a big processor for 4-way+ systems. There were variations of this depending on the processor generation for various reasons. In 2012, the big processor was brought to both 2-way and 4-way systems in the Xeon E5 line. The next year, Intel employed 3 die layouts to fill their Xeon E5 v2 lineup, with the 6-core LCC, 10-core MCC and 15-core HCC.


    It was possible and practical for Intel to put 15 cores on one die in their 22nm process for a high-end product. But there was also significant demand for low and medium-priced products. These were better suited to the cost structure of the 6-core and 10-core die, respectively. Hence the 3 die layout strategy. The E5 v2 was only available publicly with up to 12-cores. The full 15-core was publicly available in the E7 v2, (and in E5 to special customers?).

    In the current generation Xeon SP processors, the 3 die layout options are the LCC at 10 cores, HCC at 18 cores and XCC at 28 cores.


    The inherent nature of semiconductor cost structure is that a large die is disproportionately more expensive than a small die. Some of this is the unusable real estate due to the number of whole die that fit on a wafer. In very large die, cost is dominated by declining yield at a given defect density (see also Motley Fool). The cost structure of Intel Xeon E5 and Xeon SP processors largely fits this pattern with some market driven adjustments.

    The chart below shows the cost per core versus cores for the range of Xeon SP models (excluding certain specialty SKUs).


    In the main sequence, the cost per core increases as core count increases. As expected, two small die processors cost less than one large die processor having the same total number of cores as the two small ones. An example is the Xeon 6132 14-core at $2,111 versus the Xeon 8180 28-core at $10,009.

    At the system level, there is not much difference between the cost of motherboards that can accommodate one or two processors. It would seem then that there is logic in continuing the 2-way system as standard practice, now having 2 or 3 core count and frequency options to fit the broad range of data center requirements. This concept is valid assuming cores are equal between single and multi-processor systems.

    Memory Latency, 1-Socket versus 2-Socket

    Except that cores are not equal between single processor and multi-processor systems, significantly in terms of memory latency. And memory latency is the key determinant of performance for a most important class of applications.

    The figure below shows a representation of the single socket Xeon E5 v4 system.


    On a single socket Xeon E5-2630 v4 10-core processor, L3 latency was 17ns and memory latency was L3 + 58ns = 75ns as measured with the 7z benchmark. Intel cite L3 latency for the E5-2699 v4 22-core processor as 18ns. Hence single socket memory latency is assumed to be 76ns for an HCC die.

    In the LCC die, there is a single ring connecting the cores and other sub-components. The HCC die has two interconnect rings, in turn, connected with a pair of switches. One might think that there would be more than 1ns difference between a 10-core LCC and the 22/24-core HCC, and it would be nice if Intel would provide an explanation.

    The figure below shows a representation of the 2-way system with Xeon E5 v4 processors.


    The 7-cpu website shows L3 at 18ns for the E5-2699 v4, consistent with Intel. Local node memory latency is L3 + 75ns = 93ns.

    The 7z-Benchmark does not test remote node memory latency. Intel rarely mentions memory latency in the Xeon E5/7 v1-4 generations. One source cites remote node memory at either 1.7X local node. Another cites remote node at 50-60ns higher than local node. The value of 148ns is a placeholder used in later calculations.

    The difference between single socket memory latency and 2-way local node memory latency is attributed to a remote node L3 check to ensure cache coherency. The difference between 76 and 93ns is 22%. As memory latency is the major determinant in database transaction processing performance, 22% is huge.

    The local node latency applies for applications that can achieve memory locality on a NUMA system. The database engine has NUMA capabilities. But if the complete application and database environment was not architected to work with the NUMA features, memory locality will not be achieved.

    Xeon SP

    In 2017, Intel decided that it was important to compare memory latency for their Xeon SP against that of AMD EPYC. And hence memory latency information was in their presentations. The figure below shows 89ns for local node and 139ns for remote node memory latency on the Xeon SP 8180 28-core processor.


    In Xeon SP, L2 cache latency increased from 12 to 14 cycles on account of the increase in size from 256KB to 1MB. L3 latency increased from 18 to 19.5ns in the large die models as measured on the Intel Memory Latency Checker. It seems that the Skylake Xeon SP has lower memory latency than Broadwell Xeon E5 v4. This could be true or it could be different measurement techniques.

    The figure below is from Be aware that different values are cited by various sources.


    See the Intel Xeon Scalable Architecture Deep Dive (slide deck) and Intel Xeon Processor Scalable Family Technical Overview for more information.

    The important point of this section is that memory latency in a single socket system is significantly lower than local node memory in a multi-processor system. This is attributed to remote node cache coherency. The difference between 76ns for single socket and 93ns for 2-way local node memory is 22 percent. This is not trivial. A VM created with cores and memory all from the same node in a 2-socket system would need 24 cores to match 20 cores in single socket system.
    Equivalent Transaction Performance 1S to 2S

    On a full 2-socket system, a database not architected for NUMA is expected to have 50/50 local-remote node memory access. Then average memory latency is 114ns (average of 89 and 139) for Xeon SP and 120.5ns (average of 93 and 148) for Xeon E5 v4.

    Average memory latency on the 2-way is 50-59% higher than on a single socket system. The expectation is then that performance per core on the 2-way system is 0.63-0.67 times that of the single socket system. The throughput performance of the 2-way, with twice as many cores, is 1.26 to 1.34 times higher.

    Based on scaling of 1.35X from 1S to 2S, a 2-way system with 14-core processors is approximately equal in throughput performance to a one-socket system with 19 cores (actual product options are 18 or 20 cores).


    The single socket 28-core processor is about equivalent to a 2-socket system with 20-22 core processors.


    The single socket system also has better thread level performance in proportion to the difference in memory latency, 50-59%. This has significant value of its own. The single socket system should not have NUMA anomalies that can occur on multi-processor systems. One example is from contention between threads running on cores in different sockets.

    The cost difference for the processors may or may not be not regarded as large. However, software is generally licensed on a per core basis. The licensing fee may be on the order of thousands of dollars per core versus the processor cost on the order of hundreds of dollars per core. Performance efficiency per core now becomes more important.


    The argument for the single socket as the baseline standard system is compelling. In the past, there were valid reasons for both the greater compute capability and memory capacity of multi-processor systems. In recent years, with storage on SSD, massive memory overkill is no longer necessary, even though it may still be an involuntary muscle-reflex action for DBAs.

    The reality is that, with proper design, most transaction workloads can be handled by the 16-28 cores available in a single processor with Hyper-Threading doubling the logical processors (threads). Scaling up to a multi-processor system provides only moderate throughput performance gain and may create new problems, some with serious negative effects. A NUMA-aware VM recovers about half of the efficiency difference between a multi-socket versus a true single socket system. This is helpful, but is short 22% plus its own overhead of the full potential.



    For any one who remembers that single to multi-processor scaling was better in the old days, it was. My recollection is that in the Pentium II to III Xeon period, it was something like the following.



    Scaling from 4-way to 8-way was perhaps 1.5X. The reason that scaling from 1 to 2 processors was better back then was that there was little change in memory latency in going from 1S to 2S, and similar from 2S to 4S. However, these comparison should be made based on the same processor and memory controller. The regular Pentium II and III had a slower L2 cache, but simpler memory controller. The Xeon versions had faster and bigger L2 cache, but a more complex memory controller for much larger capacity, which also entails higher latency.

  • Quick performance check with latest patches

    there is some degree of press hysteria, lawsuits etc. concerning the kernel vulnerabilities of Intel processors with out-of-order execution (which originates to Pentium Pro), so I did a quick check.

    Earlier, I posted on testing with the TPC-H database and queries at scale factor 100 (in conditions that do not meet official TPC-H standards). The test system is a single socket Xeon E5 2630 v4, 10 core, 64GB memory, SQL Server limited to 32GB, 4 Intel P750 NVMe SSDs for storage. The baseline test was SQL Server 2016 SP1 (probably prior to SP1 cu1) The current test has latest Windows Server 2016 hotfixes and SQL Server 2016 SP1 cu7 build 4466

    Sum run time of the 22 TPC-H queries at SF100 database, row-store, no compression was 0.7% longer than before, CPU was 0.2% higher. The total database size is 126GB, so there was considerable IO, mostly scans, but some key lookups.

    Note TPC-H scoring is geometric mean, which means small queries need to scale as well

    I use straight sum because I am more concerned about big queries, as I do not expect small queries to scale well at high DOP, if at all.

    Individual query variation range from: Q2 43% longer, from 0.67 to 0.963 sec, there is a loop join, which would do 8K IO?

    Q18 17.8% faster, from 51.5 to 42.3 sec, might be a different execution plan, could be due

    There were 4 queries between 20-28% slower, but because Q18 was one of the big queries, its improvement washed out the high percentage impact of the smaller queries

    From the description given of the fixes, my expectation is that performance impact will be seen in operations that involve kernel, which include disk and network IO. I would except loop joins and key lookups to be more impacted that scan IO because of the number of IO,

    and also network round-trip intensive applications, i.e., the ones that people said not to do.

    Edit (Jan 11)
    never mind the above
    I had recently set my test system to power save,
    so in the proper configuration of power - performance (Edit Jan 12, this result was with turbo boost enabled, original was disabled)

    linear sum, SQL Server 2016 sp1 cu 7 bld 4466 and OS patches vs sp1 base
    9% faster, 12% more CPU efficiently
    individual queries range from 24% faster to 0.3% slower.
    there probably is a penalty in the recent fixes, but fixes since SP1 also made improvements?

    Edit Jan 12
    original test was done with turbo boost disable, above test was turbo boost enable, probably running at 2.4GHz versus 2.2 base
    new test in (my best estimate) matching conditions except for hot fixes
    latest SQL Server 2016 PS + cu about 1.7% faster, 7.6% lower CPU
    some queries 5-9% slower, a few much faster - probably with better execution plan, hence not related to recent security issue

    It does look like the ability to drive network round-trips is severely impacted. I seem to recall that a stored proc, 1 row index seek, from a 40GbE, 1 thread used to drive 11K round-trips per sec, now 7,900,
    peak throughput used to be 270K/s, now 146K/s?
    I need to find my old records to see if I can clear this up

  • Rethink Server Sizing

    Standardizing on 2 and 4 sockets systems for servers has been an established practice going back to 1996. There were very good reasons for this, but it was so long ago that they have been forgotten. Yet the practice continues unquestioned, almost as a reflex action ingrained in our subconscious. Back then, 2-way was the baseline, and a 4-way was for databases. Now, the large majority of standard systems are 2-way (2S) with 4S for exceptional requirements.

    It might seem that the 2-socket system continues to be a good choice, as two processors with an intermediate number of cores is less expensive than one processor with twice as many cores. An example is the Xeon Gold 6132 14-core versus the Xeon Platinum 8180 28-core processors. In addition, the two-socket system has twice the memory capacity and nominally twice as much memory bandwidth.



    So, end of argument, right? Well, no. This is because any multi-socket system, including the two-socket, has non-uniform memory access (NUMA). It so happens that database transaction processing performance is largely governed by round-trip memory access latency. Memory latency is lower on the one-socket system than on a multi-socket system unless the application has been architected to achieve a high degree of memory locality on a NUMA system. Almost zero real world databases have been architected for NUMA system architecture. So, memory access on a 2-way is most probably 50/50 to the local and remote nodes.

    The implication of memory latency differences is that single thread performance decreases by about 25% going from 1S to 2S, with throughput scaling from 1S to 2S being around 1.55X. In this regard, two 14-core processors with 28 cores total has comparable throughput performance to a single 20-22 core processor. Given the cost of Enterprise Edition licensing, the difference in processor cost per core does not really matter. The system that can deliver the required performance with fewer cores will win on cost. The single socket system will have fewer unusual (interpret this as bad or very bad) characteristics than a multi-socket system.

    Because of the memory latency effect, high processor frequency does not help much in transaction processing because most cycles are spent waiting. There is usually more than one frequency option at a given core count, usually with a difference in price. Finally, the expectation is that Hyper-Threading is highly effective for transaction processing, so this feature should be enabled. And in addition, Intel really needs to increase the degree of HT from 2 to 4 logical processors per core.

    Single Socket System Recap (edit 12-27)

    1. 16-28 cores with HT can handle most transaction workloads
    2. Memory capacity massive overkill not needed with storage on flash/SSD
    3. Single socket has best performance per core versus multi-socket
    4. Makes most effective use database per-core licensing
    5. Avoids problematic NUMA issues,
    6. Example: high volume inserts into table with identity key (or comparable)


    Update: Rethink Server Sizing II
    Full article on: Rethink Server Sizing 2017
    Related: SRAM as Main Memory
    and also on Linkedin Rethink Server Sizing 2017
    and SRAM as Main Memory

  • Memory Allocated To Plan Caching

    I could complain about Entity Framework, but there is probably nothing that has not already been aired. So instead, I will complain about the lack of tuning options in SQL Server's Memory Allocated To Plan Caching.

    As of SQL Server versions 2005 SP2 and 2008, the formula is 75% from the first 0-4GB, 10% from 4-64GB and 5% of the remaining over 64GB, all referring to visible target memory. There was a different formula in version 2005 RTM and SP1 that allowed even more memory for the plan cache. Does it seem that both of these, even the reduced newer version, allows for an excessively larger plan cache? How in the world was this arrived at?

    My suspicion is that the first 2005 formula derived from testing with the Pace database. Way back in the early days, before AdventureWorks, before NorthWind, someone created a database for testing. Instead of scaling up the data in this database to test ever larger environments, it was decided to scale up the number of databases.

    What this meant was that the desired plan cache size scaled linearly with data size, which is probably not true in more common situations. At some point, there was the case of a very large environment using ad-hoc queries that resulted in an excessively large amount of memory allocated to the plan cache. And so the current formula was conceded.

    It could be argued that a responsible database architect would have relied heavily on stored procedures and parameterized SQL achieving a sufficient degree of plan reuse so that the plan cache never becomes excessively large, perhaps using DBCC FREESYSTEMCACHE ('SQL Plans') on occasion, or even using a script to flush out single use Adhoc and prepared plans. But if only responsible people were involved, an RDBMS would probably have infinitesimal market share.

    In any case, SQL Server is used in an unimaginably broad range of scenarios for which no single formula on plan cache is workable. Any given formula will be significantly problematic for a some number of scenarios.

    It is time to admit that there is justification for Plan Cache memory to be a tunable setting. We could keep the current formula for the first 64GB (0.75x4 + 0.1x60 = 9GB) and make only the percentage of visible target memory over 64GB tunable. Additional fine grain tuning options are also necessary.

    There should be the ability to limit the percentage of plan cache allowed for adhoc and prepared statements. This could be a single limit on both, or individual controls on each. The purpose of this is to limit the damage that one crappy element of an application can cause on the whole system.

    An additional related tuning option concerns memory allocations outside of the new (version 2012) "any-size" page allocations. CLR , thread stack and other functions (network packet size larger than 8060) make direct allocations from Windows. In 32-bit versions of Server, there the was the memory_to_reserve, or more correctly, virtual address space (VAS) to reserve and the corresponding -g startup option, also known as MemToLeave.

    Edit 2017-08-28
    The default network packet size was 8192 bytes up to some point, either .NET Framework up to version 3.0 or 4.0
    . In more recent versions, this was changed to 8000 to avoid bad things from happening.

    We might think that this is no longer an issue in 64-bit versions of SQL Server. However, the essentially unlimited 64-bit VAS (actually only 44 or 48-bit VAS is implemented depending on the version) is just that, address space. Would it be a good idea to be able to reserve physical memory for CLR and others that use VAS outside of the page allocator ?

    The specific situation that I encountered was one in which some elements where built on Entity Framework. One of the EF queries was a de-normalized row set. The parent table had 13 child-descendant combinations, so the query implemented a UNION ALL for each component. The combined number of columns was 449. In the EF style, each section used the full table name as an alias for each column which then re-aliased, and each query had sub-queries with the extremely verbose text.

    This resulted in a query of 348KB size. Even though this was a parameterized query, there were variations such that SQL Server had very many similar queries in the plan cache. Other elements of the environment include substantial large string manipulations. (In a different, I will suggest implementing the stringbuilder as a variable option in SQL Server.)

    In this environment, there were a continuing stream of VAS allocations. System memory was 512GB. After a few days operation, VAS reserved was 1TB, and 3TB after a three weeks. VAS Committed was as expected never more than 500GB and there was always about 8GB of free physical memory. What appeared to have happened was that SQL Server forced out most of the plan cache on VAS pressure?

    Almost all of the compiled plans for stored procedures were forced out and to a lessor degree the plans for parameterized SQL. It did not seem as though any buffer cache for data pages were forced out. Data read IO activity was generally zero.

    As the storage system was on All-Flash, the preferred strategy in this situation would have been to reserve about 3-4GB of the plan cache for stored procedure plans, maybe less than 2GB for parameterized SQL. The SQL Server engine should force out data pages to support direct allocations to Windows. To be more sophisticated, data pages can be forced out until there is an observable increase in data reads, or even until data read IO latency reached some set point, as appropriate for either AFA/SSD or HDD storage as the case may be.

    In other RDBMS's, there are a great deal of tuning options not exposed in SQL Server. In the hands of a very knowledgeable and talented database expert, these tools can be put to good use. In the hands of someone who read a "best practice" blog, without bothering to understand the rational behind each item, tuning can result in serious degradation. So there needs to some care in which tuning options to expose and how best to provide safe guards.

    Settable Tuning Options Need

    1. percentage of memory above 64GB available to the plan cache
    2a. percentage of plan cache that can used by ad-hoc SQL
    2b. same as above for parameterized SQL, or a single combined setting
    3. settable physical memory to reserve for allocations outside of page allocator
    4. Priority option on what should be paged out on memory pressue: Plan cache, data, or temp

    I probably said this else where. The use of NAND flash for storage has become pervasive, particularly in new purchases even though HDD based systems may retain a large share of the installed based. We should factor in that MLC NAND has limited write life to rethink database caching strategy. Consider the HASH MATCH operator which has set point to spill to tempdb and tempdb objects in general. Instead of writing short-lived pages to tempdb, it might be better to dump clean data pages, while keeping more of tempdb in memory, up to a point of course.

    Addendum 2017-08-31
    Having a sufficient OS page file seems to alleviate the problem of SQL Server forcing out the plan cache too dangerously low level, and adding injury to insult in blocking compile while doing, when data pages could have been forced without any adverse effect.
    I am seeing Performance Counter -> Memory -> pages/sec of 6,000-10,000, but no disk activity. Does paging occur outside of physical disk perf counters?
    dm_exec_connections has a column for network packet size, I have confirmed that large packets are reported correctly.

    Plan Caching in SQL Server 2008 by Greg Low,
    Plan cache and optimizing for adhoc workloads by Kimberly Tripp,
    Memory configuration and sizing considerations in SQL Server 2012 and later versions on,
    Plan Cache Internals on
    SqlConnection.PacketSize Property
    According to above link, the default packet size was lowered from 8192 to 8000 in version .NET 3.5, but if anyone knows for sure that it was not 4.5, please advise.

    Edit: 2017-12-02:
    Social msdn

    There are 2 nodes in the availability group, identical HW, identical config (512GB, 524288MB mem), SQL Server max mem (494GB = 506000MB). LPIM in effect.

    The cache flush problem is more pronounced on one than the other. System 1 shows 8573MB available memory, but the system_low_memory_signal_state is True. System 2 typically shows 9,818MB free, sys low mem ss False. Given that there is almost no disk IO for data reads, I recommended reducing SQL Server Max 490,000MB and the plan cache immediately recovered (from a situation where the plan cache tries to grow to several GB, then gets flushed to below 1GB intermittently).

    So apparently several GB of OS free memory is not a reliable indicator, use dm_os_sys_memory value of system_low_memory_signal_state instead. Roughly, SQL Server Max memory at 96% of system mem is too high, but 93.5% seems to be ok. It is suspected that something is running on the OS periodically consuming large memory (possibly even a file transfer that get cached by the OS, more so on one than the other. I think this might be it because system_cache is just a hair higher than avail phy mem.)

    In this particular case, flushing the plan cache in the sys mem low state was the wrong thing to do because there is very little IO for data pages, and storage is on all-flash. SQL Server + Windows should have a way of knowing this, hence feel free to ditch data buffer pages, not the plan cache, unless it is the single use parameterized SQL (from Entity Frameworks). Mostly certainly the OS should know not to use too much free memory for file caching, and never trigger the sys low mem state.

    this is my best guess.
    SQL Server backups locally. This does not use the OS/system cache because SQL Server backups use unbuffered IO.

    there is a robocopy job that later copies the backups to a network location. This copy would go to the system cache because the default is not unbuffered IO, and the Windows Server 2016 OS is not smart enough to know that for a multi-GB file transfer, using buffered IO is stupid, more so when SQL Server is running and using 494GB on a system with 512GB memory. So Windows fills the remaining memory with system cache for this file copy that will not be touched afterwards (the only "benefit of caching"), then triggers the system low memory signal state.

    Then to compound this, SQL Server sees the low memory signal, and decides to flush the plan cache, not just the very many single use parameterized plan (generated unnecessarily from Entity Frameworks), but also the critical frequently used procedure plans. This causes a significant jump in CPU. In some cases, there also appears to a lock on the plan cache (could this be taken for a large dump of the plan cache) which then causes any new calls that require a compile to be blocked?

    What is sad here is that even if SQL Server decided to cooperate with the OS in releasing memory, it could have dumped data buffers instead. Much of the data pages were infrequently used. And the storage is on flash. It could 100,000's of IOPS if needed. But no, there may have been some dump of data buffers, PLE drops from 90,000 to 8,500 sec? Even then, IOPS was noise compare to what the storage could have handled. Hey dude, what's up?

    The plan is to try the Robocopy with the -j flag, which is supposed to be unbuffered IO, hopefully this will stop using the system cache. Also, I see that there is a call: %windir%\system32\rundll32.exe advapi32.dll, ProcessIdleTasks that is suppose to flush the system cache. Will try both to see if the OS free memory is truly free and not system cache (of unnecessary stuff).

    Right now, the best candidate root cause are calls to sp_xml_preparedocument
    Most of which neglect to call sp_xml_removedocument after the OPENXML operation
    "A parsed document is stored in the internal cache of SQL Server. The MSXML parser (Msxmlsql.dll) uses one-eighth the total memory available for SQL Server. To avoid running out of memory, run sp_xml_removedocument to free up the memory."
    see openxml

  • Azure database_id inconsistencies

    I am working on improvements to my SQL ExecStats tool, including Azure support. Previously, I had put try catch blocks around everything that bombs in Azure. Now I am putting IF EXISTS tests around SQL referencing system views and procedures not in Azure an other. One of issues I encountered is that there is an inconsistency between DB_ID() and sys.databases.

    SELECT database_id FROM sys.databases WHERE name = DB_NAME()

    On my Azure database, I have just one user database. We would normally expect that the first user database has database_id = 5.

    And indeed SELECT DB_ID() does produce 5, but sys.databases shows database_id 1 for master, and 4 for my user database.

    The reason this is annoying is because ExecStats attempts to correlate index usage to execution plans, I need consistent values to be reported by each of many queries. Most of the ExecStats code has already replaced DB_ID(), with a query to sys.databases WHERE name = DB_NAME().

    But sys.dm_db_index_usage_stats reports the user database as having database_id = 5.

    So for now, I am just subtracting 1 from the database_id in sys.dm_db_index_usage_stats but I have not tested this in an Azure instance with more than 1 user DB.

    By the way, the latest ExecStats is today, 2017-05-23 2017-05-25

    Edit, note date change above

    Edit 2017-06-4

    Same is true of dm_exec_sql_text, dm_exec_text_query_plan and probably more

  • Fixing the Query Optimizer Cost Model

    The model for the cost of operations used by the SQL Server query optimizer to produce an execution plan is rather simple. It has been largely unchanged since SQL Server version 7, RTM in 1998. Back then, SQL Server had not yet achieved tier 1 status with commensurate development resources. Some adjustments were made in version 2005. One might question how something so old can possibly be relevant today, considering how much hardware has changed over the intervening years. The cost model is still adequate for pure transaction processing scenarios in which queries have highly selective search arguments. Its limitations become pronounced in parallel execution plans.

    The salient aspect of modern hardware is that compute capability is distributed over very many processor cores, possibly with irregularities due to non-uniform memory access (NUMA). As such, the decision in parallelism cannot be just on/off with a cost threshold and one-size fits all with max degree of parallelism. It is necessary to have a degree of sophistication in the employment of resources for parallel processing. For some queries, it might be best to utilize a low to moderate level of parallelism. In others, perhaps all cores in one socket. And in exceptional circumstances, the whole set of cores.

    For this, the existing SQL Server cost model does not have sufficient depth to be the foundation from which an effective strategy for deploying resources can be formulated. It should not be too difficult to build a suitable model, considering the wealth of knowledge that has (or should have been) accumulated over the years. There will be some resistance to changing the cost model given that people have become accustomed to the plans generated from the current cost model. But that could be accommodated as a legacy mode, so as to not stand in the way of progress.

    The SQL Server Plan Cost Model Overview

    The cost model of the SQL Server Query Optimizer is explored in more detail elsewhere. Other authors have written on the query optimizer itself. Only a brief description of the SQL Server cost model is given here.

    Buried in Books Online in the section on the query governor cost limit option, is the statement: "Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration." The reference system existed when SQL Server version 7 was in development, so this could have a been a Pentium Pro system, but some people have said that it was a Pentium. My investigation in this topic began around year 2000, though I never went back to look at whether SQL Server version 6 and 6.5 used the same model.

    Each operator in an execution plan can have two components, CPU and IO. The IO component represents the time to complete the IO operation. There is in fact also CPU involved to perform an IO, and it is unclear whether this is part of the IO or represented in the CPU component. The assumption is that leaf level pages are not in memory at the start of query execution, but maybe retained if the execution touches a page more than once. The complete cost of operator factors in the CPU, IO, number of rows and number of executions. The CPU and IO may represent the cost for only a single execution, while the operator cost incorporates the entire operation as a function of the number of executions.

    The IO cost is based on a model of the disparate nature of hard disk performance in random and sequential access at fixed queue depth. SQL Server version 7 and 2000 had two baselines, but from version 2005 on, the IO cost model is that random IO performance is 320 IOPS, and sequential IO is 1350 pages per sec (10.8MB/s). As cost is in seconds, the random IO from key lookup and loop join operations or the first page of a scan is 1/320 = 0.003125. Each successive page in a scan or range seek is 1/1350 = 0.000740740 (the 740 sequence repeats).

    The plan cost does not to model the cost of logic beyond the basic operation. For example, an aggregation of one column has the same cost as two or more. The interpretation of this is that the purpose of the plan cost model is to find the best index and table access sequence and method. Ancillary logic that must be executed in all cases does not affect the choice of plan.

    Hard Disks 1995-2005

    It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.

    Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.

    In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000's were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.

    In a SAN system, where the vendor was exclusively focused on random IO, the typical configuration supported sequential performance of about 150 IOPS at 64KB, for 9.6MB/s or 1200 pages per sec. Quite by accident, this was approximately in line with the SQL Server query optimizer. In any case, regardless of the actual random and sequential values, the important criterion is the sequential to random IO ratio, on the assumption that leaf level pages are not in-memory. And this was usually not true by the 2005 timeframe, when hot data was likely to be in memory for a well-tuned transaction processing system on a maximum memory configuration 4-way system.

    Queue Depth

    All the numbers cited above for HDD random IOPS performance were for queue depth 1 per disk. Of course, SCSI HDDs support elevator seeking, in which IO is issued at high queue depth, and the disk reorders the sequence of operations for higher IOPS. The SQL Server engine facilitated this by issuing IO at high queue depth when the estimated of number executions of a key lookup or loop join was 25 or more. Below this, IO was issued at queue depth 1. But the plan cost did not have a model for HDD random IO at different queue depths.

    It should be pointed out that individual query performance optimization is not always the prime objective. For a transaction processing system that also supports reporting queries, the first priority is usually transaction latency. In this case the preferred strategy is to sacrifice reporting query performance by not issuing IO at excessively high-queue depth to keep latency low.


    SSDs started to pick up momentum around 2010. Early use was for extreme requirements. Now it is the more practical choice for almost all main line-of-business systems over HDDs. This comes at a time when the system may also happen to have enough memory for almost the entire database. With data on SSD, the access times for sequential and random might reflect a lower ratio than the optimizer model as opposed to a much higher ratio for HDDs in the FTDW RA. Still, the CPU cost for loading 64KB or more into memory with a single IO is lower than evicting pages and loading multiple individual pages. The scan should have a lower cost per page.

    There is a less appreciated aspect of SSD storage in being able to achieve massive IOPS capability far in excess of what is really necessary. In an HDD system, it was possible to support sufficiently high random IOPS at less than excessive latency for transactions. It was also possible to achieve high bandwidth for large block IO in support of DW. What was not entirely practical was to simultaneously support low latency IOPS for transactions and high bandwidth for DW. This is now possible with an SSD storage system and this should be the configuration objective. It would be helpful for SQL Server to implement the correct IO queue depth strategy for this capability.

    Key Lookup - Scan Cross-over

    For pure transactions, queries have a highly selective search argument. In such cases, the plan cost model is not hugely important, regardless of wild differences between the plan cost model and various implementations of real hardware. The cost model is more important for queries involving an intermediate number of rows, in which the true cross-over point from an index seek with key lookup to scan operation is of greater interest. But even then, we may prefer one of the two for other reasons.

    Parallel Cost Model Inconsistency

    There is a curious inconsistency in the cost model for parallel execution plans. The CPU component of operator cost is scaled down in proportion to the degree of parallelism, except for the last doubling of the number of processor (cores). At DOP 2, the CPU cost is half of that at DOP 1. On a system with 20 logical processors and unrestricted parallelism, the CPU cost is reduced by a factor of 10, regardless of whether it is 10 physical cores with Hyper-Threading or 20 cores without HT. The inconsistency occurs in IO cost. There is one model for source access, that is seeks and scans to table or indexes. And there is a different model for intermediate operations, examples being Hash Match and Sort operators.

    Parallel Cost - Table and Index IO

    The IO component in table and index access operations, including index seek, scan, and key lookup, does not change with parallelism. The interpretation of this is that IO system is saturated with a single thread. Hence, a parallel plan does not reduce IO time in accesses to tables and indexes. This is a very reasonable assumption for the original reference system having 2 or 4 HDDs.

    It is a not an accurate representation for scan operations on a system configured to the FTDW RA, capable of multi-GB/s IO bandwidth. The saturated IO model might be somewhat applicable for random IO because a single processor core from about 2006 on could drive 50K IOPS depending on the situation and only a very large array of perhaps 200 15K HDDs could support that volume at queue depth 1 per HDD.

    On a modern system with properly configured SSD storage, the expectation should be that both sequential and random IO scales with parallelism, fully up to the number of physical cores. An example of a properly configured system is test system from Parallel Execution in SQL Server 2016, one Xeon E5 2640 v4 10-core processor with 4 Intel P750 PCI-E SSDs.

    Parallel Cost - Hash and Sort Operations

    Hash and Sort operations for a small data set only have a CPU component. When the number of rows times the row-size exceeds a value, there will be an IO component as well. The set point is some fraction of memory, either the max server memory or system memory if the prior is unlimited. This appears to be just over 5 MB per GB of memory. This value is also per thread. At higher DOP, each thread can accommodate up to the set point. Beyond the set point, the IO is somewhat greater than nonlinear, at perhaps IO 25 for 10MB, and 300 for 100MB. However, the IO cost in hash and sort operations does scale with parallelism, unlike IO cost is index seek and scan operations.

    This might seem to be the min query memory configuration setting and the resource governor max memory setting in actual runtime behavior. But it might be that the plan cost model only factors in the max server memory setting.

    Plan Cost Scaling - Parallelism

    Below is the ratio of plan cost at DOP 1 to higher DOP between 2 and 10 for the TPC-H SF 100 database on the Xeon E5-2630 v4 system with 64GB, but SQL Server limited to 32GB. In essence, this is the predicted speedup from parallelism by the plan cost model.

    Test System

    In several queries, there is very little reduction in plan cost at higher DOP. This occurs when the bulk of the costs are from the table and index access operations, in which parallelism does not reduce IO cost, only the CPU cost is reduced.

    In a few queries, examples being 8, 16 and 18, there is significant scaling in plan cost. This occurs primarily when much of the cost is in the hash match or sort operations, in which case, both CPU and IO are reduced with parallelism.

    In any case, actual scaling is always better than plan scaling, as shown below. See Parallel Execution in SQL Server 2016 for a more complete discussion.

    Test System

    The SQL Server plan cost model has very little predictive value on the matter of scaling. The plan cost model only predicts scaling when a significant portion of the cost is in the IO component of hash and sort operations. When the cost is largely in the IO of table and index scans and range seeks, the plan predicts weak scaling. In fact, many operations seem to scale well when the IO subsystem is not saturated from a single thread.

    True Cost versus Plan Cost

    As mentioned earlier, the plan cost model was based on a reference system that was probably a Pentium processor from the mid-1990's. In that era, each successive generation of processor increased rapidly in performance. It would not make sense to recalibrate the cost model to newer generation processors. And furthermore, it does not really matter what the processor is. Plan cost tends to be dominated by the IO component of table and index accesses based on 320 IOPS random and 10.5MB/s sequential, or from the IO cost of hash and sort operations. Still, one might be curious as how plan cost correlates to actual cost.

    The sum of the plan cost for the 22 TPC-H SF100 queries at DOP 1 is 200,839.65. Note that many of the execution plans have large hash operations for which the IO cost is dependent on system memory. The sum of the execution times for the same 22 queries also at DOP 1 is 3,515.15sec, although it would be less if some queries were manually tuned, resulting in higher plan cost, but lower runtime. The ratio of the sum of the plan costs to actual runtime is 57.135.

    Curiously, this is the approximate order of magnitude difference between a Pentium and modern processor core. There was no particular expectation of this because the plan cost model represents mostly IO wait while the test system has sufficiently powerful IO that the ratio of worker time to elapsed time was 98%.

    The figure below shows actual query runtime divided by plan cost scaled by 57.1, for the 22 TPC-H queries.

    Test System

    The range in variation between actual execution time relative to plan cost scaled by 57 is between 0.32 and 2.53. This is actually not a bad range, because the query optimizer plan cost does not attempt to model the true cost of a query, only the methods of table and index access.

    Two of the outliers on either end are Q2 at 0.32 and Q13 at 2.53. The critical part of the plan for Q2 is shown below.


    The large portion of the plan cost is for the loop join to PARTSUPP. As pointed out earlier and elsewhere, the plan cost for random IO is fairly high. Most of the operations in the TPC-H queries are scans or range seeks. The bulk of weight in the measured 57X scaling relative to plan cost is driven by the plan cost of sequential IO.

    Also, disregarding the plan cost model, the SQL Server engine actually issues IO at high queue depth when the estimated rows is more than 25. The expectation is that a storage system with 4 NVMe PCI-E SSDs blows through the loop join faster than the 57X average scaling.

    The other extreme is Q13, for which the core plan component is below. There is nothing unusual in the plan.


    However, in the query search argument is the following:

      AND CUSTOMER not like '%%special%%requests%%'

    This expression is expected to be expensive to evaluate, and is not modeled in the plan cost, being treated as just any other predicate.


    With compression, the plan cost for the 22 TPC-H SF100 queries at DOP 1 decreases by 32% from 200,839 to 136,792, reflecting lower IO cost from compressed tables. The actual execution time increases by 21% from 3,515 to 4,275sec, reflecting the overhead of compression significantly outweighing the reduced IO.

    Cost Threshold for Parallelism

    The original scale of plan cost was that 1 represented a simple estimate of execution time in seconds. This has long lost its original meaning. For all practical purposes, from a few years after its arrival, we treated cost as some arbitrary unit of measure. Except for the fact that plan cost was used in the cost threshold for parallelism. In the original model, the intent was that the query optimizer does not consider parallelism until the estimated execution time of the non-parallel plan exceeds 5sec, which was once a very reasonable value.

    Today, plan cost 5 might be a query that executes in less than 0.1 sec without parallelism. This is most definitely well below the point at which parallelism should be engaged. The overhead of coordinating multiple threads is heavy in relation to the actual work itself. But we should also not wait until plan cost 285, corresponding to 57 × 5, for the original 5 sec duration. Part of this is because people today are far less patient than in the previous era. The other part is that today we have so many processor cores at our disposal that we can now be more liberal in their use than before.

    Parallelism Operators

    Having discussed both the cost model and the to some degree actual scaling, it is now the time to lay the ground work for a new model of plan cost and parallelism. A parallel plan involves splitting the work to be done among multiple threads. This involves additional work not in a non-parallel plan, comprising one or more of: distribute, repartition or gather streams.

    The effectiveness of a parallel plan depends on both the ability to divide the work with some degree of uniformity and such that the extra work necessary to coordinate multiple threads not being too large. There is also some effort necessary to start a thread or to acquire threads from a pool.

      The Bitmap Operator

    There is also another factor in parallelism. As discussed elsewhere, there are a number of queries in which the parallel execution plan runs with lower CPU time than a non-parallel plan, which should not happen. It turns out that the parallel plan has a bitmap operator that is not in the non-parallel plan.

    The following is from Microsoft TechNet on Bitmap Showplan Operator: " SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering. "

    As it turns out, the bitmap operator might have the effect of improving plan efficiency by up to 30% depending on the circumstances. Microsoft could allow the bitmap operator to occur in plans above a certain cost threshold, but not parallel for whatever reason. That is, a separate the cost threshold for the bitmap operator from the cost threshold for parallelism. I would not hold my breath on this.

    Processor and System Architecture

    The foundation of the parallelism strategy must encompass the nature of the modern processor and system. The largest Intel Broadwell EP/EX of 2016 has 24 cores, the full set of which is available in the Xeon E7-v4, but only 22 in the E5-v4.

    While we may have been conditioned from past history into viewing a server system as having multiple processor sockets, that fact is that a system having a single processor socket today can be incredibly powerful. There are significant benefits for a single socket system. One of which is that all memory is local. Another substantial benefit is that the effort to coordinate between multiple threads is low. This leads to excellent scaling characteristics even when there appears to be significant inter-thread redistribution activity.

    For those who cannot let go of the multi-socket mind-lock, below are representations of a 2-socket E5

    and a 4-socket Xeon E7.

    The inescapable fact is that coordinating between threads running on cores in different sockets is very substantial. And so, there is a very substantial degradation in scaling when threads are on different sockets. For this reason, the parallelism strategy must respect the nature of the system architecture. Low to moderate parallelism must use cores on a common socket. There should probably be one strategy in employing parallelism up to the number of cores in one socket and a separate strategy for high-parallelism spanning cores on multiple sockets.

    A Modern Parallelism Strategy

    With this knowledge, the parallelism strategy needs to blend the method for partitioning of work with the nature of the modern server processor and system. There are very many cores, i.e., compute resources that can be allocated to query plans, in either a single or multi-socket system. For this, we need something more than the very basic, legacy configuration controls: 1) Cost Threshold for Parallelism and 2) Max Degree of Parallelism. To Connect request on this matter, Microsoft replied that the newer additional controls in Resource Governor is sufficient. In essence, the SQL Server team is abrogating their responsibility in having an intelligent mechanism to properly utilize the capabilities of a modern system.

    The first step might be to rescale the plan cost so that the cost threshold for parallelism is in more understandable terms. This is not absolutely essential, and we could just state that plan cost 60 might correspond roughly to 1 second. It did not make sense to rescale plan cost in the days when processor core performance changed substantially with every generation. But performance at the core level over the last several years have been more stable, improving at a slow or moderate pace, and may continue to do so barring a radical event. We might take this opportunity to reset the plan cost.

    Second, the cost model needs to be more comprehensive. The cost of touching each row in a scan or other sequential access is larger than the current model represents and should be corrected. For this, the impact of lock-level might be considered. The cost of touching each column is both non-trivial and not accounted for at all in the current model. The cost of ancillary logic also needs to be modeled. And of course, the cost of access to compressed pages should be accounted for. The above items may not impact the structure of the execution plan, which is why it was not considered back in version 7 or 2008 for compression. But it is relevant in the cost threshold for parallelism assessment, and also impacts scaling projections.

    Third, the IO cost model needs to be fixed. This might involve modeling the actual IO subsystem performance characteristics. Or it might just assume that IO performance scales with parallelism. Also important is that parallelism scaling does not stop at the half the number of cores, instead going up to the full number of physical cores. It would help if the optimizer knew the difference between physical and logical cores. The IO cost in hash and sort operations already scale with parallelism, so fix the IO cost for source access would make the complete model consistent.

    As a rough guess, (current) plan cost 30, corresponding to roughly one-half of one second execution time might be a good default cost threshold for parallelism, settable of course. For a query with plan cost 30 expected to execute in 0.5 CPU-sec (worker time), a DOP of 2 and not higher might be a good idea. One suggestion for the DOP strategy might be nonlinear. Employ DOP 2 at the base value of the cost threshold. Then double the DOP for every 4× over of the base cost. If base is 30, then plan cost 120 (4×30) is the threshold for DOP 4 and 480 (16×30) is threshold for DOP 8. This could be the strategy for DOP up to the number of cores in one socket with one threshold for the base value and another for escalating the DOP.

    The expectation is that scaling is not so simple or excellent for parallelism on cores in different sockets. It would be good if the plan cost could model this. In any case, there might be an even higher threshold for employing cores over multiple sockets. We should not normally set DOP to be somewhat over the numbers cores in one socket. It should be up to the number of cores in one socket for most circumstances, or employ most of the cores of multiple sockets for exceptional circumstances. Furthermore, if the desired number of cores in one socket are not available, then reduce the DOP rather than employ cores on different sockets.


    Both cost model and the simple controls for parallelism have become seriously obsolete for modern high core count processors. An improved cost model, encompassing cost structure that was not necessary for optimal single thread table access strategy, is now needed for the parallelism strategy. The cost model needs to correctly model parallelism on modern hardware, which means fixing the IO cost model that currently assumes a single thread saturates the IO. A proper parallelism strategy also needs more complexity than the current on/off based on the cost threshold for parallelism, and the all or nothing with only a MAX DOP setting. A moderately more complex model has been suggested but alternatives are possible.

  • Parallel Execution on SQL Server 2016

    There are a number of interesting questions in parallel execution performance that can be investigated using the TPC-H kit, which has a data generator. There are also supplemental files that accompany published results. Important questions in parallel execution are: scaling versus degree of parallelism (DOP) on physical cores, Hyper-Threading, Compression and Columnstore. (Memory optimized tables support parallel plans in 2016? perhaps in a later post.) To be clear, the TPC-H data generator (dbgen) and queries are used here, but no attempt is made to adhere to the TPC-H benchmark requirements (there are some outright departures). The material reported here are only for purpose of studying parallel execution characteristics.

    In brief, the findings are as follows. The sum average performance gain from Hyper-Threading over the 22 TPC-H queries is 23%, in line with characteristics observed in SQL Server version 2012. (Official TPC-H results use a geometric mean.) There used to be an anomalous behavior on SQL Server 2012 in which performance was better with HT enabled and the second logical processor idle over HT disabled, even for single-threaded execution as reported in Hyper Threading. Compression has an average overhead of 20% across all DOP levels. For SQL Server 2008, I had reported in SF 10 that compression overhead was 40% between DOP 1-8, then decreases to less than 10% at DOP 32. The test circumstances were different, and may have been related to the NUMA effects of an 8-way system. Columnstore performance is excellent and should be considered whenever practical.

    • Hyper-Threading performance boost 23% in parallel DW queries
    • HT anomalous performance gain in version 2012 at DOP 1 no more
    • Page compression overhead 20%, down from 40% in 2008
    • Columnstore should run one per physical core, HT can be enabled

    Test Environment

    The test system is a single socket Intel Xeon E5-2630 v4 10-core 2.2GHz processor with 64GB memory. Storage is 4 Intel P750 PCI-E NVMe SSDs for data and temp, and one SATA SSD for logs. The software stack is Windows Server 2016 and SQL Server 2016 SP1. SQL Server was limited to 32GB for results reported here.

    The data is generated using dbgen from the TPC-H kit at scale factor 100. This used to mean the Lineitem table is 100GB, but with a 4-byte date for three columns that were formerly 8-byte datetime, is now 80GB in row-store without compression (still 600M rows). Overall database size is 140GB. With page compression, the Lineitem table is 46GB data and overall database size is 101GB. In columnstore, total database size is 44GB.

    Elsewhere, I have argued that performance characteristics are very different between scaling with multiple cores on a single socket and scaling with cores over multiple sockets. Multi-socket scaling cannot be investigated here. Conclusions should not be extrapolated beyond what is tested.

    Baseline, no compression, row store

    The figure below is speedup (elapsed time ratio) at various degrees of parallelism relative to DOP 1 for the baseline row-store, no compression database. Note that there are 10 physical cores. The SQL Server engine did seem to run parallel plans on separate cores except when DOP exceeds the number of cores.

    Test System

    Queries 5 and 15 are worse at DOP 2 than DOP 1, at 0.675 and 0.802 respectively, and Q3 is slightly slower at 0.948 relative to DOP 1. In query 3 and 5, the negative scaling from DOP 1 to 2 is caused by a change in the execution plan structure, excluding parallelism operators.

    Below is focused look at scaling from DOP 1 to 2. As noted, three queries are worse at DOP 2 than 1. Several queries are almost exactly 2X better, as expected in the idea case for a parallel execution plan.

    Test System

    And then there are also queries that are more than 2X better at DOP 2 than 1. Q17 is 4.3X faster at DOP 2 than 1. Four others at more than 3.5X. Usually, these have the bitmap operator in the parallel plan, which is not used in a non-parallel plan. The bitmap operator has some startup cost, but can process large data sets more efficiently.

    The reasonable assumption is that large queries have a parallel plan, so there is no need to employ the bitmap in non-parallel plans. There are other reasons for super-scaling. One is the set points for hash and sort operations spilling to tempdb. The setting is fixed as a fraction of system memory for each thread. At higher DOP, more memory is allocated across all threads of the parallel plan.

    Below is the scaling from DOP 2 to 10, which removes the wide fluctuation that occurs between DOP 1 to 2. In the ideal parallelism case, scaling should be close to 5.

    Overall, the DOP 2 to 10 scaling is more uniform, except for Q15, in which the poor scaling from 1 to 2 is recovered at DOP 4 and higher. Q15 scaling from 1 to 4 is 3.346, which is not bad, good even.

    Performance largely comparable between HT enabled and disabled for DOP levels up to the number of physical cores, 1-10 in this case. It was reported in Hyper Threading, that DOP 1 ran 24% faster at DOP 1 with HT enabled over disabled. This might seem paradoxical. A speculative explanation is that the SQL Server engine tries to work in batches that fit inside L2 cache. There is an API for the L2 cache size, but how much of this should be assumed to be available for use by a particular operation? If the assumed value is too high, but the HT enabled case reduces this by half, then it is possible that DOP 1 with HT enabled could be faster?

    Below is the effect of enabling Hyper-Threading (HT) with DOP 20, over disabling HT in UEFI (formerly BIOS) for DOP 10.

    Test System

    Overall, then benefit of HT is strong at 1.23X. Q2 shows as 0.9 reduction, going from 0.61 to 0.67sec runtime, but this is probably just the effect of running a small query at very high DOP. It would be interesting to know which logical processors are employed when DOP exceeds the number of physical cores on one socket, in a multi-socket system with HT enabled. Scaling over cores in different sockets is not as good compared to all cores in one socket, so perhaps the 23% gain from HT might be acceptable?

    Special cases

    In queries 3 and 5, there is a good plan at DOP 1, but less efficient plans for DOP 2 and higher. The result is negative scaling to DOP 2, and poor scaling at higher DOP relative to DOP 1. The query optimizer model believes that two or more plans have similar costs. There are differences in how the cost for these plans vary with the degree of parallelism, and hence the winning plan changes with DOP.

    Both Q3 and 5 can be rewritten to force a particular plan. Note that the special feature Date Correlation Optimization was designed specifically for TPC-H, and probably no production database uses it. When a query plan is forced, the date correlation optimization is also turned off. Since we are interested in scaling for a particular plan, the full set of implied date range SARGs are explicitly specified.

    The figure below shows Q3, 5, and 7 scaling relative to DOP 1 for natural and forced execution plans.


    In Q3 and Q5, the natural plan from DOP 1 is forced in the parallel plans. This results in good scaling at all DOP levels. Below is the execution plan for Query 3 at DOP 1 without other hints having cost 1349, execution time 28.38sec.

    Q3 DOP1

    Below is the plan at DOP 2 and higher, without hints, having cost 1057 at DOP2. Execution time is 29.94 sec, slightly longer than DOP 1.

    Q3 n

    On forcing the parallel plan to the same join order of the non-parallel plan, the cost is 1248.

    Q3 n

    In actuality, above plan having the join order of the non-parallel plan executes faster at 8.564sec versus 29.94sec for the unforced parallel plan.

    Below is the natural plan for query 5 at DOP 1, having cost 4445, and execution time 72.70sec.

    Q5 a

    The natural parallel plan for Q5 has the structure below. In this case, the query was written to force the join order of the parallel plan, and then set to DOP 1. This is done so that the plan structure is easier to discern without excessive complexity of the parallelism operators. The plan cost at DOP 2 is 3243.31 and execution time 107.62, significantly longer than for DOP 1.

    Q5 a

    If the plan at DOP2 had been forced to the same join order as at DOP 1, the plan cost is 3244.02, or 0.022% higher. In fact, the forced plan is much quicker at 23.469 than both the DOP 1 plan and the natural DOP 2 plan.

    Q15 is the other query that has negative scaling from DOP 1 to 2, miraculously recovers at DOP 4 and higher relative to DOP 1. This is a strange case. The execution plans are similar at various DOP levels, but there is a slight difference at DOP2. but not the other levels.

    It is not always the case that the non-parallel plan structure is better than the parallel plan structure. Q7 has a super-scaling gain of 2.969X from DOP 1 to 2, an outstanding gain of 1.908 from DOP 2 to 4, then another super-scaling gain of 3.072 from DOP 4 to 8, before trailing off, all with HT enabled.

    However, Q7 performance is less at DOP 2-10 with HT disabled than with HT enabled, when the results should be similar. This results in Q7 show an incredible 1.8X gain from HT disabled DOP 10 to HT enabled DOP 20. With HT enabled in UEFI, the gain from DOP 10 to 20 is 1.25X, which is very good but not irrationally high.

    Below is the natural plan at DOP 1, cost 4780 and execution time 114.96sec.

    Q7 a

    It changes to another plan at DOP 2, and then a third plan at DOP 4, for which the structure is shown below, but forced to DOP 1, cost 5840, execution time 102.074sec. The forced DOP 2 plan cost is 4614, execution time 31.895, whereas the natural DOP 2 plan cost 3710.8, execution time 38.72.

    Q7 a

    In this case, it is the parallel plan at DOP 4 and higher that runs faster at all DOP levels instead of the DOP 1 or 2 plans. In forcing the plan to the DOP 4+ plan, normal good scaling results are observed.

    Query Optimizer Cost Model

    (this should be a separate article)
    The root cause of most of these anomalies is that the cost formulas for execution operations are based on very simple, fix model. The model is adequate for transaction processing in which queries have highly selective search arguments. But the model is poor in assessing the index seek plus key lookup versus scan cross-over point.

    The SQL Server query optimizer employs a fixed model for IO cost. The model is 320 IOPS for operations assumed to be random (key lookups and loop join inner source) 1350 pages/sec × 8KB/page = 10,800KB/s for sequential IO operations (scans and range seeks). The absolute values of random and sequential IOPS are not important, as it is the ratio that effects the execution plan.

    Furthermore, the cost model assumes that IO is saturated at DOP 1 for accesses to source tables and indexes. A parallel plan does not reduce the IO cost in source accesses. IO cost is reduced in accordance with DOP in intermediate hash and sort operations. For storage on a small number of hard disks, IO is saturated at DOP 1. A very large HDD array is capable of scaling sequential IO with parallelism. A proper SSD array is capable of that and also scaling random IO.

    Why does the cost model assume saturated IO for source accesses but scalable IO for intermediate non-source operations? It is as if one developer designed the model for source, another did the model for intermediate, but they did not talk to each other to present a consist picture. The two developers may sit in adjacent cubicles.

    A second issue is that the cost model underestimates row access cost, and essentially does not model column and logic costs. This is evident in the plan cost of for a query doing a row count versus aggregating one or more columns. There is no difference between a count or any number of column aggregations. There is a small cost assessed for the first logic operation (example multiply two columns), but not for additional columns or logic. In actuality, column and logic might be the large part of an aggregation query.

    The combined effect of the query optimizer cost model limitations is that it has no meaningful capability to predict the degree of scaling to be expected with parallelism. There are only two controls: the cost threshold for parallelism, and the max degree of parallelism. The query optimizer employs parallelism when the parallel plan has a model cost of just slightly lower than the non-parallel plan and if the non-parallel plan cost is over the threshold.

    This was not a huge issue back when big systems had a total of four processor cores. A modern system up to 24 cores per socket in the current generation as well as typically having multiple sockets. The all or nothing strategy is seriously inadequate. It is now important to have the ability to assess what degree of parallelism to employ. This should be determined based on the query cost and expected scaling characteristics.

    A query that is executes in a few seconds might employ a low DOP, perhaps 2 or 4, but only if the gain from parallelism is sufficient. For example, 40% could be considered sufficient, but 10% is not. An intermediate-large query might employ up to all the cores in one socket, again, so long as there is scaling with increasing DOP. Very large queries might employ cores in all sockets, but only if it can be predicted that the execution plan scales well across multiple sockets.

    Page Compression, row store

    Below is the compression overhead, at various DOP levels relative to no compression. In SQL Server 2008 sp1, on an 8-way quad-core Opteron at SF10, all data in memory

    Test System

    Q5 has the good execution plan at DOP 1 and 2, before shifting to the poor plan at DOP 4 and higher. However, at DOP 2 the comparison is between a good plan on the database with compressed tables and a poor plan on the database having tables without compression, leading to the anomaly. Overall, the compression overhead is a fairly uniform 1.2X at all DOP levels, but can range from being nearly free to a high of 1.4X. The individual values over 1.4X are probably anomalies rather than representative.

    This is different than in 2008R2, in which compression overhead was 1.4X at DOP 1, but then gradually decreased to being essentially free at higher DOP. The test system back then was an 8-way Opteron quad-core, in which the effects of NUMA might have been significant. The hypothesized explanation was that without compression, there was contention between threads as DOP increased, decreasing scaling. The extra overhead of compression just happened to alleviate the contention.

    Presumably the explanation now is that the SQL Server engine team did a good job of removing contention where possible, so that there is better scaling without compression. Then the overhead of compression is not masked. Or it could be because of the difference in the system, NUMA and not.


    Below is columnstore performance at various DOP, relative to row store at the same DOP. In some queries, it can be almost 30X better. Overall, columnstore is more than 4X better than row store.

    Test System

    Note that columnstore at DOP 20 with HT enabled is very poor. This is the expected behavior, because columnstore marches through memory sequentially, so HT is expected to interfere with columnstore code.

    The figure below shows the large negative impact of DOP 20 relative to DOP 10, although four queries showed modest gain.

    Test System

    There was little difference between DOP 10 with HT disabled or enabled. Enabling HT in UEFI is not harmful for columnstore, so long as max DOP is limited to the number of physical cores in columnstore queries.

    The more proper comparison is between columnstore at DOP 10 and row store at DOP 20 with Hyper-Threading, as shown below.

    Test System

    The overall gain for columnstore is now 3.45. Only Q13 shows no gain for columnstore, but no query is worse than row store. Q13 was one of the few columnstore queries that did show gain with HT, so columnstore is better than row store at matching DOP and HT.

    The figure below shows columnstore scaling versus DOP.

    Test System

    Columnstore scaling is uniformly good from DOP 1 to 4, but becomes erratic at DOP 8 and 10.


    Parallelism is a complicated topic. The main guidance points were highlighted at the beginning. In addition, make few or no assumptions and always investigate. Evaluate performance characteristics at different levels of parallelism, DOP 1, 2, 4 and higher. Watch for scaling anomalies, and changes in execution plan structure. Be especially careful in comparing scaling past the number of cores in a single socket.

    Hyper-Threading does have benefit in parallel plans, but mostly for row store. It is probably advisable to enable HT in the UEFI/BIOS, but to limit columnstore to only the number of physical cores. It would be better if Microsoft not employ HT for columnstore unless they know exactly when HT might benefit. Compression does have overhead, but might still be worthwhile.

    The larger issue is that SQL Server seriously needs to overhaul the query optimizer cost model. Much of the SQL Server engine is very sophisticated, and yet the cost model has not changed substantially since version 7 in 1998. As is, the execution plan cost has weak bearing to reality and that is has no ability to predict scaling with parallelism, issues in scaling beyond a single socket, and when scaling to HT works. Without a proper cost model, the benefit of a sophisticated query optimizer is lost. For the time being, developers and DBAs can (re-)learn the method for writing SQL to a particular join order.

  • Rethinking System Architecture

    Server system memory capacities have grown to ridiculously large levels far beyond what is necessary now that solid-state storage is practical. Why is this a problem? Because the requirement that memory capacity trumps other criteria has driven system architecture to be focused exclusively on low cost DRAM. DDR DRAM, currently in its fourth version, has low cost, acceptable bandwidth, but poor latency. Round-trip memory access latency is now far more important for database transaction processing. The inverse of latency is practically a direct measure of performance.

    In the hard disk era, the importance of memory in keeping IO volume manageable did trump all other criteria. In theory, any level of IO performance could be achieved by aggregating enough HDDs. But large arrays have proportionately more frequent component failures. Although RAID provides fault tolerance, the failed drive rebuild process causes operational problems. The practical upper bound for HDD storage is on the order of 1000 disks, or 200K IOPS. If it was necessary to purchase 1TB of memory to accomplish this, then it was money well spent because the storage was even more expensive.

    Since then, storage with performance requirements have or should be transitioning to all-flash, and not some tiered mixed of flash and HDD. Modern solid-state arrays can handle up to 1M IOPS, far more and easily when on a full NVMe stack. Now is the time to re-evaluate just how much memory is really needed when storage is solid-state.

    There are existing memory technologies, RLDRAM and SRAM for example, with different degrees of lower latency, higher cost and lower density at both the chip and system level. There is potential to reduce memory latency by a factor of two or more. The performance impact for database transaction processing is expected to be equal. A single-socket system with RLDRAM or SRAM memory could replace a 4-socket system with DDR4. Practically any IO volume, even several million IOPS, can be handled by the storage system. The key is that the CPU expended by the database engine on IO be kept to a reasonable level.

    An argument is made for a radical overhaul of current system architecture, replacing DDR4 DRAM as main memory with a different technology that has much lower latency, possibly sacrificing an order of magnitude in capacity. As the memory controller is integrated into the processor, this also calls for a processor architecture change. A proper assessment for such a proposal must examine not only its own merits, but also other technologies with the potential for order of magnitude impact.

    The discussion here mostly pertains to Intel processors and Microsoft SQL Server. However, the concepts are valid on any processor and database engines built around page organization, row-store with b-tree indexes. The argument for low latency memory is valid even in the presence of Hekaton memory-optimized tables, or other MVCC implementations because it is a drop-in hardware solution instead of a database re-architecture project. Of course, the heavy lifting is on Intel or other to re-architect the system via processor and memory.


    For more than forty years, DRAM has been the standard for main memory in almost every computer system architecture. A very long time ago, having sufficient memory to avoid paging was a major accomplishment. And so, the driving force in memory was the semiconductor technology with low cost. This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C). DRAM addressing was multiplexed because even the package and signal count impact on cost mattered.

    The page file has since become a relic, that for some reason has yet to be removed. Over the years, the mainstream form of DRAM, from SDRAM to DDR4, did evolve to keep pace with progress on bandwidth. The criteria that has changed little is latency, and it is round-trip latency that has become critical in applications characterized by pointer chasing code resulting serialized memory accesses.

    Even as memory capacity reached enormous proportions, the database community routinely continued to configure systems with the memory slots filled, often with the maximum capacity DIMMs, despite the premium over the next lower capacity. There was a valid reason for this. Extravagant memory could mask (serious) deficiencies in storage system performance.

    Take the above in conjunction with SAN vendors providing helpful advice such as log volumes do not need dedicated physical disks, and the SAN has 32GB cache that will solve all IO performance problems. Then there is the doctrine to implement their vision of storage as a service. But now all-flash is the better technology for IO intensive storage. Solid-state storage on an NVMe stack is even better. Massive memory for the sole purpose of driving IO down to noise levels is no longer necessary.

    Now free of the need to cover-up for weakness in storage, it is time to rethink system architecture, particularly the memory strategy. Not all application desire low latency memory at lower capacity and higher cost. Some work just fine with the characteristics of DDR4. Others prefer a different direction entirely. The HPC community is going in the direction of extreme memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.

    In the past, when Moore's Law was in full effect, Intel operated on: "the Process is the Business Model". It was more important to push the manufacturing process, which meant having a new architecture every two years with twice the complexity of the previous generation and then be ready to shrink the new architecture to a new process the next year. Specialty products divert resources from the main priority and were difficult to justify. But now process technology has slowed to a three-year cycle and perhaps four years in the next cycle. It is time to address the major profitable product outside of general purpose computing.

    The principle topic here is the impact of memory latency on database transaction performance, and the directions forward for significant advancement. As the matter has complex entanglements, a number of different aspects come into play. One is scaling on non-uniform memory access (NUMA) system architecture, in which the impact of memory access can be examined. Memory technologies with lower latency are mentioned, but these topics are left to experts in their respective fields. Hyper-Threading is a work-around to the issue of memory latency, so it is mentioned.

    The general strategy for server system performance in recent years is multi-core. But it is also important to consider which is the right core as the foundation. Another question to answer is whether a hardware and/or software solution should be employed. On software, Multi-version Concurrency Control (MVCC) is the technology employed by memory-optimized databases (the term in-memory, has misleading connotations) that can promise large performance gains, and even more extreme gain when coupled with natively compiled procedures. Architecting the database for scaling on NUMA architecture is also a good idea.

    A mixed item is SIMD, the SSE/AVX registers and instructions introduced over the last 18 years. From the database transaction processing point of view, this is not something that would have been pursued. But it has come to occupy a significant chunk of real estate on the processor core. So, find a way to use it! Or give it back.

    The list is as follows:

    • Scale up on NUMA versus single socket
    • RLDRAM/SRAM versus DDR4+3D XPoint
    • Hyper-Threading to 4-way
    • Core versus Atom and all comers?
    • Hekaton, memory-optimized, MVCC
    • Database architected for NUMA
    • SSE/AVX Vector Instructions

    The above topics will be address, perhaps only partially, often out of order, and sometimes mixed as appropriate. Factors to consider are potential impact, difficulty of implementation and overall business justification.

    L3 and Memory Latency

    The 7-cpu benchmark website lists L3 and memory latency for some recent processors, shown below.

    ProcessorBase FreqL3Local MemoryRemote Memory
    Westmere EP (Xeon X5650)2.67GHz40-42 cyclesL3+67nsL3+105ns
    Ivy Bridge (i7-3770)3.4GHz30 cyclesL3+53ns 
    Haswell (i7-4770)3.4GHz36 cyclesL3+57ns 
    Haswell (E5-2603 v3)1.6GHz42-43 cycles  
    Skylake (i7-6700)4.0GHz42 cyclesL3+51ns 

    There are a number of questions. Is L3 latency determined by absolute time or cycles? In the Intel Xeon processors from Ivy Bridge EP/EX on, this is particularly complicated because there are 3 die layout models; LCC, MCC, and HCC, each with different structure and/or composition. The Intel material says that L3 latency is impacted by cache coherency snoop modes, ring structure and composition. See Intel Dev Conference 2015 Processor Architecture Update, and similar Intel HPC Software Workshop 2016 Barcelona For simplicity, L3 is assumed to be 15ns here without consideration for other details.

    DRAM, SQL Server 8KB Page

    The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15, working out to 14ns each for CL, RCD and RP, so random row latency is 42ns the DRAM interface. The Intel web page for their Memory Latency Checker utility shows an example having local node latency as 67.5 or 68.5 ns and remote node as 125.2 or 126.5 ns. L3 and memory latency on modern processors is a complicated matter. The Intel values above includes the L3 latency. So, by implication, the 10ns difference is the transmission time from memory controller to DRAM and back?

    One of the modes of DRAM operation is to open a row, then access different columns, all on the same row, with only the CL time between columns. I do not recall this being in the memory access API? Is it set in the memory controller? Systems designed for database servers might force close a row immediately?

    In accessing a database page, first, the header is read. Next, the row offsets at the end of the 8KB page. Then the contents of a row, all or specific columns. This should constitute a sequence of successive reads all to the same DRAM row?


    Memory Architecture - Xeon E5 and E7

    The most recent Core i3/5/7 processors run in the mid-4GHz range. There are Xeon EP and EX processors with base frequency in the 3GHz+ range. But for the high core count models, 2.2GHz is the common base frequency. Below is a representation of the Xeon E5 memory subsystem.


    In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB) or Memory Extension Buffer (MXB), depending on which document, and the interface between MC and SMB is SMI. The SMB doubles the number of DIMMs that can connect to each memory channel.


    There is no mention of the extra latency for the SMB. It cannot be free, "All magic comes with a price". The value of 15ns is used here for illustrative purposes.

    Anyone with access to both 2-way Xeon E5 and 4-way E7 systems of the same generation and core count model (LCC, MCC or HCC) is requested to run the 7-cpu benchmark or Intel Memory Latency Checker utility, and make the results known. In principle, if Intel were feeling helpful, they would do this.

    System Architecture

    Below is a single socket Xeon E5 v4. The high core count (HCC) model has 24 cores. But the Xeon E5 v4 series does not offer a 24-core model. Two cores are shown as disabled as indicated, though it could any two? There is only one memory node, and all memory accesses are local.



    Below is a 2-socket system, representing the Xeon E5 either HCC or MCC models in having two double rings, and the interconnect between rings. The details shown are to illustrate the nature of NUMA architecture.



    NUMA is complicated topic, with heavy discussion on cache coherency, the details of which impacts L3 and memory latency. This discussion here will only consider a very simple model only looking a physical distance impact on memory latency. See NUMA Deep Dive Series by Frank Denneman.

    Each socket is its own memory node (not sure what happens in Cluster-on-Die mode). To a core in socket 0, memory in node 0 is local, memory in node 1 is remote, and vice versa to a core in the other socket (more diagrams at System Architecture Review 2016).

    Below is a representation of a 4-socket system based the Xeon E7 v4 memory. To a core in socket 0, memory in node 0 is local, memory in nodes 1, 2, and 3 are remote. And repeat for other nodes.



    It is assumed that memory accesses are 15ns longer than on the E5 due the extra hop through the SMB outbound and inbound. This applies to both local and remote node memory accesses.

    On system and database startup, threads should allocate memory on the local node. After the buffer-cache has been warmed up, there is no guarantee that threads running in socket 0 will primarily access pages located in memory node 0, or threads on socket 1 to memory node 1. That is, unless the database has been purposefully architected with a plan in mind to achieve higher than random memory locality.

    The application also needs to be built to the same memory locality plan. A connection to SQL Server will use a specific TCP/IP port number based on the key value range. SQL Server will have TCP/IP ports mapped to specified NUMA nodes. Assuming the application does the initial buffer-cache warm up, and not some other source that is not aware of the NUMA tuning, there should be alignment of threads to pages on the local node. (It might be better if an explicit statement specifies the preferred NUMA node by key value range?)

    An example of how this is done in the TPC-C and E benchmark full disclosures and supplemental files. Better yet is to get the actual kit from the database vendor. For SQL Server, the POC is JR Curiously, among the TPC-E full disclosure and supplemental files' thousands of pages of inane details, there is nothing that says that the NUMA tuning is of pivotal importance. It is assumed in the examples here that threads access random pages, and memory accesses are evenly distributed over the memory nodes.

    Simple Model for Memory Latency

    In Memory Latency, NUMA and HT, a highly simplified model for the role of memory latency in database transaction processing is used to demonstrate the differences between a single-socket system having uniform memory and multi-socket systems having NUMA architecture.

    Using the example of a hypothetical transaction that could execute in 10M cycles with "single-cycle" memory, the "actual" time is modeled for a 2.2GHz core in which 5% of instructions involve a non-open page memory access. The model assumes no IO, but IO could be factored in if desired. The Xeon E5 memory model is used for 1 and 2 sockets, and the E7 for the 4-socket system.

    GHzL3+memremotesktavg. memmem cyclesfractiontot cyclestps/coretot tx/sec

    If there were such a thing as single-cycle memory, the performance would be 220 transactions per second based 2,200M CPU-cycles per second and 10M cycles per transaction.

    Based on 67ns round-trip memory access, accounting for a full CL+tRCD+tRP, the transmission time between processor and DRAM, and L3 latency, incurred in 0.5M of the 10M "instructions", the transaction now completes in 83.2M cycles.

    The balance of 73.2M cycles are spent waiting for memory accesses to complete. This circumstance arises primarily in point-chasing code, where the contents of one memory access determines the next action. Until the access completes, there is nothing else for the thread to do. The general advice is to avoid this type of coding, except that this is what happens in searching a b-tree index.

    If the impact of NUMA on database transaction processing performance were understood and clearly communicated, databases could have been architected from the beginning to work with the SQL Server NUMA and TCP/IP port mapping features. Then threads running on a given node primarily access pages local to that node.

    If this forethought had been neglected, then one option is to re-architect both the database and application, which will probably involve changing the primary key of the core tables. Otherwise, accept that scaling on multi-socket systems is not going to be what might have been expected.

    Furthermore, the Xeon E7 processor, commonly used in 4-socket systems, has the SMB feature for doubling memory capacity. As mentioned earlier, this must incur some penalty in memory latency. In the model above, scaling is:

      1P -> 2P = 1.45X,   2P -> 4P = 1.56X and
      1P -> 4P = 2.26X

    The estimate here is that the SMB has an 11% performance penalty. If the doubling of memory capacity (or other functionality) was not needed, then it might have been better to leave off the SMB. There is 4-waay Xeon E5 4600-series, but one processor is 2-hops away, which introduces its own issues.

    There is a paucity of comparable benchmark results to support meaningful quantitative analysis. In fact, it would seem that the few benchmarks available employ configuration variations with the intent to prevent insight. Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.

    ProcessorSocketscoresthreadsmemorydata storagetpsE 
    E5-2699 v424488512GB (16x32)3x17 R54,938.14-
    E7-8890 v44961924TB (64x64)5x16 R59,068.00-

    It would seem that scaling from 2P to 4P is outstanding at 1.915X. But there is a 9% increase in cores per socket from 22 to 24. Factoring this in, the scaling is 1.756X, although scaling versus core count should be moderately less than linear. Then there is the difference in memory, from 256GB per socket to 1TB per socket. Impressive, but how much did it actually contribute? Or did it just make up for the SMB extra latency? Note that TPC-E does have an intermediate level of memory locality, to a lesser extent than TPC-C.


    The details of the current Intel Hyper-Threading implementation are discussed elsewhere. The purpose of Hyper-Threading is to make use of the dead cycles that occur during memory access or other long latency operations. Hyper-threading is an alternative solution to the memory latency problem. Work arounds are good, but there are times when it is necessary to attack the problem directly. Single thread performance is important, perhaps second after overall system throughput.

    Given that the CPU clock is more than 150 times that of memory round-trip access time, it surprising that Intel only implements 2-way HT. The generic term is simultaneous multi-threading (SMT). IBM POWER8 is at 8-way, up from 4-way in POWER7. SPARC has been 8-way for a few generations?

    There is a paper on one of the two RISC processors stating that there were technical challenges in SMT at 8-way. So, a 4-way HT is reasonable. This can nearly double transaction performance. The effort to increase HT from 2-way to 4-way should not be particularly difficult. Given the already impossibly complex complexion of the processor, "difficult should be a walk in the park".

    It might help if there were API directives in the operating system to processor on code that runs well with HT and code that does not.


    One other implication of the memory latency effect is that scaling versus frequency is poor. The originator of the memory latency investigation was an incident (Amdahl Revisited) in which a system UEFI/BIOS update reset processors to power-save mode, changing base frequency from 2.7GHz to 135MHz. There was a 3X increase in worker (CPU) time on key SQL statements. A 20X change in frequency for 3X performance.

    In other words, do not worry about the lower frequency of the high core count processors. They work just fine. But check the processor SKUs carefully, certain models do not have Hyper-Threading which is important. It also appears that turbo-boost might be more of a problem than benefit. It might be better to lock processors to the base frequency.

    Hekaton - MVCC

    Contrary to popular sentiment, putting the entire database in to memory on a traditional engine having page organization and row-store does not make much of a direct contribution in performance over a far more modest buffer cache size. This is why database vendors have separate engines for memory-optimized operation, Hekaton in the case of Microsoft SQL Server. To achieve order of magnitude performance gain, it was necessary to completely rethink the architecture of the database engine.

    A database entirely in memory can experience substantial performance improvement when the storage system is seriously inadequate to meet the IOPS needed. When people talk about in-memory being "ten times" faster, what meant was that if the database engine have been designed around all data residing in memory, it would be built in a very different way than the page and row structured implemented by INGRES in the 1970's.

    Now that the memory-optimized tables feature, aka Hekaton for Microsoft SQL Server, is available, are there still performance requirements that have not been met? Memory-optimized tables, and its accompanying natively compiled procedures are capable of unbelievable performance levels. All we have to do is re-architect the database to use memory-optimized tables. And then rewrite the stored procedures for native compilation. This can be done! And it should be done, when practical.

    In many organizations, the original architects have long retired, departed, or gone the way of Dilbert's Wally (1999 Y2K episode). The current staff developers know that if they touch something, it breaks, they own it. So, if there were a way to achieve significant performance gain, with no code changes, just by throwing money at the problem, then there would be interest, and money.

    There is not a TPC-E result for Hekaton. This will not happen without a rule change. The TPC-E requirement is that database size scales with the performance reported. The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database size of 37,362GB. The minimum expectation from Hekaton is 3X, pointing to a 100TB database. A rule change for this should be proposed. Allow a "memory-optimized" option to run with a database smaller than memory capacity of the system.

    About as difficult as MVCC, less upside? Potentially the benefits of NUMA scaling and Hekaton could be combined if Microsoft exposed a mechanism for how key values map to a NUMA node. It would be necessary for a collection of tables with compound primary key to have the same lead column and that the hash use the lead column in determining NUMA node?


    The major DRAM companies are producing DDR4 at the 4Gb die level. Samsung has an 8Gb bit die. Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product. There can be up to 36 packages on a double-sided DIMM, 18 on each side. The multiple chips/packages form a 64-bit word plus 8-bits for ECC, capable of 1-bit error correction and 2-bit error detection. A memory controller might aggregate multiple-channels into a larger word and combine the ECC bits to allow for more sophisticated error correction and detection scheme.

    A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB. The DIMM is comprised of 32×4Gb die, be it 32 single-die packages or 16 two-die packages. The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128, for $8/GB net (data+ECC) or $7.11/GB raw. There is a slight premium for ECC parts, but much less than it was in the past, especially with fully buffered DIMMs that had an XMB chip on the module. The 4Gb die + package sells for less than $3.

    Jim Handy, the memory guy, estimates that 4Gbit DRAM needs to be 70mm2 to support a price in this range. The mainstream DRAM is a ruthlessly competitive environment. The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250, no premium over the 16GB part.

    The 64GB ECC RDIMM (72GB raw) is priced around $1000. This might indicate that it is difficult to put 4×4Gb die in one package, or that the 8Gb die sells for $12 compared to $3 for the 4Gb die. Regardless, it possible to charge a substantial premium in the big capacity realm.

    One consequence of the price competitiveness in the mainstream DRAM market is that cost cutting is an imperative. Multiplexed row and column address lines originated in 1973, allowing for lower cost package and module product. Twenty years ago, there was discussion on going back to a full width address, but no one was willing the pull the trigger on this.

    The only concession for performance in mainstream DRAM was increasing bandwidth by employing multiple sequential word accesses, starting with DDR to the present DDR4.


    Reduced latency DRAM appeared in 1999 for applications that needed lower latency than mainstream DRAM, but at lower cost than SRAM. One application is in high-speed network switches. The address lines on RLDRAM are not multiplexed. The entire address is sent in one group. RLDRAM allows low latency access to a particular bank. That bank cannot be accessed again for the normal DRAM period? But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.

    A 2012 paper, lead author Nilandrish Chatterjee (micro12 and micro 45) has a discussion on RLDRAM. The Chatterjee paper mentions: RLDRAM employs many small arrays that sacrifices density for latency. Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3. The first version of RLDRAM had 8 banks, while the contemporary DDR (just DDR then) had 2 banks. Both RLDRAM3 and DDR4 are currently 16 banks, but the banks are organized differently?

    Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36. Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path to support 64-bit data plus 8-bit ECC. The mainstream DDR4 8Gbit 2-die package from Micron comes in a 78-ball package for x4 and x8 organization, and 96-ball for x16. The RLDRAM comes in a 168-ball package for both x18 and x36. By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA, yet has multiplexed address? The package pin count factors into cost, and also in the die size because each signal needs to be boosted before it can go off chip?

    Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC, compared with DDR4 at $8 or 14/GB also with ECC, depending the module capacity. At this level, RLDRAM is 40-70 times more expensive than DDR4 by capacity. A large part for is probably because the RLDRAM is quoted as a specialty low volume product at high margins, while DDR4 is a quoted on razor thin margins. The top RLDRAM at 1.125Gb capacity might reflect the size needed for high-speed network switches or it might have comparable die area to a 4Gb DDR?


    There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be six or eight times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.

    There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.

    Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.

    Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range. This $15K per GB w/ECC. It is reasonable to assume that prices for both RLDRAM and QDR SRAM are much lower when purchased in volume?

    The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213 in a 2011-pin package. This would suggest SRAM at south of $1800 per GB. While the ultra-high margins in high-end processors is desirable, it is just as important to fill the fab to capacity. So, SRAM at 50% margin is justified. We could also estimate SRAM at 40X that of DRAM, per the Micron assertion of relative density, pointing to $160-320 per GB.

    Graphics and High-Bandwidth Memory

    Many years ago, graphics processors diverged from mainstream DRAM. Their requirement was for very high bandwidth at a smaller capacity than main memory, plus other features to support the memory access patterns in graphics. GDR is currently on version 5, at density up to 8Gbit, with a x32 wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4. More recently, High Bandwidth Memory (HBM) is promoted by AMD, Hybrid Cube Memory by Micron.

    High bandwidth memory is not pertinent to databases, but it does provide scope on when there is need to go a separate road from mainstream memory. Databases on the page-row type engine does not come close to testing the limits of DDR4 bandwidth. This is true for both transaction processing and DW large table scans. For that matter, neither does column-store, probably because of the CPU-cycles for decompression.

    I may have to take this back. DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels. (GB is always decimal by default for rates, but normally binary for size.) A table scan with simple aggregation from memory is what now? It was 200MB/s per core in Core 2 days, 350MB/s in Westmere. Is it 500 or 800MB/s per core now? It is probably more likely to be 500, but let's assume 800MB/s here.

    Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s (HT does not contribute in table scans). This is still well inside the Xeon E5/7 memory bandwidth. But what if this were read from storage? A table scan from disk is a write to memory, followed by a read. DDR writes to memory at the clock rate, i.e., one-half the MT/s rate. So the realized table scan rate effectively consumes 3X of the MT/s value, which is 57.6.

    Memory Summary

    To pursue the path of low latency memory, it is necessary to justify the cost and capacity structure of alternative technologies It is also necessary that the opportunity be worthwhile to justify building one more specialty processor with a different memory controller. And it may be necessary to work with operating system and database engine vendors to all be aligned in doing what is necessary.

    The Chatterjee et al micro12 paper for Micro45 ( shows LRDRAM3 improving throughput by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp. MCF shows greatest gain at 2.2X. Navigating the B-tree index should show very high gain as well.

    The cost can be justified as follows. An all-in 4-way server has the following processor and memory cost.

    ComponentDetailUnit CostTotal
    Processor4×E7-8890 v4$7,174 ea.$28,700
    Memory4TB, 64×64GB$1,000 ea.$64,000

    If the above seems excessive, recall that there was a time when some organizations where not afraid to spend $1M on the processor and memory complex, or sometimes just the for 60+ processors (sales of 1000 systems per year?). That was in the hope of having amazing performance. Except that vendors neglected to stress the importance NUMA implications. SAN vendors continued to sell multi-million-dollar storage without stressing the importance of dedicated disks for logs.

    If a more expensive low latency memory were to be implemented, the transmission time between the memory controller and DIMM, estimated to be 10ns earlier, should be revisited. A RLDRAM system might still have the DIMM slot arrangement currently in use, but other option should be considered.

    An SRAM main memory should probably be in an MCM module, or some other In-Package Interconnect (TSMC Hot Chips 28). This is if enough SRAM can be stuffed into a module or package. It would also require that the processor and memory be sold as a single unit and instead of memory being configured later. In the case of SRAM, the nature of the processor L3 probably needs to be re-examined.

    HDD Storage

    Before SSDs, the high-end 15K HDDs were popular with storage performance experts who understood that IOPS was more important than capacity. In a "short-path to bare metal" disk array, the 15K HDD could support 200 IOPS at queue depth 1 per HDD with low latency (5ms). It should be possible to assemble a very large array of 1,000 disks, capable of 200,000 IOPS.

    It is necessary to consider the mean-time-between-failure (MTBF), typically cited as over 1M-hours. There are 8,760 hours in a 365-day year. At 1M-hr MTBF, the individual disk failure rate is 0.876%. An array of 1000 disks is expected to see 9 failures per year. Hard disks in RAID groups will continue to operate with a single or sometimes multiple disk failures. However, rebuilding a RAID group from the failed drive could take several hours, and performance is degraded in this period. It is not operationally practical to run on a very large disk array. The recommendation was to fill the memory slots with big DIMMs, and damn the cost.

    SSD (tech)

    The common convention used to be a NAND controller for SATA on the upstream side would have 8-channels on the NAND side. The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively. On the downstream side, a NAND channel could have 1 or 2 packages. There could be up to 8 chips in a package. A NAND chip may be divided in to 2 planes, and each plane is functionally an independent entity. An SSD with 8 packages could have 64 NAND chips comprised of 128 planes. The random IOPS performance at the plane level is better than a 15K HDD, so even a modest collection of 24 SSDs could have a very large array (3,072) of base units.

    At the component level, having sufficient units for 1M IOPS is not difficult. Achieving 1M IOPS at the system level is more involved. NVMe builds a new stack, software and hardware, for driving extraordinarily high IOPS possible with a large array of SSDs, while making more efficient use of CPU than the SAS. PCI-E NVMe SSDs have been around since 2014, so it is possible to build a direct-attach SSD array with the full NVMe stack. NVMe over fabric was recently finalized, so SAN products might be not too far in the near future.

    From the host operating system, it is possible to drive 1M IOPS on the NVMe stack without consuming too much CPU. At the SQL Server level, there are additional steps, such as determining which page to evict from the buffer cache. Microsoft has reworked IO code to support the bandwidth made practical with SSD for DW usage. But given the enormous memory configuration of typical transaction processing systems, there may not have been much call for the ability to do random IOPS with a full buffer cache. But if the need arose, it could probably be done.

    All Flash Array

    When SSDs were still very expensive as components, storage system vendors promoted the idea of a SSD cache and/or a tiering structure of SSD, 10K and 7.2K HDDs. In the last few years, new upstarts are promoting all flash. HDD storage should not go away, but its role is backup and anything not random IO intensive.

    Rethinking System Architecture

    The justification for rethinking system architecture to low latency memory at far higher cost is shown below. The scaling achieved in 4-socket system is less than exceptional except for the very few NUMA architected databases, which is probably just the TPC-C and TPC-E database. It might be 2.2X better than single socket.



    At lower latency, 40ns L3+memory, the single socket system could match the performance of a 2-socket system with DDR4 DRAM. If 25ns were possible, then it could even match up with the 4-socket system. The mission of massive memory made possible in the 4-way to reduce IO is no longer a mandatory requirement. The fact that a single-socket system with RLDRAM or SRAM could match a 4-socket with massive memory allows very wide latitude in cost.



    RLDRAM may reside inside or outside of the processor. If outside, thought should be given on how to reduce the transmission delay. SRAM should most probably be placed inside the processor package, so the challenge is how much could be done. Should there still be an L3? Any latency from processor core to memory must be minimized as much as possible as warranted by the cost of SRAM.

    Below are the memory latency simple model calculations for a single socket with L3+memory latency of 43 and 25ns. There are the values necessary for the single-socket system to match 2 and 4-socket systems respectively.

    GHzL3+memremotesktavg. memmem cyclesfractiontot cyclestps/coretot tx/sec

    In the examples above, Hyper-Threading should still have good scaling to 4-way at 43ns, and some scaling to 4-way at 25ns memory latency.

    The new memory architecture does not mean that DDR4 DRAM becomes obsolete. It is an established and moderately inexpensive technology. There could still be DRAM memory channels. Whether this is a two-class memory system or perhaps DDR memory is accessed like a memory-mapped file can be debated elsewhere. Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM as memory node 1, all to the same processor.

    It is acknowledged that the proposed system architecture is not a new idea. The Cray-1 used SRAM as memory, and DRAM as storage? For those on a budget, the Cray-1M has MOS memory. Circumstances of the intervening years favored processor with SRAM cache and DRAM is main memory. But the time has come to revisit this thinking.

    While working on this, I came across the slide below in J Pawlowski, Micron, Memory as We Approach a New Horizon. The outline of the Pawlowski paper includes high bandwidth, and persistent memory. Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.


    I am guessing that the large number of double-ended arrows between processor and near memory means high bandwidth. And even bandwidth to DIMMs is substantial.

    On the devices to the right seems to be storage. Does ASIC mean logic? Instead of just accessing blocks, it would be useful to say: read the pointer at address A, then fetch the memory that A points to.

    Intel Vision

    Below is roughly Intel's vision of next-generation system architecture, featuring 3D XPoint. The new Intel and Micron joint non-volatile technology is promoted as having performance characteristics almost as good as DRAM, higher density than DRAM, and cost somewhere in between DRAM and NAND.



    The full potential of 3D XPoint cannot be realized as PCI-E attached storage. The idea is then to have 3D XPoint DIMMs devices on the memory interface along with DRAM. The argument is that memory configurations in recent years have become ridiculously enormous. That much of it is used to cache tepid or even cool data. In this case, DRAM is overkill. The use of 3D XPoint is almost as good, it costs less, consumes less power, is persistent, and will allow even larger capacity.

    In essence, the Intel vision acknowledges the fact that much of main memory is being used for less than hot data. The function of storing not so hot data can be accomplished with 3D XPoint at lower cost. But this also implies that the most critical functions of memory require far less capacity than that of recent generation systems.

    In the system architecture with a small SRAM or RLDRAM main memory, there will be more IO. To a degree, IO at 100µs to NAND is not bad, but the potential for 10µs or less IO to 3D XPoint further validates the concept and is too good to pass up.

    Below is a my less fancy representation of the Micron System Concept.



    The Right Core

    The latest Intel mainline Core-i processor has incredibly powerful cores, with 8-wide superscalar execution. A desktop version of Kaby Lake, 7th generation, has 4.2GHz base frequency and 4.5GHz turbo. This means the individual core can run at 4.5GHz if not significantly higher, but must throttle down to 4.2GHz so that four cores plus graphics and the system agent stays under 91W. A reasonable guess might be that the power consumption is 20w per core at 4.2GHz? Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo. Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel did not deploy this product as widely as normal.

    In transaction processing, this blazing frequency is squandered on memory latency. The server strategy is in high core count. At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W. The Broadwell HCC products do allow turbo mode in which a few cores can run at up to 3.5 or 3.6GHz.

    Every student of processor architecture knows the foundations of Moore's Law. One of the elements is that on doubling the silicon real estate at a fixed process, our expectation is to achieve 1.4X increase in performance. (The other element is a new process with 0.71X linear shrink yields 50% frequency, also providing 1.4X performance.) The Intel mainline core is two times more powerful than the performance that can be utilized in a high core count processor.

    In theory, it should be possibly to design a processor core with performance equivalent to the mainline core at 2.2GHz (see SQL on Xeon Phi). In theory, this new light core should be one-quarter the size of the mainline core. (Double the light core complexity for 1.4X performance. Double again for another 1.4X, for a cumulative 2X over baseline.)

    The new light core would be running at maximum design frequency to match the 2.2GHz mainline, whatever frequency that might be. We can pretend it is 2.2GHz if that helps. This new core would have no turbo capability. What is the power consumption of this core? Perhaps one-quarter of the mainline, because it is one-quarter the size? Or more because it is running at a slightly higher voltage? (this is covered elsewhere).

    It might be possible to fit four times as many of the light cores on the same die size, assuming cache sizes are reduced. But maybe only 3 times as many cores can be supported to stay within power limits? This a much more powerful multi-core processor for multi-threaded server workloads. How valuable is the turbo capability?

    The turbo boost has value because not everything can be made heavily multi-threaded. Single or low-threaded code might not be pointer chasing code, and would then be fully capable of benefitting from the full power of the Intel mainline core. A major reason that Intel is in such a strong position is that they have had the most powerful core for several years running, and much of the time prior to the interlude period. (AMD does have a new core coming out and some people think highly of it.)

    Intel has two versions of each manufacturing process, one for high performance, and another for low power. Could the mainline core be built on the lower power process? In principle this should reduce power to a greater degree than scaling voltage down. Would it also make the core more compact?

    Knights Landing

    We could speculate on theory, applying the general principles of Moore's Law. But there is a real product along these lines, just targeted towards a different function. The Xeon Phi 200, aka Knights Landing has 72 Atom cores, albeit at 245W (260 with fabric). The current Phi is based on the Airmont Atom core. (the latest Atom is actually Goldmont).

    The recent Atom cores have a 14-stage pipeline versus 14-19 for Core? Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache? The true top frequency for Airmont is unclear, some products based on are Airmont are listed as 2.6GHz in turbo.

    On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo. It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency while high core count products set a lower voltage resulting lower frequency, regardless of intent.

    Below is a diagram of Knights Landing, or Xeon Phi 200 from Intel's Hot Chips 27 (2015). The processor and 8 MCDRAM devices are on a single multi-chip-module (package) SVLC LGA 3647.



    The MCDRAM is a version of Hybrid Memory Cube? (Intel must have their own private acronyms.) Each device is a stack of die, 2GB, for a total of 16GB with over 400GB/s bandwidth. There are also 3 memory controllers driving a total of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s). Only 1 DIMM per channel is supported, presumably the applications are fine with 384GB but wants extreme bandwidth.

    The Xeon Phi is designed for HPC. As is, it might be able deliver impressive transaction processing performance. But perhaps not without tuning at many levels. The question is, how does Knights Landing perform on transaction processing had the memory been designed for latency instead of bandwidth?

    I suppose this could be tested simply by comparing an Airmont Atom against a Broadwell or Skylake? The theory is that the memory round-trip latency dominates, so the 8-wide superscalar of Haswell and later has little benefit. Even if there is some code that can used wide superscalar, the benefit is drowned out by code that wait for memory accesses.

    Even in transaction processing databases, not everything is transactions, i.e., amenable to wide parallelism. Some code, with important functions, do benefit from the full capability of mainline Intel core. Perhaps the long-term solution is asymmetric multi-core, two or four high-end cores, and very many mini-cores.

    SSE/AVX Vector Unit (SIMD)

    The vector (SSE/AVX) unit is a large portion of the core area. This are not used in transaction processing but are used in the more recent column-store engine. Microsoft once evaluated the use of the Intel SSE registers, but did not find a compelling case. It might have been on the assumption of the existing SSE instructions?

    Perhaps what is needed is to redesign the page structure so that the vector registers can be used effectively. The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes, leaving 8096 bytes. Row offsets (slot array of 2 byte values) are filled in from the end of the page. See Paul Randall, SQL Skills Anatomy of a Record.

    Within a page, each row has a header (16-bytes?) with several values. The goal of redesigning the page architecture is so that the slot and header arrays can be loaded into the vector registers in an efficient manner. This might mean moving the slot array and other headers up front. SQL Server would continue to recognize the old page structure. On index rebuild, the new page structure employed.

    The necessary instructions to do row-column byte offset calculations directly from the vector register would have to be devised. This needs to be worked out between Intel and various database vendors. Perhaps the load into the vector registers bypasses L1 and/or L2? It would be in L3 for cache coherency?

    The current Xeon E5/7 processors, with the latest on the Broadwell core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes. The Skylake has 32 registers of 512-bits, 2KB of registers. This is too much to waste. If they cannot be used, then the processor with special memory controller should discard the vector unit.


    The main purpose of this article was to argue for a new system architecture, having low latency memory, implying a processor architecture change, with a major focus on the feasibility for transaction processing databases, and mostly as pertinent to Intel processors. However, all avenues for significant transaction performance improvement are considered.

    • Hyper-Threading to 4-way
    • Memory-optimized tables with natively compiled procedures
    • Database architected for NUMA
    • Multi-core - the right size core?
    • 3D XPoint
    • SSE/AVX Vector Instructions

    Increasing HT from 2-way to 4-way has the potential to nearly doubly transaction processing performance. Other options have greater upside, but this is a drop-in option.

    Memory-optimized tables and natively compiled procedures combined has the greatest upside potential.

    People do not want to hear that the database should be re-architected for NUMA scaling. If it runs fine on single-socket or Hekaton, then fine. But Intel mentions that scaling to the Phi core count levels requires NUMA architecture even on one socket.

    Higher core count using a smaller core will have best transaction throughput, but an asymmetric model might be more practical.

    The capabilities of the modern processor core are being squandered in long latency for capacity that is not needed. Figure out what is the right low latency memory.

    There is definitely potential for 3D Point. But it goes beyond displacing some DDR DRAM and NAND. The true potential is to enable a smaller lower latency true main memory, then have DDR DRAM and 3D XPoint as something in-between memory and IO.

    SSE/AVX: use it or lose it.



    The growing gap between processor clock cycle time to memory latency is not a new topic. There have been many other papers on the advantage of various memory technologies with lower latency. Most of these originate either from universities or semiconductor companies. Everyone acknowledged that cost relative to mainstream DRAM was a serious obstacle. A number of strategies were conceived to narrow the gap.

    Here, the topic is approached from the point of view of database transaction processing. TP is one of several database applications. Database is one of many computer applications. However, transaction processing is a significant portion of the market for high-end processors, and systems with maximum memory configuration.

    The database community regularly spends $100K on a processor-memory complex for performance levels that could be achieved with a single socket, if it were matched with the right memory. There is valid justification from the database side to pursue the memory strategy. There is justification to the processor-memory vendor that this one market has the dollar volume to make this effort worthwhile. And the extremely large memory capacity requirement is shown to now be a red herring.

    In all, there are several worthwhile actions for the next generation of server system architecture. Probably none are mutually exclusive. There are trade-offs between impact, cost and who does the heavy lifting. No single factor wins in cases, so a multi-prong attack is the more correct approach.



    I will do a discussion to the figure below later

    The diagram below might be helpful in discussion. The elapsed time for a transaction is the weight sum of operations that incur a wait at each level. Single thread performance is the inverse of elapsed time. For throughput, memory access latency can be partially hidden with hyper-threading. IO latency is hidden by asynchronous IO, but there is an overhead to that too.



    Suppose D is the latency for DDR memory, and S it the latency of some low latency memory, both inclusive of transmission time and possibly L3. There is no read IO in the DDR system. Suppose x is the fraction of memory accesses that are outside of the fast memory, and the I is the latency. The term I might represent accesses to DDR or 3D XPoint on the memory interface via a memory access protocol so it is really not IO. Or it could be to 3D XPoint or NAND attached to PCI-E via an IO protocol.

    The criteria for the small faster memory being an advantage in elapse time delta, exclusive of operations that occur inside L3, is as follows.

    (1-x)×S + x×I < D


    x < (D-S)/(I-S)

    The objective is to achieve a large gain in performance via elapsed time. Only so much can be gained on the numerator D-S, so much depends on the latency of I. If the secondary device were DDR or 3X XPoint on the memory interface, then a very high value of accesses (x) could be allowed while still achieving good performance gain. If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.


    In the discussion on Knights Landing, I suggested that the Atom core might not be bad for transaction processing. The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system. What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT. Use ECC SODIMM.

    Atom has changed since its original inception, not using out-of-order execution for simplicity and power-efficiency. Silvermont added OOO. Not sure about Goldmont. Is Atom to be a slimmed down Core? with 3-wide superscalar and manufactured on the SoC version of the process?


    I will try to sort out the material and redistribute over several articles as appropriate.

  • Intel Processor Architecture 2020

    There is an article on WCCFtech, that a new Intel processor architecture to succeed the lake processors (Sky, Cannon, Ice and Tiger) will be "faster and leaner" and more interestingly might not be entirely compatible with older software. The original source is I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon. Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only one tick-tock pair.

    Naturally, I cannot resist commenting on this. About time!

    For perspective, in the really old days, processor architecture and instruction set architecture (ISA) was somewhat the same thing. The processor implemented the instruction set, so that was the architecture. I am excluding the virtual-architecture concept where lower cost version would not implement the complete instruction set in hardware.

    The Intel Pentium Pro was a significant step away from this, with micro-architecture and instruction set architecture now largely different topics. Pentium Pro has its own internal instructions, called micro-operations. The processor dynamically decodes X86 instructions to the "native" micro-operations. This was one of the main concepts that allow Intel to borrow many of the important technologies from RISC.

    The Pentium 4 processor, codename Willamette, had a Trace cache, that was a cache for decoded instructions. This may not have been in the Core 2 architecture that followed Pentium 4.

    My recollection is that Pentium Pro had 36 physical registers of which only 8 are visible to the X86 ISA. The processor would rename the ISA registers as necessary to support out-of-order execution. Pentium 4 increased this to 128 registers.

    Also see MIT 6.838 and NJIT rlopes

    The Nehalem micro-architecture diagrams do not mention a µop cache, (somehow the acronym is DSB) but Sandy Bridge and subsequent processors do. This is curious because both Willamette and Nehalem are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.

    The other stream that comes into this topic involves the Intel Itanium adventure. The original plan for Itanium was to have a hardware (silicon) X86 unit. Naturally, this would not be comparable to the then contemporary X86 processors, which would have been Pentium III, codename Coppermine at 900MHz, for Merced. So by implication, X86 execution would probably be comparable to something several years old, a Pentium II 266MHz with luck, and Itanium was not lucky.

    By the time of Itanium 2, the sophistication of software CPU emulation was sufficiently advanced that the hardware X86 unit was discarded. In its place was IA-32 Execution Layer. Also see the IEEE Micro paper on this topic. My recollection was the Execution Layer emulation was not great but not bad either.

    The two relevant technologies are: one, the processor having native µops instead of the visible X86 instructions, and two, the Execution Layer for non-native code. With this, why is the compiler generating X86 (ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.

    Why not make the native processor µops visible to the compiler. When the processor detects a binary with native micro-instructions, it can bypass the decoder? Also make the full set of physical registers visible to the compiler? If Hyper-threading is enabled, then the compiler should know to only use the correct fraction of registers.

    I am inclined to also say that the more the compiler knows about the underlying hardware, the better it can generate binaries to fully utilize available resources, with less reliance on the processor doing dynamic scheduling for parallelism. But of course, that was what Itanium was, and we would need to understand why Itanium did not succeed. My opinion was that EPIC was really better suited to scientific computing and not logic heavy server applications.

    Have one or two generations of overlap, for Microsoft and the Linux players make a native micro-op operating system. Then ditch the hardware decoders for X86. Any old code would then run on the Execution Layer, which may not be 100% compatible. But we need a clean break from old baggage or it will sink us.

    Off topic, but who thinks legacy baggage is sinking the Windows operating system?


    Of course, I still think that one major issues is that Intel is stretching their main line processor core over too broad a spectrum. The Core is used in both high-performance and high-efficiency mode. For high performance, it is capable of well over 4GHz, probably more limited by power than transistor switching speed. For power efficiency, the core is throttled to 2 or even 1 GHz.

    If Intel wants to do this in a mobile processor, it is probably not that big a deal. However, in the big server chips, with 24 core in Xeon v4 and possibly 32 cores in the next generation (v5), it becomes a significant matter.

    The theory is that if a given core is designed to operate at a certain level, then doubling the logic should achieve a 40% increase in performance. So if Intel is deliberately de-rating the core in the Xeon HCC die, then they could built a different core specifically to one half the original performance is perhaps one quarter the complexity.

    So it should be possible to have 100 cores with half the performance of the Broadwell 4GHz capable core, i.e., equivalent to Broadwell at 2GHz? If this supposed core were very power efficient, then perhaps we could even support the thermal envelope of 100 mini-cores?

    Of course, not every application is suitable for wide parallelism. I would like to see Intel do a processor with mixed cores. Perhaps 2 or 4 high performance cores and 80 or so mini-cores?

    A really neat trick would be if the GPU were programmable, but graphics vendors have things along this line?

  • Memory Latency and NUMA

    It should be intuitively obvious that round-trip memory access latency is one of the most important factors in modern server system architecture for transaction processing databases. Yet this is a topic that no one talks about. Vendors do not want to discuss this because no near or long-term actions are planned. Outsiders cannot write a meaningful article because too much important information is missing. In some respects, there is nothing practical we can do about latency in terms of memory components that we elect to employ. However, we can influence latency because Intel processors since Nehalem in 2009/10 have integrated memory controllers. Hence all multi-socket systems since then have non-uniform memory access (NUMA) and this is one mechanism that determines latency.

    We will start by looking at modern Intel processor architecture and the time scales involved. Then continue by examining system architecture with respect to memory latency. From there, we can do basic calculations on how we expect memory latency to impact transaction processing performance. This basic model still needs to be backed up by real experimental measurements, but there is enough to provide the basis for further investigations. Pending such, one of the important conclusions is that it is time to re-examine fundamental assumptions on server sizing.

    Broadwell, Xeon E3 v4

    The diagram below is a representation of a modern Intel mainline (desktop and mobile) processor. The proportions are from Broadwell, the 5th generation Intel Core i3, 5 and 7 and Xeon E3 v4, as Sky Lake has a different layout arrangement. The four cores, system agent, and graphics communicate over the interconnect ring. The memory controller and IO are in the system agent, but the signals that go off-chip are in the interface shown as a separate unit.



    Never mind the above, I don't think thats how Broadwell 4c is laid out. See below left. Sky Lake is below right.


    Intel and system vendors like to talk about specifications such as processor frequency and memory transfer rate. Recent generation Intel processor cores are capable of operating in the high 3 to low 4GHz range. It is quite possible that they could run even higher, but are power constrained.

    DDR4 Interlude

    The memory interface is DDR4 at 1866 and 2133MT/s. The term used is mega-transfers per second instead of MHz. This is because the clock is cited in MHz, and the data transfers rate is eight times the clock rate. Address and writes should be one-half the data rate. Using MT/s for data transfer rate is so much more clear.
    (edit 2016-12-26)
    In DDR4, there is a memory clock, 266.67MHz for example, an I/O bus clock that is 4 times higher, 1066MHz, and the data transfer rate at double the I/O clock, 2133MT/s.
    DDR memory timings are specified at the memory interface and consist of 4 values: CL, RCD, RP and RAS, the last of which is frequently not cited, and sometimes only the CAS Latency is cited. The value is in terms of the I/O clock cycles.

    In database transaction processing, the memory access pattern is largely unpredictable, amounting to a random memory row access, so the latency is RP + RCD + CL, for Row Pre-charge, Row to Column Delay, and CAS Latency. For Registered DDR4 2166, all three values are 15? (does this includes the extra cycle for registered memory?). The 2166 MT/s corresponds to 1066MHz, so each cycle is 0.938ns, and 15 cycles is 14 ns. The total memory latency at the memory interface is then 42ns?

    Cache and Memory Latency

    There are applications in which processor frequency and memory bandwidth matter an great deal. But neither are particularly important for transaction processing databases. The diagram below calls out some details between the processor core, various levels of cache, and memory that are more relevant to databases.



    At 3GHz, the processor core cycle time is 0.33ns, the inverse of the frequency. The L1 cache is cited as having 4 cycle latency. The L1 is part of the processor core execution pipeline, so to some degree, L1 latency is hidden. L2 cache latency is cited as 12 cycles. It is not certain as to whether this is fixed in cycles, or is actually something like 4ns.

    L3 cache probably depends on the number of cores and other circumstances. It is shown here as 40+ cycles. If L3 latency is actually time based, 15ns for example, then number of cycles would depend on the core clock rate. I am not sure if the L3 latency incorporates L2 latency. Memory latency is probably somewhat over 50ns, plus the L3 latency. How much of this is at the DRAM chip versus the transmission delay between the processor to DRAM and back?

    Igor Pavlov provides both the 7z benchmark and results for many processors at 7-cpu, source code included. 7-cpu lists Haswell L3 at 36 cycles, and memory at L3 + 57ns. Sky Lake L3 is 42 cycles and memory at L3 + 51ns.
    (edit 2016-12-26)
    (This seems to imply that the transmission delay from processor to memory and back is about 10ns?)
    Intel has their own Memory Latency Checker utility. One of the Intel pages shows an example with local node memory latency as 67ns, and remote node at 125ns, which will be used in the examples below.

    Broadwell EP and EX

    The above diagrams were for the desktop and Xeon E3 processors. We are really more interested in the Intel EP/EX processors used in the Xeon E5 and E7 product lines. The latest Xeon E5 and E7 processors are v4, based on the Broadwell core. There are 3 layouts, HCC, MCC and LCC with 24, 15 and 10 cores respectively. Intel provides functional layouts for all three, but the actual die layout is provided only for the LCC model.

    The Broadwell HCC representation is shown below. There are 6 rows and 4 columns of cores, 24 cores total. The 24-core model is only available in the E7. The top E5 has only 22 cores enabled. Two columns of cores communicate over the interconnect dual ring (counter-rotating). The two double rings are connected by a switch. The Intel functional diagram actually shows both QPI and PCI-E on the left side.



    Below are my representations of the MCC model on the left with 3 columns of 5 rows for 15 cores and the LCC model on the right with 2 columns of 5 rows for 10 cores. In the LCC model, there is no ring switch. The PCI-E and memory actually do overhang the core, meaning that the space to the right of the cores is blank? As even the LCC is a high margin product, an exact and efficient fit is not necessary?


    The MCC arrangement of both QPI and PCI-E on the left side of the switch connected to the left ring and a memory controller on each side of the ring switches matches the Intel functional layout, but I do not know if there is overhang. Regardless of the actual placement of the controllers, the interface signals for QPI and PCI-E probably does run along the length of the upper edge, and the interface for the memory signals probably runs along most of the lower edge.

    I am inclined to believe that L3 latency is higher in the E5/E7 processors as the path is longer and more complicated. On the LCC die, there are 10 cores, the memory, QPI and PCI-E controllers on one ring. However, if the desktop and E3 processors have only one ring (single direction) then it is possible that the bidirectional ring in the E5/E7 processors help keep L3 latency low? Presumably latencies on the MCC and HCC die are longer than on the LCC because both rings must be checked?

    Edit 2017-01-4
    Search: Haswell Cluster on Die (COD) Mode, but filter our call of duty. An Intel slidedeck on this suggests that memory latency is higher when crossing the coherency bus switches.

    Xeon E5 v4

    Below is a representation of a 2-socket Xeon E5 system with the HCC die. Two of the 24 cores are marked out, as the E5 has a maximum of 22 cores. The E5 has 2 full QPI links, and both are used to connect to the other proccessor. For a core in the left socket, the memory attached to that socket is local memory, and memory attached to the other socket is remote memory.



    It should not be difficult to appreciate that there is a large difference in memory access time between local and remote memory nodes. The Intel Memory Latency Checker example has 67ns for local and 125ns for remote memory, which will be the values we use in a following example. I am not certain if these values are for unbuffered or registered memory. Unbuffered memory should have lower latency, but registered memory is available in larger capacities, 64GB versus 16GB.

    Xeon E7 v4

    Below is a representation of the 4-socket Xeon E7 system. The E7 has 3 full QPI links, one connecting directly to each of the three other processors. All remote processors are then said to be 1-hop away. The difference of significance here between the E5 and E7 systems is that the E5 memory channel connects directly to memory. The E7 connects to a scalable memory buffer (SMB, other names have been used too), that splits into two memory channels on the downstream side. Because there are so few acronyms in effect, the interface from processor to SMB is SMI. The SMB doubles the number of DIMM sites, and in effect, doubles the memory capacity per socket.



    This difference in the memory arrangement between processors designed for 2 and 4-socket systems has been a recurring pattern in Intel system architecture for a long time, though it was not present in all systems, and there were variations. In the current generation, there is a four-socket version of the E5, which does not have the SMB in the path to memory, but each processor only has two QPI links, so one of the remote sockets is two-hops away.

    Long ago, maximum memory capacity was very valuable for database servers in reducing IO to less than impossible levels. The extra latency incurred from the SMB chip was worth the price. Since then, memory configuration has increased to stupendous levels, and data read IO has been reduced to negligible. NAND flash has also become very economical, allowing storage now to be capable of formerly impossibly high IOPS. Of course, this occurred after it was no longer absolutely essential.

    In more recent years, with maximum memory configuration sometimes more than double what is already extravagant, we might not want to incur the extra latency of the SMB?

    Highly Simplified Example

    Before proceeding, I will take the opportunity to say that the modern microprocessor is an enormously complex entity in so many ways that it defies characterization in anything less than an exceedingly complicated model. The processor has multiple cores. Each core has pipelined superscalar and out-of-order execution. Each core has dedicated L1 I+D cache and dedicated L2 unified cache. And there is decoded micro-op cache as well. The processor has L3 cache shared among the cores.

    That said, I will now provide an example based on an extremely simplified model of the core-processor memory complex.

    Suppose we have a fictitious 3GHz processor, cycle time 0.33ns, having a very fast cache, and executes one instruction per cycle when all memory is in cache. Suppose further that memory latency is 67ns, and the long latency L3 cache effects are not considered.

    Suppose we have a similarly fictitious transaction of 10M instructions. If everything is cache, the transaction completes in 3.33 million nano-seconds, or 3.33 milli-sec, and our performance is 300 transactions per second per core.

    Now suppose that 5% (1/20) of instructions require a round-trip memory access before proceeding to the next step. The 0.95 fraction of instructions that have a cache hit consume 9.5M cycles. The 0.05 fraction of 10M is 0.5M instructions that miss cache. Each of these require a round-trip memory access of 67ns or 201 cycles for 100.5M cycles. The total time to complete the transaction is now 110M cycles. Performance is now 27.27 transactions per sec per core instead of 300 tps.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec

    Now suppose that we have a 2-socket system, meaning two memory nodes, and that we have not specially architected our database in a manner to achieve higher memory locality than expected from random access patterns. Any memory access is equally likely to be in either node. The local node memory continues to be 67ns and remote node memory access is 125ns. Average memory access is now (67+125)/2 = 96ns, or 288 cycles

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec

    Without a database architected to achieve memory locality, we have lost 28% performance per core (1 - 19.54/27.27). Of course, we did double the number of cores, so throughput has increased by 43% (2*19.54/27.27). Alternatively, the performance per core in the single socket system is 39.5% better than in the 2-socket system (27.27/19.54). This magnitude is important. Did your vendors and infrastructure experts forget to mention this?

    Now suppose we are on a 4-socket Xeon E7-type system with the same database, so memory access is equally probable to any of the four cores. Local memory access is 25%, and 75% is remote to one of the three other sockets. All sockets are directly connected, so all remote nodes are one-hop away. Now, recall that Xeon E7 has a memory buffer in the path between the processor (memory controller) and memory.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec

    Let's suppose that the SMB adds 15ns additional latency. (I do not know what the number really is. It is not free. The magic of doubling memory capacity comes at a price.) Local node memory access is now 82ns and remote node is 140ns. Average memory access is (82+3*140)/4 = 125.5ns, or 377 cycles.

    We have now just lost another 22% performance per core going from a 2-socket E5 to the 4-socket E7 type systems (1 - 15.17/19.54). Total throughput is 55% better than the two-socket (2*15.17/19.54). Performance per core is 28.8% better on the 2-socket than on the 4-socket.

    The performance per core between the 1-socket is 80% better than on 4-socket (27.27/15.17). The 4-socket has 2.22X better throughput than the 1-socket (4*15.17/27.27).

    Scaling - NUMA

    Below is all three of the above cases in a single table.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec
    Scaling - Frequency

    We can also do the same calculations based on a similarly fictitious 2GHz processor.

    GHzL3+memremotesktavg. memmem cyclefractiontot cyclestps/coretot tx/sec

    Notice that we did not lose much performance in stepping down from 3 to 2GHz. We could even further step down to 1GHz and still be at 23.26, 34.78 and 55.36 tot tps for 1, 2, and 4 sockets respectively. It is important to stress that this is based on the assumption of a transaction processing workload having the characteristic of serialized memory accesses.


    All of the above is based on a highly simplified model. Real and rigorous testing should be done before drawing final conclusions. Regardless, there is no way anyone can claim that the difference is between local and remote node memory access latency is not important, unless the database has been architected to achieve a high degree of memory locality.

    Front-Side Bus Systems

    In the days before Intel integrated the memory controller, four processors connected to a memory controller in a system with uniform memory access. The Pentium Pro arrangement of 4P on one bus is represented on the left, although the diagram is actually closer to the 450NX or later.


    The system on the right represents the 4-way Xeon 7300, each quad-core processor on its own bus connected to the 7300 MCH. Intel had already committed to the Pentium 4 quad-pumped bus in 2000/01. Vendors were expecting a long stable infrastructure, so Intel delayed the switch-over from bus to point-to-point until 2009.

    Pre-Nehalem NUMA Systems

    A system with 16 or more processors could be built with a custom memory + node controller connecting four processors, memory and also a crossbar. The crossbar in turn connects multiple node controllers to form a system having non-uniform memory access (NUMA).



    In the old NUMA systems, some SQL Server operations ran fine, and other operations had horrible characteristics, far worse than would be suggested by the remote to local node memory latency ratio. So, it is the possible that there are other NUMA affects with far greater negative impact. Some may have since been resolved, while others may still be present but not as pronounced in modern NUMA systems.

    Hyper-Threading (HT)

    An important fact to notice is that a very high fraction of CPU cycles are no-ops, where the processor core does nothing while waiting for a round-trip memory access. This is why Hyper-Threading is highly effective. While one logical processor is waiting for a memory access to complete, the other thread can run. Scaling on the logical processors can be nearly linear for a transaction processing workload.

    Note, the first-generation of Intel Hyper-Threading was on the Intel Pentium 4 (Net Burst) processors. In that generation, the implementation was two threads running simultaneously, as in each clock cycle, trying to fill the super-scalar execution units. The first generation of HT was problematic. It could have been because it was too aggressive to try to execute two threads simultaneously, or it could have been simply that the Windows operating system and SQL Server engine at the time did not know how to properly use HT. The next generation of Intel processor architecture, Core 2, did not have HT.

    Then in Nehalem, HT returned, except that this time, it was a time slice implementation. Only one thread executes on any given cycle. When the executing thread encounters a memory or other long latency operation, the processor core switches to the other thread. If anyone has doubts on HT based on experience or hearsay from the Pentium 4 generation, forget it. The Nehalem and later HT is highly effective for transaction processing workloads. There used to be several SKUs with HT disabled in the Xeon E5/7 v1-3 generations. Pay close attention and pass on the no-HT SKUs.

    The question to ask is why Intel does not increase the degree of HT? The generic term is Simultaneous Multi-Threading (SMT). Both IBM POWER and Oracle SPARC processors are or have been at 8-way SMT. Granted, one of the two mentioned that scaling to 8-way SMT was tricky. It is high time for Intel to increase HT to 4-way.

    Database Architecture

    In the above examples, the simple model suggests that scaling to multiple sockets is poor on the assumption of a transaction processing database without a means of achieving memory locality. (There is supposed to be an HPE whitepaper demonstrating the importance of the SQL NUMA tuning techniques in a properly designed database.) Just what does a database architected for NUMA mean? Naturally, this will have to be expounded in a separate article.

    But for now, take a close look at both the TPC-C and TPC-E databases. The TPC-C database has all table leading with a common key, Warehouse Id, that provides a natural organizational structure. The TPC-E database has 5 transaction tables with a common key value but does not use the identity property. Instead it uses a function that must read, then update a table to determine the next value.

    The Case for Single Socket

    Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

    Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

    Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4's.

    But I am really talking about a single-socket Xeon E5. In the v4 generation, we could have up to 22 cores, though we should start by looking at the 10-core E5-2630 v4 at $667, stepping up to the 16-core 2683 v4 at $1846 before going to the 20-22 core models at $3226 and $4938.



    The Xeon E5 has 40 PCI-E gen 3 lanes. It might be convenient if there were a motherboard with 1 PCI-E x8 and 8 x4, because NVMe PCI-E SSDs are more common and economical with the x4 interface. Supermicro does have a UP Xeon E5 motherboard (X10SRL-F) with 4 x8, 2x4 gen3 plus 1 x4 gen2. It only has 8 DIMM sites out of 12 possible with the E5, but that is probably good enough.


    A strong explanation was provided showing why round-trip memory latency is very important in transaction processing. One implication of this is that scaling to multiple sockets is poor due to the NUMA effect. A remedy is to architect the database and application together in working with the SQL Server NUMA tuning options to achieve locality. Alternatively, give serious consideration to a single-socket, yet still very powerful system. A second implication is that processor frequency is less important for transaction processing, though it might be important for other aspects. The memory latency affect also supports the argument that Hyper-Threading is highly effective and Intel really needs to increase the degree of HT.


    OK, I didn't show why database transaction processing incurs the round-trip memory latency. It has to do with the b-tree index, in which we read through a page to find the right pointer to the next level. We access the memory for that pointer, then read through to find the next pointer. I will try to do a diagram of this later. But if someone can dig through an open source database engine, please send a code example.

    Several of the images were updated.

    It would help if Intel would be so helpful as to provide L3 latencies for Xeon E5 v4 LCC, MCC and HCC models. What memory latencies for local and remote node in E5 v4? How much latency does the SMB in the Xeon E7 add?

    Note that Xeon E3 and client side processors use unbuffered memory. While Xeon E5 can used unbuffered memory, these are currently limited to 16GB DIMMs while registered memory is available in capacities to 64GB.

    The Xeon D is for specialized embedded applications and not suited for the single-socket database server. It has only 4 DIMM sites?

    Supermicro has an Ultra product line targeting specialized applications. One of the features they call Hyper-Speed. They claim that with very high quality design and components, it is possible to reduce (memory?) latency via low jitter. I would like to know more about this. But the only option is Xeon E5 dual-socket, and I am more interested in single-socket. The emphasis seems to be on RHEL, and high frequency trading? There are examples for determining which processor socket the NIC is attached to, and whether a threading is running on a core in that socket. These hardware organization detection tools really should be incorporated into Windows as well. I have tried to use the WMI API from C#, but some things require coding in C or possibly assembly?

    It was stressed that round-trip memory latency impacts transaction processing databases. Column-store DW avoids this problem by emphasis on marching through memory sequentially. The main intent of Hekaton memory-optimized tables was to eliminate the need for locks. But the other part was the use of a hash index, which happens to reduce the number of memory round-trip operations.

    Additional references
    Inte Core i7 Xeon 5500 series
    Core i7 Xeon 5500 Series
    Data Source Latency (approximate)
    L1 CACHE hit, ~4 cycles
    L2 CACHE hit, ~10 cycles
    L3 CACHE hit, line unshared ~40 cycles
    L3 CACHE hit, shared line in another core ~65 cycles
    L3 CACHE hit, modified in another core ~75 cycles
    remote L3 CACHE ~100-300 cycles
    Local Dram ~60 ns
    Remote Dram ~100 ns
    Intel D. Levinthal paper

    freebsd Ulrich Drepper paper

  • 40 Gigabit Ethernet

    A couple of months ago, I got a pair of Intel XL710QDA2 Ethernet Converged Network Adapters, basically the fancy way of saying 40 Gigabit Ethernet that does other stuff too. I also bought the Intel QSFP+ Twin-axial cable. At that point in time, I might like to have opted for the newer Intel Omni-Path 100Gbps (56Gbps on PCI-E x4) which came out in Q4 2015 over the XL710 which came out in Q3 2104. But Intel Omni-Path only supports Linux at this time with a heavy emphasis in HPC applications, so it is unclear as to whether they will support Windows in the future. Another option was Infini-band, which other people have written about in conjunction with SMB 3.0 in Windows Server 2012.

    The XL710QDA2 (dual-port) adapters were $500 each, and the cable $130. The single port XL710QDA1 was $420, so I thought having the second port was worth an extra $80 even though I am not using it yet. A 40GbE switch seems to cost about $500 per port, which I think is reasonable. There is the Cisco Nexus 9332PQ for $17K. I thought there was another vendor, Edge-Core? with a 20-port, but it is difficult searching on 40 GbE because of all the junk results. In any case, I do not believe that there is a small-office (8-12-port) 40 gigabit Ethernet switch. So, my current usage is connecting two systems directly with the QSFP+ twin-axial cable, and using the embedded 1GbE for internet connectivity.

    The more readily apparent reason for high-speed networks in a SQL Server environment is to move backups from one machine to another. For my perfunctory dig at storage system vendors, the SAN vendor will say: our million-dollar product has software so you don’t need to move large backups over the network.

    A less obvious reason is to achieve low latency in support of serialize high-volume calls. Transaction processing might not need high network data rates, and can probably generate high call volume on 1 GbE from multiple concurrent connections. However, there could be situations that require the highest possible call volume on a single thread and connection. Of course, in this case, the first course of action is to employ a back-to-back connection, bypassing any switches whatsoever. And definitely not have more than one switch in the path from application to database if a direct connection is not possible.

    Potentially, we might think that high volume ETL could benefit from high data rate networks, but I believe that SQL Server single thread write operations are more limiting than 1 gigabit Ethernet. But perhaps more recent versions of SSIS might have better tools for parallelism.

    Trying out the Intel 40GbE adapater
    Back in August, when I got the XL710QDA, my test systems were running Windows Server 2016 TP5, for which Intel had beta driver. The adapter was recognized by the operating system and the beta driver installed with no issues. The Intel documentation on firmware updates was very complicated, so I did not apply any. In this setup, even though the driver installed, I could not ping across the 40GbE network. Intel support was of the opinion that the back-to-back connection should work. No diagnostic tools were provided for examining the software stack for the XL710.

    Then in early October, there was a substantial hotfix for Windows Server 2016 TP5, after which SQL Server connections were going over the 40GbE. A ping test worked one way but not the other. In any case, Windows Server 2016 RTM was out about this time, so I installed both machines to RTM, applied OS hotfixes, and updated to the release driver. Now everything seems to work.

    In bulk copies, network transfer rate was high and serialize network calls were also faster than on the 1GbE network (which was via a switch, not back to back). When I get a chance, I will look at how much round-trip latency reduction I would have gotten on 1GbE with a direct back to back cable. I did this about 15 years ago when Gigabit adapters came down to $300 but switches were still very expensive. If anyone is interested, I investigated on whether there was a cross-over cable for gigabit, Gigabit Ethernet Direct Connect , i.e. no special cross-over cable needed.

    10 gigabit Ethernet
    For people not ready to jump directly to 40 Gigabit Ethernet, then consider 10 Gigabit. Adapters are now in the $300-400 range, and switched are in the $100 per port territory. The main consideration is which interface to standardize on. Two options for short distances are 10GBase-T and SFP+ Direct-Attach. It appears CX4 is not getting traction? One might be tempted to default to twisted pair, CAT6a or CAT7 for 10GBase-T. Cables for CAT6a and 7 are really cheap, $3-10 for short length pre-made CAT6a and $7-17 for CAT7, while SFP+ cables are $50-100.

    However, various sources mention that 10GBase-T latency is higher than for the other options. And in certain situations, serialize network calls for example, latency is the primary criterion. In large file transfers, latency is not as important.

    Netgear actually offers small office 10 Gigabit Ethernet switches. The XS7xx models are 8, 12, 16, and 24 10GBase-T ports, with some models having additional SFP+ uplink ports. The M4300 models include one with 8 10GBase-T + 8 SFP+ and 12+12. I did not see any that were all or mostly SFP+, but the X+X should be acceptable. It is possible to use the Intel XL710 with a QSFP to 4-port SFP breakout cable.

    Is the Intel 40GbE a dead-end? Intel is going their own way at 100Gbps with Omni-Fabric, which currently does not support Windows.

    Mellanox has a new 100GbE that also does NVMe over fabric?

    Mellanox is offering a rebase for one-purchase of their 40/100GbE switches, the SN2100 16-port? Price might be $6600-7200? with 50% rebate. This puts it at $200+ per port at the rebate rate, and perhaps $450 per port normally?

More Posts Next page »

This Blog


Privacy Statement