THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Joe Chang

  • 40 Gigabit Ethernet

    A couple of months ago, I got a pair of Intel XL710QDA2 Ethernet Converged Network Adapters, basically the fancy way of saying 40 Gigabit Ethernet, along with a QSFP+ Twin-axial cable. At that point in time, I might like to have opted for the newer Intel Omni-Path 100Gbps (56Gbps on PCI-E x4) which came out in Q4 2015 over the XL710 which came out in Q3 2104. But Intel Omni-Path only supports Linux at this time with a heavy emphasis in HPC applications, so it is unclear as to whether they will support Windows in the future. Another option was Infini-band, which other people have written about in conjunction with SMB 3.0 in Windows Server 2012.

    The XL710QDA2 (dual-port) adapters were $500 each, and the cable $130. The single port XL710QDA1 was $420, so I thought having the second port was worth an extra $80 even though I am not using it yet. A 40GbE switch seems to cost about $500 per port, which I think is reasonable. There is the Cisco Nexus 9332PQ for $17K. I thought there was another vendor, Edge-Core? with a 20-port, but it is difficult searching on 40 GbE because of all the junk results. In any case, I not believe that there is a small-office (8-12-port) 40 gigabit Ethernet switch. So, my current usage is connecting two systems directly with the QSFP+ twin-axial cable, and using the embedded 1GbE for internet connectivity.

    The more readily apparent reason for high-speed networks in a SQL Server environment is to move backups from one machine to another. For my perfunctory dig at storage system vendors, the SAN vendor will say: our million-dollar product has software so you don’t need to move large backups over the network.

    A less obvious reason is to achieve low latency in support of serialize high-volume calls. Transaction processing might not need high network data rates, and can probably generate high call volume on 1 GbE from multiple concurrent connections. However, there could be situations that require the highest possible call volume on a single thread and connection. Of course, in this case, the first course of action is to employ a back-to-back connection, bypassing any switches whatsoever. And definitely not have more than one switch in the path from application to database if a direct connection is not possible.

    Potentially, we might think that high volume ETL could benefit from high data rate networks, but I believe that SQL Server single thread write operations are more limiting than 1 gigabit Ethernet. But perhaps more recent versions of SSIS might have better tools for parallelism.

    Back in August, when I got the XL710QDA, my test systems were running Windows Server 2016 TP5, for which Intel had beta driver. The adapter was recognized by the operating system and the beta driver installed with no issues. The Intel documentation on firmware updates was very complicated, so I did not apply any. In this setup, even though the driver installed, I could not ping across the 40GbE network. Intel support was of the opinion that the back-to-back connection should work. No diagnostic tools were provided for examining the software stack for the XL710.

    Then in early October, there was a substantial hotfix for Windows Server 2016 TP5, after which SQL Server connections were going over the 40GbE. A ping test worked one way but not the other. In any case, Windows Server 2016 RTM was out about this time, so I installed both machines to RTM, applied OS hotfixes, and updated to the release driver. Now everything seems to work.

    In bulk copies, network transfer rate was high and serialize network calls were also faster than on the 1GbE network (which was via a switch, not back to back). When I get a chance, I will look at how much round-trip latency reduction I would have gotten on 1GbE with a direct back to back cable. I did this about 15 years ago when Gigabit adapters came down to $300 but switches were still very expensive. If anyone is interested, I investigated on whether there was a cross-over cable for gigabit, Gigabit Ethernet Direct Connect , i.e. no special cross-over cable needed.

    For people not ready to jump directly to 40 Gigabit Ethernet, then consider 10 Gigabit. Adapters are now in the $300-400 range, and switched are in the $100 per port territory. The main consideration is which interface to standardize on. Two options for short distances are 10GBase-T and SFP+ Direct-Attach. It appears CX4 is not getting traction? One might be tempted to default to twisted pair, CAT6a or CAT7 for 10GBase-T. Cables for CAT6a and 7 are really cheap, $3-10 for short length pre-made CAT6a and $7-17 for CAT7, while SFP+ cables are $50-100.

    However, various sources mention that 10GBase-T latency is higher than for the other options. And in certain situations, serialize network calls for example, latency is the primary criterion. In large file transfers, latency is not as important.

    Netgear actually offers small office 10 Gigabit Ethernet switches. The XS7xx models are 8, 12, 16, and 24 10GBase-T ports, with some models having additional SFP+ uplink ports. The M4300 models include one with 8 10GBase-T + 8 SFP+ and 12+12. I did not see any that were all or mostly SFP+, but the X+X should be acceptable. It is possible to use the Intel XL710 with a QSFP to 4-port SFP breakout cable.

  • Samsung SSD 960 PRO - 3500MB/s

    Last month, Samsung announced that their 960 Pro, (PCI-E x4 gen3) M.2 NVMe SSD would available in October at 512GB, 1TB and 2TB capacities. This is a high-end PC and workstation product. There is a similar but separate product line (SM961?) for OEM customers. The 960 Pro had already been mentioned earlier this year. This week, all the hardware sights published product reviews, including: Anandtech, Tomshardware, TheSsdReview and others as well.

    All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3. Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less. In earlier generation products, an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate was typical.

    Presumably there was a reason why every PCI-E x4 was in the 2000-2500MB/s bandwidth. It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s. Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary on the downstream side. The could be other reasons as well, perhaps the DRAM for caching NAND meta-data. Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.

    It is very annoying that the more vertical SSD OEMs decline to disclose meaning details of the controller in their SSD. The 960 PRO uses the new Polaris controller, for which few details beyond the Read/Write IOPS and bandwidth specifications are published. On TheSsdReview, the 960 PRO is shown to have the Polaris controller and DRAM in a single package. There are 4 NAND packages, each with 16 stacked 512Gbit die for the 2TB product.

    I am guessing that the Polaris controller has 8-channels, at 8-bit per channel with each package connecting to 2 channels? But there are other possibilities, such as 4 x 16-bit channels.

    The reason is that the previous generation NAND were 333 or 400MB/s. The most recent NAND from Micron is listed at 533MB/s. In order to support 3,500MB/s on the upstream interface, there needs to be more than that amount on the downstream side combined. While Samsung is the leader in V-NAND, it seems more likely that the 960 has 8 channels 8-bit (or 4 channels, 16-bit) at 533MT/s and not 4 channels, 8-bit 1066MT/s. Between the two likely scenarios, 8-channel 8-bit should support higher IOPS than 4 channel 16bit?

    Both the 2TB and 1TB models have read specs of 440,000 IOPS while the 512GB model is 330,000 IOPS. It could be that the 1TB is 4 packages, 16 chips stacked with a 256Gb chips, and the 512GB model has 8 chips stacked. It is also possible that the 1TB has packages stacked 8 high with the same 512Gb, just that there is no reduction in IOPS, while the 512GB has 4 chip stacks?

    I would think that 2 x 960 PROs in a single socket Xeon E3 quad-core would be a great match. My current test system is a single socket Xeon E5 10-core with four Intel 750 SSDs. I will probably not outfit another test system until the Xeon E5 v5 comes out next year, in which I may opt for the 960 PRO.

    Oh yeah, the 960 PRO is not a enterprise product, so should probably not be used in critical production environments, but should be ok for test.
    It is possible to buy a M.2 to PCI-E adapter, but I would like an adapter that can accommodate 2 M.2 devices in one PCI-E x8 slot, without the use of a PCI-E switch if possible.
    HP has an adapter for 4 M.2 in a x16 that only works in HP systems

    last year I bought a Dell XPS9550 with the then new Skylake processor. It came with a 512GB Samsung PM951, which was an older model rated for 1GB/s sequential, while the slightly more expensive SM951 was rated for 2GB/s. This drive died after 3 months. Dell sent a replacement, but I elected to go out and buy the 950 PRO myself, rated the same 2GB/s as the SM951, but perhaps same or not? The failure of one device is not sufficient to make an assessment, but I have dozens of OCZ Vertex and Vectors, and several Intel SSDs of various flavors, all running for several years now without failure.

  • Insert Performance Limitations with Sequentially Increasing Index

    Earlier this year, HPE and Microsoft sponsored an article, The Importance of Benchmarking, in SQL Server Pro. While true, it is also apparent that there has been little interest in benchmarks within the DBA/IT community over the last several years. There are many reasons for this, one of which is the emphasis on the benchmark result as the important takeaway.

    Today most people expect that systems are very powerful, and probably know that performance at the processor core level has improved only incrementally in the recent years. The main venue is the increasing number of cores with each generation, currently at 22 in Xeon E5 v4 and 24 in E7 v4.

    The TPC benchmarks are generally well designed, with rules meant the prevent the publication of a distorted result. But every application has its own characteristics, different table organizations, and transactions with SQL logic bearing no resemblance to a benchmark or another application. As such, a benchmark result does not translate directly to a different set of circumstances.

    What is of great importance are the problems that were encountered in the benchmark, and the measures that were taken to address each problem. The topic of this article is Inserts to a table with a sequentially increasing index, commonly in the form of an identity column, but other implementations are possible as well.

    At a New York City SQL Server Users Group meeting, Thomas Grohser, author of Expert SQL Server Performance Engineering, mentioned that multiple client (connections) doing single row inserts to a table with clustered indexed on an identity column had extremely poor performance running on a 4-scoket system. In that case, this was about 6,000 rows per second. Inserts to a table with a clustered index on a uniqueidentifier, or having a compound key not leading with the identity column, performance was 95,000 (calls and) rows per second. Memory-optimized tables could achieve 1.7 rows per second. But it might be a good idea to consult with Thomas on this.

    The assessment was that this was a contention issue due to the high latency between processor sockets. All modern systems with more than one processor socket have Non-Uniform Memory Access (NUMA) architecture. For Intel processors, this goes back to Xeon 5500 (2009) and 7500 (2010) and 2003/4 for AMD Opteron.

    An Insert to a table with an index on an identity column, regardless of whether the index is clustered or nonclustered, involves acquiring an exclusive lock on the last row of the last page, as well as accessing the memory for that page and row.

    When there are multiple concurrent threads with affinity on cores in different sockets of a multi-socket system, the memory involved bounces between the L3 caches of each socket. While it is inefficient for memory to bounce across sockets, the magnitude of the impact on Inserts is stunning.

    Testing Insert Performance

    The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled. Turbo-boost is disabled for consistency. The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164). Some tests were conducted on a single socket Xeon E5 v4 with 10 cores, but most are on the E3 system. In the past, I used to maintain two-socket systems for investigating issues, but only up to the Core2 processor, which were not NUMA.

    The test table has 8 fixed length not null columns, 4 bigint, 2 guids, 1 int, and a 3-byte date. This adds up to 70 bytes. With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.

    Both heap and clustered index organized tables were tested. The indexes tested were 1) single column key sequentially increasing and 2) two column key leading with a grouping value followed by a sequentially increasing value. The grouping value was chosen so that inserts go to many different pages.

    The test was for a client to insert a single row per call. Note that the recommended practice is to consolidate multiple SQL statements into a single RPC, aka network roundtrip, and if appropriate, bracket multiple Insert, Update and Delete statements with a BEGIN and COMMIT TRAN. This test was contrived to determine the worst case insert scenario.

    On the single socket Xeon E3 quad-core system, the heap table with no indexes and both the heap with one nonclustered index and clustered index with the two column key could support an insert call rate of 70-72,000 per sec, both rows inserted and number of network roundtrips.

    The heap table with one nonclustered index and table with only a clustered index, both with the index key being a single sequentially increasing column such that inserts from all threads/connections go to the last page could support an insert call rate of about 40,000 rows per sec.


    The issue here is multiple concurrent connections making calls to Insert into a single table with a sequentially increasing index, clustered or nonclustered, resulting in contention between threads for the last page and row.

    There is a substantial 42% performance degradation on a single socket system, with memory in a common L3 cache, but the relevant memory locations are moved between the L2 cache of different cores after exclusive access is acquired. While the impact is large, even this may not be a crippling effect depending on the circumstance.

    In a multi-socket system, memory must now also be moved between L3 cache of different processor sockets, which has higher latency. The impact is expected to be more severe with as the number of sockets increases.

    The presumption here is that the limitation is in the ability to acquire exclusive locks between threads on different cores, potentially in different sockets. If each call inserted more than one row, the call might decrease only slighty resulting in a higher row insert rate. Hence the explicit emphasis on the call rate as well as the row insert rate.

    Note also that we should problem be able to insert in to multiple tables each at more or less the same rate as inserts to a single table until the limitation becomes the log write thread or overall CPU.

    This test is not the same as the test by Thomas Grohser in terms of table structure, client application, and hardware etc. Still, the expectation is that results would not be dramatically different. My tests on this matter is incomplete, and more work is necessary. As soon as I can get access to two and four socket systems, I will try to run this same test (hint to Intel or someone).

    In preliminary tests on the single socket 10-core system, the Insert call rate with rows going to different pages was over 100,000 per sec for both heap and clustered, and the around 30,000 per sec with a sequential index, nonclustered or clustered. We might infer that this a contention issue in which performance degrades with increasing number of cores, for both single and multi-socket system. In the case multiple sockets, there might be as more severe degradation?

    This issue might be more complex than a simple matter of the higher latency between sockets, but other people might have better tools to conduct such an investigation.

    Work arounds

    There are work-arounds for this issue. One is to implement a multi-column clustered index key such that inserts are distributed over many pages. It also necessary to not have even a nonclustered index on the identity column, which may have implications.

    Another work-around is simply to deploy on a single socket system. This is actually good advice for perhaps a majority of situations. The Xeon E3 with 4 cores is perhaps twenty times more powerful than a 4-socket Pentium II Xeon from 1998. If quad-core E3 were not enough, probably in memory or PCI-E, then a single socket Xeon E5 with up to 22 cores should definitely considered before reflexively defaulting to a two-socket system without quantitative substantiation of any performance assumptions.

    One the problems today is that infrastructure people have bought into the vendor driven arguments for virtual machines and/or cloud, but apply it pervasively even into the mission-critical systems, while completely disregarding any special requirements. The typical platform is a 2-socket.

    There are other options too. We could try affinitizing calls that insert into the critical table, see Map TCP IP Ports to NUMA Nodes.

    Benchmarks and Inserts

    Before closing this topic, given that it is so important, we might ask whether the benchmarks shed light on this matter. The TPC-C benchmark has the highest insert volume into a single table. However, TPC-C does not have this problem because all tables are organized (clustered index) by warehouse followed by district. This reflects a brick and mortar business, where customers place orders mostly in their own warehouse and district.

    If we were curious anyhow, TPC-C benchmark results are in transactions per minute (tpm-C). The highest SQL Server result is 1.8M tpm-C, corresponding to 30K new orders per second. Technically, the order line table has more rows inserted. Each transaction inserts one row to orders and on average 10 rows to the order line table with a single call, so the Insert statement call volume is the same for both tables.

    No TPC-C benchmark results were published for SQL Server version 2008 or later, even though a handful of results continued to be published for version 2005 after 2008 RTM. The reason for this is that SQL Server 2008 compressed log file writes to better support database mirroring, a very important feature across the broad base of users.

    Log writes were single-threaded until version 2016. This meant that SQL Server write performance could be limited by the ability of a single core running the log writer thread to compress log entries. Presumably there was some survey to suggest that this would be acceptable as there is no option to disable log write compression?

    The TPC-C benchmark has a curious statement when updating the stock table in the output clause that touches but does not change several fat columns. This part serves no meaningful business purpose, but has the effect of grossly inflating the log write size far beyond what should be necessary.

    Presumably other RDBMSs do not have mandatory log compression. And so the highest TPC-C result on a system with Intel processors is 5M tpm-C (8 sockets, 80 cores total, 160 threads) corresponding to 84000 inserts to orders per second.

    The TPC-E benchmark does have 3 tables that are unique on the (big)integer Trade Id column, and two more tables that have a two column key leading with Trade Id. Is there a magical tuning trick that can evade the insert to a table with a sequentially increasing index problem? Apparently not.

    TPC-E works around this by not using an identity column in the tables where one might think the primary key column should be an identity. In fact, it uses a scalar function and a table to assign trade Id values.

    The TradeOrderFrame4 procedure first calls the function Get_Next_T_ID with an output variable to retrieve a Trade Id. The function uses the table TID_RANGES to store blocks of ID values. Each block, represented by a row in TID_RANGES, is assigned to a specific session id (SPID, accessed with @@SPID). Every time the function is called, a column for the current Id value for the callers spid is returned and the row is updated to the next Id value.

    This is probably an acceptable implementation in the TPC-E benchmark, where all connections are running transactions. In a real-world environment, there might be very many open connections, not all of which process transactions, but in a common connection pool. I might suggest using spid modulus the total number or logical processors. Or perhaps spid divided by 2 first, then modulus the number of cores when HT in active.

    All of this is in the details as required by TPC full disclosure rules. But it is buried in the supplement files with thousands of extraneous files. There is no explanation given as to why the three tables that have a primary key clustered on the trade Id column is not an identity, instead using a function that requires an update to maintain.

    It is interesting this method was in TPC-E from the beginning. The first TPC-E result was published in 2007 for a 16-socket Unisys ES7000 with Xeon (7140M?) dual-core processors. In that generation, there were probably 4 processors in each NUMA node. The performance was 660 tpsE, meaning 660 inserts per second to each of 3 tables unique on trade Id plus inserts to two more tables. The current 4-socket Xeon E7 v4 with 96 cores total is 9,068 tpsE.

    Was the insert - index on sequentially increasing column problem know back then? If it occurred at such a low insert rate, then how many people suffered this problem without knowing why? The conclusion to draw is that, yes benchmarks are important.

    However, what is important is not the result that vendors like to brag about, but rather the many problems that were solved in producing a competitive result. While all the methods are disclosed in the supporting files, there is no accompanying explanation.

    Some methods are simple to apply, like lock pages in memory. Others must be worked into the database architecture, which has deep ramification if done after it is already is in production, or must be part of the client side application, such as the TCP per affinity.

    Some methods might have critically significant impact. Others might be of just minor impact that is desired for benchmarks, but have no important consequence in a production system. I do not suppose anyone would be willing to apply the -x startup parameter in a production environment, but it might be good to know the impact? Good luck in deciphering which details are critically important to a specific situation and which might just be used to get an extra 1% edge over a competitor.

    Addendum 2016-10-14
    The tests systems, client-load generator and server each had Intel I210 1GbE connected to a common switch. A single thread connection could drive about 4K single row select RPC/sec, and 2K single insert row RPC/sec. A high level of saturation was achieved at 80 concurrent connections. About 12 years ago, I had tested Xeon (Pentium 4) systems with 1GbE connected back to back (no switch), and this could drive 5K single row select RPC/s. I did not test the current 1GbE direct connect network round-trip rate.

    Both of these systems also had a Intel XL710 40GbE with direct QSFP+ connection. At first this did not seem to work in Windows Server 2016 TP5 with the Intel beta driver. Intel support was of the opinion that this should work but did not say definitively. Then on Tue, with the latest Windows TP5 hotfixes, all of the sudden the load generator was connecting via the 40 GbE instead of the 1GbE. There were some quirks in the behavior.

    In any case, the 40GbE direct connect could drive 11K select RPC/s, and 6K insert RPC/sec at the single. Only 20 concurrent threads were necessary to achieve a high level of load, at which the 10-core Xeon E5 could support perhaps 40K single row inserts with a sequential clustered index. I am currently in the process of changing over to Windows Server 2016 RTM + hotfixes, which also seems to upgrade the XL710 driver from 1.3.115 to 1.5.59, so more to come.

    Addendum 2016-10-19
    Henrik provided the link, SQL Server Latch Contention, which is why he mentioned that this is latch contention. I would have thought it would be lock contention, but apparently it is latch contention according to MS?.

    I only took a quick look, MS identifies the problem with the degree of concurrency, more severe at higher concurrency. Between Thomas Grosher's report, and my data, it should be emphasis that there is most probably a strong hardware NUMA or not-NUMA impact. But it would be preferable if one person ran the same test on both 1-socket, 2 and 4-socket systems.

  • SQL Server on Xeon Phi?

    Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.

    The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.

    The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking? The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop, but perhaps this is because it was more focused at mobile than desktop? (Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics). The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.

    With a single core under load, processor is probably running at the turbo boost frequency. When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.

    The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core. There are 3 die versions, LCC, MCC, and HCC with 8 10?, 15, and 24 cores respectively. All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors. But Xeons need to be more conservative in its ratings so a lower frequency is understandable.

    The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP. The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.

    The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.

    The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance. The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is. (Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).

    So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.

    This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo. The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm. A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT) and 1M L2 cache shared between 2 cores. (There is no shared L3? how is cache coherency handled?) The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.

    There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s. My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package. This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current? (Can some circuit designer speak to this please?)

    This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen. Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch. This should run fine on a simpler core than 6/8-port superscalar Broadwell. And have many dead cycles during the round-trip memory access latency, implying SMT will work well (beyond the two threads per core in the main line Intel cores).

    However, this may not be the best general purpose computing solution, in that there are important single threaded tasks, or tasks that are not massively parallelizable for which the existing powerful Intel core is the best. My thinking is that a mix of few powerful cores and many smaller cores is right solution. And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.

  • Server Sizing Comments: the case for 1-socket in OLTP

    Long ago, almost 20 years, there was legitimate concern on whether servers had sufficient capability to drive transaction processing requirements for large businesses. Then for a long period, perhaps from the late 1990’s to 2012, the better choice for most mid to large businesses was to simply buy a 4-way (socket) system. Larger systems were based on a non-uniform memory access (NUMA) architecture that had unusual characteristics and vendors provided almost no information on how to work around issues that could cripple such systems.

    Towards the end of the second era of server sizing, processors (at the socket level) became ridiculously powerful such that properly tuned systems ran at very low overall CPU utilization. However, 2-socket systems in this period did not have adequate memory or IO capability, unless it was a Xeon 7500 or later E7 series system with 2 sockets populated. And even then, the structure of SQL Server per processor licensing meant that many people opted for the 4-way system regardless of requirements.

    In 2012, Intel finally put powerful processors (Sandy Bridge, up to 8-cores) with larger memory and IO capability into the Xeon E5 line (there were both 2 and 4 socket versions). This coincided with SQL Server 2012 per core licensing. Now there was a stronger technical justification and financial incentive to not automatically opt for a 4-socket system with the most number of cores.

    It would seem reasonable to make some effort to first determine the total number of processor cores that will meet requirements with some headroom. Then determine the number processor sockets as more processor sockets means more memory sockets (also considering the difference in DIMM sockets per processor between Intel Xeon E5 and E7 processors). It is unclear whether anyone actually engaged in the technical analysis versus just making the decision strictly on budget goals.

    Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.

    We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.

    A single socket Intel Xeon E5 processor supports 12 DIMM sockets for 384GB with 32GB DIMMs, or 768GB with 64GB DIMMs which is now at a moderate (50%) premium per GB over the 32GB DIMM. In the old days, DBA’s were taught that memory was precious in reducing disk IO to a tolerable level. That was back when a total system memory of 1GB was considered very large.

    I noticed that after common server memory configuration reached 32GB memory, disk IO could be brought to a very manageable level, well under the IOPS possible with 100-200 15K HDDs.

    Today, all-flash storage is practical. Definitely in direct-attach storage, and maybe in SAN based storage depending on the degree of price gouging from your SAN vendor. So while it might be possible to show that TB-plus memory reduces IOPS from 10,000 to nil, the reality is a modern storage system correctly configured for database applications can easily support 100,000 IOPS if not 1M (which would require NVMe to be practical). In this case, we should find that memory requirements are far lower than the memory capacity of recent generation 2 and 4-socket systems.

    Why not continue to spectacularly over-configured the server system when even with SQL Server per core licensing is only a small portion of the costs in deploying a critical line-of-business application? The big bosses may not question the costs of a 2-socket system or perhaps 4-socket system even factoring in that there may be several environments between production, DR, QA and development. But if performance is inadequate, it may cost you your job.

    The answer is performance. Today, transaction processing performance is heavily dependent on serialized round-trip memory access latency. The term is pointer chasing in that a CPU instruction accesses memory to determine the next memory location to access. No amount of L2/L3 cache can hide this, nor is it possible to pre-fetch memory. (What Intel needs to do is increase Hyper-threading beyond two logical processors per physical core. IBM Power is 4 and SPARC is 8?).

    For a processor with integrated memory controllers (AMD from Opteron on, Intel from Nehalem on), local node memory access round-trip latency is probably 50ns. For memory on an adjacent processor, a 1-hop remote node, it is perhaps 100ns. This information is absolutely crucial in understanding system level performance for transactional databases, and yet vendors try to not mention it, instead stressing metrics that are mostly irrelevant.

    In a single socket system, 100% of memory accesses are local node, because there is only 1 node. In a two socket system, the expectation is that memory accesses are split 50-50 between local and remote node. In principle, it is possible to devise some strategy such that a higher percentage is local, but SQL Server does not make it practical to implement such a scheme. In the Intel Xeon E7 4-socket systems, 1 out of 4 memory nodes is local, and there are 3 memory nodes 1-hop away. In an 8-socket system, 4 nodes are 2-hop away.

    An argument in favor of multi-socket systems is that the number of memory channels scales with the number of sockets. However, in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth. (it is a factor that more memory channels means there can be more memory transactions, but it is unclear as to whether this offsets the increase in memory latency.)

    If someone has reasonably comparable systems with 1, 2 and 4 processor sockets, all of the same processor generation, the impact of the number of sockets can be tested.

    I am also of the opinion that SQL Server cannot make meaningful use of all the processor cores in a top-of-the-line 2-socket (36-cores for Xeon E5 v3), let alone a 4-socket (72-cores) due to most write operations data and log being single threaded. But I would like here opinions on this. Is this addressed in SQL Server 2016 to a degree?

    The above applies mostly to transaction processing applications.
    It is reasonable to suppose that decision support system application could use both the greater number of processors cores and the greater number of memory channels in a multi-socket systems. However it is unclear as to whether SQL Server is effective in implementing parallel execution plans when the degree of parallelism is greater than the number of cores in one socket.

    I would think that the strategy of partitioning and alignment of compute to local (system) node memory from (formerly) PDW applied to a single SQL Server instance, treating each processor socket as though it were a node, which it is, could dramatically improved very DOP parallel execution plans. But it does not seem that Microsoft is interested in this approach.

    I am presuming that column store is better capable of utilizing multi-socket, as it is definitely not encumbered by serialized round-trip memory latency.

    Addendum 2
    The Dell PowerEdge R730 can be purchased with 1 or 2 of 2 sockets populated. The R930 must have 2 or 4 of 4 sockets populated. It appears that the premium from 32GB to 64GB memory modules might be because the 32 is RDIMM and the 64GB is LRDIMM. The 2400MT/s 32GB RDIMM is $558, and the 64GB LRIMM is $1483. For this differential, I would stay single socket, opting for the more expensive 64GB (if necessary) over 2 sockets using the less expensive 32GB, even though I am giving up the 4 extra memory channels.

    If the plan is to implement memory-optimized tables in 2016 (which supports more than 256GB? and has fewer other restrictions) and more than 768GB system memory is needed, then go ahead with 2-sockets. But I would try to get my transaction processing DB under the single socket limit. In the old days, we moved the old out to the DW system.

  • Join and Query hints – serious DBAs need to learn this

    For a long time, I have refrained from endorsing the use of Join and Query Hints. The reason is that there is no such thing as query hints in SQL Server except for a few simple items. The key options listed under Microsoft SQL Server Hints (Transact-SQL) that affect join order, join type, and index usage are not really hints.

    They are in fact directives such that the specified option is employed if possible. But join order optimization is essentially disabled, and the execution plan joins tables in the order as written, exception being EXISTS and NOT EXIST clauses. It is if the Query Optimizer is hyper-sensitive and its feeling are hurt when someone impudently suggests a hint. So if hint is specified, good luck to you. You are on your own.

    Microsoft documentation is obviously very thorough on this matter with the following:
    “Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend only using hints as a last resort for experienced developers and database administrators.”

    But the fact is that there are complicated queries such that it is difficult to estimate the number of rows, especially past the source table into the intermediate steps. The query optimizer does (edit: not) produce a good plan, because the row estimate is grossly erroneous. This is aggravated by the fact plan costs employed by the query optimizer reflect a fixed IOPS model based around four 1995 era 7.2K hard disks, with no consideration for whether the actual capability of the storage system or if data is entirely in memory.

    So there are times when it is necessary to override the query optimizer. I will not go into the details of query optimization here. Back around 2001, I reversed engineered the execution plan cost formulas for many the fundamental operations without bothering to explain the operations or other aspects of query optimization. Since then, Paul White, Benjamin Nevarez, and others have written very detailed papers on query optimizations. Roughly the strategy is a balance of 1) tables access efficiency, 2) join efficiency and 3) reduce rows as quickly as possible.

    Because join order optimization is disabled when a join or query hint is specified, the key is to write the query to produce a specific join order. There is a method of doing so. See Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying By Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass on the topic of bushy plans. Once the correct join sequence is found, most of the join types might already be good. Perhaps a few join hints might be necessary. Join and Query hints do no cause the query optimizer to disable index optimization, so index hints should not be needed as frequently.

    This is a much deeper topic than I can discuss at this point in time. So why I am bring it up? Today there was a news report of the cause of the Air Asia flight 8501 crash last December. For details, see CNN Air Asia crash report or Wikipedia

    In brief, there was an equipment malfunction that put the airplane outside of the normal flight envelope. It should have been possible for the flight crew to recover, but they did not do so.

    It further turns out: “because the manual provided by the plane's manufacturer said the aircraft, an Airbus 320, was designed to prevent it from becoming upset and therefore upset recovery training was unnecessary.”

    In our world, we are told the query optimizer “typically” is best, hence we should not have to do manual query optimization, hence we do not need to learn this. So when we do need to do manual query tuning, we do not know how because we did not learn this.

    This is BULLSHIT! I don’t give a damn how good the query optimizer is “typically.” A serious principal DBA needs to learn how to do query optimization so that he/she can do so when the situation calls for it. The starting point is the bushy join style in the Itzik Ben-Gan book.

    here is the link to my previous post on bushy joins, notice at the time I put caveats on it. Bushy Joins
    I am not saying go for broke on join/query hints, just that we need to learn to shape the join sequence. Rob Farley had a great note in the comments. I had seen that query style before, in TPC benchmarks? without understanding at the time what the writer was smoking.

    per Claus comment linking to Dynamic Search Conditions in T‑SQL, I especially like the strategy below, with hard code when good parameters are specified, then some thing else to handle the unusual, be it OPTION(RECOMPILE) or dynamically built parameterized SQL

    IF @orderid IS NOT NULL
    SELECT ...
    WHERE O.OrderID = @orderid
    AND -- Conditions on Order Details here.
    -- No OPTION (RECOMPILE) here!
    SELECT ...
    WHERE -- same conditions as before

  • SSD Form Factor and interface

    There is a curious quiet from the enterprise storage community on form factor and interface direction for solid state storage, be it NAND Flash, Intel 3D XPoint or other. On the client-side, personal computing, both desktop and mobile, show clear direction in favoring both the M.2 form factor and PCI-E as the preferred interface for SSD storage. There is a backward compatibility option in M.2 to work with either SATA or PCI-E interface, but I do not think this will be widely used. SATA or hard disks will not go away, only the primary SSD is M.2 form factor, PCI-E interface and NVMe host protocol.

    On the enterprise side, there is great deal of deployed infrastructure built around the SAS interface (a super-set of SATA), and the small form factor (SFF) for 2.5in HDD at 15mm height or thickness. The bean counter types would like to wish that SSD (NAND flash for those who do not like the acronym SSD) would use existing infrastructure and not just as an interim measure. They are probably still unhappy that Fiber Channel on the back-end had to be abandoned several years ago, being not competitive and a cost burden relative to SAS.

    Preserving the value of investment in existing infrastructure is important because people are unhappy when equipment purchased at painfully high cost becomes obsolete. Of course, enterprise storage is only extremely expensive because storage vendors invented justifications for selling inexpensive components at very high markup. There is also a failure to consider that hardware has an effective depreciation of 30-40% per year due to the pace of progress, which renders the term investment in hardware foolish, or if I were less polite, then completely stupid. So ultimately this is circular logic based on an untenable premise.

    That said, it would be possible to build a viable enterprise storage system around either the PCI-E or SAS interface, because both support multiplexing lanes, and there are switch chips for both PCI-E and SAS interfaces. The reason PCs are moving from SATA to PCI-E is that the NAND interface bandwidth is advancing at a faster pace than any single lane connection can support and SATA does not support multiplexing. (In websites catering to desktop hardware, some say that PCI-E is superior to SATA. This is rubbish by writers with poor technical knowledge. The only important fact is whether the interface supports multiplexing.)

    The reason existing enterprise infrastructure should be abandoned is not because of any deficiency in SAS, but rather that it is built around four lane (x4) uplink and downlink ports. SAS at 12 Gbits/s would only support 4.4GB/s net bandwidth. This might seem to be high because enterprise storage vendors sell crappy systems with pathetic bandwidth capability. The other reason is that most existing infrastructure are either the 24-25 bay SFF in 2U or 15-bay LFF in 3U 19-inch wide rack mount enclosures designed for hard disks. Both the number of bays and physical volume are completely wrong for current generation SSDs going forward.

    My opinion is that the correct uplink and downlink for solid state (be it NAND flash, Intel 3D XPoint or other) storage enclosures (not individual devices) should be 16 lanes wide or x16. Both PCI-E and SAS have adequate bandwidth and protocols. For PCI-E gen 3 at 8Gbit/s per lane, this would support a net bandwidth of 12.8GB/s. The existing x4 SAS is just too low for an SSD (formerly disk) array enclosure.

    The core of this argument is based around the PC standard of a Flash controller with 8-channels on the NAND side, and PCI-E on the uplink side. Today the NAND interface is 333MB/s, so an 8-channel controller could support 2.6GB/s. There may have been some thought that the upstream side should be PCI-E gen 3 with 2 lanes (x2, capable of 1.6GB/s), as it is common to have excess bandwidth capability on the down stream side. But in PC world, single device benchmark performance is important, so the trend seems to be PCI-E x4 on the controller, with the option to connect only x2 (or even x1?).

    In the time of hard disks, client side PCs used 7200 RPM HDDs or less, for lower cost and higher capacity. Enterprise storage was primarily 10K or 15K RPM for greater IOPS performance, although 7200 RPM was adopted for tier 2 storage. (Storage capacity should have been too cheap to meter even for 10K HDDs, but because vendors sold a ridiculously high prices, this created a demand for 7.2K in enterprise storage systems.)

    In the first phase of SSD adoption, enterprise systems preferred single level cell (SLC) NAND with greater write endurance while client side was mostly 2-bit MLC and later some low cost devices being 3-bit TLC. Today NAND flash technology is sufficiently mature that MLC has sufficient write endurance for many enterprise needs. Fundamentally, the performance oriented PC and enterprise could use the same SSD, just with different over-provisioning and other firmware settings. It would be foolish for enterprise systems not to leverage components developed for client side systems, given the huge volume and low cost structure.

    While the standard desktop SSD element is M.2 form factor with an 8-channel controller and capable of x4 on the upstream side, the enterprise strategy should be to connect x2 on the upstream side. In enterprise, it is the performance of the complete array of storage elements that is important, not the single component. The standard storage array enclosure should probably have 16 bays, each connected x2 to the PCI-E switch, and x16 for each of the uplink port and downlink expansion port. The PCI-E switch would have 64 ports, 16 for uplink, 16 for downlink expansion, and 16 x2 for the M.2 SSDs. The enclosure should work with either 1 or 2 controllers. Existing DAEs have a single (x1) SAS connection to each bay.

    The physical volume for 16 M.2 devices would occupy only one-quarter of 1U rack. Existing enterprise storage infrastructure is x4 uplink/downlink ports, 2U full rack with 24-25 bays connected x1. This wrong for SSDs at on multiple points. Uplink and down link ports should be x16. The volume of the enclosure should be shrunk by a factor of 8. Connections to each bay should be x2, but 16 bays connected at x1 is weakly viable. Given that existing infrastructure is unsuitable for SSDs going forward, there is no excuse to not adopt the client-side components with M.2 form factor and PCI-E in a new properly designed infrastructure.

    for some reason I cannot respond to comments

    Good question. Sorry about the long winded answer, but life is complicated. I do agree with shared storage concept in HDD days, having a common pool of HDD so that each host can access the aggregate IOPS capability when needed. This and managing the complex storage system alone would have justified a good profit margin. But storage felt the need to justify extraordinary margins, hence started to invent reasons, which led to doctrine based on the invented justifications. Any time happens, it is a fuck-up of the first magnitude. And storage vendors do not seem to understand what bandwidth is, or about log write latency.

    Next, blade systems are non-starter in databases because it gives up DIMM slots and PCI-E slots. So we should stick with rack systems with the full boat of DIMM and PCI-E slots. Today a 4TB PCI-E is do-able. Whats missing is some way to match PCI-E SSDs to the available PCI-E lanes. System vendors have a mix of PCI-E slots, including several x16. Only workstations and HPC have x16 cards, servers do not. So we want to connect four PCI-E x4 SSDs to x16 slots. HP workstations have some card for this, but we need a server version. I can see a 4-socket server with 64 - 128 PCI-E lanes dedicated to storage, that is 16-32 x4 PCI-E SSDs, so 64-128TB in one system. All this can be internal, SSDs do take much space, aren't too heavy and don't consume much power.

    Storage vendors still want to sell horribly expensive AFA storage with features that we don't need, and cannot deliver anywhere close to the bandwidth that is possible. So it really is a fight between the DBA for cheap SSD at high bandwidth, and the SAN admin who wants to spend a shit load of money, have control over all storage, make you fill out forms to justify why you need each precious GB of space, all so he/she can deny you request as not sufficient in his/her judgment.

    Edit 2016-Jan-08
    Given that the NVMe controllers seem to be PCI-E x4, perhaps the strategy should be x16 uplink and downlink, with 8 x4 bays. There is still a 2:1 mismatch between downstream and upstream. The point being bandwidth max is reached with 4 devices, but there is space for 4 more. Above, I suggested 16 x2 bays.

  • Computers without ECC memory are crap - no exceptions

    In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.

    First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).

    Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.

    Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.

    Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.

    All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).

    I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.

    (In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).

    Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.

    Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)

    Additional notes:
    A. The Wikipedia article ECC memory cites references that mentions actual soft-error rates. There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s). There is a separate entry soft-errors for anyone interested in the underlying physics. Of course there are other Wiki entries on the implementation of ECC.

    Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.

    B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.

    A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.

    C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.

    An undetected error is just that. It is undetected. The system continues running with incorrect memory content. Depending on the nature of the memory corruption, anything can happen. It could be executable code, in which case the instruction changes. It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.

    * It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)

    Edit 2015-11-10
    The Wikipedia topic ECC Memory states "ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
    It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.

    Edit 2015-12-10
    Donaldvc pointed to this new article on IEEE Spectrum drams-damning-defects-and-how-they-cripple-computers
    Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide. Back then, the soft-error might only affect many memory cells but it would only be one bit in a word. Then as memory became more dense, a soft error could affect multiple bits in a word? So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms. I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated. If someone free time, please look into this.

  • Hardware rant 2015

    It has been a while so I suppose it is time for another rant on hardware. There are two systems I would like:
    One is a laptop.
    The second is a server capable of demonstrating extreme IO performance, with the secondary objective of being small enough to bring along to customer sites.

    On the laptop I am looking for
    1) quad-core with HT, i.e. 8 logical processors for better parallel execution plans.
    2) PCIe SSD, I would prefer 3GB/s+, so PCIe gen3 x4, or 2 x M.2 PCIe x2 is also an option.
    3) drive 1, but preferably 2 external 4K monitors (so I can look at complex execution plans)

    On this matter, it is time to bitch at the MS SQL Server team that there should be an option to contract the white space in execution plans. The existing zoom capability is worthless.
    Yes I know SQL Sentry Plan Explorer can do this, but really MS, is it so hard? or have you outsourced the entire SSMS some team that does not know that there is such a thing as complex queries?
    The reason I want to drive 2 external 4K displays is that at the 4K resolution, I need more than a 28 in monitor to use the resolution.

    A couple of days ago, Dell announce the new XPS 15 with Core i7-6700 processors (Sky Lake) which I immediately ordered , but unfortunately it shows a shipping date of Nov 16

    it does have 4K display, and 1 external port which may or may not support 4K. I thought I ordered the docking station, but I do not know if this would support dual external 4K monitors.
    I currently have the Dell 28in 4K monitors, which is great for looking at execution plans, but at the normal text size setting, is difficult read.
    I am thinking that the much more expensive Dell 32in 4K monitor will be better, but maybe not enough. Should I get a 55in 4k TV instead? these all have just the HDMI connector, so I need to make sure there are proper 4K adapters.

    The new XPS 15 data sheet says it has HDD bay (SATA interface) and one M.2 bay (uncertain if PCIe x2 or x4). I would have been nice if 2 M.2 x2 bays were available instead of the HDD bay. I ordered the XPS 15 with the PCIe SSD. I do not know if it is good one (Samsung SM951 cite 2150MB/s) if not, it will throw the Dell SSD out, and get a good one.

    One more thing, ECC memory
    Intel desktop and mobile processors all do not have ECC (or parity) memory capability. ECC memory has been built into Intel processors for some time now, it is just disabled in the Core product lines, enabled on in the server Xeon line.
    So the Sky Lake equivalent is the Xeon E3 v5. Intel released the v5 under the mobile group, with a 45W rating.
    Unfortunately I cannot find a laptop for sale that uses the Xeon E3 v5.

    Perhaps Dell or someone could offer a Xeon E3 mobile system?

    Extreme IO Performance demonstrator
    First, why do I need such a thing?
    when my clients have multi-million dollar SAN storage systems?
    Because SAN people are complete idiots on the matter of IO performance, being locked into irrelevant matters (to enterprise DB) like thin provisioning etc.
    Invariably, the SAN people (vendor sales engineer, the SAN admin etc) confuse that Fiber Channel is specified in Gigabits/sec (Gb/s) while all other IO bandwidth is specified in GigaBytes/sec (GB/s).
    So we have a multi-million dollar storage system (full of add-on software that have no purpose in an enterprise DB) connected to a powerful server (60+ cores and paying for SQL Server 2012 EE per core licensing) over 2 x 8Gbit/s FC links.
    Is this stupid or is this exceptionally stupid?

    Yes I know it is extremely rude of me to call other people stupid, and that being stupid is not crime, but when you are the vendor for multi-million dollar equipment, there is a reasonable expectation that you are not stupid.

    So onto the system.
    For this, I am sure I need more than 4 cores, so it needs to the Xeon E5. Perhaps 8 cores (single socket) is sufficient.
    The new Intel SSD DC P3608 has great specs, but I am not sure when it is actually available?
    I would put 2 of these in the system to demonstrate 10GB/s. Ideally this would all go into box that fits carry on luggage, which is unfortunately not one of the standard PC or server form factors.

    Another option is a 2 x 12 core system to demonstrate 20GB/s on 4 x P3608.

    I would prefer to get a laptop without high performance graphics, the NVidia GTX 960M in this case.
    The current Intel graphics is sufficient for high resolution rendering, but I do not need high frame rate. All the Intel Core i7 6th gen processors have graphics, I wonder if I can remove the GTX (for power savings)?
    Apparently Dell will have a new docking station, the Thunderbolt Dock TB15 next year, that will support 2 x 4K monitors?

    I did already rant on PC laptops only being available with 16x9 displays?
    How stupid is this? It is one thing for consumer laptops to have a 16x9 display, on the assumption that the home users just watch movies.
    but on what justification does this apply to business and workstation laptops?

    Concurrent with the Intel Skylake Xeon E3 v5 regular announcement, Supermicro announced motherboards for the E3 v5.
    There is a micro-ATX (X11SAE-M) but with just 1 x16 and 1 x4 PCIe g3 slots.
    where as the ATX (X11SAT) has 3 slot with 16/8/8 as an option. This would let me put 2 P3608? for 10GB/s?

  • GHz and MHz-BIOS updates and Amdahl Revisited

    Last week, a routine performance test ran about twice as long as expected. A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time) were approximately 3 times higher than previous tests for matching SQL statements. Almost all of the SQL involved single or few row index seeks, usually Select, some Insert and Update. The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center. The data center had sent out notice that there would be system restarts the previous weekend, which could mean either OS patches or BIOS/UEFI updates. So naturally the next thing to do is check the Processor Information object for the Processor Frequency and % of Maximum Frequency counters (or equivalent). This showed 135, as in 135MHz, and 5% of maximum. Another system of the same model also rebooted showed 1188 and 44%.

    This issue has occurred previously in this environment and in other HP systems that I am aware of. The BIOS (or UEFI) update puts the system into one of the energy efficient configurations. It could also be an operating system setting, but most that I have seen are BIOS settings? One can imagine a firmware engineer being so committed to green activism that this was made the default on BIOS updates without discussion with other parties. Perhaps there is a facility (in Houston?) with inadequate air conditioning for the number systems, that this setting was put in to prevent the lab from overheating. Then no one remembered to exclude the step in the production code? Not that I have ever done such a thing (and no further questions on this should be asked).

    Another question might be why the data center monitoring team did not check for this, as it has happened before. The whole argument for going to managed data center instead of a simple hosted data center was that the managed data center could provide the broad range of expertise that is not economical for a mid-size IT department. Obviously this managed data center did not monitor for the performance/power configuration.

    This matter is of serious concern to production DBAs and IT staff in handling operations. As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2, many software monitoring tools may not alert on changes of Processor Frequency, especially after reboot. Imagine the IT staff or DBA encountering this for the first time on the production servers, with users complaining, your boss watching over your shoulder, and his/her boss hovering over your boss, offering their helpful insights in the non-judgemental manner as bosses do.

    Performance Insight

    However, I am more interested in a different aspect of this incident. When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz, we can extrapolate parameters of interest. Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20, only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X? Perhaps, but this actually should have been expected.

    The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time. A young person might not know, but old timers would know. Up to about 20 years ago, the primary memory performance specification was access time, with 80, 70 and 60 ns being common in fast page mode and extended data out. Then with the switch to synchronous dram (SDRAM), the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common. This means a memory channel can deliver one line every 0.75ns, or 1.333 billion times per sec, with a line being 64-bits (excluding ECC bits). The Xeon E5 26xx series has four DDR3 channels. The Intel processor internally is shown as having 2 memory controllers, each controller driving 2 DDR channels, so channel can have different meanings depending on the context).

    What is less commonly cited is the round trip latency, from a processor issuing a request to memory, the internal memory access within the DRAM chip and finally the transmit time back to the processor. (The L1, L2 and L3 cache sequence is also involve in memory access timing.) For local memory (attached directly to the processor) this is around 50ns. For memory on an adjacent processor, the round trip time might be 95ns or so.

    On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles. This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency, and half at 95ns latency.

    This is a well understood issue. Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor, and 2) pre-fetching memory. Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core, and an additional shared L3 cache sized at 2.5MB per core. From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line. Prefetching is exposed in the instruction set architecture (can someone provide a reference please) and there should also be a BIOS/UEFI for hardware prefetch.

    Simple Model

    Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures. There is a memory access for the index root level page. Read to the page to find the pointer for the second level page. Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much) and the next location is not known until the current memory request is completed, so pre-fetching is not possible.

    Modern processor performance characteristics are very complicated, but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency. Start with an imaginary processor with a cycle time equal to the full round-trip memory access time. In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.

    Such a system may have never existed so now consider a system where the round trip memory access latency is some multiple of the CPU cycle time. The average time to complete an instruction where time is in units of the memory access latency (50ns or 20MHz for local node), “a” is the fraction of instructions that involve (non-local, non-prefetch-able) memory access and “n” is the processor frequency.

      (1-a)/n + a

    The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache (from previous access or pre-fetched). “1/n” is the processor cycle time (in units where memory access time is 1).

    Performance (instructions per unit time), the inverse of average time per instruction is:

      P = 1 / ( (1-a)/n + a )

         = n / (1 +(n-1)*a )

    We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:

      n -> infinity, P = 1/a

    Also, if the fraction on instructions that require memory access, “a,” is zero, then P = n.

    Does the above look familiar? It is just Amdahl’s Law, which formulated in the old days to demonstrate the limits of vectorization in supercomputers. I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.

    If someone would like to check my math, please do so. It has been a long time. Trying tricking your teenage son/daughter into doing this as a practical math exercise?

    OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things. What am I really after? In the above equation, what is known is processor frequency relative to memory access latency. While we know the performance or worker time of certain queries, we do not know it terms of instructions per CPU-cycle. And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency that cannot be hidden with cache or pre-fetching.

    But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz. Express the relative performance between the two points as a ratio.

      P2/P1 = R

    Then from the two equations

      P1 = 1 / ( (1-a)/n1 + a )

      P2 = 1 / ( (1-a)/n2 + a )

    we can solve for a in terms of the know values n1, n2 and R.

      a = (n2 – n1*R) / ( n1*n2*(R-1) + n2-n1*R )

    Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle. Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3. We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz. (Perhaps it would be more correct in estimating round-trip memory access latency as the average between the local and 1-hop remote node at 75ns?)

    So while it might seem astonishing that the difference between 135MHz and 2700MHz translates to only 3X performance, the database transaction processing workload is an extreme (but important) case. There are many workloads which exhibit better locality or have memory access patterns that are amenable to prefetch and have performance scaling better with frequency.


    Earlier, two methods of hiding round-trip memory access latency were mentioned. There is another, Hyper-threading. The processor core to appears as two (or more) logical processors to the operating system. Presumably, there is an extra set of program counters, and resources to determine which physical registers (different from the registers specified in the instruction set architecture) are assigned to each logical processor.

    In the earlier example, say that the round-trip memory access time is 135 CPU-cycles and the fraction of instructions that incurs the full round-trip latency is 6%. Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar) as either not involving memory access or data is already in cache, and the 6 the incurs the round-trip memory latency of 135 cycles. Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904. In other words, only 100 cycles out of 904 are used, the rest are no-ops.

    The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled. The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)

    In considering the nature of the database transaction processing workload, being a memory access to determine the next memory access in nature, it is perhaps time for Intel to increase the degree of HT, especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the smaller desktop/mobile processor variants. I seem to recall IBM POWER as having 4 logical processors per physical core, one of the SPARC processor lines as having 8. It would also be necessary to have a good strategy for using HT based on workload. The option to enable or disable HT in the BIOS/UEFI is not I what mean. HT should be visible to the OS. But the application itself should detect the presence and degree of HT, and make its own decision on whether HT should be used and how it should be used.

    Xeon Phi, Many Integrated Core

    Another item worth mentioning here is the Intel many integrated core (MIC) architecture, codename Knights something, now Xeon Phi. The processor puts many smaller processor cores, 61 in the 22nm Knights Corner, versus 12-18 in the 22nm mainline Xeon processors. The theory behind many smaller cores stems from one of the two main elements of Moore's Law. Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.

    (This was the case up to several generations ago. Since then, Intel no longer tries to double the logic from one process to the next. There might be 10-20% performance gain in general instructions. Some effort is given to expanding the functionality of the special/vector instructions. And most effort has been in increasing the number of cores.)

    One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago). Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC). Following the rule of 2X transistor per 1.4X (square root of 2) backwards, the expectation is that a processor if 1/4th the size would have 1/2 the performance. But potentially there could be 4X as many cores, depending on interconnect and power limitations. So in workloads that are amenable to vectorization, or otherwise can be parallelized, the more smaller cores could be a better strategy.

    The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions. If we were thinking about a transaction processing database engine on the MIC architecture, we would probably consider a very basic ALU without SIMD, (not sure on FP). I am thinking that an asymmetric processor architecture might be the objective. Perhaps two powerful cores from the current main line, and many simpler cores (without SIMD) perhaps even simpler than Atom? (The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)


    As said earlier, the nature of database page storage along rows make serialized memory access (also called pointer chasing code?) its hallmark. This is why there is interest in column storage architecture. Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over, already in cache. The work a little further down touches memory in the next 64 byte line or two away. Both the software and hardware knows this, and either is capable of issuing a pre-fetch. It does not matter that columnstore must touch more data. Processor can stream huge volumes of data, much more effectively than pointer chasing only the necessary rows.


    I should probably say something here.


    As I said earlier, modern microprocessors are very complex. Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and superscalar execution with Pentium (1993). Pipelined means that while the processor can complete an instruction in each cycle, the actual start to finish time of a specific instruction occurs over several cycles. Intel does not talk about pipeline stages any more, but there are occasional references to Core2 and later processors having a 14+ stage pipeline. (Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)

    Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle. Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar. We could apply the principle of Amdahl’s on scaling performance to any and all of pipe-lining, superscalar, and round-trip memory latency, and probably other things too.

    Rethinking Computer System Architecture

    I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture. Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk. Why do we still have a page file on disk? In the old days, there was not enough physical memory such that it was tolerable to have a page file on disk to support a larger virtual address space.

    Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing). But the key factor is in how memory is used. Back then, it was mostly for executable code and internal data structures. The assumption was that very few database data pages would actually be in memory at any given point in time. Today, a very large percentage of memory is used for caching data pages. Of the memory used for executable code and internal data structures, most is junk.

    The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket. Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file. Of course the operating system (and applications) would have to be designed around this new architecture. Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.

  • Transaction IO Performance on Violin

    Back in Feb, I went on a diatribe-rant against the standard SAN Vendor configuration practice. The Problem with Standard SAN Configuration IO Performance, article and accompanying post, showed IO performance metrics for a batch driven transaction processing workload on a SAN managed by a data center/cloud company. The only option offered by the service provider was to request volumes for storage. There was no consideration for special IO characteristics of transaction processing or other database workloads. No discussion. This practice is doctrine pontificated by SAN vendors, calling SAN admins on a mission to implement the "storage as a service" concept while remaining completely blind to the requirements of mission critical databases.

    Ok, I am venting again. Now I have performance metrics for the same workload, except that storage is on a Violin system. The system is different in having 24 physical cores, no HT, and 256GB memory versus previous system with 16 physical cores, HT (32 logical) and 384GB.

    Below are the IO characteristics. The horizontal axis time scale is 5 min per major division for 1 hour across the entire chart. Each tick is 1 minute. Data points are every 15 sec. Note that the Feb (HDD) charts were 1 min per major division, 15 min total, with data point every 5 sec.


    Transactions/sec (red)

    IOPS - read (green), write (blue), log write (red)

    ms/Rd or Wr

    The obvious difference between the IO characteristics on Violin and the previous HDD-based storage is that checkpoints now have almost no visible impact on performance. Both CPU and transactions/sec are very steady, with slightly noticeable blips, versus the severe drops before. It is evident that checkpoint writes now have almost no impact on data reads or log write IOPS. The same is true of IO latency, in milli-seconds per read or write.

    If the storage had been on HDD storage but with logs on a separate physical disks, we would expect that the checkpoint would drive up data read latency, and hence depress data read performance. But it would have no impact on log write latency, and hence no impact on log write performance. The lower data reads should have only moderately depress performance, not severely.

    The difference in system processor, 24 physical cores no-HT versus 16 cores plus HT is not a factor in the IO characteristics. The difference in physical memory, 256 GB on the system Violin storage and 384 GB in the system with HDD storage is evident in the data read IOPS, starting at 7-8K IOPS then drifting down to 2-3K IOPS on the system with less memory, compare with mostly 1K IOPS on the system with more memory. Both storage systems can easily handle 20K IOPS.

    The main argument here is not that SSD/Flash storage is a requirement for transaction processing databases, even though there are clear benefits. (NAND flash based SSD have both maturity and cost-structure to be very attractive for any new storage system purchases.) The point is that there is a severe problem with the SAN vendor doctrine of one common pool for all volumes.

    This very severe problem can mostly and easily be mitigated simply by having separate physical disks for the log volume. So the point could and should be demonstrated by showing the IO performance on an HDD SAN with separate physical disks for logs. But this violates the SAN doctrine of ignoring user requirements, and would not be considered or allowed by the service provider under any circumstance. So the only real solution is the keep performance critical databases off a storage system administered by a team on a different mission than supporting the database.

    Below are excerpts and the graphs from the Feb article.

    Standard SAN Configuration IO Performance 2015-02

    The chart below is CPU. The horizontal axis is time. One major division marked by the vertical line is 1 minute, and the small tick is 12 sec. The data points are 5 sec. There are 12 steps between each major division. The vertical axis is overall (system) CPU utilization in percent. Each of the stacked green lines represents an individual processor. There are 16 physical cores and 32 logical. A single logical core at 100% utilization would show a separation of 3.125% to the line below.


    On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec. Note that the vertical axis is log-scale, base 10. One major division is a factor of 10. Each minor tick is an integer. The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.

    time scale

    The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes. The vertical axis is log scale.


    The fourth chart is IO latency, milli-sec per IO. The same color codes applies as for IOPS. Again the vertical axis is log scale.

    IO latency ms
    ms/Rd or Wr

  • SAN Configuration Performance Problems

    The charts provided here illustrates my complaints against SAN vendor doctrine, obstinately adhering to the concept of one large pool of disks from which all volumes are created for any purpose (data, log, and junk non-DB stuff). There is no consideration for the radically different characteristics of hard disks in random versus sequential IO (low vs. high queue depth IO behavior should also be an element of IO strategy). The architecture of all traditional relational database engines are built on the premise that high volume log writes are possible at very low latency (using dedicated disks) in order to provide durability of transactions. And yet SAN vendors blithely disregard this (because it is at odds with the doctrine derived from principles invented) to justify their mission to sell inexpensive commodity hardware components at extraordinary prices.

    Transaction Performance Data

    The chart below is CPU. The horizontal axis is time. One major division marked by the vertical line is 1 minute, and the small tick is 12 sec. The data points are 5 sec. There are 12 steps between each major division. The vertical axis is overall (system) CPU utilization in percent. Each of the stacked green lines represents an individual processor. There are 16 physical cores and 32 logical. A single logical core at 100% utilization would show a separation of 3.125% to the line below.


    On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec. Note that the vertical axis is log-scale, base 10. One major division is a factor of 10. Each minor tick is an integer. The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.

    time scale

    The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes. The vertical axis is log scale.


    The fourth chart is IO latency, milli-sec per IO. The same color codes applies as for IOPS. Again the vertical axis is log scale.

    IO latency ms

    The SQL

    The SQL activity is batch driven transaction processing. There are 14 or so threads concurrently looping through a set of items to be processed. Each item involves about 20 rows of insert or update activity, hence 1000 log writes per sec corresponds to approximately 20,000 transaction/sec on the SQL counter.

    Most of the active data is in memory. There are probably 30-40 SELECT rows for each transaction or twice as many reads as writes. The data IO reads necessary to support the 20,000 inserts and updates/sec is about 2,000/sec, which the storage system is capable of supporting at about 4ms latency. This is because the data resides a small part of each disk. The actual latency for random IO is less than the expected value of 7.5 ms for data randomly accessed over an entire (10K) disk at queue depth 1.

    For approximately 20 seconds out of each minute, the transaction rate drops from the peak value of 20,000 all the way down to about 8,000 per sec (noting the log scale). These are the check points when the data write IO surges to 20-50K IOPS, (which demonstrates that the storage system is actually pretty decent) though write latency is driven up to 50-90ms.

    The checkpoint surge also pushes log write latency up from 1ms to 20-30ms. This dataset occurred during the day, when presumably there was activity for other hosts on different volumes but on the same SAN. At night, log write latency away from checkpoints could be under 0.3ms even at high volume.

    The Storage System

    I had not previously discussed the storage configuration in detail. The storage system consists of 240 x 10K HDDs only, with the standard system level caching. The SQL Server host is connected to the SAN over 4 FC ports (8Gb from host to switch, 4Gb from switch to SAN, and presumably 4Gb on the SAN backend?). The data is distributed over 8 volumes. The log is on a separate volume as seen by the host OS.

    The Problem

    The problem is that on the SAN, all disks are aggregated into a single pool, from which volumes are created. This was done per standard SAN vendor doctrine. Their magically great and powerful "value-add" intelligence would solve all your performance problems. We cannot ask for dedicated physical disks for the log because the SAN was already configured, with the SAN admin getting assistance from the SAN vendor's field engineer who followed the SAN vendor's doctrine.

    Input from the DBA team was not solicited and would have ignored in any case. Besides, there are no unallocated disks. And no, the SAN team will buy more disks because there are no empty bays in the disk enclosures. And there is no room for more enclosures in the storage rack. So the DBA request is denied.

    Even if we put up the money to get an extra cabinet for one more disk enclosure, the SAN admin will still refuse to configure dedicated physical disks for the log volume because the SAN vendor said that their great and powerful SAN will solve all performance problems. Any problems must be with their application and not the SAN. As can be seen from the charts above, this is a load of crap.

    The SAN Vendor Solution

    As I said above, this particular SAN is comprised of 240 or so 10K HDDs.

    Naturally the SAN vendor's proposed solution is that we should buy more of their value-add products in the form of auto-tiering SSD-HDD, and perhaps additional SSDs as flash-cache. The particular SAN with base features probably has an amortized cost of $4,000 per HDD. So SAN with 240 disks would cost just under $1M (while still failing to provide desired database performance characteristics). A mid-range SAN might have amortized cost per disk of $1,500-2K. Enterprise SAN could be $4-6K per disk.

    The additional value-add features would substantially increase the already high cost, while providing only minor improvement, because the checkpoint IO surge will still drive up log write latency.

    The sad thing is that the SAN group might buy into this totally stupid idea, and refuse to acknowledge that the correct solution is to simply have dedicated physical disks for the logs. If there were dedicated physical disks, the checkpoint data IO surge goes to completely different physical disks than the log disks.

    In the specific example, it is not necessary to have separate FC ports for the logs. The 50K IOPS at 8K per IO generates 400MB/sec, which is only 25-33% of the realizable IO bandwidth of 4 x 4Gbit/s FC ports. The checkpoint IO surge would increase latency on data reads, but the primary reason for the performance drop is the increase (degradation) in log write latency.

    Another angle is changing the checkpoint parameters in SQL Server, but the real problem is because we are prevented from leveraging the pure sequential IO characteristics of HDDs by allocating data and log volumes from a common pool.

    One more item. In the old days before we had immense memory, typical transactional database data read/write mix was 50/50. This is because a read forces a dirty page to be written to disk. In this situation, a data write IO surge would also depress the data reads necessary to support transactions. So the standard practice those days was to set the checkpoint interval to infinity to prevent data write IO surges. In our case, very little data reads are necessary to support transactions, so the checkpoint surge might depress data reads should have lesser impact on transactions. It is the increase in log write latency that is depressing transaction volume.

    Solutions that work

    A solution that would work is simply to have separate dedicated physical disks for the log volume. It is that simple! And yet this is not possible because the SAN people would refuse to do this, as it is not in their agenda.

    It is unfortunate that the only practical solution is to get the critical database off the corporate SAN. I have done this by going to all flash in the form of PCI-E SSDs. That is, SSDs installed internal to SQL Server system. This is not because the exceptional performance characteristics of SSDs were required.

    It was because I needed to get away from the SAN admin and his dogmatic adherence of SAN vendor doctrine. The IO performance requirements could have been meet with a direct-attach HDD array (or on a SAN). But anything with HDD enclosures would have been under the authority of the SAN admin, who would have nixed any storage components that was not a SAN, and then configured it according the SAN vendor principles.

    I have used the excuse that PCI-E SSD "accelerators" are needed for tempdb, which are not really "storage", hence there is no IT department mandate that it be on the SAN, under the absolute control of the SAN admin. In fact there were no special requirements for tempdb different from that of data. Then for unrelated reasons, there was enough capacity to put the entire DB on local SSD. Perhaps a file group with non-critical objects could reside on the SAN to carry the pretense of the local SSD not really being storage.


    Note that I have not naively suggested engaging in constructive dialog with the SAN team. They are on a holy mission that has no alignment with their company/organization's mission. Anything that contradicts SAN scripture is heresy.

    Oracle database machine has been described as hardware optimized for Oracle database. In fact the true mission is to take the SAN admin out of the loop. I think HP offered appliance oriented systems for SQL Server in 2012?, but they only system option was the DL980? which is severely narrow market segment. There needs to be DL380 and 580 options as well.


    A SAN is simply a computer system that resides between storage elements and hosts (storage clients) providing volumes (similar to any other client-server application). One practical feature is that the SAN can provide protection against single component failure. I have no objection to this concept, nor to the fact that the SAN vendor wants to sell hardware at a list price of 10X markup to cost.

    Strangely, it seems that people want to buy expensive stuff (just look around outside of IT). Consider that an organization might spend anywhere from tens of millions to a few billion dollars to develop the main database application. It does not seem right (to a high level executive) to put such an application on a $100K server plus $200K storage when there are $1M servers and $6M storage systems available. Never mind whatever might be the consequential differences between them. Or the fact that there are supreme technical challenges in scaling on (hard) NUMA, and oh yeah, the chief architects have never heard of NUMA.

    The point here is that people will buy very expensive SAN systems without technical understanding or justification. There is a perfectly sound business justification. What the client needs is a system supported by deep experts. Not some rookie field engineer who incorrectly chooses to replace the memory board with no uncorrected errors before then board with uncorrected errors.

    My observation is that the SAN Vendors feels a need to have technical justification for selling a storage system with extraordinarily high marks. So the vendors creates reasons. (Technical justification created for marketing requirements tend to have serious logical flaws, but a creative marketing professional is not easily deterred).

    It follows on pretending that these technical justifications are valid, then the "best" practices for employing a SAN should be blah, blah, as derived from the underlying reason. Do I need to explain the consequences of implementing practices built on a lie?

    I will add that there is an absolutely critical value-add that the storage system vendor must provide that alone justifies very expensive pricing. This is integration testing, verifying that the complex system with very many components work well together. The hard disk vendors are generally good at validating that their products works to specification in desktop and workstation configuration with 1 or even a few drives. Ensuring a storage system with several controllers, thousands of HDD, dual-controllers etc., is vastly more complicated. This is a valid reason. Building practices on a valid reason has benefits.


    In a large environment, there might be hundreds or thousands of servers, database instances, SQL Server or other. Managing very many systems and databases economically requires a high degree of automation, usually implemented with standard configurations.

    However, there are some databases that are critical either in being required for day to day operations and possibly providing a competitive advantage over other options. One indicator of this is that there are DBAs and developers dedicated to a specific application. In this case, the argument that customization of storage configuration is not feasible because of the other responsibilities on the SAN team is total BS.

  • ExecStats update - automating execution plan analysis

    It has been over 7 years now that I have made my ExecStats
    (current version Exec Stats 2015-02-18)
    tool publicly available (late 2007), with prototype versions going back to 2006. The two distinguishing elements of ExecStats is 1) the emphasis on cross-referencing a) query execution statistics to b) index usage via the c) execution plans, and 2) saving information locally so that it can be sent to a remote expert.

    The more recent versions now have the capability to simultaneous group stored procedure statements together while consolidating SQL by query hash. On the performance (counter) monitoring, charts are now displayed on logarithmic scale, allow insight over several decades of dynamic range.

    Too many of the commercial SQL Server performance tool emphasis generating pretty charts and reports on query execution statistics, thinking that by giving the top SQL to the originating developer is all that is expected of the DBA. In principle the DBA should have greater expertise on SQL Server performance than a developer, who is expert on the development platform (Visual Studio or other) and perhaps the business logic.

    In any case, the query execution statistics by itself is not very helpful, without the execution plan, and possibly the difference between estimated and actual rows. We should first appreciate that SQL Server is of some complexity, with several factors that could have adverse impact on performance.

    Examples are 1) does the table architecture support the business logic, 2) is the SQL written in manner that the query optimizer can interpret in the desired manner, 3) having a few good indexes, 4) no more indexes than necessary, 5) a good statistics strategy, not just frequency of rebuilds but also whether full scan samples are necessary, and possibly explicit statistics update in between certain SQL statements typically for ETL, 6) compile parameter issues, 7) row estimation issues at the source, 8) serious row estimate errors after intermediate operations (not at the source) and of course 9) system level issues.

    Some very advanced tools have the capability of generating alternative SQL in an attempt to solve a performance problem. The question is have other factors been considered before jumping into re-writing the SQL?

    I have had good success with ExecStats in onsite engagements in greatly simplifying data collection and to a degree the analysis as well. On many occasions I have been able solve simple or medium complexity issues remotely with just the information collected by ExecStats. It should be able to work against Azure as well, but this is only tested intermittently (and on request). People are invite to use and send feedback.

    ExecStats documentation

  • extensions for sp_helpindex

    The system stored procedures in SQL Server from the very beginning(?) provide useful information. However they have not been updated in a substantial manner for the new features in later versions, nor have they been extended with additional details that are now available in the DMVs for more sophisticated DBA's. Presumably, this is for backward compatibility. As an alternative, there is a provision for creating custom procedures that behave as system procedure with sp_ms_marksystemobject.

    In the case of sp_helpindex new features of are included columns, filtered indexes, compression and partitioning. Other information of interest might be size, index usage and statistics. Part of the reason for not changing could be for backward compatibility, which is fine, but let's then make new system procedures with extended information.

    In the text of sp_helpindex for SQL Server 2012, the only difference from earlier versions is that the description field has a provision for columnstore. The SQL Server 2014 version adds hash index and memory optimized in the description field.


    Below is my new procedure for extended index information. I have retained the same error checking code from the original. The cursor loop to assemble the index key columns has been replaced with a code sequence using the STUFF function and FOR XML PATH. A similar structure reports on the included columns. This procedure does not replicate the index description field of the original, but rather has a limited description and a separate field for the type code.


    USE master


    CREATE procedure [dbo].[sp_helpindex3]

     @objname nvarchar(776)


    DECLARE @objid int

     , @dbname sysname

     -- Check to see that the object names are local to the current database.

    select @dbname = parsename(@objname,3)

    if @dbname is null

      select @dbname = db_name()

    else if @dbname <> db_name()



      return (1)


    -- Check to see the the table exists and initialize @objid.

     select @objid = object_id(@objname) 

     if @objid is NULL 



      return (1) 



    ;WITH b AS (

      SELECT d.object_id, d.index_id, part = COUNT(*)

      , reserved = 8*SUM(d.reserved_page_count)

      , used = 8*SUM(d.used_page_count )

      , in_row_data = 8*SUM(d.in_row_data_page_count)

      , lob_used = 8*SUM(d.lob_used_page_count)

      , overflow = 8*SUM( d.row_overflow_used_page_count)

      , row_count = SUM(row_count)

      , notcompressed = SUM(CASE data_compression WHEN 0 THEN 1 ELSE 0 END)

      , compressed = SUM(CASE data_compression WHEN 0 THEN 0 ELSE 1 END) -- change to 0 for SQL Server 2005

      FROM sys.dm_db_partition_stats d WITH(NOLOCK)

      INNER JOIN sys.partitions r WITH(NOLOCK) ON r.partition_id = d.partition_id

      GROUP BY d.object_id, d.index_id

    ), j AS (

      SELECT j.object_id, j.index_id, j.key_ordinal, c.column_id,,is_descending_key

      FROM sys.index_columns j

      INNER JOIN sys.columns c ON c.object_id = j.object_id AND c.column_id = j.column_id


    SELECT ISNULL(, '') [index]

    , ISNULL(STUFF(( SELECT ', ' + name + CASE is_descending_key WHEN 1 THEN '-' ELSE '' END

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND j.key_ordinal >0

       ORDER BY j.key_ordinal FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Keys

    , ISNULL(STUFF(( SELECT ', ' + name

       FROM j WHERE j.object_id = i.object_id AND j.index_id = i.index_id AND j.key_ordinal = 0 

       ORDER BY j.column_id FOR XML PATH(''), TYPE, ROOT).value('root[1]','nvarchar(max)'),1,1,'') ,'') as Incl

    , i.index_id

    , CASE WHEN i.is_primary_key = 1 THEN 'PK'

       WHEN i.is_unique_constraint = 1 THEN 'UC'

       WHEN i.is_unique = 1 THEN 'U'

       WHEN i.type = 0 THEN 'heap'

       WHEN i.type = 3 THEN 'X'

       WHEN i.type = 4 THEN 'S'

       ELSE CONVERT(char, i.type) END typ

    , i.data_space_id dsi

    , b.in_row_data in_row , b.overflow ovf , b.lob_used lob

    , b.reserved - b.in_row_data - b.overflow -b.lob_used unu

    , 'ABR' = CASE row_count WHEN 0 THEN 0 ELSE 1024*used/row_count END

    , y.user_seeks, y.user_scans u_scan, y.user_lookups u_look, y.user_updates u_upd

    , b.notcompressed ncm , b.compressed cmp , b.row_count

    , s.rows, s.rows_sampled, s.unfiltered_rows, s.modification_counter mod_ctr, s.steps

    , CONVERT(varchar, s.last_updated,120) updated

    , i.is_disabled dis, i.is_hypothetical hyp, ISNULL(i.filter_definition, '') filt

    FROM sys.objects o

    JOIN sys.indexes i ON i.object_id = o.object_id

    LEFT JOIN b ON b.object_id = i.object_id AND b.index_id = i.index_id

    LEFT JOIN sys.dm_db_index_usage_stats y ON y.object_id = i.object_id AND y.index_id = i.index_id

    AND y.database_id = DB_ID()

    OUTER APPLY sys.dm_db_stats_properties(i.object_id , i.index_id) s

    WHERE o.type IN('U','V')

    AND i.object_id = @objid



    -- Then mark the procedure as a system procedure.

    EXEC sp_ms_marksystemobject 'sp_helpindex3'





    Information from my extended version of index help are space, index usage, and statistics. The DMV/F function dm_db_stats_properties was added in SQL Server (SQL Server 2008 R2 Service Pack 2, SQL Server 2012 Service Pack 1. The function STATS_DATE was added in SQL Server 2008 can be used if dm_db_stats_properties is not supported.

    We could also join to dm_db_index_operational_stats, dm_db_index_physical_stats or dm_db_xtp_index_stats for additional information, but I do not think the extra information is necessary for routine use.

    The above query might be more useful as a view capable of reporting on indexes for all tables and indexed views, but is there a system view option?

    It is too bad more statistics information in DBCC SHOW_STATISTICS is not available in query form.

    I am not sure why color coding is not display, see the same page on my web site sp_helpindex3 for
    We could also do an option to show indexes for all tables, leaving off the object_id =.

  • Join Row Estimation in Query Optimization

    This topic is titled to specifically consider only row estimation after joins, precluding discussion of row estimates at the source table, which has already been addressed in papers covering the new Cardinality Estimator in SQL Server 2014 and other statistics papers for previous versions of SQL Server.

    There are certain situations in which the query compile time can be excessively long even for queries of moderate complexity. This could be true when the plan cost is very high, so that the query optimizer expends more effort to find a better plan before reaching the time out, that is, the time out setting appear to be a function of the plan cost. Even then, the query optimizer could still make row estimation errors in the operations after the source table (for which data distribution statistics are kept on columns and indexes) of sufficient consequence that renders the remainder of the plan un-executable in a practical time frame.

    The new cardinality estimator in SQL Server 2014 is helpful in resolving known issues, but has little to improve row estimates after the initial access at the data source beyond fixed rules which may be more generally true than the rule used before. That said, the query optimizer only attempts to estimate rows (and other cost factors) using a combination of the information it has, and rules for situations where there are no grounds for making a knowledge based estimate.

    So why be bound by this rule of estimating only? The other company (only sometimes in the news concerning databases) has (recently introduced?) adaptive query optimization that can make run-time adjustments to the execution plan. Sure that’s nice, but I am thinking something more sophisticated could done in forming the plan but less complex than the ability to change course in the midst of execution.

    Obviously, if a query has an obviously correct execution plan that can be determined by existing techniques, and the plan cost is fairly low, then no change is necessary. Further special technique should only be considered for expensive queries, particularly in which the row estimates at the intermediate steps are difficult to assess.

    Consider query below
    SELECT bunch of columns
    FROM Nation n
    JOIN Customers c ON c.NationId = n.NationId
    JOIN Orders o ON o.CustumerId = c.CustomerId
    WHERE n.Country LIKE ‘Z%’

    From statistics, we know approximately how many countries have a name beginning with Z. SQL Server also has a histogram for the NationId column in the Customers table. If we had specified the list of NationId values (with equality on to both tables), SQL Server could use the more detailed information from the histogram to make a row estimate. But because we specified the Country name column that is only in the Nation table, we must use the average number of customers per county (in the density vector) multiplied by the number of countries to estimate rows after the join to customers.

    And next of course, all customers are not alike. There are customers who place a small, medium or large number of orders. So why expend a great deal of effort to find a plan from all the possible join orders and index combination based on only estimates of rows when it is known that data distribution in each succeeding table is heavily skewed. Why not pre-execute the tables for which a SARG was specified to get the column values used in the join to next table so that the more detailed histogram distribution information. This technique could be pushed to multiple levels depending on the initially assessed plan cost, and perhaps controlled by a query hint.

    For example, in the above query, suppose NationId’s 19, 37, 42, and 59 meet the criteria Country beginning with Z. The optimizer would next look at the Customers table histogram on NationId for these values to estimate rows after the join. If the situation warrants, the next level could be examined as well.

    It could be argued that the query optimizer should not execute the query to determine the plan, but why follow that principle if the cost of query optimization is excessively high (several seconds) in relation to relatively minor effort to make a more extensive reconnaissance (of tens or hundreds of milli-seconds)? Especially considering that the reliability of row estimates becomes progressively worse after each join or other operation beyond the original source?

    This technique should probably be used when there are tables with search arguments joining to tables on columns with highly skewed distribution. The first implementation might be activated only be a query hint until some maturity is achieved, followed by greater use.

    Presumably there might be a cost threshold as well. I would prefer not to tie it with parallelism. Of course, given the nature of modern systems, it really is time for the cost threshold for parallelism and max degree of parallelism to have graduated controls, instead of the single setting on-off.

    Side note 1

    OK, now forget what I said at the beginning and I will gripe about SQL Server default statistics. It has been discussed else where that SQL Server uses random page samples and not random row samples, as this is a much less expensive way to collect data. It does use an index for which the column is not a lead key if available, to improve randomness. Still, I have notice a severe sensitivity to sampling percentage in cases where the column value is correlated with (page) storage location.

    So I suggest that as the page sampling is in progress, a count of new values found in each successive page sampled versus existing values be kept. If the number of new values found falls off sharply, then most distinct values have probably been found, and the interpretation of the existing sample is that its distribution should be scaled by ratio of total rows to the rows sampled. If almost of the values in the last few pages sampled are new (not previously found), then the interpretation should be that the number of distinct values should be scaled by the total to sample ratio. And some blend when there is an intermediate number of new values versus previously found values in each successive page.


    The query optimization/plan compile process is single threaded. The modern microprocessor might be 3GHz, so a 10 sec compile is 30 Billion cpu-cycles. And I have seen compiles run more than 100 sec? One query even broke SQL Server, of course that was a set of deeply nested, repeating CTE's that should have been PIVOT/UNPIVOT. So why the principle of optimizing based on unreliable estimates when an index seek is a mere 10 micro-sec? Presumably the key column statistics have already been decoded.

    It would be nice to have a more powerful query optimizer, but there is a method for writing SQL to get a specific execution plan, Bushy Joins. Of course the other element in this is known what the correct execution plan is. This involves not what the query optimizer uses for cost model, but what the true cost model is.

More Posts Next page »

This Blog


Privacy Statement