Server system memory capacities have grown to ridiculously large levels
far beyond what is necessary now that solid-state storage is practical.
Why is this a problem?
Because the requirement that memory capacity trumps other criteria has
driven system architecture to be focused exclusively on low cost DRAM.
DDR DRAM, currently in its fourth version,
has low cost, acceptable bandwidth,
but poor latency.
Round-trip memory access latency is now far more important
for database transaction processing.
The inverse of latency is practically a direct measure
In the hard disk era, the importance of memory in keeping IO
volume manageable did trump all other criteria.
In theory, any level of IO performance could be achieved
by aggregating enough HDDs.
But large arrays have proportionately more frequent
Although RAID provides fault tolerance,
the failed drive rebuild process causes
The practical upper bound for HDD storage
is on the order of 1000 disks, or 200K IOPS.
If it was necessary to purchase 1TB of memory to accomplish this,
then it was money well spent because the storage was
even more expensive.
Since then, storage with performance requirements
have or should be transitioning to all-flash,
and not some tiered mixed of flash and HDD.
Modern solid-state arrays can handle up to 1M IOPS,
far more and easily when on a full NVMe stack.
Now is the time to re-evaluate just how much memory
is really needed when storage is solid-state.
There are existing memory technologies, RLDRAM and SRAM for example,
with different degrees of lower latency, higher cost
and lower density at both the chip and system level.
There is potential to reduce memory latency by a factor of two or more.
The performance impact for database transaction processing
is expected to be equal.
A single-socket system with RLDRAM or SRAM memory could replace
a 4-socket system with DDR4.
Practically any IO volume, even several million IOPS,
can be handled by the storage system.
The key is that the CPU expended by the database engine
on IO be kept to a reasonable level.
An argument is made for a radical overhaul of current system
architecture, replacing DDR4 DRAM as main memory
with a different technology that has much lower latency,
possibly sacrificing an order of magnitude in capacity.
As the memory controller is integrated into the processor,
this also calls for a processor architecture change.
A proper assessment for such a proposal
must examine not only its own merits,
but also other technologies with the potential for order
of magnitude impact.
The discussion here mostly pertains to Intel processors
and Microsoft SQL Server.
However, the concepts are valid on any processor and database engines
built around page organization, row-store with b-tree indexes.
The argument for low latency memory is valid even in the
presence of Hekaton memory-optimized tables, or other MVCC implementations
because it is a drop-in hardware solution instead of a database
Of course, the heavy lifting is on Intel or other
to re-architect the system via processor and memory.
For more than forty years, DRAM has been the standard
for main memory in almost every computer system architecture.
A very long time ago, having sufficient memory to avoid paging
was a major accomplishment.
And so, the driving force in memory was the semiconductor technology
with low cost.
This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C).
DRAM addressing was multiplexed because even the package
and signal count impact on cost mattered.
The page file has since become a relic, that for some reason
has yet to be removed.
Over the years, the mainstream form of DRAM, from SDRAM to DDR4,
did evolve to keep pace with progress on bandwidth.
The criteria that has changed little is latency,
and it is round-trip latency that has become critical in applications
characterized by pointer chasing code resulting
serialized memory accesses.
Even as memory capacity reached enormous proportions,
the database community routinely continued to configure systems
with the memory slots filled, often with the maximum
capacity DIMMs, despite the premium over the next lower capacity.
There was a valid reason for this.
Extravagant memory could mask (serious) deficiencies
in storage system performance.
Take the above in conjunction with SAN vendors providing helpful advice
such as log volumes do not need dedicated physical disks,
and the SAN has 32GB cache that will solve all IO performance problems.
Then there is the doctrine to implement their vision of storage as a service.
But now all-flash is the better technology for IO intensive storage.
Solid-state storage on an NVMe stack is even better.
Massive memory for the sole purpose of driving IO down to noise levels
is no longer necessary.
Now free of the need to cover-up for weakness in storage,
it is time to rethink system architecture,
particularly the memory strategy.
Not all application desire low latency memory at lower capacity
and higher cost.
Some work just fine with the characteristics of DDR4.
Others prefer a different direction entirely.
The HPC community is going in the direction of extreme
memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.
In the past, when Moore's Law was in full effect,
Intel operated on: "the Process is the Business Model".
It was more important to push the manufacturing process,
which meant having a new architecture every two years
with twice the complexity of the previous generation
and then be ready to shrink the new architecture to
a new process the next year.
Specialty products divert resources from the main priority
and were difficult to justify.
But now process technology has slowed to a three-year cycle
and perhaps four years in the next cycle.
It is time to address the major
product outside of general purpose computing.
The principle topic here is the impact of memory latency
on database transaction performance, and the
directions forward for significant advancement.
As the matter has complex entanglements, a number of different
aspects come into play.
One is scaling on non-uniform memory access (NUMA) system
architecture, in which the impact of memory access can be examined.
Memory technologies with lower latency are mentioned,
but these topics are left to experts in their respective fields.
Hyper-Threading is a work-around to the issue of memory latency,
so it is mentioned.
The general strategy for server system performance in recent years
But it is also important to consider which is
the right core as the foundation.
Another question to answer is whether a hardware and/or software solution
should be employed.
On software, Multi-version Concurrency Control (MVCC) is the technology
employed by memory-optimized databases
(the term in-memory,
has misleading connotations)
that can promise large performance gains,
and even more extreme gain when coupled with natively compiled
Architecting the database for scaling on NUMA architecture
is also a good idea.
A mixed item is SIMD, the SSE/AVX registers and instructions
introduced over the last 18 years.
From the database transaction processing point of view,
this is not something that would have been pursued.
But it has come to occupy a significant chunk of real estate
on the processor core.
So, find a way to use it! Or give it back.
The list is as follows:
- Scale up on NUMA versus single socket
- RLDRAM/SRAM versus DDR4+3D XPoint
- Hyper-Threading to 4-way
- Core versus Atom and all comers?
- Hekaton, memory-optimized, MVCC
- Database architected for NUMA
- SSE/AVX Vector Instructions
The above topics will be address, perhaps only partially,
often out of order, and sometimes mixed as appropriate.
Factors to consider are potential impact,
difficulty of implementation
and overall business justification.
L3 and Memory Latency
benchmark website lists L3 and memory latency for some recent processors, shown below.
|Processor||Base Freq||L3||Local Memory||Remote Memory|
|Westmere EP (Xeon X5650)||2.67GHz||40-42 cycles||L3+67ns||L3+105ns|
|Ivy Bridge (i7-3770)||3.4GHz||30 cycles||L3+53ns|| |
|Haswell (i7-4770)||3.4GHz||36 cycles||L3+57ns|| |
|Haswell (E5-2603 v3)||1.6GHz||42-43 cycles|| || |
|Skylake (i7-6700)||4.0GHz||42 cycles||L3+51ns|| |
There are a number of questions.
Is L3 latency determined by absolute time or cycles?
In the Intel Xeon processors from Ivy Bridge EP/EX on,
this is particularly complicated because there are 3 die layout
LCC, MCC, and HCC,
each with different structure and/or composition.
The Intel material says that L3 latency is impacted
by cache coherency snoop modes, ring structure
See Intel Dev Conference 2015
Processor Architecture Update,
and similar Intel
HPC Software Workshop 2016 Barcelona
For simplicity, L3 is assumed to be 15ns here without consideration for
DRAM, SQL Server 8KB Page
The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15,
working out to 14ns each for CL, RCD and RP, so random row latency is 42ns
the DRAM interface.
The Intel web page for their
Memory Latency Checker
utility shows an example having local node latency as 67.5 or 68.5 ns
and remote node as 125.2 or 126.5 ns.
L3 and memory latency on modern processors is a complicated matter.
The Intel values above includes the L3 latency.
So, by implication, the 10ns difference is the transmission time from memory controller
to DRAM and back?
One of the modes of DRAM operation is to open a row,
then access different columns, all on the same row,
with only the CL time between columns.
I do not recall this being in the memory access API?
Is it set in the memory controller?
Systems designed for database servers might force close a row immediately?
In accessing a database page, first, the header is read.
Next, the row offsets at the end of the 8KB page.
Then the contents of a row, all or specific columns.
This should constitute a sequence of successive reads
all to the same DRAM row?
Memory Architecture - Xeon E5 and E7
The most recent Core i3/5/7 processors run in the mid-4GHz range.
There are Xeon EP and EX processors with base frequency in the 3GHz+ range.
But for the high core count models, 2.2GHz is the common base frequency.
Below is a representation of the Xeon E5 memory subsystem.
In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB)
or Memory Extension Buffer (MXB), depending on which document,
and the interface between MC and SMB is SMI.
The SMB doubles the number of DIMMs that can connect to each memory channel.
There is no mention of the extra latency for the SMB.
It cannot be free,
"All magic comes with a price".
The value of 15ns is used here for illustrative purposes.
Anyone with access to both 2-way Xeon E5 and 4-way E7 systems
of the same generation and core count model (LCC, MCC or HCC)
is requested to run the 7-cpu
or Intel Memory Latency Checker
utility, and make the results known.
In principle, if Intel were feeling helpful, they would do this.
Below is a single socket Xeon E5 v4.
The high core count (HCC) model has 24 cores.
But the Xeon E5 v4 series does not offer a 24-core model.
Two cores are shown as disabled as indicated, though it could any two?
There is only one memory node, and all memory accesses are local.
Below is a 2-socket system, representing the Xeon E5
either HCC or MCC models in having two double rings,
and the interconnect between rings.
The details shown are to illustrate the nature of NUMA
NUMA is complicated topic, with heavy discussion on cache coherency,
the details of which impacts L3 and memory latency.
This discussion here will only consider a very simple model
only looking a physical distance impact on memory latency.
NUMA Deep Dive Series
by Frank Denneman.
Each socket is its own memory node
(not sure what happens in Cluster-on-Die mode).
To a core in socket 0, memory in node 0 is local, memory in node 1 is remote,
and vice versa to a core in the other socket
(more diagrams at System Architecture Review 2016).
Below is a representation of a 4-socket system based the Xeon E7 v4 memory.
To a core in socket 0, memory in node 0 is local,
memory in nodes 1, 2, and 3 are remote.
And repeat for other nodes.
It is assumed that memory accesses are 15ns longer than
on the E5 due the extra hop through the SMB outbound and inbound.
This applies to both local and remote node memory accesses.
On system and database startup, threads should allocate memory on the local node.
After the buffer-cache has been warmed up,
there is no guarantee that threads running in socket 0
will primarily access pages located in memory node 0,
or threads on socket 1 to memory node 1.
That is, unless the database has been purposefully architected with a plan
in mind to achieve higher than random memory locality.
The application also needs to be built to the same memory locality plan.
A connection to SQL Server will use a specific TCP/IP port number
based on the key value range.
SQL Server will have TCP/IP ports mapped to specified NUMA nodes.
Assuming the application does the initial buffer-cache warm up,
and not some other source that is not aware of the NUMA tuning,
there should be alignment of threads to pages on the local node.
(It might be better if an explicit statement specifies the preferred NUMA node
by key value range?)
An example of how this is done in the TPC-C and E benchmark full disclosures
and supplemental files.
Better yet is to get the actual kit from the database vendor.
For SQL Server, the POC is JR
Curiously, among the TPC-E full disclosure and supplemental files'
thousands of pages of inane details, there is nothing
that says that the NUMA tuning is of pivotal importance.
It is assumed in the examples here that threads access random pages,
and memory accesses are evenly distributed over the memory nodes.
Simple Model for Memory Latency
Memory Latency, NUMA and HT,
a highly simplified model for the role of memory latency in
database transaction processing is used to demonstrate the
differences between a single-socket system having uniform memory
and multi-socket systems having NUMA architecture.
Using the example of a hypothetical transaction that could execute in 10M cycles
with "single-cycle" memory,
the "actual" time is modeled for a 2.2GHz core in which 5% of instructions
involve a non-open page memory access.
The model assumes no IO, but IO could be factored in if desired.
The Xeon E5 memory model is used for 1 and 2 sockets,
and the E7 for the 4-socket system.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
If there were such a thing as single-cycle memory,
the performance would be 220 transactions per second
based 2,200M CPU-cycles per second and 10M cycles per transaction.
Based on 67ns round-trip memory access,
accounting for a full CL+tRCD+tRP,
the transmission time between processor and DRAM,
and L3 latency,
incurred in 0.5M of the 10M "instructions",
the transaction now completes in 83.2M cycles.
The balance of 73.2M cycles are spent waiting for memory accesses to complete.
This circumstance arises primarily in point-chasing code,
where the contents of one memory access determines the next action.
Until the access completes, there is nothing else for the thread to do.
The general advice is to avoid this type of coding,
except that this is what happens in searching a b-tree index.
If the impact of NUMA on database transaction processing performance
were understood and clearly communicated,
databases could have been architected from the beginning to work with
the SQL Server NUMA and TCP/IP port mapping features.
Then threads running on a given node primarily access pages local to that node.
If this forethought had been neglected, then one option is to re-architect
both the database and application,
which will probably involve changing the primary key
of the core tables.
Otherwise, accept that scaling on multi-socket systems is
not going to be what might have been expected.
Furthermore, the Xeon E7 processor, commonly used in 4-socket systems,
has the SMB feature for doubling memory capacity.
As mentioned earlier, this must incur some penalty in memory latency.
In the model above, scaling is:
1P -> 2P = 1.45X, 2P -> 4P = 1.56X and
1P -> 4P = 2.26X
The estimate here is that the SMB has an 11% performance penalty.
If the doubling of memory capacity (or other functionality)
was not needed, then it might have been better to leave off the SMB.
There is 4-waay Xeon E5 4600-series, but one processor is
2-hops away, which introduces its own issues.
There is a paucity of comparable benchmark results to support meaningful quantitative
In fact, it would seem that the few benchmarks available employ configuration variations
with the intent to prevent insight.
Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.
|Processor||Sockets||cores||threads||memory||data storage||tpsE|| |
|E5-2699 v4||2||44||88||512GB (16x32)||3x17 R5||4,938.14||-|
|E7-8890 v4||4||96||192||4TB (64x64)||5x16 R5||9,068.00||-|
It would seem that scaling from 2P to 4P is outstanding at 1.915X.
But there is a 9% increase in cores per socket from 22 to 24.
Factoring this in, the scaling is 1.756X, although scaling versus core count
should be moderately less than linear.
Then there is the difference in memory, from 256GB per socket to 1TB per socket.
Impressive, but how much did it actually contribute?
Or did it just make up for the SMB extra latency?
Note that TPC-E does have an intermediate level of memory locality,
to a lesser extent than TPC-C.
The details of the current Intel Hyper-Threading implementation are
The purpose of Hyper-Threading is to make use of the dead cycles
during memory access or other long latency operations.
Hyper-threading is an alternative solution to
the memory latency problem.
Work arounds are good, but there are times when it is necessary
to attack the problem directly.
Single thread performance is important, perhaps second after overall system throughput.
Given that the CPU clock is more than 150 times that of memory round-trip access
time, it surprising that Intel only implements 2-way HT.
The generic term is simultaneous multi-threading (SMT).
IBM POWER8 is at 8-way, up from 4-way in POWER7.
SPARC has been 8-way for a few generations?
There is a paper on one of the two RISC processors
stating that there were technical challenges in SMT at 8-way.
So, a 4-way HT is reasonable.
This can nearly double transaction performance.
The effort to increase HT from 2-way to 4-way should not be
Given the already impossibly complex
complexion of the processor,
"difficult should be a walk in the park".
It might help if there were API directives in the operating
system to processor on code that runs well with HT and code that does not.
One other implication of the memory latency effect is that
scaling versus frequency is poor.
The originator of the memory latency investigation
was an incident
in which a system UEFI/BIOS update reset processors
to power-save mode, changing base frequency from 2.7GHz to 135MHz.
There was a 3X increase in worker (CPU) time on key SQL statements.
A 20X change in frequency for 3X performance.
In other words, do not worry about the lower frequency of the
high core count processors.
They work just fine.
But check the processor SKUs carefully, certain models do not
have Hyper-Threading which is important.
It also appears that turbo-boost might be more of a problem
It might be better to lock processors to the base frequency.
Hekaton - MVCC
Contrary to popular sentiment,
putting the entire database in to memory on a traditional engine
having page organization and row-store does not make much of
a direct contribution in performance
over a far more modest buffer cache size.
This is why database vendors have separate engines
for memory-optimized operation,
Hekaton in the case of Microsoft SQL Server.
To achieve order of magnitude performance gain, it was necessary to
completely rethink the architecture of the database engine.
A database entirely in memory can experience substantial performance
improvement when the storage system is seriously inadequate to meet
the IOPS needed.
When people talk about in-memory being "ten times" faster,
what meant was that if the database engine have been designed around
all data residing in memory, it would be built in a very different way
than the page and row structured implemented by INGRES in the 1970's.
Now that the memory-optimized tables feature, aka Hekaton for Microsoft
SQL Server, is available,
are there still performance requirements that have not been met?
Memory-optimized tables, and its accompanying natively compiled
procedures are capable of unbelievable performance levels.
All we have to do is re-architect the database to use memory-optimized
And then rewrite the stored procedures for native compilation.
This can be done!
And it should be done, when practical.
In many organizations, the original architects have long retired,
departed, or gone the way of Dilbert's Wally (1999 Y2K episode).
The current staff developers know that if they touch something,
it breaks, they own it.
So, if there were a way to achieve significant performance gain,
with no code changes, just by throwing money at the problem,
then there would be interest, and money.
There is not a TPC-E result for Hekaton.
This will not happen without a rule change.
The TPC-E requirement is that database size scales with the
The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database
size of 37,362GB.
The minimum expectation from Hekaton is 3X, pointing to a 100TB database.
A rule change for this should be proposed.
Allow a "memory-optimized" option to run with a database
smaller than memory capacity of the system.
About as difficult as MVCC, less upside?
Potentially the benefits of NUMA scaling and Hekaton
could be combined if Microsoft exposed a mechanism
for how key values map to a NUMA node.
It would be necessary for a collection of tables
with compound primary key to have the same lead column
and that the hash use the lead column in determining NUMA node?
The major DRAM companies are producing DDR4 at the 4Gb die level.
Samsung has an 8Gb bit die.
Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product.
There can be up to 36 packages on a double-sided DIMM, 18 on each side.
The multiple chips/packages form a 64-bit word plus 8-bits for ECC,
capable of 1-bit error correction and 2-bit error detection.
A memory controller might aggregate multiple-channels
into a larger word and combine the ECC bits to allow for more
sophisticated error correction and detection scheme.
A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB.
The DIMM is comprised of 32×4Gb die, be it 32 single-die packages
or 16 two-die packages.
The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128,
for $8/GB net (data+ECC) or $7.11/GB raw.
There is a slight premium for ECC parts, but much less than it was in the past,
especially with fully buffered DIMMs that had an XMB chip on the module.
The 4Gb die + package sells for less than $3.
the memory guy,
estimates that 4Gbit DRAM needs to be 70mm2 to support
a price in this range.
The mainstream DRAM is a ruthlessly competitive environment.
The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250,
no premium over the 16GB part.
The 64GB ECC RDIMM (72GB raw) is priced around $1000.
This might indicate that it is difficult to put 4×4Gb die in one package,
or that the 8Gb die sells for $12 compared to $3 for the 4Gb die.
Regardless, it possible to charge a substantial premium in
the big capacity realm.
One consequence of the price competitiveness in the mainstream DRAM
market is that cost cutting is an imperative.
Multiplexed row and column address lines originated in 1973,
allowing for lower cost package and module product.
Twenty years ago, there was discussion on going back to a full width address,
but no one was willing the pull the trigger on this.
The only concession for performance in mainstream DRAM was increasing
bandwidth by employing multiple sequential word accesses,
starting with DDR to the present DDR4.
Reduced latency DRAM appeared in 1999
for applications that needed lower latency than mainstream DRAM,
but at lower cost than SRAM.
One application is in high-speed network switches.
The address lines on RLDRAM are not multiplexed.
The entire address is sent in one group.
RLDRAM allows low latency access to a particular bank.
That bank cannot be accessed again for the normal DRAM period?
But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.
A 2012 paper, lead author Nilandrish Chatterjee
has a discussion on RLDRAM.
The Chatterjee paper mentions:
RLDRAM employs many small arrays that sacrifices density for latency.
Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3.
The first version of RLDRAM had 8 banks,
while the contemporary DDR (just DDR then) had 2 banks.
Both RLDRAM3 and DDR4 are currently 16 banks,
but the banks are organized differently?
Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36.
Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path
to support 64-bit data plus 8-bit ECC.
The mainstream DDR4 8Gbit 2-die package from Micron comes in
a 78-ball package for x4 and x8 organization, and 96-ball for x16.
The RLDRAM comes in a 168-ball package for both x18 and x36.
By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA,
yet has multiplexed address?
The package pin count factors into cost,
and also in the die size because each signal needs to be boosted before
it can go off chip?
Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC,
compared with DDR4 at $8 or 14/GB also with ECC, depending the module
At this level, RLDRAM is 40-70 times more expensive than DDR4
A large part for is probably because the RLDRAM is quoted
as a specialty low volume product at high margins,
while DDR4 is a quoted on razor thin margins.
The top RLDRAM at 1.125Gb capacity might reflect the size needed
for high-speed network switches
or it might have comparable die area to a 4Gb DDR?
There are different types of SRAM.
High-performance SRAM has 6 transistors, 6T.
Intel may use 8T
Intel Labs at ISSCC 2012
or even 10T for low power?
(see real world tech NTV).
It would seem that SRAM should be six or eight times less dense than DRAM,
depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.
There is a Micron slide in
Micro 48 Keynote III
that says SRAM does not scale
on manufacturing process as well as DRAM.
Instead of 6:1, or 0.67Gbit SRAM at the same die size
as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area?
Another source says 100:1 might be appropriate.
Eye-balling the Intel Broadwell 10-core (LCC) die,
the L3 cache is 50mm2,
listed as 25MB.
It includes tags and ECC on both data and tags?
There could be 240Mb or more in the 25MB L3?
Then 1G could fit in a 250mm2 die,
plus area for the signals going off-die.
Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range.
This $15K per GB w/ECC.
It is reasonable to assume that prices for both RLDRAM and QDR SRAM
are much lower when purchased in volume?
The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213
in a 2011-pin package.
This would suggest SRAM at south of $1800 per GB.
While the ultra-high margins in high-end processors is desirable,
it is just as important to fill the fab to capacity.
So, SRAM at 50% margin is justified.
We could also estimate SRAM at 40X that of DRAM, per the Micron
assertion of relative density, pointing to $160-320 per GB.
Graphics and High-Bandwidth Memory
Many years ago, graphics processors diverged from mainstream DRAM.
Their requirement was for very high bandwidth at a smaller capacity
than main memory, plus other features to support the memory
access patterns in graphics.
GDR is currently on version 5, at density up to 8Gbit, with a x32
wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4.
More recently, High Bandwidth Memory (HBM) is promoted by AMD,
Hybrid Cube Memory by Micron.
High bandwidth memory is not pertinent to databases,
but it does provide scope on when there is need to
go a separate road from mainstream memory.
Databases on the page-row type engine does not come close to testing the limits
of DDR4 bandwidth.
This is true for both transaction processing and DW large table scans.
For that matter, neither does column-store, probably because of the CPU-cycles
I may have to take this back.
DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels.
(GB is always decimal by default for rates, but normally binary for size.)
A table scan with simple aggregation from memory is what now?
It was 200MB/s per core in Core 2 days, 350MB/s in Westmere.
Is it 500 or 800MB/s per core now?
It is probably more likely to be 500,
but let's assume 800MB/s here.
Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s
(HT does not contribute in table scans).
This is still well inside the Xeon E5/7 memory bandwidth.
But what if this were read from storage?
A table scan from disk is a write to memory, followed by a read.
DDR writes to memory at the clock rate, i.e., one-half the MT/s rate.
So the realized table scan rate effectively consumes 3X of the MT/s value,
which is 57.6.
To pursue the path of low latency memory,
it is necessary to justify the cost and capacity
structure of alternative technologies
It is also necessary that the opportunity be worthwhile
to justify building one more specialty processor
with a different memory controller.
And it may be necessary to work with operating system
and database engine vendors to all be aligned
in doing what is necessary.
The Chatterjee et al
paper for Micro45 (microarch.org) shows LRDRAM3 improving throughput
by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp.
MCF shows greatest gain at 2.2X.
Navigating the B-tree index should show very high gain as well.
The cost can be justified as follows.
An all-in 4-way server has the following processor and memory cost.
|Processor||4×E7-8890 v4||$7,174 ea.||$28,700|
|Memory||4TB, 64×64GB||$1,000 ea.||$64,000|
If the above seems excessive, recall that there was a time
when some organizations where not afraid to spend $1M on the
processor and memory complex,
or sometimes just the for 60+ processors
(sales of 1000 systems per year?).
That was in the hope of having amazing performance.
Except that vendors neglected to stress the importance
SAN vendors continued to sell multi-million-dollar storage
without stressing the importance of dedicated disks for logs.
If a more expensive low latency memory were to be implemented,
the transmission time between the memory controller and DIMM,
estimated to be 10ns earlier,
should be revisited.
A RLDRAM system might still have the DIMM slot arrangement currently
in use, but other option should be considered.
An SRAM main memory should probably be in an MCM module,
or some other
(TSMC Hot Chips 28).
This is if enough SRAM can be stuffed into a module or package.
It would also require that the processor and memory be sold as a single
unit and instead of memory being configured later.
In the case of SRAM, the nature of the processor L3 probably needs to be
Before SSDs, the high-end 15K HDDs were popular with storage
performance experts who understood that IOPS was more important
In a "short-path to bare metal" disk array,
the 15K HDD could support 200 IOPS at queue depth 1 per HDD
with low latency (5ms).
It should be possible to assemble a very large array of 1,000
disks, capable of 200,000 IOPS.
It is necessary to consider the mean-time-between-failure (MTBF),
typically cited as over 1M-hours.
There are 8,760 hours in a 365-day year.
At 1M-hr MTBF, the individual disk failure rate is 0.876%.
An array of 1000 disks is expected to see 9 failures per year.
Hard disks in RAID groups will continue to operate
with a single or sometimes multiple disk failures.
However, rebuilding a RAID group from the failed drive
could take several hours, and performance is degraded
in this period.
It is not operationally practical to run on a very
large disk array.
The recommendation was to fill the memory slots with big DIMMs,
and damn the cost.
The common convention used to be a NAND controller for
SATA on the upstream side would have 8-channels on the NAND side.
The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively.
On the downstream side, a NAND channel could have
1 or 2 packages.
There could be up to 8 chips in a package.
A NAND chip may be divided in to 2 planes,
and each plane is functionally an independent entity.
An SSD with 8 packages could have 64 NAND chips comprised of 128 planes.
The random IOPS performance at the plane level is better
than a 15K HDD, so even a modest collection of 24 SSDs
could have a very large array (3,072) of base units.
At the component level, having sufficient units for 1M IOPS is not difficult.
Achieving 1M IOPS at the system level is more involved.
NVMe builds a new stack, software and hardware, for driving
extraordinarily high IOPS possible with a large array of SSDs,
while making more efficient use of CPU than the SAS.
PCI-E NVMe SSDs have been around since 2014,
so it is possible to build a direct-attach SSD array
with the full NVMe stack.
NVMe over fabric was recently finalized, so SAN products
might be not too far in the near future.
From the host operating system, it is possible to drive
1M IOPS on the NVMe stack without consuming too much CPU.
At the SQL Server level, there are additional steps,
such as determining which page to evict from the buffer cache.
Microsoft has reworked IO code to support the bandwidth
made practical with SSD for DW usage.
But given the enormous memory configuration of typical
transaction processing systems,
there may not have been much call for the ability
to do random IOPS with a full buffer cache.
But if the need arose, it could probably be done.
All Flash Array
When SSDs were still very expensive as components,
storage system vendors promoted the idea of a SSD
cache and/or a tiering structure of SSD, 10K and 7.2K HDDs.
In the last few years, new upstarts are promoting all flash.
HDD storage should not go away, but its role is backup
and anything not random IO intensive.
Rethinking System Architecture
The justification for rethinking system architecture
to low latency memory at far higher cost
is shown below.
The scaling achieved in 4-socket system is less than exceptional
except for the very few NUMA architected databases,
which is probably just the TPC-C and TPC-E database.
It might be 2.2X better than single socket.
At lower latency, 40ns L3+memory, the single socket system could
match the performance of a 2-socket system with DDR4 DRAM.
If 25ns were possible, then it could even match up with the 4-socket system.
The mission of massive memory made possible in the 4-way
to reduce IO is no longer a mandatory requirement.
The fact that a single-socket system with RLDRAM or SRAM
could match a 4-socket with massive memory
allows very wide latitude in cost.
RLDRAM may reside inside or outside of the processor.
If outside, thought should be given on how to reduce
the transmission delay.
SRAM should most probably be placed inside the processor package,
so the challenge is how much could be done.
Should there still be an L3?
Any latency from processor core to memory must be minimized
as much as possible as warranted by the cost of SRAM.
Below are the memory latency simple model calculations
for a single socket with L3+memory latency of 43 and 25ns.
There are the values necessary for the single-socket
system to match 2 and 4-socket systems respectively.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
In the examples above, Hyper-Threading should still have good scaling to 4-way
at 43ns, and some scaling to 4-way at 25ns memory latency.
The new memory architecture does not mean that DDR4 DRAM
It is an established and moderately inexpensive technology.
There could still be DRAM memory channels.
Whether this is a two-class memory system or
perhaps DDR memory is accessed like a memory-mapped file
can be debated elsewhere.
Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM
as memory node 1, all to the same processor.
It is acknowledged that the proposed system architecture is not
a new idea.
The Cray-1 used SRAM as memory, and DRAM as storage?
For those on a budget, the Cray-1M has MOS memory.
Circumstances of the intervening years favored processor with SRAM cache
and DRAM is main memory.
But the time has come to revisit this thinking.
While working on this, I came across the slide below in
J Pawlowski, Micron,
Memory as We Approach a New Horizon.
The outline of the Pawlowski paper includes high bandwidth, and persistent memory.
Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.
I am guessing that the large number of double-ended arrows between processor and
near memory means high bandwidth.
And even bandwidth to DIMMs is substantial.
On the devices to the right seems to be storage.
Does ASIC mean logic?
Instead of just accessing blocks, it would be useful to say:
read the pointer at address A, then fetch the memory that A points to.
Below is roughly Intel's vision of next-generation system architecture,
featuring 3D XPoint.
The new Intel and Micron joint non-volatile technology is promoted
as having performance characteristics
almost as good as DRAM, higher density than DRAM,
and cost somewhere in between DRAM and NAND.
The full potential of 3D XPoint cannot be realized as PCI-E
The idea is then to have 3D XPoint DIMMs devices on the memory interface
along with DRAM.
The argument is that memory configurations in recent years
have become ridiculously enormous.
That much of it is used to cache tepid or even cool data.
In this case, DRAM is overkill.
The use of 3D XPoint is almost as good,
it costs less, consumes less power, is persistent,
and will allow even larger capacity.
In essence, the Intel vision acknowledges the fact
that much of main memory is being used for less than hot data.
The function of storing not so hot data can be accomplished
with 3D XPoint at lower cost.
But this also implies that the most critical functions
of memory require far less capacity than that of recent generation
In the system architecture with a small SRAM or RLDRAM
main memory, there will be more IO.
To a degree, IO at 100µs to NAND is not bad,
but the potential for 10µs or less IO to 3D XPoint
further validates the concept and
is too good to pass up.
Below is a my less fancy representation of the Micron System Concept.
The Right Core
The latest Intel mainline Core-i processor has incredibly powerful
cores, with 8-wide superscalar execution.
A desktop version of Kaby Lake, 7th generation,
has 4.2GHz base frequency and 4.5GHz turbo.
This means the individual core can run at 4.5GHz if not significantly
higher, but must throttle down to 4.2GHz
so that four cores plus graphics and the system agent stays under 91W.
A reasonable guess might be that the power consumption is 20w per core at 4.2GHz?
Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo.
Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel
did not deploy this product as widely as normal.
In transaction processing, this blazing frequency is squandered
on memory latency.
The server strategy is in high core count.
At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W.
The Broadwell HCC products do allow turbo mode in which a few cores
can run at up to 3.5 or 3.6GHz.
Every student of processor architecture knows the foundations of Moore's Law.
One of the elements is that on doubling the silicon real estate
at a fixed process, our expectation is to achieve 1.4X increase in performance.
(The other element is a new process with 0.71X linear shrink yields 50% frequency,
also providing 1.4X performance.)
The Intel mainline core is two times more powerful than the
performance that can be utilized in a high core count processor.
In theory, it should be possibly to design a processor core
with performance equivalent to the mainline core at 2.2GHz
(see SQL on Xeon Phi).
In theory, this new light core should be one-quarter the size of the
(Double the light core complexity for 1.4X performance. Double again for another 1.4X,
for a cumulative 2X over baseline.)
The new light core would be running at maximum design frequency to match
the 2.2GHz mainline, whatever frequency that might be.
We can pretend it is 2.2GHz if that helps.
This new core would have no turbo capability.
What is the power consumption of this core?
Perhaps one-quarter of the mainline, because it is one-quarter the size?
Or more because it is running at a slightly higher voltage? (this is covered elsewhere).
It might be possible to fit four times as many of the light cores on the same die
size, assuming cache sizes are reduced.
But maybe only 3 times as many cores can be supported to stay within power limits?
This a much more powerful multi-core processor
for multi-threaded server workloads.
How valuable is the turbo capability?
The turbo boost has value because not everything can be made heavily multi-threaded.
Single or low-threaded code might not be pointer chasing code,
and would then be fully capable of benefitting from the full power of the Intel mainline core.
A major reason that Intel is in such a strong position is that they have had
the most powerful core for several years running,
and much of the time prior to the interlude period.
(AMD does have a new core coming out and some people think highly of it.)
Intel has two versions of each manufacturing process, one for high performance,
and another for low power.
Could the mainline core be built on the lower power process?
In principle this should reduce power to a greater degree than scaling voltage down.
Would it also make the core more compact?
We could speculate on theory, applying the general principles of Moore's Law.
But there is a real product along these lines, just targeted towards a different
The Xeon Phi 200, aka
has 72 Atom cores, albeit at 245W (260 with fabric).
The current Phi is based on the Airmont Atom core.
(the latest Atom is actually Goldmont).
The recent Atom cores have a 14-stage pipeline versus 14-19 for Core?
Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache?
The true top frequency for Airmont is unclear, some products based on
are Airmont are listed as 2.6GHz in turbo.
On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo.
It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency
while high core count products set a lower voltage resulting lower frequency,
regardless of intent.
Below is a diagram of Knights Landing, or Xeon Phi 200
from Intel's Hot Chips 27 (2015).
The processor and 8 MCDRAM devices are on a single multi-chip-module
(package) SVLC LGA 3647.
The MCDRAM is a version of Hybrid Memory Cube?
(Intel must have their own private acronyms.)
Each device is a stack of die, 2GB, for a total of 16GB
with over 400GB/s bandwidth.
There are also 3 memory controllers driving a total
of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s).
Only 1 DIMM per channel is supported,
presumably the applications are fine with 384GB but wants extreme bandwidth.
The Xeon Phi is designed for HPC.
As is, it might be able deliver impressive
transaction processing performance.
But perhaps not without tuning at many levels.
The question is, how does Knights Landing perform on transaction processing
had the memory been designed for latency
instead of bandwidth?
I suppose this could be tested simply by comparing an Airmont Atom against
a Broadwell or Skylake?
The theory is that the memory round-trip latency dominates,
so the 8-wide superscalar of Haswell and later has little benefit.
Even if there is some code that can used wide superscalar,
the benefit is drowned out by code that wait for memory accesses.
Even in transaction processing databases,
not everything is transactions, i.e., amenable to wide parallelism.
Some code, with important functions, do benefit from the full capability
of mainline Intel core.
Perhaps the long-term solution is asymmetric multi-core,
two or four high-end cores, and very many mini-cores.
SSE/AVX Vector Unit (SIMD)
The vector (SSE/AVX) unit is a large portion of the core area.
This are not used in transaction processing
but are used in the more recent column-store engine.
Microsoft once evaluated the use of the Intel SSE registers,
but did not find a compelling case.
It might have been on the assumption of the existing
Perhaps what is needed is to redesign the page structure
so that the vector registers can be used effectively.
The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes,
leaving 8096 bytes.
Row offsets (slot array of 2 byte values) are filled in from the end of the page.
See Paul Randall, SQL Skills
Anatomy of a Record.
Within a page, each row has a header (16-bytes?) with several values.
The goal of redesigning the page architecture is so that the slot
and header arrays can be loaded into the vector registers
in an efficient manner.
This might mean moving the slot array and other headers up front.
SQL Server would continue to recognize the old page structure.
On index rebuild, the new page structure employed.
The necessary instructions to do row-column byte offset calculations
directly from the vector register would have to be devised.
This needs to be worked out between Intel and various database vendors.
Perhaps the load into the vector registers bypasses L1 and/or L2?
It would be in L3 for cache coherency?
The current Xeon E5/7 processors, with the latest on the Broadwell
core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes.
The Skylake has 32 registers of 512-bits, 2KB of registers.
This is too much to waste.
If they cannot be used, then the processor
with special memory controller
should discard the vector unit.
The main purpose of this article was to argue
for a new system architecture, having low latency memory,
implying a processor architecture change,
with a major focus on the feasibility for transaction processing
databases, and mostly as pertinent to Intel processors.
However, all avenues for significant transaction performance improvement
- Hyper-Threading to 4-way
- Memory-optimized tables with natively compiled procedures
- Database architected for NUMA
- Multi-core - the right size core?
- 3D XPoint
- SSE/AVX Vector Instructions
Increasing HT from 2-way to 4-way has the potential to nearly
doubly transaction processing performance.
Other options have greater upside, but this is a drop-in option.
Memory-optimized tables and natively compiled procedures combined
has the greatest upside potential.
People do not want to hear that the database should be
re-architected for NUMA scaling.
If it runs fine on single-socket or Hekaton, then fine.
But Intel mentions that scaling to the Phi core count levels
requires NUMA architecture even on one socket.
Higher core count using a smaller core will have best
transaction throughput, but an asymmetric model might be
The capabilities of the modern processor core are being squandered
in long latency for capacity that is not needed.
Figure out what is the right low latency memory.
There is definitely potential for 3D Point.
But it goes beyond displacing some DDR DRAM and NAND.
The true potential is to enable a smaller lower latency
true main memory, then have DDR DRAM and 3D XPoint as
something in-between memory and IO.
SSE/AVX: use it or lose it.
The growing gap between processor clock cycle time to memory latency
is not a new topic.
There have been many other papers on the advantage of various
memory technologies with lower latency.
Most of these originate either from universities or semiconductor
Everyone acknowledged that cost relative to mainstream DRAM
was a serious obstacle.
A number of strategies were conceived to narrow the gap.
Here, the topic is approached from the point of view of
database transaction processing.
TP is one of several database applications.
Database is one of many computer applications.
However, transaction processing is a significant portion
of the market for high-end processors, and
systems with maximum memory configuration.
The database community regularly spends $100K on
a processor-memory complex for performance levels
that could be achieved with a single socket,
if it were matched with the right memory.
There is valid justification from the database side
to pursue the memory strategy.
There is justification to the processor-memory vendor
that this one market has the dollar volume to make this
And the extremely large memory capacity requirement is shown to now be a red herring.
In all, there are several worthwhile actions for the next generation
of server system architecture.
Probably none are mutually exclusive.
There are trade-offs between impact, cost and who does the heavy lifting.
No single factor wins in cases, so a multi-prong attack is the more
I will do a discussion to the figure below later
The diagram below might be helpful in discussion.
The elapsed time for a transaction is the weight sum of operations
that incur a wait at each level.
Single thread performance is the inverse of elapsed time.
For throughput, memory access latency can be partially hidden with hyper-threading.
IO latency is hidden by asynchronous IO, but there is an overhead to that too.
Suppose D is the latency for DDR memory,
and S it the latency of some low latency memory,
both inclusive of transmission time and possibly L3.
There is no read IO in the DDR system.
Suppose x is the fraction of memory accesses that
are outside of the fast memory, and the I is the latency.
The term I might represent accesses to DDR or 3D XPoint
on the memory interface via a memory access protocol
so it is really not IO.
Or it could be to 3D XPoint or NAND attached to PCI-E
via an IO protocol.
The criteria for the small faster memory being an advantage in elapse time delta,
exclusive of operations that occur inside L3, is as follows.
(1-x)×S + x×I < D
x < (D-S)/(I-S)
The objective is to achieve a large gain in performance via elapsed time.
Only so much can be gained on the numerator D-S, so much depends on the
latency of I.
If the secondary device were DDR or 3X XPoint on the memory interface,
then a very high value of accesses (x) could be allowed while
still achieving good performance gain.
If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.
In the discussion on Knights Landing, I suggested that the Atom core might not be bad
for transaction processing.
The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system.
What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT.
Use ECC SODIMM.
Atom has changed since its original inception,
not using out-of-order execution for simplicity and power-efficiency.
Silvermont added OOO. Not sure about Goldmont.
Is Atom to be a slimmed down Core? with 3-wide superscalar
and manufactured on the SoC version of the process?
I will try to sort out the material and redistribute
over several articles as appropriate.
There is an article on
that a new Intel processor architecture to succeed the lake processors
(Sky, Cannon, Ice and Tiger)
will be "faster and leaner" and more interestingly might not be entirely
compatible with older software.
The original source is
I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon.
Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only
one tick-tock pair.
Naturally, I cannot resist commenting on this.
in the really old days, processor architecture and instruction set architecture (ISA) was somewhat
the same thing.
The processor implemented the instruction set, so that was the architecture.
I am excluding the virtual-architecture concept where lower cost version
would not implement the complete instruction set in hardware.
was a significant step away from this,
with micro-architecture and instruction set architecture now
largely different topics.
Pentium Pro has its own internal instructions, called
The processor dynamically decodes X86 instructions to the "native"
This was one of the main concepts that allow Intel to borrow
many of the important technologies from RISC.
The Pentium 4 processor, codename
that was a cache for decoded instructions.
This may not have been in the Core 2 architecture that
followed Pentium 4.
My recollection is that Pentium Pro had 36 physical registers
of which only 8 are visible to the X86 ISA.
The processor would rename the ISA registers as necessary to support
Pentium 4 increased this to 128 registers.
Nehalem micro-architecture diagrams
do not mention a
(somehow the acronym is DSB) but
and subsequent processors do.
This is curious because both Willamette and Nehalem
are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.
The other stream that comes into this topic involves
the Intel Itanium adventure.
The original plan for Itanium was to have a hardware (silicon)
Naturally, this would not be comparable to the then contemporary
X86 processors, which would have been Pentium III, codename Coppermine
at 900MHz, for Merced.
So by implication, X86 execution would probably be comparable to
something several years old, a Pentium II 266MHz with luck,
and Itanium was not lucky.
By the time of Itanium 2, the sophistication of software CPU emulation
was sufficiently advanced that the hardware X86 unit was discarded.
In its place was
IA-32 Execution Layer.
Also see the
IEEE Micro paper on this topic.
My recollection was the Execution Layer emulation was not great but not bad either.
The two relevant technologies are: one, the processor having native µops
instead of the visible X86 instructions,
and two, the Execution Layer for non-native code.
With this, why is the compiler generating X86
(ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.
Why not make the native processor µops visible to the compiler.
When the processor detects a binary with native micro-instructions,
it can bypass the decoder?
Also make the full set of physical registers visible to the compiler?
If Hyper-threading is enabled, then the compiler should know to only
use the correct fraction of registers.
I am inclined to also say that the more the compiler knows about the underlying
hardware, the better it can generate binaries to fully utilize available resources,
with less reliance on the processor doing dynamic scheduling for parallelism.
But of course, that was what Itanium was,
and we would need to understand why Itanium did not succeed.
My opinion was that EPIC was really better suited to scientific computing
and not logic heavy server applications.
Have one or two generations of overlap,
for Microsoft and the Linux players make a native micro-op operating system.
Then ditch the hardware decoders for X86.
Any old code would then run on the Execution Layer,
which may not be 100% compatible.
But we need a clean break from old baggage
or it will sink us.
Off topic, but who thinks legacy baggage is sinking the Windows operating system?
Of course, I still think that one major issues is that Intel is stretching
their main line processor core over too broad a spectrum.
The Core is used in both high-performance and high-efficiency mode.
For high performance, it is capable of well over 4GHz,
probably more limited by power than transistor switching speed.
For power efficiency, the core is throttled to 2 or even 1 GHz.
If Intel wants to do this in a mobile processor, it is probably not
that big a deal.
However, in the big server chips, with 24 core in Xeon v4
and possibly 32 cores in the next generation (v5),
it becomes a significant matter.
The theory is that if a given core is designed to operate at
a certain level, then doubling the logic should achieve a 40%
increase in performance.
So if Intel is deliberately de-rating the core in the Xeon HCC die,
then they could built a different core specifically to one half
the original performance is perhaps one quarter the complexity.
So it should be possible to have 100 cores with half the performance
of the Broadwell 4GHz capable core,
i.e., equivalent to Broadwell at 2GHz?
If this supposed core were very power efficient, then perhaps we could
even support the thermal envelope of 100 mini-cores?
Of course, not every application is suitable for wide parallelism.
I would like to see Intel do a processor with mixed cores.
Perhaps 2 or 4 high performance cores and 80 or so mini-cores?
A really neat trick would be if the GPU were programmable,
but graphics vendors have things along this line?
It should be intuitively obvious that round-trip memory access latency
is one of the most important factors in modern server system architecture
for transaction processing databases.
Yet this is a topic that no one talks about.
Vendors do not want to discuss this because no near or long-term actions are planned.
Outsiders cannot write a meaningful article because too much important information is missing.
In some respects, there is nothing practical we can do about latency
in terms of memory components that we elect to employ.
However, we can influence latency because Intel processors since Nehalem in 2009/10 have integrated memory controllers.
Hence all multi-socket systems since then have non-uniform
memory access (NUMA) and this is one mechanism that determines latency.
We will start by looking at modern Intel processor architecture
and the time scales involved.
Then continue by examining system architecture with respect to
From there, we can do basic calculations on how we expect
memory latency to impact transaction processing performance.
This basic model still needs to be backed up by real experimental measurements,
but there is enough to provide the basis for further investigations.
Pending such, one of the important conclusions is that it is time to re-examine
fundamental assumptions on server sizing.
Broadwell, Xeon E3 v4
The diagram below is a representation of a modern Intel mainline
(desktop and mobile) processor.
The proportions are from Broadwell, the 5th generation Intel Core i3, 5 and 7 and Xeon E3 v4,
as Sky Lake has a different layout arrangement.
The four cores, system agent, and graphics communicate over the
The memory controller and IO are in the system agent,
but the signals that go off-chip are in the interface shown as a separate unit.
Never mind the above, I don't think thats how Broadwell 4c is laid out.
See below left. Sky Lake is below right.
Intel and system vendors like to talk about specifications
such as processor frequency and memory transfer rate.
Recent generation Intel processor cores are capable of operating in
the high 3 to low 4GHz range.
It is quite possible that they could run even higher,
but are power constrained.
The memory interface is DDR4 at 1866 and 2133MT/s.
The term used is mega-transfers per second instead of MHz.
This is because the clock is cited in MHz, and the data transfers rate
is eight times the clock rate.
Address and writes should be one-half the data rate.
Using MT/s for data transfer rate is so much more clear.
there is a memory clock, 266.67MHz for example,
an I/O bus clock that is 4 times higher, 1066MHz,
and the data transfer rate at double the I/O clock, 2133MT/s.
DDR memory timings
are specified at the memory interface and
consist of 4 values: CL, RCD, RP and RAS,
the last of which is frequently not cited, and sometimes only the
CAS Latency is cited.
The value is in terms of the I/O clock cycles.
In database transaction processing, the memory access pattern
is largely unpredictable, amounting to a random memory row access,
so the latency is RP + RCD + CL, for Row Pre-charge, Row to Column Delay,
and CAS Latency.
For Registered DDR4 2166, all three values are 15?
(does this includes the extra cycle for registered memory?).
The 2166 MT/s corresponds to 1066MHz, so each cycle is 0.938ns,
and 15 cycles is 14 ns.
The total memory latency at the memory interface is then 42ns?
Cache and Memory Latency
There are applications in which processor frequency and memory bandwidth
matter an great deal.
But neither are particularly important for transaction processing databases.
The diagram below calls out some details between the processor core,
various levels of cache, and memory that are more relevant to databases.
At 3GHz, the processor core cycle time is 0.33ns, the inverse of the frequency.
The L1 cache is cited as having 4 cycle latency.
The L1 is part of the processor core execution pipeline,
so to some degree, L1 latency is hidden.
L2 cache latency is cited as 12 cycles.
It is not certain as to whether this is fixed in cycles,
or is actually something like 4ns.
L3 cache probably depends on the number of cores and other circumstances.
It is shown here as 40+ cycles.
If L3 latency is actually time based, 15ns for example,
then number of cycles would depend on the core clock rate.
I am not sure if the L3 latency incorporates L2 latency.
Memory latency is probably somewhat over 50ns, plus the L3 latency.
How much of this is at the DRAM chip versus the transmission delay
between the processor to DRAM and back?
Igor Pavlov provides both the 7z benchmark and results for many processors
at 7-cpu, source code included.
7-cpu lists Haswell L3 at 36 cycles, and memory at L3 + 57ns.
Sky Lake L3 is 42 cycles and memory at L3 + 51ns.
(This seems to imply that the transmission delay from processor
to memory and back is about 10ns?)
Intel has their own
Memory Latency Checker
One of the Intel pages shows an example with local node memory latency as 67ns,
and remote node at 125ns,
which will be used in the examples below.
Broadwell EP and EX
The above diagrams were for the desktop and Xeon E3 processors.
We are really more interested in the Intel EP/EX processors used
in the Xeon E5 and E7 product lines.
The latest Xeon E5 and E7 processors are v4, based on the Broadwell core.
There are 3 layouts, HCC, MCC and LCC with 24, 15 and 10 cores respectively.
Intel provides functional layouts for all three,
but the actual die layout is provided only for the LCC model.
The Broadwell HCC representation is shown below.
There are 6 rows and 4 columns of cores, 24 cores total.
The 24-core model is only available in the E7.
The top E5 has only 22 cores enabled.
Two columns of cores communicate over the interconnect dual ring (counter-rotating).
The two double rings are connected by a switch.
The Intel functional diagram actually shows both QPI and PCI-E on the left side.
Below are my representations of the MCC model on the left with 3 columns of 5 rows
for 15 cores
and the LCC model on the right with 2 columns of 5 rows for 10 cores.
In the LCC model, there is no ring switch.
The PCI-E and memory actually do overhang the core,
meaning that the space to the right of the cores
As even the LCC is a high margin product, an exact and efficient fit is not necessary?
The MCC arrangement of both QPI and PCI-E on the left side
of the switch connected to the left ring
and a memory controller on each side
of the ring switches matches the Intel functional layout,
but I do not know if there is overhang.
Regardless of the actual placement of the controllers,
the interface signals for QPI and PCI-E probably does run along the length
of the upper edge,
and the interface for the memory signals probably runs along most of the lower edge.
I am inclined to believe that L3 latency is higher in the E5/E7 processors
as the path is longer and more complicated.
On the LCC die, there are 10 cores, the memory, QPI and PCI-E controllers
on one ring.
However, if the desktop and E3 processors have only one ring (single direction)
then it is possible that the bidirectional ring in the E5/E7 processors
help keep L3 latency low?
Presumably latencies on the MCC and HCC die are longer than on the LCC
because both rings must be checked?
Search: Haswell Cluster on Die (COD) Mode, but filter our call of duty.
An Intel slidedeck on this suggests that memory latency is higher
when crossing the coherency bus switches.
Xeon E5 v4
Below is a representation of a 2-socket Xeon E5 system with the HCC die.
Two of the 24 cores are marked out, as the E5 has a maximum
of 22 cores.
The E5 has 2 full QPI links, and both are used to connect to the other proccessor.
For a core in the left socket, the memory attached to that socket is local
memory, and memory attached to the other socket is remote memory.
It should not be difficult to appreciate that there is a large difference
in memory access time between local and remote memory nodes.
The Intel Memory Latency Checker example
has 67ns for local and 125ns for remote memory,
which will be the values we use in a following example.
I am not certain if these values are for unbuffered or registered memory.
Unbuffered memory should have lower latency,
but registered memory is available in larger capacities, 64GB versus 16GB.
Xeon E7 v4
Below is a representation of the 4-socket Xeon E7 system.
The E7 has 3 full QPI links, one connecting directly to each of
the three other processors.
All remote processors are then said to be 1-hop away.
The difference of significance here between the E5 and E7 systems is that
the E5 memory channel connects directly to memory.
The E7 connects to a scalable memory buffer (SMB, other names have been used too),
that splits into two memory channels on the downstream side.
Because there are so few acronyms in effect, the interface from processor
to SMB is SMI.
The SMB doubles the number of DIMM sites, and in effect,
doubles the memory capacity per socket.
This difference in the memory arrangement between processors
designed for 2 and 4-socket systems
has been a recurring pattern in Intel system architecture for a long time,
though it was not present in all systems, and there were variations.
In the current generation, there is a four-socket version of the E5,
which does not have the SMB in the path to memory,
but each processor only has two QPI links,
so one of the remote sockets is two-hops away.
Long ago, maximum memory capacity was very valuable for database servers
in reducing IO to less than impossible levels.
The extra latency incurred from the SMB chip was worth the price.
Since then, memory configuration has increased to stupendous levels,
and data read IO has been reduced to negligible.
NAND flash has also become very economical,
allowing storage now to be capable of formerly impossibly high IOPS.
Of course, this occurred after it was no longer absolutely essential.
In more recent years, with maximum memory configuration sometimes more than
double what is already extravagant,
we might not want to incur the extra latency of the SMB?
Highly Simplified Example
Before proceeding, I will take the opportunity to say that the modern
microprocessor is an enormously complex entity in so many ways
that it defies characterization in anything less than an exceedingly complicated model.
The processor has multiple cores.
Each core has pipelined superscalar and out-of-order execution.
Each core has dedicated L1 I+D cache and dedicated L2 unified cache.
And there is decoded micro-op cache as well.
The processor has L3 cache shared among the cores.
That said, I will now provide an example based on an extremely simplified
model of the core-processor memory complex.
Suppose we have a fictitious 3GHz processor, cycle time 0.33ns,
having a very fast cache, and executes one instruction
per cycle when all memory is in cache.
Suppose further that memory latency is 67ns,
and the long latency L3 cache effects are not considered.
Suppose we have a similarly fictitious transaction of 10M instructions.
If everything is cache, the transaction completes in 3.33 million nano-seconds,
or 3.33 milli-sec, and our performance is 300 transactions per second per core.
Now suppose that 5% (1/20) of instructions require a round-trip memory access
before proceeding to the next step.
The 0.95 fraction of instructions that have a cache hit consume 9.5M cycles.
The 0.05 fraction of 10M is 0.5M instructions that miss cache.
Each of these require a round-trip memory access of 67ns or 201 cycles
for 100.5M cycles. The total time to complete the transaction
is now 110M cycles.
Performance is now 27.27 transactions per sec per core
instead of 300 tps.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Now suppose that we have a 2-socket system,
meaning two memory nodes,
and that we have not specially architected our database in a manner
to achieve higher memory locality than expected from random access patterns.
Any memory access is equally likely to be in either node.
The local node memory continues to be 67ns
and remote node memory access is 125ns.
Average memory access is now (67+125)/2 = 96ns, or 288 cycles
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Without a database architected to achieve memory locality,
we have lost 28% performance per core (1 - 19.54/27.27).
Of course, we did double the number of cores, so throughput has increased by 43% (2*19.54/27.27).
Alternatively, the performance per core in the single socket system is 39.5% better
than in the 2-socket system (27.27/19.54).
This magnitude is important.
Did your vendors and infrastructure experts forget to mention this?
Now suppose we are on a 4-socket Xeon E7-type system with the same database,
so memory access is equally probable to any of the four cores.
Local memory access is 25%, and 75% is remote to one of the three other sockets.
All sockets are directly connected, so all remote nodes are one-hop away.
Now, recall that Xeon E7 has a memory buffer in the path between
the processor (memory controller) and memory.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Let's suppose that the SMB adds 15ns additional latency.
(I do not know what the number really is. It is not free.
The magic of doubling memory capacity comes at a price.)
Local node memory access is now 82ns and remote node is 140ns.
Average memory access is (82+3*140)/4 = 125.5ns, or 377 cycles.
We have now just lost another 22% performance per core going from a 2-socket
E5 to the 4-socket E7 type systems (1 - 15.17/19.54).
Total throughput is 55% better than the two-socket (2*15.17/19.54).
Performance per core is 28.8% better on the 2-socket than on the 4-socket.
The performance per core between the 1-socket is 80% better than on 4-socket
The 4-socket has 2.22X better throughput than the 1-socket (4*15.17/27.27).
Scaling - NUMA
Below is all three of the above cases in a single table.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Scaling - Frequency
We can also do the same calculations based on a similarly fictitious 2GHz processor.
|GHz||L3+mem||remote||skt||avg. mem||mem cycle||fraction||tot cycles||tps/core||tot tx/sec|
Notice that we did not lose much performance in stepping down from 3 to 2GHz.
We could even further step down to 1GHz and still be at 23.26, 34.78 and 55.36 tot tps
for 1, 2, and 4 sockets respectively.
It is important to stress that this is based on the assumption of a transaction
processing workload having the characteristic of serialized memory accesses.
All of the above is based on a highly simplified model.
Real and rigorous testing should be done before drawing
Regardless, there is no way anyone can claim that the difference
is between local and remote node memory access latency is not important,
unless the database has been architected to achieve a high degree of memory locality.
Front-Side Bus Systems
In the days before Intel integrated the memory controller,
four processors connected to a memory controller in a system with uniform memory access.
The Pentium Pro arrangement of 4P on one bus is represented on the left,
although the diagram is actually closer to the 450NX or later.
The system on the right represents the 4-way Xeon 7300,
each quad-core processor on its own bus connected to the 7300 MCH.
Intel had already committed to the Pentium 4 quad-pumped bus in 2000/01.
Vendors were expecting a long stable infrastructure,
so Intel delayed the switch-over from bus to point-to-point until 2009.
Pre-Nehalem NUMA Systems
A system with 16 or more processors could be built
with a custom memory + node controller
connecting four processors, memory and also a crossbar.
The crossbar in turn connects multiple node controllers
to form a system having non-uniform memory access (NUMA).
In the old NUMA systems, some SQL Server operations ran fine,
and other operations had horrible characteristics,
far worse than would be suggested by the
remote to local node memory latency ratio.
So, it is the possible that there are other NUMA affects with far greater
Some may have since been resolved, while others may still be present
but not as pronounced in modern NUMA systems.
An important fact to notice is that a very high fraction of CPU cycles are no-ops,
where the processor core does nothing while waiting for a round-trip memory access.
This is why Hyper-Threading is highly effective.
While one logical processor is waiting for a memory access to complete,
the other thread can run.
Scaling on the logical processors can be nearly linear for a transaction processing
Note, the first-generation of Intel Hyper-Threading was on the Intel Pentium 4
(Net Burst) processors.
In that generation, the implementation was two threads running simultaneously,
as in each clock cycle, trying to fill the super-scalar execution units.
The first generation of HT was problematic.
It could have been because it was too aggressive to try to execute two
threads simultaneously, or it could have been simply that the Windows
operating system and SQL Server engine at the time did not know how to properly use HT.
The next generation of Intel processor architecture, Core 2, did not have HT.
Then in Nehalem, HT returned, except that this time,
it was a time slice implementation.
Only one thread executes on any given cycle.
When the executing thread encounters a memory or other long latency operation,
the processor core switches to the other thread.
If anyone has doubts on HT based on experience or hearsay
from the Pentium 4 generation, forget it.
The Nehalem and later HT is highly effective for transaction processing workloads.
There used to be several SKUs with HT disabled in the Xeon E5/7 v1-3 generations.
Pay close attention and pass on the no-HT SKUs.
The question to ask is why Intel does not increase the degree of HT?
The generic term is Simultaneous Multi-Threading (SMT).
Both IBM POWER and Oracle SPARC processors are or have been at 8-way SMT.
Granted, one of the two mentioned that scaling to 8-way SMT was tricky.
It is high time for Intel to increase HT to 4-way.
In the above examples, the simple model suggests that scaling to
multiple sockets is poor on the assumption of a transaction processing database
without a means of achieving memory locality.
(There is supposed to be an HPE whitepaper demonstrating the importance
of the SQL NUMA tuning techniques in a properly designed database.)
Just what does a database architected for NUMA mean?
Naturally, this will have to be expounded in a separate article.
But for now, take a close look at both the TPC-C and TPC-E databases.
The TPC-C database has all table leading with a common key, Warehouse Id,
that provides a natural organizational structure.
The TPC-E database has 5 transaction tables with a common key value
but does not use the identity property.
Instead it uses a function that must read, then update a table
to determine the next value.
The Case for Single Socket
Naturally, the database and application
should be architected together with the SQL Server NUMA tuning options
to support good scaling
on multi-socket NUMA systems.
If we neglected this in the original design,
I am sure many DBA-developers know how well such a suggestion would be received
Is there another option? Well yes.
Get rid of the NUMA latency issue with a non-NUMA system.
Such a system has a single processor socket, hence one memory node.
Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.
Still, a single quad-core processor today is 40 times more powerful
than a 4-way system from twenty years ago (100,000 tpm-C per core is probably
possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996).
The Xeon E3 could probably support many medium sized organizations.
Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each).
My recollection is that many IO problems went away at the 32-64GB level.
And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4's.
But I am really talking about a single-socket Xeon E5.
In the v4 generation, we could have up to 22 cores,
though we should start by looking at the 10-core E5-2630 v4 at $667,
stepping up to the 16-core 2683 v4 at $1846
before going to the 20-22 core models at $3226 and $4938.
The Xeon E5 has 40 PCI-E gen 3 lanes.
It might be convenient if there were a motherboard with 1 PCI-E x8 and 8 x4,
because NVMe PCI-E SSDs are more common and economical with the x4 interface.
Supermicro does have a UP Xeon E5 motherboard (X10SRL-F) with 4 x8, 2x4 gen3 plus 1 x4 gen2.
It only has 8 DIMM sites out of 12 possible with the E5, but that is probably good enough.
A strong explanation was provided showing why round-trip memory latency
is very important in transaction processing.
One implication of this is that scaling to multiple sockets is poor
due to the NUMA effect.
A remedy is to architect the database and application together in working
with the SQL Server NUMA tuning options to achieve locality.
Alternatively, give serious consideration to a single-socket, yet still very powerful system.
A second implication is that processor frequency is less important
for transaction processing, though it might be important for other aspects.
The memory latency affect also supports the argument that Hyper-Threading
is highly effective and Intel really needs to increase the degree of HT.
OK, I didn't show why database transaction processing incurs the round-trip
It has to do with the b-tree index, in which we read through a page
to find the right pointer to the next level.
We access the memory for that pointer, then read through to find the next pointer.
I will try to do a diagram of this later.
But if someone can dig through an open source database engine,
please send a code example.
Several of the images were updated.
It would help if Intel would be so helpful as
to provide L3 latencies for Xeon E5 v4 LCC, MCC and HCC models.
What memory latencies for local and remote node in E5 v4?
How much latency does the SMB in the Xeon E7 add?
Note that Xeon E3 and client side processors use unbuffered memory.
While Xeon E5 can used unbuffered memory, these are currently limited
to 16GB DIMMs while registered memory is available in capacities to 64GB.
The Xeon D is for specialized embedded applications and not suited for the
single-socket database server. It has only 4 DIMM sites?
Supermicro has an Ultra product line targeting specialized applications.
One of the features they call
They claim that with very high quality design and components,
it is possible to reduce (memory?) latency via low jitter.
I would like to know more about this.
But the only option is Xeon E5 dual-socket, and I am more interested in single-socket.
The emphasis seems to be on RHEL, and high frequency trading?
There are examples for determining which processor socket the NIC is attached to,
and whether a threading is running on a core in that socket.
These hardware organization detection tools really should be incorporated into Windows
I have tried to use the WMI API from C#, but some things require coding in C
or possibly assembly?
It was stressed that round-trip memory latency impacts transaction processing databases.
Column-store DW avoids this problem by emphasis on marching through memory sequentially.
The main intent of Hekaton memory-optimized tables was to eliminate the need for locks.
But the other part was the use of a hash index, which happens to reduce
the number of memory round-trip operations.
Inte Core i7 Xeon 5500 series
Core i7 Xeon 5500 Series
Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
Intel D. Levinthal paper
freebsd Ulrich Drepper paper
A couple of months ago, I got a pair of Intel XL710QDA2 Ethernet Converged Network Adapters, basically the fancy way of saying 40 Gigabit Ethernet that does other stuff too. I also bought the Intel QSFP+ Twin-axial cable. At that point in time, I might like to have opted for the newer Intel Omni-Path 100Gbps (56Gbps on PCI-E x4) which came out in Q4 2015 over the XL710 which came out in Q3 2104. But Intel Omni-Path only supports Linux at this time with a heavy emphasis in HPC applications, so it is unclear as to whether they will support Windows in the future. Another option was Infini-band, which other people have written about in conjunction with SMB 3.0 in Windows Server 2012.
The XL710QDA2 (dual-port) adapters were $500 each, and the cable $130. The single port XL710QDA1 was $420, so I thought having the second port was worth an extra $80 even though I am not using it yet. A 40GbE switch seems to cost about $500 per port, which I think is reasonable. There is the Cisco Nexus 9332PQ for $17K. I thought there was another vendor, Edge-Core? with a 20-port, but it is difficult searching on 40 GbE because of all the junk results. In any case, I do not believe that there is a small-office (8-12-port) 40 gigabit Ethernet switch. So, my current usage is connecting two systems directly with the QSFP+ twin-axial cable, and using the embedded 1GbE for internet connectivity.
The more readily apparent reason for high-speed networks in a SQL Server environment is to move backups from one machine to another. For my perfunctory dig at storage system vendors, the SAN vendor will say: our million-dollar product has software so you don’t need to move large backups over the network.
A less obvious reason is to achieve low latency in support of serialize high-volume calls. Transaction processing might not need high network data rates, and can probably generate high call volume on 1 GbE from multiple concurrent connections. However, there could be situations that require the highest possible call volume on a single thread and connection. Of course, in this case, the first course of action is to employ a back-to-back connection, bypassing any switches whatsoever. And definitely not have more than one switch in the path from application to database if a direct connection is not possible.
Potentially, we might think that high volume ETL could benefit from high data rate networks, but I believe that SQL Server single thread write operations are more limiting than 1 gigabit Ethernet. But perhaps more recent versions of SSIS might have better tools for parallelism.
Trying out the Intel 40GbE adapater
Back in August, when I got the XL710QDA, my test systems were running Windows Server 2016 TP5, for which Intel had beta driver. The adapter was recognized by the operating system and the beta driver installed with no issues. The Intel documentation on firmware updates was very complicated, so I did not apply any. In this setup, even though the driver installed, I could not ping across the 40GbE network. Intel support was of the opinion that the back-to-back connection should work. No diagnostic tools were provided for examining the software stack for the XL710.
Then in early October, there was a substantial hotfix for Windows Server 2016 TP5, after which SQL Server connections were going over the 40GbE. A ping test worked one way but not the other. In any case, Windows Server 2016 RTM was out about this time, so I installed both machines to RTM, applied OS hotfixes, and updated to the release driver. Now everything seems to work.
In bulk copies, network transfer rate was high and serialize network calls were also faster than on the 1GbE network (which was via a switch, not back to back). When I get a chance, I will look at how much round-trip latency reduction I would have gotten on 1GbE with a direct back to back cable. I did this about 15 years ago when Gigabit adapters came down to $300 but switches were still very expensive. If anyone is interested, I investigated on whether there was a cross-over cable for gigabit,
Gigabit Ethernet Direct Connect , i.e. no special cross-over cable needed.
10 gigabit Ethernet
For people not ready to jump directly to 40 Gigabit Ethernet, then consider 10 Gigabit. Adapters are now in the $300-400 range, and switched are in the $100 per port territory. The main consideration is which interface to standardize on. Two options for short distances are 10GBase-T and SFP+ Direct-Attach. It appears CX4 is not getting traction? One might be tempted to default to twisted pair, CAT6a or CAT7 for 10GBase-T. Cables for CAT6a and 7 are really cheap, $3-10 for short length pre-made CAT6a and $7-17 for CAT7, while SFP+ cables are $50-100.
However, various sources mention that 10GBase-T latency is higher than for the other options. And in certain situations, serialize network calls for example, latency is the primary criterion. In large file transfers, latency is not as important.
Netgear actually offers small office 10 Gigabit Ethernet switches. The XS7xx models are 8, 12, 16, and 24 10GBase-T ports, with some models having additional SFP+ uplink ports. The M4300 models include one with 8 10GBase-T + 8 SFP+ and 12+12. I did not see any that were all or mostly SFP+, but the X+X should be acceptable. It is possible to use the Intel XL710 with a QSFP to 4-port SFP breakout cable.
Is the Intel 40GbE a dead-end? Intel is going their own way at 100Gbps with Omni-Fabric, which currently does not support Windows.
Mellanox has a new 100GbE that also does NVMe over fabric?
Mellanox is offering a rebase for one-purchase of their 40/100GbE switches, the SN2100 16-port?
Price might be $6600-7200? with 50% rebate.
This puts it at $200+ per port at the rebate rate, and perhaps $450 per port normally?
Last month, Samsung announced
that their 960 Pro,
(PCI-E x4 gen3) M.2 NVMe SSD would available in October
at 512GB, 1TB and 2TB capacities.
This is a high-end PC and workstation product.
There is a similar but separate product line (SM961?) for OEM customers.
The 960 Pro had already been mentioned earlier this year.
This week, all the hardware sights published product reviews,
and others as well.
All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read.
The 960 Pro is rated for 3,500MB/s read.
This is pretty much the maximum possible bandwidth for PCI-E x4 gen3.
Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less.
In earlier generation products,
an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate
Presumably there was a reason why every PCI-E x4 was
in the 2000-2500MB/s bandwidth.
It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s.
Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary
on the downstream side.
The could be other reasons as well, perhaps the DRAM for caching NAND meta-data.
Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.
It is very annoying that the more vertical SSD OEMs decline to disclose meaning details
of the controller in their SSD.
The 960 PRO uses the new Polaris controller, for which few details beyond the Read/Write IOPS and bandwidth
specifications are published.
the 960 PRO is shown to have the Polaris controller and DRAM in a single package.
There are 4 NAND packages, each with 16 stacked 512Gbit die for the 2TB product.
I am guessing that the Polaris controller has 8-channels,
at 8-bit per channel with each package connecting to 2 channels?
But there are other possibilities, such as 4 x 16-bit channels.
The reason is that the previous generation NAND were 333 or 400MB/s.
The most recent NAND from Micron is listed at 533MB/s.
In order to support 3,500MB/s on the upstream interface,
there needs to be more than that amount on the downstream side combined.
While Samsung is the leader in V-NAND, it seems more likely that the 960 has 8 channels 8-bit
(or 4 channels, 16-bit) at 533MT/s
and not 4 channels, 8-bit 1066MT/s.
Between the two likely scenarios, 8-channel 8-bit should support higher IOPS than 4 channel 16bit?
Both the 2TB and 1TB models have read specs of 440,000 IOPS while the 512GB model is 330,000 IOPS.
It could be that the 1TB is 4 packages, 16 chips stacked with a 256Gb chips, and the 512GB model has 8 chips stacked. It is also possible that the 1TB has packages stacked 8 high with the same 512Gb, just that there is no reduction in IOPS, while the 512GB has 4 chip stacks?
I would think that 2 x 960 PROs in a single socket Xeon E3 quad-core would be a great match.
My current test system is a single socket Xeon E5 10-core with four Intel 750 SSDs.
I will probably not outfit another test system until the Xeon E5 v5 comes out next year,
in which I may opt for the 960 PRO.
Oh yeah, the 960 PRO is not a enterprise product, so should probably not be used in critical production environments, but should be ok for test.
It is possible to buy a M.2 to PCI-E adapter, but I would like an adapter that can accommodate 2 M.2 devices in one PCI-E x8 slot, without the use of a PCI-E switch if possible.
HP has an adapter for 4 M.2 in a x16 that only works in HP systems
last year I bought a Dell XPS9550 with the then new Skylake processor.
It came with a 512GB Samsung PM951, which was an older model rated for 1GB/s sequential, while the slightly more expensive SM951 was rated for 2GB/s.
This drive died after 3 months. Dell sent a replacement, but I elected to go out and buy the 950 PRO myself, rated the same 2GB/s as the SM951, but perhaps same or not?
The failure of one device is not sufficient to make an assessment, but I have dozens of OCZ Vertex and Vectors, and several Intel SSDs of various flavors, all running for several years now without failure.
Earlier this year, HPE and Microsoft sponsored an article,
The Importance of Benchmarking,
in SQL Server Pro.
While true, it is also apparent that there has been little interest in benchmarks
within the DBA/IT community over the last several years.
There are many reasons for this,
one of which is the emphasis on the benchmark result as the important takeaway.
Today most people expect that systems are very powerful,
and probably know that performance at the processor core level
has improved only incrementally in the recent years.
The main venue is the increasing number of cores with each generation,
currently at 22 in Xeon E5 v4 and 24 in E7 v4.
The TPC benchmarks are generally well designed,
with rules meant the prevent the publication of a distorted result.
But every application has its own characteristics,
different table organizations,
and transactions with SQL logic
bearing no resemblance to a benchmark or another application.
As such, a benchmark result does not translate directly to a different set of circumstances.
What is of great importance are the problems that were encountered in the benchmark,
and the measures that were taken to address each problem.
The topic of this article is Inserts to a table with a sequentially increasing index,
commonly in the form of an identity column,
but other implementations are possible as well.
At a New York City SQL Server Users Group meeting, Thomas Grohser,
Expert SQL Server Performance Engineering,
mentioned that multiple client (connections)
doing single row inserts to a table
with clustered indexed on an identity column
had extremely poor performance running on a 4-scoket system.
In that case, this was about 6,000 rows per second.
Inserts to a table with a clustered index on a uniqueidentifier,
or having a compound key not leading with the identity column,
performance was 95,000 (calls and) rows per second.
Memory-optimized tables could achieve 1.7 rows per second.
But it might be a good idea to consult with Thomas on this.
The assessment was that this was a contention issue due to the high latency
between processor sockets.
All modern systems with more than one processor socket have
Non-Uniform Memory Access (NUMA) architecture.
For Intel processors, this goes back to Xeon 5500 (2009) and 7500 (2010) and 2003/4 for AMD Opteron.
An Insert to a table with an index on an identity column,
regardless of whether the index is clustered or nonclustered,
involves acquiring an exclusive lock on the last row of the last page,
as well as accessing the memory for that page and row.
When there are multiple concurrent threads with affinity on cores
in different sockets of a multi-socket system,
the memory involved bounces between the L3 caches of each socket.
While it is inefficient for memory to bounce across sockets,
the magnitude of the impact on Inserts is stunning.
Testing Insert Performance
The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled.
Turbo-boost is disabled for consistency.
The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164).
Some tests were conducted on a single socket Xeon E5 v4 with 10 cores,
but most are on the E3 system.
In the past, I used to maintain two-socket systems for investigating issues,
but only up to the Core2 processor, which were not NUMA.
The test table has 8 fixed length not null columns,
4 bigint, 2 guids, 1 int, and a 3-byte date.
This adds up to 70 bytes.
With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.
Both heap and clustered index organized tables were tested.
The indexes tested were 1) single column key sequentially increasing
and 2) two column key leading with a grouping value followed by a sequentially increasing value.
The grouping value was chosen so that inserts go to many different pages.
The test was for a client to insert a single row per call.
Note that the recommended practice is to consolidate
multiple SQL statements into a single RPC, aka network roundtrip,
and if appropriate, bracket multiple Insert, Update and Delete statements
with a BEGIN and COMMIT TRAN.
This test was contrived to determine the worst case insert scenario.
On the single socket Xeon E3 quad-core system, the heap table with no indexes
and both the heap with one nonclustered index and clustered index
with the two column key
could support an insert call rate of 70-72,000 per sec,
both rows inserted and number of network roundtrips.
The heap table with one nonclustered index and table with only a clustered index,
both with the index key being a single sequentially increasing column
such that inserts from all threads/connections go to the last page
could support an insert call rate of about 40,000 rows per sec.
The issue here is multiple concurrent connections
making calls to Insert into a single table
with a sequentially increasing index, clustered or nonclustered,
resulting in contention between threads for the last page and row.
There is a substantial 42% performance degradation on a single socket system,
with memory in a common L3 cache, but the relevant memory locations
are moved between the L2 cache of different cores after exclusive access is acquired.
While the impact is large,
even this may not be a crippling effect depending on the circumstance.
In a multi-socket system, memory must now also be moved between L3 cache of
different processor sockets, which has higher latency.
The impact is expected to be more severe with as the number of sockets increases.
The presumption here is that the limitation is in the ability to acquire exclusive
locks between threads on different cores, potentially in different sockets.
If each call inserted more than one row, the call might decrease only slighty
resulting in a higher row insert rate.
Hence the explicit emphasis on the call rate as well as the row insert rate.
Note also that we should problem be able to insert in to multiple tables
each at more or less the same rate as inserts to a single table
until the limitation becomes the log write thread or overall CPU.
This test is not the same as the test by Thomas Grohser
in terms of table structure, client application, and hardware etc.
Still, the expectation is that results would not be dramatically different.
My tests on this matter is incomplete, and more work is necessary.
As soon as I can get access to two and four socket systems, I will try to run this same test
(hint to Intel or someone).
In preliminary tests on the single socket 10-core system, the Insert call rate
with rows going to different pages was over 100,000 per sec for both heap and clustered,
and the around 30,000 per sec with a sequential index, nonclustered or clustered.
We might infer that this a contention issue in which performance degrades
with increasing number of cores, for both single and multi-socket system.
In the case multiple sockets, there might be as more severe degradation?
This issue might be more complex than a simple matter of the higher latency between sockets,
but other people might have better tools to conduct such an investigation.
There are work-arounds for this issue.
One is to implement a multi-column clustered index key such that inserts are distributed over many pages.
It also necessary to not have even a nonclustered index on the identity column,
which may have implications.
Another work-around is simply to deploy on a single socket system.
This is actually good advice for perhaps a majority of situations.
The Xeon E3 with 4 cores is perhaps twenty times more powerful than
a 4-socket Pentium II Xeon from 1998.
If quad-core E3 were not enough, probably in memory or PCI-E,
then a single socket Xeon E5 with up to 22 cores should definitely considered
before reflexively defaulting to a two-socket system
without quantitative substantiation of any performance assumptions.
One the problems today is that infrastructure people have bought into
the vendor driven arguments for virtual machines and/or cloud,
but apply it pervasively even into the mission-critical systems,
while completely disregarding any special requirements.
The typical platform is a 2-socket.
There are other options too.
We could try affinitizing calls that insert into the critical table,
Map TCP IP Ports to NUMA Nodes.
Benchmarks and Inserts
Before closing this topic, given that it is so important,
we might ask whether the benchmarks shed light on this matter.
The TPC-C benchmark has the highest insert volume into a single table.
However, TPC-C does not have this problem because all tables are organized
(clustered index) by warehouse followed by district.
This reflects a brick and mortar business,
where customers place orders mostly in their own warehouse and district.
If we were curious anyhow, TPC-C benchmark results are in transactions per minute (tpm-C).
The highest SQL Server result is 1.8M tpm-C, corresponding to 30K new orders per second.
Technically, the order line table has more rows inserted.
Each transaction inserts one row to orders
and on average 10 rows to the order line table with a single call,
so the Insert statement call volume is the same for both tables.
No TPC-C benchmark results were published for SQL Server version 2008 or later,
even though a handful of results continued to be published for version 2005
after 2008 RTM.
The reason for this is that SQL Server 2008 compressed log file writes
to better support database mirroring,
a very important feature across the broad base of users.
Log writes were single-threaded until version 2016.
This meant that SQL Server write performance could be limited by the ability of a
single core running the log writer thread to compress log entries.
Presumably there was some survey to suggest that this would be acceptable
as there is no option to disable log write compression?
The TPC-C benchmark has a curious statement when updating the stock table
in the output clause that touches but does not change several fat columns.
This part serves no meaningful business purpose,
but has the effect of grossly inflating the log write size far beyond
what should be necessary.
Presumably other RDBMSs do not have mandatory log compression.
And so the highest TPC-C result on a system with Intel processors
is 5M tpm-C (8 sockets, 80 cores total, 160 threads)
corresponding to 84000 inserts to orders per second.
The TPC-E benchmark does have 3 tables that are unique on the (big)integer Trade Id column,
and two more tables that have a two column key leading with Trade Id.
Is there a magical tuning trick that can evade the insert to a table
with a sequentially increasing index problem?
TPC-E works around this by not using an identity column in the tables
where one might think the primary key column should be an identity.
In fact, it uses a scalar function and a table to assign trade Id values.
The TradeOrderFrame4 procedure first calls the function Get_Next_T_ID
with an output variable to retrieve a Trade Id.
The function uses the table TID_RANGES to store blocks of ID values.
Each block, represented by a row in TID_RANGES, is assigned
to a specific session id (SPID, accessed with @@SPID).
Every time the function is called,
a column for the current Id value for the callers spid is returned
and the row is updated to the next Id value.
This is probably an acceptable implementation in the TPC-E benchmark,
where all connections are running transactions.
In a real-world environment, there might be very many open connections,
not all of which process transactions, but in a common connection pool.
I might suggest using spid modulus the total number or logical processors.
Or perhaps spid divided by 2 first, then modulus the number of cores when HT in active.
All of this is in the details as required by TPC full disclosure rules.
But it is buried in the supplement files with thousands of extraneous files.
There is no explanation given as to why the three tables that have a primary key clustered
on the trade Id column is not an identity,
instead using a function that requires an update to maintain.
It is interesting this method was in TPC-E from the beginning.
The first TPC-E result was published in 2007
for a 16-socket
with Xeon (7140M?) dual-core processors.
In that generation, there were probably 4 processors in each NUMA node.
The performance was 660 tpsE,
meaning 660 inserts per second to each of 3 tables unique on trade Id
plus inserts to two more tables.
The current 4-socket Xeon E7 v4 with 96 cores total is
Was the insert - index on sequentially increasing column problem know back then?
If it occurred at such a low insert rate,
then how many people suffered this problem without knowing why?
The conclusion to draw is that, yes benchmarks are important.
However, what is important is not the result that vendors like to brag about,
but rather the many problems that were solved in producing a competitive result.
While all the methods are disclosed in the supporting files,
there is no accompanying explanation.
Some methods are simple to apply, like lock pages in memory.
Others must be worked into the database architecture,
which has deep ramification if done after it is already is in production,
or must be part of the client side application, such as the TCP per affinity.
Some methods might have critically significant impact.
Others might be of just minor impact that is desired for benchmarks,
but have no important consequence in a production system.
I do not suppose anyone would be willing to apply the -x startup parameter
in a production environment, but it might be good to know the impact?
Good luck in deciphering which details are critically important to a specific situation
and which might just be used to get an extra 1% edge over a competitor.
The tests systems, client-load generator and server each had Intel I210 1GbE connected to a common switch.
A single thread connection could drive about 4K single row select RPC/sec, and 2K single insert row RPC/sec. A high level of saturation was achieved at 80 concurrent connections.
About 12 years ago, I had tested Xeon (Pentium 4) systems with 1GbE connected back to back (no switch), and this could drive 5K single row select RPC/s. I did not test the current 1GbE direct connect network round-trip rate.
Both of these systems also had a Intel XL710 40GbE with direct QSFP+ connection. At first this did not seem to work in Windows Server 2016 TP5 with the Intel beta driver.
Intel support was of the opinion that this should work but did not say definitively.
Then on Tue, with the latest Windows TP5 hotfixes, all of the sudden the load generator was connecting via the 40 GbE instead of the 1GbE. There were some quirks in the behavior.
In any case, the 40GbE direct connect could drive 11K select RPC/s, and 6K insert RPC/sec at the single. Only 20 concurrent threads were necessary to achieve a high level of load, at which the 10-core Xeon E5 could support perhaps 40K single row inserts with a sequential clustered index.
I am currently in the process of changing over to Windows Server 2016 RTM + hotfixes,
which also seems to upgrade the XL710 driver from 1.3.115 to 1.5.59, so more to come.
Henrik provided the link,
SQL Server Latch Contention, which is why he mentioned that this is latch contention.
I would have thought it would be lock contention, but apparently it is latch contention according to MS?.
I only took a quick look, MS identifies the problem with the degree of concurrency, more severe at higher concurrency. Between Thomas Grosher's report, and my data, it should be emphasis that there is most probably a strong hardware NUMA or not-NUMA impact. But it would be preferable if one person ran the same test on both 1-socket, 2 and 4-socket systems.
Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.
The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.
The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking?
The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop,
but perhaps this is because it was more focused at mobile than desktop?
(Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics).
The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.
With a single core under load, processor is probably running at the turbo boost frequency.
When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.
The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core.
There are 3 die versions, LCC, MCC, and HCC with
8 10?, 15, and 24 cores respectively.
All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors.
But Xeons need to be more conservative in its ratings so a lower frequency is understandable.
The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP.
The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.
The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.
The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance.
The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is.
(Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).
So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.
This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo.
The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm.
A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT)
and 1M L2 cache shared between 2 cores.
(There is no shared L3? how is cache coherency handled?)
The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.
There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s.
My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package.
This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current?
(Can some circuit designer speak to this please?)
This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen.
Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch.
This should run fine on a simpler core than 6/8-port superscalar Broadwell.
And have many dead cycles during the round-trip memory access latency,
implying SMT will work well (beyond the two threads per core in the main line Intel cores).
However, this may not be the best general purpose computing solution,
in that there are important single threaded tasks,
or tasks that are not massively parallelizable for which the existing powerful Intel core is the best.
My thinking is that a mix of few powerful cores and many smaller cores is right solution.
And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.
Long ago, almost 20 years, there was legitimate concern on whether servers had sufficient capability to drive transaction processing requirements for large businesses.
Then for a long period, perhaps from the late 1990’s to 2012, the better choice for most mid to large businesses was to simply buy a 4-way (socket) system.
Larger systems were based on a non-uniform memory access (NUMA) architecture that had unusual characteristics and vendors provided almost no information on how to work around issues that could cripple such systems.
Towards the end of the second era of server sizing, processors (at the socket level) became ridiculously powerful such that properly tuned systems ran at very low overall CPU utilization.
However, 2-socket systems in this period did not have adequate memory or IO capability, unless it was a Xeon 7500 or later E7 series system with 2 sockets populated.
And even then, the structure of SQL Server per processor licensing meant that many people opted for the 4-way system regardless of requirements.
In 2012, Intel finally put powerful processors (Sandy Bridge, up to 8-cores) with larger memory and IO capability into the Xeon E5 line (there were both 2 and 4 socket versions). This coincided with SQL Server 2012 per core licensing. Now there was a stronger technical justification and financial incentive to not automatically opt for a 4-socket system with the most number of cores.
It would seem reasonable to make some effort to first determine the total number of processor cores that will meet requirements with some headroom.
Then determine the number processor sockets as more processor sockets means more memory sockets (also considering the difference in DIMM sockets per processor between Intel Xeon E5 and E7 processors).
It is unclear whether anyone actually engaged in the technical analysis versus just making the decision strictly on budget goals.
Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems.
First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.
We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.
A single socket Intel Xeon E5 processor supports 12 DIMM sockets for 384GB with 32GB DIMMs, or 768GB with 64GB DIMMs which is now at a moderate (50%) premium per GB over the 32GB DIMM. In the old days, DBA’s were taught that memory was precious in reducing disk IO to a tolerable level. That was back when a total system memory of 1GB was considered very large.
I noticed that after common server memory configuration reached 32GB memory, disk IO could be brought to a very manageable level, well under the IOPS possible with 100-200 15K HDDs.
Today, all-flash storage is practical.
Definitely in direct-attach storage, and maybe in SAN based storage
depending on the degree of price gouging from your SAN vendor.
So while it might be possible to show that TB-plus memory reduces IOPS from 10,000 to nil, the reality is a modern storage system correctly configured for database applications can easily support 100,000 IOPS if not 1M (which would require NVMe to be practical).
In this case, we should find that memory requirements are far lower than the memory capacity of recent generation 2 and 4-socket systems.
Why not continue to spectacularly over-configured the server system when even with SQL Server per core licensing is only a small portion of the costs in deploying a critical line-of-business application? The big bosses may not question the costs of a 2-socket system or perhaps 4-socket system even factoring in that there may be several environments between production, DR, QA and development. But if performance is inadequate, it may cost you your job.
The answer is performance. Today, transaction processing performance is heavily dependent on serialized round-trip memory access latency. The term is pointer chasing in that a CPU instruction accesses memory to determine the next memory location to access. No amount of L2/L3 cache can hide this, nor is it possible to pre-fetch memory. (What Intel needs to do is increase Hyper-threading beyond two logical processors per physical core. IBM Power is 4 and SPARC is 8?).
For a processor with integrated memory controllers (AMD from Opteron on, Intel from Nehalem on), local node memory access round-trip latency is probably 50ns. For memory on an adjacent processor, a 1-hop remote node, it is perhaps 100ns. This information is absolutely crucial in understanding system level performance for transactional databases, and yet vendors try to not mention it, instead stressing metrics that are mostly irrelevant.
In a single socket system, 100% of memory accesses are local node, because there is only 1 node. In a two socket system, the expectation is that memory accesses are split 50-50 between local and remote node. In principle, it is possible to devise some strategy such that a higher percentage is local, but SQL Server does not make it practical to implement such a scheme. In the Intel Xeon E7 4-socket systems, 1 out of 4 memory nodes is local, and there are 3 memory nodes 1-hop away. In an 8-socket system, 4 nodes are 2-hop away.
An argument in favor of multi-socket systems is that the number of memory channels scales with the number of sockets. However, in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth. (it is a factor that more memory channels means there can be more memory transactions, but it is unclear as to whether this offsets the increase in memory latency.)
If someone has reasonably comparable systems with 1, 2 and 4 processor sockets, all of the same processor generation, the impact of the number of sockets can be tested.
I am also of the opinion that SQL Server cannot make meaningful use of all the processor cores in a top-of-the-line 2-socket (36-cores for Xeon E5 v3), let alone a 4-socket (72-cores) due to most write operations data and log being single threaded. But I would like here opinions on this. Is this addressed in SQL Server 2016 to a degree?
The above applies mostly to transaction processing applications.
It is reasonable to suppose that decision support system application could use both the greater number of processors cores and the greater number of memory channels in a multi-socket systems. However it is unclear as to whether SQL Server is effective in implementing parallel execution plans when the degree of parallelism is greater than the number of cores in one socket.
I would think that the strategy of partitioning and alignment of compute to local (system) node memory from (formerly) PDW applied to a single SQL Server instance, treating each processor socket as though it were a node, which it is, could dramatically improved very DOP parallel execution plans. But it does not seem that Microsoft is interested in this approach.
I am presuming that column store is better capable of utilizing multi-socket, as it is definitely not encumbered by serialized round-trip memory latency.
The Dell PowerEdge R730 can be purchased with 1 or 2 of 2 sockets populated. The R930 must have 2 or 4 of 4 sockets populated. It appears that the premium from 32GB to 64GB memory modules might be because the 32 is RDIMM and the 64GB is LRDIMM. The 2400MT/s 32GB RDIMM is $558, and the 64GB LRIMM is $1483.
For this differential, I would stay single socket, opting for the more expensive 64GB (if necessary) over 2 sockets using the less expensive 32GB, even though I am giving up the 4 extra memory channels.
If the plan is to implement memory-optimized tables in 2016 (which supports more than 256GB? and has fewer other restrictions) and more than 768GB system memory is needed, then go ahead with 2-sockets.
But I would try to get my transaction processing DB under the single socket limit. In the old days, we moved the old out to the DW system.
For a long time, I have refrained from endorsing the use of Join and Query Hints. The reason is that there is no such thing as query hints in SQL Server except for a few simple items. The key options listed under Microsoft SQL Server Hints (Transact-SQL) that affect join order, join type, and index usage are not really hints.
They are in fact directives such that the specified option is employed if possible. But join order optimization is essentially disabled, and the execution plan joins tables in the order as written, exception being EXISTS and NOT EXIST clauses. It is if the Query Optimizer is hyper-sensitive and its feeling are hurt when someone impudently suggests a hint. So if hint is specified, good luck to you. You are on your own.
Microsoft documentation is obviously very thorough on this matter with the following:
“Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend only using hints as a last resort for experienced developers and database administrators.”
But the fact is that there are complicated queries such that it is difficult to estimate the number of rows,
especially past the source table into the intermediate steps.
The query optimizer does (edit: not) produce a good plan, because the row estimate is grossly erroneous.
This is aggravated by the fact plan costs employed by the query optimizer reflect a fixed IOPS model based around four 1995 era 7.2K hard disks, with no consideration for whether the actual capability of the storage system or if data is entirely in memory.
So there are times when it is necessary to override the query optimizer. I will not go into the details of query optimization here. Back around 2001, I reversed engineered the execution plan cost formulas for many the fundamental operations without bothering to explain the operations or other aspects of query optimization. Since then, Paul White, Benjamin Nevarez, and others have written very detailed papers on query optimizations. Roughly the strategy is a balance of 1) tables access efficiency, 2) join efficiency and 3) reduce rows as quickly as possible.
Because join order optimization is disabled when a join or query hint is specified, the key is to write the query to produce a specific join order. There is a method of doing so. See Inside Microsoft SQL Server 2008 T-SQL Querying: T-SQL Querying By Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass on the topic of bushy plans.
Once the correct join sequence is found, most of the join types might already be good. Perhaps a few join hints might be necessary. Join and Query hints do no cause the query optimizer to disable index optimization, so index hints should not be needed as frequently.
This is a much deeper topic than I can discuss at this point in time. So why I am bring it up? Today there was a news report of the cause of the Air Asia flight 8501 crash last December. For details, see
CNN Air Asia crash report or
In brief, there was an equipment malfunction that put the airplane outside of the normal flight envelope. It should have been possible for the flight crew to recover, but they did not do so.
It further turns out: “because the manual provided by the plane's manufacturer said the aircraft, an Airbus 320, was designed to prevent it from becoming upset and therefore upset recovery training was unnecessary.”
In our world, we are told the query optimizer “typically” is best, hence we should not have to do manual query optimization, hence we do not need to learn this. So when we do need to do manual query tuning, we do not know how because we did not learn this.
This is BULLSHIT! I don’t give a damn how good the query optimizer is “typically.” A serious principal DBA needs to learn how to do query optimization so that he/she can do so when the situation calls for it. The starting point is the bushy join style in the Itzik Ben-Gan book.
here is the link to my previous post on bushy joins, notice at the time I put caveats on it.
I am not saying go for broke on join/query hints, just that we need to learn to shape the join sequence.
Rob Farley had a great note in the comments. I had seen that query style before, in TPC benchmarks? without understanding at the time what the writer was smoking.
per Claus comment linking to
Dynamic Search Conditions in T‑SQL,
I especially like the strategy below, with hard code when good parameters are specified,
then some thing else to handle the unusual, be it OPTION(RECOMPILE) or dynamically built parameterized SQL
IF @orderid IS NOT NULL
WHERE O.OrderID = @orderid
AND -- Conditions on Order Details here.
-- No OPTION (RECOMPILE) here!
WHERE -- same conditions as before
There is a curious quiet from the enterprise storage community on form factor and interface direction for solid state storage, be it NAND Flash, Intel 3D XPoint or other. On the client-side, personal computing, both desktop and mobile, show clear direction in favoring both the M.2 form factor and PCI-E as the preferred interface for SSD storage. There is a backward compatibility option in M.2 to work with either SATA or PCI-E interface, but I do not think this will be widely used. SATA or hard disks will not go away, only the primary SSD is M.2 form factor, PCI-E interface and NVMe host protocol.
On the enterprise side, there is great deal of deployed infrastructure built around the SAS interface (a super-set of SATA), and the small form factor (SFF) for 2.5in HDD at 15mm height or thickness. The bean counter types would like to wish that SSD (NAND flash for those who do not like the acronym SSD) would use existing infrastructure and not just as an interim measure. They are probably still unhappy that Fiber Channel on the back-end had to be abandoned several years ago, being not competitive and a cost burden relative to SAS.
Preserving the value of investment in existing infrastructure is important because people are unhappy when equipment purchased at painfully high cost becomes obsolete. Of course, enterprise storage is only extremely expensive because storage vendors invented justifications for selling inexpensive components at very high markup.
There is also a failure to consider that hardware has an effective depreciation of 30-40% per year due to the pace of progress, which renders the term investment in hardware foolish, or if I were less polite, then completely stupid. So ultimately this is circular logic based on an untenable premise.
That said, it would be possible to build a viable enterprise storage system around either the PCI-E or SAS interface, because both support multiplexing lanes, and there are switch chips for both PCI-E and SAS interfaces. The reason PCs are moving from SATA to PCI-E is that the NAND interface bandwidth is advancing at a faster pace than any single lane connection can support and SATA does not support multiplexing. (In websites catering to desktop hardware, some say that PCI-E is superior to SATA. This is rubbish by writers with poor technical knowledge. The only important fact is whether the interface supports multiplexing.)
The reason existing enterprise infrastructure should be abandoned is not because of any deficiency in SAS, but rather that it is built around four lane (x4) uplink and downlink ports. SAS at 12 Gbits/s would only support 4.4GB/s net bandwidth. This might seem to be high because enterprise storage vendors sell crappy systems with pathetic bandwidth capability. The other reason is that most existing infrastructure are either the 24-25 bay SFF in 2U or 15-bay LFF in 3U 19-inch wide rack mount enclosures designed for hard disks. Both the number of bays and physical volume are completely wrong for current generation SSDs going forward.
My opinion is that the correct uplink and downlink for solid state (be it NAND flash, Intel 3D XPoint or other) storage enclosures (not individual devices) should be 16 lanes wide or x16. Both PCI-E and SAS have adequate bandwidth and protocols. For PCI-E gen 3 at 8Gbit/s per lane, this would support a net bandwidth of 12.8GB/s. The existing x4 SAS is just too low for an SSD (formerly disk) array enclosure.
The core of this argument is based around the PC standard of a Flash controller with 8-channels on the NAND side, and PCI-E on the uplink side. Today the NAND interface is 333MB/s, so an 8-channel controller could support 2.6GB/s. There may have been some thought that the upstream side should be PCI-E gen 3 with 2 lanes (x2, capable of 1.6GB/s), as it is common to have excess bandwidth capability on the down stream side. But in PC world, single device benchmark performance is important, so the trend seems to be PCI-E x4 on the controller, with the option to connect only x2 (or even x1?).
In the time of hard disks, client side PCs used 7200 RPM HDDs or less, for lower cost and higher capacity. Enterprise storage was primarily 10K or 15K RPM for greater IOPS performance, although 7200 RPM was adopted for tier 2 storage. (Storage capacity should have been too cheap to meter even for 10K HDDs, but because vendors sold a ridiculously high prices, this created a demand for 7.2K in enterprise storage systems.)
In the first phase of SSD adoption, enterprise systems preferred single level cell (SLC) NAND with greater write endurance while client side was mostly 2-bit MLC and later some low cost devices being 3-bit TLC. Today NAND flash technology is sufficiently mature that MLC has sufficient write endurance for many enterprise needs. Fundamentally, the performance oriented PC and enterprise could use the same SSD, just with different over-provisioning and other firmware settings. It would be foolish for enterprise systems not to leverage components developed for client side systems, given the huge volume and low cost structure.
While the standard desktop SSD element is M.2 form factor with an 8-channel controller and capable of x4 on the upstream side, the enterprise strategy should be to connect x2 on the upstream side. In enterprise, it is the performance of the complete array of storage elements that is important, not the single component. The standard storage array enclosure should probably have 16 bays, each connected x2 to the PCI-E switch, and x16 for each of the uplink port and downlink expansion port. The PCI-E switch would have 64 ports, 16 for uplink, 16 for downlink expansion, and 16 x2 for the M.2 SSDs. The enclosure should work with either 1 or 2 controllers. Existing DAEs have a single (x1) SAS connection to each bay.
The physical volume for 16 M.2 devices would occupy only one-quarter of 1U rack. Existing enterprise storage infrastructure is x4 uplink/downlink ports, 2U full rack with 24-25 bays connected x1. This wrong for SSDs at on multiple points. Uplink and down link ports should be x16. The volume of the enclosure should be shrunk by a factor of 8. Connections to each bay should be x2, but 16 bays connected at x1 is weakly viable. Given that existing infrastructure is unsuitable for SSDs going forward, there is no excuse to not adopt the client-side components with M.2 form factor and PCI-E in a new properly designed infrastructure.
for some reason I cannot respond to comments
Good question. Sorry about the long winded answer, but life is complicated. I do agree with shared storage concept in HDD days, having a common pool of HDD so that each host can access the aggregate IOPS capability when needed. This and managing the complex storage system alone would have justified a good profit margin. But storage felt the need to justify extraordinary margins, hence started to invent reasons, which led to doctrine based on the invented justifications. Any time happens, it is a fuck-up of the first magnitude. And storage vendors do not seem to understand what bandwidth is, or about log write latency.
Next, blade systems are non-starter in databases because it gives up DIMM slots and PCI-E slots. So we should stick with rack systems with the full boat of DIMM and PCI-E slots. Today a 4TB PCI-E is do-able. Whats missing is some way to match PCI-E SSDs to the available PCI-E lanes. System vendors have a mix of PCI-E slots, including several x16. Only workstations and HPC have x16 cards, servers do not. So we want to connect four PCI-E x4 SSDs to x16 slots. HP workstations have some card for this, but we need a server version. I can see a 4-socket server with 64 - 128 PCI-E lanes dedicated to storage, that is 16-32 x4 PCI-E SSDs, so 64-128TB in one system. All this can be internal, SSDs do take much space, aren't too heavy and don't consume much power.
Storage vendors still want to sell horribly expensive AFA storage with features that we don't need, and cannot deliver anywhere close to the bandwidth that is possible. So it really is a fight between the DBA for cheap SSD at high bandwidth, and the SAN admin who wants to spend a shit load of money, have control over all storage, make you fill out forms to justify why you need each precious GB of space, all so he/she can deny you request as not sufficient in his/her judgment.
Given that the NVMe controllers seem to be PCI-E x4, perhaps the strategy should be x16 uplink and downlink, with 8 x4 bays. There is still a 2:1 mismatch between downstream and upstream. The point being bandwidth max is reached with 4 devices, but there is space for 4 more. Above, I suggested 16 x2 bays.
In my previous post, Hardware rant 2015, some readers reacted to my suggestion that vendors start offering the Intel Xeon E3 v5 in laptop and desktop PCs as if this were an outlandish or impractical idea.
First, doing so requires almost no additional work. Simply substitute 1) the Xeon E3 v5 for the Core i7 gen 6, 2) the server PCH (C236) in place of the desktop PCH (Z170) which are really the same thing as are the two processors, and 3) ECC memory for non-ECC, which has 8 extra bits over the normal 64-bits. The cost of this might one hundred dollars mostly driven by the premium Intel charges, only slightly from the 12% higher cost of memory. (It is the Xeon E5 line that would be seriously impractical in a laptop that an old person could easily carry. A young fit person might claim to not feel the difference between 4 and 15lbs, or 2 and 6kg).
Second, I should explain why ECC memory is so important, far out weighing the extra cost. This is true for user systems, not just servers with 24/7 requirements. As the title states, a PC without ECC protected memory is total crap, no exceptions unless what you do on the PC is totally worthless, which could be the case for a significant market segment.
Basically without any check on memory integrity, we may have no idea when and where a soft error has occurred. Perhaps the only hint being the OS or application crashes for no traceable reason or serious data corruption has already occurred. Let it be clear that soft errors do occur unless you are deep under ground.
Up until the early 1990’s, many if not most PC’s sold as desktops and laptops had parity protected memory. Then in the time frame of Windows 3.x, (almost?) all PC vendors switched to memory with no data integrity protection for their entire lineup of desktop and mobile PCs (with perhaps the exception of dual-processor systems based on Pentium Pro and later, that were subsequently classified as workstations). This was done to reduce cost, eliminating 1/9th of the memory for parity.
All server systems retained parity, and later switched to ECC memory even though entry level servers use the same processor as desktops (either with the same product name, or different). The implementation of memory protection is done in the memory controller, which was on in the north-bridge in the past, and more recently, integrated into the processor itself (starting with Opteron on the AMD side, and Nehalem in the Intel side).
I recall that the pathetic (but valid?) excuses given to justify abandoning parity memory protection was that DOS and Windows were so unreliable so as to be responsible for more system crashes than an unprotected memory system. However, since 2003 or so, new PCs were sold with operating system shifted to the Windows NT code base, imaginatively called Windows XP.
(In theory) Windows NT is supposed to be a hugely more reliable operating system than Windows 3.1/95, depending on the actual third-party kernel mode drivers used. (Lets not sidetrack on this item, and pretend what I just said is really true). By this time, the cost of sufficient DRAM, unprotected or ECC, was no longer as serious a matter, even though base memory configuration had grown from 4MB for Windows 3.1 to 512MB if not 1GB for Windows XP or later. And yet, there was not a peep from PC system vendors on restoring memory protection with ECC now being standard. (I did hear IBM engineers* propose this, but nothing from PC vendors without real engineers. We don’t need to discuss what the gutless wonders in product management thought).
Presumably soft-errors are now the most common source of faults in systems from Windows NT/XP on. Apple Mac OS (from version?) and Linux are also protected mode operating systems. So this is pretty much the vast majority of systems in use today. It is possible that bugs in drivers from third-party that have not been tested under the vast range of possible system configurations (more so for performance oriented graphics drivers?). Still, the fact that vendors to do not regard correcting the most serious source of errors in PCs today is an indication that they consider the work we do on PCs to be worthless crap, which is the same regard we should have for their products.
Let me stress again that putting out PCs with ECC memory does not require any technical innovation. ECC capability has been in entry server systems built from identical or comparable components all along. By this time, Intel memory controllers had ECC capability which could be (factory) enabled or disabled depending on the targeted market segment. (Intel does have dumbed-down chipsets for the low-end PCs, but it is unclear if ECC was actually removed from the silicon.)
A. The Wikipedia article
cites references that mentions actual soft-error rates.
There are a wide range of values cited, so I suggest not getting hung up on the exact rate, and treat this as order-of-magnitude(s).
There is a separate entry
for anyone interested in the underlying physics.
Of course there are other Wiki entries on the implementation of ECC.
Briefly, the prevalent source of soft-errors today originating with cosmic rays striking the upper atmosphere, creating a shower of secondary particles, of which neutron can reach down to habitable areas of Earth. Unless the environment is a cave deep underground, there will be soft errors caused by background radiation. The probability of errors also depends on the surface area of memory silicon, so a system with a single DIMM will experience fewer soft errors than system with many DIMMs.
B. Early memory modules were organized as 8 bit data plus 1 bit for parity in a 30-pin x9 SIMM. Sometime in the early 1990’s, around the 80486 to Pentium time, 72-pin x36 SIMMs (32 bit data, 4 bit parity) was popular. The implementation was 1 parity bit protects 8 bits of data for both the x9 and x36 modules. Parity protected memory had ability to detect, but not correct single bit errors in an 8 bit “line”.
A few high-end servers in this era had ECC memory which may have been implemented with 2 x36 memory modules forming a 64 bit line with 8 bits for parity, or perhaps a custom memory module? Later on, memory modules progressed to DIMMs, having 64 bits of data with allowance for 8 additional bits for ECC. The base implementation of ECC is to have a 72-bit line with 64-bits for data and 8 bits for ECC. This allows the ability to detect and correct single-bit errors and detect but not correct 2-bit errors (SECDED). More than 2-bits in error could potentially constitute an undetected error (dependent on the actual ECC implementation). There also other ECC strategies such as grouping 4 x72 DIMMs into a line allowing the ability to detect and correct the failure of an entire x4 (or x8?) DRAM chip, when each DIMM is comprised of 18 x4 chips, each chip providing 4 bits of data.
C. At the hardware level, if an error is detected and corrected, the operating system and applications continue to function. The event can be logged at the system level. A detected but uncorrected error, the hardware should cause a blue screen OS crash.
An undetected error is just that. It is undetected. The system continues running with incorrect memory content.
Depending on the nature of the memory corruption, anything can happen.
It could be executable code, in which case the instruction changes.
It could be critical a operating system data, causing subsequent memory access to read or write to the wrong location, which could have serious corruption consequences. It could also be end data, or number or char or control, which may or may not be critical.
* It is probably more correct to say that soft-errors is the province of scientists/physicists, not engineers. Sun had perfectly good engineers, but in the late 1990's, they had an Ultra-Sparc II processor with 4M L2 cache in their high-end enterprise system. I believe the L2 data had ECC - SECDED, but the tags were only parity protected - SED. Some of systems started to experience mysterious failures (the one located in high-altitude locations?). This was ultimately traced to soft-errors. It was not a simple thing to change the L2 cache tags from parity to ECC (logic in the processor itself?) so the temporary solution was to mirror the memory used for tags? (if some knows the details, please step forward)
The Wikipedia topic ECC Memory states
"ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing."
It is more correct to say ECC is used to when it is necessary to correct the more common (typically single bit) errors, and detect certain errors involving more than 1 bit, which cannot be corrected. However it is possible that some multi-bit errors cannot even be detected.
Donaldvc pointed to this new article on IEEE Spectrum
Much of my knowledge is very old, from back in the days when memory chips were 1-4 bit wide.
Back then, the soft-error might only affect many memory cells but it would only be one bit in a word.
Then as memory became more dense, a soft error could affect multiple bits in a word?
So processors did ECC on a bank of 4 DIMMs = 256 bits of data, 288 bits of memory, which allowed more sophisticated algorithms.
I am not sure what Xeon E3 or E5 has. Xeon E7 is supposed to be very sophisticated.
If someone free time, please look into this.
It has been a while so I suppose it is time for another rant on hardware.
There are two systems I would like:
One is a laptop.
The second is a server capable of demonstrating extreme IO performance,
with the secondary objective of being small enough to bring along to customer sites.
On the laptop I am looking for
1) quad-core with HT, i.e. 8 logical processors for better parallel execution plans.
2) PCIe SSD, I would prefer 3GB/s+, so PCIe gen3 x4, or 2 x M.2 PCIe x2 is also an option.
3) drive 1, but preferably 2 external 4K monitors (so I can look at complex execution plans)
On this matter, it is time to bitch at the MS SQL Server team that there should be an option to contract the white space in execution plans.
The existing zoom capability is worthless.
Yes I know SQL Sentry Plan Explorer can do this,
but really MS, is it so hard? or have you outsourced the entire SSMS some team that does not know that there is such a thing as complex queries?
The reason I want to drive 2 external 4K displays is that at the 4K resolution, I need more than a 28 in monitor to use the resolution.
A couple of days ago, Dell announce the new XPS 15 with Core i7-6700 processors (Sky Lake)
which I immediately ordered , but unfortunately it shows a shipping date of Nov 16
it does have 4K display, and 1 external port which may or may not support 4K.
I thought I ordered the docking station, but I do not know if this would support dual external 4K monitors.
I currently have the Dell 28in 4K monitors, which is great for looking at execution plans,
but at the normal text size setting, is difficult read.
I am thinking that the much more expensive Dell 32in 4K monitor will be better, but maybe not enough.
Should I get a 55in 4k TV instead? these all have just the HDMI connector,
so I need to make sure there are proper 4K adapters.
The new XPS 15 data sheet says it has HDD bay (SATA interface) and one M.2 bay (uncertain if PCIe x2 or x4). I would have been nice if 2 M.2 x2 bays were available instead of the HDD bay.
I ordered the XPS 15 with the PCIe SSD. I do not know if it is good one (Samsung SM951 cite 2150MB/s)
if not, it will throw the Dell SSD out, and get a good one.
One more thing, ECC memory
Intel desktop and mobile processors all do not have ECC (or parity) memory capability.
ECC memory has been built into Intel processors for some time now,
it is just disabled in the Core product lines, enabled on in the server Xeon line.
So the Sky Lake equivalent is the Xeon E3 v5.
Intel released the v5 under the mobile group, with a 45W rating.
Unfortunately I cannot find a laptop for sale that uses the Xeon E3 v5.
Perhaps Dell or someone could offer a Xeon E3 mobile system?
Extreme IO Performance demonstrator
First, why do I need such a thing?
when my clients have multi-million dollar SAN storage systems?
Because SAN people are complete idiots on the matter of IO performance,
being locked into irrelevant matters (to enterprise DB) like thin provisioning etc.
Invariably, the SAN people (vendor sales engineer, the SAN admin etc) confuse that Fiber Channel is specified in Gigabits/sec (Gb/s) while all other IO bandwidth is specified in GigaBytes/sec (GB/s).
So we have a multi-million dollar storage system (full of add-on software that have no purpose in an enterprise DB) connected to a powerful server (60+ cores and paying for SQL Server 2012 EE per core licensing)
over 2 x 8Gbit/s FC links.
Is this stupid or is this exceptionally stupid?
Yes I know it is extremely rude of me to call other people stupid, and that being stupid is not crime,
but when you are the vendor for multi-million dollar equipment,
there is a reasonable expectation that you are not stupid.
So onto the system.
For this, I am sure I need more than 4 cores, so it needs to the Xeon E5.
Perhaps 8 cores (single socket) is sufficient.
The new Intel SSD DC P3608 has great specs, but I am not sure when it is actually available?
I would put 2 of these in the system to demonstrate 10GB/s.
Ideally this would all go into box that fits carry on luggage, which is unfortunately not one of the standard PC or server form factors.
Another option is a 2 x 12 core system to demonstrate 20GB/s on 4 x P3608.
I would prefer to get a laptop without high performance graphics, the NVidia GTX 960M in this case.
The current Intel graphics is sufficient for high resolution rendering, but I do not need high frame rate.
All the Intel Core i7 6th gen processors have graphics, I wonder if I can remove the GTX (for power savings)?
Apparently Dell will have a new docking station, the Thunderbolt Dock TB15 next year, that will support 2 x 4K monitors?
I did already rant on PC laptops only being available with 16x9 displays?
How stupid is this? It is one thing for consumer laptops to have a 16x9 display,
on the assumption that the home users just watch movies.
but on what justification does this apply to business and workstation laptops?
Concurrent with the Intel Skylake Xeon E3 v5 regular announcement,
Supermicro announced motherboards for the E3 v5.
There is a micro-ATX (X11SAE-M) but with just 1 x16 and 1 x4 PCIe g3 slots.
where as the ATX (X11SAT) has 3 slot with 16/8/8 as an option.
This would let me put 2 P3608? for 10GB/s?
Last week, a routine performance test ran about twice as long as expected.
A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time)
were approximately 3 times higher than previous tests for matching SQL statements.
Almost all of the SQL involved single or few row index seeks, usually Select, some Insert and Update.
The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center.
The data center had sent out notice that there would be system restarts the previous weekend,
which could mean either OS patches or BIOS/UEFI updates.
So naturally the next thing to do is check the Processor Information object for the Processor Frequency
and % of Maximum Frequency counters (or equivalent).
This showed 135, as in 135MHz, and 5% of maximum.
Another system of the same model also rebooted showed 1188 and 44%.
This issue has occurred previously in this environment and in other HP systems that I am aware of.
The BIOS (or UEFI) update puts the system into one of the energy efficient configurations.
It could also be an operating system setting, but most that I have seen are BIOS settings?
One can imagine a firmware engineer being so committed to green activism
that this was made the default on BIOS updates without discussion with other parties.
Perhaps there is a facility (in Houston?) with inadequate air conditioning for the number systems,
that this setting was put in to prevent the lab from overheating.
Then no one remembered to exclude the step in the production code?
Not that I have ever done such a thing (and no further questions on this should be asked).
Another question might be why the data center monitoring team did not check for this, as it has happened before.
The whole argument for going to managed data center instead of a simple hosted data center was
that the managed data center could provide the broad range of expertise that is not economical for a mid-size IT department.
Obviously this managed data center did not monitor for the performance/power configuration.
This matter is of serious concern to production DBAs and IT staff in handling operations.
As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2,
many software monitoring tools may not alert on changes of Processor Frequency,
especially after reboot.
Imagine the IT staff or DBA encountering this for the first time on the production servers,
with users complaining, your boss watching over your shoulder, and his/her boss hovering over your boss,
offering their helpful insights in the non-judgemental manner as bosses do.
However, I am more interested in a different aspect of this incident.
When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz,
we can extrapolate parameters of interest.
Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20,
only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X?
Perhaps, but this actually should have been expected.
The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time.
A young person might not know, but old timers would know.
Up to about 20 years ago, the primary memory performance specification was access time,
with 80, 70 and 60 ns being common in fast page mode and extended data out.
Then with the switch to synchronous dram (SDRAM),
the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common.
This means a memory channel can deliver one line every 0.75ns, or 1.333 billion times per sec,
with a line being 64-bits (excluding ECC bits).
The Xeon E5 26xx series has four DDR3 channels.
The Intel processor internally is shown as having 2 memory controllers,
each controller driving 2 DDR channels, so channel can have different meanings depending on the context).
What is less commonly cited is the round trip latency, from a processor issuing a request to memory,
the internal memory access within the DRAM chip and finally the transmit time back to the processor.
(The L1, L2 and L3 cache sequence is also involve in memory access timing.)
For local memory (attached directly to the processor) this is around 50ns.
For memory on an adjacent processor, the round trip time might be 95ns or so.
On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles.
This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency,
and half at 95ns latency.
This is a well understood issue.
Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor,
and 2) pre-fetching memory.
Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core,
and an additional shared L3 cache sized at 2.5MB per core.
From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line.
Prefetching is exposed in the instruction set architecture (can someone provide a reference please)
and there should also be a BIOS/UEFI for hardware prefetch.
Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures.
There is a memory access for the index root level page.
Read to the page to find the pointer for the second level page.
Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much)
and the next location is not known until the current memory request is completed, so pre-fetching is not possible.
Modern processor performance characteristics are very complicated,
but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency.
Start with an imaginary processor with a cycle time equal to the full round-trip memory access time.
In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.
Such a system may have never existed so now consider a system where the round trip memory access latency
is some multiple of the CPU cycle time.
The average time to complete an instruction where time is in units of the memory access latency
(50ns or 20MHz for local node),
“a” is the fraction of instructions that involve (non-local, non-prefetch-able) memory access
and “n” is the processor frequency.
(1-a)/n + a
The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache
(from previous access or pre-fetched).
“1/n” is the processor cycle time (in units where memory access time is 1).
Performance (instructions per unit time), the inverse of average time per instruction is:
P = 1 / ( (1-a)/n + a )
= n / (1 +(n-1)*a )
We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:
n -> infinity, P = 1/a
Also, if the fraction on instructions that require memory access, “a,” is zero, then P = n.
Does the above look familiar? It is just Amdahl’s Law, which formulated in the old days to demonstrate the limits of
vectorization in supercomputers.
I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.
If someone would like to check my math, please do so. It has been a long time.
Trying tricking your teenage son/daughter into doing this as a practical math exercise?
OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things.
What am I really after?
In the above equation, what is known is processor frequency relative to memory access latency.
While we know the performance or worker time of certain queries,
we do not know it terms of instructions per CPU-cycle.
And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency
that cannot be hidden with cache or pre-fetching.
But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz.
Express the relative performance between the two points as a ratio.
P2/P1 = R
Then from the two equations
P1 = 1 / ( (1-a)/n1 + a )
P2 = 1 / ( (1-a)/n2 + a )
we can solve for a in terms of the know values n1, n2 and R.
a = (n2 – n1*R) / ( n1*n2*(R-1) + n2-n1*R )
Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle.
Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3.
We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch
round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz.
(Perhaps it would be more correct in estimating round-trip memory access latency as the average
between the local and 1-hop remote node at 75ns?)
So while it might seem astonishing that the difference between 135MHz and 2700MHz
translates to only 3X performance,
the database transaction processing workload is an extreme (but important) case.
There are many workloads which exhibit better locality
or have memory access patterns that are amenable to prefetch
and have performance scaling better with frequency.
Earlier, two methods of hiding round-trip memory access latency were mentioned.
There is another, Hyper-threading. The processor core to appears as two (or more)
logical processors to the operating system.
Presumably, there is an extra set of program counters, and resources to determine which physical registers
(different from the registers specified in the instruction set architecture)
are assigned to each logical processor.
In the earlier example, say that the round-trip memory access time is 135 CPU-cycles
and the fraction of instructions that incurs the full round-trip latency is 6%.
Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar)
as either not involving memory access or data is already in cache,
and the 6 the incurs the round-trip memory latency of 135 cycles.
Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904.
In other words, only 100 cycles out of 904 are used, the rest are no-ops.
The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors
on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled.
The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)
In considering the nature of the database transaction processing workload,
being a memory access to determine the next memory access in nature,
it is perhaps time for Intel to increase the degree of HT,
especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the
smaller desktop/mobile processor variants.
I seem to recall IBM POWER as having 4 logical processors per physical core,
one of the SPARC processor lines as having 8.
It would also be necessary to have a good strategy for using HT based on workload.
The option to enable or disable HT in the BIOS/UEFI is not I what mean.
HT should be visible to the OS. But the application itself should detect the presence and degree of HT,
and make its own decision on whether HT should be used and how it should be used.
Xeon Phi, Many Integrated Core
Another item worth mentioning here is the Intel many integrated core (MIC) architecture,
codename Knights something, now Xeon Phi.
The processor puts many smaller processor cores, 61 in the 22nm Knights Corner,
versus 12-18 in the 22nm mainline Xeon processors.
The theory behind many smaller cores stems from one of the two main elements of Moore's Law.
Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.
(This was the case up to several generations ago.
Since then, Intel no longer tries to double the logic from one process to the next.
There might be 10-20% performance gain in general instructions.
Some effort is given to expanding the functionality of the special/vector instructions.
And most effort has been in increasing the number of cores.)
One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago).
Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC).
Following the rule of 2X transistor per 1.4X (square root of 2) backwards,
the expectation is that a processor if 1/4th the size would have 1/2 the performance.
But potentially there could be 4X as many cores, depending on interconnect and power limitations.
So in workloads that are amenable to vectorization, or otherwise can be parallelized,
the more smaller cores could be a better strategy.
The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions.
If we were thinking about a transaction processing database engine on the MIC architecture,
we would probably consider a very basic ALU without SIMD, (not sure on FP).
I am thinking that an asymmetric processor architecture might be the objective.
Perhaps two powerful cores from the current main line,
and many simpler cores (without SIMD) perhaps even simpler than Atom?
(The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)
As said earlier, the nature of database page storage along rows make serialized memory access
(also called pointer chasing code?) its hallmark.
This is why there is interest in column storage architecture.
Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over,
already in cache.
The work a little further down touches memory in the next 64 byte line or two away.
Both the software and hardware knows this, and either is capable of issuing a pre-fetch.
It does not matter that columnstore must touch more data.
Processor can stream huge volumes of data, much more effectively than pointer chasing
only the necessary rows.
I should probably say something here.
As I said earlier, modern microprocessors are very complex.
Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and
superscalar execution with Pentium (1993).
Pipelined means that while the processor can complete an instruction in each cycle,
the actual start to finish time of a specific instruction occurs over several cycles.
Intel does not talk about pipeline stages any more, but there are occasional references to Core2
and later processors having a 14+ stage pipeline.
(Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)
Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle.
Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar.
We could apply the principle of Amdahl’s on scaling performance to any and all of pipe-lining, superscalar,
and round-trip memory latency, and probably other things too.
Rethinking Computer System Architecture
I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture.
Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk.
Why do we still have a page file on disk?
In the old days, there was not enough physical memory such that it was tolerable to have a page file
on disk to support a larger virtual address space.
Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing).
But the key factor is in how memory is used.
Back then, it was mostly for executable code and internal data structures.
The assumption was that very few database data pages would actually be in memory at any given point in time.
Today, a very large percentage of memory is used for caching data pages.
Of the memory used for executable code and internal data structures, most is junk.
The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer
to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket.
Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file.
Of course the operating system (and applications) would have to be designed around this new architecture.
Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.
Back in Feb, I went on a diatribe-rant against the standard SAN Vendor configuration practice.
Problem with Standard SAN Configuration IO Performance,
article and accompanying post,
showed IO performance metrics for a batch driven transaction processing workload
on a SAN managed by a data center/cloud company.
The only option offered by the service provider was to request volumes for storage.
There was no consideration for special IO characteristics of transaction processing or other database workloads.
This practice is doctrine pontificated by SAN vendors,
calling SAN admins on a mission to implement the "storage as a service" concept
while remaining completely blind to the requirements of mission critical databases.
Ok, I am venting again.
Now I have performance metrics for the same workload,
except that storage is on a Violin system.
The system is different in having 24 physical cores, no HT, and 256GB memory
versus previous system with 16 physical cores, HT (32 logical) and 384GB.
Below are the IO characteristics.
The horizontal axis time scale is 5 min per major division for 1 hour across the entire chart.
Each tick is 1 minute. Data points are every 15 sec.
Note that the Feb (HDD) charts were 1 min per major division, 15 min total, with data point every 5 sec.
IOPS - read (green),
log write (red)
ms/Rd or Wr
The obvious difference between the IO characteristics on Violin and the previous HDD-based storage
is that checkpoints now have almost no visible impact on performance.
Both CPU and transactions/sec are very steady, with slightly noticeable blips,
versus the severe drops before.
It is evident that checkpoint writes now have almost no impact on data reads or log write IOPS.
The same is true of IO latency, in milli-seconds per read or write.
If the storage had been on HDD storage but with logs on a separate physical disks,
we would expect that the checkpoint would drive up data read latency, and hence depress data read performance.
But it would have no impact on log write latency, and hence no impact on log write performance.
The lower data reads should have only moderately depress performance, not severely.
The difference in system processor, 24 physical cores no-HT
versus 16 cores plus HT is not a factor in the IO characteristics.
The difference in physical memory, 256 GB on the system Violin storage and 384 GB in the system with HDD storage
is evident in the data read IOPS, starting at 7-8K IOPS then drifting down to 2-3K IOPS on the system with less memory,
compare with mostly 1K IOPS on the system with more memory.
Both storage systems can easily handle 20K IOPS.
The main argument here is not that SSD/Flash storage is a requirement for transaction processing databases,
even though there are clear benefits.
(NAND flash based SSD have both maturity and cost-structure to be very attractive for any new storage system purchases.)
The point is that there is a severe problem with the SAN vendor doctrine of one common pool for all volumes.
This very severe problem can mostly and easily be mitigated
simply by having separate physical disks for the log volume.
So the point could and should be demonstrated by showing the IO performance on an HDD SAN with separate physical disks for logs.
But this violates the SAN doctrine of ignoring user requirements,
and would not be considered or allowed by the service provider under any circumstance.
So the only real solution is the keep performance critical databases off a storage system
administered by a team on a different mission than supporting the database.
Below are excerpts and the graphs from the Feb article.
Standard SAN Configuration IO Performance 2015-02
The chart below is CPU. The horizontal axis is time.
One major division marked by the vertical line is 1 minute, and the small tick is 12 sec.
The data points are 5 sec. There are 12 steps between each major division.
The vertical axis is overall (system) CPU utilization in percent.
Each of the stacked green lines represents an individual processor.
There are 16 physical cores and 32 logical.
A single logical core at 100% utilization would show a separation of 3.125% to the line below.
On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec.
Note that the vertical axis is log-scale, base 10. One major division is a factor of 10.
Each minor tick is an integer.
The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.
The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes.
The vertical axis is log scale.
The fourth chart is IO latency, milli-sec per IO.
The same color codes applies as for IOPS.
Again the vertical axis is log scale.
ms/Rd or Wr
The charts provided here illustrates my complaints against SAN vendor doctrine,
obstinately adhering to the concept of one large pool of disks
from which all volumes are created for any purpose (data, log, and
junk non-DB stuff).
There is no consideration for the radically different characteristics of hard disks
in random versus sequential IO
(low vs. high queue depth IO behavior should also be an element of IO strategy).
The architecture of all traditional relational database engines are built on the premise
that high volume log writes are possible at very low latency (using dedicated disks)
in order to provide durability of transactions.
And yet SAN vendors blithely disregard this
(because it is at odds with the doctrine derived from principles invented)
to justify their mission to sell inexpensive commodity hardware components at extraordinary prices.
Transaction Performance Data
The chart below is CPU.
The horizontal axis is time.
One major division marked by the vertical line is 1 minute,
and the small tick is 12 sec.
The data points are 5 sec. There are 12 steps between each major division.
The vertical axis is overall (system) CPU utilization in percent.
Each of the stacked green lines represents an individual processor.
There are 16 physical cores and 32 logical.
A single logical core at 100% utilization would show a separation of 3.125% to the line below.
On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec.
Note that the vertical axis is log-scale, base 10. One major division is a factor of 10.
Each minor tick is an integer.
The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.
The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes.
The vertical axis is log scale.
The fourth chart is IO latency, milli-sec per IO.
The same color codes applies as for IOPS.
Again the vertical axis is log scale.
The SQL activity is batch driven transaction processing.
There are 14 or so threads concurrently looping through a set of items to be processed.
Each item involves about 20 rows of insert or update activity,
hence 1000 log writes per sec corresponds to approximately 20,000 transaction/sec on the SQL counter.
Most of the active data is in memory.
There are probably 30-40 SELECT rows for each transaction
or twice as many reads as writes.
The data IO reads necessary to support the 20,000 inserts and updates/sec is about 2,000/sec,
which the storage system is capable of supporting at about 4ms latency.
This is because the data resides a small part of each disk.
The actual latency for random IO is less than the expected value of 7.5 ms
for data randomly accessed over an entire (10K) disk at queue depth 1.
For approximately 20 seconds out of each minute,
the transaction rate drops from the peak value of 20,000
all the way down to about 8,000 per sec
(noting the log scale).
These are the check points when the data write IO surges to 20-50K IOPS,
(which demonstrates that the storage system is actually pretty decent)
though write latency is driven up to 50-90ms.
The checkpoint surge also pushes log write latency up from 1ms to 20-30ms.
This dataset occurred during the day,
when presumably there was activity for other hosts on different volumes
but on the same SAN.
At night, log write latency away from checkpoints could be under 0.3ms even at high volume.
The Storage System
I had not previously discussed the storage configuration in detail.
The storage system consists of 240 x 10K HDDs only, with the standard system level caching.
The SQL Server host is connected to the SAN over 4 FC ports
(8Gb from host to switch, 4Gb from switch to SAN, and presumably 4Gb on the SAN backend?).
The data is distributed over 8 volumes.
The log is on a separate volume as seen by the host OS.
The problem is that on the SAN, all disks are aggregated into a single pool, from which volumes are created.
This was done per standard SAN vendor doctrine.
Their magically great and powerful "value-add" intelligence would solve all your performance problems.
We cannot ask for dedicated physical disks for the log because the SAN was already configured,
with the SAN admin getting assistance from the SAN vendor's field engineer
who followed the SAN vendor's doctrine.
Input from the DBA team was not solicited
and would have ignored in any case.
Besides, there are no unallocated disks.
And no, the SAN team will buy more disks because there are no empty bays in the disk enclosures.
And there is no room for more enclosures in the storage rack.
So the DBA request is denied.
Even if we put up the money to get an extra cabinet for one more disk enclosure,
the SAN admin will still refuse to configure dedicated physical disks for the log volume
because the SAN vendor said that their great and powerful SAN will solve all performance problems.
Any problems must be with their application and not the SAN.
As can be seen from the charts above, this is a load of crap.
The SAN Vendor Solution
As I said above, this particular SAN is comprised of 240 or so 10K HDDs.
Naturally the SAN vendor's proposed solution is that we should buy more of their value-add products
in the form of auto-tiering SSD-HDD, and perhaps additional SSDs as flash-cache.
The particular SAN with base features probably has an amortized cost of $4,000 per HDD.
So SAN with 240 disks would cost just under $1M (while still failing to provide desired database performance characteristics).
A mid-range SAN might have amortized cost per disk of $1,500-2K. Enterprise SAN could be $4-6K per disk.
The additional value-add features would substantially increase the already high cost,
while providing only minor improvement, because the checkpoint IO surge will still drive up log write latency.
The sad thing is that the SAN group might buy into this totally stupid idea,
and refuse to acknowledge that the correct solution is to simply have dedicated physical disks for the logs.
If there were dedicated physical disks, the checkpoint data IO surge goes to completely different physical disks
than the log disks.
In the specific example, it is not necessary to have separate FC ports for the logs.
The 50K IOPS at 8K per IO generates 400MB/sec, which is only 25-33% of the realizable IO bandwidth of 4 x 4Gbit/s FC ports.
The checkpoint IO surge would increase latency on data reads,
but the primary reason for the performance drop is the increase (degradation) in log write latency.
Another angle is changing the checkpoint parameters in SQL Server,
but the real problem is because we are prevented from leveraging the pure sequential IO characteristics of HDDs
by allocating data and log volumes from a common pool.
One more item. In the old days before we had immense memory,
typical transactional database data read/write mix was 50/50.
This is because a read forces a dirty page to be written to disk.
In this situation, a data write IO surge would also depress the data reads necessary to support transactions.
So the standard practice those days was to set the checkpoint interval
to infinity to prevent data write IO surges.
In our case, very little data reads are necessary to support transactions,
so the checkpoint surge might depress data reads should have lesser impact on transactions.
It is the increase in log write latency that is depressing transaction volume.
Solutions that work
A solution that would work is simply to have separate dedicated physical disks for the log volume.
It is that simple!
And yet this is not possible because the SAN people would refuse to do this, as it is not in their agenda.
It is unfortunate that the only practical solution is to get the critical database off the corporate SAN.
I have done this by going to all flash in the form of PCI-E SSDs.
That is, SSDs installed internal to SQL Server system.
This is not because the exceptional performance characteristics of SSDs were required.
It was because I needed to get away from the SAN admin and his dogmatic adherence of SAN vendor doctrine.
The IO performance requirements could have been meet with a direct-attach HDD array (or on a SAN).
But anything with HDD enclosures would have been under the authority of the SAN admin,
who would have nixed any storage components that was not a SAN,
and then configured it according the SAN vendor principles.
I have used the excuse that PCI-E SSD "accelerators" are needed for tempdb,
which are not really "storage",
hence there is no IT department mandate that it be on the SAN,
under the absolute control of the SAN admin.
In fact there were no special requirements for tempdb different from that of data.
Then for unrelated reasons, there was enough capacity to put the entire DB on local SSD.
Perhaps a file group with non-critical objects could reside on the SAN
to carry the pretense of the local SSD not really being storage.
Note that I have not naively suggested engaging in constructive dialog with the SAN team.
They are on a holy mission that has no alignment with their company/organization's mission.
Anything that contradicts SAN scripture is heresy.
Oracle database machine has been described as hardware optimized for Oracle database.
In fact the true mission is to take the SAN admin out of the loop.
I think HP offered appliance oriented systems for SQL Server in 2012?,
but they only system option was the DL980? which is severely narrow market segment.
There needs to be DL380 and 580 options as well.
A SAN is simply a computer system that resides between storage elements and hosts (storage clients)
providing volumes (similar to any other client-server application).
One practical feature is that the SAN can provide protection against single component failure.
I have no objection to this concept, nor to the fact that the SAN vendor
wants to sell hardware at a list price of 10X markup to cost.
Strangely, it seems that people want to buy expensive stuff (just look around outside of IT).
Consider that an organization might spend anywhere from tens of millions to a few billion dollars to develop the main database application.
It does not seem right (to a high level executive) to put such an application on a $100K server plus $200K storage when there are $1M servers and $6M storage systems available.
Never mind whatever might be the consequential differences between them.
Or the fact that there are supreme technical challenges in scaling on (hard) NUMA,
and oh yeah, the chief architects have never heard of NUMA.
The point here is that people will buy very expensive SAN systems without technical understanding or justification.
There is a perfectly sound business justification.
What the client needs is a system supported by deep experts.
Not some rookie field engineer who incorrectly chooses to replace the memory board with no uncorrected errors before then board with uncorrected errors.
My observation is that the SAN Vendors feels a need to have technical justification for selling a storage system with extraordinarily high marks.
So the vendors creates reasons.
(Technical justification created for marketing requirements tend to have serious logical flaws, but a creative marketing professional is not easily deterred).
It follows on pretending that these technical justifications are valid,
then the "best" practices for employing a SAN should be blah, blah, as derived from the underlying reason.
Do I need to explain the consequences of implementing practices built on a lie?
I will add that there is an absolutely critical value-add that the storage system vendor must provide
that alone justifies very expensive pricing.
This is integration testing, verifying that the complex system with very many components work well together.
The hard disk vendors are generally good at validating that their products works to specification in desktop and workstation configuration with 1 or even a few drives.
Ensuring a storage system with several controllers, thousands of HDD, dual-controllers etc., is vastly more complicated.
This is a valid reason.
Building practices on a valid reason has benefits.
In a large environment, there might be hundreds or thousands of servers, database instances, SQL Server or other.
Managing very many systems and databases economically requires a high degree of automation,
usually implemented with standard configurations.
However, there are some databases that are critical either in being required for day to day operations and possibly providing a competitive advantage over other options.
One indicator of this is that there are DBAs and developers dedicated to a specific application.
In this case, the argument that customization of storage configuration is not feasible because of the other responsibilities on the SAN team is total BS.