Server system memory capacities have grown to ridiculously large levels
far beyond what is necessary now that solid-state storage is practical.
Why is this a problem?
Because the requirement that memory capacity trumps other criteria has
driven system architecture to be focused exclusively on low cost DRAM.
DDR DRAM, currently in its fourth version,
has low cost, acceptable bandwidth,
but poor latency.
Round-trip memory access latency is now far more important
for database transaction processing.
The inverse of latency is practically a direct measure
In the hard disk era, the importance of memory in keeping IO
volume manageable did trump all other criteria.
In theory, any level of IO performance could be achieved
by aggregating enough HDDs.
But large arrays have proportionately more frequent
Although RAID provides fault tolerance,
the failed drive rebuild process causes
The practical upper bound for HDD storage
is on the order of 1000 disks, or 200K IOPS.
If it was necessary to purchase 1TB of memory to accomplish this,
then it was money well spent because the storage was
even more expensive.
Since then, storage with performance requirements
have or should be transitioning to all-flash,
and not some tiered mixed of flash and HDD.
Modern solid-state arrays can handle up to 1M IOPS,
far more and easily when on a full NVMe stack.
Now is the time to re-evaluate just how much memory
is really needed when storage is solid-state.
There are existing memory technologies, RLDRAM and SRAM for example,
with different degrees of lower latency, higher cost
and lower density at both the chip and system level.
There is potential to reduce memory latency by a factor of two or more.
The performance impact for database transaction processing
is expected to be equal.
A single-socket system with RLDRAM or SRAM memory could replace
a 4-socket system with DDR4.
Practically any IO volume, even several million IOPS,
can be handled by the storage system.
The key is that the CPU expended by the database engine
on IO be kept to a reasonable level.
An argument is made for a radical overhaul of current system
architecture, replacing DDR4 DRAM as main memory
with a different technology that has much lower latency,
possibly sacrificing an order of magnitude in capacity.
As the memory controller is integrated into the processor,
this also calls for a processor architecture change.
A proper assessment for such a proposal
must examine not only its own merits,
but also other technologies with the potential for order
of magnitude impact.
The discussion here mostly pertains to Intel processors
and Microsoft SQL Server.
However, the concepts are valid on any processor and database engines
built around page organization, row-store with b-tree indexes.
The argument for low latency memory is valid even in the
presence of Hekaton memory-optimized tables, or other MVCC implementations
because it is a drop-in hardware solution instead of a database
Of course, the heavy lifting is on Intel or other
to re-architect the system via processor and memory.
For more than forty years, DRAM has been the standard
for main memory in almost every computer system architecture.
A very long time ago, having sufficient memory to avoid paging
was a major accomplishment.
And so, the driving force in memory was the semiconductor technology
with low cost.
This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C).
DRAM addressing was multiplexed because even the package
and signal count impact on cost mattered.
The page file has since become a relic, that for some reason
has yet to be removed.
Over the years, the mainstream form of DRAM, from SDRAM to DDR4,
did evolve to keep pace with progress on bandwidth.
The criteria that has changed little is latency,
and it is round-trip latency that has become critical in applications
characterized by pointer chasing code resulting
serialized memory accesses.
Even as memory capacity reached enormous proportions,
the database community routinely continued to configure systems
with the memory slots filled, often with the maximum
capacity DIMMs, despite the premium over the next lower capacity.
There was a valid reason for this.
Extravagant memory could mask (serious) deficiencies
in storage system performance.
Take the above in conjunction with SAN vendors providing helpful advice
such as log volumes do not need dedicated physical disks,
and the SAN has 32GB cache that will solve all IO performance problems.
Then there is the doctrine to implement their vision of storage as a service.
But now all-flash is the better technology for IO intensive storage.
Solid-state storage on an NVMe stack is even better.
Massive memory for the sole purpose of driving IO down to noise levels
is no longer necessary.
Now free of the need to cover-up for weakness in storage,
it is time to rethink system architecture,
particularly the memory strategy.
Not all application desire low latency memory at lower capacity
and higher cost.
Some work just fine with the characteristics of DDR4.
Others prefer a different direction entirely.
The HPC community is going in the direction of extreme
memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.
In the past, when Moore's Law was in full effect,
Intel operated on: "the Process is the Business Model".
It was more important to push the manufacturing process,
which meant having a new architecture every two years
with twice the complexity of the previous generation
and then be ready to shrink the new architecture to
a new process the next year.
Specialty products divert resources from the main priority
and were difficult to justify.
But now process technology has slowed to a three-year cycle
and perhaps four years in the next cycle.
It is time to address the major
product outside of general purpose computing.
The principle topic here is the impact of memory latency
on database transaction performance, and the
directions forward for significant advancement.
As the matter has complex entanglements, a number of different
aspects come into play.
One is scaling on non-uniform memory access (NUMA) system
architecture, in which the impact of memory access can be examined.
Memory technologies with lower latency are mentioned,
but these topics are left to experts in their respective fields.
Hyper-Threading is a work-around to the issue of memory latency,
so it is mentioned.
The general strategy for server system performance in recent years
But it is also important to consider which is
the right core as the foundation.
Another question to answer is whether a hardware and/or software solution
should be employed.
On software, Multi-version Concurrency Control (MVCC) is the technology
employed by memory-optimized databases
(the term in-memory,
has misleading connotations)
that can promise large performance gains,
and even more extreme gain when coupled with natively compiled
Architecting the database for scaling on NUMA architecture
is also a good idea.
A mixed item is SIMD, the SSE/AVX registers and instructions
introduced over the last 18 years.
From the database transaction processing point of view,
this is not something that would have been pursued.
But it has come to occupy a significant chunk of real estate
on the processor core.
So, find a way to use it! Or give it back.
The list is as follows:
- Scale up on NUMA versus single socket
- RLDRAM/SRAM versus DDR4+3D XPoint
- Hyper-Threading to 4-way
- Core versus Atom and all comers?
- Hekaton, memory-optimized, MVCC
- Database architected for NUMA
- SSE/AVX Vector Instructions
The above topics will be address, perhaps only partially,
often out of order, and sometimes mixed as appropriate.
Factors to consider are potential impact,
difficulty of implementation
and overall business justification.
L3 and Memory Latency
benchmark website lists L3 and memory latency for some recent processors, shown below.
|Processor||Base Freq||L3||Local Memory||Remote Memory|
|Westmere EP (Xeon X5650)||2.67GHz||40-42 cycles||L3+67ns||L3+105ns|
|Ivy Bridge (i7-3770)||3.4GHz||30 cycles||L3+53ns|| |
|Haswell (i7-4770)||3.4GHz||36 cycles||L3+57ns|| |
|Haswell (E5-2603 v3)||1.6GHz||42-43 cycles|| || |
|Skylake (i7-6700)||4.0GHz||42 cycles||L3+51ns|| |
There are a number of questions.
Is L3 latency determined by absolute time or cycles?
In the Intel Xeon processors from Ivy Bridge EP/EX on,
this is particularly complicated because there are 3 die layout
LCC, MCC, and HCC,
each with different structure and/or composition.
The Intel material says that L3 latency is impacted
by cache coherency snoop modes, ring structure
See Intel Dev Conference 2015
Processor Architecture Update,
and similar Intel
HPC Software Workshop 2016 Barcelona
For simplicity, L3 is assumed to be 15ns here without consideration for
DRAM, SQL Server 8KB Page
The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15,
working out to 14ns each for CL, RCD and RP, so random row latency is 42ns
the DRAM interface.
The Intel web page for their
Memory Latency Checker
utility shows an example having local node latency as 67.5 or 68.5 ns
and remote node as 125.2 or 126.5 ns.
L3 and memory latency on modern processors is a complicated matter.
The Intel values above includes the L3 latency.
So, by implication, the 10ns difference is the transmission time from memory controller
to DRAM and back?
One of the modes of DRAM operation is to open a row,
then access different columns, all on the same row,
with only the CL time between columns.
I do not recall this being in the memory access API?
Is it set in the memory controller?
Systems designed for database servers might force close a row immediately?
In accessing a database page, first, the header is read.
Next, the row offsets at the end of the 8KB page.
Then the contents of a row, all or specific columns.
This should constitute a sequence of successive reads
all to the same DRAM row?
Memory Architecture - Xeon E5 and E7
The most recent Core i3/5/7 processors run in the mid-4GHz range.
There are Xeon EP and EX processors with base frequency in the 3GHz+ range.
But for the high core count models, 2.2GHz is the common base frequency.
Below is a representation of the Xeon E5 memory subsystem.
In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB)
or Memory Extension Buffer (MXB), depending on which document,
and the interface between MC and SMB is SMI.
The SMB doubles the number of DIMMs that can connect to each memory channel.
There is no mention of the extra latency for the SMB.
It cannot be free,
"All magic comes with a price".
The value of 15ns is used here for illustrative purposes.
Anyone with access to both 2-way Xeon E5 and 4-way E7 systems
of the same generation and core count model (LCC, MCC or HCC)
is requested to run the 7-cpu
or Intel Memory Latency Checker
utility, and make the results known.
In principle, if Intel were feeling helpful, they would do this.
Below is a single socket Xeon E5 v4.
The high core count (HCC) model has 24 cores.
But the Xeon E5 v4 series does not offer a 24-core model.
Two cores are shown as disabled as indicated, though it could any two?
There is only one memory node, and all memory accesses are local.
Below is a 2-socket system, representing the Xeon E5
either HCC or MCC models in having two double rings,
and the interconnect between rings.
The details shown are to illustrate the nature of NUMA
NUMA is complicated topic, with heavy discussion on cache coherency,
the details of which impacts L3 and memory latency.
This discussion here will only consider a very simple model
only looking a physical distance impact on memory latency.
NUMA Deep Dive Series
by Frank Denneman.
Each socket is its own memory node
(not sure what happens in Cluster-on-Die mode).
To a core in socket 0, memory in node 0 is local, memory in node 1 is remote,
and vice versa to a core in the other socket
(more diagrams at System Architecture Review 2016).
Below is a representation of a 4-socket system based the Xeon E7 v4 memory.
To a core in socket 0, memory in node 0 is local,
memory in nodes 1, 2, and 3 are remote.
And repeat for other nodes.
It is assumed that memory accesses are 15ns longer than
on the E5 due the extra hop through the SMB outbound and inbound.
This applies to both local and remote node memory accesses.
On system and database startup, threads should allocate memory on the local node.
After the buffer-cache has been warmed up,
there is no guarantee that threads running in socket 0
will primarily access pages located in memory node 0,
or threads on socket 1 to memory node 1.
That is, unless the database has been purposefully architected with a plan
in mind to achieve higher than random memory locality.
The application also needs to be built to the same memory locality plan.
A connection to SQL Server will use a specific TCP/IP port number
based on the key value range.
SQL Server will have TCP/IP ports mapped to specified NUMA nodes.
Assuming the application does the initial buffer-cache warm up,
and not some other source that is not aware of the NUMA tuning,
there should be alignment of threads to pages on the local node.
(It might be better if an explicit statement specifies the preferred NUMA node
by key value range?)
An example of how this is done in the TPC-C and E benchmark full disclosures
and supplemental files.
Better yet is to get the actual kit from the database vendor.
For SQL Server, the POC is JR
Curiously, among the TPC-E full disclosure and supplemental files'
thousands of pages of inane details, there is nothing
that says that the NUMA tuning is of pivotal importance.
It is assumed in the examples here that threads access random pages,
and memory accesses are evenly distributed over the memory nodes.
Simple Model for Memory Latency
Memory Latency, NUMA and HT,
a highly simplified model for the role of memory latency in
database transaction processing is used to demonstrate the
differences between a single-socket system having uniform memory
and multi-socket systems having NUMA architecture.
Using the example of a hypothetical transaction that could execute in 10M cycles
with "single-cycle" memory,
the "actual" time is modeled for a 2.2GHz core in which 5% of instructions
involve a non-open page memory access.
The model assumes no IO, but IO could be factored in if desired.
The Xeon E5 memory model is used for 1 and 2 sockets,
and the E7 for the 4-socket system.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
If there were such a thing as single-cycle memory,
the performance would be 220 transactions per second
based 2,200M CPU-cycles per second and 10M cycles per transaction.
Based on 67ns round-trip memory access,
accounting for a full CL+tRCD+tRP,
the transmission time between processor and DRAM,
and L3 latency,
incurred in 0.5M of the 10M "instructions",
the transaction now completes in 83.2M cycles.
The balance of 73.2M cycles are spent waiting for memory accesses to complete.
This circumstance arises primarily in point-chasing code,
where the contents of one memory access determines the next action.
Until the access completes, there is nothing else for the thread to do.
The general advice is to avoid this type of coding,
except that this is what happens in searching a b-tree index.
If the impact of NUMA on database transaction processing performance
were understood and clearly communicated,
databases could have been architected from the beginning to work with
the SQL Server NUMA and TCP/IP port mapping features.
Then threads running on a given node primarily access pages local to that node.
If this forethought had been neglected, then one option is to re-architect
both the database and application,
which will probably involve changing the primary key
of the core tables.
Otherwise, accept that scaling on multi-socket systems is
not going to be what might have been expected.
Furthermore, the Xeon E7 processor, commonly used in 4-socket systems,
has the SMB feature for doubling memory capacity.
As mentioned earlier, this must incur some penalty in memory latency.
In the model above, scaling is:
1P -> 2P = 1.45X, 2P -> 4P = 1.56X and
1P -> 4P = 2.26X
The estimate here is that the SMB has an 11% performance penalty.
If the doubling of memory capacity (or other functionality)
was not needed, then it might have been better to leave off the SMB.
There is 4-waay Xeon E5 4600-series, but one processor is
2-hops away, which introduces its own issues.
There is a paucity of comparable benchmark results to support meaningful quantitative
In fact, it would seem that the few benchmarks available employ configuration variations
with the intent to prevent insight.
Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.
|Processor||Sockets||cores||threads||memory||data storage||tpsE|| |
|E5-2699 v4||2||44||88||512GB (16x32)||3x17 R5||4,938.14||-|
|E7-8890 v4||4||96||192||4TB (64x64)||5x16 R5||9,068.00||-|
It would seem that scaling from 2P to 4P is outstanding at 1.915X.
But there is a 9% increase in cores per socket from 22 to 24.
Factoring this in, the scaling is 1.756X, although scaling versus core count
should be moderately less than linear.
Then there is the difference in memory, from 256GB per socket to 1TB per socket.
Impressive, but how much did it actually contribute?
Or did it just make up for the SMB extra latency?
Note that TPC-E does have an intermediate level of memory locality,
to a lesser extent than TPC-C.
The details of the current Intel Hyper-Threading implementation are
The purpose of Hyper-Threading is to make use of the dead cycles
during memory access or other long latency operations.
Hyper-threading is an alternative solution to
the memory latency problem.
Work arounds are good, but there are times when it is necessary
to attack the problem directly.
Single thread performance is important, perhaps second after overall system throughput.
Given that the CPU clock is more than 150 times that of memory round-trip access
time, it surprising that Intel only implements 2-way HT.
The generic term is simultaneous multi-threading (SMT).
IBM POWER8 is at 8-way, up from 4-way in POWER7.
SPARC has been 8-way for a few generations?
There is a paper on one of the two RISC processors
stating that there were technical challenges in SMT at 8-way.
So, a 4-way HT is reasonable.
This can nearly double transaction performance.
The effort to increase HT from 2-way to 4-way should not be
Given the already impossibly complex
complexion of the processor,
"difficult should be a walk in the park".
It might help if there were API directives in the operating
system to processor on code that runs well with HT and code that does not.
One other implication of the memory latency effect is that
scaling versus frequency is poor.
The originator of the memory latency investigation
was an incident
in which a system UEFI/BIOS update reset processors
to power-save mode, changing base frequency from 2.7GHz to 135MHz.
There was a 3X increase in worker (CPU) time on key SQL statements.
A 20X change in frequency for 3X performance.
In other words, do not worry about the lower frequency of the
high core count processors.
They work just fine.
But check the processor SKUs carefully, certain models do not
have Hyper-Threading which is important.
It also appears that turbo-boost might be more of a problem
It might be better to lock processors to the base frequency.
Hekaton - MVCC
Contrary to popular sentiment,
putting the entire database in to memory on a traditional engine
having page organization and row-store does not make much of
a direct contribution in performance
over a far more modest buffer cache size.
This is why database vendors have separate engines
for memory-optimized operation,
Hekaton in the case of Microsoft SQL Server.
To achieve order of magnitude performance gain, it was necessary to
completely rethink the architecture of the database engine.
A database entirely in memory can experience substantial performance
improvement when the storage system is seriously inadequate to meet
the IOPS needed.
When people talk about in-memory being "ten times" faster,
what meant was that if the database engine have been designed around
all data residing in memory, it would be built in a very different way
than the page and row structured implemented by INGRES in the 1970's.
Now that the memory-optimized tables feature, aka Hekaton for Microsoft
SQL Server, is available,
are there still performance requirements that have not been met?
Memory-optimized tables, and its accompanying natively compiled
procedures are capable of unbelievable performance levels.
All we have to do is re-architect the database to use memory-optimized
And then rewrite the stored procedures for native compilation.
This can be done!
And it should be done, when practical.
In many organizations, the original architects have long retired,
departed, or gone the way of Dilbert's Wally (1999 Y2K episode).
The current staff developers know that if they touch something,
it breaks, they own it.
So, if there were a way to achieve significant performance gain,
with no code changes, just by throwing money at the problem,
then there would be interest, and money.
There is not a TPC-E result for Hekaton.
This will not happen without a rule change.
The TPC-E requirement is that database size scales with the
The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database
size of 37,362GB.
The minimum expectation from Hekaton is 3X, pointing to a 100TB database.
A rule change for this should be proposed.
Allow a "memory-optimized" option to run with a database
smaller than memory capacity of the system.
About as difficult as MVCC, less upside?
Potentially the benefits of NUMA scaling and Hekaton
could be combined if Microsoft exposed a mechanism
for how key values map to a NUMA node.
It would be necessary for a collection of tables
with compound primary key to have the same lead column
and that the hash use the lead column in determining NUMA node?
The major DRAM companies are producing DDR4 at the 4Gb die level.
Samsung has an 8Gb bit die.
Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product.
There can be up to 36 packages on a double-sided DIMM, 18 on each side.
The multiple chips/packages form a 64-bit word plus 8-bits for ECC,
capable of 1-bit error correction and 2-bit error detection.
A memory controller might aggregate multiple-channels
into a larger word and combine the ECC bits to allow for more
sophisticated error correction and detection scheme.
A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB.
The DIMM is comprised of 32×4Gb die, be it 32 single-die packages
or 16 two-die packages.
The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128,
for $8/GB net (data+ECC) or $7.11/GB raw.
There is a slight premium for ECC parts, but much less than it was in the past,
especially with fully buffered DIMMs that had an XMB chip on the module.
The 4Gb die + package sells for less than $3.
the memory guy,
estimates that 4Gbit DRAM needs to be 70mm2 to support
a price in this range.
The mainstream DRAM is a ruthlessly competitive environment.
The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250,
no premium over the 16GB part.
The 64GB ECC RDIMM (72GB raw) is priced around $1000.
This might indicate that it is difficult to put 4×4Gb die in one package,
or that the 8Gb die sells for $12 compared to $3 for the 4Gb die.
Regardless, it possible to charge a substantial premium in
the big capacity realm.
One consequence of the price competitiveness in the mainstream DRAM
market is that cost cutting is an imperative.
Multiplexed row and column address lines originated in 1973,
allowing for lower cost package and module product.
Twenty years ago, there was discussion on going back to a full width address,
but no one was willing the pull the trigger on this.
The only concession for performance in mainstream DRAM was increasing
bandwidth by employing multiple sequential word accesses,
starting with DDR to the present DDR4.
Reduced latency DRAM appeared in 1999
for applications that needed lower latency than mainstream DRAM,
but at lower cost than SRAM.
One application is in high-speed network switches.
The address lines on RLDRAM are not multiplexed.
The entire address is sent in one group.
RLDRAM allows low latency access to a particular bank.
That bank cannot be accessed again for the normal DRAM period?
But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.
A 2012 paper, lead author Nilandrish Chatterjee
has a discussion on RLDRAM.
The Chatterjee paper mentions:
RLDRAM employs many small arrays that sacrifices density for latency.
Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3.
The first version of RLDRAM had 8 banks,
while the contemporary DDR (just DDR then) had 2 banks.
Both RLDRAM3 and DDR4 are currently 16 banks,
but the banks are organized differently?
Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36.
Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path
to support 64-bit data plus 8-bit ECC.
The mainstream DDR4 8Gbit 2-die package from Micron comes in
a 78-ball package for x4 and x8 organization, and 96-ball for x16.
The RLDRAM comes in a 168-ball package for both x18 and x36.
By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA,
yet has multiplexed address?
The package pin count factors into cost,
and also in the die size because each signal needs to be boosted before
it can go off chip?
Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC,
compared with DDR4 at $8 or 14/GB also with ECC, depending the module
At this level, RLDRAM is 40-70 times more expensive than DDR4
A large part for is probably because the RLDRAM is quoted
as a specialty low volume product at high margins,
while DDR4 is a quoted on razor thin margins.
The top RLDRAM at 1.125Gb capacity might reflect the size needed
for high-speed network switches
or it might have comparable die area to a 4Gb DDR?
There are different types of SRAM.
High-performance SRAM has 6 transistors, 6T.
Intel may use 8T
Intel Labs at ISSCC 2012
or even 10T for low power?
(see real world tech NTV).
It would seem that SRAM should be six or eight times less dense than DRAM,
depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.
There is a Micron slide in
Micro 48 Keynote III
that says SRAM does not scale
on manufacturing process as well as DRAM.
Instead of 6:1, or 0.67Gbit SRAM at the same die size
as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area?
Another source says 100:1 might be appropriate.
Eye-balling the Intel Broadwell 10-core (LCC) die,
the L3 cache is 50mm2,
listed as 25MB.
It includes tags and ECC on both data and tags?
There could be 240Mb or more in the 25MB L3?
Then 1G could fit in a 250mm2 die,
plus area for the signals going off-die.
Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range.
This $15K per GB w/ECC.
It is reasonable to assume that prices for both RLDRAM and QDR SRAM
are much lower when purchased in volume?
The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213
in a 2011-pin package.
This would suggest SRAM at south of $1800 per GB.
While the ultra-high margins in high-end processors is desirable,
it is just as important to fill the fab to capacity.
So, SRAM at 50% margin is justified.
We could also estimate SRAM at 40X that of DRAM, per the Micron
assertion of relative density, pointing to $160-320 per GB.
Graphics and High-Bandwidth Memory
Many years ago, graphics processors diverged from mainstream DRAM.
Their requirement was for very high bandwidth at a smaller capacity
than main memory, plus other features to support the memory
access patterns in graphics.
GDR is currently on version 5, at density up to 8Gbit, with a x32
wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4.
More recently, High Bandwidth Memory (HBM) is promoted by AMD,
Hybrid Cube Memory by Micron.
High bandwidth memory is not pertinent to databases,
but it does provide scope on when there is need to
go a separate road from mainstream memory.
Databases on the page-row type engine does not come close to testing the limits
of DDR4 bandwidth.
This is true for both transaction processing and DW large table scans.
For that matter, neither does column-store, probably because of the CPU-cycles
I may have to take this back.
DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels.
(GB is always decimal by default for rates, but normally binary for size.)
A table scan with simple aggregation from memory is what now?
It was 200MB/s per core in Core 2 days, 350MB/s in Westmere.
Is it 500 or 800MB/s per core now?
It is probably more likely to be 500,
but let's assume 800MB/s here.
Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s
(HT does not contribute in table scans).
This is still well inside the Xeon E5/7 memory bandwidth.
But what if this were read from storage?
A table scan from disk is a write to memory, followed by a read.
DDR writes to memory at the clock rate, i.e., one-half the MT/s rate.
So the realized table scan rate effectively consumes 3X of the MT/s value,
which is 57.6.
To pursue the path of low latency memory,
it is necessary to justify the cost and capacity
structure of alternative technologies
It is also necessary that the opportunity be worthwhile
to justify building one more specialty processor
with a different memory controller.
And it may be necessary to work with operating system
and database engine vendors to all be aligned
in doing what is necessary.
The Chatterjee et al
paper for Micro45 (microarch.org) shows LRDRAM3 improving throughput
by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp.
MCF shows greatest gain at 2.2X.
Navigating the B-tree index should show very high gain as well.
The cost can be justified as follows.
An all-in 4-way server has the following processor and memory cost.
|Processor||4×E7-8890 v4||$7,174 ea.||$28,700|
|Memory||4TB, 64×64GB||$1,000 ea.||$64,000|
If the above seems excessive, recall that there was a time
when some organizations where not afraid to spend $1M on the
processor and memory complex,
or sometimes just the for 60+ processors
(sales of 1000 systems per year?).
That was in the hope of having amazing performance.
Except that vendors neglected to stress the importance
SAN vendors continued to sell multi-million-dollar storage
without stressing the importance of dedicated disks for logs.
If a more expensive low latency memory were to be implemented,
the transmission time between the memory controller and DIMM,
estimated to be 10ns earlier,
should be revisited.
A RLDRAM system might still have the DIMM slot arrangement currently
in use, but other option should be considered.
An SRAM main memory should probably be in an MCM module,
or some other
(TSMC Hot Chips 28).
This is if enough SRAM can be stuffed into a module or package.
It would also require that the processor and memory be sold as a single
unit and instead of memory being configured later.
In the case of SRAM, the nature of the processor L3 probably needs to be
Before SSDs, the high-end 15K HDDs were popular with storage
performance experts who understood that IOPS was more important
In a "short-path to bare metal" disk array,
the 15K HDD could support 200 IOPS at queue depth 1 per HDD
with low latency (5ms).
It should be possible to assemble a very large array of 1,000
disks, capable of 200,000 IOPS.
It is necessary to consider the mean-time-between-failure (MTBF),
typically cited as over 1M-hours.
There are 8,760 hours in a 365-day year.
At 1M-hr MTBF, the individual disk failure rate is 0.876%.
An array of 1000 disks is expected to see 9 failures per year.
Hard disks in RAID groups will continue to operate
with a single or sometimes multiple disk failures.
However, rebuilding a RAID group from the failed drive
could take several hours, and performance is degraded
in this period.
It is not operationally practical to run on a very
large disk array.
The recommendation was to fill the memory slots with big DIMMs,
and damn the cost.
The common convention used to be a NAND controller for
SATA on the upstream side would have 8-channels on the NAND side.
The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively.
On the downstream side, a NAND channel could have
1 or 2 packages.
There could be up to 8 chips in a package.
A NAND chip may be divided in to 2 planes,
and each plane is functionally an independent entity.
An SSD with 8 packages could have 64 NAND chips comprised of 128 planes.
The random IOPS performance at the plane level is better
than a 15K HDD, so even a modest collection of 24 SSDs
could have a very large array (3,072) of base units.
At the component level, having sufficient units for 1M IOPS is not difficult.
Achieving 1M IOPS at the system level is more involved.
NVMe builds a new stack, software and hardware, for driving
extraordinarily high IOPS possible with a large array of SSDs,
while making more efficient use of CPU than the SAS.
PCI-E NVMe SSDs have been around since 2014,
so it is possible to build a direct-attach SSD array
with the full NVMe stack.
NVMe over fabric was recently finalized, so SAN products
might be not too far in the near future.
From the host operating system, it is possible to drive
1M IOPS on the NVMe stack without consuming too much CPU.
At the SQL Server level, there are additional steps,
such as determining which page to evict from the buffer cache.
Microsoft has reworked IO code to support the bandwidth
made practical with SSD for DW usage.
But given the enormous memory configuration of typical
transaction processing systems,
there may not have been much call for the ability
to do random IOPS with a full buffer cache.
But if the need arose, it could probably be done.
All Flash Array
When SSDs were still very expensive as components,
storage system vendors promoted the idea of a SSD
cache and/or a tiering structure of SSD, 10K and 7.2K HDDs.
In the last few years, new upstarts are promoting all flash.
HDD storage should not go away, but its role is backup
and anything not random IO intensive.
Rethinking System Architecture
The justification for rethinking system architecture
to low latency memory at far higher cost
is shown below.
The scaling achieved in 4-socket system is less than exceptional
except for the very few NUMA architected databases,
which is probably just the TPC-C and TPC-E database.
It might be 2.2X better than single socket.
At lower latency, 40ns L3+memory, the single socket system could
match the performance of a 2-socket system with DDR4 DRAM.
If 25ns were possible, then it could even match up with the 4-socket system.
The mission of massive memory made possible in the 4-way
to reduce IO is no longer a mandatory requirement.
The fact that a single-socket system with RLDRAM or SRAM
could match a 4-socket with massive memory
allows very wide latitude in cost.
RLDRAM may reside inside or outside of the processor.
If outside, thought should be given on how to reduce
the transmission delay.
SRAM should most probably be placed inside the processor package,
so the challenge is how much could be done.
Should there still be an L3?
Any latency from processor core to memory must be minimized
as much as possible as warranted by the cost of SRAM.
Below are the memory latency simple model calculations
for a single socket with L3+memory latency of 43 and 25ns.
There are the values necessary for the single-socket
system to match 2 and 4-socket systems respectively.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
In the examples above, Hyper-Threading should still have good scaling to 4-way
at 43ns, and some scaling to 4-way at 25ns memory latency.
The new memory architecture does not mean that DDR4 DRAM
It is an established and moderately inexpensive technology.
There could still be DRAM memory channels.
Whether this is a two-class memory system or
perhaps DDR memory is accessed like a memory-mapped file
can be debated elsewhere.
Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM
as memory node 1, all to the same processor.
It is acknowledged that the proposed system architecture is not
a new idea.
The Cray-1 used SRAM as memory, and DRAM as storage?
For those on a budget, the Cray-1M has MOS memory.
Circumstances of the intervening years favored processor with SRAM cache
and DRAM is main memory.
But the time has come to revisit this thinking.
While working on this, I came across the slide below in
J Pawlowski, Micron,
Memory as We Approach a New Horizon.
The outline of the Pawlowski paper includes high bandwidth, and persistent memory.
Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.
I am guessing that the large number of double-ended arrows between processor and
near memory means high bandwidth.
And even bandwidth to DIMMs is substantial.
On the devices to the right seems to be storage.
Does ASIC mean logic?
Instead of just accessing blocks, it would be useful to say:
read the pointer at address A, then fetch the memory that A points to.
Below is roughly Intel's vision of next-generation system architecture,
featuring 3D XPoint.
The new Intel and Micron joint non-volatile technology is promoted
as having performance characteristics
almost as good as DRAM, higher density than DRAM,
and cost somewhere in between DRAM and NAND.
The full potential of 3D XPoint cannot be realized as PCI-E
The idea is then to have 3D XPoint DIMMs devices on the memory interface
along with DRAM.
The argument is that memory configurations in recent years
have become ridiculously enormous.
That much of it is used to cache tepid or even cool data.
In this case, DRAM is overkill.
The use of 3D XPoint is almost as good,
it costs less, consumes less power, is persistent,
and will allow even larger capacity.
In essence, the Intel vision acknowledges the fact
that much of main memory is being used for less than hot data.
The function of storing not so hot data can be accomplished
with 3D XPoint at lower cost.
But this also implies that the most critical functions
of memory require far less capacity than that of recent generation
In the system architecture with a small SRAM or RLDRAM
main memory, there will be more IO.
To a degree, IO at 100µs to NAND is not bad,
but the potential for 10µs or less IO to 3D XPoint
further validates the concept and
is too good to pass up.
Below is a my less fancy representation of the Micron System Concept.
The Right Core
The latest Intel mainline Core-i processor has incredibly powerful
cores, with 8-wide superscalar execution.
A desktop version of Kaby Lake, 7th generation,
has 4.2GHz base frequency and 4.5GHz turbo.
This means the individual core can run at 4.5GHz if not significantly
higher, but must throttle down to 4.2GHz
so that four cores plus graphics and the system agent stays under 91W.
A reasonable guess might be that the power consumption is 20w per core at 4.2GHz?
Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo.
Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel
did not deploy this product as widely as normal.
In transaction processing, this blazing frequency is squandered
on memory latency.
The server strategy is in high core count.
At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W.
The Broadwell HCC products do allow turbo mode in which a few cores
can run at up to 3.5 or 3.6GHz.
Every student of processor architecture knows the foundations of Moore's Law.
One of the elements is that on doubling the silicon real estate
at a fixed process, our expectation is to achieve 1.4X increase in performance.
(The other element is a new process with 0.71X linear shrink yields 50% frequency,
also providing 1.4X performance.)
The Intel mainline core is two times more powerful than the
performance that can be utilized in a high core count processor.
In theory, it should be possibly to design a processor core
with performance equivalent to the mainline core at 2.2GHz
(see SQL on Xeon Phi).
In theory, this new light core should be one-quarter the size of the
(Double the light core complexity for 1.4X performance. Double again for another 1.4X,
for a cumulative 2X over baseline.)
The new light core would be running at maximum design frequency to match
the 2.2GHz mainline, whatever frequency that might be.
We can pretend it is 2.2GHz if that helps.
This new core would have no turbo capability.
What is the power consumption of this core?
Perhaps one-quarter of the mainline, because it is one-quarter the size?
Or more because it is running at a slightly higher voltage? (this is covered elsewhere).
It might be possible to fit four times as many of the light cores on the same die
size, assuming cache sizes are reduced.
But maybe only 3 times as many cores can be supported to stay within power limits?
This a much more powerful multi-core processor
for multi-threaded server workloads.
How valuable is the turbo capability?
The turbo boost has value because not everything can be made heavily multi-threaded.
Single or low-threaded code might not be pointer chasing code,
and would then be fully capable of benefitting from the full power of the Intel mainline core.
A major reason that Intel is in such a strong position is that they have had
the most powerful core for several years running,
and much of the time prior to the interlude period.
(AMD does have a new core coming out and some people think highly of it.)
Intel has two versions of each manufacturing process, one for high performance,
and another for low power.
Could the mainline core be built on the lower power process?
In principle this should reduce power to a greater degree than scaling voltage down.
Would it also make the core more compact?
We could speculate on theory, applying the general principles of Moore's Law.
But there is a real product along these lines, just targeted towards a different
The Xeon Phi 200, aka
has 72 Atom cores, albeit at 245W (260 with fabric).
The current Phi is based on the Airmont Atom core.
(the latest Atom is actually Goldmont).
The recent Atom cores have a 14-stage pipeline versus 14-19 for Core?
Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache?
The true top frequency for Airmont is unclear, some products based on
are Airmont are listed as 2.6GHz in turbo.
On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo.
It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency
while high core count products set a lower voltage resulting lower frequency,
regardless of intent.
Below is a diagram of Knights Landing, or Xeon Phi 200
from Intel's Hot Chips 27 (2015).
The processor and 8 MCDRAM devices are on a single multi-chip-module
(package) SVLC LGA 3647.
The MCDRAM is a version of Hybrid Memory Cube?
(Intel must have their own private acronyms.)
Each device is a stack of die, 2GB, for a total of 16GB
with over 400GB/s bandwidth.
There are also 3 memory controllers driving a total
of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s).
Only 1 DIMM per channel is supported,
presumably the applications are fine with 384GB but wants extreme bandwidth.
The Xeon Phi is designed for HPC.
As is, it might be able deliver impressive
transaction processing performance.
But perhaps not without tuning at many levels.
The question is, how does Knights Landing perform on transaction processing
had the memory been designed for latency
instead of bandwidth?
I suppose this could be tested simply by comparing an Airmont Atom against
a Broadwell or Skylake?
The theory is that the memory round-trip latency dominates,
so the 8-wide superscalar of Haswell and later has little benefit.
Even if there is some code that can used wide superscalar,
the benefit is drowned out by code that wait for memory accesses.
Even in transaction processing databases,
not everything is transactions, i.e., amenable to wide parallelism.
Some code, with important functions, do benefit from the full capability
of mainline Intel core.
Perhaps the long-term solution is asymmetric multi-core,
two or four high-end cores, and very many mini-cores.
SSE/AVX Vector Unit (SIMD)
The vector (SSE/AVX) unit is a large portion of the core area.
This are not used in transaction processing
but are used in the more recent column-store engine.
Microsoft once evaluated the use of the Intel SSE registers,
but did not find a compelling case.
It might have been on the assumption of the existing
Perhaps what is needed is to redesign the page structure
so that the vector registers can be used effectively.
The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes,
leaving 8096 bytes.
Row offsets (slot array of 2 byte values) are filled in from the end of the page.
See Paul Randall, SQL Skills
Anatomy of a Record.
Within a page, each row has a header (16-bytes?) with several values.
The goal of redesigning the page architecture is so that the slot
and header arrays can be loaded into the vector registers
in an efficient manner.
This might mean moving the slot array and other headers up front.
SQL Server would continue to recognize the old page structure.
On index rebuild, the new page structure employed.
The necessary instructions to do row-column byte offset calculations
directly from the vector register would have to be devised.
This needs to be worked out between Intel and various database vendors.
Perhaps the load into the vector registers bypasses L1 and/or L2?
It would be in L3 for cache coherency?
The current Xeon E5/7 processors, with the latest on the Broadwell
core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes.
The Skylake has 32 registers of 512-bits, 2KB of registers.
This is too much to waste.
If they cannot be used, then the processor
with special memory controller
should discard the vector unit.
The main purpose of this article was to argue
for a new system architecture, having low latency memory,
implying a processor architecture change,
with a major focus on the feasibility for transaction processing
databases, and mostly as pertinent to Intel processors.
However, all avenues for significant transaction performance improvement
- Hyper-Threading to 4-way
- Memory-optimized tables with natively compiled procedures
- Database architected for NUMA
- Multi-core - the right size core?
- 3D XPoint
- SSE/AVX Vector Instructions
Increasing HT from 2-way to 4-way has the potential to nearly
doubly transaction processing performance.
Other options have greater upside, but this is a drop-in option.
Memory-optimized tables and natively compiled procedures combined
has the greatest upside potential.
People do not want to hear that the database should be
re-architected for NUMA scaling.
If it runs fine on single-socket or Hekaton, then fine.
But Intel mentions that scaling to the Phi core count levels
requires NUMA architecture even on one socket.
Higher core count using a smaller core will have best
transaction throughput, but an asymmetric model might be
The capabilities of the modern processor core are being squandered
in long latency for capacity that is not needed.
Figure out what is the right low latency memory.
There is definitely potential for 3D Point.
But it goes beyond displacing some DDR DRAM and NAND.
The true potential is to enable a smaller lower latency
true main memory, then have DDR DRAM and 3D XPoint as
something in-between memory and IO.
SSE/AVX: use it or lose it.
The growing gap between processor clock cycle time to memory latency
is not a new topic.
There have been many other papers on the advantage of various
memory technologies with lower latency.
Most of these originate either from universities or semiconductor
Everyone acknowledged that cost relative to mainstream DRAM
was a serious obstacle.
A number of strategies were conceived to narrow the gap.
Here, the topic is approached from the point of view of
database transaction processing.
TP is one of several database applications.
Database is one of many computer applications.
However, transaction processing is a significant portion
of the market for high-end processors, and
systems with maximum memory configuration.
The database community regularly spends $100K on
a processor-memory complex for performance levels
that could be achieved with a single socket,
if it were matched with the right memory.
There is valid justification from the database side
to pursue the memory strategy.
There is justification to the processor-memory vendor
that this one market has the dollar volume to make this
And the extremely large memory capacity requirement is shown to now be a red herring.
In all, there are several worthwhile actions for the next generation
of server system architecture.
Probably none are mutually exclusive.
There are trade-offs between impact, cost and who does the heavy lifting.
No single factor wins in cases, so a multi-prong attack is the more
I will do a discussion to the figure below later
The diagram below might be helpful in discussion.
The elapsed time for a transaction is the weight sum of operations
that incur a wait at each level.
Single thread performance is the inverse of elapsed time.
For throughput, memory access latency can be partially hidden with hyper-threading.
IO latency is hidden by asynchronous IO, but there is an overhead to that too.
Suppose D is the latency for DDR memory,
and S it the latency of some low latency memory,
both inclusive of transmission time and possibly L3.
There is no read IO in the DDR system.
Suppose x is the fraction of memory accesses that
are outside of the fast memory, and the I is the latency.
The term I might represent accesses to DDR or 3D XPoint
on the memory interface via a memory access protocol
so it is really not IO.
Or it could be to 3D XPoint or NAND attached to PCI-E
via an IO protocol.
The criteria for the small faster memory being an advantage in elapse time delta,
exclusive of operations that occur inside L3, is as follows.
(1-x)×S + x×I < D
x < (D-S)/(I-S)
The objective is to achieve a large gain in performance via elapsed time.
Only so much can be gained on the numerator D-S, so much depends on the
latency of I.
If the secondary device were DDR or 3X XPoint on the memory interface,
then a very high value of accesses (x) could be allowed while
still achieving good performance gain.
If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.
In the discussion on Knights Landing, I suggested that the Atom core might not be bad
for transaction processing.
The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system.
What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT.
Use ECC SODIMM.
Atom has changed since its original inception,
not using out-of-order execution for simplicity and power-efficiency.
Silvermont added OOO. Not sure about Goldmont.
Is Atom to be a slimmed down Core? with 3-wide superscalar
and manufactured on the SoC version of the process?
I will try to sort out the material and redistribute
over several articles as appropriate.