There are situations where guaranteed
accuracy correctness is essential,
which is why we have transactional databases meeting the ACID properties. And
then there are situations where we just need an approximate number. An example is the size of the
SQL Agent job history log.
I bring up this topic because on more than one occasion,
after fixing performance problems at a client site, the remaining top waits are
two SQL statements belonging to stored procedures in msdb: one in
and one in sp_jobhistory_row_limiter.
The average wait time is on for the day of Feb 28 was 207ms
and 201ms respectively. The average elapsed times over several weeks are 171ms
and 170ms respectively. Neither of these show meaningful CPU consumption,
at 0.716ms and 0.124ms respectively.
A third statement, in
actually has much higher CPU consumption. This statement contributes 33.5ms CPU and 39.4ms elapsed time
per call over the long period.
In terms of overall load on the system, these msdb
procedures are negligible. A modern 2-socket system has 12-16 physical cores,
so there are 288-384 CPU-hours available per day. The CPU intensive query
consumes 40 min per day (33.5ms/call x 71,509 calls per day), a very small
portion of the available CPU.
The wait time of the 2 longer running queries may be
significant at 4 hours per day each. The calls are from SQL Agent, not a live
person waiting for the response, so the wait time is not a problem by itself.
The high wait statement in sp_sqlagent_log_jobhistory is:
The high wait statement in sp_jobhistory_row_limiter is:
The maximum elapsed times of sp_sqlagent_log_jobhistory and
over the longer period are 11 and 41 seconds respectively. Potentially this
might cause problems in situation that require precise timing. Of course, it is
probably not a good idea to rely on SQL Agent if precise timing is required on
a heavily load database server processing a high volume of jobs.
One might think that at just less than 1 call per second (71,509
calls per day, 86,400 sec per day) the chance of a collision between the two statements
at 0.2 sec each is small. However these are not random calls, but fire on
schedule. So if 100 jobs fire off every 2 minutes on the minute, perhaps the
chance of collision is much higher?
procedure inserts into jobhistory, and then at the end calls sp_jobhistory_row_limiter.
There will not be contention between the two procedures for a given job and session.
So the issue is whether there will be collisions between different jobs. (I am
not sure under what circumstances there would be multiple sessions of 1 job.
Does this occur if the job does not complete within its periodicity or can one
job be controlled by multiple schedulers?)
The typical sequence of statements in sp_jobhistory_row_limiter excluding
conditionals is as follows:
/* this is the high wait statement */
/* this is the high CPU statement */
/* this is rarely called? */
The combination of the SELECT with
(TABLOCKX) and the BEGIN/COMMIT TRAN will block inserts to jobhistory for the
duration of the transaction. So it is quite possible there are collisions with
Below is the detail for the
xp_instance_regread call. For some reason the execution statistics are not available. If this is mainly
called from the sp_jobhistory_row_limiter
procedure, with 2 calls, then there would be 143,000 calls per day with an
average wait of 12.6ms per call.
I am thinking that the reason
for the BEGIN/COMMIT TRAN block in the
is to prevent the action of one session from deleting rows in the process of
being inserted by another session?
Having limits on both the
total number of rows for job history and individual jobs is a good idea. But
this is an example of a situation where we are not concern about guaranteed
accuracy. In this case there are 150 jobs and 260,000 rows in jobhistory,
requiring 460MB of storage.
If we set total jobhistory at 500,000 rows and individual job history at 1000 rows,
perhaps our intention
We definitely do not need to check and enforce (both) jobhistory
limits 71,500 times per day. One way to reduce the frequency of calls is to
delete. I suggest deleting at minimum 10 rows per job everytime the threshold
I think it is silly to do a
full count of the number of rows in sysjobhistory for job execution. The SQL
Server SELECT COUNT(*) statement does not cheat. It reports the actual number
of rows in the table by touching the row header of each and every row. So
perhaps we should use an alternative good enough row count, from the
row_count column for example?
I would also like to do
away with the BEGIN/COMMIT TRAN block along with the TABLOCKX. Perhaps we can
only delete jobs more than 1 day old ?
/* be sure to make this index change before altering the procedure */
CREATE INDEX nc1
So my proposal for the sequence of
statements in jobhistory_row_limter is as follows:
SELECT @today =
-- with (TABLOCKX)
/* if we only delete previous days,
then this will only active once per day */
/* do this first */
index_id < 2
/* only if the above is above threshold */
-- ALTER INDEX ALL ON msdb.dbo.sysjobhistory REBUILD?
Even though the overall impact of
the jobhistory limits procedures are insignificant, it is somewhat annoying to
have them show up at the top of the list of a third party performance
monitoring tool. With a few simple precautions, these can be eliminated.
If any one is willing to alter the jobhistory row limiter procedure,
please us know the changes made along with the before and after execution statistics.
I should also add that the sysjobhistory table will get heavily fragmented over time, especially since job_id is a uniqueidentifier, so perhaps periodic index rebuilds are in order. Perhaps our strategy should be: when the max total rows is exceeded, delete a larger percentage, say 20% for example, then rebuild the index (outside the transaction if present),
pps - I may have inadvertently deleted a comment? - oops! - sorry Publish
Update 2013-10: StorageReview on EMC Next Generation VNX
Update 2013-08: News reports that VNX2 will come out in Sep 2013
While going through the
Flash Management Summit 2012 slide decks, I came across the session
Flash Implications in Enterprise Storage Designs
by Denis Vilfort of EMC, that provided information on performance of the CLARiiON, VNX, a VNX2 and VNX Future.
A common problem with SAN vendors is that it is almost impossible to find meaningful performance
information on their storage systems.
The typical practice is to cited some meaningless numbers like IOPS to cache or the combined IO bandwidth of the FC ports, conveying the impression of massive IO bandwidth, while actually guaranteeing nothing.
The original VNX was introduced in early 2011?
The use of the new Intel Xeon 5600 (Westmere-EP) processors was progressive.
The decision to employ only a single socket was not.
EMC did provide the table below on their VNX mid-range systems in the document "VNX: Storage Technology High Bandwidth Application" (h8929) showing the maximum number of front-end FC and back-end SAS channels along with the IO bandwidths for several categories.
It is actually unusual for a SAN storage vendor to provide such information, so good for EMC.
Unfortunately, there is no detailed explanation of the IO patterns for each category.
Now obviously the maximum IO bandwidth can be reached in the maximum configuration,
that is with all IO channels and all drive bays populated.
There is also no question that maximum IO bandwidth requires all back-end IO ports populated
and a sufficient number of front-end ports populated.
(The VNX systems may support more front-end ports than necessary for configuration flexibility?)
However, it should not be necessary to employ the full set of hard disks to reach maximum IO bandwidth. This is because SAN systems are designed for capacity and IOPS.
There are Microsoft Fast Track Data Warehouse version 3.0 and 4.0 documents for the EMC VNX
5300 or 5500 system. Unfortunately Microsoft has backed away from bare table scan tests of disk rates in favor of a composite metric. But it does seem to indicate that 30-50MB/s per disk is possible in the VNX.
What is needed is a document specifying the configuration strategy for high bandwidth specific to SQL Server. This includes the number and type of front-end ports, the number of back-end SAS buses, the number of disk array enclosures (DAE) on each SAS bus, the number of disks in each RAID group and other details for each significant VNX model.
It is also necessary to configure the SQL Server database file layout to match the storage system structure, but that should be our responsibility as DBA.
It is of interest to note that the VNX FTDW reference architectures do not employ
Fast Cache (flash caching) and (auto) tiered-storage.
Both of these are an outright waste of money on DW systems and actually impedes performance.
It does make good sense to employ a mix of 10K/15K HDD and SSD in the DW storage system,
but we should use the SQL Server storage engine features (filegroups and partitioning)
to place data accordingly.
A properly configured OLTP system should also employ separate HDD and SSD volumes, again using of filegroups and partitioning to place data correctly. The reason is that the database engine itself is a giant data cache, with perhaps as much as 1000GB of memory.
What do we really expect to be in the 16-48GB SAN cache that is not in the 1TB database buffer cache? The IO from the database server is likely to be very misleading in terms of what data is important and whether it should be on SSD or HDD.
CLARiiON, VNX, VNX2, VNX Future Performance
Below are performance characteristics of EMC mid-range for CLARiiON, VNX, VNX2 and VNX Future.
This is why I found the following diagrams highly interesting and noteworthy.
Here, the CLARiiON bandwidth is cited as 3GB/s and the current VNX as 12GB/s
(versus 10GB/s in the table above).
I am puzzled that the VNX is only rated at 200K IOPS. That would correspond to 200 IOPS per disk and 1000 15K HDDs at low queue depth. I would expect there to be some capability to support short-stroke and high-queue depth to achieve greater than 200 IOPS per 15K disk.
The CLARiiON CX4-960 supported 960 HDD. Yet the IOPS cited corresponds to the queue depth 1 performance of 200 IOPS x 200 HDD = 40K. Was there some internal issue in the CLARiiON.
I do recall a CX3-40 generating 30K IOPS over 180 x 15K HDD.
A modern SAS controller can support 80K IOPS, so the VNX 7500 with 8 back-end SAS buses
should handle more than 200K IOPS (HDD or SSD), perhaps as high as 640K? So is there some limitation in the VNX storage processor (SP), perhaps the inter-SP communication? or a limitation of write-cache which requires write to memory in both SP?
Below (I suppose) is the architecture of the new VNX2. (Perhaps VNX2 will come out in May with EMC World?) In addition to transitioning from Intel Xeon 5600 (Westmere) to E5-2600 series (Sandy Bridge EP), the diagram indicates that the new VNX2 will be dual-processor (socket) instead of single socket on the entire line of the original VNX. Considering that the 5500 and up are not entry systems, this was disappointing.
VNX2 provides 5X increase in IOPS to 1M and 2.3X in IO bandwidth to 28GB/s. LSI mentions a FastPath option that dramatically increases IOPS capability of their RAID controllers from 80K to 140-150K IOPS. My understanding is that this is done by completely disabling the cache on the RAID controller. The resources to implement caching for large array of HDDs can actually impede IOPS performance, hence caching is even more degrading on an array of SSDs.
The bandwidth objective is also interesting. The 12GB/s IO bandwidth of the original VNX would require 15-16 FC ports at 8Gbps (700-800MBps per port) on the front-end. The VNX 7500 has a maximum of 32 FC ports, implying 8 quad-port FC HBAs, 4 per SP.
The 8 back-end SAS busses implies 4 dual-port SAS HBAs per SP? as each SAS bus requires 1 SAS port to each SP? This implies 8 HBAs per SP? Intel Xeon 5600 processor connects over QPI to a 5220 IOH with 32 PCI-E gen 2 lanes, supporting 4 x8 and 1x4 slots, plus a 1x4 Gen1 for other functions.
In addition, a link is needed for inter-SP communication. If one x8 PCI-E gen2 slot is used for this, then write bandwidth would be limited to 3.2GB/s (per SP?).
A single socket should only be able to drive 1 IOH even though it is possible to connect 2.
Perhaps the VNX 7500 is dual-socket?
An increase to 28GB/s could require 40 x8Gbps FC ports (if 700MB/s is the practical limit of 1 port). A 2-socket Xeon E5-2600 should be able to handle this easily, with 4 memory channels and
5 x8 PCI-E gen3 slots per socket.
The future VNX is cited as 5M IOPS and 112GB/s.
I assume this might involve the new NVM-express driver architecture
supporting distributed queues and high parallelism.
Perhaps both VNX2 and VNX Future are described is that the basic platform is ready
but not all the components to support the full bandwidth?
The 5M IOPS should be no problem with an array of SSDs, and the new NVM express architecture of course. But the 112GB/s bandwidth is curious. The number of FC ports, even at a future 16Gbit/s is too large to be practical. When the expensive storage systems will finally be able to do serious IO bandwidth, it will also be time to ditch FC and FCOE. Perhaps the VNX Future will support infini-band?
The puprose of having extreme IO bandwidth capability is to be able to deliver all of it to a single database server on demand, not a little dribblet here and there. If not, then the database server should have its own storage system.
The bandwidth is also too high for even a dual-socket E5-2600. Each Xeon E5-2600 has 40 PCI-E gen3 lanes, enough for 5 x8 slots. The nominal bandwidth per PCIe G3 lane is 1GB/s, but the realizable bandwidth might be only 800MB/s per lane, or 6.4GB/s. A socket system in theory could drive 64GB/s. The storage system is comprised of 2 SP, each SP being a 2-socket E5-2600 system.
To support 112GB/s each SP must be able to simultaneously move 56GB/s on storage and 56GB/s on the host-side ports for a total of 112GB/s per SP.
In addition, suppose the 112GB/s bandwidth for read, and that the write bandwidth is 56GB/s.
Then it is also necessary to support 56GB/s over the inter-SP link to guarantee write-cache
coherency (unless it has been decided that write caching flash on the SP is stupid).
Is it possible the VNX Future has more than 2 SP's? Perhaps each SP is a 2-socket E5-4600 system, but the 2 SPs are linked via QPI? Basically this would be a 4-socket system, but running as 2 separate nodes, each node having its own OS image. Or that it is a 4-socket system?
Later this year, Intel should be releasing an Ivy Bridge-EX, which might have more bandwidth?
Personally I am inclined to prefer a multi-SP system over a 4-socket SP.
Never mind, I think Haswell-EP will have 64 PCIe gen4 lanes at 16GT/s. The is 2GB/s per lane raw, and 1.6GB/s per lane net, 12.8GB/s per x8 slot and 100GB/s per socket.
I still think it would be a good trick if one SP could communicate with the other over QPI, instead of PCIe. Write caching SSD at the SP level is probably stupid if the flash controller is already doing this? Perhaps the SP memory should be used for SSD metadata?
In any case, there should be coordination between what each component does.
It is good to know that EMC is finally getting serious about IO bandwidth.
I was of the opinion that the reason Oracle got into the storage business was that they were
tired of hearing complaints from customers resulting from bad IO performance on the multi-million dollar
My concern is that the SAN vendor field engineers have been so thoroughly indoctrinated
in the SaaS concept that only capacity matters while having zero knowledge of bandwidth,
that they are not be able to properly implement the IO bandwidth capability
of the existing VNX, not to mention the even higher bandwidth in VNX2 and Future.
Updates will be kept on
The term in-memory database can be subject to misinterpretation.
An in-memory database was originally used to describe a storage engine designed for the memory access characteristics of modern microprocessors, not simply a database stored in memory.
Today it is common for a database to reside almost entirely in the buffer cache, i.e.,
memory of a traditional relational DBMS, but this is very different from an in-memory database just defined.
As Microsoft recently announced that the next version of SQL Server will incorporate in-memory database technology under the Hekaton codename, it is worthwhile now to revisit in more detail the difference between the original disk storage oriented and in-memory databases,
along with the differences in computer system architecture between then and now that drove the change in the database storage engine.
The First Relational Databases
Relational databases originated from the papers of Edgar Codd published from 1970 on.
Oracle may have had the first commercial product.
A group at UC Berkeley (Stonebraker and Wong) built the INGRES, from which Sybase and later SQL Server descended.
Ingres was developed on a DEC PDP-11, which was a popular mini-computer system at the time (16-bit integer/register).
The Design and Implementation of INGRES
paper by Stonebraker, Wong and Held, ACM 1976 mentions support for UNIX on the PDP-11/40 45 and 70 models.
The 11/40 could support a process address space of 64K and 128K on the 11/45 and 11/70 models.
The main element was for a database engine to make best use of limited memory to complete a query with minimal disk IO.
Computer System Architecture Evolution
The DEC PDP-11 came out in 1970 at a relatively low price-point such that it was a very
popular system in university environments.
The Spring Joint Computer Conference 1970 paper
A new architecture for mini-computers - The DEC PDP-11 cites a $5000-10K price target.
This is may have been why one happened to be available for the original Ingres development project.
PDP 11 Handbook
lists the PDP-11/10 as having 1,024 words of 16-bit read-only memory
and 128 word read-write memory.
The PDP-11/20 model has 4,096 words of 16-bit read-write
(Magnetic) core memory.
The max data transfer rate on the Unibus was one word every 750ns.
Core memory had a 1.2 µs cycle time and 500 ns access time.
Wikipedia lists the history of the PDP-11 Unibus models as:
- PDP-11/20 and 11/15: original with core memory, non-microprogrammed
- PDP-11/35 and 11/40: with microprogramming (1972?)
- PDP-11/45, 50 and 55: upto 256KB semiconductor memory (1971?)
- PDP-11/70: upto 4MB memory and 2KB cache (1975)
Microsoft Research now seems to be the repository of DEC material under the Gordon Bell section, including
The timeline information between Wikipedia and Microsoft Research do not appear to be entirely conistent.
Either it is difficult to interpret surviving documents or people's recollections
of this era are fading.
DEC VAX 11/780
There is more information on the next generation DEC VAX 11/780, the first 32-bit mini-computer.
This system came out in 1977.
VAX-11/780 Hardware Users's Guide
VAX Product Sales Guide
for details. Also search for the VAX-11/780 Architecture Handbook from
Carnegie Mellon ECE.
The CPU was built with TTL, had a 200ns clock and 8KB cache.
No transistor count is cited?
The VAX-11/780 pricing was between $210K and 350K?
The system was described as 1MIPS,
but that was because the performance was roughly comparable to an IBM system (370/158-3?) that was
accepted as 1MIPS.
It turned out the VAX 11/780 executed 0.5M native instructions per sec to deliver equivalent peformance
to the IBM 1MIPS.
John McMallum jcmit cites the IBM 370/158-3 as 0.73MIPS
and the VAX-11/780 as 1MIPS.
The CPUs of this time were very limited in the number transistors,
and should have only basic instructions.
It would have not been feasible for compiled binaries to be built on basic instructions.
The native VAX (or PDP-11) instruction set were comprised of complex instructions,
which are translated by a set of microprogrammed instructions (microcode)
into the basic instructions?
The presumption based on 0.5 VAX MIPS and the 5MHz clock cycle
is then that the average VAX instruction decomposes into 10 basic instructions
or rather clock cycles, accounting for memory access time?
The memory system contains one or two memory controllers.
Each controller can handle 1 to 16 arrays.
The memory array has a cycle time of 600ns.
A memory controller buffers one command while processing another.
The memory controllers can be interleaved.
Cache access time is 200ns, basically 1-cycle access.
Memory cycle time is 600ns. Read access time at the processor is 1800ns.
Effective average operand access time is 290ns.
The first systems used 4Kbit DRAM supporting a maximum system memory of 2MB, in increments of 128K.
Later systems used 16Kbit DRAM, supporting up to 8MB memory, in 256K increments.
Minimum memory was cited as 128K and 256K in the 1977 and 1979 handbooks,
but later documents cited minimum memory of 1M?
If we do that math, we can work out that excluding overhead for ECC, 4096 chips are required for 8MB at 16Kbit per DRAM.
The VAX 11/780 has a 72-bit memory path comprised of a 64-bit word with 8-bits for ECC.
By comparison, a modern server system supports 1TB memory over 64 DIMM sockets with 16GB DIMMs.
There are 36 chips on each 16GB DIMM (32 for data, 4 for ECC) at 4Gbit per chip.
The DRAM package could be single or double die package (DDP).
So the system could have upto 2048 chips plus 256 for ECC.
Over the course of time, computer systems transitioned to single chip microprocessors.
The low-end systems transitioned first to realize the cost benefits of lower part count.
Eventually high-end systems transitioned to microprocessors as well,
due to the chip to chip signal delays not scaling with improving transistor performance
within a chip.
The next step in microprocessor architecture was pipelined execution.
A complete single instruction is comprised of a sequence of many operations.
By dividing the sequence into many steps, the clock rate for completing a single
step can be higher than for the whole instruction.
By allowing the a sequence of instructions to overlap,
one instruction could be completed each clock cycle with pipelining.
Microprocessor Design/Pipelined Processors
has excellent illustrations of pipelining.
The time to execution a single full instruction is several clock cycles.
The Intel 80486 (1989) has a 5-stage pipeline: fetch, decode1, decode2 (effective address), execute, and write-back.
The Pentium (1993) pipeline stages are: Prefetch, Fetch (MMX only?), D1 Instruction Decode, D2 Address Generate, Execute, Writeback.
So that makes 5 stage for the original Pentium and 6 for the MMX?
Intel is curiously vague on the exact number of pipeline stages for the Pentium Pro to Pentium III,
collectively known as the P6 line.
The later Pentium M could be an improved P6, but is also called a new design.
It might be because the actual number of stages varies with the instruction?
The Pentium III has been cited as 10 stages, and the Pentium Pro (P6) could be the same.
The later Pentium III processors may have added a (prefetch) stage purely to account for the time to access L1 cache
as core frequency increased with process generation and maturity.
The first Pentium 4 processors (Willametter and Northwood) are 20 stage,
the second generation Prescot is 31 stages.
The diagram below is from "The Microarchitecture of the Pentium 4 Processor", Intel Technology Journal Q1 2001
showing 10 stages for a basic Pentium III, and 20 stages for the 180nm and 130nm Pentium 4s,
Willamette and Northwood.
In other documents, I have P6 as:
IFU1, IFU2, IFU3, Dec1, Dec2, RAT, ROB, Dis, EX, Ret1, Ret2.
The Core 2 (Conroe 65nm, Penryn 45nm, there were other codenames for server and mobile) 14 stages.
The Core 2 brand name was later changed to Core, even though pre-Core 2 processors had already been sold
with Core (Duo and Solo) as brand name.
The difficult decisions that marketing pukes must make.
The next significant innovation was super-scalar execution,
where a microprocessor could complete several instructions in parallel each clock cycle.
The Intel Pentium has limited 2-wide super-scalar.
The Pentium Pro had a more broadly usable 3-wide.
The super-scalar execution units typically have special uses,
so it is not always possible to complete an instruction on all units in each cycle.
The Intel Pentium 4 is shown with 4 ports, 2 for execution, 1 Load and 1 Store port.
I recall the Pentium 4 as 3-wide, which might be the maximum throughput of the ports.
The Intel Core microarchitecture (Conroe/Penryn) is described as 4 instructions per clock cycle
versus 3 in previous architectures.
The diagram shows 5 units, 3 for different aspects of ALU, FP and vector, 1 Load and 1 Store.
Also mentions 14-stage pipeline.
The Intel Nehalem is shown in IDF 2008 with 6 execution units, 3 for integer, FP and vector,
1 Load, 1 Store Address and 1 Store Data.
The Intel Sandy-Bridge is shown with 6 execution units, 3 for integer, FP and vector,
2 for Load/Store Address and 1 Store Data.
The Intel IDF Fall 2012 presentation on Haswell shows 8 units: 4 integer of which 3 can do vector, and 2 can do FP,
, 1 Load/Store Addres, 1 Store Data, 1 Store Address.
Million Instructions Per Second MIPS
Technically, a instruction on one system architecture has no inherent correlation to an instruction
on a different system architecture.
So there should be no correlation between MIPS on one system to another.
But people need or want to compare systems, and MIPS had already become popular,
so the MIPS was standardized, first as Whetstone (contains floating-point),
and then later Dhrystone (no floating-point).
One DMIPS is the performance of the VAX-11/780, rather than 1 million specific actual IPS.
There are minor inconsistencies between MIPS from various sources.
The table below is mostly from Wikipedia
Instruction per second .
Last two items are multi-core processoes and the MIPS rating is for all cores,
but the D IPS/clock is per core.
Another broad compilation is jcmit.
|IBM 370 158-3
||1 MIPS at 8.69MHz
||0.33 (not Dhrystone)
|Intel 80486 DX2
|Intel Pentium Pro
|Intel Pentium III
|Intel Pentium 4EE
|Intel Core 2 (2c)
|Intel Core i7 920 (4c)
Notice the sharp rise in IPS per clock between the Intel 80386 (non-pipelined) and the 80486DX2 (pipelined)
to nearly 1 per clock.
Presumably the main contributor is the 8K (unified) on die for the 80486
and 8K data + 8 K instruction cache for the Pentium.
The high-end 486 and Pentium systems of this period also had off-die L2 cache as well.
I do not recall if off-die cache was common for 386 systems.
Thereafter, IPS/clock is greater than 1 with the advent of super-scalar execution.
Both the Pentium Pro and Pentium III are 3-wide, so the increase in IPC might be due to the SIMD capability of the Pentium III.
The Pentium 4 gave up a degree of IPC on the very deep pipeline to achieve extraordinarily high clock rates.
The Core 2 was 5-wide?
The Core i7 is 5-wide but also has hyperthreading.
The latest Sandy-Bridge is 6 wide?
Intel provides MIPS rating of their earlier processors up to Pentium in
List of Intel microprocessors
|Intel Pentium (P5)
|Intel Pentium (P54)
|Intel Pentium (P54CS)
A complete DRAM history is more difficult to trace,
along with the primary manufacturers chaning over time.
Wikipedia is generally a good starting point.
Dynamic random access memory,
DRAM Design Overview
from Stanford University by Junji Ogawa.
DRAM timing is particularly difficult to understand,
more so with the radical change from asynchronous (FPM and EDO) DRAM
to synchronous SDRAM, and DDR timings.
provides the diagram below.
Other references are Anantech
and Ulrich Drepper's
What every programmer should know about memory.
The aspect of focus here is memory access latency.
This element was generally quoted for asynchronous DRAM products.
After the change to synchronous DRAM, the industry emphasis changed
to bandwidth timings.
The last of the FPM and EDO DRAM products were available with 50ns access times,
but 60ns products were more common.
Perhaps the 50ns access time required cherry picking from a normal production run?
Today, the best DDR3 may have an access time of 25-30ns at the DRAM chip.
Local memory access time at the processors (with integrated memory controller) is
on the order of 50ns?
The difference due to signal transmission from processor to memory and back.
On server systems using registered memory, there may be buffer chip between processor and DRAM?
On multi-processor (socket) systems, access to remote node memory may be over 95ns?
to an adjacent node and 150ns+ for 2-hop distances?
DDR transfers data on both edges of the clock, i.e., at double the clock rate.
Internally, DRAM is now organizied into multiple banks in order to sustain
data transfers at a very high-rate.
The entire discussion above pertains to mainstream DRAM,
which emphasis cost relative capacity first, followed by bandwidth,
with the expectation that computer system will be comprised of many DRAM chips.
For example, a recent generation personal computer will have 2 memory channels, each 64-bits wide.
The DRAM components are organized as x8, providing an 8-bit data path,
so there are 8 chips to form a 64-bit channel,
and the minimum system memory has 16-chips.
There specialty DRAM products designed around different requirements.
Graphics DRAM is designed for high bandwidth on the assumption that the memory system
will be comprised of few chips.
Consider a graphics subsystem that needs only 1GB comprised of 1Gbit chips.
The desired bandwidth might require a 256-bit path. So GDDR DRAM are often organized wider, x32 being popular.
Another specialty is reduced latency DRAM for network systems.
These systems do not require monstrous system memory capacity,
but do need super low latency to support fast turn-around time for high-speed networking,
in the 10-40Gbit/s range.
A Micron RLDRAM document mentions tRC of 8ns versus 46-52ns for DDR3?
It has been my opinion that server system memory has long since become out of balance with the original concept of system main memory.
The latency has become to long for memory.
Today most memory is used for caching of one form or another, including the database buffer cache.
The time is right to split computer system memory.
There should be a smaller memory subsystem emphasizing very low latency, not just with specialtly DRAM,
but also with physical proximity, perhaps in the same package as the processor.
A separate larger subsystem can continue to implement bulk DRAM, tolerating longer latency.
It has long been known that memory access latency cannot keep up with the microprocessor.
Of course, the Intel server microprocessor clocks rates have settled into the 3GHz range,
with further progress emphasizing the number superscalar execution ports,
and the number of cores on a single die (or socket).
For a 3GHz processor and 50ns local node access, memory latency is now 150 CPU clock cycles away,
and 300+ for remote node memory access.
Micron and other memory companies have formed the
Hybrid Memory Cube consortium, proposing a radical re-architecture of the memory system.
See Hot Chips HC23
Hybrid Memory Cube (HMC).
by J. Thomas Pawlowski, Micron
High-Performance Memories for Packet Processing.
On the VAX-11/780, the CPU clock was 5MHz or 200ns cycle time,
but a complete instruction averaged 10 cycles.
DRAM access time was 250ns, 600ns to the memory controller and 1800ns to the processor.
This was before the advent of SIMM and DIMM technology.
The processor, memory controller and memory were all on separate boards, with long signal delays.
So essentially, memory access time was comparable to the time complete one VAX (complex?) instruction.
The a single Intel Sandy-Bridge core can complete 6 instructions per clock cycle if there are no memory stalls.
The key to modern microprocessor performance is an effective cache strategy to hide memory latency.
This can be successful is there is locality or if memory can be prefeched,
ideally 150+ cycles before it is needed.
An alternative strategy is sequential memory access to make use of the high memory bandwidth of modern systems.
|CPU clock||Effective ns/Inst||Memory Access|
100ns (1 hop)
Summarizing, the CPU clock was faster than memory access even back in 1978.
However, the CPU was also a very simple device that required 10 full cycles to complete
a (microprogammed) instruction. So the net result was instruction time was comparable to memory access.
Today, a single core is capable of completing 6 instructions per clock.
This is on top of the 150-1 ratio between local memory access latency to CPU clock.
The decisions made thirty years ago for good reasons nolonger hold today.
The current nature of computer system architecture points to a completely different strategy
given the long latency for memory access.
The modern microprocessor is designed to operate with pipelining and superscalar execution.
There should be multiple independent instructions that can be executed on each clock.
Furthermore, instructions executed in one clock should not have intractable dependencies
on instructions in the immediately preceding clock cycles.
The most difficult code for modern microprocessors is pointer chasing.
This is where a memory access retrieves the next memory location to be accessed.
If the memory address is not in cache, then the access time to DRAM is over 150 cpu-cycles,
during which the processor core has nothing to do.
Once the memory is accessed, this provides the address of the next memory fetch.
Unfortunately, this code sequence just happens to describe a b-tree search.
Modern Computer Architecture and Memory
Page and Row Storage Organization
Having covered the computer system architecture transistions from 1970 to 2012,
including the processor core and the memory system,
it is appropriate to return to the orignal relational database implementation.
The following diagram is from
The Design and Implementation of INGRES
by Stonebraker, Wong and Held, ACM 1976.
The page and row storage organization from Ingres in the 1970s is still in use today.
The diagrams below are from Microsoft MSDN
Understanding Pages and Extents
Inside the Storage Engine: Anatomy of a page
Now examine the sequence of operations to access rows and columns with page-row storage,
with consideration for whether memory access operations are in cache,
or can be prefetched.
Assume that we have already found the sequence of rows required by a query from an index.
The information we have for each row is the file_id, page_id, and row
1) Check if the page is in the SQL Server buffer cache.
Also the OS must check if the page is in memory (unless lock pages in memory is in effect)
2) Acquire a shared lock or latch on the page (table and row locks)
3) Read the page header
4) Read the 2-byte row offset at the end of the page
5) Read the row/record header
6a) Fixed column loads: Address = row offset + column offset
6b) Nullable columns: Load NULL bitmap, calculate offset, load?
6c) Variable length: follow the chain of column offset to the desired column?
1) the cost of the page in cache check could be as high as 1000 cpu-cycles?
This is based on a series of table scan tests I did for varying number of rows per page.
with the lock pages in memory permission on.
The OS check could be equally expensive. One of Thomas Kejser's slides from SQLBits
mentions that lock pages in memory performance impact could be significant.
Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid.
2) It is necessary to place some kind of lock or latch on the page
even if nolock or tablock is applied on the query.
This is so the SQL Server storage engine knows that the page cannot be evicted from the buffer cache while being read.
4) The reason that the row offset and the actual row data is filled in from opposite
directions of the page is the improve storage efficiency.
With nullable or variable length data, it is not known how many rows will fit in any given page.
This requires non-sequential memory access patterns.
5) One might think in a SELECT COUNT(*) query that we could just read the m_slotCnt value in
the page header, or read the number of 2-byte row offset values at the end of page,
but apparently SQL Server actually reads the row header for each row.
6) Fixed length non-nullable columns are the least effort because the column offset is known ahead of time and the same for each row.
One of the recent SQL Server versions improved the handling of nullable columns by having a bitmask for all columns in each row, which simplifies the process of determining the offset?
Variable length columns are then difficult?
I think we have to go to the first variable length column, read the length to get the next length value and so on until we find the desired column. It would be nice to see the source code for this.
Perhaps someone would be helpful in examining the code of one of the open source databases?
There are also cycles expended to handle conversion from raw bytes to the correct data type
and special SQL Server rules.
A couple of years ago, I proposed extension to to the Intel vector instructions in
SIMD Extensions for the Database Storage Engine
in order to facilitate database page-row-column address calculation.
This would require working out the details on the new instructions,
and getting Intel to implement this in the next processor still early in the design stage.
I suspect that it would also be necessary to change the way metadata is stored to facilitate loading into the vector registers.
It would take 2-3 years for the new processor to enter production.
There would be another 2-3 years before the new technology is broadly deployed.
Of course all of this should be been started ten years ago when CPU frequency went over 1GHz.
I ran a test on a series of tables with a range of rows per page from 1 to 476.
The queries consisted of a table scans, first getting just a count of rows
and then aggregating successive columns.
The first three systems are 2-socket, running SQL Server 2008R2 on Windows Server 2008R2.
The last system is single socket running SQL Server 2012 on Windows Server 2012.
The table below shows in CPU-nanoseconds for the cost per page (including 1 row) of the count query,
the cost for each additional rows, and then the additionak cost for the first and each additional column aggregated.
Given that all processors cores were around 3GHz, the CPU-cycles for each page is the range of 2000 cycles, each row and column contributing another 150 or so cycles.
When the first row is accessed, ie, reading the last 2 bytes of an 8KB page,
the entire 64-byte cache line comprising 32 row offset values would be read into cache.
The approx 150ns cost for per row corresponds to a single memory access for the row header,
with the row offset most likely already in cache.
The tests compared column accesses in sequence. The single column aggregate is on the second column. The two column aggregate is on the second and third columns, which should be stored in adjacent bytes. There is some indication the pairs of columns are marginally more than a single column but the cost off 100+ cycles per successive column seems to be high.
Is the type conversion? or due the interpreted code?
My standard SQL Server configuration is with the lock pages in memory right assigned,
as this is required for Trace flag 834: use large-page allocations for the buffer pool.
I was not aware of Thomas Kejser's report that the lock pages in memory by itself
would have significant performance impact.
If possible, I will re-run the above tests with and without lock pages in memory.
Scaling and Locks
Another major top in database performance is scaling to a very high number of many cores.
This is both scaling over the cores in a single processor socket
and scaling over all the cores of a multi-socket system.
Apparently the locking mechanism is a serious obstacle to scaling.
A few years ago, I did a study of a non-transactional b-tree search engine, ie,
without locking. Not only did it scale perfectly over the physical cores,
it also scaled perfectly over the Hyper-Threading logical cores.
This was possible because the b-tree search is a series of pointer chasing memory accesses,
resulting in many no-ops cycles within a single thread.
With no lock contention, the scaling was perfect.
I also looked at compression and parallelism. At DOP 1, queries to compressed tables consumed
about 40% more CPU than to an uncompressed tables, depending on the operation.
The uncompressed tables would scale with increasing the degree of parallelism up to a point, before scaling falls off and the performance is saturated. The compressed tables scaled perfectly until the performance was equal to the uncompressed tables.
The interpretation was that contention for locks was limiting scaling with parallelism.
The compression added enough CPU on each thread to relieve the contention.
At high degree of parallelism, 16-32 in some examples, the compression essentially become free.
Transactional memory is currently a topic of discussion. See the Intel Developer Forum 2012
session ARCS0004, Intel Transaction Synchronization Extensions.
The objective is a lower overhead alternative to locking that can used on most cases?
The Microsoft paper also discusses lockless or lock free memory strategies?
As soon as it was evident that CPU-cycles and memory access latency were on diverging
paths, perhaps around 1990, it was realized that the page row storage system with pointer chasing code to retrieve scattered metadata would not realize the full capability of modern processors. Hence the term in-memory database for describing a storage engine optimized for
processor - memory access characteristics.
Another option is columnar data storage.
The sequential data access could then take advantage of the memory bandwidth of systems,
which was improving at an adequate pace.
Furthermore, the data type within each would be known, except for the (hopefully) rarely used variant.
By the time of the later Intel 486 or early Pentium processors, the cpu cycle time to memory access latency
ratio had exceed ten. So there was talk of 10X or greater performance
with in-memory and columnar database technology.
At that time, system memory had not become ridiculously huge as it is today,
so in-memory databases were not really practical to achieve broad adoption.
Today server memory capacity is both huge and cheap, with 16GB DIMM pricing below $400.
Of course the mainstream database systems have progressed far beyond their original base
with a deep infrastructure of tools and features that migrating to a different
DBMS would involve huge effort and risk.
The natural solution is incorporate in-memory database technology into an existing DBMS.
Microsoft SQL Server has already incorporated columnar storage in version 2012.
Breakthrough performance with in-memory technologies
(Nov 8, 2012) on the Microsoft Technet SQL Server Blog by Dave Campbell,
The coming in-memory database tipping point (Apr 9, 2012)
describes the rational behind in Hekaton.
The first Nov 8 post cites Per-Ake Larson et al
High-Performance Concurrency Control Mechanisms for Main-Memory Databases
which describes method to reduce locking and other concurrency overhead.
Oracle TimesTen In-Memory Database Architectural Overview,
IBM solidDB redbook,
Wikipedia In-memory database,
and Column-oriented DBMS.
The diagram below is from Oracle TimesTen In-Memory Database Architectural Overview
I am still working on this, to fill in missing data, correct mistakes, etc
I will try to make updates here, but if not, the permanent copy is here
I had discussed SQL Server parallelism in Oct 2010, with my thoughts on the
best settings for: Cost Threshold for Parallelism (CTP) and Max Degrees of Parallelism (MAXDOP) in
Parallelism Strategy and Comments.
At the time, I had intended to follow up with detailed measurements.
So now a mere 2 years later, here it is.
The general thought was that CTP should be raised from the default value of 5,
and MAXDOP should be changed from unrestricted, on modern systems with very many cores,
and most especially on systems with Hyper-Threading.
However reasonable each persons ideas/suggestions are, nothing is better than hard data.
The interest is in the smaller queries that can have a parallel execution plan.
With the default Cost Threshold of Parallelism setting,
we are trying to find queries with the lowest plan cost and lowest actual cost (which point to different queries)
that have a parallel execution plan.
(I will details on index seek later).
The test system is now a 2-socket Intel Xeon 5670 six-core 2.93GHz (32nm Westmere-EP) with Hyper-Threading enabled
(a total of 24 logical processors).
Some references are made to test results on an earlier system
with 2 x Xeon E5430 quad-core 2.66GHz (45nm Core 2) without Hyper-Threading.
The test data is built using the TPC-H data generator, initially employing derivatives of the Part table.
At scale factor 10, the Part table is 2M rows, about 285MB, 55 rows per page or 149 bytes per row.
The derived tables have 2 additional 4-byte integer columns, size 301MB, 52 rows per page or 157 bytes per row.
Parallel Hash Join
The smallest hash join on the modified Part table that results in a parallel execution plan occurs at around 100,000 rows.
Lets first examine the non-parallel plan.
The three major components of this plan are the outer and inner source index seeks, and the hash match.
The Stream Aggregate is small to the major components.
The IO cost of a 1.42 corresponds roughly to a range scan/seek of 15MB (IO cost 1 is 10.5MB).
The actual logical IO is 1931, very close to the values of (1.42446 - 0.003125) * 1350,
and includes a few upper level index LIOs.
Below are the Hash Match and Stream Aggregate details.
Note that the Hash Match has the largest cost of all operations in this execution plan,
and the entire cost is in the CPU element.
The 2 index seeks operations have their cost mostly in the IO element, with only 7.1% in the CPU.
Below is the parallel Hash Join execution plan at DOP 2.
The Outer and Inner Source index seeks have identical costs so only the first is shown.
As covered in Cost Based Optimizer Parallelism I
the CPU cost is reduced by the degree of parallelism,
but the IO cost does not change, which I call a saturated IO model.
The Hash Match details above and the Stream Aggregate details below at DOP 2.
Both operations have cost reduced by the degree of parallelism as both elements
have their costs entirely in the CPU portion.
The Parallelism operation, as its name implies, only occurs in parallel execution plans
adding cost 0.0285.
The cost structure of this operation is proportional to the number of rows,
which implies that high row count queries are less likely to have a parallel execution plan
as the cost of reconstituting threads may outweigh the formula benefit of parallel operations.
Parallel Loop Join
The second test query is a loop join.
A parallel execution plan occurs at around 8000 rows for high degrees of parallelism.
At DOP 2, this occurs around 10000 rows, which will be used for test purposes.
The non-parallel loop join plan is shown below.
The outer and inner source operation details are shown below.
Note the inner source subtree cost of 29.0264, based on number of executions 10000.
The CPU component should be 10000 * 0.0001581 = 1.581.
Then the IO component is 29.0264 - 1.581 = 27.4454,
which is approximately equal to 8782 * 0.003125.
This is an estimate of the number of physical IO for 10000 executes.
The assumption is that some of the pages would have been previously loaded into memory
during the execution of this query, but not yet evicted.
There are 38465 leaf level pages in both tables.
The alternate plan for this query is a hash join with a scan on the inner source table.
In this plan, the table scan would have IO component 28.5, CPU 2.2
and the hash match would contribute another 9.
So it would take another 35% more rows before the loop join plan would naturally
shift to a hash join at DOP 1. At higher DOP, the hash join cost is reduced proportionate to DOP.
The Nested Loops and Stream Aggregate details are shown below.
The Nested Loops is only 0.14% of the overall plan cost, and the Stream Aggregate even less.
Below is the parallel loop join query plan at DOP 2.
The parallel outer and inner details are below. Note the very minor reduction
in outer source CPU from 0.155756 to 0.150178.
There is no change in the inner source operation, not even the element attributed to CPU.
The Nested Loops and Parallelism details are below.
In the parallel plan, there is a CPU reduction of 0.005 from the outer source,
0.0209 from the Nest Loops, and 0.003 from the Stream Aggregate,
to just ever so slightly overcome the Parallelism cost of 0.0285.
The total plan cost for the non-parallel loop join is 29.23 and the parallel plan cost is 29.229.
If this model was even remotely accurate, we should ask why bother to switch to a parallel plan
for such a minute gain.
It turns out that the SQL Server parallel execution plan cost estimate is nothing close to the true
cost structure, which raises a completely different set of questions,
starting with: how can SQL Server be expected to generate good parallel (or not) execution plans
if the cost model is completely different than the true cost structure?
Parallel Query Performance and Actual Cost - Hash Joins
Below is the hash join performance in rows per sec (left vertical scale),
and (worker or CPU) cost in micro-sec per row (right vertical scale), both versus degree of parallelism.
Hash Join 100K rows versus Degree of Parallelism
At DOP 1, the performance is just over 1M rows/sec, scaling well to DOP 4,
levels off for DOP 6-12 at just over 4M rows/s, then jumps again at DOP 14.
The peak gain from parallelism is 6.646 speed up at DOP 20 over DOP 1.
The cost is just under 1 us per row at DOP 1, rising as high as 2.6 us per row at high
DOP, more so if HT is involved?
At DOP 1, the query execution time for 100K rows is 95ms.
It appears that at DOP 2 and 4, the individual threads are running on logical processors
in different physical cores, hence the excellent scaling.
At DOP 6-12, some of the logical cores are on the same phyical cores.
Hyper-threading does improve performance to a moderate degree in SQL Server
depending on the operation.
(I worked on a custom b-tree search engine with no locking protection.
The scaling was essentially linear regardless of physical or logical cores from 1 to 24 threads.
This may seem fantastic, but it does make sense because a b-tree search is just serialized memory accesses.
Fetch a memory location, which determine the next memory location to fetch.
The first memory access must be complete before the next can be issued. The core clock rate for 3.3GHz is 0.3ns. Memory access latency is 50ns for local node, 100ns for 1 hop remote node, corresponding to 150 and 300 CPU-cycles respectively.)
If this assessment is correct, then it would suggest the proper strategy for the SQL Server
engine in parallel execution plans is to allocate 1 worker thread from a logical processor
on each physical core, perhaps allocating from the cores on the same socket before
proceeding to the next socket.
Only if the desired degree of parallelism is greater than number of cores should two logical
processor be allocated from any single physical core, and this should only occur in
The next level of sophistication would be to match thread assign with memory alignment, so that the majority of memory accesses are to the local node. This might require partitioned tables (with hash key partitioning?).
This effort would be hugely complicated, and only applicable to special circumstances,
but a true fanatic would not be deterred.
Below is the performance and cost structure for a hash join on the full table of 300MB and 2M rows.
The cost per row is higher 1.495 us per row at DOP 1, rising to 3us at high DOP.
However scaling is better with a peak speedup over DOP 1 of 11.35 and the peak rows per sec
is also higher than the previous query.
Hash Join - 2M rows versus Degree of Parallelism
I am not sure why the cost per row in this query is higher for the full table over the limited range of 100K rows.
There was no tempdb activity in either case.
One speculation is that the hash join intermediate results remain L3 cache (15M), and results in lower access
time than off-die memory accesses.
A test with small and large merge joins, also exhibits the same behavior.
Is it possible this is due to local and remote node memory locations?
Life is not simple on NUMA systems with HT.
Just to be sure, the same full table scan query was tested on a 3GB 20M row table, as shown below.
The DOP 1 cost and performance was about the same, slightly lower actually at 1.376 us per row.
The scaling was better with peak speedup at 14.95 and peak performance at nearly 11M rows/sec.
Hash Join - 20M rows versus Degree of Parallelism
Both full table hash joins (300MB and 3GB) demonstrate that parallel execution plan scaling
is better for larger queries.
Parallel Query Performance and Actual Cost - Loop Joins
Below is the Loop Join performance in rows per sec and cost in usec per row
for a query of 10000 rows on the 300MB tables.
Notice that performance peaks at DOP 6 and degrades at higher DOP.
The peak speedup over DOP 1 was 4.4.
Loop Join - 10K rows versus Degree of Parallelism
The cost at DOP 1 is 1.827 usec per row.
For 10000 rows, the execution time is a mere
1.8 ms (correction) 18ms.
That this query benefit from parallel execution at all is amazing.
Of course it also raises the serious question as
why SQL Server should go to the effort of a parallel execution plan
for a query that runs in 1.8ms 18ms.
This is due to the antiquated IO based cost model and the default CTP setting of 5.
Based on the 100K row hash join hash with plan cost just over 5 and actual query time of 95ms,
we might consider Cost Threshold for Parallelism around 25,
making the theshold at around 0.5 sec.
However, based on the 10K rows loop join with plan cost 29 and actual query time of 1.8ms,
CTP of 25 would point to a 50K rows loop join with actual execution tim 10ms. (OK, it was late)
However the 10K row loop join at plan cost 29 runs in 1.8ms!
The disparity between Loop and Hash join model and actual cost does allow a good strategy
for the Cost Threshold for Parallelism setting without a complete overhaul of the SQL Server
cost model. And this is not something the SQL Server team seems to be will to tackle.
Another possiblilty is for the Cost Threshold for Parallelism setting to only
consider the CPU element of the plan cost.
Too bad we do not have access to source code to investigate this.
The outer and inner source tables were populated
and indexed in a manner such that for this loop join, each row from the outer source joins to a row
in a different page of the inner source table.
When the rows join to consecutive rows in the inner source table,
the cost per row is even lower at 1.33 usec per row.
Below is the Loop Join performance and cost for 100K rows on the 3GB tables.
Performance continues to increase for higher DOP.
The peak speedup over DOP 1 was 8.26.
Loop Join - 100K rows versus Degree of Parallelism
In examining the CPU usage during the tests, it was evident that in some cases,
only one logical processor of a physical core was used.
In other cases, both logical processors on some cores were used.
This leads to the uneven scaling versus DOP.
Parallel Execution Plan Throughput Tests
Another point interest is throughput with parallel execution plans
is supporting concurrent sessions.
The figure below shows performance in rows per sec at various DOP settings
versus number of concurrent sessions.
The results based on the number of queries completed in a 10-sec windows.
Since the single session DOP 1 query run-time is 100ms, this should be reasonably accurate.
Hash Join Performance (rows/s) by DOP versus # of Concurrent Sessions
SQL Server parallel execution plans are great for a single session,
even for queries that run in so short a time that parallelism should have never been considered.
However there are definitely issues with parallel execution in throughput tests.
Loop Join Performance (rows/s) by DOP versus # of Concurrent Sessions
It is stressed that the queries used in the tests above run in 95ms and
1.8ms 18ms respectively
at DOP 1. So it is not surprising that there are significant throughput issues
in attempting concurrent parallel execution with such minuscule queries.
I present these as examples because they are among the smallest queries for which
SQL Server engages parallel execution with the default settings.
This is because the default setting are seriously obsolete.
The original setting that targeted a 5 sec threshold for parallel execution on a Pentium Pro 166MHz
(I think this a 0.8um or 800nm design)
does not work well on modern microprocessor at 3GHz and 32nm design.
That said, SQL Server does have issues with concurrent parallel execution throughput
on properly sized queries. So some effort here would be helpful.
But the priority is below index on hash key, and parallel write operations.
The initial release neglected the simple index seek test
Parallel Query Performance and Actual Cost - Index Seek
Below is the performance and cost characteristics for a simple index seek
aggregating 1 float type column for 350K rows.
The cost per row is 0.1734 usec at DOP 1.
Considering that the hash join test above does 2 index seeks and summed 2 float columns,
we could speculated that the cost of the bare hash join is
0.952 - 2x0.1734 = 0.605 usec per row.
Below is the characteristics for a simple index with only COUNT(*).
The cost is now only 0.0783 usec per row.
This implies that the float type column sum costs 0.095 usec per row.
The bare index seek cost of 0.0783usec per row also amortizes the cost of the page access.
At 52 rows per page, the full page cost including rows is 4.07usec.
Previously, I had assessed that the bare page access cost was 1us on the Core 2 platform.
The Westmere processor is more powerful and mostly importantly,
has significantly lower memory round-trip time.
So I am expecting the bare page cost on Westmere to be less than 1usec.
It might seem that we a quibbling over fractions of 1usec.
But we need to consider that 1us on a modern microprocessor in 2-3,000 CPU-cycles,
and this on a processor core that can execute multiple intructions per cycle.
Those of you with very good friends on the Microsoft SQL Server engine team
might persuade them to show you the source code and compiled binary/assembly.
Even without looking source code, it is clear that there are not nearly that many
instructions to touch a page, row and column.
The CPU-cycles are spent in the lock mechanism and the memory-round sequences.
This is why there significant differences between a traditional database engine
with all data in-memory and the "in-memory" database engine that have
completely different data organization.
It might better to refer to traditional database engines as page-row structures
and the new stuff column-oriented structures.
See my blog
SIMD Extensions for the Database Storage Engine
for my thoughts on an alternative solution.
Parallel Execution Setting Summary
It is definitely clear that the default settings for Cost Threshold for Parallelism (5)
and Max Degree of Parallelism (0 - unrestricted) are seriously obsolete
and should be changed as standard practice on SQL Server installation.
It is not clear there is a single universal best value for Cost Threshold for Parallelism.
I think the more important criteria is that there should not be a high volume of
concurrently running queries with a parallel execution plans.
Whatever the plan cost of the high volume queries are, the CTP should be set above it.
Of course, the other important objective is to reduce the plan cost
of these high volume queries, assuming this correlates to improved actual query execution time.
The strategy for Max Degree of Parallelism is also unclear.
Before the advent of multi-core processors, a good strategy was to disable parallelism
on transaction processing systems.
It was also critical in the old days to ruthlessly enforce the restriction of only pure transactions too.
Today, we have 32-40 powerful cores on transaction systems,
and almost everyone runs complex queries on it.
So parallelism is good.
Ideally I would like to restrict MaxDOP to the number of physical cores
on a Data Warehouse system, and less than that on a transaction system.
I would also like to get great scaling with parallelism.
But because SQL Server does not balance the threads to one logical processor
per phyical core, this strategy does not work.
So this is something the SQL Server team needs to address.
I will make continued updates on this on my website www.qdpma.com in the topic
There were 7 companies exhibiting SSD products as PASS this year, and one with a product to support SSD storage. This is not counting Dell, EMC and HP who have SSD products, but were at PASS for other reasons. This shows that many have aspirations with the hope that market leadership is not yet firmly set.
The products fall into the following:
PCI-E SSDs (Fusion-io, LSI, Virident)
SSD SAN (Violin, TMS, Whiptail)
SSD caching (LSI, GridIron)
I am inclined to be of the opinion that SSD market is not yet hard set. One reason is the upcoming form factor changes. The Dell PowerEdge T620 has the option of front accessible PCI-E lanes to or a 2.5in HDD form factor device. Intel is planning on a transition to hybrid PCI-E, SAS, SATA connectors (SFF-8639 and SATA Express) for flexibility in supporting devices. Whoever has the best solution for the new form factors has an opportunity to shake up the existing order.
The form factor shift could be an opportunity to address the issue of expansion. In SAS storage system, it is easy to design for both high IO bandwidth in the initial configuration while maintaing the ability to expand capacity over time. We can fill the PCI-E slots with RAID controller or HBAs, and sufficient disks on each to match the PCI-E slot bandwidth. And it is still possible to add disks to each SAS channel while maintaining bandwidth.
For PCI-E SSDs, if we fill the PCI-E slots with lower capacity devices initially (for maximum bandwidth), then capacity can only be expanded by replacing the existing cards. The new form factor is an opportunity to allow SAS-style expansion.
Second is that there will be a role for all of SLC, HET MLC (also known as eMLC?) and even ordinary MLC in the server market the same way there was a role for 15K, 10K and 7200RPM hard disks. Some vendors are focused on the enterprise side with SLC and HET MLC configured with a high degree of over-provisioning (30% plus) resulting in a cost structure of $10K per TB for eMLC and $20K for SLC. Others have low cost consumer oriented MLC with a low degree of over-provisioning (7-13%) netting a cost structure of $1K per TB, There is role for all three and any vendor without a full lineup is going to leave an opening for other vendors to enter an account.
It is hard to believe that we are essentially at SSD-HDD price parity? Of course I am comparing enterprise class 10K/15K HDDs to consumer grade SSDs. Below are prices I am seeing
600GB 15K 3.5in HDD $370
3TB 7.2K 3.5in HDD $400
300GB 15K 2.5in HDD $370
900GB 10K 2.5in HDD $600
1TB 7.2K 2.5in HDD $230 (less for consumer HDDs)
512GB SATA SSD $400-600
Intel SSD DC S3700 400GB $940
The 512GB SATA SSDs are consumer grade, MLC NAND, with only 7% over provisioning.
That is 512GB (1GB = 2^30) of NAND, with 512GB (1GB =10^9) of user capacity.
Intel just announced the SSD DC S3700, which appears to be a reasonable enterprise product, in having 32% over-provisioning. I am inclined to think that DW permanent data does not need more over-provisioning than in the consumer grade SSDs. Otherwise your db is probably not a DW.
Potentially tempdb might need more write endurance than in the consumer SSDs. So the question is whether a large array of consumer SSDs supporting mostly static data and a smaller write intensive tempdb is a good match.
Most SAN vendors are peddling grossly over-priced enterprise grade SLC SSDs. That is because they want SSD to be used for caching or tiered storage, which results in heavy write activity. Apparently SAN vendors have no concept of DW.
Now if we could only get system vendors to provide storage bays for 9.3mm SSDs instead of 15mm 10K/15K HDDs to achieve higher density. Better yet, arrange with SSD vendors to ditch the case, providing the SSD on just the PCB. The SSD storage bays should also be more appropriately balanced, say 2 x4 SAS ports to 8-bays.
I used to buy SuperMicro exclusively for my own lab.
SuperMicro always had a deep lineup of motherboards with almost every conceivable variation.
In particular, they had the maximum memory and IO configuration that is desired for database servers.
But from around 2006, I became too lazy to source the additional components necessary to complete the system, and switched
to Dell PowerEdge Tower servers.
Now, I may reconsider as neither Dell or HP are offering the right combination of PCI-E slots.
Nor do the chassis support the capability I am looking for.
The two Supermicro motherboards of interest are the
for 2-way Xeon E5-2600 series processors, and the
for 4-way Xeon E5-4600 series processors.
Below is a comparison of the Dell, HP and Supermicro 2-way Xeon E5-2600 systems (or motherboards).
Both the Dell and HP have PCI-E x16 slots. Unfortunately this is not particularly useful
as the only PCI-E SSD capable of using the full x16 bandwidth is the Fusion-IO ioDrive Octal at over $90K.
|PCI-E 3 x16||4||3||0|
|PCI-E 3.0 x8||2||1||10|
|PCI-E 3.0 x4||1||4||0|
|PCI-E 2.0 x4||0||1||1|
Below are the Dell and HP 4-way systems for Xeon E5-4600, the HP 4-way Xeon E7 (Westmere-EX)
and the Supermicro 4-way E5-4600 motherboard.
It is apparent that neither the Dell and HP E5-4600 systems are meant to fully replace
the previous generation 4-way E7 (Westmere-EX) systems, as both implement only half of the full
set of PCI-E lanes.
| ||Dell R820||HP DL560 Gen8||HP DL580 Gen7||SuperMicro|
|PCI-E 3 x16 ||2||2||0||7|
|PCI-E 3.0 x8||5*||3||6 (g2)||1|
|PCI-E 3.0 x4||0||0||0||0|
|PCI-E 2.0 x4||0||1||5||0|
The Xeon E5-2600 and 4600 series processor has 40 PCI-E gen 3 lanes, and the DMI
which is equivalent to x4 PCI-E gen 2. One processor needs to have the south-bridge using DMI,
but the others could implement a x4 g2 port. Of course the full set of 160 gen 3 lanes
are only available if all processors are populated,
but the same concept applies for the memory sockets.
These systems are probably more suitable for VM consolidation servers.
Hopefully there will be a true database server in the works?
Today, I am interested in maximum IO bandwidth with a uniform set of PCI-E slots.
Maximum memory bandwidth is required to support this,
but it is not absolutely essential to have maximum memory capacity.
The IO bandwidth plan is built around SSDs, because a 15K HDD starts around $200 providing 200MB/s on the outer tracks
while a 128GB SSD can deliver 500MB/s for around $100.
It would actually be easier to build a high bandwidth system with PCI-E SSDs.
The Intel 910 400GB model is rated at 1GB/s for just over $2000, and the 800GB model does 2GB/s
at the same IO bandwidth per dollar.
The Fusion-io ioDrive2 Duo 2.4TB can do 3GB/s but costs $25K (or is it $38K?).
The Micron P320h can also do 3GB/s but is probably expensive being based on SLC.
The other option is 8 x 128GB SATA SSDs on a PCI-E SAS RAID Controller.
The LSI SAS 9260-8i can support 2.4GB/s with 8 SSDs. In theory 6 SSDs could support this but I have not validated this.
So one 9260-8i ($500) and 8x128GB SATA SSDs ($100 each) means I can get 2.4GB/s for $1300, possibly less.
I understand that the LSI SAS 9265-8i ($665) can support 2.8GB/s (or better?), but the LSI rep did not send me one when said he would.
LSI now has PCI-E 3.0 SAS controllers, the 9271 and 9286, but I do not have any yet.
Bandwidth versus cost options
Fusion-IO ioDrive2 365GB 910MB/s $6K
Intel 910 400GB 1000MB/s $2K
LSI+8 SATA SSD 2400MB/s? $1.3K?
To implement this strategy, the chassis should support many SATA/SAS devices organized 8 bays per 8 SAS lanes.
The both Dell T620 and HP ML350p support 32 2.5in SAS devices, but organized as 16 per dual SAS ports (x4 each?).
So for my purposes, these systems are adequate to house SSDs for 2 adapters.
It could also be pointed out that the 2.5in SAS bays are designed for the enterprise class 10K/15K HDDs
which are 15mm thick.
SATA SSDs on the other hand are 9.3mm thick, designed to fit laptop HDD dimensions.
It could be even thinner without the case.
I should point out that the Intel 910 400GB PCI-E SSD has 768GB actual NAND, about 50% of capacity is reserved, so this should have very good write endurance for MLC. This typical for enterprise oriented SSDs. The typical consumer SSD, has about 7% reserve. For example, a device with 128GB (binary 1GB=1024^3) has 128GB decimal (1GB=10^9) user capacity. So for production use, stay with the enterprise oriented products.
First, the is a lab, not a production server, and I am paying for this myself, so $1-2K matters to me.
Lets start with the a 2-way Xeon E5, and the Dell T620 for example.
A reasonable IO target for this system is 5GB/s based on 300MB/s per core. I can get this with 2 PCI-E cards that can do 2.5GB/s, but the cheap card can only do 1GB/s so I need 5 cards. Plus I might like 2 RAID controllers to HDD so I can do really fast local backups. Next I might like to have 2x10GbE or even an Infiniband HBA. So by this time I am out of slots. I might like to run extreme IO tests, so perhaps targeting 8GB/s.
So the x16 slots are wasting PCI-E lanes that I need for extra x8 slots. And I cannot afford the Fusion Octal, and Fusion will not lend one to me long-term.
Next, the Intel 910 400GB is $2100, while the Fusion ioDrive2 365GB is $5-6K (sorry, how much is 1£ these days?)
both are about the same in bandwidth. The Fusion is about 50% better in random read, and 10X+ in random write. Both cite 65us latency. If I had only a single card, I am sure I could see the difference between the Intel and Fusion. But if I were to fill the empty slots with PCI-E SSDs, I am inclined to think that I have exceeded the ability of SQL Server to drive random IO.
I tested this once with OCZ RevoDrives, but OCZ cannot provide server system support and OCZ uses the Marvell PCIE to SATA controller, so I stopped using OCZ PCI-E cards. I still used OCZ SATA SSDs, just connected to LSI SAS HBAs. Intel uses the LSI, which has much better interrupt handling. While Fusion may be better at the individual card level, I am not willing to spend the extra money on a lab system. And I am using SATA SSDs because they even cheaper than the Intel 910.
Realistically, I need to replace my lab equipment every 1-2 years to be current, so I treat this as disposables, not investments.
Until recently, the SQL Server Enterprise Edition per processor (socket) licensing model
resulted in only 2 or 3 server system configurations being the preferred choice.
Determine the number of sockets: 2, 4 or 8.
Then select the processor with the most compute capability at that socket count level.
Finally, fill the DIMM sockets with the largest capacity ECC memory module at reasonable cost per GB.
Currently this is the 16GB DIMM with a price of $365 on the Dell website, and $240 from Crucial.
The 32GB from Dell is currently (2012-Oct) at $1399 each, down significantly from $2499 in early 2012?
Perhaps next year the 32GB DIMM might be under $800?
SQL Server 2012 Enterprise Edition options
Now with SQL Server 2012 per core licensing, there are a broader range of possibilities
based on the number of cores.
The table below shows Dell PowerEdge system examples for the Intel Xeon E5 processors
from 8 to 32 cores.
I would cite HP ProLiant configurations as well,
but their website has become so painful to use that I have given up.
Pricing for the Dell PowerEdge T620 system above with 16x16GB memory and 1 boot drive.
The prices for the Dell PowerEdge R820 are also with 16x16GB memory.
Each additional 16x16GB DIMMs cost $5840.
The SQL Server 2012 Enterprise Edition licensing shown above is based on a discounted price of $6K per core.
The list price is $6,736 per core.
The Fujitsu RX300 S7 TPC-E full disclosure report of 2012 Jul 5 shows a full environment: system + storage + software
discount of 20%. If evenly applied, this would mean SQL Server license at $5,400 per core.
I would like hear what discounts people are getting with respect to volume.
My understanding prior to 2012 was that the Microsoft sales rep does not love you unless you buy 16 EE processor licenses,
which would translate to 32 core licenses in 2012. Is this still the threshold?
It is unfortunate that Intel does not offer a high frequency 4-core in the E5-4600 line as they do in the E5-2600 series.
I am of the opinion that a 4-way system with Xeon E5 3GHz+ quad-core processor and 48 DIMM sockets would be a very interesting platform.
The Intel list price for the E5-4650 8-core is $3616, the 4617 6-core for $1611, and the 4603 4-core at $551.
Considering that the SQL Server Enterprise Edition licensing component dwarfs the system and processor costs,
it would be a good idea for Intel to offer an all-purpose E5-4600 at the high-end that can configured to 4, 6 or 8 cores in the microcode.
It would be simpler for large organizations to purchase 4-way systems with the all-purpose processor as a standard configuration.
Then each individual system could have the number cores dialed down to the desired level.
The 4-way E5-4603 2.0GHz is probably not as useful compared to the 2-way E5-2690 2.9GHz, both at 16-cores total.
The 4-way has twice the memory bandwidth and capacity but probably also much more than necessary to support the 16 x 2GHz cores.
The 2-way has nearly 50% more compute capability with balanced memory bandwidth because the complete processor was designed to
be in balance for the high-end configuration.
There are only a small number of situations that would favor the larger memory capacity of the 4-way E5-4600.
The recent generation Intel processor cores are so powerful that 4 or 6 cores is probably good
enough for most medium size businesses.
I would prefer a 2-socket system for the extra memory bandwidth and capacity,
but the minimum SQL Server 2012 license is for 4-cores per socket,
negating the feasibility of a 2-way dual-core system.
The limit for SQL Server 2012 Standard Edition is the lessor of 16-cores or 4 sockets and 64GB memory.
In addition, many important features are not available like compression, partitioning, and advanced security.
I recall that there was a limit to parallel query execution, and that it was less than 16?
Standard Edition does not have parallel index operations - i.e., index creation?
Perhaps all this means that 16-core is far more than can be used in a Standard Edition environment.
The 64GB memory also provides guidance on when to use standard edition.
Personally, I do not have much side-by-side comparisons of Standard versus Enterprise edition.
I would like to hear from people what are the key technical considerations in determining when SE is suitable.
It would seem that a single socket system with 4-6 cores and 64GB+ memory is most suitable for Standard Edition.
The 64GB memory limit applies to SQL Server, it might be a good idea to configure
the server with more than 64GB, perhaps as much as 96GB, so as to leave more than sufficient memory
for the operating system and other processes.
Below are some Dell system examples that might be suitable for SQL Server Standard Edition.
It appears that Dell is discontinuing the T320 and T420, in favor of the Rx20 systems.
While the R-modes are great from higher density environments as web servers,
the T-models are best for small business database servers.
||1x16, 1x8, 1x4
||1x16, 2x4g3, 1x4g2
||1x16, 2x4g3, 1x4g2
||1x16, 2x4g3, 1x4g2
||4x16, 2x8, 1x4
* The T320 and T420 are no longer available? Only the R320 and R420?
The T110 II with 4x8GB added $1,251 from Dell, price from Crucial is $440.
The T320 memory price from Dell is $160 for 8GB, and $365 for 16GB. Crucial is $85 for 8GB and $240 for 16GB.
Technically, the systems for the E5 processors are better than the E3,
with more memory bandwidth (3 channels versus 2) and larger memory capacity.
On the downside is a large drop in processor frequency.
The 2-socket quad-core is probably a better option than the single socket 8-core processors.
The SQL Server 2012 per core licensing may be a shock over 2008 R2 licensing at the 8-core per socket level,
effectively doubling SQL Server licensing costs.
However, based on direct observations of many environments,
I am of the opinion that most businesses would have more than adequate performance with a properly tuned 2-way quad-core system with 8 cores total.
This system has more than 4X the compute capability of 4-way systems from the period before multi-core processors.
So in fact, SQL Server licensing costs have gone down, we just need to be judicious in the choice of configuration.
Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines
including SQL Server.
From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field
accessible with the STATS_STREAM clause.
Back in SQL Server 2000, it was possible to modify system tables directly,
including the sysindexes stat_blob field.
At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan,
presumably on a development system and not a production system.
Starting with SQL Server 2005, it was nolonger possible to directly modify system tables.
An API was provided to access data distribution statistics to allow cloning the statistics from one database
The presumed usage is to clone statistics from a large production database to a small development database.
In other database engines, I had heard of the idea of updating statistics on a backup system
to be applied to the production system.
While it was still possible to decode most of the 2005 stats_stream binary,
it appears that a checksum was added so it was not possible to
apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.
Around this time, I was working on other SQL Server tools,
most prominently SQL System for performance monitoring,
Exec Stats for execution plan analysis and TraceAnalysis for trace processing.
Work on the SQL Server data distribution cloning tool was discontinued,
and I could not continue further research into the decoding of SQL Server data distribution statistics.
Since several people have asked about the data distribution statisics decode,
I am making what I know about stat_stream available.
It would be helpful is other people would contribute to the missing pieces.
Note that organization of stats_stream changed from SQL Server version 2000 (then sysindexes stat_blob) to 2005 and again to 2008?
It is quite possible there are also changes in version 2012?
Most of what I discuss here applies to version 2008 R2.
Decoding Stats Stream for SQL Server 2008R2
Here I am using 1 based reference. Byte index 1 is the first byte.
In C# and most other programming languages use zero based index.
||number of vectors
||stats stream length
||stats stream length - minus vector variable length
The difference - is 64 for 1 vector (defined as off1).
Each additional vector adds 24 bytes starting at byte position 41
|Start of vector information, 24-bytes per vector
||system type id
||user type id
||Some of the unknown fields should be for nullable, collation, etc
|Addition vectors if present
||Updated?, 9 byte datetime2?
||4 byte real
||Density - Header
||4 byte real
||Density - vector, upto 33 values
||4 byte int
||Steps (first copy)
||4 byte int
||Steps (second copy)
||4 byte int
||number of vectors
||4 byte int
||Step size (in bytes)
||4 byte real
||Average Key length - header
||4 byte real
||4 byte int
||4 byte real
||Average key length - vector
||Some fields may represent string index (bool), or filter expression
||8 byte int
||unkown, values 0x11, 0x13 and 0x19 observed,
may determine the # of post histogram 8 bytes values starting at off1+341?
||8 byte 0?
||offset for value after histogram?
||another offset if value of [off1+25] is 19 or more?
|more offsets if value of [off1+25] is 25 or more?
|Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
a value in native type of the stat, and then
ending with 0x040000
||0x10 - 16
||length of core columns
determines the organization of histogram structures?-->
||17 or higher
||size of step, excluding 3 byte trailer
||4 byte real
||4 byte real
||4 byte real
||Avg Range Rows
||Range Hi Key
||step terminator?, x is the size of the type
|*off1 = value of 4(8) byte int at position  - value of 
|**off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19
|***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]
So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics.
Variable length statistics has different organization, particularly in the histogram part.
String statistics may have extended information after the histogram, per new feature of SQL Server 2008?
Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float,
and vice versa.
Supporting SQL functions and procedures:
The updated tools now has a stored procedure that accepts table and index (or column stat) as input parameters,
in addition to the original procedure that has the stats stream binary.
QDPMA Stats Stream - Updated
decoding stats stream - Updated
An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.
Intel officially announced the Xeon E5 2600 series processor based on Sandy Bridge-EP variant with
upto 8 cores and 20MB LLC per socket.
Only one TPC benchmark accompanied product launch, summary below.
|2 x Xeon E5-2690||8||2.9GHz||512GB (16x32GB)||2012||IBM||1,863.23|
|2 x Xeon E7-2870||10||2.4GHz||512GB (32x16GB)||2008R2||IBM||1,560.70|
|2 x Xeon X5690||6||3.46GHz||192GB (12x16GB)||2008R2||HP||1,284.14|
Note: the HP report lists SQL Server 2008 R2 Enterprise Edition licenses at $23,370 per socket.
The first IBM report lists SQL Server 2012 Enterprise Edition licenses at $13,473 per pair of cores(?) or $53,892 per socket. All results used SSD storage. The IBM E7 result used eMLC SSDs, the IBM E5 results showed more expensive SSDs, but did not explicitly say SLC?.
The Xeon E5 superceeds 2-socket systems based on both the Xeon 5600 (Westmere-EP) and Xeon E7 (Westmere-EX).
It is evident that Sandy Bridge improves performance over Westmere at both the socket and core levels and also on a GHz basis.
|Architecture||Total Cores||Frequency||Core-GHz||TPC-E||tps-E per core-GHz|
|Sandy Bridge-EP||2 x 8 = 16||2.9GHz||46.4||1,863.23||40.16|
|Westmere-EX||2 x 10 = 20||2.4GHz||48.0||1,560.70||32.51|
|Westmere-EP||2 x 6 = 12||3.46GHz||41.52||1,284.14||30.93|
One advantage of the Xeon E7 (Westmere-EX) system is that the memory expanders support for 4 DIMMs per channel or 16 DIMMs per socket (4 memory channels). However, a two-socket Sandy Bridge-EP system supports 256GB with 16 (8 per socket) of the lower price (per GB) 16GB DIMMs. And really, 256GB is more than enough for most situations, so it is quite reasonable to not burden outlier configuration requirements on the large majority.
A later version of the Xeon E5 will support 4-socket systems.
There is no explanation as to whether glue-less 8-socket systems will be supported in the future.
It was previously discussed that there would a EN variant of Sandy Bridge with 3 memory channels and fewer PCI-E lanes.
Hardware Strategy for SQL Server 2012 per core licensing
Top frequency on the 6 core E5-2667 is 2.9GHz, the same as the 8 core (excluding the 8 core 2687W model at 3.1GHz). Top frequency for the 4 core E5-2643 and 2 core E5-2637 are 3.3 and 3.0GHz respectively. The desktop i7-2830 is 3.6GHz with 4 cores, so Intel is deliberately constraining the top frequency on 2 & 4 core version for the server parts, apparently to favor interest in the 8 core part.
Given the SQL Server 2012 per core licensing, there should be interest in a system with fewer cores per socket running at higher frequency, while taking advantage of the high memory and IO bandwith of the E5 system. Consider also that SQL Server write operations (Insert, Update, Delete, the final stage of index builds) and even certain SELECT operations are not parallel (the Sequence Project operator that support the ROW_NUMBER function).
I think it would also make sense for Intel to allow cores to be disabled in BIOS (now UEFI) on the top of line E5-2690 like the desktop extreme edition unlocked processors. Large corporate customers can buy a batch of identical systems, disabling cores that are not needed on individul systems.
It would also be of value to engage a (nolonger quite so, relative to core licenses) exhorbitantly priced consultant to tune SQL Server to run on fewer cores.(Not to be construed as a solicitation for services)
Overall, SQL Server has become a very capable and mature product, with a very powerful engine
and sophisticated query optimizer.
Still, every now and then, a certain query structure throws the optimizer for a loop resulting
in an execution plan that will take forever.
The key to identifying this type of problem begins with the exeuction plan.
First, the plan cost does not tell the whole story.
It is necessary to know which execution plan operations can run well on modern server systems
and which do not.
Solving the problem can be a simple matter of rewriting the SQL to a different execution plan,
one that uses good execution components.
Of course, when working with 3rd party applications that do not use stored procedures,
it is necessary to convince the ISV, often first talking to someone who does not write code,
not mention someone with any understanding of the SQL Server query optimizer.
Anyways, the topic here is Full-Text Search, in particular CONTAINS and CONSTAINSTABLE.
CONTAINS is "a predicte used in a WHERE clause" per Microsoft documentation,
while CONTAINSTABLE acts as a table.
Consider the two queries below, the first is an example of CONTAINS and the second an example of CONTAINSTABLE.
We might intuitively think that there should be no difference between the two,
which is why in SQL Server, we should never even bother with intuition
and instead always, repeat always, focus on the execution plan.
Both queries perform a Full-Text search, both the CONTAINS function must also scan a index
on the source table to get the count.
The CONTAINSTABLE function on the other hand, being a row source, can be summed directly.
In this example, the Document table is on the order of 60GB excluding lob structures stored out of the table,
the index in question is 150MB, and there are 16M rows in the table.
Both queries run in about 2+ sec elapsed. The first consumes 6 CPU-sec running in 2.2 sec,
while the second query consumes 2.6 CPU-sec in 2.6 sec as it the not a parallel plan.
OK, so the first query runs slightly faster with parallel execution on the Stream Aggregate,
while the second is single-threaded. But the Full-Text function itself is not multi-threaded,
and probably the bulk of the 2.2 sec of the first query.
So why is the CONTAINS operation beneficial?
Before jumping to the title topic - Query Optimizer Gone Wild, lets look at another query, shown below.
Below is the query plan. Note that neither column in the search argument is indexed,
because this is an administrative query that the executive director runs once every month
as a Key Performance Indicator, which is also probably related to why I am not an executive.
So the execution plan is a table scan.
The IO portion of the full table (Clustered Index) scan is 5677
(1350 pages or 10.5MB has an IO cost of 1 in a scan operation).
For this particular example, the Fulltext Match Table Valued Function is assessed a plan cost of 1.6.
When combined with the other components, Stream Aggregate and Filter, the total plan cost of
this Full-Text search is 4.14.
On this particular system, a Xeon E7-48xx, with max degree of parallelism set to 8,
the table scan query consumes 25 CPU-sec running 3.8 sec when data is in memory.
At MAXDOP 20, the query consumes 37 CPU-sec running in 2.1sec.
This is why I emphasized earlier that plan cost is not hugely relavent.
(In case you were curious, the 60GB, 16M row table scan consumes 23 CPU-sec at DOP 1,
24.5 CPU-sec, 12.3 sec elapsed at DOP 2, the same 24.5 CPU-sec, 6.6 sec elapsed at DOP 4,
i.e., excellent scaling to DOP 8, and good continued scaling to DOP 20.
This is an amazing 2.6GB/s per sec per core, and 700,000 rows per sec per core.
Of course, this is a wide table with 175 columns averaging 3750 bytes per row.)
The Wild Plan
The actual query we are interested in is not the ones discussed above.
Due to the nature of the application, PowerPoint documents can be indicated
by the expression shown in any of three columns, one of which is part of a Full-Text catalog,
as expressed by the query below.
(It actually turns out that this query is not entirely correct from the technical perspective,
but it is correct by executive direction; also, part of the reason why I will never be an executive.)
Given that this is a relatively simple SQL expression, and that the two elements of this
query are known to run quickly,
we might intuitively expect this composite query to also run quickly.
But as I said earlier, do not even bother with intuition,
and always always focus on the execution plan
as shown below.
Can you say: "We are sooo screwed!"
What is wrong with this plan? Let us compare this plan with the table scan plan from above.
Both plans have approximately equal cost, as the 60GB table scan dominates.
The extra Table Valued function contributes very little, as shown below.
The problem with this execution plan is that there is a fat arrow (indicating very many rows, 16M in fact)
coming from the outer source (top) with the Full-Text search in the inner source (bottom).
For each row from the outer source, the inner source is evaluated.
This is why I said to not pay much attention to the plan cost, including components with high cost
relative to other components.
Instead, it is important to focus on the SQL operations, the number of rows and pages involved,
along with our knowledge of how each operation behaves non-parallel and parallel execution plans
and the difference between data in memory versus on hard drive, and now also SSD storage.
This execution plan will attempt to performance 16M Full-Text searches.
We have already established that this particular Full-Text search takes about 2 sec.
The full query might take 32M sec. There are 86,400 seconds per day.
We should expect this query to complete in 370 days, assuming there is not a need to reboot the OS after
a critical security patch.
And oh by the way, we need to run this query next month too, every month as a matter of fact.
Note, in the first Query Optimizer Gone Wild, the topic was a loop join with a table scan on
the inner source. So this is another loop join example.
The Tamed Plan
Now that we have identified a problem, and we know exactly what to look for in the execution plan,
it is time to solve the problem.
Because we have been working with SQL Server and other DBMS engines built around a cost base optimizer
for many years, we know exactly what to do.
The solution is to rewrite the SQL to get a good execution plan where the two base operations forle which we know
that have run time is reasonable are only executed only once each.
The query below meets this objective.
The execution plan is below.
This query consumes 37 CPU-sec, and 6.9 sec elapsed.
Given that the two component elements of this query combined for 27 CPU-sec and 6.4 sec elapsed,
the hash join and 2 parallelism repartition streams component increased true cost by 10 CPU-sec,
but almost a minuscue 0.5 sec elapsed time.
I suppose that I should file this on Connect, but Microsoft locked my account out,
and does not want to send me the unlock code.
So I am posting this here.
In one of the recent Windows OS versions, GUID Partition
Table (GPT) became an option in addition to Master Boot Record (MBR) for
creating disk partitions, with GPT supporting volumes larger than 2TB. In MBR,
a 32-bit unsigned integer addresses 512-byte sectors (yeah, there is a push to
adapt 4K sectors), so the disk partition limit was 2TB (2.19x1012).
OK, then fine. The Windows Server OS supports GPT and SQL
Server has been tested to support >2TB partitions. But to what extent has
this been tested? I am sure Microsoft has many SANs with 10-100TB storage
capacity, and someone tested 2TB plus. But anyone that works with big complex
systems and storage systems has probably got tired of clicking the GUI
repeatedly (no joke: one colleague had to go on 6-week disability after doing
too many PowerPoint slides), so we do everything from SQL scripts and probably
forgot how to use SSMS. (Me, I really liked Query Analyzer, especially how
quickly it launches.)
I am sure Microsoft has QA people who must test every single
feature of each GUI tool, SSMS, CVT etc., but how many tests are on 2TB plus
disks? and then 2TB+ files? So what can break? Even though the core OS and the SQL Server engine
core works, there are many utility tools out there that makes file IO API
calls. How many work with >2TB partitions or files, and how many still use a 32-bit
unsigned integer to represent the sector offset? Or otherwise thinks a
partition/file must be less than 2 billion KB?
Now I am sure most people out listen to every word I say as
the word of @#$. In which case your storage system is comprised of a great very
many 15K 146GB disks distributed over many IO channels, which further implies
that each RAID group is probably comprised of 4-8 disks (Fast Track originally
recommended 2 disk RAID groups, which results in too many LUNs).
In which case, 8 disks at 146GB (decimal 146x1012
= binary 136x230) in RAID 10 makes for a 543GB LUN. Even if it was 8
disks 300GB in RAID 5, the 1955GB LUN is still under 2TB. So you would have
never encountered any >2TB issues. But there are a few who do not seem to
follow my advice, and instead choose to trust the technical expertise of your
The arrival of the Sandy Bridge EN and EP processors, expected in early 2012, will mark
the completion of a significant shift in Intel server strategy.
For the longest time 1995-2009, the strategy had been to focus on producing a premium processor
designed for 4-way systems that might also be used in 8-way systems and higher.
The objective for 2-way systems was use the desktop processor that later had a separate brand
and different package & socket to leverage the low cost structure in driving volume.
The implication was that components would be constrained by desktop cost requirements.
The Sandy Bridge collection will be comprised of one group for single processor systems
designed for low cost, and one premium processor.
The premium processor will support both the EN and EP product lines,
the EN limited to 2-way, and the EP for both 2-way and 4-way systems,
with more than adequate memory and IO in each category.
The cost structure of both 2-way and 4-way increased from Core 2 to Nehalem, along with a significant boost in CPU, memory and IO capability.
With quad-core available in 1P, the more price sensitive environments should move
down to single processor systems.
This allows 2 & 4-way systems to be built with balanced compute, memory and IO unconstrained by desktop cost requirements.
In other blogs, I had commented that the default system choice for database server
for a long time had been a 4-way system should now be a 2-way since the introduction of Nehalem in mid-2009.
Default choice means in the absence of detailed technical analysis, basically a rough guess.
The Sandy Bridge EP, with 8 cores, 4 memory channels and 80 PCI-E lanes per socket
in a 2-way system provides even stronger support for this strategy.
The glue-less 8-way capability of the Nehalem and Westmere EX line is not continued.
One possibility is that 8-way systems do not need to be glue-less.
The other is that 8-way systems are being abandoned,
but I am inclined to think this is not the case.
The Master Plan
The foundation of the premium processor strategy, even though it may have been forgotten in the mists of time,
not to mention personnel turnover, was that a large cache improves scaling at the 4-way multi-processor level
for the shared bus SMP system architectures of the Intel Pentium to Xeon MP period.
The 4-way server systems typically deployed with important applications that could easily
justify a far higher cost structure than that of desktop components, but required critical capabilities
not necessary in personal computers.
Often systems in this category were fully configured with top line components whether needed or not.
Hence the Intel large cache strategy was an ideal match between premium processors and high budget
systems for important applications.
One aspect that people with an overly technical point of view have difficulty fathoming
is that the non-technical VP's don't want their mission critical applications running on a cheap box.
In fact, more expensive means that is must be better, and the most expensive is the best, right?
From the Intel perspective, a large premium is necessary to amortize the substantial effort necessary to
produce even a derivative processor in volumes small relative to desktop processors.
The low cost 2-way strategy was to explore demand for multi-processor systems in the desktop market.
Servers were expected to be a natural fit for 2-way systems.
Demand for 2-way servers exploded to such an extent
that it was thought for a brief moment there would be no further interest for single processor servers.
Eventually, the situation sorted itself out, in part with the increasing power of processors.
Server unit volume settled to a 30/60/10 split between single, dual and quad processors
(this is old data, I am not sure what the split is today).
The 8-way and higher unit volume is low,
but potentially of importance in having a complete system lineup.
AMD followed a different strategy based on the characteristics of thier platform.
The Hyper-Transport (HT) interconnect and integrated memory controller architecture
did not have a hard requirement for large cache to support 4-way and above.
So AMD elected to pursue a premium product strategy on the number of HT links.
Single processor systems require one HT to connect the IO hub.
Two HT is required in a 2-way system, one HT connecting to IO, and another to the second processor.
Three HT could support 4-way and higher with various connection arrangements.
The pricing structure is based on the number of HT links enabled,
on the theory that the processor has higher value in big systems than in small systems.
What Actually Happened
Even with the low cost structure Intel enabled in 2-way, desktop systems remained
and actually became defined as single processor.
Instead, the 2-way systems at the desk of users became the workstation category.
This might have been because the RISC/UNIX system vendors sold workstations.
The Intel workstations quickly obliterated RISC workstations,
and there have been no RISC workstations for sometime?
Only two RISC architectures are present today, having retreated to the very high-end server space,
where Intel does not venture.
Itanium was supposed to participate in this space, but the surviving RISC vendors optimized at 8-way and higher.
Intel would not let go of the 4-way system volume and Itanium was squeezed by Xeon at 4-way and below,
yet could not match IBM Power in high SMP scaling.
To do so would incur a high price burden on 4-way systems.
One other aspect of Intel server strategy of the time was the narrow minded focus on optimizing for a single platform.
Most of the time, this was the 4-way server.
There was so much emphasis on 4-way that there actually 2 reference platforms, almost to the exclusion of all else.
For a brief period in 1998 or so, there was an incident of group hysteria that 8-way
would become the standard high volume server.
But this phase wore off eventually.
The SPARC was perhaps the weakest of the RISC at the processor level.
Yet the Sun strategy to design for a broad range of platforms from 2-way to 30-way,
(then with luck 64-way via acquisition of one of the Cray spin-offs) was successful
until their processor fell too far behind.
After the initial implementation of the high volume 2-way strategy,
desktop systems became intensely price sensitive.
The 2-way workstations and server system were in fact not price sensitive even though it was thought they were.
It became clear that desktops could not incur any burden to support 2-way capability.
The desktop processor for 2-way systems was put into a different package and socket,
and was given the Xeon brand.
Other cost reduction techniques were implemented over the next several generations
as practical on timing and having the right level of maturity.
The main avenue is integration of components to reduce part count.
This freed 2-way system from desktop cost constraints, but as with desktops,
it would take several generations to evolve into a properly balanced architecture.
The 4-way capable processors remained on a premium derivative,
given the Xeon MP brand in the early Pentium 4 architecture (or NetBurst) period.
To provide job security for marketing people, 2-way processors were then became the Xeon 5000 series,
and 4-way the Xeon 7000 series in the late NetBurst to 2010 period.
In 2011, the new branding scheme is E3 for 1P servers, E5 for 2-way and E7 for 4-way and higher.
Presumably each branding adjustment requires changes to thousands of slidedecks.
At first, Intel thought both 2-way and 4-way systems had high demand versus cost elasticity.
If cost could be reduced, there would be substantially higher volume.
Chipsets (MCH and IOH) had overly aggressive cost objectives that limited in memory and IO capability.
In fact, 4-way systems had probably already fallen below the boundary of demand elasticity.
The same may have been true for 2-way systems, as people began to realize that single processor
systems were just fine for entry server requirements.
For Pentium II and III 2-way systems, Intel only had a desktop chipset.
In 2005-6, Intel was finally able to produce a viable chipset for 2-way systems (E7500? or 5000P) that provided
memory and IO capability beyond desktop systems.
Previously, the major vendors elected for chipsets from ServerWorks.
It was also thought at the time that there was not a requirement for premium processors in 2-way server systems.
The more correct interpretation was that the large (and initially faster) cache of premium processors
did not contribute sufficient value for 2-way systems.
A large cache does improve performance in 2-way systems, but not to the degree that it does at the 4-way level.
So the better strategy by far on performance above the baseline 2-way system with standard desktop
processors was to step up to a 4-way system with the low-end premium processors instead of a 2-way system with
the bigger cache premium processors.
And as events turned out, the 4-way premium processors lagged desktop processors in transitions to
new microarchitectures and manufacturing processes by 1 full year or more.
The 2-way server on the newer technology of the latest desktop processors
was better than a large cache processor of the previous generation,
especially one that carried a large price premium.
So the repackaged desktop processor was the better option for 2-way systems
The advent of multi-core enabled premium processors to be a viable concept for 2-way systems.
A dual-core processor has much more compute capability than a single core and the same for a quad-core over dual-core
in any system, not just 4-way,
provided that there not too much difference in frequency.
The power versus frequency characteristics of microprocessors clearly favors multiple cores for code
that scale with threads, as in any properly architected server application.
However, multi-core at the dual and quad-core level was employed for desktop processors.
So the processors for 2-way servers did not have a significant premium in capability relative to desktops.
The Intel server strategy remained big cache processors. There was the exception of Tigerton,
when two standard desktop dual-core processor die in the Xeon MP socket was employed for the 4-way system,
until a large cache variant was readied in the next generation Dunnington processor incorporated a large cache.
This also happened for the Paxville and Tulsa.
System Architecture Evolution from Core 2 to Sandy Bridge
The figure below shows 4-way and 2-way server architecture evolution relative to single processor desktops (and servers too)
from 45nm Core 2 to Nehalem & Westmere and then to Sandy Bridge. Nehalem systems are not shown for space considerations,
but are discussed below.
System architecture from Penryn to Westmere to Sandy Bridge, (Nehalem not shown)
The Core 2 architecture was the last Intel processor to use the shared bus, which allows
multiple devices, processors and bridge chips, to share a bus with a protocol to arbitrate
for control of the bus. It was called the front-side bus (FSB) because there was once a back-side bus for cache.
When cache was brought on-die more than 10 years ago, the BSB was no more.
By the Core 2 period, to support higher bus frequency, the number of devices was reduced to 2,
but the shared bus protocol was not changed.
The FSB was only pushed to 1066MHz for Xeon MP, 1333MHz for 2-way servers, and 1600MHz for 2-way workstations.
Nehalem was the first Intel processor with a true point-to-protocol, Quick Path Interconnect (QPI),
at 6.4GHz transfer rate, achieving much higher bandwidth to pin efficiency than possible over shared bus.
Intel had previously employed a point-to-point protocol for connecting nodes of an Itanium system back in 2002.
(AMD implemented point-to-point with HT for Opteron in 2003? at an initial signaling rate of 1.6GHz?)
Shared bus also has bus arbitration overhead in addition to lower frequency of operation.
The other limitation of Intel processors up to Core 2, was the concentration of signals on the
memory controller hub (also known as North Bridge) for processors, memory and PCI-E.
The 7300 MCH for the 4-way Core 2 has 2013-pins, which is at the practical limit,
and yet the memory and IO bandwidth is somewhat inadequate.
Nehalem and Westmere implement a massive increase in memory and PCI-E bandwidth (number of channels or ports) for
the 2-way and 4-way systems compared to their Core 2 counterparts.
Both Nehalem 2-way and 4-way systems have significantly higher cost structure than Core 2.
Previously, Intel had been mindlessly obsessed with reducing system to the detriment of balanced memory and IO.
This shows Intel recognized that their multi-processor systems were already below the price-demand elasticity point,
and it was time to rebalance memory and IO bandwidth, now possible with point to point interconnect
and the integrated memory controller.
QPI in Nehalem required an extra chip to bridge the processor to PCI-E.
This was not an issue for multi-processor systems,
but was undesirable for the hyper sensitive cost structure of desktop systems.
The lead quad-core 45nm Nehalem processor with 3 memory channels and 2 QPI ports in a LGA 1366 socket
was followed by a quad-core, 2-memory channel derivative (Lynnfield) with 16 PCI-E plus DMI replacing QPI in a LGA 1156 socket.
The previously planned dual-core Nehalem on 45nm was cancelled.
Nehalem with QPI was employed in the desktop extreme line,
while the quad-core without QPI was employed in the high-end of the regular desktop line.
The lead 32nm Westmere was a dual-core with the same LGA 1156 socket (memory and IO) as Lynnfield.
Per the desktop and mobile objective, cost structure was reduced with integration,
with 1 processor die and potentially a graphics die in the same package,
and just 1 other component the PCH.
The follow-on Westmere derivative was a six-core using the same LGA 1366 socket as Nehalem,
i.e., 3 memory channels and 2 QPI.
This began the separation process of desktop and other single processor systems from
multi-processor server and workstation systems.
Extreme desktops employ the higher tier components designed for 2-way, but are still single-socket systems.
I suppose that a 2-way extreme system is a workstation.
Gamers will have settle for the mundane look of a typical workstation chassis.
With the full set of Sandy Bridge derivatives, the server strategy transition will be complete.
Multi-processor products, even for 2-way, are completely separated from desktops without the requirement
to meet desktop cost structure constraints.
With desktops interested only in dual and quad-core,
a premium product strategy can be built for 2-way and above around both the number of cores and QPI links.
The Sandy Bridge premium processor has 8 cores, 4 memory channels, 2 QPI, 40 PCI-E lanes and DMI
(that can function as x4 PCI-E).
The high-end EP line in a LGA 2011 socket will have full memory, QPI and PCI-E capability.
The EN line in LGA 1356 socket will have 3 memory channels, 1 QPI and 24 PCI-E lanes plus DMI
to supports up to 2-way systems, and will be suitable for lower priced systems.
Extreme desktops will use the LGA 2011 socket, but without QPI.
What is interesting is that the 4-way capable Sandy Bridge EP line is targeted at both 2-way and 4-way systems.
This is a departure from the old Intel strategy of premium processors for 4-way and up.
Since the basis of the old strategy is no longer valid, of course a new strategy should be formulated.
But too often, people only remember the rules of the strategy, not the basis.
And hence blindly follow the old strategy even when it is no longer valid (does this sound familiar?)
This element of a premium 2-way system actually started with the Xeon 6500 line based on Nehalem-EX.
Nehalem-EX was designed for 4-way and higher with eight-cores,
4 memory channels supporting 16 DIMMs per processor and 4 QPI links.
A 2-way Nehalem-EX with 8 cores, 16 DIMMs per socket might be viable versus Nehalem at 4 cores, 9 DIMMs per socket,
even though the EX top frequency 2.26GHz versus 2.93GHz and higher in Nehalem.
The more consequential hindrance was that Nehalem-EX did not enter production until Westmere-EP was also in production,
with 6 cores per socket at 3.33GHz.
So the Sandy-Bridge EP line will provide a better indicator for premium 2-way systems.
The Future of 8-way and the EX line
There is no EX line with Sandy Bridge.
Given the relatively low volume of 8-way systems, it is better not to burden the processor used by 4-way systems
with glue-less 8-way capability.
Glue-less means that the processors can be directly connected without the need for additional bridge chips.
This both lowers cost and standardizes multi-processor system architecture,
which is probably one of the cornerstones for the success Intel achieved in MP systems.
I am expecting that 8-way systems are not being abandoned,
but rather a system architecture with "glue" will be employed.
Since 8-way systems are a specialized very high-end category,
this would suggest a glued system architecture is more practical in terms of effort than a subsequent 22nm Ivy Bridge EX.
Below are two of my suggestions for 8-way Sandy Bridge or perhaps Ivy Bridges depending on when components could be available.
The first has two 4-port QPI switch, or cross-bar or routers connecting four nodes with 2 processors per node.
The second system below has two 8-port QPI switches connecting single processor nodes.
The 2 processor node architecture would be economical, but I am inclined to recommend building the 8-port QPI switch.
Should the 2 processor node prove to be workable,
then a 16-way system would be possible.
Both are purely speculative as Intel does not solicit my advice on server system architecture and strategy,
not even back in 1997-99.
In looking at the HP DL980 diagram, I am thinking that the HP node controllers
would support Sandy Bridge EP in an 8-way system.
There are cache coherency implications (Directory based versus Snoop) that are beyond the scope for database server oriented topic.
There was an IBM or Sun discussion transactional memory.
I would really like to see some innovation on handling locks.
This is critical to database performance and scaling.
For example, the database engine ensures exclusive access to a row, i.e., memory, before allowing access.
Then why does the system architecture need to do a complex cache coherency check when the application has already done so?
I had also previously discussed SIMD instructions to improve handling of page and row base storage,
SIMD Extensions for the Database Storage Engine
If that were not enough, I had also called for splitting the memory system.
Over the period of Intel multi-processor systems 1995 to 2011,
practical system memory has increased from 2GB to 2TB.
Most of the new memory capacity is used for data buffers.
The exceptionally large capacity of the memory system also means that it cannot be
brought very close to the processor, as into to the same package/socket.
So the memory architecture should be split into a small segment
that needs super low latency byte addressability.
The huge data buffer portion could be changed to block access.
If so, then perhaps the database page organization should also be changed to make the metadata
access more efficient in terms of modern processor architecture to reduce
the impact of off-die memory access by making
full use of cache line organization.
The NAND people are also arguing for Storage Class Memory, something along the lines
of NAND used as memory.
More on QDMPA System Architecture.
and Sandy Bridge.
Many of us have probably seen the new SQL Server 2012 per
core licensing, with Enterprise Edition at $6,874 per core super ceding the $27,495
per socket of SQL Server 2008 R2 (discounted to $19,188 for 4-way and $23,370
for 2-way in TPC benchmark reports) with Software Assurance at $6,874 per
processor? Datacenter was $57,498 per processor, so the new per-core licensing
puts 2012 EE on par with 2008R2 DC, at 8-cores per socket.
This is a significant increase for EE licensing on Intel Xeon 5600 6-core systems
(6x$6,874 = $41,244 per socket) and a huge increase for Xeon E7 10-cores systems, now $68,740 per socket.
I do not intend to discuss justification of the new model. I will say that SQL
Server licensing had gotten out of balance with the growing performance capability
of server systems over time. So perhaps the more correct perspective is that
SQL Server had become underpriced in recent years.
(Consider that there was a 30%+ increase in the hardware cost structure in the transition
from Core 2 architectures systems to Nehalem systems for both 2-way and 4-way to accommodate the vastly increased memory and IO channels.)
Previously, I had discussed that the default choice for SQL Server used to be a 4-way system.
In the really old days, server sizing and capacity planning was an important job category.
From 1995/6 on, the better strategy for most people was to buy the 4-way Intel standard
high-volume platform rather than risk the temperamental
nature of big-iron NUMA systems (and even worse, the consultant to get SQL Server to run correctly by steering the execution plan around operations that were broken on NUMA). With the compute, memory and IO capabilities of Intel Xeon 5500 (Nehalem-EP),
the 2-way became the better default system choice from mid-2009 on.
By “default choice”,
I mean in the absence of detailed technical sizing analysis.
I am not suggesting that ignorance is good policy (in addition to bliss),
but rather the cost of knowledge was typically more than the value of said knowledge.
Recall that in the past, there were companies that made load testing tools. I think they are mostly gone now. An unrestricted license for the load test product might be $100K. The effort to build scripts might equal or exceed that. All to find out whether a $25K or $50K server is the correct choice?
So now there will also be a huge incentive on software licensing to step down from a 4-way 10-core system with 40 cores total to a 2-way system with perhaps 8-12 cores total
(going forward, this cost structure essentially kills the new AMD Bulldozer 16-core processor, which had just recently achieved price performance competitiveness with the Intel 6-core Westmere-EP in 2-way systems).
In the world of database performance consulting, for several
years I had been advocating a careful balance between performance tuning effort
(billed at consultant rates) with hardware. The price difference between a
fully configured 2-way and 4-way system might be $25,000. For a two-node
cluster, this is $50K difference in hardware, with perhaps another $50K in SQL
Server licensing cost, with consideration that blindly stepping up to bigger
hardware does not necessarily improve the critical aspect of performance
proportionately, sometimes not at all, and may even have negative impact.
With performance tuning, it is frequently possible to
achieve significant performance gains in the first few weeks. But after that,
additional gains become either progressively smaller, limited in scope, or involve major
re-architecture. In the long ago past, when hardware was so very expensive, not
mention the hard upper limits on performance, it was not uncommon for a consultant to get a
long term contract to do performance work exclusively.
More recently, performance consulting work tended to be shorter-term.
Just clean up the long hanging fruit, and crush moderate
inefficiencies with cheap powerful hardware. While this is perfectly viable
work, it also precludes the justification for the deep skills necessary to resolve complex
problems, which also calls into question the need to endure an intolerably arrogant, exorbitantly
It had gotten to the point that I had given thought to retiring,
and go fishing in some remote corner of the world.
But now with the new SQL Server per core licensing,
Microsoft has restored the indispensable (though still intolerable) status to arrogant, exorbitantly expensive, performance consultant.
So, thank you Microsoft.
Edit 16 Dec 2011
VR-Zone mentions a Windows 7/Server 2008 R2
that treats the 8-core AMD Bulldozer die as 4 cores with HT, as opposed to AMD's positioning as 8-cores. AMD should hope that this is Microsoft's position for SQL Server 2012 or no one should consider the AMD in light of the per core licensing, given that Intel physical cores are much more powerful than the Bulldozer "core"
Edit 20 Feb 2012
I might add that the new per core licensing would be well worth the extra money if SQL Server would give us:
1) Parallel Execution plans for Insert, Update and Delete
2) Improve Loop Join parallel scaling - I believe today there is content between thread in latching the inner source index root
3) Fix parallel merge join - If the parallel merge join code is broken, why can we not use the parallel hash join code with the existing index?
The basis for this if we going to pay the cores, then SQL Server should not let the core sit idle in time consuming operations.
There has been relatively litle activity in TPC Benchmarks recently with the exception of the raft of Dell TPC-H results with Exa Solutions.
It could be that systems today are so powerful that few people feel the need for benchmarks.
IBM published an 8-way Xeon E7 (Westmere-EX) TPC-E result of 4593 in August, slightly higher
than the Fujitsu result of 4555, published in May 2011.
Both systems have 2TB memory. IBM prices 16GB DIMMs at $899 each, $115K for 2TB or $57.5K per TB. (I think a 16MB DIMM was $600+ back in 1995!)
The Fujistu system has 384 SSDs of the 60GB SLC variety, $1014 each,
and IBM employed 143 SSDs of the 200GB eMLC variety, $1800 each for 24-28TB raw capacity respectively.
Except for unusually write intensive situations, eMLC or even regular MLC is probably
good enough for most environments.
HP published a TPC-H 1TB of 219,887.p QphH
for their 8-way ProLiant DL980 G7 with the Xeon E7-4870,
26% higher in the overall composite score than the IBM x3580 with the Xeon E7-8870 (essentially the same processor).
The HP scores 16% higher in power and 37.7% higher in throughput.
Both throughput tests were with 7 streams.
The HP system had Hyper-Threading enabled (80 physical cores, 160 logical)
while the IBM system did not.
Both systems had 2TB memory, more than sufficient to hold the entire database, data and indexes in memory.
The IBM system had 7 PCI-E SSDs and
the HP system has 416 HDDs over 26 D2700 disk enclosures, 10 LSI SAS RAID controllers,
3 P411 and 1 dual-port 8Gbps FC controller.
Also of interest are TPC-H 1TB reports published for the 16-way SPARC M8000 (June 2011)
with SPARC64 VII+ processors and the 4-way SPARC T4-4 (Sep 2011).
The table below shows configuration information for recent TPC-H 1000GB results.
|TPC-H 1000GB||IBM x3850 X5||HP ProLiant DL980 G7||IBM Power 780||SPARC M8000||SPARC T4-4|
|DBMS ||SQL 2K8R2 EE||SQL 2K8R2 EE||Sybase IQ ASE 15.2||Oracle 11g R2||Oracle 11g R2|
|Processors||8 Xeon E7||8 Xeon E7||8 POWER7||16 SPARC64 VII+||4 SPARC T4|
|Cores Threads ||80-80||80-160||32-128||64-128||32-256|
|IO Controllers ||7||13||12||4 Arrays||4 Arrays|
|HDD/SSD||7 SSD||416 HDD||52 SSD||4x80 SSD||4x80 SSD|
The figure below shows TPC-H 1000GB power, throughput and QphH composite scores for 4 x Xeon 7560 (32 cores, 64 threads),
two 8 x Xeon E7 (80 cores, 80 and 160 threads) systems, 8 x POWER7 (32 cores, 128 threads)
16 SPARC64 VII+ (64 cores, 128 threads) and the 4 SPARC T4 (32 cores, 256 threads).
TPC-H SF 1000 Results
The HP 8-way Xeon and both Oracle/Sun systems, one with 16 sockets
and the newest with 4 SPARC T4 processors, are comparable, within 10%.
An important point is that both Oracle/Sun and the IBM Power systems are configured with 512GB memory
versus 2TB for the 8-way Xeon E7 systems, which enough to keep all data and indexes in memory.
There is still disk IO for the initial data load and tempdb intermediate results.
This good indication that Oracle and Sybase have been reasonably optimized on IO, in particular,
when to use an index and when not to.
I had previously raised the issue that the SQL Server query optimizer should consider
the different characteristics of in-memory, DW optimized HDD storage (100MB/s per disk sequential)
Sun clearly made tremendous improvements from the SPARC 64 VII+ to the T4,
with the 4-way new system essentially matching the previous 16-way.
Of course, the Sun had been lagging at the individual processor socket level until now.
The most interesting aspect is that the SPARC T4 has 8 threads per core.
The expectation is that server applications have a great deal of pointer chasing code,
that is: fetch memory which determines next address to fetch with inherently poor locality.
A modern microprocessor with core frequency 3GHz corresponds to a 0.33 nano-second clock cycle.
Local node memory access time might be 50ns, or 150 CPU-clocks.
Remote node memory acess time might be 100ns for a neighboring node to over 250ns for multi-hop nodes
after cache-coherency is taken into account.
So depending on how many instructions are required for each non-cached memory access,
we can expect each thread or logical core to have many dead cycles, possibly enough to justify 8 threads per core.
What is surprising is that Oracle published a TPC-H benchmark with their new T4-4
and not a TPC-C/E which is more likely to emphasize the pointer chasing code than DW.
Below are the 22 individual query times for the above systems in the power test (1 stream).
TPC-H SF 1000 Queries 1-22
Below are the 22 individual query power times for just the two 8 Xeon E7 systems.
Overall, the HP system (with HT enabled) has 16% TPC-H power score, but the IBM system without HT
is faster or comparable in 9 of the 22 queries.
Not considering the difference in system architecture, the net might be attributed to HT?
TPC-H SF 1000 IBM and HP 8-way Xeon E7
Below are the 22 individual query power times for the HP 8 Xeon E7 and Oracle SPARC T4-4 systems.
TPC-H SF 1000 8-way HP Xeon E7 and 4-way SPARC T4