Overall, SQL Server has become a very capable and mature product, with a very powerful engine
and sophisticated query optimizer.
Still, every now and then, a certain query structure throws the optimizer for a loop resulting
in an execution plan that will take forever.
The key to identifying this type of problem begins with the exeuction plan.
First, the plan cost does not tell the whole story.
It is necessary to know which execution plan operations can run well on modern server systems
and which do not.
Solving the problem can be a simple matter of rewriting the SQL to a different execution plan,
one that uses good execution components.
Of course, when working with 3rd party applications that do not use stored procedures,
it is necessary to convince the ISV, often first talking to someone who does not write code,
not mention someone with any understanding of the SQL Server query optimizer.
Anyways, the topic here is Full-Text Search, in particular CONTAINS and CONSTAINSTABLE.
CONTAINS is "a predicte used in a WHERE clause" per Microsoft documentation,
while CONTAINSTABLE acts as a table.
Consider the two queries below, the first is an example of CONTAINS and the second an example of CONTAINSTABLE.
We might intuitively think that there should be no difference between the two,
which is why in SQL Server, we should never even bother with intuition
and instead always, repeat always, focus on the execution plan.
Both queries perform a Full-Text search, both the CONTAINS function must also scan a index
on the source table to get the count.
The CONTAINSTABLE function on the other hand, being a row source, can be summed directly.
In this example, the Document table is on the order of 60GB excluding lob structures stored out of the table,
the index in question is 150MB, and there are 16M rows in the table.
Both queries run in about 2+ sec elapsed. The first consumes 6 CPU-sec running in 2.2 sec,
while the second query consumes 2.6 CPU-sec in 2.6 sec as it the not a parallel plan.
OK, so the first query runs slightly faster with parallel execution on the Stream Aggregate,
while the second is single-threaded. But the Full-Text function itself is not multi-threaded,
and probably the bulk of the 2.2 sec of the first query.
So why is the CONTAINS operation beneficial?
Before jumping to the title topic - Query Optimizer Gone Wild, lets look at another query, shown below.
Below is the query plan. Note that neither column in the search argument is indexed,
because this is an administrative query that the executive director runs once every month
as a Key Performance Indicator, which is also probably related to why I am not an executive.
So the execution plan is a table scan.
The IO portion of the full table (Clustered Index) scan is 5677
(1350 pages or 10.5MB has an IO cost of 1 in a scan operation).
For this particular example, the Fulltext Match Table Valued Function is assessed a plan cost of 1.6.
When combined with the other components, Stream Aggregate and Filter, the total plan cost of
this Full-Text search is 4.14.
On this particular system, a Xeon E7-48xx, with max degree of parallelism set to 8,
the table scan query consumes 25 CPU-sec running 3.8 sec when data is in memory.
At MAXDOP 20, the query consumes 37 CPU-sec running in 2.1sec.
This is why I emphasized earlier that plan cost is not hugely relavent.
(In case you were curious, the 60GB, 16M row table scan consumes 23 CPU-sec at DOP 1,
24.5 CPU-sec, 12.3 sec elapsed at DOP 2, the same 24.5 CPU-sec, 6.6 sec elapsed at DOP 4,
i.e., excellent scaling to DOP 8, and good continued scaling to DOP 20.
This is an amazing 2.6GB/s per sec per core, and 700,000 rows per sec per core.
Of course, this is a wide table with 175 columns averaging 3750 bytes per row.)
The Wild Plan
The actual query we are interested in is not the ones discussed above.
Due to the nature of the application, PowerPoint documents can be indicated
by the expression shown in any of three columns, one of which is part of a Full-Text catalog,
as expressed by the query below.
(It actually turns out that this query is not entirely correct from the technical perspective,
but it is correct by executive direction; also, part of the reason why I will never be an executive.)
Given that this is a relatively simple SQL expression, and that the two elements of this
query are known to run quickly,
we might intuitively expect this composite query to also run quickly.
But as I said earlier, do not even bother with intuition,
and always always focus on the execution plan
as shown below.
Can you say: "We are sooo screwed!"
What is wrong with this plan? Let us compare this plan with the table scan plan from above.
Both plans have approximately equal cost, as the 60GB table scan dominates.
The extra Table Valued function contributes very little, as shown below.
The problem with this execution plan is that there is a fat arrow (indicating very many rows, 16M in fact)
coming from the outer source (top) with the Full-Text search in the inner source (bottom).
For each row from the outer source, the inner source is evaluated.
This is why I said to not pay much attention to the plan cost, including components with high cost
relative to other components.
Instead, it is important to focus on the SQL operations, the number of rows and pages involved,
along with our knowledge of how each operation behaves non-parallel and parallel execution plans
and the difference between data in memory versus on hard drive, and now also SSD storage.
This execution plan will attempt to performance 16M Full-Text searches.
We have already established that this particular Full-Text search takes about 2 sec.
The full query might take 32M sec. There are 86,400 seconds per day.
We should expect this query to complete in 370 days, assuming there is not a need to reboot the OS after
a critical security patch.
And oh by the way, we need to run this query next month too, every month as a matter of fact.
Note, in the first Query Optimizer Gone Wild, the topic was a loop join with a table scan on
the inner source. So this is another loop join example.
The Tamed Plan
Now that we have identified a problem, and we know exactly what to look for in the execution plan,
it is time to solve the problem.
Because we have been working with SQL Server and other DBMS engines built around a cost base optimizer
for many years, we know exactly what to do.
The solution is to rewrite the SQL to get a good execution plan where the two base operations forle which we know
that have run time is reasonable are only executed only once each.
The query below meets this objective.
The execution plan is below.
This query consumes 37 CPU-sec, and 6.9 sec elapsed.
Given that the two component elements of this query combined for 27 CPU-sec and 6.4 sec elapsed,
the hash join and 2 parallelism repartition streams component increased true cost by 10 CPU-sec,
but almost a minuscue 0.5 sec elapsed time.
I suppose that I should file this on Connect, but Microsoft locked my account out,
and does not want to send me the unlock code.
So I am posting this here.
In one of the recent Windows OS versions, GUID Partition
Table (GPT) became an option in addition to Master Boot Record (MBR) for
creating disk partitions, with GPT supporting volumes larger than 2TB. In MBR,
a 32-bit unsigned integer addresses 512-byte sectors (yeah, there is a push to
adapt 4K sectors), so the disk partition limit was 2TB (2.19x1012).
OK, then fine. The Windows Server OS supports GPT and SQL
Server has been tested to support >2TB partitions. But to what extent has
this been tested? I am sure Microsoft has many SANs with 10-100TB storage
capacity, and someone tested 2TB plus. But anyone that works with big complex
systems and storage systems has probably got tired of clicking the GUI
repeatedly (no joke: one colleague had to go on 6-week disability after doing
too many PowerPoint slides), so we do everything from SQL scripts and probably
forgot how to use SSMS. (Me, I really liked Query Analyzer, especially how
quickly it launches.)
I am sure Microsoft has QA people who must test every single
feature of each GUI tool, SSMS, CVT etc., but how many tests are on 2TB plus
disks? and then 2TB+ files? So what can break? Even though the core OS and the SQL Server engine
core works, there are many utility tools out there that makes file IO API
calls. How many work with >2TB partitions or files, and how many still use a 32-bit
unsigned integer to represent the sector offset? Or otherwise thinks a
partition/file must be less than 2 billion KB?
Now I am sure most people out listen to every word I say as
the word of @#$. In which case your storage system is comprised of a great very
many 15K 146GB disks distributed over many IO channels, which further implies
that each RAID group is probably comprised of 4-8 disks (Fast Track originally
recommended 2 disk RAID groups, which results in too many LUNs).
In which case, 8 disks at 146GB (decimal 146x1012
= binary 136x230) in RAID 10 makes for a 543GB LUN. Even if it was 8
disks 300GB in RAID 5, the 1955GB LUN is still under 2TB. So you would have
never encountered any >2TB issues. But there are a few who do not seem to
follow my advice, and instead choose to trust the technical expertise of your
The arrival of the Sandy Bridge EN and EP processors, expected in early 2012, will mark
the completion of a significant shift in Intel server strategy.
For the longest time 1995-2009, the strategy had been to focus on producing a premium processor
designed for 4-way systems that might also be used in 8-way systems and higher.
The objective for 2-way systems was use the desktop processor that later had a separate brand
and different package & socket to leverage the low cost structure in driving volume.
The implication was that components would be constrained by desktop cost requirements.
The Sandy Bridge collection will be comprised of one group for single processor systems
designed for low cost, and one premium processor.
The premium processor will support both the EN and EP product lines,
the EN limited to 2-way, and the EP for both 2-way and 4-way systems,
with more than adequate memory and IO in each category.
The cost structure of both 2-way and 4-way increased from Core 2 to Nehalem, along with a significant boost in CPU, memory and IO capability.
With quad-core available in 1P, the more price sensitive environments should move
down to single processor systems.
This allows 2 & 4-way systems to be built with balanced compute, memory and IO unconstrained by desktop cost requirements.
In other blogs, I had commented that the default system choice for database server
for a long time had been a 4-way system should now be a 2-way since the introduction of Nehalem in mid-2009.
Default choice means in the absence of detailed technical analysis, basically a rough guess.
The Sandy Bridge EP, with 8 cores, 4 memory channels and 80 PCI-E lanes per socket
in a 2-way system provides even stronger support for this strategy.
The glue-less 8-way capability of the Nehalem and Westmere EX line is not continued.
One possibility is that 8-way systems do not need to be glue-less.
The other is that 8-way systems are being abandoned,
but I am inclined to think this is not the case.
The Master Plan
The foundation of the premium processor strategy, even though it may have been forgotten in the mists of time,
not to mention personnel turnover, was that a large cache improves scaling at the 4-way multi-processor level
for the shared bus SMP system architectures of the Intel Pentium to Xeon MP period.
The 4-way server systems typically deployed with important applications that could easily
justify a far higher cost structure than that of desktop components, but required critical capabilities
not necessary in personal computers.
Often systems in this category were fully configured with top line components whether needed or not.
Hence the Intel large cache strategy was an ideal match between premium processors and high budget
systems for important applications.
One aspect that people with an overly technical point of view have difficulty fathoming
is that the non-technical VP's don't want their mission critical applications running on a cheap box.
In fact, more expensive means that is must be better, and the most expensive is the best, right?
From the Intel perspective, a large premium is necessary to amortize the substantial effort necessary to
produce even a derivative processor in volumes small relative to desktop processors.
The low cost 2-way strategy was to explore demand for multi-processor systems in the desktop market.
Servers were expected to be a natural fit for 2-way systems.
Demand for 2-way servers exploded to such an extent
that it was thought for a brief moment there would be no further interest for single processor servers.
Eventually, the situation sorted itself out, in part with the increasing power of processors.
Server unit volume settled to a 30/60/10 split between single, dual and quad processors
(this is old data, I am not sure what the split is today).
The 8-way and higher unit volume is low,
but potentially of importance in having a complete system lineup.
AMD followed a different strategy based on the characteristics of thier platform.
The Hyper-Transport (HT) interconnect and integrated memory controller architecture
did not have a hard requirement for large cache to support 4-way and above.
So AMD elected to pursue a premium product strategy on the number of HT links.
Single processor systems require one HT to connect the IO hub.
Two HT is required in a 2-way system, one HT connecting to IO, and another to the second processor.
Three HT could support 4-way and higher with various connection arrangements.
The pricing structure is based on the number of HT links enabled,
on the theory that the processor has higher value in big systems than in small systems.
What Actually Happened
Even with the low cost structure Intel enabled in 2-way, desktop systems remained
and actually became defined as single processor.
Instead, the 2-way systems at the desk of users became the workstation category.
This might have been because the RISC/UNIX system vendors sold workstations.
The Intel workstations quickly obliterated RISC workstations,
and there have been no RISC workstations for sometime?
Only two RISC architectures are present today, having retreated to the very high-end server space,
where Intel does not venture.
Itanium was supposed to participate in this space, but the surviving RISC vendors optimized at 8-way and higher.
Intel would not let go of the 4-way system volume and Itanium was squeezed by Xeon at 4-way and below,
yet could not match IBM Power in high SMP scaling.
To do so would incur a high price burden on 4-way systems.
One other aspect of Intel server strategy of the time was the narrow minded focus on optimizing for a single platform.
Most of the time, this was the 4-way server.
There was so much emphasis on 4-way that there actually 2 reference platforms, almost to the exclusion of all else.
For a brief period in 1998 or so, there was an incident of group hysteria that 8-way
would become the standard high volume server.
But this phase wore off eventually.
The SPARC was perhaps the weakest of the RISC at the processor level.
Yet the Sun strategy to design for a broad range of platforms from 2-way to 30-way,
(then with luck 64-way via acquisition of one of the Cray spin-offs) was successful
until their processor fell too far behind.
After the initial implementation of the high volume 2-way strategy,
desktop systems became intensely price sensitive.
The 2-way workstations and server system were in fact not price sensitive even though it was thought they were.
It became clear that desktops could not incur any burden to support 2-way capability.
The desktop processor for 2-way systems was put into a different package and socket,
and was given the Xeon brand.
Other cost reduction techniques were implemented over the next several generations
as practical on timing and having the right level of maturity.
The main avenue is integration of components to reduce part count.
This freed 2-way system from desktop cost constraints, but as with desktops,
it would take several generations to evolve into a properly balanced architecture.
The 4-way capable processors remained on a premium derivative,
given the Xeon MP brand in the early Pentium 4 architecture (or NetBurst) period.
To provide job security for marketing people, 2-way processors were then became the Xeon 5000 series,
and 4-way the Xeon 7000 series in the late NetBurst to 2010 period.
In 2011, the new branding scheme is E3 for 1P servers, E5 for 2-way and E7 for 4-way and higher.
Presumably each branding adjustment requires changes to thousands of slidedecks.
At first, Intel thought both 2-way and 4-way systems had high demand versus cost elasticity.
If cost could be reduced, there would be substantially higher volume.
Chipsets (MCH and IOH) had overly aggressive cost objectives that limited in memory and IO capability.
In fact, 4-way systems had probably already fallen below the boundary of demand elasticity.
The same may have been true for 2-way systems, as people began to realize that single processor
systems were just fine for entry server requirements.
For Pentium II and III 2-way systems, Intel only had a desktop chipset.
In 2005-6, Intel was finally able to produce a viable chipset for 2-way systems (E7500? or 5000P) that provided
memory and IO capability beyond desktop systems.
Previously, the major vendors elected for chipsets from ServerWorks.
It was also thought at the time that there was not a requirement for premium processors in 2-way server systems.
The more correct interpretation was that the large (and initially faster) cache of premium processors
did not contribute sufficient value for 2-way systems.
A large cache does improve performance in 2-way systems, but not to the degree that it does at the 4-way level.
So the better strategy by far on performance above the baseline 2-way system with standard desktop
processors was to step up to a 4-way system with the low-end premium processors instead of a 2-way system with
the bigger cache premium processors.
And as events turned out, the 4-way premium processors lagged desktop processors in transitions to
new microarchitectures and manufacturing processes by 1 full year or more.
The 2-way server on the newer technology of the latest desktop processors
was better than a large cache processor of the previous generation,
especially one that carried a large price premium.
So the repackaged desktop processor was the better option for 2-way systems
The advent of multi-core enabled premium processors to be a viable concept for 2-way systems.
A dual-core processor has much more compute capability than a single core and the same for a quad-core over dual-core
in any system, not just 4-way,
provided that there not too much difference in frequency.
The power versus frequency characteristics of microprocessors clearly favors multiple cores for code
that scale with threads, as in any properly architected server application.
However, multi-core at the dual and quad-core level was employed for desktop processors.
So the processors for 2-way servers did not have a significant premium in capability relative to desktops.
The Intel server strategy remained big cache processors. There was the exception of Tigerton,
when two standard desktop dual-core processor die in the Xeon MP socket was employed for the 4-way system,
until a large cache variant was readied in the next generation Dunnington processor incorporated a large cache.
This also happened for the Paxville and Tulsa.
System Architecture Evolution from Core 2 to Sandy Bridge
The figure below shows 4-way and 2-way server architecture evolution relative to single processor desktops (and servers too)
from 45nm Core 2 to Nehalem & Westmere and then to Sandy Bridge. Nehalem systems are not shown for space considerations,
but are discussed below.
System architecture from Penryn to Westmere to Sandy Bridge, (Nehalem not shown)
The Core 2 architecture was the last Intel processor to use the shared bus, which allows
multiple devices, processors and bridge chips, to share a bus with a protocol to arbitrate
for control of the bus. It was called the front-side bus (FSB) because there was once a back-side bus for cache.
When cache was brought on-die more than 10 years ago, the BSB was no more.
By the Core 2 period, to support higher bus frequency, the number of devices was reduced to 2,
but the shared bus protocol was not changed.
The FSB was only pushed to 1066MHz for Xeon MP, 1333MHz for 2-way servers, and 1600MHz for 2-way workstations.
Nehalem was the first Intel processor with a true point-to-protocol, Quick Path Interconnect (QPI),
at 6.4GHz transfer rate, achieving much higher bandwidth to pin efficiency than possible over shared bus.
Intel had previously employed a point-to-point protocol for connecting nodes of an Itanium system back in 2002.
(AMD implemented point-to-point with HT for Opteron in 2003? at an initial signaling rate of 1.6GHz?)
Shared bus also has bus arbitration overhead in addition to lower frequency of operation.
The other limitation of Intel processors up to Core 2, was the concentration of signals on the
memory controller hub (also known as North Bridge) for processors, memory and PCI-E.
The 7300 MCH for the 4-way Core 2 has 2013-pins, which is at the practical limit,
and yet the memory and IO bandwidth is somewhat inadequate.
Nehalem and Westmere implement a massive increase in memory and PCI-E bandwidth (number of channels or ports) for
the 2-way and 4-way systems compared to their Core 2 counterparts.
Both Nehalem 2-way and 4-way systems have significantly higher cost structure than Core 2.
Previously, Intel had been mindlessly obsessed with reducing system to the detriment of balanced memory and IO.
This shows Intel recognized that their multi-processor systems were already below the price-demand elasticity point,
and it was time to rebalance memory and IO bandwidth, now possible with point to point interconnect
and the integrated memory controller.
QPI in Nehalem required an extra chip to bridge the processor to PCI-E.
This was not an issue for multi-processor systems,
but was undesirable for the hyper sensitive cost structure of desktop systems.
The lead quad-core 45nm Nehalem processor with 3 memory channels and 2 QPI ports in a LGA 1366 socket
was followed by a quad-core, 2-memory channel derivative (Lynnfield) with 16 PCI-E plus DMI replacing QPI in a LGA 1156 socket.
The previously planned dual-core Nehalem on 45nm was cancelled.
Nehalem with QPI was employed in the desktop extreme line,
while the quad-core without QPI was employed in the high-end of the regular desktop line.
The lead 32nm Westmere was a dual-core with the same LGA 1156 socket (memory and IO) as Lynnfield.
Per the desktop and mobile objective, cost structure was reduced with integration,
with 1 processor die and potentially a graphics die in the same package,
and just 1 other component the PCH.
The follow-on Westmere derivative was a six-core using the same LGA 1366 socket as Nehalem,
i.e., 3 memory channels and 2 QPI.
This began the separation process of desktop and other single processor systems from
multi-processor server and workstation systems.
Extreme desktops employ the higher tier components designed for 2-way, but are still single-socket systems.
I suppose that a 2-way extreme system is a workstation.
Gamers will have settle for the mundane look of a typical workstation chassis.
With the full set of Sandy Bridge derivatives, the server strategy transition will be complete.
Multi-processor products, even for 2-way, are completely separated from desktops without the requirement
to meet desktop cost structure constraints.
With desktops interested only in dual and quad-core,
a premium product strategy can be built for 2-way and above around both the number of cores and QPI links.
The Sandy Bridge premium processor has 8 cores, 4 memory channels, 2 QPI, 40 PCI-E lanes and DMI
(that can function as x4 PCI-E).
The high-end EP line in a LGA 2011 socket will have full memory, QPI and PCI-E capability.
The EN line in LGA 1356 socket will have 3 memory channels, 1 QPI and 24 PCI-E lanes plus DMI
to supports up to 2-way systems, and will be suitable for lower priced systems.
Extreme desktops will use the LGA 2011 socket, but without QPI.
What is interesting is that the 4-way capable Sandy Bridge EP line is targeted at both 2-way and 4-way systems.
This is a departure from the old Intel strategy of premium processors for 4-way and up.
Since the basis of the old strategy is no longer valid, of course a new strategy should be formulated.
But too often, people only remember the rules of the strategy, not the basis.
And hence blindly follow the old strategy even when it is no longer valid (does this sound familiar?)
This element of a premium 2-way system actually started with the Xeon 6500 line based on Nehalem-EX.
Nehalem-EX was designed for 4-way and higher with eight-cores,
4 memory channels supporting 16 DIMMs per processor and 4 QPI links.
A 2-way Nehalem-EX with 8 cores, 16 DIMMs per socket might be viable versus Nehalem at 4 cores, 9 DIMMs per socket,
even though the EX top frequency 2.26GHz versus 2.93GHz and higher in Nehalem.
The more consequential hindrance was that Nehalem-EX did not enter production until Westmere-EP was also in production,
with 6 cores per socket at 3.33GHz.
So the Sandy-Bridge EP line will provide a better indicator for premium 2-way systems.
The Future of 8-way and the EX line
There is no EX line with Sandy Bridge.
Given the relatively low volume of 8-way systems, it is better not to burden the processor used by 4-way systems
with glue-less 8-way capability.
Glue-less means that the processors can be directly connected without the need for additional bridge chips.
This both lowers cost and standardizes multi-processor system architecture,
which is probably one of the cornerstones for the success Intel achieved in MP systems.
I am expecting that 8-way systems are not being abandoned,
but rather a system architecture with "glue" will be employed.
Since 8-way systems are a specialized very high-end category,
this would suggest a glued system architecture is more practical in terms of effort than a subsequent 22nm Ivy Bridge EX.
Below are two of my suggestions for 8-way Sandy Bridge or perhaps Ivy Bridges depending on when components could be available.
The first has two 4-port QPI switch, or cross-bar or routers connecting four nodes with 2 processors per node.
The second system below has two 8-port QPI switches connecting single processor nodes.
The 2 processor node architecture would be economical, but I am inclined to recommend building the 8-port QPI switch.
Should the 2 processor node prove to be workable,
then a 16-way system would be possible.
Both are purely speculative as Intel does not solicit my advice on server system architecture and strategy,
not even back in 1997-99.
In looking at the HP DL980 diagram, I am thinking that the HP node controllers
would support Sandy Bridge EP in an 8-way system.
There are cache coherency implications (Directory based versus Snoop) that are beyond the scope for database server oriented topic.
There was an IBM or Sun discussion transactional memory.
I would really like to see some innovation on handling locks.
This is critical to database performance and scaling.
For example, the database engine ensures exclusive access to a row, i.e., memory, before allowing access.
Then why does the system architecture need to do a complex cache coherency check when the application has already done so?
I had also previously discussed SIMD instructions to improve handling of page and row base storage,
SIMD Extensions for the Database Storage Engine
If that were not enough, I had also called for splitting the memory system.
Over the period of Intel multi-processor systems 1995 to 2011,
practical system memory has increased from 2GB to 2TB.
Most of the new memory capacity is used for data buffers.
The exceptionally large capacity of the memory system also means that it cannot be
brought very close to the processor, as into to the same package/socket.
So the memory architecture should be split into a small segment
that needs super low latency byte addressability.
The huge data buffer portion could be changed to block access.
If so, then perhaps the database page organization should also be changed to make the metadata
access more efficient in terms of modern processor architecture to reduce
the impact of off-die memory access by making
full use of cache line organization.
The NAND people are also arguing for Storage Class Memory, something along the lines
of NAND used as memory.
More on QDMPA System Architecture.
and Sandy Bridge.
Many of us have probably seen the new SQL Server 2012 per
core licensing, with Enterprise Edition at $6,874 per core super ceding the $27,495
per socket of SQL Server 2008 R2 (discounted to $19,188 for 4-way and $23,370
for 2-way in TPC benchmark reports) with Software Assurance at $6,874 per
processor? Datacenter was $57,498 per processor, so the new per-core licensing
puts 2012 EE on par with 2008R2 DC, at 8-cores per socket.
This is a significant increase for EE licensing on Intel Xeon 5600 6-core systems
(6x$6,874 = $41,244 per socket) and a huge increase for Xeon E7 10-cores systems, now $68,740 per socket.
I do not intend to discuss justification of the new model. I will say that SQL
Server licensing had gotten out of balance with the growing performance capability
of server systems over time. So perhaps the more correct perspective is that
SQL Server had become underpriced in recent years.
(Consider that there was a 30%+ increase in the hardware cost structure in the transition
from Core 2 architectures systems to Nehalem systems for both 2-way and 4-way to accommodate the vastly increased memory and IO channels.)
Previously, I had discussed that the default choice for SQL Server used to be a 4-way system.
In the really old days, server sizing and capacity planning was an important job category.
From 1995/6 on, the better strategy for most people was to buy the 4-way Intel standard
high-volume platform rather than risk the temperamental
nature of big-iron NUMA systems (and even worse, the consultant to get SQL Server to run correctly by steering the execution plan around operations that were broken on NUMA). With the compute, memory and IO capabilities of Intel Xeon 5500 (Nehalem-EP),
the 2-way became the better default system choice from mid-2009 on.
By “default choice”,
I mean in the absence of detailed technical sizing analysis.
I am not suggesting that ignorance is good policy (in addition to bliss),
but rather the cost of knowledge was typically more than the value of said knowledge.
Recall that in the past, there were companies that made load testing tools. I think they are mostly gone now. An unrestricted license for the load test product might be $100K. The effort to build scripts might equal or exceed that. All to find out whether a $25K or $50K server is the correct choice?
So now there will also be a huge incentive on software licensing to step down from a 4-way 10-core system with 40 cores total to a 2-way system with perhaps 8-12 cores total
(going forward, this cost structure essentially kills the new AMD Bulldozer 16-core processor, which had just recently achieved price performance competitiveness with the Intel 6-core Westmere-EP in 2-way systems).
In the world of database performance consulting, for several
years I had been advocating a careful balance between performance tuning effort
(billed at consultant rates) with hardware. The price difference between a
fully configured 2-way and 4-way system might be $25,000. For a two-node
cluster, this is $50K difference in hardware, with perhaps another $50K in SQL
Server licensing cost, with consideration that blindly stepping up to bigger
hardware does not necessarily improve the critical aspect of performance
proportionately, sometimes not at all, and may even have negative impact.
With performance tuning, it is frequently possible to
achieve significant performance gains in the first few weeks. But after that,
additional gains become either progressively smaller, limited in scope, or involve major
re-architecture. In the long ago past, when hardware was so very expensive, not
mention the hard upper limits on performance, it was not uncommon for a consultant to get a
long term contract to do performance work exclusively.
More recently, performance consulting work tended to be shorter-term.
Just clean up the long hanging fruit, and crush moderate
inefficiencies with cheap powerful hardware. While this is perfectly viable
work, it also precludes the justification for the deep skills necessary to resolve complex
problems, which also calls into question the need to endure an intolerably arrogant, exorbitantly
It had gotten to the point that I had given thought to retiring,
and go fishing in some remote corner of the world.
But now with the new SQL Server per core licensing,
Microsoft has restored the indispensable (though still intolerable) status to arrogant, exorbitantly expensive, performance consultant.
So, thank you Microsoft.
Edit 16 Dec 2011
VR-Zone mentions a Windows 7/Server 2008 R2
that treats the 8-core AMD Bulldozer die as 4 cores with HT, as opposed to AMD's positioning as 8-cores. AMD should hope that this is Microsoft's position for SQL Server 2012 or no one should consider the AMD in light of the per core licensing, given that Intel physical cores are much more powerful than the Bulldozer "core"
Edit 20 Feb 2012
I might add that the new per core licensing would be well worth the extra money if SQL Server would give us:
1) Parallel Execution plans for Insert, Update and Delete
2) Improve Loop Join parallel scaling - I believe today there is content between thread in latching the inner source index root
3) Fix parallel merge join - If the parallel merge join code is broken, why can we not use the parallel hash join code with the existing index?
The basis for this if we going to pay the cores, then SQL Server should not let the core sit idle in time consuming operations.
There has been relatively litle activity in TPC Benchmarks recently with the exception of the raft of Dell TPC-H results with Exa Solutions.
It could be that systems today are so powerful that few people feel the need for benchmarks.
IBM published an 8-way Xeon E7 (Westmere-EX) TPC-E result of 4593 in August, slightly higher
than the Fujitsu result of 4555, published in May 2011.
Both systems have 2TB memory. IBM prices 16GB DIMMs at $899 each, $115K for 2TB or $57.5K per TB. (I think a 16MB DIMM was $600+ back in 1995!)
The Fujistu system has 384 SSDs of the 60GB SLC variety, $1014 each,
and IBM employed 143 SSDs of the 200GB eMLC variety, $1800 each for 24-28TB raw capacity respectively.
Except for unusually write intensive situations, eMLC or even regular MLC is probably
good enough for most environments.
HP published a TPC-H 1TB of 219,887.p QphH
for their 8-way ProLiant DL980 G7 with the Xeon E7-4870,
26% higher in the overall composite score than the IBM x3580 with the Xeon E7-8870 (essentially the same processor).
The HP scores 16% higher in power and 37.7% higher in throughput.
Both throughput tests were with 7 streams.
The HP system had Hyper-Threading enabled (80 physical cores, 160 logical)
while the IBM system did not.
Both systems had 2TB memory, more than sufficient to hold the entire database, data and indexes in memory.
The IBM system had 7 PCI-E SSDs and
the HP system has 416 HDDs over 26 D2700 disk enclosures, 10 LSI SAS RAID controllers,
3 P411 and 1 dual-port 8Gbps FC controller.
Also of interest are TPC-H 1TB reports published for the 16-way SPARC M8000 (June 2011)
with SPARC64 VII+ processors and the 4-way SPARC T4-4 (Sep 2011).
The table below shows configuration information for recent TPC-H 1000GB results.
|TPC-H 1000GB||IBM x3850 X5||HP ProLiant DL980 G7||IBM Power 780||SPARC M8000||SPARC T4-4|
|DBMS ||SQL 2K8R2 EE||SQL 2K8R2 EE||Sybase IQ ASE 15.2||Oracle 11g R2||Oracle 11g R2|
|Processors||8 Xeon E7||8 Xeon E7||8 POWER7||16 SPARC64 VII+||4 SPARC T4|
|Cores Threads ||80-80||80-160||32-128||64-128||32-256|
|IO Controllers ||7||13||12||4 Arrays||4 Arrays|
|HDD/SSD||7 SSD||416 HDD||52 SSD||4x80 SSD||4x80 SSD|
The figure below shows TPC-H 1000GB power, throughput and QphH composite scores for 4 x Xeon 7560 (32 cores, 64 threads),
two 8 x Xeon E7 (80 cores, 80 and 160 threads) systems, 8 x POWER7 (32 cores, 128 threads)
16 SPARC64 VII+ (64 cores, 128 threads) and the 4 SPARC T4 (32 cores, 256 threads).
TPC-H SF 1000 Results
The HP 8-way Xeon and both Oracle/Sun systems, one with 16 sockets
and the newest with 4 SPARC T4 processors, are comparable, within 10%.
An important point is that both Oracle/Sun and the IBM Power systems are configured with 512GB memory
versus 2TB for the 8-way Xeon E7 systems, which enough to keep all data and indexes in memory.
There is still disk IO for the initial data load and tempdb intermediate results.
This good indication that Oracle and Sybase have been reasonably optimized on IO, in particular,
when to use an index and when not to.
I had previously raised the issue that the SQL Server query optimizer should consider
the different characteristics of in-memory, DW optimized HDD storage (100MB/s per disk sequential)
Sun clearly made tremendous improvements from the SPARC 64 VII+ to the T4,
with the 4-way new system essentially matching the previous 16-way.
Of course, the Sun had been lagging at the individual processor socket level until now.
The most interesting aspect is that the SPARC T4 has 8 threads per core.
The expectation is that server applications have a great deal of pointer chasing code,
that is: fetch memory which determines next address to fetch with inherently poor locality.
A modern microprocessor with core frequency 3GHz corresponds to a 0.33 nano-second clock cycle.
Local node memory access time might be 50ns, or 150 CPU-clocks.
Remote node memory acess time might be 100ns for a neighboring node to over 250ns for multi-hop nodes
after cache-coherency is taken into account.
So depending on how many instructions are required for each non-cached memory access,
we can expect each thread or logical core to have many dead cycles, possibly enough to justify 8 threads per core.
What is surprising is that Oracle published a TPC-H benchmark with their new T4-4
and not a TPC-C/E which is more likely to emphasize the pointer chasing code than DW.
Below are the 22 individual query times for the above systems in the power test (1 stream).
TPC-H SF 1000 Queries 1-22
Below are the 22 individual query power times for just the two 8 Xeon E7 systems.
Overall, the HP system (with HT enabled) has 16% TPC-H power score, but the IBM system without HT
is faster or comparable in 9 of the 22 queries.
Not considering the difference in system architecture, the net might be attributed to HT?
TPC-H SF 1000 IBM and HP 8-way Xeon E7
Below are the 22 individual query power times for the HP 8 Xeon E7 and Oracle SPARC T4-4 systems.
TPC-H SF 1000 8-way HP Xeon E7 and 4-way SPARC T4
Fusion-iO just announced the new ioDrive2 and ioDrive2 Duo on Oct 2011 (at some conference of no importance).
The MLC models will be available late November and the SLC models afterwards.
See the Fusion-iO
press release for more info.
Below are the Fusion-IO ioDrive2 and ioDrive2 Duo specifications.
The general idea seems to be for the ioDrive2 to match the realizable bandwidth of a PCI-E gen2 x4 slot (1.6GB/s)
and for the ioDrive2 Duo to match the bandwidth of a PCI-E gen2 x8 slot (3.2GB/s).
I assume that there is a good explanation why most models have specifications slightly below the corresponding PCI-E limits.
The exception is that 365GB model at about 50% of the PCI-E g2 x4 limit.
Suppose that the 785GB model implement parallelism with 16 channels and 4 die per channel.
Rather than building the 365GB model with the same 16 channels,
but a different NAND package with 2 die each, they just implemented 8 channels using the same 4 die per package.
Lets see if Fusion explains this detail.
||SLC (Single Level Cell)
||MLC (Multi Level Cell)
|Read Bandwidth (64kB)
|Write Bandwidth (64kB)
|Read IOPS (512 Byte)
|Write IOPS (512 Byte)
|Read Access Latency
|Write Access Latency
||PCI-E Gen 2 x4
Fusion-IO ioDrive2 Duo
||SLC (Single Level Cell)
||MLC (Multi Level Cell)
|Read Bandwidth (64kB)
|Write Bandwidth (64kB)
|Read IOPS (512 Byte)
|Write IOPS (512 Byte)
|Read Access Latency
|Write Access Latency
||PCI-E Gen 2 x8
SLC verus MLC NAND
Between the SLC and MLC models, the SLC models have much better 512-byte reads IOPS than the MLC models,
with only moderately better bandwidth and read latency.
Not mentioned, but common knowledge is that SLC NAND has much greater write-cycle endure than MLC NAND.
It is my opinion that most database, transaction processing and DW, can accommodate MLC NAND
characteristics and limitations in return for the lower cost per TB.
I would consider budgeting a replacement set of SSDs if analysis shows that the MLC life-cycle does not match
the expected system life-cycle.
Of course, I am also an advocate of replacing the main production database server on a 2-3 year cycle
instead of the traditional (bean-counter) 5-year practice.
The difference in read IOPS at 512B is probably not important. If the ioDrive2 MLC models can drive 70K+ read IOPS at 8KB, then it does not matter what the 512B IOPS is.
One point from the press release:
"new intelligent self-healing feature called Adaptive FlashBack provides complete chip level fault tolerance,
which enables ioMemory to repair itself after a single chip or a multi chip failure without interrupting business continuity."
For DW systems, I would like to completely do away with RAID when using SSDs,
instead having two system without RAID on SSD units.
By this, I mean fault-tolerance should be pushed into the SSD at the unit level. Depending the failure rate of the controller, perhaps there could be two controllers on each SSD unit.
For a critical transaction processing system, it would be nice if Fusion could provide failure statistics
for units that have been in production for more than 30 days
(or whatever the infant mortality period is) on the assumption that most environments will spend a certain amount
of time to spin up a new production system.
If the failure rate for a system with 2-10 SSDs is less than 1 per year,
then perhaps even a transaction processing system using mirroring for high-availability can also do
without RAID on the SSD?
ioDrive2 and ioDrive2 Duo
I do think that it is great idea for Fusion to offer both the ioDrive2 and ioDrive2 Duo product lines
matched to PCI-E gen2 x4 and x8 bandwidths respectively.
The reason is that server systems typically have a mix of PCI-E x4 and x8 slots
with no clear explanation of the reasoning for the exact mix,
other than perhaps that being demanded by the customer complaining the loudest.
By have both the ioDrive2 and Duo, it is possible to fully utilize the bandwidth from all available slots
It would have been an even better idea if the Duo is actually a daughter card the plugs onto the
ioDrive2 base unit, so the base model can be converted to a Duo,
but Fusion apparently neglected to solicit my advice on this matter.
I am also inclined to think that there should also be an ioDrive2 Duo MLC model
at 1.2TB, on the assumption that the performance will be similar to the 2.4TB model,
as the ioDrive2 765GB and 1.2TB models have similar performance specifications.
The reason is that a database server should be configuration with serious brute force IO capability,
that is, all open PCI-E gen 2 slots should be populated.
But not every system will need the x8 slots populated with the 2.4TB MLC model,
hence the viability of a 1.2TB model as well.
if Fusion should be interested in precise quantitative analysis for SQL Server performance, instead of the rubish whitepapers put out by typical system vendors,
well I can turn a good performance report very quickly. Of course I would need to keep the cards a while for continuing analysis...
Over the last two years, I have stood up several proof-of-concept
(POC) database server systems with consumer grade SSD storage at cost $2-4K per TB.
Of course production servers are on enterprise class SSD, Fusion-IO and others, typically $25K+ per TB. (There are some special
situations where it is viable to deploy a pair of data warehouse servers with
PCI-E SSDs - OCZ RevoDrive, RevoDrive X2, & RevoDrive 3 X2
The first POC system was a Dell T710 with 2 Xeon 5670
processors 96GB (12x8GB) memory, 16x10K SAS HDDs and 6 OCZ RevoDrive (original version)
PCI-E SSDs supporting table scans at nearly 3GB/s. The most difficult query repeatedly
hashed a large set of rows (as in there were multiple large intermediate result sets) generating
extremely heavy tempdb IO. With tempdb on 12 10K HDDs, the query time was 1
hour. With tempdb on the 6 OCZ RevoDrives, the query time was reduced to 20min.
Before SSDs became viable, I would normally have configured
a 2-socket system with 48 (2x24) 15K HDDs, with one RAID controller for each
24-disk enclosure. This setup costs about $11K per enclosure with 24x146GB 15K
SAS drive and can be expected to deliver 4GB/s sequential bandwidth, 10K IOPS at
low queue, low latency (200 IOPS per 15K disk) and in the range of 15-20K IOPS
at high queue, high latency. As it was my intent to deploy on SSD, I only
configured 16 HDDs in the internal disk bays and did not direct the purchase of external HDDs.
The 6 OCZ RevoDrive 120GB PCI-E SSDs in the POC system cost about
$400 each at the time (now $280?). I recall that the tempdb IO traffic was
something like 40K IOPS (64KB), around 2.5GB/s bandwidth. This was consistent with
the manufacturers specifications of 540MB/s read and 480MB/s write at 128K IO,
and considering that there will be some degradation in aggregating performance
over 6 devices. The IO latency was somewhere in the range of 40-60ms (note that the
SQL Server engine issues tempdb IO at high queue depth). OK, so the real purpose of
the POC exercise was to tell the SAN admin in no uncertain terms that the
350MB/s from his $200K iSCSI storage system (4x1GbE) was pathetic, and even the
700MB/s on 2x4Gbps FC ports does not cut mustard in DW.
The next set of systems was ordered with 4 OCZ RevoDrive X2,
160GB (<$500 each). There was some discussion on whether to employ the OCZ enterprise
class ZDrive R3, but this product was cancelled and the OCZ substitute, the
VeloDrive (4 SandForce 1565 controllers, rated for ~1GB/s), was not yet
available. I was expecting somewhat better performance for 4 RevoDrive X2 (4 SandForce
1222 controllers each, rated for 700MB/s) over 6 of the original RevoDrives (2 SandForce
controllers each). The tempdb IO intensive query
that took 20min with the 6 RevoDrives now ran in 15min with the 4 RevoDrive X2s.
In additional, IO latency was under 10ms.
I was hoping to test the new OCZ RevoDrive 3 X2 with 4
SandForce 2281 controllers, rated for 1500MB/s read and 1200MB/s write.
Unfortunately there is an incompatibility with the Dell T110 II with the E3-1240
(Sandy Bridge) processor which has a new UEFI replacing the BIOS. OCZ does not
provide server system support on their workstation/enthusiast products.
Hopefully Dell will eventually resolve this.
SATA SSDs - OCZ Vertex 2, Vertex 3 & Vertex 3 Max IOPS, Crucial C300 & m4
My preference is to employ PCI-E rather than SATA/SAS SSD
devices. This is mostly driven by the fact the disk enclosures reflect the IO
capability of HDDs, with 24 bays on 4 SAS lanes. An SSD oriented design should
have 4 SSDs on each x4 SAS port. Of course, 4 SSDs and 4-8 HDDs on each x4 SAS
port is also a good idea.
So I have also looked at SATA SSDs. Earlier this year, I
started with the OCZ Vertex 2 and Crucial C300 SSDs. After encountering the
issue with the RevoDrive 3 on the new Dell server, I acquired OCZ Vertex 3,
Vertex 3 Max IOPS, and Crucial m4 SATA SSDs. The OCZ Vertex 2 has a 3Gbps
interface, the Vertex 3 and both Crucial C300 and m4 all support 6Gbps SATA
The OCZ Vertex SSDs use SandForce controllers; the Vertex 2 uses
the previous generation SandForce 1222 and the Vertex 3 uses the current generation
2281 controller. The Crucial SSDs use Marvel controllers (both?). Perhaps the
significant difference between the OCZ Vertex and Crucial SSDs are that the
SandForce controllers implement compression. The OCZ Vertex SSDs have far
better write performance with compressible data, but is comparable for
incompressible data. It does appear that SQL Server tempdb IO is compressible and
benefits from the compression feature.
Another difference is that OCZ offers 60, 120, 240 and 480GB
capacities while Crucial offers 64, 128, 256 and 512GB capacities. All
capacities are in decimal, that is, 1GB = 10^9 bytes. Both OCZ 60GB and Crucial
64GB presumably have 64GB NAND flash, the 64GB being binary, meaning 1GB =
1024^3, or 7.37% more than 1GB decimal. Basically, OCZ has more over-provisioning
than the Crucial, which in theory should also contribute to better write
performance. (Earlier Vertex drives had 50 and 100GB capacities. But there are so
many varieties of the Vertex 2 that I cannot keep track.)
In brief, the performance difference between SSD
generations, both from the OCZ Vertex 2 to Vertex 3 and from the Crucial C300
to m4, is substantial, so I will focus mostly on the newer Vertex 3 and m4
drives. The performance observed in SQL Server operations seemed to be consistent
with manufacturer specifications for both generations of OCZ and Crucial SSDs. It
was not noted whether writing compressed SQL Server database tables were
further compressible by the SandForce controller. This may be because it is
difficult to achieve the high write performance necessary to stress modern SSDs
in SQL Server with transactional integrity features.
Test System - Dell T110 II, Xeon E3 quad-core Sandy Bridge processor
The test system is a Dell PowerEdge T110 II, Xeon E3-1240
3.30GHz quad-core processor (Sandy Bridge) with 16GB memory. This system has 2
PCI-E Gen2 x8 and 1 Gen 1 x4 slot. All SSDs were attached to a LSI MegaRAID SAS
8260 controller (PCI-E Gen2 x8, 8 6Gbps SAS ports). I did not some testing with
2 SSDs on the SATA ports (3Gbps) but did make detailed observations.
Incidentally, the cost of this system, processor, memory and 1 SATA HD was $1078? The 128GB SSDs were about $220 ($268 for the Max IOPS). So a very capable system with 2 SSDs could be built for $1300-1500 (64GB or 128GB SSDs). A better configuration with 2 SATA HDDs, 4 SSDs and SAS controller would push this to $2500. But if Dell and OCZ could resolve this RevoDrive 3 -UEFI issue, then I would recommend the T110 II, E3 processor, 16GB memory, 2 SATA HDDs and 1 RevoDrive 3 X2.
One unfortunate aspect of this system is that
the SATA ports are all 3Gbps per the Intel C202 PCH, even though the ever so
slightly more expensive C204 supports 6Gpbs SATA on 2 ports. Basically, this
has similar characteristics as the database laptop with super IO that I
proposed earlier, except that the laptop would be 2.5GHz to keep power
Performance tests with the TPC-H SF 100 database
With 8 SSDs (2 Vertex 3, 2 Vertex 3 MaxIOPS, 2 m4, and 2 C300)
I was able to generate 2.4GB/s in table scan aggregation query, possibly
gated by the 355MB/s rating of the C300s. A configuration consisting of the 4
Vertex 3 and 2 m4’s would have been gated by the 415MB/s rating of the m4. If
can get a total 8 Vertex 3s, which are rated at 550/500MB/s for compressible
and incompressible data, then I would either be limited by the adapter or the
PCI-E Gen2 x8 limit of 3.2GB/s. There is an LSI SAS8265 adapter with dual-cores
that has even higher IOPS capability, but it is not known whether this is
necessary for large block IO.
The tests consisted of running the TPC-H queries, single
stream (but not per official benchmark requirements).
The figure below show the time to run the the 22 queries (excluding statistics, parse and compile) for 2, 4 and 6 SSDs with no data compression (raw) and with page mode data compression.
Run time on 2 OCZ Vertex 3 (regular) SSDs was 815 sec with compression and 936 sec raw (w/o compression).
On 4 OCZ Vertex 3 SSDs (2 regular, 2 MaxIOPS)
total query times were reduced to 658 sec with compression and 622 sec raw.
On 6 SSDs, 4 OCZ and 2 Crucial m4, total query times are 633 sec with compression and 586 sec raw.
The figure below shows tempdb IO write latency for 2, 4 and 6 SSDs, with raw and compressed tables.
On the 2 OCZ SSDs, IO latency from fn virtual file stats (for the entire run)
averaged 30ms (temp write) and 90ms (data read) with compression
and 60ms (temp write) - 130ms (data read) without compression.
The performance with 2 Crucial m4 drives was less, showing much higher write latencies.
On 4 OCZ Vertex 3 SSDs, IO latency was 14 temp write and 30ms data read with compression
and 18-60ms without compression. The IO latencies on the Max IOPS models were lower than on
the regular Vertex 3 models.
For 6 SSDs, IO latencies are now down to the 15ms range, with somewhat higher
latency on the Crucial m4 SSDs.
With data and tempdb on 2 OCZ Vertex 3 SSDs, performance was decent but IO constrained. Performance was 15% better with data compression (page) than without, even though CPU was 23% higher.
Performance with 4 OCZ Vertex 3 SSDs (2 regular, 2 Max IOPS)
was 20% better for compression on and 34% better without compression, relative
to performance with 2 SSDs. The performance without compression was now 6% better than
At 6 SSDs (4 Vertex 3, 2 m4), there was another 5%
performance improvement relative to 4 SSDs, for both compressed and not
In the above tests, each SSD was kept as a standalone disk, i.e., I did not use RAID.
There was 1 data and 1 tempdb file on each SSD.
I noticed that the uncompressed (raw) database tended to generate 64K or 128K IO,
while the compressed database tended to have 256K IO.
Two queries, 17 and 19(?) generated 8KB IO,
and would have much better performance with data in memory.
There was also wide variation from query to query in whether performance was better with or without compression.
Today, it is actually possible to build a highly capable database
system in a laptop form factor. There is no point to running a production database
on a laptop. The purpose of this is so that consultants (i.e., me), can
investigate database performance issues without direct access to a full sized
server. It is only necessary to have the characteristics of a proper
database server, rather than be an exact replica.
Unfortunately, the commercially available laptops do not
support the desired configuration, so I am making an open appeal to laptops
vendors. What I would like is:
1) Quad-core processor with hyper-threading (8 logical processors),
2) 8-16GB memory (4 SODIMM so we do not need really expensive 8GB single rank DIMMs)
3) 8x64GB (raw capacity) SSDs on a PCI-E Gen 2 x8
interface (for the main database, not the OS)
- alternatively, 2-4 x4 externally accessible PCI-E ports for external SSDs
- or 2 x4 SAS 6Gbps ports for external SATA SSDs
4) 2-3 SATA ports for HDD/SSD/DVD etc for OS boot etc
5) 1-2 e-SATA
6) 2 1GbE
Below is a representation of the system, if this helps clarify.
The Sandy-Bridge integrated graphics should be sufficient, but
high-resolution 1920x1200 graphics and dual-display are desired. (I could live
There should also be a SATA hard disk for the OS (or SATA SSD
without the 2.5in HDD form factor if space constrained) as the primary SSD array should be dedicated to the database.
Other desirable elements would be 1 or 2 e-SATA port, to support backup and restores with consuming the valuable main SSD array,
ports (so I can test code for parallel network transfers.
The multiple processor cores allow parallel execution plans.
Due to a quirk of the SQL Server query optimizer, 8 or logical processors are more
likely to generate a parallel execution plan in some cases.
Ideally, the main
SSD array is comprised of 2 devices, one on each PCI-E x4 channel.
The point of the storage system is to demonstrate 2GB/sec+
bandwidth, and 100-200K IOPS. One of the sad fact is even today storage
vendors promote $100K+ storage systems that end up delivering less
than 400-700MB/s bandwidth and less than 10K IOPS.
So it is important to demonstrate what a proper database storage system should be capable of.
Note that is it not necessary to have massive memory.
A system with sufficient memory and a powerful storage system can run any query, while a system with very large memory but weak storage can only run read queries that fit in memory. And even if data fits in memory, the performance could still fall off a cliff on tempdb IO.
Based on component costs, the
laptop without PCI-E SSD should be less than $2000, and the SSD array should be
less than $1000 per PCI-E x4 unit (4x64GB).
It would really help if the PCI-E SSD could be powered off from SW, i.e., without having to remove it. This why I want to boot off the SATA port, be it HDD or SSD.
per below, 2 SSDs on SATA ports do not cut the mustard,
The spec above call for 8 SSDs. Each SSD is comprised of 8 NAND packages, and each
package is comprised of 8 die. So there are 64 die in one SSD, and IO is
distributed over 8 SSDs, or a total of 512 individual die.
The performance of a single NAND die is nothing special and even pathetic on
writes. However, a single NAND die is really small and really cheap. That is
why it is essential to employ high parallelism at the SSD unit level. And then,
employ parallelism over multiple SSD units.
An alternative solution is for the laptop to expose 2-4 PCI-E x4 ports (2 Gen 2
or 4 Gen 1) to connect to something like the OCZ IBIS, or an SAS controller
with 2 x4 external SAS ports.
The laptop will have 1 Intel quad-core Sandy-Bridge processor, which has 2 memory channels supporting 16GB dual-rank DDR3 memory. The processor has 16 PCI-E gen 2, DMI g2 (essentially 4 PCI-E g2 lanes) and integrated graphics. There must be a 6-series (or C20x) PCH, which connects upstream on the DMI. Downstream, there are 6 SATA ports (2 of which can be 6Gbps), 1 GbE port, and 8 PCI-E g2 lanes. So on the PCH, we can attach 2 HDD or SSD at 6Gbps, plus support 2 eSATA connections. There is only a single 1GbE port, so if we want 2, we have to employ a separate GbE chip.
While the total PCH down stream ports exceeds the upstream, it ok for our purposes to support 2 internal SATA SSDs at 6Gbps, 2 eSATA ports and 2 GbE, plus USB etc. The key is how the 16 PCI-E gen 2 lanes are employed. In the available high-end laptops, most vendors attach a high-end graphics chip (to all 16 lanes?). We absolutely need 8 PCI-E lanes for our high performance SDD storage array. I would be happy with the integrated graphics, but if the other 8 PCI-E lanes were attached to graphics, I could live with it.
The final comment (for now) is that even though it is possible to attach more than 2 SSD off the PCH, we need then bandwidth on the main set of PCI-E ports. It is insufficient for all storage to be clogging the DMI and PCH.
Thunderbolt is 2x2 PCI-E g2 lanes, so technically thats almost what I need (8 preferred, but 6 acceptable).
What is missing from the documentation is were Thunderbolt attaches.
If directly to the SandyBridge processor (with bridge chip for external?), then that's OK,
if off the PCH, then that is not good enough for the reasons I outlined above.
Also, we need serious SSDs to attach off TB, does the Apple SSD cut mustard?
The diagram below shows the Thunderbolt controller connected to the PCH,
but also states that other configurations are possible.
The problem is that most high-end laptops are designed with high-end graphics,
which we do not want squandering all 16 PCI-E lanes.
A Thunderbolt controller attached to the PCH is capable of supporting x4 PCI-E gen 2, but cannot also simultaneously support saturation volume traffic from internal storage (SATA ports), and network (not to mention eSATA). I should add that I intend to place the log on the SATA port HDD/SSD, along with the OS, hence I do not want the main SSD array generating traffic over the DMI-PCH connection.
A Thunderbolt SDK is supposed to released very soon, so we can find out more.
I am inclined to think that Thunderbolt is really a docking station connector, being able to route both video and IO over a single connector. If we only need to route IO traffic, then there are already 2 very suitable protocols for this, i.e., eSATA for consumer, and SAS for servers, each with a decent base of products. Of course, I might like a 4 bay disk enclosure for 2.5in SSDs on 1x4 SAS, or an 8-bay split over 2 x4 ports. Most of the existing disk enclosures carry over from hard disk environment, with either 12-15 3.5in bays or 24-25 2.5in bays.
There is a feature, called index skip scan that has been in Oracle since version 9i.
When I across this, it seemed like a very clever trick, but not a critical capability.
More recently, I have been advocating DW on SSD in approrpiate situations,
and I am thinking this is now a valuable feature in keeping the number of nonclustered
indexes to a minimum.
Briefly, suppose we have an index with key columns: Col1,
Col2, in that order.
Obviously, a query with a search argument (SARG) on Col1 can use this index,
assuming the data distribution is favorable.
However, a query with the SARG on Col2
but not Col1 cannot use the index in a seek operation.
Now suppose that the cardinality of Col1,
(the number of distinct values of Col1), is relatively low.
The database engine could seek each distinct first value of Col1
and the specified SARG on Col2.
Microsoft SQL Server currently does not have the Oracle Index Skip-Scan feature,
but the capability can be achieved with a work-around.
In this example, the LINEITEM table has a cluster key on columns
but does not have an index leading with L_ORDERKEY.
Our query is to find a specific Order Key in the LineItem table.
If there is a table with the distinct date values, DimDate,
we could force a loop join from the DimDate table to LineItem
(even though only columns from LineItem are required)
to get the execution plan below.
The question is now: how effective is this technique?
The most efficient execution plan is of course, to have an index leading with the Order Key column.
But the situation calls for keeping the number of nonclustered indexes to an absolute minimum.
So how does the above execution plan compare with a table scan?
A table scan, in this type of query, such that only few rows meet an easy to evaluare SARG,
might run at about 1GB/s per core.
Note this is far higher than the 200MB/sec cited in the Microsoft Fast Track Data Warehouse
documents. This is because the FTDW baseline is a table scan that aggregates several columns
of every row. And not only that, a Hash Match is also required to group the results.
Basically, a needle in haystack table scan runs much faster than the more complex
aggregate and group scan.
At most, a 1GB table scan might acceptable for a non-parallel execution plan
and even a 50GB table scan could be tolerable on a powerful 32-core system
with an unrestricted parallel execution plan.
A loop join can run somewhere in range of 100,000-200,000 seeks/sec to the inner source.
Realistically, a Data Warehouse with 10 years data has distinct 3652 days (depending on the leap year situation).
A loop join with 3650 rows from the outer source should run some where around 36ms.
Even if the DW had 20 years data, this is still acceptable, on the assumption that
the non-lead column Order Key search is in the minority,
with the plus being one less index on the big table is required.
If the query could be bounded to with a single year or quarter-year,
then we are approaching the efficiency of having the extra nonclustered index.
Last week Intel announced the 10-core Xeon E7-x8xx series (Westmere-EX), superceding the Xeon 6500 and 7500 series (Nehalem-EX).
The E7 group consists of the E7-8800 series for 8-way systems, the E7-4800 series for 4-way systems
and the E7-2800 series for 2-way systems. Also, the E3-12xx series (Sandy Bridge) for 1-socket servers,
superceding the Xeon 3000 series (Nehalem and Westmere).
This week at Intel Developer Forum Bejing, Intel has a slidedeck on Sandy Bridge-EP,
an 8-core die that will presumably be the Xeon E5-xxxx series superceding the Xeon 5600 series (Westmere-EP) scheduled for 2H 2011.
||Xeon 6500/7500, 4-8 cores
||E7-8/4/2800, 6-10 cores
||Xeon 5600, 4-6 cores
||E5-xx00, upto 8 cores
||Xeon 3x00, 2-6 cores
||E3-1200, 2-4 cores
The top-of-the-line Xeon E7-8870 is 10-core, 2.4GHz (max turbo 2.8GHz) and 30M last level cache,
compared with Xeon X7560 8-core, 2.26GHz (turbo 2.67GHz) and 24M LLC.
HP ProLiant DL580 G7 TPC-E results for 4-way Xeon E7-8870 and 7560 are 2454.51 and 2,001.12 respectively.
This is a 22% gain from 25% more cores, and 6% higher frequency, inline with expectations.
IBM System x3850 X5 TPC-H results at scale factor 1TB for the 4-way Xeon X7560 and 8-way Xeon E7-8870 are below.
|4-way Xeon 7560
It is unfortunate that a direct comparison (with same number of processors and at the same SF)
between the Xeon E7-8870 and X7560 is not available.
The presumption is that the Xeon E7-8870 would show only moderate improvement over the X7560.
This because the TPC-H is scored on a geometric mean of the 22 queries,
of which only some benefit from very high degree-of-parallelism.
The more modest performance gain from Nehalem-EX to Westmere-EX, compared to the previous 40% per year objective, is probably an indication
of the future trend in the pace of performance progression.
The pace of single core performance progression slowed several years ago.
Now, the number of cores per processor socket also cannot be increased at a rapid pace.
Fortunately, the compute power available in reasonably priced systems is already so outstanding
that the only excuse for poor performance is incompetence on the software side.
My expectation is that transaction processing performance can still be boosted significantly
with more threads per core. The IBM POWER 7 and Oracle/Sun SPARC T3 implement 8 threads per core. It is unclear if Intel intends to pursue this avenue.
Data Warehouse performance could be increased with columnar storage, already in Ingres VectorWise
and coming in the next version of SQL Server.
Scale out is now available in PDW (EXA SOL has TPC-H results with 60 nodes).
I am also of the opinion that SIMD instruction set extensions for the row & column offset calculation
could improve database engine performance. The object is not just the reduce the number of instructions,
but more importantly to make the memory access sequence more transparent, ie, allow for effective prefetching.
At the system level, the processor interconnect technology (AMD Hyper-Transport and Intel QPI)
should also allow scale-up systems. HP has mentioned that 16-way Xeon is possible.
HP already has the crossbar technology from their Itanium based Superdome and the sx3000 chipset.
It is probably just a matter of gauging the market volume of the 8-way ProLiant DL980 to assess
whether their is also a viable market for 16-way Xeon systems.
Another observation is the price structure of 2-way and 4-way systems.
It used to be that there was very little price difference between 1-way and 2-way systems with otherwise comparable features.
So the default system choice frequently started with a 2-way system and higher.
On the downside, the older 2-way systems also did not have substantially better memory or IO capability.
With the 2-way Xeon 5500 and 5600 systems, there is a more significant price gap between 1-way and 2-way systems.
However, the 2-way Xeon 5500 systems also have serious memory and IO capability.
So low-end entry system now needs to revert to a single socket system.
The price gap with 4-way systems has also grown along with capabilities, particularly in memory capacity and reliability.
The default system upgrade choice should be to replace older 4-way systems with new generation 2-way systems.
The new 4-way systems should target very-high reliability requirements.
Seagate today announced a near complete overhaul of their enterprise product line.
This include second generation SSD now with either SAS and SATA interfaces.
The first generation Pulsar SSD only supported SATA interface.
The new 2.5in 15K and 10K hard drive models have higher capacity.
The 2.5in 7.2K hard drive was upgraded to 1TB last month?
The 7.2K 3.5in is now available upto 3TB.
All models support 6Gbps.
Pulsar SSD (SAS/SATA interface, 2.5in FF)
The new second generation Seagate Pulsar SSD comprises two product lines.
The Pulsar XT.2 is based on SLC NAND at 100, 200 and 400GB capacities with SAS interface only.
The Pulsar.2 is based on MLC NAND at 100, 200, 400 and 800GB capacities available in both SAS and SATA interfaces.
Performance specifications cited for the Pulsar XT.2 are 360MB/s read and 300MB/sec write
at 128KB, equivalent to sequential IO in hard disks.
Random 4KB IO rates are 48K IOPS read and 22K IOPS write.
Performance specifications were not cited for the Pulsar.2.
Based on other vendors with both SLC and MLC product lines,
the expectation is that the MLC model should have comparable read performance.
Write performance might be less than the SLC model, but still adequate to almost all requirements.
Savvio 15K 2.5in HDD
The Savvio 15K.3 supercedes the 15K.2 product line with 300GB and 146GB capacities
replacing the 146GB and 73GB models in the 15K.2 line.
Sequential transfer rate is 202MB/s on the outer tracks and 151 on the inner tracks,
up from 160 to 122 in the 15K.2
Average read and write seek time in reduced to 2.6/3.1ms, down from 2.9/3.3ms,
resulting is slightly improved random IO performance.
Savvio 10K 2.5in HDD
The Savvio 10K.5 product line features 300, 450, 600 and 900GB capacities (300MB per platter),
up from 450 and 600GB in the 10K.4.
Sequential transfer rate is 168 to 93MB/s, up from 141 to 75MB/s.
Average seek time is 3.4/3.8ms (3.7/4.1 for the 900GB model), down from 3.9/4.5 ms in the earlier model.
Constellation 7.2K 3.5in HDD
The Constellation ES.2 enterprise grade 7200RPM 3.5in 3TB drive is available in both SAS and SATA interfaces.
Sequential transfer rate is 155MB/s. The previous generation ranged from 500GB to 2TB.
Constellation 7.2K 2.5in HDD
The Constellation.2 enterprise grade 7200RPM 2.5in product line features 250GB, 500GB and 1TB capacities,
in both SATA and SAS interfaces (SATA only for 250GB).
The 3.5in 15K and 10K product lines have not been refreshed. It is unclear whether there will be future models for these product lines.
Intel released the SSD 510 this month (March 2011), shows as in-stock on some web-stores.
SATA 6Gbps interface, 120 and 250GB capacities.
Sequential read/write 500/315MB/s. Random read/write 20K/8K IOPS (this seems low).
The OCZ Vertex 3 regular and Pro series with SATA 6Gbps interface and 2.5in form factor.
have been in the news,
but there is no information on the OCZ website.Correction: The Vertex 3 is not listed under the products section, but is described in the press release section
Toms Hardware lists the Vertex 3 with the SandForce 2281 controller, 550/525MB/s sequential, 60K IOPS 4K random and $499 price for the 240GB model. The Vertex 3 Pro with the SF-2582 controller, 550/500MB/s sequential, 70K IOPS and the 200GB model priced at $775.
The OCZ Z-Drive R3 with PCI-E gen 2 interface has been announced, available probably in Apr or May.
Sequential Read/Write at 1000/900MB/s.
Random at 135K IOPS.
No pricing is available on the new Seagate drives. The bare drive pricing (not from system vendors) on the current Seagate 2.5in 15K 146GB is about $200, and the 2.5in 10K 600K about $400. Given that the consumer grade OCZ Vertex 2 is $430 for the 240GB and $215 for the 120GB, my thinking thinking is that the 15K 146GB drives are nolonger viable, evening considering the higher price of the Vertex 3 Pro. The 10K 600GB is only barely viable from the capacity point of view. So the new higher capacity drives really need to come in at comparable price points to the current models.
All the Seagate 2.5 devices, SSD and HDD are 15mm height, standard on server storage bays. The OCZ SSD follow the common notebook 9.3mm height form factor. Going forward, perhaps there will be a 2U storage enclosure that can accommodate 36 x 9.3mm bays in place of the current 24 x 15mm bays. But an even better idea is to have a new DIMM size form factor with the connection at the short side so a that a 1U storage enclosure can accommodate 36 of these devices.
Seagate first generation Pulsar
Some one brought up the point that Pulsar, the first generation Seagate SSD, was vaporware. My guess would be that Seagate was late to market without any performance or price advantages over products already available, did not get any design wins with majors system and storage player, and hence decided not to launch. So there is concern that the second generation Pulsar might also be vaporware. Stayed tuned.
The Register reported a spat between HP and Oracle on the TPC-C benchmark. Per above, HP submitted a TPC-C result of 3,388,535 tpm-C for their ProLiant DL980 G7 (8 Xeon X7560 processors), with a cost of $0.63 per tpm-C. Oracle has refused permission to publish.
Late last year (2010) Oracle published a result of 30M tpm-C for a 108 processors (sockets) SPARC cluster ($30M complete system cost). Oracle is now comparing this to the HP Superdome result from 2007 of 4M tpm-C at $2.93 per tpm-C, calling the HP solution a clunker. The SPARC cluster is comprised of 27 nodes. Each node is 4 socket, 16-core (64 cores total) with 8 threads per core (512 logical processors total) and 512GB memory. The complete cluster is comprised of 1728 cores and 11,040 x 24GB SSD drives.
Microsoft wants system vendors to move the newer TPC-E benchmark. A ProLiant DL980 or any 8-way Xeon 7500 TPC-E results other system vendors would be good to see. Oracle has not published any TPC-E benchmark result. The rumor mill has it that Oracle on a single system runs TPC-E just fine. The problem is believed to be that Oracle RAC cannot scale on TPC-E to degree desired by marketing (even if it is ok by technical standards).
The Microsoft Parallel Data Warehouse diagram was somewhat difficult to understand in terms of the functionality of each subsystem in relation to the configuration of its components. So now that HP has provided a detailed list of the PDW components, the diagram below shows the PDW subsystems with component configuration (InfiniBand, FC, and network connections not shown).
Observe that there are three different ProLiant server models, the DL360 G7, DL370 G6 and the DL380 G7, in five different configurations as suitable for the requirements of each subsystem. There are also up to three different configurations of the P2000 G3 storage system for the Control node cluster, the compute nodes, and the backup node.
The Control nodes are ProLiant DL380 G7 servers with 2 Xeon X5680 six-core 3.3GHz 130W processors, 96GB memory, 14 internal 300GB 10K disks, and an external P2000 G3 with 5x450G 15K disks. The Control nodes parse incoming queries to be reissued to the compute nodes, and also reassemble the results from each node to the client as a single set. This would explain the use of powerful processors and a heavy memory configuration.
The purpose of the 14 internal disks is unclear as one might expect that result sorting takes place on the shared storage, unless this is done outside of SQL Server and also outside of the cluster shared resources. Now that I think about it, this is reasonable. On a cluster failover, there is no need to recover the intermediate results of queries in progress, as they will have to be reissued?
The general idea is to distribute as much query processing to the compute nodes as possible. There are situations that require intermediate data to be brought back to the control node for finally processing. Once there are more environments on PDW, there may be an evaluation as to whether a more power control node would be suitable? depending on the specific case.
The management nodes are ProLiant DL360 G7 servers with a single Xeon E5620 quad-core 2.4GHz 80W processor, 36GB memory and 2 disks. If the management nodes have light compute and memory requirements, I am inclined to think that this functionality could be consolidated with the other subsystems. But this is not necessarily an important point.
The Landing Zone is a ProLiant DL370 G6 with a single Xeon E5620. (The HP spec sheet also mentions W5580, which is a quad-core 3.2GHz 130W 45nm Nehalem-EP core.) The memory configuration cited is peculiar, with 6x2GB and 6x4GB DIMMs. A single processor Xeon 5600 system should have a limit of 9 DIMMs, three per memory channel. So why not employ 9x4GB DIMMs?
The DL370 chassis accommodates up to 14 LFF (3.5in) disks, eliminating the need for an external storage unit. The Landing Zone actually employs 10x1TB 7.2K and 2x160GB 7.2K HDDs. It is unclear why the LZ system was not configured with 14 disks. It could have also been configured with all 1TB disks, with the OS using a small slice. There are now 2GB and 3GB disks in the 3.5in 7.2K form factor. Seagate has a 2TB enterprise rated 3.5in 7200RPM drive. But it is unclear when these ultra-large capacity disks will available for server systems or even if there is a need for additional capacity in this function.
The Backup node is a ProLiant DL380G7 with 2 Xeon E5620 processors, 24GB memory. Presumably there are 2 internal disks to boot the OS. There is also an external P2000 G3 storage system with sufficient 3.5in drive capacity to support backups assuming 4:1 compression.
The max operational capacity of the compute nodes is 500TB This would imply that the Backup node could have 125TB net capacity..
The cited max capacity of PDW (with 40 nodes) is 500TB. This is based on the uncompressed data? So the actual net storage is 133TB? A reasonable assumption is that a 133TB database with page compresssion applied might yield another 2X reduction on a backup with compression?
The maximum configuration for the P2000 G3 is 96 LFF (3.5in) disks, with 7 additional disk enclosures. The P2000 G3 does support the 2TB 7.2K drive, so the P2000 with 5 additional disk enclosures totaling 72x2TB disks would meet the backup capacity requirement.
The Compute nodes are ProLiant DL360 G7 server with 2 Xeon X5670 six-core 2.93GHz 95W processors, 96GB memory, 8 internal 300GB 10K HDD and one external P2000 G3 with the option of 11x300GB 15K, 11x1TB 7.2TB or 24x300GB 10K drives. The external storage unit is for permanent data storage. The internal disks are for tempdb.
There are now 600GB 3.5in 15K and 600GB 2.5in 10K drives, so it is possible that these will replace two of the current options in the near future. A single (42U) rack must support 10 compute nodes (plus 1 spare), 10 storage nodes, and the associated switches (2 InfiniBand, 2 Ethernet, 2 Fiber Channel). This precludes a 2U form factor for the compute node. The 1U DL360 G7 cannot support the 130W thermal envelope for each of 2 Xeon X5680 processors, so the 95W Xeon X5670 processors are employed instead.
If tempdb is actually on the 8 internal drives, then I wonder why the P2000 storage units employs RAID-10? Write to permanent data is expected to be infrequent, negating the need for small block random write IO performance (the most serious liability of RAID-5). Only tempdb activity is expected to generate non-sequential IO.
Without more than very brief hands on time with PDW, I have two comments at this time. One is that the P2000 G3 supports 1,572MB/s bandwidth. Ideally, for 24 disks per storage unit, we would like to target 2GB/s, possibly somewhat more. Hopefully the next generation HP entry level storage system with employ the Intel C5500 processor (or successor) or some comparable processor with adequate memory and IO bandwidth. I have heard that the desire is also to move the storage interface from FC to Infini-band.
The second comment is that SSD could be considered for tempdb. The 8 internal 300GB 10K drives might cost $2000 (or whatever OEM volume pricing is). The cost of a PCI-E 1TB consumer grade SSD is approaching $2-3K. An enterprise grade 1TB SSD is higher depending on the vendor.
The maximum PDW configuration is 40 compute nodes. With the 24-disk storage units (300GB decimal, 278GB binary), there are 960 disks, excluding the capacity of disks locally attached on the compute node internal bays. Net RAID-10 capacity is then 133,440GB binary, which could correspond to 500TB uncompressed capacity. The 40 compute node limit may be a connectivity limit. At some point, the 600GB 2.5in 10K drive should become available for the PDW system, doubling capacity. I am also wondering what if an actual multi-unit PDW customer asked MS for additional external storage units to be daisy chained.
on further consideration, if I can mix HDD and SSD on a SAS RAID controller, on the compute node, I would go with 2 HDD for the OS and 4-6 SATA/SAS SSDs, plus 1 PCI-E SSD in the open PCI-E slot.
Lets figure $10K each for the heavy config servers, $15K for the storage units, and $10K for each of the IB and FC switches, lets figure $500K HW cost for the control rack plus one 10 node compute rack. There are 20 SQL Server processor licenses plus other components, so probably another $500K here. It will probably also involve another $500K in consulting for the deployment, so maybe $2M all told. I am thinking gold electro-plating on the rack might cost $10K.
Compare this with an 8-way ProLiant DL980 environment. Lets just suppose the PDW with 20 Xeon 5600 sockets has 2.5X the performance of 8 Xeon 7500 sockets. The DL980 with memory should cost around $150K, a direct-attach storage (8x24 disks) costs $12.5K each for $100K. Throw in another $50K of hardware to resemble the PDW. SQL EE licenses for 8 sockets is $240K. Suppose the consulting services to deployment on the SQL Server that we are already familar with is $100K. This brings the total to $700K. So $2M for 2.5X of the $700K system seems reasonable, considering the heavily super non-linear cost structure of scaling.
Also, figure an Oracle Database Machine with 8 RAC nodes and 14 Exadata Storage nodes will also cost around $2M (somebody please look this up).
IBM just published a TPC-H SF 1000 result for their
4-way Xeon 7560 system featuring a special MAX5 memory expansion board to support 1.5TB memory. In Dec 2010, IBM also published a TPC-H SF1000 for their Power 780 system, 8-way, quad-core, (4 logical processors per physical core).
The figure table below shows TPC-H SF 1000 results for the 8-way 6-core Opteron 8439 on SQL Server and Sybase, the 16-way quad-core Itanium 9350 on Oracle, the 4-way Xeon 7560 on SQL Server and the 8-way POWER7 on Sybase. On TPC-H Power (single stream), the 4-way Xeon on SQL Server is competitively placed relative to the 16-way Itanium and 8-way POWER7 systems. In other words, an 8-way Xeon might be comparable to the 8-way POWER7. If there is a weak point in SQL Server, it is in the throughput test (multiple concurrent query streams). This aspect is probably something that could be corrected. Unfortunately, it is probably not a priority for the SQL Server team at this time.
TPC-H SF 1000 Results for HP DL785 and Integrity Superdome servers
|HP DL785 G6
|HP DL785 G6
|IBM x3850 X5
|IBM Power 780
Additional details are below.
The two IBM results employ SSD storage. The older results are on HDD storage.
In addition, the IBM x3850 X5 (with Xeon 7560) system is configured with 1.5TB memory.
The total size of the TPC-H SF 1000 database, all tables and indexes, should be 1.4TB.
A storage system (7 SSDs) capable of very high IO rates is still required to handle the intense tempdb activity.
|System||DL785 G6||DL785 G6
||Superdome 2||x3850 X5||Power 780|
||Oracle 11g R2
||8 x 6 = 48
||8 x 6 = 48
||16 x 4 = 64
||4 x 8 = 32
||8 x 4 = 32
||8 x 8Gbps
|7 PCI-E SSD
||12 PCI-E SAS
||4 EXP 12
||52 SAS SSD
||3x50, 25, 30, 35
||1 per 24
||1 per 24
||3 per 6
||2008 R2 EE
||2008 R2 EE
||Oracle 11g R2
Below are the individual query run times.
TPC-H SF 1000 individual query execution times
Note the wide variation in each query between different systems and database engines.
This could reflect differences in any of:
1) processor and system architecture,
2) memory versus disk, HDD and SSD
3) execution plans
4) the efficiency between component operations (scan, index seek, hash, sort, etc)
and probably other factors as well.
It would be interesting to compare the execution plans between different database engines,
even to force the SQL Server execution plan to one as close as possible to the plans
employed on the other database engines.
The main point of interest is not moderate differences in the overall (geometric mean) performance, but rather the very large differences in certain queries.
The long run time for Q18 should probably be investigated.
Another view, the 4-way Xeon 7560 SF 1TB with 1.5TB memory + SSD versus 4-way Xeon SF 3TB with 0.5TB memory & HDD.
The number of processors is doubled, but the database is 3 times larger.
On this alone, we might expect a 50% difference in query time, with the caveat that there are complications in projecting TPC-H performance at different scale factor.
There are also significant differences in the memory-to-data ratio,
and storage performance characteristics.
TPC-H individual query execution times for 4-way 1TB and 8-way 3TB
On the 8-way system at SF 3TB, Q18 actually runs faster than on the 4-way system at SF 1TB.
But the other larger queries, Q1, 9, and 21, show the expected pattern.
Overall, it does appear that the 3TB query run times are on the order of 50% higher.
Microsoft just release Fast Track Data Warehouse 3.0 Reference Guide version. The new changes are increased memory recommendation and the disks per RAID group change from 2-disk RAID 1 to 4-Disk RAID 10.
The earlier FTDW reference architecture cited 4GB memory per core. There was no rational behind this, but it was felt some rule was better than no rule. The new FTDW RG correctly cites the rational that more memory helps keep hash join intermediate results and sort operations in memory.
4-Disk RAID 10
FTDW versions 1 & 2 specified 2-disk RAID groups. The reason was the SAN storage system could not deliver good sequential performance IO with fat RAID groups (comprised of many drives). For some reason, a thin 2-disk RAID group could deliver 100MB/sec per disk. An EMC whitepaper added that a 3-disk RAID 5 group could also deliver good sequential performance. But RAID 5 does not have the necessary performance characteristics to support tempdb IO requirements. FTDW 3.0 now specifies 4-disk RAID 10 groups. The presumption is that 100MB/s sequential IO is still the objective. No explanation was given why a 4-disk RAID group is now specified. It is possible the firmware in SAN systems can now support good sequential performance with larger RAID groups. But an explanation is still desired.
Apparently the reason for the 4-disk RAID 10 group is to match the PDW configuration. No data was provided as to whether this could still meet the old 100MB/s per disk objective.
The reasons cited for the new memory recommendation in FTDW 3.0 are correct. My suggestion however, is that rather than cite specific memory configurations which just happen to be reasonable today, apply the underlying principal. The cost of server class ECC memory modules follow a consistent pattern. An 8GB DIMM today is $300 and a 16GB is $850 (from Crucial). The 32GB memory modules are just becoming available, and probably very expensive, say for example: $5000. The premium for the 16GB memory module is small relative to 2x8 GB modules.
The low cost of memory relative to the overall environment and the fact that memory has positive benefits, but no specific hard requirements drives the strategy. Fill the available memory sockets with either 8GB or 16GB memory modules (depending on system constraints). When the premium for 16GB modules disappears, then discontinue the 8GB module from consideration. When the premium for 32GB memory modules becomes less, then both 16 & 32GB modules can be considered.
My preference is still to employ direct-attach storage systems with RAID controllers in the server system. The latest PCI-E SAS RAID controllers cannot support 2.8GB/s, just under the net bandwidth of a PCI-E gen 2 x8 slot. Each 6Gbps SAS x4 channel can support 2GB/s. Modern 15K disk drives can deliver 150MB/s on the outer tracks. This infrastructure has an amortized cost structure of less than $700 per disk. The cost structure of low-end SAN systems specified by FTDW is typically 2X higher per disk. I also prefer to employ more 2.5in 15K HDDs over more channels than specified by FTDW over using 300GB+ 3.5in 15K HDDs.
Index Light Strategy
FTDW endorses the index light strategy. The presumption is that the data warehouse is very large, in the multi-TB range, too large to fit in system memory. The storage strategy employs hard disk drives because an HDD storage system properly configured with IO distributed over multiple channels and many disk drives can deliver high sequential IO bandwidth economically.
The SQL Server query optimizer cost model is built on the model that table scans run at 10.8MB/s (or 1350 pages/sec) and random IO (from loop joins and key lookups) run at 320 IOPS. The absolute value of each in relation to the actual server is not important. The important point is 1350 pages/s sequential to 320 random IOPS ratio.
The FTDW reference architecture is designed to achieve sequential IO of 100MB/sec per disk, or 12,800 pages/sec. A 3.5in 15K disk can support 175 IOPS (225 for 2.5in 15K) for random IO distributed over the entire drive at queue depth 1. At higher queue depth, 15K disks can support 320 IOPS. So the FTDW reference architecture IO performance characteristic has the ratio 12,800 pages sequential to 320 random IOPS.
There is no provision to adjust the SQL Server query optimizer sequential-to-random ratio. For this reason, FTDW favors avoiding non-clustered indexes so that key lookups and loop joins are not an execution plan option. A good partitioning scheme to help contain the range of scan operations is an essential element of the index light strategy.
Tempdb on SSD
SQL Server issues IO to tempdb at high queue depth, typically 64K in size, to store the intermediate results of large hash joins and sort operations. The pattern is a write burst eventually followed by reads. A very large disk array can support this, but a small array of SSDs can support extraordinarily high IO rates at lower latency than HDDs.
The next consideration is that if only tempdb is on SSD, there is not a hard requirement to employ enterprise grade SSDs. Depending on the situation, some environment can consider consumer grade SSDs. The difference in cost between consumer and enterprise SSD is more than 10X ($30K per TB for enterprise, less than $3K for consumer, all without RAID). For comparison, an HDD storage designed for performance might cost $10K per TB in RAID 10.
Some HD storage systems can support SSD with SATA/SAS, but not all can support the very high IO rates possible with SSD, so it might be a good idea to employ PCI-E SSDs. This configuration does not support clustering, but clustering is not necessarily a true requirement for data warehouses. Going forward, expect storage systems to finally abandon the under-powerful obsolete processor/controllers, and switch to powerful Intel Xeon 5600 or later processors. Then intermixing SSD and HDD should not be a problem.
Index Heavy Strategy
There appears to be a tendency to treat data warehouses as contest of whose is bigger. The data warehouse exists to serve the organization. With good design, the active portion of a data warehouse may fit in the system memory of modern server systems. Current 2-socket systems support 192-256GB, 4-socket server supports 1TB and 8-socket systems support 2TB memory. The cost of 1TB memory, 64 x 16GB ECC DIMMs should be less than $64K.
Another possible strategy is to employ SSD for the main database itself, or the most critical portions of the data warehouse. In an SSD storage system, there is very little difference between large block “sequential” IO and small (8K) block “random” IO. The difference in throughput between 1 page of a (sequential, large block) table scan and a “random” 1 page loop join/lookup IO might be none or 2:1. It also turns out that for data in memory, the actual CPU cost of a table scan per page is approximately the same as 1 key lookup or loop join.
In this case, the strategy should be to employ (sufficiently) covered indexes as necessary. It may even be advantageous to force index in some cases because the optimizer assumes the cost ratio is 4 pages scan equal 1 random IO. Note that the index heavy strategy will result execution plans with loop joins and key lookups. These operations do not scale to high degrees of parallelism, where as the hash join has excellent scaling at very high degrees of parallelism. (It is suspected the reason is each row latches the root level, resulting in contention at DOP 8 and higher. I do not understand why there cannot be just a single latch for the entire query or even one per thread.)