Fusion-iO just announced the new ioDrive2 and ioDrive2 Duo on Oct 2011 (at some conference of no importance).
The MLC models will be available late November and the SLC models afterwards.
See the Fusion-iO
press release for more info.
Below are the Fusion-IO ioDrive2 and ioDrive2 Duo specifications.
The general idea seems to be for the ioDrive2 to match the realizable bandwidth of a PCI-E gen2 x4 slot (1.6GB/s)
and for the ioDrive2 Duo to match the bandwidth of a PCI-E gen2 x8 slot (3.2GB/s).
I assume that there is a good explanation why most models have specifications slightly below the corresponding PCI-E limits.
The exception is that 365GB model at about 50% of the PCI-E g2 x4 limit.
Suppose that the 785GB model implement parallelism with 16 channels and 4 die per channel.
Rather than building the 365GB model with the same 16 channels,
but a different NAND package with 2 die each, they just implemented 8 channels using the same 4 die per package.
Lets see if Fusion explains this detail.
||SLC (Single Level Cell)
||MLC (Multi Level Cell)
|Read Bandwidth (64kB)
|Write Bandwidth (64kB)
|Read IOPS (512 Byte)
|Write IOPS (512 Byte)
|Read Access Latency
|Write Access Latency
||PCI-E Gen 2 x4
Fusion-IO ioDrive2 Duo
||SLC (Single Level Cell)
||MLC (Multi Level Cell)
|Read Bandwidth (64kB)
|Write Bandwidth (64kB)
|Read IOPS (512 Byte)
|Write IOPS (512 Byte)
|Read Access Latency
|Write Access Latency
||PCI-E Gen 2 x8
SLC verus MLC NAND
Between the SLC and MLC models, the SLC models have much better 512-byte reads IOPS than the MLC models,
with only moderately better bandwidth and read latency.
Not mentioned, but common knowledge is that SLC NAND has much greater write-cycle endure than MLC NAND.
It is my opinion that most database, transaction processing and DW, can accommodate MLC NAND
characteristics and limitations in return for the lower cost per TB.
I would consider budgeting a replacement set of SSDs if analysis shows that the MLC life-cycle does not match
the expected system life-cycle.
Of course, I am also an advocate of replacing the main production database server on a 2-3 year cycle
instead of the traditional (bean-counter) 5-year practice.
The difference in read IOPS at 512B is probably not important. If the ioDrive2 MLC models can drive 70K+ read IOPS at 8KB, then it does not matter what the 512B IOPS is.
One point from the press release:
"new intelligent self-healing feature called Adaptive FlashBack provides complete chip level fault tolerance,
which enables ioMemory to repair itself after a single chip or a multi chip failure without interrupting business continuity."
For DW systems, I would like to completely do away with RAID when using SSDs,
instead having two system without RAID on SSD units.
By this, I mean fault-tolerance should be pushed into the SSD at the unit level. Depending the failure rate of the controller, perhaps there could be two controllers on each SSD unit.
For a critical transaction processing system, it would be nice if Fusion could provide failure statistics
for units that have been in production for more than 30 days
(or whatever the infant mortality period is) on the assumption that most environments will spend a certain amount
of time to spin up a new production system.
If the failure rate for a system with 2-10 SSDs is less than 1 per year,
then perhaps even a transaction processing system using mirroring for high-availability can also do
without RAID on the SSD?
ioDrive2 and ioDrive2 Duo
I do think that it is great idea for Fusion to offer both the ioDrive2 and ioDrive2 Duo product lines
matched to PCI-E gen2 x4 and x8 bandwidths respectively.
The reason is that server systems typically have a mix of PCI-E x4 and x8 slots
with no clear explanation of the reasoning for the exact mix,
other than perhaps that being demanded by the customer complaining the loudest.
By have both the ioDrive2 and Duo, it is possible to fully utilize the bandwidth from all available slots
It would have been an even better idea if the Duo is actually a daughter card the plugs onto the
ioDrive2 base unit, so the base model can be converted to a Duo,
but Fusion apparently neglected to solicit my advice on this matter.
I am also inclined to think that there should also be an ioDrive2 Duo MLC model
at 1.2TB, on the assumption that the performance will be similar to the 2.4TB model,
as the ioDrive2 765GB and 1.2TB models have similar performance specifications.
The reason is that a database server should be configuration with serious brute force IO capability,
that is, all open PCI-E gen 2 slots should be populated.
But not every system will need the x8 slots populated with the 2.4TB MLC model,
hence the viability of a 1.2TB model as well.
if Fusion should be interested in precise quantitative analysis for SQL Server performance, instead of the rubish whitepapers put out by typical system vendors,
well I can turn a good performance report very quickly. Of course I would need to keep the cards a while for continuing analysis...
Over the last two years, I have stood up several proof-of-concept
(POC) database server systems with consumer grade SSD storage at cost $2-4K per TB.
Of course production servers are on enterprise class SSD, Fusion-IO and others, typically $25K+ per TB. (There are some special
situations where it is viable to deploy a pair of data warehouse servers with
PCI-E SSDs - OCZ RevoDrive, RevoDrive X2, & RevoDrive 3 X2
The first POC system was a Dell T710 with 2 Xeon 5670
processors 96GB (12x8GB) memory, 16x10K SAS HDDs and 6 OCZ RevoDrive (original version)
PCI-E SSDs supporting table scans at nearly 3GB/s. The most difficult query repeatedly
hashed a large set of rows (as in there were multiple large intermediate result sets) generating
extremely heavy tempdb IO. With tempdb on 12 10K HDDs, the query time was 1
hour. With tempdb on the 6 OCZ RevoDrives, the query time was reduced to 20min.
Before SSDs became viable, I would normally have configured
a 2-socket system with 48 (2x24) 15K HDDs, with one RAID controller for each
24-disk enclosure. This setup costs about $11K per enclosure with 24x146GB 15K
SAS drive and can be expected to deliver 4GB/s sequential bandwidth, 10K IOPS at
low queue, low latency (200 IOPS per 15K disk) and in the range of 15-20K IOPS
at high queue, high latency. As it was my intent to deploy on SSD, I only
configured 16 HDDs in the internal disk bays and did not direct the purchase of external HDDs.
The 6 OCZ RevoDrive 120GB PCI-E SSDs in the POC system cost about
$400 each at the time (now $280?). I recall that the tempdb IO traffic was
something like 40K IOPS (64KB), around 2.5GB/s bandwidth. This was consistent with
the manufacturers specifications of 540MB/s read and 480MB/s write at 128K IO,
and considering that there will be some degradation in aggregating performance
over 6 devices. The IO latency was somewhere in the range of 40-60ms (note that the
SQL Server engine issues tempdb IO at high queue depth). OK, so the real purpose of
the POC exercise was to tell the SAN admin in no uncertain terms that the
350MB/s from his $200K iSCSI storage system (4x1GbE) was pathetic, and even the
700MB/s on 2x4Gbps FC ports does not cut mustard in DW.
The next set of systems was ordered with 4 OCZ RevoDrive X2,
160GB (<$500 each). There was some discussion on whether to employ the OCZ enterprise
class ZDrive R3, but this product was cancelled and the OCZ substitute, the
VeloDrive (4 SandForce 1565 controllers, rated for ~1GB/s), was not yet
available. I was expecting somewhat better performance for 4 RevoDrive X2 (4 SandForce
1222 controllers each, rated for 700MB/s) over 6 of the original RevoDrives (2 SandForce
controllers each). The tempdb IO intensive query
that took 20min with the 6 RevoDrives now ran in 15min with the 4 RevoDrive X2s.
In additional, IO latency was under 10ms.
I was hoping to test the new OCZ RevoDrive 3 X2 with 4
SandForce 2281 controllers, rated for 1500MB/s read and 1200MB/s write.
Unfortunately there is an incompatibility with the Dell T110 II with the E3-1240
(Sandy Bridge) processor which has a new UEFI replacing the BIOS. OCZ does not
provide server system support on their workstation/enthusiast products.
Hopefully Dell will eventually resolve this.
SATA SSDs - OCZ Vertex 2, Vertex 3 & Vertex 3 Max IOPS, Crucial C300 & m4
My preference is to employ PCI-E rather than SATA/SAS SSD
devices. This is mostly driven by the fact the disk enclosures reflect the IO
capability of HDDs, with 24 bays on 4 SAS lanes. An SSD oriented design should
have 4 SSDs on each x4 SAS port. Of course, 4 SSDs and 4-8 HDDs on each x4 SAS
port is also a good idea.
So I have also looked at SATA SSDs. Earlier this year, I
started with the OCZ Vertex 2 and Crucial C300 SSDs. After encountering the
issue with the RevoDrive 3 on the new Dell server, I acquired OCZ Vertex 3,
Vertex 3 Max IOPS, and Crucial m4 SATA SSDs. The OCZ Vertex 2 has a 3Gbps
interface, the Vertex 3 and both Crucial C300 and m4 all support 6Gbps SATA
The OCZ Vertex SSDs use SandForce controllers; the Vertex 2 uses
the previous generation SandForce 1222 and the Vertex 3 uses the current generation
2281 controller. The Crucial SSDs use Marvel controllers (both?). Perhaps the
significant difference between the OCZ Vertex and Crucial SSDs are that the
SandForce controllers implement compression. The OCZ Vertex SSDs have far
better write performance with compressible data, but is comparable for
incompressible data. It does appear that SQL Server tempdb IO is compressible and
benefits from the compression feature.
Another difference is that OCZ offers 60, 120, 240 and 480GB
capacities while Crucial offers 64, 128, 256 and 512GB capacities. All
capacities are in decimal, that is, 1GB = 10^9 bytes. Both OCZ 60GB and Crucial
64GB presumably have 64GB NAND flash, the 64GB being binary, meaning 1GB =
1024^3, or 7.37% more than 1GB decimal. Basically, OCZ has more over-provisioning
than the Crucial, which in theory should also contribute to better write
performance. (Earlier Vertex drives had 50 and 100GB capacities. But there are so
many varieties of the Vertex 2 that I cannot keep track.)
In brief, the performance difference between SSD
generations, both from the OCZ Vertex 2 to Vertex 3 and from the Crucial C300
to m4, is substantial, so I will focus mostly on the newer Vertex 3 and m4
drives. The performance observed in SQL Server operations seemed to be consistent
with manufacturer specifications for both generations of OCZ and Crucial SSDs. It
was not noted whether writing compressed SQL Server database tables were
further compressible by the SandForce controller. This may be because it is
difficult to achieve the high write performance necessary to stress modern SSDs
in SQL Server with transactional integrity features.
Test System - Dell T110 II, Xeon E3 quad-core Sandy Bridge processor
The test system is a Dell PowerEdge T110 II, Xeon E3-1240
3.30GHz quad-core processor (Sandy Bridge) with 16GB memory. This system has 2
PCI-E Gen2 x8 and 1 Gen 1 x4 slot. All SSDs were attached to a LSI MegaRAID SAS
8260 controller (PCI-E Gen2 x8, 8 6Gbps SAS ports). I did not some testing with
2 SSDs on the SATA ports (3Gbps) but did make detailed observations.
Incidentally, the cost of this system, processor, memory and 1 SATA HD was $1078? The 128GB SSDs were about $220 ($268 for the Max IOPS). So a very capable system with 2 SSDs could be built for $1300-1500 (64GB or 128GB SSDs). A better configuration with 2 SATA HDDs, 4 SSDs and SAS controller would push this to $2500. But if Dell and OCZ could resolve this RevoDrive 3 -UEFI issue, then I would recommend the T110 II, E3 processor, 16GB memory, 2 SATA HDDs and 1 RevoDrive 3 X2.
One unfortunate aspect of this system is that
the SATA ports are all 3Gbps per the Intel C202 PCH, even though the ever so
slightly more expensive C204 supports 6Gpbs SATA on 2 ports. Basically, this
has similar characteristics as the database laptop with super IO that I
proposed earlier, except that the laptop would be 2.5GHz to keep power
Performance tests with the TPC-H SF 100 database
With 8 SSDs (2 Vertex 3, 2 Vertex 3 MaxIOPS, 2 m4, and 2 C300)
I was able to generate 2.4GB/s in table scan aggregation query, possibly
gated by the 355MB/s rating of the C300s. A configuration consisting of the 4
Vertex 3 and 2 m4’s would have been gated by the 415MB/s rating of the m4. If
can get a total 8 Vertex 3s, which are rated at 550/500MB/s for compressible
and incompressible data, then I would either be limited by the adapter or the
PCI-E Gen2 x8 limit of 3.2GB/s. There is an LSI SAS8265 adapter with dual-cores
that has even higher IOPS capability, but it is not known whether this is
necessary for large block IO.
The tests consisted of running the TPC-H queries, single
stream (but not per official benchmark requirements).
The figure below show the time to run the the 22 queries (excluding statistics, parse and compile) for 2, 4 and 6 SSDs with no data compression (raw) and with page mode data compression.
Run time on 2 OCZ Vertex 3 (regular) SSDs was 815 sec with compression and 936 sec raw (w/o compression).
On 4 OCZ Vertex 3 SSDs (2 regular, 2 MaxIOPS)
total query times were reduced to 658 sec with compression and 622 sec raw.
On 6 SSDs, 4 OCZ and 2 Crucial m4, total query times are 633 sec with compression and 586 sec raw.
The figure below shows tempdb IO write latency for 2, 4 and 6 SSDs, with raw and compressed tables.
On the 2 OCZ SSDs, IO latency from fn virtual file stats (for the entire run)
averaged 30ms (temp write) and 90ms (data read) with compression
and 60ms (temp write) - 130ms (data read) without compression.
The performance with 2 Crucial m4 drives was less, showing much higher write latencies.
On 4 OCZ Vertex 3 SSDs, IO latency was 14 temp write and 30ms data read with compression
and 18-60ms without compression. The IO latencies on the Max IOPS models were lower than on
the regular Vertex 3 models.
For 6 SSDs, IO latencies are now down to the 15ms range, with somewhat higher
latency on the Crucial m4 SSDs.
With data and tempdb on 2 OCZ Vertex 3 SSDs, performance was decent but IO constrained. Performance was 15% better with data compression (page) than without, even though CPU was 23% higher.
Performance with 4 OCZ Vertex 3 SSDs (2 regular, 2 Max IOPS)
was 20% better for compression on and 34% better without compression, relative
to performance with 2 SSDs. The performance without compression was now 6% better than
At 6 SSDs (4 Vertex 3, 2 m4), there was another 5%
performance improvement relative to 4 SSDs, for both compressed and not
In the above tests, each SSD was kept as a standalone disk, i.e., I did not use RAID.
There was 1 data and 1 tempdb file on each SSD.
I noticed that the uncompressed (raw) database tended to generate 64K or 128K IO,
while the compressed database tended to have 256K IO.
Two queries, 17 and 19(?) generated 8KB IO,
and would have much better performance with data in memory.
There was also wide variation from query to query in whether performance was better with or without compression.
Today, it is actually possible to build a highly capable database
system in a laptop form factor. There is no point to running a production database
on a laptop. The purpose of this is so that consultants (i.e., me), can
investigate database performance issues without direct access to a full sized
server. It is only necessary to have the characteristics of a proper
database server, rather than be an exact replica.
Unfortunately, the commercially available laptops do not
support the desired configuration, so I am making an open appeal to laptops
vendors. What I would like is:
1) Quad-core processor with hyper-threading (8 logical processors),
2) 8-16GB memory (4 SODIMM so we do not need really expensive 8GB single rank DIMMs)
3) 8x64GB (raw capacity) SSDs on a PCI-E Gen 2 x8
interface (for the main database, not the OS)
- alternatively, 2-4 x4 externally accessible PCI-E ports for external SSDs
- or 2 x4 SAS 6Gbps ports for external SATA SSDs
4) 2-3 SATA ports for HDD/SSD/DVD etc for OS boot etc
5) 1-2 e-SATA
6) 2 1GbE
Below is a representation of the system, if this helps clarify.
The Sandy-Bridge integrated graphics should be sufficient, but
high-resolution 1920x1200 graphics and dual-display are desired. (I could live
There should also be a SATA hard disk for the OS (or SATA SSD
without the 2.5in HDD form factor if space constrained) as the primary SSD array should be dedicated to the database.
Other desirable elements would be 1 or 2 e-SATA port, to support backup and restores with consuming the valuable main SSD array,
ports (so I can test code for parallel network transfers.
The multiple processor cores allow parallel execution plans.
Due to a quirk of the SQL Server query optimizer, 8 or logical processors are more
likely to generate a parallel execution plan in some cases.
Ideally, the main
SSD array is comprised of 2 devices, one on each PCI-E x4 channel.
The point of the storage system is to demonstrate 2GB/sec+
bandwidth, and 100-200K IOPS. One of the sad fact is even today storage
vendors promote $100K+ storage systems that end up delivering less
than 400-700MB/s bandwidth and less than 10K IOPS.
So it is important to demonstrate what a proper database storage system should be capable of.
Note that is it not necessary to have massive memory.
A system with sufficient memory and a powerful storage system can run any query, while a system with very large memory but weak storage can only run read queries that fit in memory. And even if data fits in memory, the performance could still fall off a cliff on tempdb IO.
Based on component costs, the
laptop without PCI-E SSD should be less than $2000, and the SSD array should be
less than $1000 per PCI-E x4 unit (4x64GB).
It would really help if the PCI-E SSD could be powered off from SW, i.e., without having to remove it. This why I want to boot off the SATA port, be it HDD or SSD.
per below, 2 SSDs on SATA ports do not cut the mustard,
The spec above call for 8 SSDs. Each SSD is comprised of 8 NAND packages, and each
package is comprised of 8 die. So there are 64 die in one SSD, and IO is
distributed over 8 SSDs, or a total of 512 individual die.
The performance of a single NAND die is nothing special and even pathetic on
writes. However, a single NAND die is really small and really cheap. That is
why it is essential to employ high parallelism at the SSD unit level. And then,
employ parallelism over multiple SSD units.
An alternative solution is for the laptop to expose 2-4 PCI-E x4 ports (2 Gen 2
or 4 Gen 1) to connect to something like the OCZ IBIS, or an SAS controller
with 2 x4 external SAS ports.
The laptop will have 1 Intel quad-core Sandy-Bridge processor, which has 2 memory channels supporting 16GB dual-rank DDR3 memory. The processor has 16 PCI-E gen 2, DMI g2 (essentially 4 PCI-E g2 lanes) and integrated graphics. There must be a 6-series (or C20x) PCH, which connects upstream on the DMI. Downstream, there are 6 SATA ports (2 of which can be 6Gbps), 1 GbE port, and 8 PCI-E g2 lanes. So on the PCH, we can attach 2 HDD or SSD at 6Gbps, plus support 2 eSATA connections. There is only a single 1GbE port, so if we want 2, we have to employ a separate GbE chip.
While the total PCH down stream ports exceeds the upstream, it ok for our purposes to support 2 internal SATA SSDs at 6Gbps, 2 eSATA ports and 2 GbE, plus USB etc. The key is how the 16 PCI-E gen 2 lanes are employed. In the available high-end laptops, most vendors attach a high-end graphics chip (to all 16 lanes?). We absolutely need 8 PCI-E lanes for our high performance SDD storage array. I would be happy with the integrated graphics, but if the other 8 PCI-E lanes were attached to graphics, I could live with it.
The final comment (for now) is that even though it is possible to attach more than 2 SSD off the PCH, we need then bandwidth on the main set of PCI-E ports. It is insufficient for all storage to be clogging the DMI and PCH.
Thunderbolt is 2x2 PCI-E g2 lanes, so technically thats almost what I need (8 preferred, but 6 acceptable).
What is missing from the documentation is were Thunderbolt attaches.
If directly to the SandyBridge processor (with bridge chip for external?), then that's OK,
if off the PCH, then that is not good enough for the reasons I outlined above.
Also, we need serious SSDs to attach off TB, does the Apple SSD cut mustard?
The diagram below shows the Thunderbolt controller connected to the PCH,
but also states that other configurations are possible.
The problem is that most high-end laptops are designed with high-end graphics,
which we do not want squandering all 16 PCI-E lanes.
A Thunderbolt controller attached to the PCH is capable of supporting x4 PCI-E gen 2, but cannot also simultaneously support saturation volume traffic from internal storage (SATA ports), and network (not to mention eSATA). I should add that I intend to place the log on the SATA port HDD/SSD, along with the OS, hence I do not want the main SSD array generating traffic over the DMI-PCH connection.
A Thunderbolt SDK is supposed to released very soon, so we can find out more.
I am inclined to think that Thunderbolt is really a docking station connector, being able to route both video and IO over a single connector. If we only need to route IO traffic, then there are already 2 very suitable protocols for this, i.e., eSATA for consumer, and SAS for servers, each with a decent base of products. Of course, I might like a 4 bay disk enclosure for 2.5in SSDs on 1x4 SAS, or an 8-bay split over 2 x4 ports. Most of the existing disk enclosures carry over from hard disk environment, with either 12-15 3.5in bays or 24-25 2.5in bays.
There is a feature, called index skip scan that has been in Oracle since version 9i.
When I across this, it seemed like a very clever trick, but not a critical capability.
More recently, I have been advocating DW on SSD in approrpiate situations,
and I am thinking this is now a valuable feature in keeping the number of nonclustered
indexes to a minimum.
Briefly, suppose we have an index with key columns: Col1,
Col2, in that order.
Obviously, a query with a search argument (SARG) on Col1 can use this index,
assuming the data distribution is favorable.
However, a query with the SARG on Col2
but not Col1 cannot use the index in a seek operation.
Now suppose that the cardinality of Col1,
(the number of distinct values of Col1), is relatively low.
The database engine could seek each distinct first value of Col1
and the specified SARG on Col2.
Microsoft SQL Server currently does not have the Oracle Index Skip-Scan feature,
but the capability can be achieved with a work-around.
In this example, the LINEITEM table has a cluster key on columns
but does not have an index leading with L_ORDERKEY.
Our query is to find a specific Order Key in the LineItem table.
If there is a table with the distinct date values, DimDate,
we could force a loop join from the DimDate table to LineItem
(even though only columns from LineItem are required)
to get the execution plan below.
The question is now: how effective is this technique?
The most efficient execution plan is of course, to have an index leading with the Order Key column.
But the situation calls for keeping the number of nonclustered indexes to an absolute minimum.
So how does the above execution plan compare with a table scan?
A table scan, in this type of query, such that only few rows meet an easy to evaluare SARG,
might run at about 1GB/s per core.
Note this is far higher than the 200MB/sec cited in the Microsoft Fast Track Data Warehouse
documents. This is because the FTDW baseline is a table scan that aggregates several columns
of every row. And not only that, a Hash Match is also required to group the results.
Basically, a needle in haystack table scan runs much faster than the more complex
aggregate and group scan.
At most, a 1GB table scan might acceptable for a non-parallel execution plan
and even a 50GB table scan could be tolerable on a powerful 32-core system
with an unrestricted parallel execution plan.
A loop join can run somewhere in range of 100,000-200,000 seeks/sec to the inner source.
Realistically, a Data Warehouse with 10 years data has distinct 3652 days (depending on the leap year situation).
A loop join with 3650 rows from the outer source should run some where around 36ms.
Even if the DW had 20 years data, this is still acceptable, on the assumption that
the non-lead column Order Key search is in the minority,
with the plus being one less index on the big table is required.
If the query could be bounded to with a single year or quarter-year,
then we are approaching the efficiency of having the extra nonclustered index.
Last week Intel announced the 10-core Xeon E7-x8xx series (Westmere-EX), superceding the Xeon 6500 and 7500 series (Nehalem-EX).
The E7 group consists of the E7-8800 series for 8-way systems, the E7-4800 series for 4-way systems
and the E7-2800 series for 2-way systems. Also, the E3-12xx series (Sandy Bridge) for 1-socket servers,
superceding the Xeon 3000 series (Nehalem and Westmere).
This week at Intel Developer Forum Bejing, Intel has a slidedeck on Sandy Bridge-EP,
an 8-core die that will presumably be the Xeon E5-xxxx series superceding the Xeon 5600 series (Westmere-EP) scheduled for 2H 2011.
||Xeon 6500/7500, 4-8 cores
||E7-8/4/2800, 6-10 cores
||Xeon 5600, 4-6 cores
||E5-xx00, upto 8 cores
||Xeon 3x00, 2-6 cores
||E3-1200, 2-4 cores
The top-of-the-line Xeon E7-8870 is 10-core, 2.4GHz (max turbo 2.8GHz) and 30M last level cache,
compared with Xeon X7560 8-core, 2.26GHz (turbo 2.67GHz) and 24M LLC.
HP ProLiant DL580 G7 TPC-E results for 4-way Xeon E7-8870 and 7560 are 2454.51 and 2,001.12 respectively.
This is a 22% gain from 25% more cores, and 6% higher frequency, inline with expectations.
IBM System x3850 X5 TPC-H results at scale factor 1TB for the 4-way Xeon X7560 and 8-way Xeon E7-8870 are below.
|4-way Xeon 7560
It is unfortunate that a direct comparison (with same number of processors and at the same SF)
between the Xeon E7-8870 and X7560 is not available.
The presumption is that the Xeon E7-8870 would show only moderate improvement over the X7560.
This because the TPC-H is scored on a geometric mean of the 22 queries,
of which only some benefit from very high degree-of-parallelism.
The more modest performance gain from Nehalem-EX to Westmere-EX, compared to the previous 40% per year objective, is probably an indication
of the future trend in the pace of performance progression.
The pace of single core performance progression slowed several years ago.
Now, the number of cores per processor socket also cannot be increased at a rapid pace.
Fortunately, the compute power available in reasonably priced systems is already so outstanding
that the only excuse for poor performance is incompetence on the software side.
My expectation is that transaction processing performance can still be boosted significantly
with more threads per core. The IBM POWER 7 and Oracle/Sun SPARC T3 implement 8 threads per core. It is unclear if Intel intends to pursue this avenue.
Data Warehouse performance could be increased with columnar storage, already in Ingres VectorWise
and coming in the next version of SQL Server.
Scale out is now available in PDW (EXA SOL has TPC-H results with 60 nodes).
I am also of the opinion that SIMD instruction set extensions for the row & column offset calculation
could improve database engine performance. The object is not just the reduce the number of instructions,
but more importantly to make the memory access sequence more transparent, ie, allow for effective prefetching.
At the system level, the processor interconnect technology (AMD Hyper-Transport and Intel QPI)
should also allow scale-up systems. HP has mentioned that 16-way Xeon is possible.
HP already has the crossbar technology from their Itanium based Superdome and the sx3000 chipset.
It is probably just a matter of gauging the market volume of the 8-way ProLiant DL980 to assess
whether their is also a viable market for 16-way Xeon systems.
Another observation is the price structure of 2-way and 4-way systems.
It used to be that there was very little price difference between 1-way and 2-way systems with otherwise comparable features.
So the default system choice frequently started with a 2-way system and higher.
On the downside, the older 2-way systems also did not have substantially better memory or IO capability.
With the 2-way Xeon 5500 and 5600 systems, there is a more significant price gap between 1-way and 2-way systems.
However, the 2-way Xeon 5500 systems also have serious memory and IO capability.
So low-end entry system now needs to revert to a single socket system.
The price gap with 4-way systems has also grown along with capabilities, particularly in memory capacity and reliability.
The default system upgrade choice should be to replace older 4-way systems with new generation 2-way systems.
The new 4-way systems should target very-high reliability requirements.
Seagate today announced a near complete overhaul of their enterprise product line.
This include second generation SSD now with either SAS and SATA interfaces.
The first generation Pulsar SSD only supported SATA interface.
The new 2.5in 15K and 10K hard drive models have higher capacity.
The 2.5in 7.2K hard drive was upgraded to 1TB last month?
The 7.2K 3.5in is now available upto 3TB.
All models support 6Gbps.
Pulsar SSD (SAS/SATA interface, 2.5in FF)
The new second generation Seagate Pulsar SSD comprises two product lines.
The Pulsar XT.2 is based on SLC NAND at 100, 200 and 400GB capacities with SAS interface only.
The Pulsar.2 is based on MLC NAND at 100, 200, 400 and 800GB capacities available in both SAS and SATA interfaces.
Performance specifications cited for the Pulsar XT.2 are 360MB/s read and 300MB/sec write
at 128KB, equivalent to sequential IO in hard disks.
Random 4KB IO rates are 48K IOPS read and 22K IOPS write.
Performance specifications were not cited for the Pulsar.2.
Based on other vendors with both SLC and MLC product lines,
the expectation is that the MLC model should have comparable read performance.
Write performance might be less than the SLC model, but still adequate to almost all requirements.
Savvio 15K 2.5in HDD
The Savvio 15K.3 supercedes the 15K.2 product line with 300GB and 146GB capacities
replacing the 146GB and 73GB models in the 15K.2 line.
Sequential transfer rate is 202MB/s on the outer tracks and 151 on the inner tracks,
up from 160 to 122 in the 15K.2
Average read and write seek time in reduced to 2.6/3.1ms, down from 2.9/3.3ms,
resulting is slightly improved random IO performance.
Savvio 10K 2.5in HDD
The Savvio 10K.5 product line features 300, 450, 600 and 900GB capacities (300MB per platter),
up from 450 and 600GB in the 10K.4.
Sequential transfer rate is 168 to 93MB/s, up from 141 to 75MB/s.
Average seek time is 3.4/3.8ms (3.7/4.1 for the 900GB model), down from 3.9/4.5 ms in the earlier model.
Constellation 7.2K 3.5in HDD
The Constellation ES.2 enterprise grade 7200RPM 3.5in 3TB drive is available in both SAS and SATA interfaces.
Sequential transfer rate is 155MB/s. The previous generation ranged from 500GB to 2TB.
Constellation 7.2K 2.5in HDD
The Constellation.2 enterprise grade 7200RPM 2.5in product line features 250GB, 500GB and 1TB capacities,
in both SATA and SAS interfaces (SATA only for 250GB).
The 3.5in 15K and 10K product lines have not been refreshed. It is unclear whether there will be future models for these product lines.
Intel released the SSD 510 this month (March 2011), shows as in-stock on some web-stores.
SATA 6Gbps interface, 120 and 250GB capacities.
Sequential read/write 500/315MB/s. Random read/write 20K/8K IOPS (this seems low).
The OCZ Vertex 3 regular and Pro series with SATA 6Gbps interface and 2.5in form factor.
have been in the news,
but there is no information on the OCZ website.Correction: The Vertex 3 is not listed under the products section, but is described in the press release section
Toms Hardware lists the Vertex 3 with the SandForce 2281 controller, 550/525MB/s sequential, 60K IOPS 4K random and $499 price for the 240GB model. The Vertex 3 Pro with the SF-2582 controller, 550/500MB/s sequential, 70K IOPS and the 200GB model priced at $775.
The OCZ Z-Drive R3 with PCI-E gen 2 interface has been announced, available probably in Apr or May.
Sequential Read/Write at 1000/900MB/s.
Random at 135K IOPS.
No pricing is available on the new Seagate drives. The bare drive pricing (not from system vendors) on the current Seagate 2.5in 15K 146GB is about $200, and the 2.5in 10K 600K about $400. Given that the consumer grade OCZ Vertex 2 is $430 for the 240GB and $215 for the 120GB, my thinking thinking is that the 15K 146GB drives are nolonger viable, evening considering the higher price of the Vertex 3 Pro. The 10K 600GB is only barely viable from the capacity point of view. So the new higher capacity drives really need to come in at comparable price points to the current models.
All the Seagate 2.5 devices, SSD and HDD are 15mm height, standard on server storage bays. The OCZ SSD follow the common notebook 9.3mm height form factor. Going forward, perhaps there will be a 2U storage enclosure that can accommodate 36 x 9.3mm bays in place of the current 24 x 15mm bays. But an even better idea is to have a new DIMM size form factor with the connection at the short side so a that a 1U storage enclosure can accommodate 36 of these devices.
Seagate first generation Pulsar
Some one brought up the point that Pulsar, the first generation Seagate SSD, was vaporware. My guess would be that Seagate was late to market without any performance or price advantages over products already available, did not get any design wins with majors system and storage player, and hence decided not to launch. So there is concern that the second generation Pulsar might also be vaporware. Stayed tuned.
The Register reported a spat between HP and Oracle on the TPC-C benchmark. Per above, HP submitted a TPC-C result of 3,388,535 tpm-C for their ProLiant DL980 G7 (8 Xeon X7560 processors), with a cost of $0.63 per tpm-C. Oracle has refused permission to publish.
Late last year (2010) Oracle published a result of 30M tpm-C for a 108 processors (sockets) SPARC cluster ($30M complete system cost). Oracle is now comparing this to the HP Superdome result from 2007 of 4M tpm-C at $2.93 per tpm-C, calling the HP solution a clunker. The SPARC cluster is comprised of 27 nodes. Each node is 4 socket, 16-core (64 cores total) with 8 threads per core (512 logical processors total) and 512GB memory. The complete cluster is comprised of 1728 cores and 11,040 x 24GB SSD drives.
Microsoft wants system vendors to move the newer TPC-E benchmark. A ProLiant DL980 or any 8-way Xeon 7500 TPC-E results other system vendors would be good to see. Oracle has not published any TPC-E benchmark result. The rumor mill has it that Oracle on a single system runs TPC-E just fine. The problem is believed to be that Oracle RAC cannot scale on TPC-E to degree desired by marketing (even if it is ok by technical standards).
The Microsoft Parallel Data Warehouse diagram was somewhat difficult to understand in terms of the functionality of each subsystem in relation to the configuration of its components. So now that HP has provided a detailed list of the PDW components, the diagram below shows the PDW subsystems with component configuration (InfiniBand, FC, and network connections not shown).
Observe that there are three different ProLiant server models, the DL360 G7, DL370 G6 and the DL380 G7, in five different configurations as suitable for the requirements of each subsystem. There are also up to three different configurations of the P2000 G3 storage system for the Control node cluster, the compute nodes, and the backup node.
The Control nodes are ProLiant DL380 G7 servers with 2 Xeon X5680 six-core 3.3GHz 130W processors, 96GB memory, 14 internal 300GB 10K disks, and an external P2000 G3 with 5x450G 15K disks. The Control nodes parse incoming queries to be reissued to the compute nodes, and also reassemble the results from each node to the client as a single set. This would explain the use of powerful processors and a heavy memory configuration.
The purpose of the 14 internal disks is unclear as one might expect that result sorting takes place on the shared storage, unless this is done outside of SQL Server and also outside of the cluster shared resources. Now that I think about it, this is reasonable. On a cluster failover, there is no need to recover the intermediate results of queries in progress, as they will have to be reissued?
The general idea is to distribute as much query processing to the compute nodes as possible. There are situations that require intermediate data to be brought back to the control node for finally processing. Once there are more environments on PDW, there may be an evaluation as to whether a more power control node would be suitable? depending on the specific case.
The management nodes are ProLiant DL360 G7 servers with a single Xeon E5620 quad-core 2.4GHz 80W processor, 36GB memory and 2 disks. If the management nodes have light compute and memory requirements, I am inclined to think that this functionality could be consolidated with the other subsystems. But this is not necessarily an important point.
The Landing Zone is a ProLiant DL370 G6 with a single Xeon E5620. (The HP spec sheet also mentions W5580, which is a quad-core 3.2GHz 130W 45nm Nehalem-EP core.) The memory configuration cited is peculiar, with 6x2GB and 6x4GB DIMMs. A single processor Xeon 5600 system should have a limit of 9 DIMMs, three per memory channel. So why not employ 9x4GB DIMMs?
The DL370 chassis accommodates up to 14 LFF (3.5in) disks, eliminating the need for an external storage unit. The Landing Zone actually employs 10x1TB 7.2K and 2x160GB 7.2K HDDs. It is unclear why the LZ system was not configured with 14 disks. It could have also been configured with all 1TB disks, with the OS using a small slice. There are now 2GB and 3GB disks in the 3.5in 7.2K form factor. Seagate has a 2TB enterprise rated 3.5in 7200RPM drive. But it is unclear when these ultra-large capacity disks will available for server systems or even if there is a need for additional capacity in this function.
The Backup node is a ProLiant DL380G7 with 2 Xeon E5620 processors, 24GB memory. Presumably there are 2 internal disks to boot the OS. There is also an external P2000 G3 storage system with sufficient 3.5in drive capacity to support backups assuming 4:1 compression.
The max operational capacity of the compute nodes is 500TB This would imply that the Backup node could have 125TB net capacity..
The cited max capacity of PDW (with 40 nodes) is 500TB. This is based on the uncompressed data? So the actual net storage is 133TB? A reasonable assumption is that a 133TB database with page compresssion applied might yield another 2X reduction on a backup with compression?
The maximum configuration for the P2000 G3 is 96 LFF (3.5in) disks, with 7 additional disk enclosures. The P2000 G3 does support the 2TB 7.2K drive, so the P2000 with 5 additional disk enclosures totaling 72x2TB disks would meet the backup capacity requirement.
The Compute nodes are ProLiant DL360 G7 server with 2 Xeon X5670 six-core 2.93GHz 95W processors, 96GB memory, 8 internal 300GB 10K HDD and one external P2000 G3 with the option of 11x300GB 15K, 11x1TB 7.2TB or 24x300GB 10K drives. The external storage unit is for permanent data storage. The internal disks are for tempdb.
There are now 600GB 3.5in 15K and 600GB 2.5in 10K drives, so it is possible that these will replace two of the current options in the near future. A single (42U) rack must support 10 compute nodes (plus 1 spare), 10 storage nodes, and the associated switches (2 InfiniBand, 2 Ethernet, 2 Fiber Channel). This precludes a 2U form factor for the compute node. The 1U DL360 G7 cannot support the 130W thermal envelope for each of 2 Xeon X5680 processors, so the 95W Xeon X5670 processors are employed instead.
If tempdb is actually on the 8 internal drives, then I wonder why the P2000 storage units employs RAID-10? Write to permanent data is expected to be infrequent, negating the need for small block random write IO performance (the most serious liability of RAID-5). Only tempdb activity is expected to generate non-sequential IO.
Without more than very brief hands on time with PDW, I have two comments at this time. One is that the P2000 G3 supports 1,572MB/s bandwidth. Ideally, for 24 disks per storage unit, we would like to target 2GB/s, possibly somewhat more. Hopefully the next generation HP entry level storage system with employ the Intel C5500 processor (or successor) or some comparable processor with adequate memory and IO bandwidth. I have heard that the desire is also to move the storage interface from FC to Infini-band.
The second comment is that SSD could be considered for tempdb. The 8 internal 300GB 10K drives might cost $2000 (or whatever OEM volume pricing is). The cost of a PCI-E 1TB consumer grade SSD is approaching $2-3K. An enterprise grade 1TB SSD is higher depending on the vendor.
The maximum PDW configuration is 40 compute nodes. With the 24-disk storage units (300GB decimal, 278GB binary), there are 960 disks, excluding the capacity of disks locally attached on the compute node internal bays. Net RAID-10 capacity is then 133,440GB binary, which could correspond to 500TB uncompressed capacity. The 40 compute node limit may be a connectivity limit. At some point, the 600GB 2.5in 10K drive should become available for the PDW system, doubling capacity. I am also wondering what if an actual multi-unit PDW customer asked MS for additional external storage units to be daisy chained.
on further consideration, if I can mix HDD and SSD on a SAS RAID controller, on the compute node, I would go with 2 HDD for the OS and 4-6 SATA/SAS SSDs, plus 1 PCI-E SSD in the open PCI-E slot.
Lets figure $10K each for the heavy config servers, $15K for the storage units, and $10K for each of the IB and FC switches, lets figure $500K HW cost for the control rack plus one 10 node compute rack. There are 20 SQL Server processor licenses plus other components, so probably another $500K here. It will probably also involve another $500K in consulting for the deployment, so maybe $2M all told. I am thinking gold electro-plating on the rack might cost $10K.
Compare this with an 8-way ProLiant DL980 environment. Lets just suppose the PDW with 20 Xeon 5600 sockets has 2.5X the performance of 8 Xeon 7500 sockets. The DL980 with memory should cost around $150K, a direct-attach storage (8x24 disks) costs $12.5K each for $100K. Throw in another $50K of hardware to resemble the PDW. SQL EE licenses for 8 sockets is $240K. Suppose the consulting services to deployment on the SQL Server that we are already familar with is $100K. This brings the total to $700K. So $2M for 2.5X of the $700K system seems reasonable, considering the heavily super non-linear cost structure of scaling.
Also, figure an Oracle Database Machine with 8 RAC nodes and 14 Exadata Storage nodes will also cost around $2M (somebody please look this up).
IBM just published a TPC-H SF 1000 result for their
4-way Xeon 7560 system featuring a special MAX5 memory expansion board to support 1.5TB memory. In Dec 2010, IBM also published a TPC-H SF1000 for their Power 780 system, 8-way, quad-core, (4 logical processors per physical core).
The figure table below shows TPC-H SF 1000 results for the 8-way 6-core Opteron 8439 on SQL Server and Sybase, the 16-way quad-core Itanium 9350 on Oracle, the 4-way Xeon 7560 on SQL Server and the 8-way POWER7 on Sybase. On TPC-H Power (single stream), the 4-way Xeon on SQL Server is competitively placed relative to the 16-way Itanium and 8-way POWER7 systems. In other words, an 8-way Xeon might be comparable to the 8-way POWER7. If there is a weak point in SQL Server, it is in the throughput test (multiple concurrent query streams). This aspect is probably something that could be corrected. Unfortunately, it is probably not a priority for the SQL Server team at this time.
TPC-H SF 1000 Results for HP DL785 and Integrity Superdome servers
|HP DL785 G6
|HP DL785 G6
|IBM x3850 X5
|IBM Power 780
Additional details are below.
The two IBM results employ SSD storage. The older results are on HDD storage.
In addition, the IBM x3850 X5 (with Xeon 7560) system is configured with 1.5TB memory.
The total size of the TPC-H SF 1000 database, all tables and indexes, should be 1.4TB.
A storage system (7 SSDs) capable of very high IO rates is still required to handle the intense tempdb activity.
|System||DL785 G6||DL785 G6
||Superdome 2||x3850 X5||Power 780|
||Oracle 11g R2
||8 x 6 = 48
||8 x 6 = 48
||16 x 4 = 64
||4 x 8 = 32
||8 x 4 = 32
||8 x 8Gbps
|7 PCI-E SSD
||12 PCI-E SAS
||4 EXP 12
||52 SAS SSD
||3x50, 25, 30, 35
||1 per 24
||1 per 24
||3 per 6
||2008 R2 EE
||2008 R2 EE
||Oracle 11g R2
Below are the individual query run times.
TPC-H SF 1000 individual query execution times
Note the wide variation in each query between different systems and database engines.
This could reflect differences in any of:
1) processor and system architecture,
2) memory versus disk, HDD and SSD
3) execution plans
4) the efficiency between component operations (scan, index seek, hash, sort, etc)
and probably other factors as well.
It would be interesting to compare the execution plans between different database engines,
even to force the SQL Server execution plan to one as close as possible to the plans
employed on the other database engines.
The main point of interest is not moderate differences in the overall (geometric mean) performance, but rather the very large differences in certain queries.
The long run time for Q18 should probably be investigated.
Another view, the 4-way Xeon 7560 SF 1TB with 1.5TB memory + SSD versus 4-way Xeon SF 3TB with 0.5TB memory & HDD.
The number of processors is doubled, but the database is 3 times larger.
On this alone, we might expect a 50% difference in query time, with the caveat that there are complications in projecting TPC-H performance at different scale factor.
There are also significant differences in the memory-to-data ratio,
and storage performance characteristics.
TPC-H individual query execution times for 4-way 1TB and 8-way 3TB
On the 8-way system at SF 3TB, Q18 actually runs faster than on the 4-way system at SF 1TB.
But the other larger queries, Q1, 9, and 21, show the expected pattern.
Overall, it does appear that the 3TB query run times are on the order of 50% higher.
Microsoft just release Fast Track Data Warehouse 3.0 Reference Guide version. The new changes are increased memory recommendation and the disks per RAID group change from 2-disk RAID 1 to 4-Disk RAID 10.
The earlier FTDW reference architecture cited 4GB memory per core. There was no rational behind this, but it was felt some rule was better than no rule. The new FTDW RG correctly cites the rational that more memory helps keep hash join intermediate results and sort operations in memory.
4-Disk RAID 10
FTDW versions 1 & 2 specified 2-disk RAID groups. The reason was the SAN storage system could not deliver good sequential performance IO with fat RAID groups (comprised of many drives). For some reason, a thin 2-disk RAID group could deliver 100MB/sec per disk. An EMC whitepaper added that a 3-disk RAID 5 group could also deliver good sequential performance. But RAID 5 does not have the necessary performance characteristics to support tempdb IO requirements. FTDW 3.0 now specifies 4-disk RAID 10 groups. The presumption is that 100MB/s sequential IO is still the objective. No explanation was given why a 4-disk RAID group is now specified. It is possible the firmware in SAN systems can now support good sequential performance with larger RAID groups. But an explanation is still desired.
Apparently the reason for the 4-disk RAID 10 group is to match the PDW configuration. No data was provided as to whether this could still meet the old 100MB/s per disk objective.
The reasons cited for the new memory recommendation in FTDW 3.0 are correct. My suggestion however, is that rather than cite specific memory configurations which just happen to be reasonable today, apply the underlying principal. The cost of server class ECC memory modules follow a consistent pattern. An 8GB DIMM today is $300 and a 16GB is $850 (from Crucial). The 32GB memory modules are just becoming available, and probably very expensive, say for example: $5000. The premium for the 16GB memory module is small relative to 2x8 GB modules.
The low cost of memory relative to the overall environment and the fact that memory has positive benefits, but no specific hard requirements drives the strategy. Fill the available memory sockets with either 8GB or 16GB memory modules (depending on system constraints). When the premium for 16GB modules disappears, then discontinue the 8GB module from consideration. When the premium for 32GB memory modules becomes less, then both 16 & 32GB modules can be considered.
My preference is still to employ direct-attach storage systems with RAID controllers in the server system. The latest PCI-E SAS RAID controllers cannot support 2.8GB/s, just under the net bandwidth of a PCI-E gen 2 x8 slot. Each 6Gbps SAS x4 channel can support 2GB/s. Modern 15K disk drives can deliver 150MB/s on the outer tracks. This infrastructure has an amortized cost structure of less than $700 per disk. The cost structure of low-end SAN systems specified by FTDW is typically 2X higher per disk. I also prefer to employ more 2.5in 15K HDDs over more channels than specified by FTDW over using 300GB+ 3.5in 15K HDDs.
Index Light Strategy
FTDW endorses the index light strategy. The presumption is that the data warehouse is very large, in the multi-TB range, too large to fit in system memory. The storage strategy employs hard disk drives because an HDD storage system properly configured with IO distributed over multiple channels and many disk drives can deliver high sequential IO bandwidth economically.
The SQL Server query optimizer cost model is built on the model that table scans run at 10.8MB/s (or 1350 pages/sec) and random IO (from loop joins and key lookups) run at 320 IOPS. The absolute value of each in relation to the actual server is not important. The important point is 1350 pages/s sequential to 320 random IOPS ratio.
The FTDW reference architecture is designed to achieve sequential IO of 100MB/sec per disk, or 12,800 pages/sec. A 3.5in 15K disk can support 175 IOPS (225 for 2.5in 15K) for random IO distributed over the entire drive at queue depth 1. At higher queue depth, 15K disks can support 320 IOPS. So the FTDW reference architecture IO performance characteristic has the ratio 12,800 pages sequential to 320 random IOPS.
There is no provision to adjust the SQL Server query optimizer sequential-to-random ratio. For this reason, FTDW favors avoiding non-clustered indexes so that key lookups and loop joins are not an execution plan option. A good partitioning scheme to help contain the range of scan operations is an essential element of the index light strategy.
Tempdb on SSD
SQL Server issues IO to tempdb at high queue depth, typically 64K in size, to store the intermediate results of large hash joins and sort operations. The pattern is a write burst eventually followed by reads. A very large disk array can support this, but a small array of SSDs can support extraordinarily high IO rates at lower latency than HDDs.
The next consideration is that if only tempdb is on SSD, there is not a hard requirement to employ enterprise grade SSDs. Depending on the situation, some environment can consider consumer grade SSDs. The difference in cost between consumer and enterprise SSD is more than 10X ($30K per TB for enterprise, less than $3K for consumer, all without RAID). For comparison, an HDD storage designed for performance might cost $10K per TB in RAID 10.
Some HD storage systems can support SSD with SATA/SAS, but not all can support the very high IO rates possible with SSD, so it might be a good idea to employ PCI-E SSDs. This configuration does not support clustering, but clustering is not necessarily a true requirement for data warehouses. Going forward, expect storage systems to finally abandon the under-powerful obsolete processor/controllers, and switch to powerful Intel Xeon 5600 or later processors. Then intermixing SSD and HDD should not be a problem.
Index Heavy Strategy
There appears to be a tendency to treat data warehouses as contest of whose is bigger. The data warehouse exists to serve the organization. With good design, the active portion of a data warehouse may fit in the system memory of modern server systems. Current 2-socket systems support 192-256GB, 4-socket server supports 1TB and 8-socket systems support 2TB memory. The cost of 1TB memory, 64 x 16GB ECC DIMMs should be less than $64K.
Another possible strategy is to employ SSD for the main database itself, or the most critical portions of the data warehouse. In an SSD storage system, there is very little difference between large block “sequential” IO and small (8K) block “random” IO. The difference in throughput between 1 page of a (sequential, large block) table scan and a “random” 1 page loop join/lookup IO might be none or 2:1. It also turns out that for data in memory, the actual CPU cost of a table scan per page is approximately the same as 1 key lookup or loop join.
In this case, the strategy should be to employ (sufficiently) covered indexes as necessary. It may even be advantageous to force index in some cases because the optimizer assumes the cost ratio is 4 pages scan equal 1 random IO. Note that the index heavy strategy will result execution plans with loop joins and key lookups. These operations do not scale to high degrees of parallelism, where as the hash join has excellent scaling at very high degrees of parallelism. (It is suspected the reason is each row latches the root level, resulting in contention at DOP 8 and higher. I do not understand why there cannot be just a single latch for the entire query or even one per thread.)
Ingres just published a TPC-H benchmark for VectorWise, an analytic database technology employing
1) SIMD processing (Intel SSE 4.2),
2) better memory optimizations to leverage on-chip cache,
4) Column-based storage.
Ingres originated as a research project at UC Berkeley (see Wikipedia) in the 1970s, and has since become a commercially supported, open source database system. Apparently, Ingres project people later founded Sybase. So Ingres in a sense, is the grandfather (or perhap grand-step-uncle?) to SQL Server. According to the Ingres whitepaper, VectorWise spun off from CWI in Amstersdam, with some roots in MonetDB? For the time being, Ingres VectorWise is not open-source. The interesting aspect of VectorWise among other column store database products is use of the SIMD, as in utilizing the special instructions and registers built into recent generation microprocessors. (In a previous blog, I proposed special SIMD instructions to improve the efficiency on accessing pages in traditional row-storage database engines.)
Below are TPC-H results at scale factor 100GB on HP ProLiant DL380 G7 with two Xeon 5680 six-core 3.33GHz processors for SQL Server 2008 R2 and Ingres VectorWise 1.5.
Overall, the VectorWise columnar database has about 2.5X higher score on Power, and 5X on Throughput. The glaring weak point of SQL Server is that the throughput is (almost 2X) lower than the power. The SQL Server parallel execution engine is very powerful running a single stream with no other concurrent queries competing for CPU. It does not have a good strategy for sharing compute resources with other concurrent long running queries.
The general expectation set by all the new technology, columnar storage, memory optimization and vector processing was 10X, not 3X. However, we should consider that the TPC-H benchmark score is a composite geometric mean of 22 individual queries. The individual query performance differences between SQL Server and VectorWise are shown below, on a logarithmic scale (base 2, not 10, a cool feature in Excel 2007?).
The full order of magnitude gain is definitely achieved in certain queries (1, 4, 6, & 12), with good gains in additional queries. Many other queries are comparable with row based storage.
Columnar storage is expected in SQL Server 2011. Search for the SQL Server Technical Article: "Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0" by Eric Hanson. The Columnstore indexes cannot be updated, so a partitioning strategy to separate activate and static data is necessary.
since we all love calculations, in the SQL Server TPC-H report, Q1 ran in 21.0 sec (presumably with all 12 cores near 100%, and there are 6M rows per SF 1GB, of which 98.4% are touched in Q1, we can work out the following
| ||Q1 sec||CPU-cycles per row|| |
On my laptop, with 2 Core 2 architecture cores at 2.2GHz, I measured 2944 CPU-cycles per row on Q1. Without the Group By, the cost is 1982 CPU-cycles per row. Just to touch each row (in SELECT COUNT(*)) is 420 CPU-cycles.
The TPC-H SF100 LineItem table with the 3 byte DATE data type is 87.5GB (w/o compression). So Q1 21 sec means 4.17GB/s data consumption rate. The SQL Server TPC-H result was configured with 4 SATA SSD drives. I believe these are the 3Gbit/s SATA interface (6Gb/s models coming soon?) meaning 300MB/sec per drive tops, or 1.2GB/s total for 4 drives. So to achieve Q1 in 21 sec, the data must be in memory. However, with 16 (3Gb/s) SATA SSDs or 8 6Gb/s SATA SSDs distributed over 2 controllers, 4GB/s is easily achievable.
For VectorWise, given that I have not actually used it myself, I will have to make "reasonable estimations". Lets disregard compression for the moment. Assuming VectorWise employs 8 byte values to store Quantity, ExtendedPrice, Discount, and Tax, 4 bytes for Date, and 1 byte each for the 2 flag values, then VectorWise must process 38 bytes per row, or about 24GB for TPC-H Q1. To support Q1 in 2 seconds, the disk system would need to deliver 12GB/sec. With compression, the rate could probably be brought down to 6GB/s. The actual storage system employed, 16 15K SAS disks over 2 controllers, can deliver 16 x 150MB/s = 2.4GB/s. The 2 x Xeon 5680 (2 IOH) system can easily deliver 6GB/s (and probably 12GB/s if this is really necessary). So to access data on disk at that rate, the storage system should be 40 disk drives over 2 controllers (probably capped at 2.8GB/sec per controller today).
For the last 15 years, Intel and AMD have been progressively adding special purpose extensions to their processor architectures. The extensions mostly pertain to vector operations with Single Instruction, Multiple Data (SIMD) concept. The reasoning was that achieving significant performance improvement over each successive generation for the general purpose elements had become extraordinarily difficult. On the other hand, SIMD performance could be significantly improved with special purpose registers and instructions. The very large performance gains achieved enabled new applications (mainly in the areas of multimedia, graphics and video processing) to become viable over the span of a single processor generation, instead of the typical several generations.
All of this is interesting, but here we are interested in the database engine. What can SIMD and special instruction set architecture extensions do for database performance, in particular the storage engine? The fact that there have been no database related extensions would seem to indicate that SIMD cannot be used in this area. The core of database storage is the page organization of data. A cursory examination of accessing rows in a page would show a series of scalar instructions. There does not seem to be a fit for SIMD processing. If there were special instructions to enable SIMD style coding along with use of the vector registers for database functions, the benefits might not be obvious without a detailed examination of where CPU clock cycles are expended.
Microprocessors since the Intel Pentium Pro era in the 1990s are essentially factory assembly lines. The throughput, in terms of instructions completed, is phenomenal. The time span from an instruction entering the pipeline to its completion, on the other hand is much longer. Access to memory is even longer, hundreds of CPU cycles away. The key to performance to on modern microprocessors involves keeping the pipeline running, and issuing requests for memory long in advance. A very poor coding practice is a sequence of instructions where each depends on the previous instruction to complete before the next can start. Even worse is a sequence that involves a fetch memory to determine the next memory access.
Intel started with Multi-Media Extensions (MMX) in the Pentium II and later Pentium processors for vector integer processing using 8 MMX registers aliased to the existing floating point registers. The next extension was Streaming SIMD Extensions (SSE) in Pentium III with 8 dedicated 128-bit registers supporting 4 packed single precision floating point instructions. Pentium 4 accompanied SSE2 that extended the special instructions to double precision floating point and 1-8 bytes integers with the 128-bit SSE registers. Additional instructions were added for specific applications in SSE3 and supplemental SSE3. Later SSE4 and AESNI added encryption, and string and text processing. The latest Intel microprocessor, Sandy Bridge added 256-bit AVX registers.
SIMD extensions can improve performance on two aspects. One is a reduction in the number of instructions. Second is the addition of special purpose registers. Consider adding two vectors. Not only are multiple additions consolidated into a single instruction, the loads preceding the add instruction and the following store operations are now a single vector instruction instead of multiple scalar instructions. A single load instruction fills the vector register with multiple elements. By using special registers for vector operands, contention for the very limited number of general purpose registers(in the X86 architecture, there are 8 for 32-bit and 16 for 64-bit modes) is reduced. The expectation is that there are fewer temporary memory copies to free up a register. The memory access pattern should be more predictable as well.
SQL Server Table Scan
Consider the TPC-H Query 1 that aggregates several columns with a group by clause. (The group by clause, in light of the existing indexes, necessitates a Hash Aggregate operation that is more expensive than the simpler Stream Aggregate). On the Xeon 5680 processor, the performance is approximately 350MB/s per core, corresponding to 44,800 pages per second. The Line Item table size is 138 bytes per row (with the 3 byte date data type), so the row rate is 2,659,431 per second. This corresponds to
1,240 CPU-cycles per row (apparently 1400 CPU-cyc per row is a better calc). The cost is much lower for an aggregate without the group by. Even then, the cost per row aggregated is still several hundred CPU-cycles.
In a 2008 blog, I had reported a cost of approximately 1 CPU micro-second for the page handling overhead based on a Core 2 processor at 2.66GHz (or 2670 CPU-cycles) using a table with 1 row per pages and the entire table in memory. The cost for a Select Count was approximately 0.05 micro-seconds for each additional row per page, or about 133-cycles. Aggregating columns incurs additional cost (I will provide this later).
Hypothetical Page Access Instruction Sequence
Consider what the code sequence might look like after the page is in memory, and appropriate locks and latches have been taken, and the page header has been read. (See Paul Randal and the SQL Server Storage blog for information on SQL Server page structure). It is assumed that the column offsets, sizes and types
have been read from the system table. It is also assumed that we are only
accessing not null fixed length columns.
1) Load (2 byte) offset for first row at end of page
2) Load column offset
3) Compute column address (page address + row offset + column offset)
4) Load column from memory location
5) Consume column (plus data validity tests), write results
Repeat 2, 3, 4, & 5 for each column access.
Repeat entire sequence for each row. (Assuming row locks are not required)
This is of course a gross simplification. In response to questions raised on one my earlier blogs, Mario of sqlinternals.com provided a description of the actual instructions using the debugger. So if anyone would be so kind to do so again, I would appreciate it.
In addition, a SQL query is not compiled to binary code as with a C/C++ program, but is interpreted. It is presumed that SQL Server and other RDBMS have efficient interpreters. (It might help if Microsoft would allow us to look at source code, including the assembly output for this).
It is expected that some memory accesses will be found in L1 or L2 cache, especially for accesses to adjacent column values. The L2 cache latency on recent Intel processors is around 10 CPU clocks, relatively close. Any accesses to L3 incur 36 cycle latency (Westmere). Local memory access should be around 50ns, and remote node memory access could be 100ns+. Some documents indicate that even though local memory access is quick, a remote node snoop is required for cache coherency.
In any case, there is a wide gap between a reasonable estimate of the CPU instructions necessary to process each row and column in a page and the actual total CPU clock cycles observed. It is suspected that many of the clock cycles are lost on memory latency. Due to the conventional nature of instruction sequence, the compiler and processor cannot determine which memory addresses need to be pre-fetched far in advance.
SIMD Page Operation
The question is: what special purpose extensions could significantly improve database page access performance? Consider the previous scenario of a table scan with this special purpose register strategy. The row offsets are loaded into one special purpose register. A 128-bit 16-byte register can hold 8 row offsets, 16 into a 256-bit register. The column offsets are loaded into another register. These are only the columns to be accessed, and excludes columns not accessed. A third register hold the size of each column corresponding to the column offsets in the previous register. If more than 8 columns are accessed, additional register pairs (for the offset and size) could be employed.
XMM1: Row offsets: 1, 2, 3, 4, 5, 6, 7, 8
XMM2: Column offsets: 1, 2, 3, 4, 5, 6, 7, 8
XMM3: Column sizes: 1, 2, 3, 4, 5, 6, 7, 8
The special instructions for database operations could fetch row x column y calculated using the page base address + row offset + column offset. It is not necessary to access multiple columns or multiple rows in a single instruction because this might entail more memory accesses than the processor has load/store units for. Simply achieving a through put of one column per clock (or even several clock cycles) is already a very large performance gain.
Note this only works for fixed length not null columns. A more complicated method could be contrived to handle columns that allow null values or variable length columns. (Or one could hire a competent database architect). Potentially, a data structure could have been allocated for the columns accessed and the instruction copies the each column in succession into the data structure. I am assuming that the processor is able to distinguish when more than one column is a single cache line and not issue unnecessary memory accesses.
This SIMD strategy is not just about reduces the number of instructions necessary to access the required rows and columns in a page. The more important aspect is to minimize memory latency stalls. Using vector registers for multiple offset values eliminates the waste of using a large 8-byte (64-bit) general purpose register for a single 2-byte offset. The memory access pattern should now be sufficiently predictable to substantially reduce memory latency delays.
There is no pretense that the rambling (does JK have a trademark on this term?) here is sufficient to make the case for SIMD extensions in the CPU architecture and for retro-fitting the SQL Server storage engine. But given the importance of the database in the server world, should there not be a thorough investigation (with source code access). It is well understood that transactional database engines incur significant overhead because of the critical requirements. An alternative question could be: would database SIMD extensions be more valuable than flirtations with columnar database storage.
There are some additional questions with regard to micro-architecture
optimizations. Should the row start positions be 4, 8, 16, 32 or 64-byte aligned as appropriate for the table based on average row size? Should the lead fixed length columns be the 4 and 8 byte values? Alternatively pairs of 2-bytes values or quads or 1-byte values. Does the SQL Server 3-byte Date data type cause alignment performance penalty?
below is a Microsoft Research / Columbia University paper on database SIMD, but our strategy appears to be different?
per John in the comments,
from what I gather on John's blog, VectorWise is its own company, but is partnering with Ingres for a really fast data analytics. My discussion focused on retaining the row organization in a general purpose and transactional DB. In analytic world, there is no problem with implementing column storage, and of course, using SIMD instructions effectively is the mark of a serious programmer. (John: did you think if I was the dept head of statistics? The two of us had been fighting for top search ranking until that movie star from Taiwan blew us both away)
Since this topic is being discussed, I will plug my own tools,
SQL Exec Stats and (a little dated)
the main capability is cross-referencing index usuage with specific execution plans.
another feature is generating execution plans for all stored procedures in a database, along with the index usage cross-reference.
There are several sources of execution plans or plan handles,
this could be a live trace, a previously saved trace, previously saved sqlplan files,
from dm_exec_cached_plans, dm_exec_query_stats, etc
from SQL via SET SHOWPLAN_XML etc
all are useful, so we need tools to make use of execution plans from all sources
I could also mention my own performance monitoring tool, which needs to be updated or incorporated into a commercial tool,
the point of note is that is shows disk IO from performance counters, and from fn_virtualfilestats with the data, log and temp read/write all simultaneously
Finally, there is SQL Clone, which started in SQL Server 2000, but in SQL Server 2005 and later, we have the API to get and update data distribution statistics.
Every now and then, the question is asked “When will servers be powerful enough that performance tuning will not be necessary.” Sometimes I get the impression this was posed not on technical grounds, but rather that ISVs and CIOs don’t like performance specialists. Fortunately (for me) it does not appear that this will ever happen for two principal reasons: 1) hardware and tuning contribute in such completely different aspects that neither can be ignored, and 2) multi-core processors actually introduce new tuning requirements and strategies.
On the hardware side, performance improves over time with 1) processor compute capability, in terms of both single core and system aggregate performance, 2) memory capacity and 3) storage IO performance. Tuning largely involves the three categories: 1) improving the execution plans of top resource consuming queries, 2) balancing avoidable compiles and recompiles of high volume queries with having the right execution plan, 3) managing deadlocks, blocking and various resource bottlenecks.
In the past, processor performance increased roughly 40% per year at the single core level. Since the 2005 time frame, processor performance has increased mostly with increasing number of cores per socket than at the single core level. Over the last six years, single core performance has roughly doubled. However, the total aggregate performance of recent generation microprocessors is truly impressive. In the past, mainstay database servers had 2-4 processor cores. Today, processors with six or eight core are standard and the server system itself might have between 12 to 32 cores. The current AMD Opteron processors have 12 cores on 2 die in one socket, but the aggregate performance at the socket level seems to be comparable to the Intel Xeon with 6 cores.
Database performance will depend on whether an application can effectively use all of the available processor cores in a scalable manner. If so, then it might be possible to handle performance issues with brute force. More than likely, only a few elements of a complete application will naturally scale exceptionally well. Many elements may have moderate or marginal scaling. Key operations, particularly writes (Insert, Update and Delete) may be restricted to a single core. A complete database application with a rich functionality set cannot be completely handled with brute force compute power comprised of many processor cores.
Memory can dramatically reduce disk IO to manageable levels. This usually requires proper tuning to bring the active database into a limited set of pages that fit in the memory available for the buffer cache. Once upon a time memory was to extremely critical. When system memory was below 1GB, a major effort was required to bring disk IO to manageable levels by ruthlessly restricting queries to basic transactions only. A report run concurrently with transactions could seriously overload the disk system. By the year 2000 or so, system memory capacity with reasonably priced memory modules (usually half the maximum support memory module capacity) crossed the 3-4GB range. Around this time frame, the memory capacity was sufficient to support both transactions and some reporting activity.
This was followed by the tortured path to full 64-bit for unrestricted access to all system memory. Itanium, the first Intel microprocessor designed for 64-bit architecture, was repeatedly delayed and seriously disappointed on performance when it finally did arrive (2001?). The Intel Pentium Pro and later processors did have a band-aid capability with the Physical Address Extension (PAE) feature which let a 32-bit OS manage more than 4GB physical memory. Microsoft reluctantly (perhaps it was a cold day in hell) added Address Windowing Extensions so that applications like SQL Server could map memory outside the normal address range with functionality sufficient for the buffer cache.
The Intel Itanium II finally provided good performance on a full 64-bit operating system supporting a 64-bit version of SQL Server 2000. But acceptance of the Itanium platform was poor. A 64-bit OS did not become pervasive until after the AMD Opteron, and later Intel 64-bit Xeons (based on the Prescott core) in 2004. SQL Server with 64-bit on the more popular X86-64 architecture did not arrive until late 2005.
By this time system memory capacity was in the 32-64GB range, with 16-32GB being very affordable. This seemed to be sufficient to handle properly tuned databases even with rich functionality. Today, server systems can reach 1TB memory at a reasonable price. A 16GB memory module with ECC is around $800. The price for 64 modules (1TB) is just over $50,000. This is relatively low compared with the overall cost of an important database application. Few applications should actually require so much memory. The reality is that most environments have already reached the memory brute force realm.
Interestingly, IO performance, while weak at the single disk level, was the one component that could be scaled with massive numbers of disk drives. The key to implementing this strategy is low cost per disk amortized over the complete storage system. However, storage vendors were interested in selling extremely high value-add features, i.e., extraordinarily expensive disks, which directly conflicts with the performance strategy.
Now with solid-state storage technology sufficiently mature and competitive in price, IO performance has also reached the brute-force realm. Of course, this came only after system memory configuration was so massive that a well designed transaction processing database should not generate heavy IO anymore. Still, it is good to have capability to handle IO surges.
Tuning for Execution Plan Efficiency
On the tuning side, a combination of rewriting the SQL (or using plan hints when this is not possible) and building the right set of indexes allows the query optimizer to find efficient execution plans for the most important queries. This effort targets the top SQL statements, and it may be possible to achieve a very large performance gain on a handful of queries, with essentially no change on the majority of queries.
Note that the range of performance on the hardware side between contemporary mid-range (2 sockets, 12 cores total) and high-end systems (4-8 sockets, 32-64 cores total) is not more than 2-4X. In the old days, there were systems with 32-64 processor sockets, scaling was limited and could only be achieved with complete re-architecture of the application. The difference between a good execution plan and poor one might be one, two or even three orders of magnitude, which cannot be spanned with hardware. Only tuning can close this range, even if this is feasible for just a handful of SQL.
Tuning for Parallelism
As discussed earlier, server system today can have 12, 32 or even 64 processor cores. Achieving good scaling to all processor cores is not a simple matter. The Microsoft SQL CAT team has published strategies for reducing contention for key metadata structures and index root level access. Examples are creating multiple identical indexes on small tables that frequently updated to avoid contention for the index root level page, or partitioning to achieve the same effect.
However powerful modern server systems are in terms of compute, memory and IO, brute force capability is helpful, but does not replace good database architecture, and performance tuning. The impact characteristics from each are very different. Hardware alone cannot solve some performance issues, nor does endless performance tuning effort compensate for obsolete hardware.
Its always difficult to tell when something is an unbiased assessment versus an self-serving opinion. If only life were easy!
Every now and then, one encounters a legacy application that uses the API Server Cursors(see http://msdn.microsoft.com/en-us/library/ms187088.aspx) or Cursor Stored Procedures http://msdn.microsoft.com/en-us/library/ms187801.aspx.
In Profiler or Trace, these calls involve sp_cursoropen and sp_cursorfetch, or sp_prepare/sp_prepexec and sp_execute. (This is not the same as cursors declared in SQL statements). API Server Cursors are sometimes blamed for performance problems. This could be generic poor performance or a nearly complete SQL Server system seizure with memory leak symptoms.
The general poor performance is usually due to the way API Server Cursors are used. In most situations, I have seen sp_cursorfetch calls that get 1 row at time, with hundreds or thousands of calls to get all rows. So to understand this, I chased down the documentation to cursor parameter values (see the SSWUG article) I was surprised to find that the API Server cursors were fully capable of fetching multiple rows with a single call. The problem was that the client side elected to fetching only one row at a time.
The other possible problem associated with API server cursors is execessive network round-trips. When the query returns only a single row, the API cursors may generate 3 calls, sp_cursoropen to setup the cursor, sp_cursorfetch to get the row, and sp_cursorclose to release resources used by the cursor. (The ODBC call that does this should also release resources on the client.)
The cost of the network roundtrip overhead from client to SQL Server can be 6-8 times higher than a single row index seek. So generating 3 network calls when 1 was sufficient is a seriously bad technique. Generating thousands of call when only 1 (with many network packers) was necessary is even worse. Furthermore, scaling network roundtrips on big server systems, especially NUMA, requires special handling, with possible serious negative consequences if not done correctly.
The number of network round-trips required can be reduced from 3 to 2 with sp_prepexec, and sp_unprepare. The first procedure both prepares a statement and executes for the first parameter. The handle can be used for additional executes with different parameter values.
The second more serious problem, occurs when the client neglects to close the cursor or prepared statement. So if the profiler trace shows calls to the sp_cursoropen and sp_cursorfetch without a call to sp_cursorclose or calls to sp_prepare/sp_prepexec and sp_execute without a call to sp_unprepare, then the client side code never called SQLFreeStmt (the next syntax is SQLCursorClose or SQLFreeHandle?).
In this case, the cursor or prepare statement is still accessible, meaning SQL Server will not release resources necessary to access it, even though the client will never access it again. This is essentially a memory leak. On 32-bit SQL Server, I have seen SQL Server lock up with several million open cursors (this probably depends on the resources consumed by each cursor or prepared statement). Presumably 64-bit SQL Server can go quite a while before exhausting the virtual address space.
In the past on SQL Server 2000, I have also seen this happen when one cursor is left open, followed by a very long stream (millions) of cursors openned and closed. This may sound silly, but consider a Visual Basic program with a nest while loop. The outer loop opens a cursor, executes its, but leaves it open. Then inner loop opens a cursor, and executes it. The code does not close the cursor explicitly (SqlCmd=nothing?). As the code goes to the next iteration of the inner loop, a new cursor is openned for the current iteration, then VB garbage collector kicks in, releasing resources created in the previous iteration, which are nolonger accessible.
At PASS last week, I asked one of the Microsoft SQL Programmability reps about this subject. He was under the impression that it was possible to close API Server cursors from a different session. I am not sure how this is done.
If anyone still has application that issues API Server Cursors, and experiences these problems, first I would strongly suggest rewriting the entire application to not use the API Server Cursors. Second choice would be to fix the code. In the case that this was a third-party product, and they are unresponsive, perhaps the Microsoft SQL Server team would be so kind as to provide a means to close open cursors in other sessions. Perhaps sp_reset_connection could be slipped in somehow?
ps It is common practice in performance tuning to run a Profiler or Trace with filters to avoid overwhelming the system. The sp_cursorclose and sp_unprepare almost always have zero CPU and duration (no sure on logical reads). So a filtered trace will not show these. If there are API Server Cursor calls, it could be helpful to run an filtered trace on just the RPC:Completed event, or filter to a specific spid.