Ingres just published a TPC-H benchmark for VectorWise, an analytic database technology employing
1) SIMD processing (Intel SSE 4.2),
2) better memory optimizations to leverage on-chip cache,
3) compression,
4) Column-based storage.
Ingres originated as a research project at UC Berkeley (see Wikipedia) in the 1970s, and has since become a commercially supported, open source database system. Apparently, Ingres project people later founded Sybase. So Ingres in a sense, is the grandfather (or perhap grand-step-uncle?) to SQL Server. According to the Ingres whitepaper, VectorWise spun off from CWI in Amstersdam, with some roots in MonetDB? For the time being, Ingres VectorWise is not open-source. The interesting aspect of VectorWise among other column store database products is use of the SIMD, as in utilizing the special instructions and registers built into recent generation microprocessors. (In a previous blog, I proposed special SIMD instructions to improve the efficiency on accessing pages in traditional row-storage database engines.)
Below are TPC-H results at scale factor 100GB on HP ProLiant DL380 G7 with two Xeon 5680 six-core 3.33GHz processors for SQL Server 2008 R2 and Ingres VectorWise 1.5.

| |
Power |
Throughput |
QphH |
| SQL Server |
99,626.3 |
55,038.2 |
73,974.6 |
| VectorWise |
257,142.9 |
246,101.7 |
251,561.7 |
Overall, the VectorWise columnar database has about 2.5X higher score on Power, and 5X on Throughput. The glaring weak point of SQL Server is that the throughput is (almost 2X) lower than the power. The SQL Server parallel execution engine is very powerful running a single stream with no other concurrent queries competing for CPU. It does not have a good strategy for sharing compute resources with other concurrent long running queries.
The general expectation set by all the new technology, columnar storage, memory optimization and vector processing was 10X, not 3X. However, we should consider that the TPC-H benchmark score is a composite geometric mean of 22 individual queries. The individual query performance differences between SQL Server and VectorWise are shown below, on a logarithmic scale (base 2, not 10, a cool feature in Excel 2007?).

The full order of magnitude gain is definitely achieved in certain queries (1, 4, 6, & 12), with good gains in additional queries. Many other queries are comparable with row based storage.
Columnar storage is expected in SQL Server 2011. Search for the SQL Server Technical Article: "Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0" by Eric Hanson. The Columnstore indexes cannot be updated, so a partitioning strategy to separate activate and static data is necessary.
ps
since we all love calculations, in the SQL Server TPC-H report, Q1 ran in 21.0 sec (presumably with all 12 cores near 100%, and there are 6M rows per SF 1GB, of which 98.4% are touched in Q1, we can work out the following
| | Q1 sec | CPU-cycles per row | |
| SQL Server |
21.0 |
1398 |
|
| VectorWise |
1.9 |
126 |
|
On my laptop, with 2 Core 2 architecture cores at 2.2GHz, I measured 2944 CPU-cycles per row on Q1. Without the Group By, the cost is 1982 CPU-cycles per row. Just to touch each row (in SELECT COUNT(*)) is 420 CPU-cycles.
The TPC-H SF100 LineItem table with the 3 byte DATE data type is 87.5GB (w/o compression). So Q1 21 sec means 4.17GB/s data consumption rate. The SQL Server TPC-H result was configured with 4 SATA SSD drives. I believe these are the 3Gbit/s SATA interface (6Gb/s models coming soon?) meaning 300MB/sec per drive tops, or 1.2GB/s total for 4 drives. So to achieve Q1 in 21 sec, the data must be in memory. However, with 16 (3Gb/s) SATA SSDs or 8 6Gb/s SATA SSDs distributed over 2 controllers, 4GB/s is easily achievable.
For VectorWise, given that I have not actually used it myself, I will have to make "reasonable estimations". Lets disregard compression for the moment. Assuming VectorWise employs 8 byte values to store Quantity, ExtendedPrice, Discount, and Tax, 4 bytes for Date, and 1 byte each for the 2 flag values, then VectorWise must process 38 bytes per row, or about 24GB for TPC-H Q1. To support Q1 in 2 seconds, the disk system would need to deliver 12GB/sec. With compression, the rate could probably be brought down to 6GB/s. The actual storage system employed, 16 15K SAS disks over 2 controllers, can deliver 16 x 150MB/s = 2.4GB/s. The 2 x Xeon 5680 (2 IOH) system can easily deliver 6GB/s (and probably 12GB/s if this is really necessary). So to access data on disk at that rate, the storage system should be 40 disk drives over 2 controllers (probably capped at 2.8GB/sec per controller today).