THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul Nielsen

www.SQLServerBible.com

Columnar Databases

Josh Jones, all around good guy and author of an upcoming book on database design, presented a rousing talk on Columnar Databases at last night’s Colorado Springs’ SQL Server User Group. There are several commercial and open source columnar databases available; SQL Server is not one of them.

SQL Server, like most relational databases, is a row-based data store. Data is written in the pages as rows,

     1, Paul, Nielsen; 2, Josh, Jones; 3, Kalen, Delaney.

A columnar database writes data to the page as columns,

     1, 2, 3; Paul, Josh, Kalen; Nielsen, Jones, Delaney.

A storage unit (page, or file) is assigned to a each column. This makes writing or reading every column of a single row from the middle of 50 million rows difficult and slow, so columnar database are contraindicated (cool big word) for OLTP databases. 

OTOH, reading one of two columns from every row is screamlingly fast using a columnar database, so this storage style is perfect for OLAP databases. Some columnar databases claim they are 50x faster than a row-based database at these types of queries.  

The difference between row-based and columnar databases is only in the storage engine layer. The query engine layer can still use straight SQL and then pass the optimized query to the storage engine, which writes to / reads from either row-based or columnar based tables.

Maybe you can see where I’m going with this thread. Currently, Microsoft SQL Server only supports row-based storage. But, it’s very conceivable that a future storage engine could write tables as either row-based or columnar storage structures. This would grant a huge performance advantage to SQL Server in OLAP applications. Consider this my public vote for a CREATE TABLE COLUMNAR (col1…) option.

For more info on columnar databases, there's a good page on Wikipedia. Thanks again Josh for your great presentation.

Published Thursday, November 15, 2007 9:35 AM by Paul Nielsen

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Denis Gobo said:

I am evaluating Sybase IQ right now, Sybase IQ is columnar based and also claims 50X improvement

I will let you know the results after I test it out against a billion rows

November 15, 2007 11:02 AM
 

Jamie Thomson said:

Wow, interesting idea Paul. I worked with a columnar DB (we termed it columnbase but whatever) called Broadbase back in 2000-2004. Broadbase were a silicon valley startup from Menlo Park that thought they were onto something but eventually got swallowed up into Kana (http://www.kana.com/). I'm not sure where it is now - probably nowhere.

-Jamie

November 15, 2007 12:15 PM
 

Denis Gobo said:

Paul,

would you mind adding this to the connect site as a SQL Server feature we want (you will get my vote)

If not I can add it, since it was your idea I did not feel like taking credit for it

Denis

November 16, 2007 8:11 AM
 

Greg Linwood said:

I haven't had a chance to read up much on columnar dbs but can someone explain what the difference is between columnar storage & regular indexes? Indexes store any column or combination of columns together on storage so I'm wondering where the advantage in columnar storage is?

November 16, 2007 9:04 AM
 

Paul Nielsen said:

Hi Greg,

That a great question. Denis, in your testing how does Sybase IQ compare against SQL Server with a covering index? I'd love to know.

The columnar folks certainly present the notion that columnar databases outperform row-based queries in selecting a single column for all rows by a factor of x50. But is there more to that than marketing?

Assuming that covering indexes perform as well as a columnar table (I would guess they should), I suppose the question boils down to would you rather add a covering index to every column of an OLAP fact table with 50 dimensions, or just build a columnar type table?

November 16, 2007 11:02 AM
 

Adam Machanic said:

I think Paul's note is dead on.  In the column-based database I assume they keep some kind of header record so they know, e.g., "column 1's values start at sector 0001 and end at sector 0010", "column 2's values start at sector 0011 and end at sector 0100", etc.  So if you ask for column 2, it can do a single scan of exactly the right sectors from the disk, and there you are.

In a row-based DB, if you have more than one column in a given index, you need to read all of the data from every column in the index, then processes it to pick out only the data you actually need.

I guess it all comes down to what you're trying to do, and especially how often you actually need a large subset of the rows.  I assume a more sophisticated column-based setup would keep a row ID with each column value, so if you you had Name, Age, and City it would look like:

1.John 2.Steve 3.Bob, 1.34 2.20 3.53, 1.Denver 2.Boston 3.Melbourne

Then the data could be pre-sorted (not done here) to take advantage of range-based scans.  I can see how that would be quite advantageous if pulling down large chunks of data for OLAP queries... It also might help enable range scans for queries on composite indexes, something that SQL Server is certainly not good at doing today... Maybe a using a multidimensional tree to build the row-based indexes would better satisfy that requirement, though.

November 16, 2007 4:28 PM
 

Paul Nielsen said:

Hi Adam,

Exactly. The columnar database must have some way to fetch specific row data within the column. Much of the discussion (led by Josh) at the Springs SQL UG centered on this very question.

SQL Server can simulate column based searches using a non-clustered index, but for heavy OLAP, a columnar table option would be welcome.

November 17, 2007 11:44 AM
 

Joe Celko said:

Another factor is that a column is always one data type.  You then design compression algorithms for them and work with the compressed format instead of the original data.

November 18, 2007 2:48 PM
 

John Sichi said:

To follow up on Jamie's comment above, the Broadbase server actually lives on in the open-source LucidDB project.

January 26, 2008 9:18 PM
 

Sam Bendayan said:

Very interesting.  One question....how would ETL jobs perform on such a database?  Since we are writing entire rows, wouldn't the ETL be incredibly slow?

Thanks,

SB

May 13, 2009 1:37 PM
 

Sam Bendayan said:

Another question:  what about building an index that has this columnar structure over a standard row-based table, instead of converting the table to column-based storage?  Is this possible?  Would this be a better idea?

Thanks,

SB

May 13, 2009 1:43 PM
 

Phil Runciman said:

Any given DBMS could support both row and column based physical storage. It would be up to the DBA to decide which physical organisation is appropriate.

Data inversion was another technique for achieving a similar effect.

The existing logical access methods would apply.

We do not need new products just improvements to existing ones.

July 21, 2009 7:08 PM
 

Stephan Schoenberger said:

@Phil Runciman; I like your ideas. Have you ever read about the customized PostgreSQL from Yahoo? Yahoo took PostgreSQL made some considerable changes and uses this as a columnar db now (http://www.computerworld.com/s/article/9087918/Size_matters_Yahoo_claims_2_petabyte_database_is_world_s_biggest_busiest?taxonomyId=53&intsrc=kc_feat&taxonomyName=databases). It would be a matter of Yahoo making their code public adding some development to port the changes into the core of the Postgresql development and we would be 3 steps closer to your vision.

July 31, 2009 4:00 AM
 

Geoff said:

I currently work with Vertica, a columnar database. It is indeed blazingly fast.

This type of database isn't for every job, but for what it does it does really well. A hybrid option would be very interesting. But that would almost require a new product.

November 17, 2009 2:27 PM
 

Radhika said:

Hello Paul,

This seems to be a good working idea. My query is, How do we query this type of columnr database from Mysql. What differenrt settings would be requierd in the normal select query

June 16, 2010 1:08 AM
 

Jim O'Donnell said:

This is an interesting article.  I would disagree that columnar databases are slow at reading all columns of a particular row.  Columnar databases allow, among other things, each value to be stored in a columnar index once, so initial searches on the where clause are considerably faster.  They also waste no space on null fields, so compression is a considerable attraction, especially when you'd prefer to have a significant portion of the base in memory.  The performance differential for OLTP is significant because it takes time to index a row into several columnar indices.  In general, columns are optimized for reads, tables are optimized for writes.

September 21, 2011 3:32 AM
 

Jim O'Donnell said:

@Sam Bendayan

Great comments and questions.  Mature columnar implementations include bulkloaders which enable very quick writes.  Single transaction writes are relatively expensive, however some columnar DBMS employ a temporary row store which is visible to the SQL interface and is periodically inserted into the columnar store via a backend process.  I think you'll find that, if you play with some columnar engines, the write performance is decent.  I suspect that the general direction will be OLTP on traditional normalized table-based implementations, agile marts and warehouses increasingly on columnar databases with a ton of cache.  I can get more data into cache with a columnar store than I can in a row store, so retrieval of high value data from my mart is nigh instantaneous.

September 21, 2011 3:41 AM

Leave a Comment

(required) 
(required) 
Submit

About Paul Nielsen

Paul Nielsen believes SQL is the romance language of data. As such he’s a hands-on database developer, Microsoft SQL Server MVP, trainer, and author of SQL Server Bible series (Wiley). As a data architect, he developed the concepts of Smart Database Design and Nordic – an open source O/R dbms for SQL Server. He lives in Colorado Springs.

This Blog

Syndication

News

news item test
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement