THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Michael Zilberstein

Updateable columnstore index gotchas

SQL Server 2014 introduces updateable nonclustered and clustered columnstore indexes. When new data is inserted, it goes first to deltastore where background process divides it into row groups, indexes and compresses. Indexing progress can be observed via sys.column_store_row_groups DMV. According to BOL, row groups in deltastore can be in one of 3 possible statuses: OPEN (a read/write row group that is accepting new records. An open row group is still in rowstore format and has not been compressed to columnstore format), CLOSED (a row group that has been filled, but not yet compressed by the tuple mover process) and COMPRESSED (a row group that has filled and compressed). COMPRESSED actually means that data is already part of the index. Row groups that are still in OPEN or CLOSED state are yet to be added to the index and currently reside somewhere in deltastore (internal object inside the database). You won’t find it in sys.objects or sys.partitions. But it is there – you can observe it using “half-documented” sys.system_internals_partitions view.

Anyway, performance difference between fully updated columnstore index and situation when part of your data that is still in OPEN or CLOSED state in deltastore is huge. So we would like the background process that indexes and compresses data to be as fast as possible. Unfortunately it is painfully slow. I’ve loaded 45 billion rows into clustered columnstore index. When loading had finished, I had only 13 billion rows in the index – less than 30%. 12 hours passed since then – I continue all the time to observe indexing progress using following query:

   1: SELECT 
   2:     state_description, 
   3:     SUM(total_rows) as TotalRows, 
   4:     SUM(CONVERT(BIGINT, size_in_bytes))/(1024*1024) as SizeMB
   5: FROM sys.column_store_row_groups
   6: GROUP BY state_description;

Results are very disappointing – less than 1 billion rows in 12 hours. Meaning indexing takes much longer than loading itself. Actually, more than 10 times longer. And querying data which is still in the deltastore is painful even more. Here is estimated plan – couldn’t wait for an actual, stopped the query after 15 minutes (when indexed with fully updated columnstore index, runs under 2 seconds):


Look at the estimated IO and CPU costs – they’re huge. I guess, it means scanning entire deltastore. Which is reasonable – there is no traditional b-tree index on this data (for sure not while in deltastore), columnstore index isn’t ready yet, so scan is inevitable. And even more – I couldn’t find any query that would run fast. That makes sense too – even if required data resides fully inside columnstore index, we can’t be sure about that. What if there is another part that is still in deltastore? It is unsorted and un-indexed, remember? So for any query SQL Server should check deltastore – in case something there is not in “COMPRESSED” status. In my case it means scanning 45B – 14B = 31B rows.

What does it mean? It means that updateable clustered/nonclustered columnstore indexes should be used with extreme caution. If your data is constantly updated (loaded) or if you have huge single ETL but not enough time for a background process to calculate entire index before users start to query it – result can be extremely painful. Don’t test your system only when data is ready and queries run blazingly fast. Test it when deltastore contains some data. I hope, Microsoft would improve performance of the process that indexes and compresses data – in my environment it seems extremely lazy: nothing else runs, no user sessions at all and still CPU is on 1-2% and indexing progress according to sys.column_store_row_groups is very very slow.

P.S. As I’ve blogged earlier, sys.partitions sees only rows which are in “COMPRESSED” status. So sp_spaceused, for example, currently indicates that my table contains only 14B rows.

Update: by request from Kevin Boles in comments below, I’m providing environment metrics.

It wasn’t intended to be a stress test, just turned out that way. The original intention was to compare my client’s application performance in a current version where we store some data in long compressed (proprietary + zip) blobs vs same data in clustered columnstore index. Updateability is a must, so columnstore index in SQL 2012 wasn’t an option. Data loading had been executed by application that was parsing the blob and calling stored procedure that receives TVP as an input parameter. Every call contained ~5K rows. 3 instances of this “migration tool” were running concurrently. The important part here is that it wasn’t bulk load. Data loading took ~3 days, mainly because of CPU bottleneck in migration application – database could do faster. As I’ve mentioned above, by the end of the loading, when ~45B rows had been inserted into the table, SQL Server had indexes and compressed 13B rows.

Environment is Virtual Machine running Windows 2012 Server, Intel Xeon CPU E5-2670 (2.6GHz) with 4 virtual cores; 6GB of RAM; IO – don’t know what is below the disk that I see but anyway, there is no IO bottleneck.

SQL Server used ~5GB of memory. MAXDOP setting remained default (0). Delayed Transaction Durability was set to “forced”.

I took IO and wait types snapshots several times during the day. IO latencies were ~12ms/read and ~0.5 ms/write. CPU was between 1% and 5%. During 5 minutes between snapshots SQL Server wrote ~144MB and read ~240 MB. Not much. Looks like server is almost idle. Tempdb was idle either - ~1% of total IO activity.

There were only 2 noticeable wait types – between subsequent 5 minutes snapshots they both had 299.98 second wait time (i.e. entire interval): DIRTY_PAGE_POLL and HADR_FILESTREAM_IOMGR_IOCOMPLETION.

Average number of rows per rowstore is ~1.05M. Average row group size is ~9.3MB. All the data resides in a single partition – as it is single partition in the current system (although with long blobs it is only 10M rows instead of 45B).

Regarding dictionary usage, sys.column_store_dictionaries contains ~141K rows with entry_count in a range from 1 to 27.5K.

Regarding query: here it is (without table and column names):

   1: SELECT AVG(Col5)
   2: FROM dbo.Tbl
   3: WHERE 
   4:     Col1 = @Int1 AND
   5:     Col2 = @Int2;

I’ve written aside some statistical data during data migration, so I know for sure that this WHERE clause filters 27.5M rows. Same query on a fully updated columnstore index (same number of rows to query but index itself much smaller) on the same machine took under 2 seconds with cold cache.

Published Sunday, December 1, 2013 4:53 PM by Michael Zilberstein



Joey D'Antoni said:

Just curious--what does the available memory on the host look like. My understanding of the tuple mover process, is that it halts under memory pressure. This is definitely a hole in the process though

December 1, 2013 10:25 AM

Kevin Boles said:

Can you provide some details about exactly how you loaded the 45B rows?  What mechanism did you use and what batch commit size did you use per transaction?  Did you use maxdop on the load to ensure availability to the compression engine as the bulk loading was being compressed on the fly (which obviously assumes you did some form of bulk loading). :-)

Also, what is the average number of rows in your row groups and the total number of row groups?  

Is the table partitioned or not?  If so, how is the breakdown of CLOSED and OPEN per partition?

And one more question: how is the usage of your global dictionaries looking?  I wonder if some scenarios wouldn't benefit from "pre-loading" a smallish sampling of data (5-10% maybe) as a heap and then creating a CCSI from that to allow the creation of good global dictionaries for all columns.  Then subsequent bulk insertion in 1M-row batches might get much better compression from the globals.  Hmm, thinking about that, I wonder if compressing all those CLOSED row groups is leading to adjustments of the global dictionaries, which could be expensive, or poor use of them which could also be bad.  

Get on as a TAP with Microsoft maybe, or get with their CAT team for a review?  The white paper about CCSI (Enhancements to SQL Server Column Stores) stated 600GB/hour loading on a 16 core server, although I note it did NOT say what the state of the row groups was at the completion of said test runs.

December 1, 2013 10:31 AM

Michael Zilberstein said:

Joey, I didn't do a stress test on a strong server - it is VM with 4 virtual cores and 6GB of memory (stress test wasn't an intention - it just turned out that way).

December 1, 2013 10:46 AM

Kevin Boles said:

>>Joey, I didn't do a stress test on a strong server - it is VM with 4 virtual cores and 6GB of memory (stress test wasn't an intention - it just turned out that way).

Michael you REALLY should have posted details about your environment and methods so people could have discounted your "findings" here.  This is why vendors get very upset with (and often legally ban) unverified benchmarking.

I would still like to know details I asked for to see if there aren't reasons OTHER than your suboptimal environment as to why things are so slow.

BTW, have you done a waitstats or fileIOstall analysis during this time?

December 1, 2013 11:28 AM

Michael Zilberstein said:

Kevin, another thing: in the estimated plan, can you explain why in order to find estimated 9620 rows, required CPU is 2481 and required IO is 7437? Is it indeed scan of deltastore as I thought?

December 1, 2013 11:44 AM

Michael Zilberstein said:


I've done wait stats and IO stats, although after writing this post. I don't have an IO bottleneck - will post a numbers later today along with answer to your previous comment, have urgent work to do. Anyway what disturbs me isn't absolute numbers but a relation between population time and time that takes SQL Server to index and compress data. It is same environment, I had same IO capacity during loading.

P.S. Just for an IO - latency numbers are: 12ms/read and 0.5 ms/write

December 1, 2013 11:52 AM

Kevin Boles said:

1) Do you have MAXDOP set to one on the server by chance? Maybe throttling there?

2) How bad is tempdb activity on the box?

3) How busy is the actual tuple mover?  sp_whoisactive using the delta-time functionality here could be useful.

4) Were memory grant waits noticeable?

5) It still could be simply some throttling (or suboptimality or even bug(s)) in the tuple mover causing slowness.  That may or not be Microsoft's fault because I doubt they envisioned someone paying $28000 dollars for an enterprise edition of SQL Server and then using it on a server with 4 vCPUs and just 6GB of RAM for the amount of data you are throwing at it. :-)  I wouldn't blame them at all if they simply discounted your scenario and all your problems are related to that decision. :-)

6) As for your query cost:

a) I couldn't see the query so that limits my ability to interpret/comment

b) Given that you have soooooo much data in the delta store you get ONE thread for it (per partition per store), and yes, that could certainly be a limit with umpteen bajillion rows of data to scan through.  Recall that with no NC indexes you cannot do any NCI SEEKs with this arrangment.

c) I would love to see the ability to do a BATCH MODE plan for the compressed row groups, and I wonder if that is happening under the covers here but not displayed.  If not that is could be a HUGE gap in functionality!

December 1, 2013 1:23 PM

Michael Zilberstein said:

Kevin, see update to the post - answered majority of your questions.

December 1, 2013 1:39 PM

Kevin Boles said:

I would expect SIGNIFICANTLY better picture if you use BULK loading mechanism, or even if you simply batched up 10K rows per transactin on loading.

I wonder if the varioables in the query are preventing good segment elimnation.  Try it with hard coded Col values and see if it makes a difference, and also with OPTION (RECOMPILE).

December 1, 2013 2:14 PM

Remus Rusanu said:

Loading a CCI in 5k batches: you created a horrible bad use case, misinterpreted the results and then draw the wrong conclusions.

Tuple Mover is not a replacement for index build. Use bulk insert tools for initial population. For populating a CCI directly each batch must provide at least 1 million rows per partition per thread and use bulk insert API, and the machine must have enough RAM to contain the uncompressed + compressed segment in memory for each thread (minimum 3-4 GB per thread, preferable +12GB per thread). Otherwise, populate a heap and use CREATE INDEX to create the initial CCI (same memory requirements). Then let tuple mover keep up with your day-to-day activity, unless day-to-day implies large ETL dumps, then you must make sure you use bulk insert and create directly compressed segments.

December 1, 2013 3:47 PM

Michael Zilberstein said:

Kevin, same with variables and with values hardcoded.

I'm sure that with bulk loading it could be better because columnstore is optimized for that. So you can call it bad practice and be 100% right but still I would expect that when server is idle, the tuple mover would be more aggressive.

December 1, 2013 4:00 PM

Michael Zilberstein said:


I don't argue that the use case is bad. Call it bad practice, call it even the worst practice possible. It wasn't the point. Sometimes you just can't load data into heap and create index later. I can't execute CREATE CLUSTERED INDEX on a 45B rows heap! Transaction log will crack. And I don't have downtime for that too. In production environment it will be partitioned table with hundreds billions rows - those 45B will be just one partition out of tens. So I have to load data into clustered columnstore index. The only thing we could do different is migrate data using bigger chunks. But my scenario is also possible - small logs that arrive from different sources. I don't see where I misinterpreted the results. I've provided segment sizes - single segment of 1.05M rows (in my case) perfectly fits into memory, both compressed and uncompressed. Tempdb doesn't work hard - almost doesn't work at all (1% of total SQL Server IO) - so spills can be ruled out. Entire instance performs ~7K IO operations in 5 minutes. 7500 RPM SATA drive is capable of more. And what are the wrong conclusions? My main point is that regardless of the way data had been loaded, the process that compresses data in deltastore should be much more aggressive - at least it should detect idle conditions and open more threads. My server is totally idle! It is extremely illogical that loading took 3 days and indexing - if continues with current speed - would take 15 days.

Anyway - as the one who knows the internals - is there any place where I can found information about Tuple Mover and deltastore - how it works, whether data is somehow indexed or not etc? The idea that 7.4K IO on a simple query are due to deltastore scan is just a theory. Is it indeed so?

December 1, 2013 5:59 PM

Michael Zilberstein said:

Remus, actually I see the answer to my question in your great article (read it in August - enough time to forget :-)

-- When handling deltastores the columnar storage advantages are dimmed. Deltastores are row mode storage so the entire row has to be read, not only the column(s) of interest. Segment elimination does not occur for deltastores since the deltastores do not have metadata about min and max values contained inside each column. Parallel scans will distribute the deltastores among the threads so that multiple deltastores are scanned in parallel, but there is no attempt for parallelism inside a single deltastores. At maximum size of 1 million rows they’re simply too small to justify the engineering complications of handling parallelism inside the deltastores.

Parallel scans assign each delta store to a single thread of execution. A single delta store is too small to justify scanning in parallel but multiple delta stores can be scanned in parallel. Scanning delta stores is slower than scanning data in columnar format because complete records have to be read and not just the columns needed by the query.

December 1, 2013 6:09 PM

tobi said:

Note, that there is a command to force the delta store to be copied to the column store. I don't remember it now, but it exists.

December 2, 2013 4:57 AM

Michael Zilberstein said:

tobi, it sounds strange - data in deltastore is kept in a usual rowstore format. Segments should be closed first and then compressed - only then data is in the index - automatically. So I don't see how you can just copy from deltastore to columnstore - those are 2 different formats.

December 2, 2013 5:03 AM

tobi said:

I found it:

"Reorganize a Clustered Columnstore Index"

December 2, 2013 5:34 AM

Michael Zilberstein said:

tobi, thanks! That's what I needed. BOL says:

-- Reorganizing is not required for moving CLOSED rowgroups into the columnstore. The tuple-mover process will eventually find all CLOSED rowgroups and move them. However, the tuple-mover is single-threaded and might not move rowgroups fast enough for your workload.

That was my point all along - why tuple mover is single-threaded? Why not add more threads when server is idle? But Microsoft guys somehow avoided this point focusing on the way I've loaded data.

Executed REORGANIZE - number of processed rows during 5 minutes interval jumped from 14M to 50M! Taking into account 4 cores, I would say that it is very efficient. CPU jumped to 25-35%. Number of IOs also jumped almost 4 times: from ~7K during 5 minutes interval to 27.5K.

December 2, 2013 5:50 AM

Remus Rusanu said:

December 2, 2013 8:15 AM

Michael Zilberstein said:

Thanks! I see that you've addressed all the issues in great detail. Glad to be an inspiration, even in such circumstances :-). At least now I can say that I have basic understanding of the flow - had no idea that bulk insert of 100K+ rows can bypass deltastore and go straight to the index. Same about dictionary-related information. And that's crucial for application design.

December 2, 2013 8:39 AM

AlexK said:

We need to be very careful with vendors that "get very upset with (and often legally ban) unverified benchmarking." There are lots of problems out here that absolutely do exist, yet are very difficult to reproduce.

Suppose, for example, that your Microsoft Zune did not work on 12/31/2008, and you are legally banned from even telling that on 1/1/2009?

December 2, 2013 9:40 AM

Roger Van Unen said:

Hi Michael

All depends on the number of occurrence you have in a column or even in multiple columns. If for example contains a unique ID or unique text fields the CCI needs to add a lot dictionaries and blobs that need to be compressed to store the data.

The server you mentioned is particularly too small: CCI needs memory. I did the same kind of test on a 40 core fast track with a 39 billion rows table (ints and big ints no text fields) and the CCI was a little more faster than adding rows to a page compressed table with a clustered PK index.

All were ~ 3 times slower than loading into a non compressed HEAP.

June 20, 2014 6:04 AM

Michael said:

Hi Michael!  It has been over a year since this blog post.  Do you know if anything has been improved in terms of how SQL Server handles updates to columnstore indexes since you ran your tests?

February 25, 2015 7:35 PM

Michael Zilberstein said:

Hi Michael,

actually the missing part was an information regarding the proper way to load data into clustered columnstore index. You can read a great article publisher the day after my post (and probably inspired by it) by Remus Rusanu here:

There're also many other resources, like huge series of posts about clustered columnstore index written by Niko Neugebauer:

In short, It appears that the best way of loading data into cci is bulk loading of at least 100K rows at a time (1M is even better). And loading by simple INSERT command and in smaller than 100K rows chunks is probably the worst way.

February 25, 2015 7:43 PM
New Comments to this post are disabled
Privacy Statement