THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Tibor Karaszi

Are inserts quicker to heap or clustered tables?

This blog has moved! You can find this content at the following new location:

http://sqlblog.karaszi.com/are-inserts-quicker-to-heap-or-clustered-tables/

Published Thursday, August 14, 2008 3:05 PM by TiborKaraszi

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Scott R. said:

Tibor,

Many thanks for taking time to prepare these tests and findings, and to provide the tests for others to repeat and confirm.  The feedback on concurrent inserts was especially useful.

Your findings are the second empirical report I have seen that suggest better performance for clustered tables versus heap tables (the other being a Microsoft best practice article from 05/2007: “Comparing Tables Organized with Clustered Indexes versus Heaps” – see link http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/clusivsh.mspx).

Clustered versus heap tables is a hotly-debated topic – almost a religion for some folks!  There are those that swear by clustered indexes for all tables, those that prefer heaps (only using non-clustered indexes as needed) for “performance”, and everything in between.  Personally, I have practiced using clustered indexes on all tables unless there is a motivating (and proven) reason for using a heap.

In the past, common “best practices” suggest that heaps have less overhead and thus might perform better in some situations (especially high volume scenarios).  I don’t know whether those “practices” were once true and aren’t anymore (due to product evolution – DBMS, hardware, etc.), or were never true and propagated by IT industry “urban legend”.  I believe it is most helpful to get past anecdotal and hearsay “evidence” with empirical, repeatable tests and evidence, as you have done.

For a complete perspective, one should also consider the operational impacts of the two table types, in addition to the observed performance impacts.  The following issues are important considerations, especially for large sized DBs:

-  DB space reclamation after row deletes (automatic / immediate versus requires a manual data reorg / unload-reload)

-  Fragmentation (data and indexes)

-  How frequent DB reorganization is needed to resolve such issues

-  Operational time required for DB maintenance functions

-  Impact of reorganizing data on non-clustered indexes:

   *  Clustered table: no impact – cluster key is the non-clustered index row ID, and isn’t changed by reorg process

   *  Heap table: non-clustered indexes must be rebuilt if heap table is reorganized – physical RID is the non-clustered index row ID which is changed by reorg process

-  etc.

These impacts may outweigh or substantiate the performance-only impacts.

I found it interesting that the Microsoft best practice paper also mentions that DB space reclamation is more immediate and automatic with clustered tables than heaps (at least in SQL 2005 – I have heard that this may change with SQL 2008 where heap tables also get the same benefit, but I haven’t tried this myself to confirm if this feature made it in SQL 2008 RTM).

I don’t claim to be an expert on these issues, and welcome feedback from others for perspectives – confirming or differing.

Thanks again for your great efforts.

Scott R.

August 14, 2008 11:37 AM
 

Saggi Neumann said:

Hi Tibor,

Did you do the tests on 2005 or 2008?

Inserts into heaps on 2008 can be minimally logged (like SELECT INTO), so I suspect can be much faster than on 2005.

Inserts into B-Trees (clustered indexes) were also supposed to be minimally logged but that feature was dropped out at some stage.

For reference see here:

http://blogs.msdn.com/sqlserverstorageengine/archive/2008/08/11/update-on-minimal-logging-for-sql-server-2008.aspx

August 14, 2008 12:32 PM
 

TiborKaraszi said:

Hi Scott,

Thanks for the feedback and information. Yes, there are many many aspects here, apart from sheer insert performance. I purposly dedicated the article to sheer insert performance, and once you "get into it" when testing and writing you don't want to disturbed by thinking too broadly.

I thing that if there is such a "heaps are better for insert" thinking that it probably mostly is an urban legend. It shouldn't come from older version, since the old architecture (pre-7) was even worse where for heaps SQL Server always added the rows to the end of a linked list (meaning little possibilities for space reclaim, for instance - no PFS or IAM).

Thanks for that WP tip, I haven't read that. Will look into it when I have a few moments. I do not know what such improvements for space reclaim might be for 2008 (if any), I guess time will tell.

August 14, 2008 12:47 PM
 

TiborKaraszi said:

Hi Saggi,

The tests were done on 2008. I thought about testing on older versions as well to see if there might be some difference here. I doubt it, though and I (as all of us) have limited time on my hands. Anyhow, I would be surprised if I got minimally logged inserts since I had the database in full recovery and also because the timing results.

August 14, 2008 12:50 PM
 

AaronBertrand said:

I agree with Tibor, the conditions you need for minimally logged inserts are almost certainly not worth justifying a heap, even if it does perform a bit better in that case (which I doubt).  This would have to assume that the speed of the insert is the most important part of your operation, and even more important than the data itself.

If you are comfortable with running in simple recovery mode because that's the way you have to justify a heap, then by all means, go nuts.  I don't think I will be recommending that approach to any of my clients when they ask me, why do I need a clustered index?

August 14, 2008 1:22 PM
 

jchang said:

Yes a heap can have unusual or inconsistent insert characteristics as discussed above, but the cluster heap decision should ultimately hinge on what is to be done with the data afterwards. On rare occasions, I have seen archive tables that are inserted once and never used again, so this decision can depend entirely on insert performance. Otherwise, SELECT performance from the tables will be important. Let us assume that nonclustered indexes will be required in either case. If the queries that use the nonclustered index all happen to be covered (in the index key or in the include list), then there is probably not much difference.

For data residing in memory, if a key lookup (2005 terminology, bookmark lookup in 2000 terminology) is required, the key lookup in to a clustered table is about 15-20% more expensive than a key lookup to a heap (actual CPU cost, this is not reflected in the plan cost). This is because the key lookup to a clustered table involves another index seeks on the cluster key, while a key lookup to a heap is just a direct file-page-row lookup. Now with the cluster, all accesses on the cluster key do not require a key lookup. The implication of this with regard to select performance is that when the percentage of index seeks that requires a key lookup is sufficiently high, the heap is better. Otherwise the cluster is more efficient. Based on the 15-20% higher cost for the cluster key lookup, this can only occur in a table with 6 or more indexes (assuming the cluster key was chosen correctly, all bets off when you pick the wrong cluster key), and if the index seeks that require key lookup is relatively uniformly distributed over the various indexes. Examples that I can think of include a product table that has several description columns the might searched on with little anticipated preference.

Now if the data is on disk, the cluster key lookup for a group of rows could already be in sorted order, because the actual nonclustered index key is the explicit key, plus the cluster key. So if we were searching for a specific value of the nonclustered key, we then have a list of the cluster keys in sorted order (assuming the extra covering columns are in the include list and not the nonclustered key, or you deliberately built the nonclustered index as nonclustered key, cluster key, then followed by covering columns). This would make the disk access (for 30+ rows on which SQL Server will issue multiple IO) much faster than random IO, because the sequence is something like: seek, skip a few tracks, read, skip a few etc. I can get 800+ IOPS per disk this way. In the key lookup to a heap, the keys are not explicitly guaranteed to be in sorted order, regardless of the actual order on disk. In this case, the maximum high queue depth short-stroke pseudo-random performance per disk might be 500-600 IOPS. This is for direct attach storage. All bets off in a SAN.

If people are interested in this level of detail (and don’t have other work to do), I can discuss the implications of index intersections, where the cluster-heap organization affects the join type, hash or merge, which has additional implications in non-parallel or parallel plans.

Oh BTW, the over head INSERT for each nonclustered index is about 20% of the base insert cost, regardless of heap or cluster. So by the time you have 5 nonclustered indexes, you have approximately doubled the cost of your insert (gross ballpark estimate for planning purposes only). See my presentation from the 2003 Connections conference on the subject of quantitative insert, update performance analysis.

August 14, 2008 1:47 PM
 

Saggi Neumann said:

Tibor and Aaron - You can obviously use the bulk logged recovery model and still have log backups (no point in time restores).

Perhaps in ETL scenarios where you would insert into a heap/CIX (which you'll switch into a partition) and only after you're done you build NCIX indexes, minimally logged inserts may be beneficial. It is unfortunate that minimally logged inserts into a B-tree were left out in RTM.

jchang - What about record forwarding? you might read a lot more pages with a RID lookup if there are alot of forwarded records. In addition as Scott mentioned in order to eliminate those, you must rebuild the heap.

August 15, 2008 2:31 AM
 

AaronBertrand said:

Of course in ETL scenarios there are many advantages to using a heap initially and creating indexes after the load is done... but this is usually only useful if the table starts as empty each time (and in a staging database that is somewhere in the middle of an ETL pipeline, you shouldn't need more than simple recovery anyway, since you should always be able to get back to that state easily).

But a permanent table that is part of your OLTP application should not stay a heap for this purpose, IMHO, unless there are other very good reasons to do so.  I have yet to find a case for leaving a table as a heap in any of the applications I've developed or worked with.  YMMV.

August 15, 2008 11:23 AM
 

Uri Dimant said:

Well, I think the good advise will be ,"test it ,before using"

August 17, 2008 1:47 AM
 

Max Mulawa said:

Hi Tibor,

Few days ago I've run into a problem with large fragmentation of the table (3 milion rows, aprox 4GB in size). First, I've tried loading heap with 3M rows and building only clustered index on it. It was slow, but the problem with fragmentation was resolved. Another scenario was inserting ordered data (by clustered index keys) into clustered table, this was performing faster than "heap+index build" and there was almost no fragmentation of the table.

August 20, 2008 10:44 AM
 

Denis Gobo said:

Wow, it has been already a year since I wrote A year in review, The 21 + 1 best blog posts on SQLBlog

December 31, 2008 10:38 AM
 

Amit Performance Begineers said:

What if The clustered index is not on an Ever Increasing key?

September 9, 2015 1:23 PM
 

TiborKaraszi said:

Then we can be pretty sure tat it will be much slower compared to heaps and cl ix on ever increasing key.

September 9, 2015 4:14 PM

Leave a Comment

(required) 
(required) 
Submit
Privacy Statement