THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

Making Fast Load really fast on clustered indexed tables with SSIS

I have been so used to use int identity fields as primary key for any table and to believe it’s the fastest way to define a clustered primary key that I never thought this can cause serious performance problems to SSIS. Until I tested it and discovered that int identity primary keys are among the slowest way to insert huge amount of data with SSIS. In the post I’ll describe the technique that – from my tests – is the fastest way to insert data into tables with clustered index using SSIS.

As a test case I used a very simple table with only two fields:

CREATE TABLE [dbo].[FastLoadTests](
      [Id] [int] Identity NOT NULL,
      [TestString] [varchar](2048) NOT NULL
) ON [PRIMARY]

 

I filled it with a very simple task that generates one million of monotonically increasing Id and TestStrings of exactly 512 bytes each, throwing them into an OleDb destination adapter that uses FastLoad to fill the table.

If the table has no index at all the package runs in 18.5 seconds, pretty nice. Now, I created an index on the table like this:

CREATE UNIQUE CLUSTERED INDEX [FastLoadTests_ClusteredIndex] ON [dbo].[FastLoadTests] ([Id] ASC) ON [PRIMARY]

And run the package once more. The execution time is now 1.08.5 (one minute and eight seconds). It is more or less 4 times slower. Clearly, as SQL has to sort one million rows, I was expecting poor performances. The first trial has been that of reducing the number of rows to sort.

Looking at the various options in the OleDb destination adapter it is easy to find that it has a parameter (Maximun Insert Commit Size) that defaults to 0.

OleDB-MICS

Setting it to 10.000 reduced execution time to 44.5 seconds, a better result but still more than two times slower than the table with no clustered index. Is this the best result? Not at all! Let’s try something different.

Reading (carefully) the MS documentation about the OleDB destination adapter (http://msdn2.microsoft.com/en-us/library/ms141237.aspx) you’ll discover that you can gain performance if the input data is sorted accordingly to the clustered index on the table, specifying the ORDER option with the advanced editor.

After some trials I managed to set it with the advanced editor:

OleDB-ORDER

It has been surprising that, running the package, I got this error:

[Test Table [44]] Error: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005. An OLE DB record is available.  Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Could not bulk load. The sorted column 'Id' is not valid. The ORDER hint is ignored.".

So, it seems that setting this option on an INT IDENTITTY field causes FastLoad to fail. Really surprising because it seems that SQL Server does not know that the keys generated by itself will be sorted! Anyway, removing the Identity setting from the table and using the package generated ID resulted in execution time of 46.23 seconds. No performance gain at all. Anyway as it seemed an interesting way to search into, I went more on trials.

The last and resolving trial test has been that of REMOVING the Maximum Insert Commit Size parameter from the destination adapter. Running the package again I got a result of 19.07 seconds execution time, more or less the same time as inserting the data with no clustered index at all. Got it! Now I have a really fast way to insert data.

The results are summarized in this table

  No Index
0 MICS
Index
10.000 MICS
Index
0 MICS
Int Identity key 18.85 41.15 1.08.57
SSIS generated key 18.59 44.54 1.10.03
SSIS key and ORDER option 18.85 46.23 19.07

*MICS = Maximum Insert Commit Size

You can easily see that using the ORDER option gives a dramatic speed improvement when you have a clustered index but ONLY if you are able to generate the sorting key in the SSIS package and DO NOT USE the MICS parameter.

I think that the documentation for this behavior should have been made far more accessible and that ORDER parameter is so important that should have been shown in the first page of the standard editor for OleDb destination, hope it will be done in the next release of Sql Server.

Moreover, as using MICS parameter leads to a dramatic performance degradation, I think that the OleDb destination adapter should issue a warning if both ORDER and MICS parameter are set so to make the programmer think twice before using them together.

Published Sunday, April 08, 2007 9:48 AM by AlbertoFerrari
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Alberto Ferrari said:

In a previous post I spoke about the advantages of having sorted flows in SSIS to greatly speed up data

July 23, 2007 3:26 AM
 

Alberto Ferrari said:

In a previous post I showed how to make bulk insert faster using SSIS. During Sql Server Conference 2007

July 23, 2007 3:31 AM
 

Bill said:

I plan on testing this, but I have one concern.  1 million rows really isn't that much.  It's fine for testing, but i have a real-world case of dealing with 350 million rows.  I will not leave the max commit size set to 0 becase 350 million rows is simply too much for the DB's transaciton log to handle.

April 25, 2008 4:47 PM
 

AlbertoFerrari said:

Bill,

AFAIK, setting MICS to 0 and using the sorted input, the whole operation is minimally logged, and the pressure on the log is really low.

I would try different values for MICS, including 0, giving it a chance. :)

Moreover, I'd like to know the results of your tests.

April 25, 2008 6:34 PM
 

Amol said:

This solution is neat and works! I had to load 25 million rows from Table A to Table B the primary keys of the table is GUID. I specified the ORDER BY ([PKGUID]) ASC and the loads were a lot lot faster than without using the ORDER BY in the Advanced properties.

Thanks,

Amol

January 14, 2009 3:17 PM
 

prakash said:

When I am executing DTS Package on SQL 2008 that time I am getting error  - Description: The Maximum insert commit size property of the OLE DB destination "OLE DB Destination" is set to 0. This property setting can cause the running package to stop responding. For more information, see the F1 Help topic for OLE DB Destination

Where is OLE DB destination in Management Studio, so I can change Maximum insert commit size property from 0 to any much value.

Please help

Thanks

Prakash Gupta

prakashg@techlink.in

February 9, 2009 2:57 AM
 

Steve said:

Alberto,

Good work.  With the max insert commit size set to zero, SSIS attempts to perform the entire commit from memory; so, if the size of your data fits entirely in memory, then the zero commit size plus the sort key "hint" probably did the trick for you.

I think the row size, number of rows per buffer, any copied columns, and machine word boundary are all the factors that would contribute to the memory used by your dataset.

Any idea what would have happened with considerably more data (maybe 45 million rows) and a zero commit size batch would have done?  

I'm not sure if SSIS would attempt to pick an ideal committ size instead of  spooling to disk.  Perhaps a trace would tell us the whole story :)

December 12, 2009 5:32 PM
 

drsql said:

Wow, can't thank you enough, this has really improved my data movement SSIS packages I am working on. Being a non-tool guy and usually working in SQL alone, I miss a lot of these nuances when I am forced to code :)

March 29, 2010 1:57 PM
 

Mark White said:

Alberto - you're a God send.  I've been battling for months (literally!) trying to improve the load times of a monster data warehouse, and I totally agree that this information about MICS vs ORDER on the OLEDB/SQL destinations needs to be properly, prominently documented.

Thank you so much for sharing your findings with the rest of us!

June 28, 2010 1:10 PM
 

Mighty said:

I dont know but this is increasing the Load time for me.

mirzafahad1409@hotmail.com

September 2, 2011 12:20 PM
 

AlbertoFerrari said:

@Mighty,

you are using a simple log for the databases, arent't you?

September 2, 2011 4:11 PM
 

SQLBI - Marco Russo said:

Alberto Ferrari posted some test we made working on performance optimization loading data in very large

November 25, 2011 12:07 PM
 

Mighty said:

Alberto,

sorry for replying after 3 months.I did not visit back the site.

No we are using Full recovery model for our databases.

December 9, 2011 6:08 PM
 

AlbertoFerrari said:

@Mighty,

You need to have simple or bulk-logged recovery model to speed up inserts, take a look here: http://msdn.microsoft.com/en-us/library/ms190422.aspx.

In full recovery, bulk operations are fully logged, and you don't get any speed boost.

December 9, 2011 6:11 PM
 

Mighty said:

Thanks Alberto,

Thats true , yes we are planning to go for Simple Recovery model for some of are not-so-important Publishing DBs.Hope things move faster now.Other stuff thats making my data load slow is that i have 3 xml data type columns in my table.Moving ,around 1 mil rows (10 gb) of data.

December 9, 2011 6:35 PM
 

Anil said:

I have table with 350 million records with a composite Primary Key of 4 columns. I have an SSIS package inserting average 3 to 4 million records into this table every day. I’ve used “OLEDB Destination” with Data Access Mode: Table or View – fast load and the package is taking 4 to 5 hours to execute this task. I have tried different options as below,

• Uncheck “Check Constraints”

• Set “Maximum insert commit size” to 0

• Given “ORDER” (with 4 Primary Key column) in FastLoadOptions

I have also tried the “SQL Server Destination” instead of “OLEDB Destination” but not much performance. My server has good configuration (Windows Server 2003 R2 Enterprise x64 Edition SP2 with 32 GB RAM)

Is there any way I can gain performance?

June 22, 2012 1:57 PM
 

AlbertoFerrari said:

@Anil,

While performance might depend from really many many factors, taking more than one minute for one million rows is normally an indicator of something wrong in the design. Taking an hour is really problematic.

I cannot give you an answer, based on the few information I have, but I strongly suggest you to ask a good consultant for an assesment of the solution, I think there is a big space for improvement, but it needs more than this textbox to perform a decent analysis.

June 22, 2012 2:31 PM
 

Anil said:

Actually there are no extra steps in the package. An "OLEDB Source" which execute a stored Procedure and pass the value into a "Row Count" transformation and the to the "OLEDB Destination". I checked the SP and it is not taking much time, means when I remove the OLEDB Destination" the package is executing quickly upto "Row Count".

June 22, 2012 3:36 PM
 

Anil said:

Actually the package is very simple, executing an SP and getting the result thru OLEDB Source, then goes to "Rew Count" and then to "OLEDB Destination". The SP is executing fast and I can see this when I remove the "OLEDB Destination" from Data Flow.

June 22, 2012 5:57 PM
 

AlbertoFerrari said:

@Anil,

As I said, any assesment requires time and patience, the problem can be anywhere and finding it is not an easy task. But, taking more than a few minutes for some millions rows is something that can be probably fixed.

This it not the right place where to speak about it, if you need specific assistance, please drop me an email, so we can discuss about it privately

June 24, 2012 5:12 PM
 

Ed Allison said:

This method has been extremely useful to me, reducing a regular incremental load job by many hours.  Thank you very much, Alberto.

June 5, 2013 10:45 AM
 

Anna said:

Hi,

I've tried to apply the suggested method for 1M records loaded from a flat file and inserted into a table after some simple

validations. My row size is:12716 bytes (each!)

First of all I had to cancel the red arrow for errors redirect since it does not live together with MICS=0.

Then, when I ran the package, it processed all rows in 2 minutes. However, the commit took another 20 minutes. That's since it commits all rows at once because of the MICS=0.

I am not sure what's the difference compares to your example except that my row is much heavier.

Do you have an idea what can be changed here?

When I use no pk the whole thing takes 7 minutes, but I do need the index for a later stage.

Thanks in advance.

November 19, 2013 8:57 AM
 

AlbertoFerrari said:

Anna, do you have the recovery mode of the database set to SIMPLE? If not, logging will happen anyway and you will never be able to get good performance.

November 19, 2013 10:06 AM
 

Preston said:

Alberto, I'm currently attempting a load of >600 million rows into a fact table with a clustered index (non-clustered primary key w/ bigint identity), I've built an adjustment fact with the following clustered key (snapshotdate (the package::startime var), rowsource (a char(2) indicating the load package, and the dimdatekey for the business event date), every time a historical row changes, I reload all rows related to that key as an insert op (i have another fact table that I join to to get the "current" view).  I ensure to sort all of my data by the clustered key in my OLE source, keep the sort through the data flow, and added hints as you have noted here.  My question, is there any way to parallelize the write operation, or run a trace on the system to determine that it is performing optimally?  When I use the above settings, my ETL server (which has size-able resources 8 cpu, 32gb ram) sits at 9% cpu.  I'm testing other ways to do things now, and loading parallel tables I can get the ETL servers CPU maxed, more than halving the load time, but requiring expensive post processing SQL ops to finish the job.  Any hints?

March 20, 2014 2:14 AM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement