THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

Bulk Insert of sorted data

In a previous post I showed how to make bulk insert faster using SSIS. During Sql Server Conference 2007 I had the great opportunity to double check it with Stefano Stefani that analyzed the problem in greater detail and finally came up with both an explanation of the problem and a simple workaround when facing the same problem with T-SQL (my solution is still good with SSIS but Stefano’s one is very effective with T-SQL).

The problem is related to bad performances when inserting rows into a table that has a clustered index (typically a primary key) based on an INT IDENTITY field and is easily shown in this example:

Create Table Destination (
ID_Test Int Identity (1,1) Primary Key,
Cod_Test VarChar (20),
Test VarChar (100))
Go

Create Table RowsToInsert (
Cod_Test VarChar (20),
Test VarChar (100))
Go

 

Now, suppose you have 1.000.000 of rows in RowsToInsert, if you analyze the execution plan of this simple insert:

INSERT 
    Destination
SELECT 
    Cod_Test, 
    Test
FROM
    RowsToInsert
OPTION (RECOMPILE)

 

You will get this: BulkInsertWithSort

As you can see, the data is sorted before being inserted into the table even if the IDENTITY specification of the primary key guarantees to SQL Server that the data is already sorted. Of course, sorting one million rows takes a long time.

The problem is that SQL Server can insert rows in parallel mode into the table, thus there is no guarantee that the IDENTITY fields will be generated sequentially. This leads to very poor performances when inserting a huge amount of data, as is normally the case in DWH tasks.

A very effective workaround, when facing the problem in a T-SQL task, is that of sorting the data based on a dummy column, as in this example:

INSERT 
    Destination
SELECT 
    Cod_Test, 
    Test
FROM
    (SELECT 1 A, * FROM RowsToInsert) Q
ORDER BY A
OPTION (RECOMPILE)

 

The execution plan is now: BulkInsertWithoutSort

The sort task disappeared, as now SQL Server knows that the IDENTITY fields will be generated sequentially because the input is sorted.

The final consideration is that you can have a dramatic speed gain in huge INSERT operations with SQL Server into clustered indexed tables if you provide SORTED SELECTS to the INSERT operation. When the clustered index is based on an INT IDENTITY field you can get the same speed by sorting on a dummy field, instructing SQL Server that the IDENTITY columns will be computed sequentially and there is no need to sort the data.

Published Monday, July 02, 2007 9:48 AM by AlbertoFerrari
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

rg said:

Hi Thanks

Here's a way to bulk insert from a dataset to an sql database :

http://www.dataportweb.com/post/Sql-bulk-insert-and-paramaters.aspx

September 18, 2008 1:41 PM
 

SQLBI - Marco Russo said:

A few months ago I wrote about a SSIS setting you have to use just to get better performance when you

November 25, 2011 12:11 PM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement