THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alberto Ferrari

SqlBulkCopy Performance Analysis

In the endless search for optimal performance in Bulk loading data, I have written a paper which describes how to reach top performance using the SqlBulkCopy class in .NET, trying to find the best combination of its parameter settings to load heaps, clustered table with and without indexes performing all the load in a highly parallelized environment.

In order to produce test, I have written a simple solution that implements a producer/consumer pattern where many producers and consumer can run in a parallel environment, that can be used to perform detailed tests by simply switching some parameters on and off. It is not a user friendly program, just a tool that you can use to test different bulk insert operations. Nevertheless, I found it very useful in understanding how bulk insert work and how to optimize it.

The paper does not go as deep as the Data Loading Performance Guide from Microsoft does in describing the internals of SQL Server and bulk loading. Nevertheless, the Microsoft paper describes some esoteric hardware configurations that are pretty difficult to find in the real world, while my paper is much closer to a real user experience. I think that reading both might help any ETL code writer to better understand how to boos his code performance. Mine is easier, Microsoft’s is far more complete.

If you are interested in the matter it please follow

I am very interested in feedback and, if you find time to make tests on your specific hardware, any result you find that are in some ways different than mine. So feel free to contact me to provide both, so that I can add different results to the paper and increase the whitepaper completeness.

Published Monday, November 30, 2009 4:24 PM by AlbertoFerrari

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Hi Alberto,

Just did a quick initial read through of your article. Looks like you did some really great work!

One surprising thing for me is your analysis of batch size. I have found in some cases that setting too large a batch size does impact performance negatively, and in these cases I saw a large amount of corresponding tempdb activity, I assume for an intermediate sort because I was loading into empty tables with a clustered index. Did you happen to do those tests only on a heap? You might also consider trying a much larger load--the full 60 million rows in one shot, instead of only 6 million--in order to really stress things and see the impact of that setting.

November 30, 2009 9:53 AM
 

SDC said:

much more closer -> much closer

tx for the info!

November 30, 2009 10:01 AM
 

AlbertoFerrari said:

Adam,

Yep,there are a lot of tests that can be done. I focused on a couple gigabyte of data, since it is what I normally find in the daily load for a small DWH, like the ones I am currently working on. Loading more data creates the need to run both producer and consumer at the same time, since I cannot have all the rows in memory before sending them to SqlBulkCopy.

That said...  I'd love to see some more tests, if you have some spare time to do them :)

November 30, 2009 11:22 AM
 

AlbertoFerrari said:

SDC... thanks a lot, corrected that. My spaghetti english is always alive and kicking! :))

November 30, 2009 11:23 AM
 

Tar said:

404 :(

November 4, 2011 10:21 AM
 

AlbertoFerrari said:

@Tar, sorry, during the site review the content has not been correctly published, we are working on that. Drop me an email alberto.ferrari@sqlbi.com if you need it. The link will be up again in a few days.

Alberto

November 4, 2011 11:15 AM
 

AlbertoFerrari said:

Thanks for the comment, I have updated the link so that they now point to the right place in the new site, they should work.

Alberto

November 7, 2011 3:50 AM
 

Orlando Colamatteo said:

The performance test is relying on the SqlBulkCopy.WriteToServer(IDataReader) method. This isolates raw throughput of the object as compared to other methods however the technique will not easily translate to using SqlBulkCopy to process flat-files. If wanting to use SqlBulkCopy to load flat-files the ready example is to load the entire file into a DataTable before calling WriteToServer(DataTable). This implies the entire file must be loaded into memory which obviously becomes a limiting factor. In a real-world scenario where we want data in a flat-file loaded into a database but cannot rely on being able to load the entire file into memory I think we would need to use an ODBC text driver to produce an IDataReader, or possibly write our own parsing and buffer methods using something like TextStream.ReadLine so we can control how much of the file is loaded into memory at one time. In this scenario I would probably employ bcp, BULK INSERT or SSIS since these tools have already implemented the necessary producer and consumer components.

Regarding the SSIS test I noticed you did not mention the DefaultBufferMaxRows or DefaultBufferSize properties of the Data Flow Task and they are set to their default values in the code-download. These can help SSIS performance for large sets of data.

It would depend on a lot of factors (as usual) but when loading flat-files I would be hard-pressed to find a reason to use SqlBulkCopy over other off-the-shelf tools for anything but a corner-case. I can see using SqlBulkCopy in a wholly memory-resident process and have used it in my own applications. I doubt it was your aim to address flat-file processing but despite the performance findings I would question the practicality of using SqlBulkCopy for processing flat-files into SQL Server.

Nice writeup!

May 31, 2012 4:10 PM

Leave a Comment

(required) 
(required) 
Submit

About AlbertoFerrari

Alberto Ferrari is a Business Intelligence consultant. He his interests lie in two main areas: BI development lifecycle methodologies and performance tuning of ETL and SQL code. His main activities are with SSIS and SSAS for the banking, manufacturing and statistical sectors. He is also a speaker in international conferences like European PASS Conference and PASS Summit.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement