THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Revisiting the transaction batch size

It’s a well known good practice to control the batch size when you perform large data changes. For instance, you may need to purge a large amount of data monthly, and if you delete them all in one shot, you may blow up your transaction log. Therefore, it's wise that you chop up the total amount of data to be deleted into smaller chunks and delete each chunk in a loop as a separate transaction.

 

The size of that chopped-up data chunk may go by different names such as transaction commit size. In this post, I refer to it as the batch size.

 

There is no problem with this approach in general. However, exactly what constitutes the most optimal batch size may change over time.

 

Recently, I worked on a data archiving project whose batch process had been in production for a long time.  To contain the transaction log size, many steps of this data archiving batch process employed the above general idea to perform smaller delete in loops. In most cases, the batch size was controlled in terms of the number of rows deleted, and it was generally set to around 200,000 rows.

 

This might be a good choice at the time (several years ago). 

 

Today, disk drives are generally much larger and severs are much more powerful with much more physical memory. Is setting the delete batch size to 200,000 rows still optimal for this batch process? Well, first of all, it should be noted that going by the row number alone is often not a best practice because ultimately the concern is the amount of the transaction log space that a transaction may take up, and given a very narrow table, deleting 1,000,000 rows in one transaction may not consume much log space.

 

The other factor worth noting is that the smaller the batch size, the more loops one needs to go through, and therefore more table scans, resulting in longer processing time. So you don’t want the batch size to be too large to blow up the transaction log, but at the same time you don’t want to set the batch size too small for the batch job to take a long time to finish. That is, you need to throttle the batch size to achieve the optimal overall result.

 

So back to the data archiving project I was working on some time ago. In that case, I ended up eliminating the loops altogether. This increased the log usage to the max of 40~50GB or maybe slightly more, depending on the volume of the month.  With a 100GB drive dedicated to the transaction log and additional overflow space elswhere, using up 40~50GB did not seem to be a big deal. But it ended up reducing the overall data purging time from four hours to ~40 minutes. And that is a big deal!

 

At least in this one particular case, it paid to revisit the transaction batch size.

Published Sunday, August 01, 2010 12:05 AM by Linchi Shea

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Uri Dimant said:

Hi Linci

You said that you eliminated loop so what is your T-SQL statement?

WHILE 1 = 1

BEGIN

  DELETE TOP(100000)

  FROM Foo

  WHERE <range on the cikey >

  IF @@ROWCOUNT < 2000 BREAK;

END

August 1, 2010 1:26 AM
 

Uri Dimant said:

Hi Linci

You said that you eliminated loop so what is your T-SQL statement?

WHILE 1 = 1

BEGIN

  DELETE TOP(100000)

  FROM Foo

  WHERE <range on the cikey >

  IF @@ROWCOUNT < 2000 BREAK;

END

August 1, 2010 1:26 AM
 

Chris Howarth said:

Don't forget that log files can't be instantly initialised during creation, so if there is a requirement to keep database restore time (and therefore down time) to a minimum, then, during the restore process a 50GB file will, obviously, take a lot longer to create than, say, a 5GB file.

Just one thing to consider before taking this approach.

Chris

August 1, 2010 2:25 AM
 

Adam Machanic said:

Hi Linchi,

I think more testing is in order. Even with plenty of disk space I've had great success in using smaller batch sizes to significantly speed up large operations... I'm not sure what all of the determining factors are--it would be very interesting to figure out a set of best practices for this kind of thing.

August 2, 2010 4:55 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement