It’s a well known good practice to control the batch size when you perform large data changes. For instance, you may need to purge a large amount of data monthly, and if you delete them all in one shot, you may blow up your transaction log. Therefore, it's wise that you chop up the total amount of data to be deleted into smaller chunks and delete each chunk in a loop as a separate transaction.
The size of that chopped-up data chunk may go by different names such as transaction commit size. In this post, I refer to it as the batch size.
There is no problem with this approach in general. However, exactly what constitutes the most optimal batch size may change over time.
Recently, I worked on a data archiving project whose batch process had been in production for a long time. To contain the transaction log size, many steps of this data archiving batch process employed the above general idea to perform smaller delete in loops. In most cases, the batch size was controlled in terms of the number of rows deleted, and it was generally set to around 200,000 rows.
This might be a good choice at the time (several years ago).
Today, disk drives are generally much larger and severs are much more powerful with much more physical memory. Is setting the delete batch size to 200,000 rows still optimal for this batch process? Well, first of all, it should be noted that going by the row number alone is often not a best practice because ultimately the concern is the amount of the transaction log space that a transaction may take up, and given a very narrow table, deleting 1,000,000 rows in one transaction may not consume much log space.
The other factor worth noting is that the smaller the batch size, the more loops one needs to go through, and therefore more table scans, resulting in longer processing time. So you don’t want the batch size to be too large to blow up the transaction log, but at the same time you don’t want to set the batch size too small for the batch job to take a long time to finish. That is, you need to throttle the batch size to achieve the optimal overall result.
So back to the data archiving project I was working on some time ago. In that case, I ended up eliminating the loops altogether. This increased the log usage to the max of 40~50GB or maybe slightly more, depending on the volume of the month. With a 100GB drive dedicated to the transaction log and additional overflow space elswhere, using up 40~50GB did not seem to be a big deal. But it ended up reducing the overall data purging time from four hours to ~40 minutes. And that is a big deal!
At least in this one particular case, it paid to revisit the transaction batch size.