THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Aaron Bertrand

Bad habits to kick : using loops to populate large tables

In my last post in this series, I talked about using ORDER BY <ordinal position>.  Today I'd like to talk about using primitive loops to generate a large number of rows.

Okay, I will admit, in most cases you are doing this for a demo or proof of concept, and so speed / performance is not all that important.  But using a loop to populate a table with 1 million rows, one row at a time, is not fun at all, no matter how much thumb-twiddling you have to do.  I often see people building up test cases like this:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE
@i INT,
@UpperLimit INT;

SELECT
@i 1,
@UpperLimit 1000000;

WHILE @i <= @UpperLimit
BEGIN
  INSERT 
dbo.foo(barVALUES(@i);
  
SET @i @i 1;
END

For 500 or 50,000 rows, this probably wouldn't be a big deal.  But when you get into the bigger numbers, you are going to be spending a long time waiting for the inserts to finish. So how do you get around it?

I have long been an advocate of an auxiliary numbers table.  It makes quick work of generating dummy data, does not require a lot of storage space, and is easy to join against for various reporting-type functions.  Let's say you create a numbers table with 1,000,000 rows, then you can change the above code to:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

INSERT dbo.foo(bar)
 
SELECT [Number]
  
FROM dbo.Numbers
  
WHERE [Number] BETWEEN AND @UpperLimit

However, there are many people who fundamentally object to adding a table to their schema, even if they put it in some utility database.  [As an aside, I see the exact same objections from people who don't think it's worth it to store a calendar table in their systems, even though it extremely simplifies date calculations such as business days between two dates, taking into account things like Easter and company holidays.]

Well, thankfully, there are some clever workarounds to this that let you construct a numbers table on the fly (and in fact you need to use something like this to populate the numbers table in the first place, without having to use a loop).  Here is one that I have used in previous posts:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

WITH AS
(
    
SELECT
       
x ROW_NUMBER() OVER
        
(ORDER BY s1.[object_id])
    
FROM  sys.objects AS s1
    
CROSS JOIN sys.objects AS s2
    
CROSS JOIN sys.objects AS s3
)
SELECT [Number] = x
INTO dbo.Numbers
FROM n
WHERE x BETWEEN 1 AND @UpperLimit;

GO

CREATE UNIQUE CLUSTERED INDEX 
ON dbo.Numbers([Number]);

And here is one using a set of CTEs and no tables, based on code I stole from Itzik Ben-Gan:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

WITH
   
n5  (xAS (SELECT UNION SELECT 0),
  
n4  (xAS (SELECT FROM n5 CROSS JOIN n5 AS x),
  
n3  (xAS (SELECT FROM n4 CROSS JOIN n4 AS x),
  
n2  (xAS (SELECT FROM n3 CROSS JOIN n3 AS x),
  
n1  (xAS (SELECT FROM n2 CROSS JOIN n2 AS x),
   
n0  (xAS (SELECT FROM n1 CROSS JOIN n1 AS x),
   
Nbrs (xAS
   
(
       
SELECT
          
ROW_NUMBER() OVER
           
(ORDER BY x)
       
FROM n0
   
)
SELECT [Number] = x
  
INTO dbo.Numbers
  
FROM Nbrs
  
WHERE BETWEEN AND @UpperLimit;

GO

CREATE UNIQUE CLUSTERED INDEX ON dbo.Numbers([Number]);

So, you could use either of these examples either (a) to populate your own numbers table, which you can later use instead of loops; or (b) directly, every time you need to simulate a loop.  I prefer (a) because, if you use it enough, the numbers table will reside in memory, have statistics, etc.  For full disclosure (and not to say my method is better), I compared the actual execution plans of both methods when using 1,000,000 rows as my upper limit.  The comparison is as follows:

  CROSS JOIN CTE
Query cost (relative)
10% 90%
Cached plan size (bytes)
88
176
CompileCPU (ms) 28 50
CompileMemory (KB) 2336 3376
CompileTime (ms) 28 50
Estimated Subtree Cost
0.0206647 0.193489

And the most visible difference is the query plan (click to enlarge).  In case it's unclear, I'll give you a hint: the cross join of sys.objects is at the top.

 

Again, these aren't things you need to be concerned about if you are only populating the numbers table once.  But if you are using these methods on the fly, the difference in performance could be significant.

If you require <= 32,767 rows, you can use this alternative syntax, employing a recursive CTE.  It is much more efficient, but has a cap of 32,767 due to the limit on the MAXRECURSION option:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit = 32767;

WITH AS
(
   
SELECT x 1
  
UNION ALL
   
SELECT x = x + 1
FROM n
     
WHERE x @UpperLimit
)
SELECT [Number] = x
  
INTO dbo.Numbers
  
FROM n
  
OPTION (MAXRECURSION 32767);
 
GO

CREATE UNIQUE CLUSTERED INDEX ON dbo.Numbers([Number]);

I'll admit, this isn't the easiest syntax to memorize, so you will likely want to keep a copy of it in your toolbox.  If you are going to continue using a loop, at the very least, please get in the habit of adding SET NOCOUNT ON statements to your code.  This way, at least, that big loop won't be wasting all that effort passing "1 row(s) affected" to the client n times, and the client won't be wasting all that effort printing it out to you n times.

Now don't get me wrong, I am not saying that loops are a bad thing.  In fact for a lot of things they are the right answer, and for some things they are even the only answer.  But in cases where things don't have to happen one at a time, they shouldn't happen one at a time.

I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code.  Up next: using old-style joins.


Published Wednesday, October 07, 2009 7:17 PM by AaronBertrand

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Linchi Shea said:

Aaron;

If we are talking about performance and simplicity, I'd actually argue that your very first method with a loop is actually the best (with some minor change).

October 7, 2009 9:06 PM
 

Scott Whigham said:

I like the breakdown - nice post. I don't think that, even if you showed people that performance of a loop was 500 times worse than using a recursive CTE, that it would change more than 2% of the population's technique of doing this. Why? As you pointed out, it's not a technique that people will generally be able to memorize. My "benchmark" on stuff like this is to imagine that I've just learned this technique on December 1 and then I take the whole month off and go to the Bahamas. When I come back to work in mid-January, will I remember this technique? If the answer is, "No", then I do one of two things: I either (a) do as you said and add a script to my "toolkit", or (b) I just forget about it. Your final example has been in my toolkit for several years but when it comes time to write a loop on the fly in a demo, I reach for the declarative technique (that you showed initially).

If you have a numbers table that covers your min/max values, the simplest technique, for me, is to just use SELECT INTO - no need for more fancy options IMO. I don't even need the SET NOCOUNT ON with such a query:

 SELECT [Number]

 INTO Foo

 FROM dbo.Numbers

 WHERE [Number] BETWEEN 1 AND 1000000

 GO

 ALTER TABLE foo ADD CONSTRAINT PK_Foo PRIMARY KEY (Number)

 GO

This is simple, easy to remember, and it's fast. Of course, it depends on your having a numbers table that covers your min/max values.

October 8, 2009 7:57 AM
 

AaronBertrand said:

Linchi, I agree with you about simplicity.  One of the problems with bad habits is that often they are borne out of simplicity.  In a lot of cases the wait time is irrelevant, so the simple option is best.  But perhaps this gives people a better way when they're really in a rush.

As for performance, I disagree completely (though I don't know what "simple change" you imply).  I ran an insert of 1,000,000 rows into a new table using three methods: (a) a loop with a counter, using SET NOCOUNT ON, (b) a loop with a counter, using SET NOCOUNT OFF, and (c) the sys.objects cross join from above.  I only observed elapsed time, and the results were:

(a) 8 minutes, 18 seconds

(b) 8 minutes, 32 seconds

(c) 3 seconds

October 8, 2009 10:07 AM
 

AaronBertrand said:

Scott, good comments.  I'll address a few.

1) I realize that I am not going to change a whole lot of habits.  But if I made one person realize a better way to accomplish a common task, mission accomplished, IMHO.

2) I agree that if you already have a numbers table, then SELECT INTO is best, without any of the CTE complications.  However, you still have to get a numbers table populated in the first place; surely you should do this in the most efficient manner possible, no?  And again, I explained that some people are opposed to such a table on principle - so hopefully this post gives them some alternatives as well.

3) As for SET NOCOUNT ON, I use it on all my queries, regardless of their complexity.  And like anyone trying to demonstrate and reinforce good habits, repetition is key.  When you are performing one operation with one rowcount output, I agree, it is not necessary and does not buy you anything measurable.  But what harm does it do to add it?  You can even make it the first line in your template for all new queries, then the agonizing keystrokes can't even be a good excuse anymore.  :-)

October 8, 2009 10:07 AM
 

Linchi Shea said:

Aaron;

Your (a) and (c) are not strictly apple to apple comparison. If you can retrieve data from some other table, sure, go ahead. It would be silly to use a loop in that case.

When you use a loop, you most likely are generating new data, and the bottleneck is almost always on the INSERTs, not the loop itself. By 'minor change', I primarily meant to change the transaction commit behavior of the loop. Instead of committing each and every single row insert, which is very expensive, you may want to commit every 100,000 rows or some other batch size (big enough to make transaction commits efficient). That would make it dramatically faster than 8 minutes.

Change it to someting like the following and it should finish in seconds:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE   @i INT,  @UpperLimit INT;

SELECT  @i = 1,  @UpperLimit = 1000000;

begin tran

WHILE @i <= @UpperLimit

BEGIN  

   if @i % 100000 = 0

   begin

      commit tran

      begin tran

   end

   INSERT dbo.foo(bar) VALUES(@i);  

   SET @i = @i + 1;

END

if @@trancount > 0

  commit tran

Here are some more data points:

http://sqlblog.com/blogs/linchi_shea/archive/2007/08/23/performance-impact-finding-the-most-optimal-batch-size.aspx

October 8, 2009 11:31 AM
 

AaronBertrand said:

Fair enough Linchi, but do you know anybody who is going to code their "simple loops" that way?

October 8, 2009 11:36 AM
 

Linchi Shea said:

Aaron;

I agree that if they literally use the first approach, and inserting and committing one row at a time, they deserve the poor performance. And it would be a bad habit.

October 8, 2009 1:02 PM
 

Adam Machanic said:

Aaron has recently been doing an absolutely fantastic series of posts detailing various "bad habits"

October 8, 2009 10:57 PM
 

AaronBertrand said:

Linchi, I think a lot of people don't realize the implications of these things, such as the costs of commit overhead.  So, saying they "deserve" something that isn't really explained well anywhere is kind of harsh, IMHO.  I still think the inclination for most people is to use a set-based approach rather than a one-by-one loop, even if there are cumbersome ways to defer or limit the costs associated with the latter.

October 9, 2009 11:40 AM
 

SqlServerKudos said:

Kudos for a great Sql Server article - Trackback from SqlServerKudos

October 9, 2009 2:57 PM
 

Helping people kick bad SQL Server habits « OTO One to One Interactive said:

October 10, 2009 2:28 PM
 

Uri Dimant said:

Aaron , great posts

I often use to make numbers table as

CREATE TABLE Num(number INT NOT NULL IDENTITY(1,1))

INSERT INTO Num DEFAULT VALUES

GO 10000

October 11, 2009 2:04 AM
 

AaronBertrand said:

Uri, that's still a loop, even though you don't actually use BEGIN / END or WHILE -- while it is simple to code, it still suffers the same performance problem I pointed out in my post, and isn't subject to the enhancement Linchi mentioned about reducing the number of commits.  Also, without a GO between CREATE TABLE and INSERT INTO, the script you posted will return 9,999 error messages that the table Num already exists.  So, even the simplest types of loops are still subject to memorization / transposition issues.  And where is the schema prefix?  For shame!  :-)

October 11, 2009 12:01 PM
 

Uri Dimant said:

:-)Hey,I posted this comment before your last one,

You know, I have not written anymore an old JOIN style :-))

October 11, 2009 2:57 PM
 

AaronBertrand said:

Cool, so at least one person has kicked a bad habit based on my complaints.  :-)

October 11, 2009 3:28 PM
 

OllyA said:

Actually, you can use the recursive CTE for large numbers of rows.  Don't think it's as fast as a numbers table, but a lot quicker than the loop.

Just use OPTION (MAXRECURSION 0).

Oh, and make sure your CTE has the WHERE clause... or you'll be trying to go for an infinite numbers table ;-)

October 12, 2009 4:51 AM
 

Brian Tkatch said:

Aarton, this is a great series. Or rather, these are great posts. The series, however should be bifurcated (finally, got to use it is a "real" sentence :P ). Loops aren't a bad habit, as has been pointed out. There is just a better way. Perhaps you ought to seaprate between bad habits and best practices.

October 12, 2009 10:48 AM
 

AaronBertrand said:

Thanks Brian, in all fairness, I didn't say loops were a bad habit.  Just that they were a bad habit when the goal is to generate millions of rows.

October 12, 2009 10:52 AM
 

Brian Tkatch said:

>I didn't say loops were a bad habit.  Just that they were a bad habit

>when the goal is to generate millions of rows.

oh, ok. Heh. I was just focusing on the title as a subject identifier and ran with it. Regardless, it is a good point to read, and i'll read it whatever you call it.

October 12, 2009 12:18 PM
 

Ron Krauter said:

Aaron,

If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?

WHILE ...

 BEGIN

   --send dbmail

 END

Thanks.

November 5, 2009 1:00 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About AaronBertrand

...about me...

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement