THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Aaron Bertrand

Aaron is a Product Evangelist for SentryOne, makers of Plan Explorer and a performance monitoring and event management platform for the Microsoft Data Platform and VMware. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features, and also blogs at blogs.sentryone.com and SQLPerformance.com; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at major conferences, user group meetings, and SQL Saturday events worldwide.

Bad habits to kick : using loops to populate large tables

In my last post in this series, I talked about using ORDER BY <ordinal position>.  Today I'd like to talk about using primitive loops to generate a large number of rows.

Okay, I will admit, in most cases you are doing this for a demo or proof of concept, and so speed / performance is not all that important.  But using a loop to populate a table with 1 million rows, one row at a time, is not fun at all, no matter how much thumb-twiddling you have to do.  I often see people building up test cases like this:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE
@i INT,
@UpperLimit INT;

SELECT
@i 1,
@UpperLimit 1000000;

WHILE @i <= @UpperLimit
BEGIN
  INSERT 
dbo.foo(barVALUES(@i);
  
SET @i @i 1;
END

For 500 or 50,000 rows, this probably wouldn't be a big deal.  But when you get into the bigger numbers, you are going to be spending a long time waiting for the inserts to finish. So how do you get around it?

I have long been an advocate of an auxiliary numbers table.  It makes quick work of generating dummy data, does not require a lot of storage space, and is easy to join against for various reporting-type functions.  Let's say you create a numbers table with 1,000,000 rows, then you can change the above code to:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

INSERT dbo.foo(bar)
 
SELECT [Number]
  
FROM dbo.Numbers
  
WHERE [Number] BETWEEN AND @UpperLimit

However, there are many people who fundamentally object to adding a table to their schema, even if they put it in some utility database.  [As an aside, I see the exact same objections from people who don't think it's worth it to store a calendar table in their systems, even though it extremely simplifies date calculations such as business days between two dates, taking into account things like Easter and company holidays.]

Well, thankfully, there are some clever workarounds to this that let you construct a numbers table on the fly (and in fact you need to use something like this to populate the numbers table in the first place, without having to use a loop).  Here is one that I have used in previous posts:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

WITH AS
(
    
SELECT
       
x ROW_NUMBER() OVER
        
(ORDER BY s1.[object_id])
    
FROM  sys.all_objects AS s1
    
CROSS JOIN sys.all_objects AS s2
    
CROSS JOIN sys.all_objects AS s3
)
SELECT [Number] = x
INTO dbo.Numbers
FROM n
WHERE x BETWEEN 1 AND @UpperLimit;

GO

CREATE UNIQUE CLUSTERED INDEX 
ON dbo.Numbers([Number]);

And here is one using a set of CTEs and no tables, based on code I stole from Itzik Ben-Gan:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit 1000000;

WITH
   
n5  (xAS (SELECT UNION SELECT 0),
  
n4  (xAS (SELECT FROM n5 CROSS JOIN n5 AS x),
  
n3  (xAS (SELECT FROM n4 CROSS JOIN n4 AS x),
  
n2  (xAS (SELECT FROM n3 CROSS JOIN n3 AS x),
  
n1  (xAS (SELECT FROM n2 CROSS JOIN n2 AS x),
   
n0  (xAS (SELECT FROM n1 CROSS JOIN n1 AS x),
   
Nbrs (xAS
   
(
       
SELECT
          
ROW_NUMBER() OVER
           
(ORDER BY x)
       
FROM n0
   
)
SELECT [Number] = x
  
INTO dbo.Numbers
  
FROM Nbrs
  
WHERE BETWEEN AND @UpperLimit;

GO

CREATE UNIQUE CLUSTERED INDEX ON dbo.Numbers([Number]);

So, you could use either of these examples either (a) to populate your own numbers table, which you can later use instead of loops; or (b) directly, every time you need to simulate a loop.  I prefer (a) because, if you use it enough, the numbers table will reside in memory, have statistics, etc.  For full disclosure (and not to say my method is better), I compared the actual execution plans of both methods when using 1,000,000 rows as my upper limit.  The comparison is as follows:

  CROSS JOIN CTE
Query cost (relative)
10% 90%
Cached plan size (bytes)
88
176
CompileCPU (ms) 28 50
CompileMemory (KB) 2336 3376
CompileTime (ms) 28 50
Estimated Subtree Cost
0.0206647 0.193489

And the most visible difference is the query plan (click to enlarge).  In case it's unclear, I'll give you a hint: the cross join of sys.objects is at the top.

 

Again, these aren't things you need to be concerned about if you are only populating the numbers table once.  But if you are using these methods on the fly, the difference in performance could be significant.

If you require <= 32,767 rows, you can use this alternative syntax, employing a recursive CTE.  It is much more efficient, but has a cap of 32,767 due to the limit on the MAXRECURSION option:

SET NOCOUNT ON;
DECLARE @UpperLimit INT;
SET @UpperLimit = 32767;

WITH AS
(
   
SELECT x 1
  
UNION ALL
   
SELECT x = x + 1
FROM n
     
WHERE x @UpperLimit
)
SELECT [Number] = x
  
INTO dbo.Numbers
  
FROM n
  
OPTION (MAXRECURSION 32767);
 
GO

CREATE UNIQUE CLUSTERED INDEX ON dbo.Numbers([Number]);

I'll admit, this isn't the easiest syntax to memorize, so you will likely want to keep a copy of it in your toolbox.  If you are going to continue using a loop, at the very least, please get in the habit of adding SET NOCOUNT ON statements to your code.  This way, at least, that big loop won't be wasting all that effort passing "1 row(s) affected" to the client n times, and the client won't be wasting all that effort printing it out to you n times.

Now don't get me wrong, I am not saying that loops are a bad thing.  In fact for a lot of things they are the right answer, and for some things they are even the only answer.  But in cases where things don't have to happen one at a time, they shouldn't happen one at a time.

I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code.  Up next: using old-style joins.

Published Wednesday, October 7, 2009 7:17 PM by AaronBertrand

Comments

 

Linchi Shea said:

Aaron;

If we are talking about performance and simplicity, I'd actually argue that your very first method with a loop is actually the best (with some minor change).

October 7, 2009 9:06 PM
 

Scott Whigham said:

I like the breakdown - nice post. I don't think that, even if you showed people that performance of a loop was 500 times worse than using a recursive CTE, that it would change more than 2% of the population's technique of doing this. Why? As you pointed out, it's not a technique that people will generally be able to memorize. My "benchmark" on stuff like this is to imagine that I've just learned this technique on December 1 and then I take the whole month off and go to the Bahamas. When I come back to work in mid-January, will I remember this technique? If the answer is, "No", then I do one of two things: I either (a) do as you said and add a script to my "toolkit", or (b) I just forget about it. Your final example has been in my toolkit for several years but when it comes time to write a loop on the fly in a demo, I reach for the declarative technique (that you showed initially).

If you have a numbers table that covers your min/max values, the simplest technique, for me, is to just use SELECT INTO - no need for more fancy options IMO. I don't even need the SET NOCOUNT ON with such a query:

 SELECT [Number]

 INTO Foo

 FROM dbo.Numbers

 WHERE [Number] BETWEEN 1 AND 1000000

 GO

 ALTER TABLE foo ADD CONSTRAINT PK_Foo PRIMARY KEY (Number)

 GO

This is simple, easy to remember, and it's fast. Of course, it depends on your having a numbers table that covers your min/max values.

October 8, 2009 7:57 AM
 

AaronBertrand said:

Linchi, I agree with you about simplicity.  One of the problems with bad habits is that often they are borne out of simplicity.  In a lot of cases the wait time is irrelevant, so the simple option is best.  But perhaps this gives people a better way when they're really in a rush.

As for performance, I disagree completely (though I don't know what "simple change" you imply).  I ran an insert of 1,000,000 rows into a new table using three methods: (a) a loop with a counter, using SET NOCOUNT ON, (b) a loop with a counter, using SET NOCOUNT OFF, and (c) the sys.objects cross join from above.  I only observed elapsed time, and the results were:

(a) 8 minutes, 18 seconds

(b) 8 minutes, 32 seconds

(c) 3 seconds

October 8, 2009 10:07 AM
 

AaronBertrand said:

Scott, good comments.  I'll address a few.

1) I realize that I am not going to change a whole lot of habits.  But if I made one person realize a better way to accomplish a common task, mission accomplished, IMHO.

2) I agree that if you already have a numbers table, then SELECT INTO is best, without any of the CTE complications.  However, you still have to get a numbers table populated in the first place; surely you should do this in the most efficient manner possible, no?  And again, I explained that some people are opposed to such a table on principle - so hopefully this post gives them some alternatives as well.

3) As for SET NOCOUNT ON, I use it on all my queries, regardless of their complexity.  And like anyone trying to demonstrate and reinforce good habits, repetition is key.  When you are performing one operation with one rowcount output, I agree, it is not necessary and does not buy you anything measurable.  But what harm does it do to add it?  You can even make it the first line in your template for all new queries, then the agonizing keystrokes can't even be a good excuse anymore.  :-)

October 8, 2009 10:07 AM
 

Linchi Shea said:

Aaron;

Your (a) and (c) are not strictly apple to apple comparison. If you can retrieve data from some other table, sure, go ahead. It would be silly to use a loop in that case.

When you use a loop, you most likely are generating new data, and the bottleneck is almost always on the INSERTs, not the loop itself. By 'minor change', I primarily meant to change the transaction commit behavior of the loop. Instead of committing each and every single row insert, which is very expensive, you may want to commit every 100,000 rows or some other batch size (big enough to make transaction commits efficient). That would make it dramatically faster than 8 minutes.

Change it to someting like the following and it should finish in seconds:

SET NOCOUNT ON;

CREATE TABLE dbo.foo(bar INT);

DECLARE   @i INT,  @UpperLimit INT;

SELECT  @i = 1,  @UpperLimit = 1000000;

begin tran

WHILE @i <= @UpperLimit

BEGIN  

   if @i % 100000 = 0

   begin

      commit tran

      begin tran

   end

   INSERT dbo.foo(bar) VALUES(@i);  

   SET @i = @i + 1;

END

if @@trancount > 0

  commit tran

Here are some more data points:

http://sqlblog.com/blogs/linchi_shea/archive/2007/08/23/performance-impact-finding-the-most-optimal-batch-size.aspx

October 8, 2009 11:31 AM
 

AaronBertrand said:

Fair enough Linchi, but do you know anybody who is going to code their "simple loops" that way?

October 8, 2009 11:36 AM
 

Linchi Shea said:

Aaron;

I agree that if they literally use the first approach, and inserting and committing one row at a time, they deserve the poor performance. And it would be a bad habit.

October 8, 2009 1:02 PM
 

Adam Machanic said:

Aaron has recently been doing an absolutely fantastic series of posts detailing various "bad habits"

October 8, 2009 10:57 PM
 

AaronBertrand said:

Linchi, I think a lot of people don't realize the implications of these things, such as the costs of commit overhead.  So, saying they "deserve" something that isn't really explained well anywhere is kind of harsh, IMHO.  I still think the inclination for most people is to use a set-based approach rather than a one-by-one loop, even if there are cumbersome ways to defer or limit the costs associated with the latter.

October 9, 2009 11:40 AM
 

Uri Dimant said:

Aaron , great posts

I often use to make numbers table as

CREATE TABLE Num(number INT NOT NULL IDENTITY(1,1))

INSERT INTO Num DEFAULT VALUES

GO 10000

October 11, 2009 2:04 AM
 

AaronBertrand said:

Uri, that's still a loop, even though you don't actually use BEGIN / END or WHILE -- while it is simple to code, it still suffers the same performance problem I pointed out in my post, and isn't subject to the enhancement Linchi mentioned about reducing the number of commits.  Also, without a GO between CREATE TABLE and INSERT INTO, the script you posted will return 9,999 error messages that the table Num already exists.  So, even the simplest types of loops are still subject to memorization / transposition issues.  And where is the schema prefix?  For shame!  :-)

October 11, 2009 12:01 PM
 

Uri Dimant said:

:-)Hey,I posted this comment before your last one,

You know, I have not written anymore an old JOIN style :-))

October 11, 2009 2:57 PM
 

AaronBertrand said:

Cool, so at least one person has kicked a bad habit based on my complaints.  :-)

October 11, 2009 3:28 PM
 

OllyA said:

Actually, you can use the recursive CTE for large numbers of rows.  Don't think it's as fast as a numbers table, but a lot quicker than the loop.

Just use OPTION (MAXRECURSION 0).

Oh, and make sure your CTE has the WHERE clause... or you'll be trying to go for an infinite numbers table ;-)

October 12, 2009 4:51 AM
 

Brian Tkatch said:

Aarton, this is a great series. Or rather, these are great posts. The series, however should be bifurcated (finally, got to use it is a "real" sentence :P ). Loops aren't a bad habit, as has been pointed out. There is just a better way. Perhaps you ought to seaprate between bad habits and best practices.

October 12, 2009 10:48 AM
 

AaronBertrand said:

Thanks Brian, in all fairness, I didn't say loops were a bad habit.  Just that they were a bad habit when the goal is to generate millions of rows.

October 12, 2009 10:52 AM
 

Brian Tkatch said:

>I didn't say loops were a bad habit.  Just that they were a bad habit

>when the goal is to generate millions of rows.

oh, ok. Heh. I was just focusing on the title as a subject identifier and ran with it. Regardless, it is a good point to read, and i'll read it whatever you call it.

October 12, 2009 12:18 PM
 

Ron Krauter said:

Aaron,

If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?

WHILE ...

 BEGIN

   --send dbmail

 END

Thanks.

November 5, 2009 1:00 AM
 

Jeff Moden said:

Ron drauter said:  If you had a million customers in a table and you wanted to email each one of them, wouldn't you use the loop example you provided?

Absolutely (although as of SQL SERVER 2005, there are even ways around that).  In the case you cite, the While Loop won't be the performance problem and is necessary because the proc can only do 1 email at a time (IIRC).

The problem is that a lot of people write procs that can only handle 1 row or record at a time instead of writing a set based solution.  Heh...I guess they do it out of "habit".

--Jeff Moden

March 6, 2010 4:01 PM
 

Jeff Moden said:

Aaron said: Thanks Brian, in all fairness, I didn't say loops were a bad habit.

Heh... I will!  ;-)

I know this is a relatively old blog but I just had to leave you a note about this.  I'm really glad to see someone else trying to keep people from using the addictive drug known as "While Loops".  I'm also pressing to keep them from using other forms of unncessary RBAR such as recursive CTEs to do such simple things as counting because they're sometime worse than a While Loop.

One of my favorite peeves is when someone wants to demonstrate some hot new setbased technique and they go to build a test table to prove their code... and they use a While Loop or recursive CTE to do it.  It just doesn't make sense to me and it tends to destroy the "set based" credibility of the author in my eyes.

A couple of years ago and again recently, SSC had an article on the ol' "FizzBuzz" problem that's sometimes used during interviews.  It's amazing to me that so many people submit solutions that contain a While Loop and try to justify its use on "short time allowed" and "it's only for 100 rows".  They just don't understand that when I'm doing interviews, I'm looking for people who...

1.  Don't have the bad habit of using "While Loops" for something so simple.

2.  Don't have the bad habit of justifying a total lack of future performance and scalability because they don't understand that scalability requirements can change quickly and the code should be able to handle it without rework.

I took a lot of the same heat on those two threads for the same reason you took some heat on this thread.  I'm thrilled that someone else is standing their ground in a fight against RBAR and non scalable code.  Very well done and my hat's off to you!

--Jeff Moden  

BTW... sys.All_Columns in the Master database will always have more than 4k rows for a full install.

March 6, 2010 4:16 PM
 

Razvan Socol said:

Hello Aaron,

You wrote that the recursive CTE method is "much more efficient" than Itzik's method (at least that's what I understand from that sentence). However, in my tests, it takes about 70ms to generate 10000 rows using Itzik's method and 563ms to generate 20000 rows with the recursive CTE. The recursive CTE also seems slower than the sysobjects CROSS JOIN, which provides the 20000 rows in about 80ms.

I guess that you meant to say that the recursive CTE method is faster than the iterative method, but the sentence seems to imply something else.

Razvan

May 11, 2010 9:48 AM
 

Aaron Bertrand said:

One complaint about SEQUENCE is that there is no simple construct such as NEXT (@n) VALUES FOR so that

December 13, 2010 7:58 AM
 

Aaron Bertrand : Bad habits to kick : using old-style JOINs said:

January 17, 2011 3:01 PM
 

Abhijit said:

Guys, fantastic. I kicked myself (better late than never).

Having said that, any better solution for my problem below.

I need to generate following output where R1..Rn are rows and C1..Cn are columns (max limit known)

e.g.

R1 => C1 | C2 | C3 | C4 .......Cn

R2 => C1 | C2 | C3 | C4 .......Cn

R3 => C1 | C2 | C3 | C4 .......Cn

......

For this I generate a series of numbers for Rows and Columns.

Is using left join good (assuming there are 10000 rows and 5000 columns)

I have tried to depict my problem in a very simple mechanism but let me know if the description is adequate.

April 4, 2014 10:22 AM
 

Dean said:

Hi Aaron.  I like the look of your preferred option but would you be able to explain it a little?  What exactly does it do, and why?  I have just run it for the first time and I only get 970,299 rows, not 1,000,000.  I just think it may be useful to people to explain what the code is doing (especially to give people confidence to edit it to their own needs). Thanks!

June 8, 2016 9:12 AM
 

AaronBertrand said:

Hi Dean, sorry, the code should be using sys.all_objects instead of sys.objects. I ran my tests in AdventureWorks but in relatively empty databases I could understand that you might fall short of 1,000,000 rows as a result.

I'll think about a separate blog post to explain the cross joins, but in the meantime maybe this series will help:

http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1

http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-2

http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-3

June 8, 2016 12:58 PM
 

Dean said:

Thanks Aaron, I understand cross joins, I just meant from a functional point of view, I was interested in why you chose that particular approach.  I think I can pretty much imply it now from your reply - you just want access to a table with sufficient rows that a cross join to itself will create a large number of results, without having to create and populate it yourself, right?  If so hopefully my experience will be useful to people that if they have a very particular and large number in mind, they may want to be careful about relying on any system tables.  (Even my sys.all_objects table only has 2,267 rows, so would only create 5 million results when cross joined.)

The general point of your post has really helped me though, I had a requirement to fill a table with 50 million rows of random sales data, and the way I was previously doing it (with a 1 to 50,000,000 WHILE loop) took 50 minutes.  Following your approach, but using a smaller loop to populate a table variable with 7,072 rows, I now have it down to 5 minutes.  I've posted it here in case a) it helps anyone or b) anyone has any further suggestions for improvement!  (Once again, thanks for the post, it was a life saver!)

SET NOCOUNT ON

DECLARE @first_tab TABLE (col1 INT);

DECLARE @i INT = 1

WHILE @i <= 7072

BEGIN

INSERT INTO @first_tab VALUES(@i);

SET @i = @i + 1;

END

-- Sales figures random between 315 and 1140

-- Dates random number of minutes between 1 and 2629440 after 1/1/07

-- (makes a range of approx 5 years)

IF OBJECT_ID('sales_figures') IS NOT NULL DROP TABLE dbo.sales_figures;

SELECT TOP 50000000

       CONVERT(VARCHAR(30), DATEADD(minute, ABS(CHECKSUM(NewId())) % 2629440, '01-JAN-2007'), 100) AS sales_date

,       315 + (ABS(CHECKSUM(NewId())) % 825)  AS sales_figure

INTO dbo.sales_figures

FROM @first_tab t1 CROSS JOIN @first_tab t2;

June 9, 2016 9:04 AM
 

Balaji Ram said:

Linchi/Aaron,

Regarding using optimal transaction size (100000 in the example above). This seems to matter only for permanent tables in user databases. Tempdb's #tables do not suffer much of a performance hit even when no transaction size is used and left to default(auto commit) with no explicit transaction.

Thanks to Paul Randal, I know tempdb data pages are NOT flushed to disk unless there is memory pressure. What about the tempdb LOG BLOCKs? Is it also NOT flushed to tempdb log file unless there is memory pressure?

Any Insights/thoughts? Thanks.

August 15, 2016 12:56 PM
New Comments to this post are disabled

About AaronBertrand

...about me...

This Blog

Syndication

Archives

Privacy Statement