THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Linchi Shea

Checking out SQL Server via empirical data points

Performance impact: The cost of doing small lookups in a large batch update

Lookup tables are widely used in database applications for good reasons. Usually, a lookup table has a small number of rows and looking it up with a join is fast, especially when the table is already cached.

Recently, I needed to update every row in many relatively large tables, each of which was identically structured, had ~25 million rows, and was ~30GB in size. The tables were denormalized to include both a lookup index column (i.e. CategoryID, which was an integer) and the corresponding lookup value column (i.e. CategoryName, which was a char(50)). The batch update I was performing was to ensure that the CategoryName column of these tables had the correct matching the value. The CategoryID to CategoryName mapping was defined in a small lookup table, CategoryLookup, with 10 rows.

Question

What would be the most efficient method to perform this batch update?

Three lookup methods

For the batch update scenario described above, you have three alternatives to lookup the CategpryName values (assume that the table to be updated is called Transactions):

  • Method 1—The inline CASE method, which performs an inline lookup with a CASE expression in the UPDATE statement. Okay, this is not really a lookup. But this method provides a baseline for comparison.

UPDATE Transactions

     SET CategoryName = CASE

               when CategoryID = 1  then 'abc1'

               when CategoryID = 2  then 'abc2'

               when CategoryID = 3  then 'abc3'

               when CategoryID = 4  then 'abc4'

               when CategoryID = 5  then 'abc5'

               when CategoryID = 6  then 'abc6'

               when CategoryID = 7  then 'abc7'

               when CategoryID = 8  then 'abc8'

               when CategoryID = 9  then 'abc9'

               when CategoryID = 10 then 'abc10'

             END

 

  • Method 2—The JOIN method, which relies on joining the Transactions table and the CategoryLookup table to do the lookup.

  UPDATE t1

     SET t1.CategoryName = t2.CategoryName

    FROM Transactions t1 JOIN CategoryLookup t2

          ON t1.CategoryID = t2.CategoryID

 

  • Method 3—The subquery method, which does a lookup with a subquery. Clearly, there is a join in the subquery.

  UPDATE Transactions

     SET CategoryName =

          (SELECT CategoryLookup.CategoryName

             FROM CategoryLookup

            WHERE CategoryLookup.CategoryID=Transactions.CategoryID ) 

 

You can also do the lookup with a scalar function. But it’s so horrifically inefficient that you should not seriously consider it. It’s not interesting to include in this discussion. In addition, you could do the lookup with an inline table valued function, which has a similar performance profile as that of the inline CASE method.

It should be highlighted that method 2 (the JOIN method) and method 3 (the Subquery method) are not semantically identical. For instance, if the Transactions table has a CategoryID value that is not present in the CategoryLookup table, the Subquery method will, if permitted, set the CategoryName column to NULL, or the update will fail if NULL is not permitted, whereas the JOIN method will leave the CategoryName value unchanged. For the scenario we are interested in, the results of these two methods are identical.  All the CategorID values in the Transactions table are also in the CategoryLookup table and the mapping from CategoryID to CategoryName in the CategoryLookup table is one to one.

I ran a series of controlled tests that mimicked the update scenario described previously. To keep the tests more manageable, I used a smaller and artificially created Transactions table that had 5,000,000 rows and was ~5GB in size. You can find the DDLs and the test script at the bottom of this post.

Test results and practical implications

I made sure that the results shown below were stable in that (1) they were taken from 50 repeated tests with a small number of outliers thrown out, and (2) the remaining results were inspected and made sure that the variances were relatively small among them and the values exhibited a consistent pattern.

Clearly, if you do a massive number of lookups (like what I did in this test), the cumulative cost can be quite visible. In fact, in this test using an inline CASE expression was more than twice as fast as lookups using either a subquery or a straight join.  As the number of rows increases, you can expect to see this difference (or the cost of doing lookups) grow more prominent. So, if you are doing a very large batch update, it’s definitely worth replacing the table lookups with an inline CASE expression for better performance.

The difference between the CASE method and the table lookups (either the Subquery method or the JOIN method) remained stable across different test environments. But the difference between the Subquery method and the JOIN method was more subtle. In fact, if you run the same test in a different environment, you may see different relative performance between them. In some environments, the Subquery method can perform significantly better than the JOIN method.

Although there was a significant performance penalty when using Subquery or JOIN lookups in a massive update, this does not mean you should jettison using lookups in your individual transactions. Because the marginal cost of doing an individual lookup is infinitesimally small compared to many other performance-related factors, you’d lose much more  in terms of code reuse, flexibility, and so on if you start to embed ‘lookups’ inline. To  emphasize, note that the difference between the CASE method and the Subquery method in the test was ~34 seconds. Divide 34 seconds by the 5,000,000 lookups the update did, we get 6.8 microseconds as the marginal cost of an individual lookup.

There is no surprise that avoiding a massive number of table lookups could give you better performance. But it’s still good to be able to appreciate it with some concrete numbers. My update of all those 25-million-row tables mentioned at the beginning of this post took more than 10 hours to complete and I used the subquery method. Had I had the results reported here, I could have finished the same update process in five hours. That would have been a very nice saving!

Test setup

The lookup DDL and data:

drop table CategoryLookup

go

create table CategoryLookup(CategoryID int, CategoryName char(20))              

go

with tmp(a, b) as (

   select 1, 'abc' + cast(1 as varchar(5))

   union all

   select a+1, 'abc' + cast(a+1 as varchar(5))

     from tmp

    where a < 10

)

insert CategoryLookup

select * from tmp

go

create clustered index cix_CategoryLookup on CategoryLookup(CategoryID)

go

The Transactions test table DDL and data:

drop table Transactions

go

create table Transactions(CategoryID int,

                  CategoryName char(50),

                  filler char(1000))

go

set nocount on

go

declare @i int

set @i = 1

 

begin tran

while @i <= 5000000

begin

     insert Transactions

     select @i % 10 + 1, 'abc', 'filler'

    

     if @i % 100000 = 0

     begin

         commit tran

         begin tran

     end

     set @i = @i + 1

end

if @@trancount > 0

   commit tran

go

sp_spaceused Transactions

go

 

create clustered index cix_Transactions on Transactions(CategoryID)

go

  

drop table test_log – this tale is used to log the test times

go

create table test_log (

    Name        varchar(50),

    Num         int,

    StartTime   datetime,

    EndTime     datetime NULL

) 

go

 

The test script:

set nocount on

go

declare @dt datetime,

        @i int

 

set @i = 1

 

while @i < 20 -- run the test 20 times

begin

   set @dt = getdate()

   insert test_log select 'CASE method', 10, @dt, NULL

 

   update Transactions

     set CategoryName = case

               when CategoryID = 1  then 'abc1'

               when CategoryID = 2  then 'abc2'

               when CategoryID = 3  then 'abc3'

               when CategoryID = 4  then 'abc4'

               when CategoryID = 5  then 'abc5'

               when CategoryID = 6  then 'abc6'

               when CategoryID = 7  then 'abc7'

               when CategoryID = 8  then 'abc8'

               when CategoryID = 9  then 'abc9'

               when CategoryID = 10 then 'abc10'

             end

 

  update test_log

     set EndTime = getdate()

   where StartTime = @dt

 

 

  set @dt = getdate()

  insert test_log select 'Subquery method', 10, @dt, NULL

 

  update Transactions

     set CategoryName =

        (select CategoryLookup.CategoryName

          from CategoryLookup

         where CategoryLookup.CategoryID= Transactions.CategoryID ) 

 

  update test_log

     set EndTime = getdate()

   where StartTime = @dt

 

  set @dt = getdate()

  insert test_log select 'JOIN method', 10, @dt, NULL

 

  update t1

     set t1.CategoryName = t2.CategoryName

    from Transactions t1

         join CategoryLookup t2 on t1.CategoryID = t2.CategoryID

 

  update test_log

     set EndTime = getdate()

   where StartTime = @dt

 

  set @i = @i +1

end

The reported results were obtained on  a DL585 G1 with 64GB of RAM and eight 2.6GHz cores, running Windows Server Enterprise 2003 and SQL Server 2008 SP2 Enterprise x64 Edition. 50GB was allocated to the SQL Server instance.

 

Published Monday, April 04, 2011 3:34 PM by Linchi Shea

Attachment(s): LookupCost.GIF

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Abhijit said:

hey! nice article. Specially graphical presentation of sub-query, join and case methods.

April 7, 2011 10:55 AM
 

piers7 said:

The interesting bit is when you look at the query plan. Basically the CASE statement, whether directly or in a TVF, gets converted into a Compute Scalar operation, which seems to be really cheap, whereas the join typically becomes a hash join which, even though it's pretty efficient, is still relatively much, much more expensive.

Factors here are (presumably) the constraint that the CASE statement will always result in 1:1 input:output row counts, whereas the JOIN could fan-out or fan-in, depending on the cardinaility of the tables, and that probing the hashtable requires hash computation in addition to equality checking and list-probing (in hash-collided buckets).

Even more interesting: I had a case last week where changing to a CASE lookup substantially *reduced* performance. Still scratching my head over this one, but basically the knock-on effect was to change the ordering of the rows 'further up' in the query plan, and force an expensive sort before a particular aggregation could be performed.

Moral: test each case individually (and baseline, and monitor)

April 10, 2011 9:05 PM
 

Michael Rybicki said:

This post scares the crap out of me. I use Scalar functions in A LOT of my code. Thanks for the info but now what?

I actually took your scripts, modified the # of rows to 100,000 and # of passes to two. I also removed the Join and Subquery ones and replaced them with an update using a Scalar Function. Profiler Results were really dramatic:

                    CPU    READS   WRITES  DURATION

UPDATE w/CASE        358    14382    14295       613

UPDATE w/FUNCTION   1779   571755      942      1801

SELECT w/CASE         47    14335        0       927

SELECT w/FUNCTION    671    14341        0      1137

So can you explain why the function during an update gives me so many less Writes at the cost of higher CPU and READS? Also taking 3 times longer.

Also - with the SELECT statement - CPU is more than 10 times but fortunately the DURATION and READS are very close.

And - what are the alternatives if you need code re-use? Just accept it?

Thanks.

April 14, 2011 8:30 PM

Leave a Comment

(required) 
(required) 
Submit

About Linchi Shea

Checking out SQL Server via empirical data points

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement