Lookup tables are widely used in database applications for good reasons. Usually, a lookup table has a small number of rows and looking it up with a join is fast, especially when the table is already cached.
Recently, I needed to update every row in many relatively large tables, each of which was identically structured, had ~25 million rows, and was ~30GB in size. The tables were denormalized to include both a lookup index column (i.e. CategoryID, which was an integer) and the corresponding lookup value column (i.e. CategoryName, which was a char(50)). The batch update I was performing was to ensure that the CategoryName column of these tables had the correct matching the value. The CategoryID to CategoryName mapping was defined in a small lookup table, CategoryLookup, with 10 rows.
Question
What would be the most efficient method to perform this batch update?
Three lookup methods
For the batch update scenario described above, you have three alternatives to lookup the CategpryName values (assume that the table to be updated is called Transactions):
- Method 1—The inline CASE method, which performs an inline lookup with a CASE expression in the UPDATE statement. Okay, this is not really a lookup. But this method provides a baseline for comparison.
UPDATE Transactions
SET CategoryName = CASE
when CategoryID = 1 then 'abc1'
when CategoryID = 2 then 'abc2'
when CategoryID = 3 then 'abc3'
when CategoryID = 4 then 'abc4'
when CategoryID = 5 then 'abc5'
when CategoryID = 6 then 'abc6'
when CategoryID = 7 then 'abc7'
when CategoryID = 8 then 'abc8'
when CategoryID = 9 then 'abc9'
when CategoryID = 10 then 'abc10'
END
- Method 2—The JOIN method, which relies on joining the Transactions table and the CategoryLookup table to do the lookup.
UPDATE t1
SET t1.CategoryName = t2.CategoryName
FROM Transactions t1 JOIN CategoryLookup t2
ON t1.CategoryID = t2.CategoryID
- Method 3—The subquery method, which does a lookup with a subquery. Clearly, there is a join in the subquery.
UPDATE Transactions
SET CategoryName =
(SELECT CategoryLookup.CategoryName
FROM CategoryLookup
WHERE CategoryLookup.CategoryID=Transactions.CategoryID )
You can also do the lookup with a scalar function. But it’s so horrifically inefficient that you should not seriously consider it. It’s not interesting to include in this discussion. In addition, you could do the lookup with an inline table valued function, which has a similar performance profile as that of the inline CASE method.
It should be highlighted that method 2 (the JOIN method) and method 3 (the Subquery method) are not semantically identical. For instance, if the Transactions table has a CategoryID value that is not present in the CategoryLookup table, the Subquery method will, if permitted, set the CategoryName column to NULL, or the update will fail if NULL is not permitted, whereas the JOIN method will leave the CategoryName value unchanged. For the scenario we are interested in, the results of these two methods are identical. All the CategorID values in the Transactions table are also in the CategoryLookup table and the mapping from CategoryID to CategoryName in the CategoryLookup table is one to one.
I ran a series of controlled tests that mimicked the update scenario described previously. To keep the tests more manageable, I used a smaller and artificially created Transactions table that had 5,000,000 rows and was ~5GB in size. You can find the DDLs and the test script at the bottom of this post.
Test results and practical implications
I made sure that the results shown below were stable in that (1) they were taken from 50 repeated tests with a small number of outliers thrown out, and (2) the remaining results were inspected and made sure that the variances were relatively small among them and the values exhibited a consistent pattern.
Clearly, if you do a massive number of lookups (like what I did in this test), the cumulative cost can be quite visible. In fact, in this test using an inline CASE expression was more than twice as fast as lookups using either a subquery or a straight join. As the number of rows increases, you can expect to see this difference (or the cost of doing lookups) grow more prominent. So, if you are doing a very large batch update, it’s definitely worth replacing the table lookups with an inline CASE expression for better performance.
The difference between the CASE method and the table lookups (either the Subquery method or the JOIN method) remained stable across different test environments. But the difference between the Subquery method and the JOIN method was more subtle. In fact, if you run the same test in a different environment, you may see different relative performance between them. In some environments, the Subquery method can perform significantly better than the JOIN method.
Although there was a significant performance penalty when using Subquery or JOIN lookups in a massive update, this does not mean you should jettison using lookups in your individual transactions. Because the marginal cost of doing an individual lookup is infinitesimally small compared to many other performance-related factors, you’d lose much more in terms of code reuse, flexibility, and so on if you start to embed ‘lookups’ inline. To emphasize, note that the difference between the CASE method and the Subquery method in the test was ~34 seconds. Divide 34 seconds by the 5,000,000 lookups the update did, we get 6.8 microseconds as the marginal cost of an individual lookup.
There is no surprise that avoiding a massive number of table lookups could give you better performance. But it’s still good to be able to appreciate it with some concrete numbers. My update of all those 25-million-row tables mentioned at the beginning of this post took more than 10 hours to complete and I used the subquery method. Had I had the results reported here, I could have finished the same update process in five hours. That would have been a very nice saving!
Test setup
The lookup DDL and data:
drop table CategoryLookup
go
create table CategoryLookup(CategoryID int, CategoryName char(20))
go
with tmp(a, b) as (
select 1, 'abc' + cast(1 as varchar(5))
union all
select a+1, 'abc' + cast(a+1 as varchar(5))
from tmp
where a < 10
)
insert CategoryLookup
select * from tmp
go
create clustered index cix_CategoryLookup on CategoryLookup(CategoryID)
go
The Transactions test table DDL and data:
drop table Transactions
go
create table Transactions(CategoryID int,
CategoryName char(50),
filler char(1000))
go
set nocount on
go
declare @i int
set @i = 1
begin tran
while @i <= 5000000
begin
insert Transactions
select @i % 10 + 1, 'abc', 'filler'
if @i % 100000 = 0
begin
commit tran
begin tran
end
set @i = @i + 1
end
if @@trancount > 0
commit tran
go
sp_spaceused Transactions
go
create clustered index cix_Transactions on Transactions(CategoryID)
go
drop table test_log – this tale is used to log the test times
go
create table test_log (
Name varchar(50),
Num int,
StartTime datetime,
EndTime datetime NULL
)
go
The test script:
set nocount on
go
declare @dt datetime,
@i int
set @i = 1
while @i < 20 -- run the test 20 times
begin
set @dt = getdate()
insert test_log select 'CASE method', 10, @dt, NULL
update Transactions
set CategoryName = case
when CategoryID = 1 then 'abc1'
when CategoryID = 2 then 'abc2'
when CategoryID = 3 then 'abc3'
when CategoryID = 4 then 'abc4'
when CategoryID = 5 then 'abc5'
when CategoryID = 6 then 'abc6'
when CategoryID = 7 then 'abc7'
when CategoryID = 8 then 'abc8'
when CategoryID = 9 then 'abc9'
when CategoryID = 10 then 'abc10'
end
update test_log
set EndTime = getdate()
where StartTime = @dt
set @dt = getdate()
insert test_log select 'Subquery method', 10, @dt, NULL
update Transactions
set CategoryName =
(select CategoryLookup.CategoryName
from CategoryLookup
where CategoryLookup.CategoryID= Transactions.CategoryID )
update test_log
set EndTime = getdate()
where StartTime = @dt
set @dt = getdate()
insert test_log select 'JOIN method', 10, @dt, NULL
update t1
set t1.CategoryName = t2.CategoryName
from Transactions t1
join CategoryLookup t2 on t1.CategoryID = t2.CategoryID
update test_log
set EndTime = getdate()
where StartTime = @dt
set @i = @i +1
end
The reported results were obtained on a DL585 G1 with 64GB of RAM and eight 2.6GHz cores, running Windows Server Enterprise 2003 and SQL Server 2008 SP2 Enterprise x64 Edition. 50GB was allocated to the SQL Server instance.