It would be fair to say that there aren't a huge number of
programmability features added to SQL Server 2008 that will have a
great impact on most of our day-to-day lives as SQL Server developers.
The release was clearly manageability themed rather than
programmability themed, and I can count the most interesting and
far-reaching developer enhancements on a few fingers: the DATE data
type is a great time-saver, TVPs are fun (but not nearly the game
changer they could have been), and Change Tracking will certainly help
with caching scenarios.
MERGE is a feature that initially made this list for me. I was
really looking forward to using it and thought it would make life
easier for many ETL tasks requiring "upsert" functionality. My first
impression of the feature was that people wouldn't love the syntax, and
even now that I'm used to it I still think it leaves a bit to be
desired. My second impression came when working on a large data
load--hundreds of millions of rows--and I discovered that compared with
the temp table approach I'd been using in the SQL Server 2005 version
of the procedure, the MERGE performance was dismal. My first attempt
showed a 20x decrease in performance (that's twenty times slower,
not
percent). While spending a day and a half messing with the thing and
experimenting with various combinations I saw as much as a 100x
decrease in performance at one point, but after losing much blood and
sweat, and having shed many tears, I finally
managed to tune it to the point where it was only twice as slow as the
previous methodology.
And that's when MERGE left my list of interesting features.
Open-minded kind of guy that I am, I didn't completely dismiss
MERGE, and it's a good thing; it provides us with at least two
features, independent of its ability to do "upsert", that not only put
it back on my list of interesting features, but right at the top--as the
number one most important developer enhancement in SQL Server 2008.
The first feature is the ability to do joined updates, without
encountering the potential data quality issues that the UPDATE FROM
syntax can cause. Hugo Kornelis has already done a great job of describing that ability, so I won't discuss it here. But please, read Hugo's post before writing another UPDATE FROM. You'll be happy you did.
The second enhancement is equally important, but I haven't seen
coverage of it in any of the blogs or other sources I read (aside from Ward Pond mentioning it in passing). The MERGE
statement's OUTPUT clause can do a special trick that other OUTPUT
clauses cannot: it has the ability to output data not only from the
"inserted" and "deleted" virtual tables, but also from either the
source or destination table. This means that MERGE, even without its
"upsert" capabilities, is surprisingly more powerful than the
INSERT, UPDATE, or DELETE statements and can effect better logging, auditing,
and more precise ETL processes. A true win-win.
But before we jump into MERGE's capabilities in this area, a bit of
background is in order for those who might not be well-versed in all of
the T-SQL enhancements that have been added in the past two versions of
SQL Server.
Background #1: The OUTPUT Clause
Added in SQL Server 2005, the OUTPUT clause gives DML
constructs--INSERT, UPDATE, DELETE, and now MERGE--the ability to
redirect the results of the DML either back to the caller as a rowset
or to a table. This is accomplished by leveraging the same idea that we
have in triggers: "inserted" and "deleted" virtual tables. When
inserting data you have access to a virtual table called "inserted",
when deleting you have access to "deleted", and when updating you have
access to both tables. This is great for getting back IDENTITY values,
among other things:
DECLARE @x TABLE
(
x INT IDENTITY(1,1) NOT NULL,
y INT NOT NULL
);
INSERT @x
(
y
)
OUTPUT
inserted.x,
inserted.y
SELECT 2 AS y
UNION ALL
SELECT 1;
UPDATE @x
SET
y = y + 1
OUTPUT
deleted.y AS old_y,
inserted.y AS new_y;
DELETE @x
OUTPUT
deleted.x
WHERE
y = 3;
In addition to simply getting back values, we can, as mentioned, also redirect them to a table:
DECLARE @x TABLE
(
x INT IDENTITY(1,1) NOT NULL,
y INT NOT NULL
);
DECLARE @y TABLE
(
x INT NOT NULL,
y INT NOT NULL
);
INSERT @x
(
y
)
OUTPUT
inserted.x,
inserted.y
INTO @y
SELECT 2 AS y
UNION ALL
SELECT 1;
SELECT
*
FROM @y;
As an aside, what I would really love to be able to do is reroute
the output into one or more scalar variables when I know that only a
single row will be affected by the operation. If you agree, please vote
on the Connect item I've filed asking for that capability.
I suspect that if you are a SQL Server developer and can not readily
find--or have not already found--uses for the OUTPUT clause then you're
not getting challenged much in your day to day work, and it's time to
go look for a new job. On the projects I've done in the past few years
I have found it to be one of the top few enhancements; it has really
made all sorts of tasks much easier to do in SQL Server 2005 and 2008
than they were in SQL Server 2000. And that brings us to our next stop,
the MERGE statement...
Background #2: Basic MERGE Support for the OUTPUT Clause
Prior to the release of SQL Server 2008, Itzik Ben-Gan wrote a massive white paper
covering the various T-SQL enhancements we could expect with the new
version. Included therein is a short section about MERGE, including
some information on the OUTPUT clause. Itzik's coverage is centered
around the fact that MERGE can do something that could never be done
before with a SQL Server DML statement: cause rows in a table to be
inserted, updated, and deleted atomically, in a single statement.
To make life easier for those who want to write generic code, a
special function was introduced that can be used in the OUTPUT clause
with MERGE. This function, $action, outputs one of the
following values to let you know which action affected tho row: INSERT,
UPDATE, or DELETE. You can, of course, get this same functionality by
looking at columns from both the inserted and deleted virtual tables
and writing a CASE experssion, but it's much easier to simply use a
built-in function. The following example shows how the $action function
works in practice:
DECLARE @x TABLE
(
x INT IDENTITY(1,1) NOT NULL,
y INT NOT NULL
);
INSERT @x
(
y
)
SELECT 2 AS y
UNION ALL
SELECT 1;
MERGE INTO @x AS tgt
USING
(
SELECT 2 AS y
UNION ALL
SELECT 3
) AS src ON
src.y = tgt.y
WHEN MATCHED THEN
UPDATE
SET tgt.y = tgt.y + 1
WHEN NOT MATCHED THEN
INSERT
(
y
)
VALUES
(
src.y
)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
OUTPUT
$action AS dml_action,
inserted.x AS new_x,
deleted.x AS old_x,
inserted.y AS new_y,
deleted.y AS old_y;
I should also mention here that SQL Server 2008 includes better support
for the OUTPUT clause via a feature called "composable DML." This is
not a MERGE enhancement--it works with any DML statement--but it's
tangentially related and certainly of interest to people who might be
reading this post.
The
composable DML feature allows you to do something very strange indeed:
treat DML statements that use the OUTPUT clause as derived tables, and
use the result of the OUTPUT in an outer INSERT-SELECT. This is helpful
when you need a bit more precision; for example, perhaps you want to
populate a history table, but you're only interested in archiving the
data that's been updated or deleted--the newly inserted data is already
in the live table. To filter the results down using the composable DML
feature you would do something similar to what is shown in the
following example:
DECLARE @x TABLE
(
x INT IDENTITY(1,1) NOT NULL,
y INT NOT NULL
);
INSERT @x
(
y
)
SELECT 2 AS y
UNION ALL
SELECT 1;
DECLARE @history TABLE
(
x INT,
y INT NOT NULL,
dml_action VARCHAR(6) NOT NULL,
action_date_time DATETIME2(7) NOT NULL
)
INSERT @history
(
x,
y,
dml_action,
action_date_time
)
SELECT
x.old_x,
x.old_y,
x.dml_action,
SYSDATETIME()
FROM
(
MERGE INTO @x AS tgt
USING
(
SELECT 2 AS y
UNION ALL
SELECT 3
) AS src ON
src.y = tgt.y
WHEN MATCHED THEN
UPDATE
SET tgt.y = tgt.y + 1
WHEN NOT MATCHED THEN
INSERT
(
y
)
VALUES
(
src.y
)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
OUTPUT
$action AS dml_action,
deleted.x AS old_x,
deleted.y AS old_y
) AS x
WHERE
x.dml_action IN ('UPDATE', 'DELETE');
SELECT *
FROM @history;
Note that as of SQL Server 2008 the results of the outer operation must
be an INSERT-SELECT. That's a rather frustrating limitation, and I
really hope the SQL Server team loosens the restrictions in the next
version of the product.
Payload: MERGE, OUTPUT, and Access to Source and Destination Columns
Now that the groundwork has been laid--MERGE, OUTPUT, OUTPUT with
MERGE (or is it the other way around?), and composable DML--we can
finally get to the impetus behind this post, which is based on two
factors. First of all, MERGE adds more power to the OUTPUT clause than
just the $action function; it also allows us to access source and
destination columns that we can't get to using INSERT, UPDATE, or
DELETE. And second, MERGE can be used for standalone inserts, updates,
or deletes. This means that it doesn't matter if we can't get to the
data using the other three statements; we have MERGE and it can be
those other statements for us when we need it to.
So what's so great about the ability to access source columns that
you're not using as part of the data modification? Let's take a look at
a simple example. Here's the scenario: You've been asked to load a flat
file into the database. Each line of this flat file contains
information about a person: a name, current salary, a list of up to
three companies of employment, and a list of up to three childrens'
first names, all in comma-delimited format. Here are the tables we want
to load the data into:
USE tempdb
GO
CREATE TABLE people
(
name VARCHAR(100) NOT NULL,
current_salary INT NOT NULL,
person_id INT IDENTITY(1,1) NOT NULL PRIMARY KEY
)
GO
CREATE TABLE previous_companies
(
company_name VARCHAR(100) NOT NULL,
person_id INT NOT NULL REFERENCES people (person_id),
PRIMARY KEY (company_name, person_id)
)
GO
CREATE TABLE childrens_names
(
child_name VARCHAR(100) NOT NULL,
person_id INT NOT NULL REFERENCES people (person_id),
PRIMARY KEY (child_name, person_id)
)
GO
And here is some sample data:
Joe Smith,110000,Microsoft,,,John,Sue,Ed
Jane Baker,50000,Oracle,Teradata,,George,,
Tom Jones,0,,,,Anne,Cathy,
Joe Smith,100000,Microsoft,,,Steve,,
Kate Davis,150000,Google,Oracle,Microsoft,Don,Amy,
After taking a quick glance at this data I hope you can see right
away that we have some issues. Looking at the first and third lines you
can see that we have duplicate names--Joe Smith--and therefore the name
is not a good candidate for a primary key. Salary, with this small set,
will work fine if combined with name, but of course it's well within
the realm of possibility that two people can have the same salary,
especially when working for the same company as both of our Joe Smiths
do. We could add childrens' names to the mix, but of course there is
even no guarantee that two people with the same name, working for the
same company and making the same salary, won't have a child with the
same name. I'm sure it's happened before and will happen again.
We've
been reassured that each line really does represent a unique person, so
we have to make it work. No problem once we're in the database--we can
deal with the issue using the surrogate key on the "people" table--but
how should we handle the data on the way in?
Assuming that we're doing all of this processing in T-SQL, the
logical process is straightforward: we're going to have to insert the
names and salaries into the "people" table, get back one person_id per
name, and then do the inserts first into previous_companies and then
into childrens_names. Again, this wouldn't be a big deal with the small
sample data set here--we could, after inserting the names and salaries,
join those back to the input data set (assume that it's in a temp table
at this point) in order to map back the surrogate keys. And that would
work well enough; but as more columns are added this will get trickier
and trickier, and more prone to the introduction of a nasty bug that
will cause the keys to be mismapped. There must be a better way--and
thanks to MERGE, there is.
The solution to this problem requires a couple of steps. First
create a surrogate key on the incoming data, separate from the
surrogate key that will eventually be assigned when rows are inserted
into the "people" table:
CREATE TABLE #data
(
name VARCHAR(100),
salary INT,
company1 VARCHAR(100),
company2 VARCHAR(100),
company3 VARCHAR(100),
child1 VARCHAR(100),
child2 VARCHAR(100),
child3 VARCHAR(100),
input_surrogate INT IDENTITY(1,1) NOT NULL PRIMARY KEY
);
INSERT #data
VALUES
('Joe Smith',110000,'Microsoft',NULL,NULL,'John','Sue','Ed'),
('Jane Baker',50000,'Oracle','Teradata',NULL,'George',NULL,NULL),
('Tom Jones',0,NULL,NULL,NULL,'Anne','Cathy',NULL),
('Joe Smith',100000,'Microsoft',NULL,NULL,'Steve',NULL,NULL),
('Kate Davis',150000,'Google','Oracle','Microsoft','Don','Amy',NULL);
Once
we have this input_surrogate populated, we need to be able to access
the values in order to match them to the surrogate generated during the
insert to the "people" table. And this is where the power of MERGE
kicks in:
CREATE TABLE #surrogate_map
(
input_surrogate INT NOT NULL,
person_id INT NOT NULL
);
MERGE INTO people AS tgt
USING #data AS src ON
1=0 --Never match
WHEN NOT MATCHED THEN
INSERT
(
name,
current_salary
)
VALUES
(
src.name,
src.salary
)
OUTPUT
src.input_surrogate,
inserted.person_id
INTO #surrogate_map;
In
this example I've forced the MERGE statement to behave as an INSERT
statement by using a predicate of 1=0; since this will never match on
anything, the WHEN NOT MATCHED clause will fire for every row, thereby
causing that row to be inserted.
The key element on the OUTPUT
side of the equation is the seemingly-innocent third-to-last line. I'm
using "src.input_surrogate"--a column that is not used as part of the
INSERT--as part of the OUTPUT clause. This little feature enables us to
reliably map the person_id surrogate to the input_surrogate without
playing any games or worrying about complex logic or coding bugs.
Simply stated, it just works. And the rest of the ETL process, from
here on out, is simple enough, thanks to this mapping, that I won't
have to bore you with it.
This OUTPUT feature of MERGE is not only useful for INSERTs and
mapping of keys. Any time you're doing DML based on joins between
tables, you might consider using this feature to help with logging. For
example, you might want to delete some rows from one table by matching
the rows with another table based on a range predicate. Want to know
which rows matched and caused deletions to occur? No problem:
MERGE INTO people AS tgt
USING
(
SELECT 100000 AS salary
UNION ALL
SELECT 50000 AS salary
) AS src ON
tgt.current_salary BETWEEN src.salary AND src.salary * 1.10
WHEN MATCHED THEN
DELETE
OUTPUT
src.salary AS source_salary,
deleted.person_id,
deleted.current_salary;
Prior
to MERGE, properly doing this kind of thing would have required much
more complex code: Most likely you would have to do the work in at
least two steps, first finding potential matches and storing them in a
temporary table, then going back to do the DELETE. MERGE and OUTPUT
enables all of this work to be done in a single statement, and while
this may or may not be more efficient (see my notes at the beginning of
this post) it is definitely more elegant--and in most cases
maintainability is more important when performance differences aren't
extreme.
There are a number of takeaways here. First of all,
MERGE used as an "upsert" tool may or may not be the best choice. In my
experience performance differences will be extreme enough to warrant
not using it for that purpose. But I encourage you to give it a try and
not rely on my experiences as a guide; and after testing, I would
definitely like to hear what kinds of results you're getting,
especially if you're migrating legacy code and testing both methods.
Second, MERGE is not only an "upsert" tool. It can be used for INSERTs
and DELETEs, and as Hugo showed in the post I linked to, it is the
number one choice for UPDATEs when you need to do the update based on a
join. Third, thanks to the power of the OUTPUT clause, MERGE is
incredibly useful in the areas of ETL and logging--and probably many
others that I haven't considered yet.
I'll conclude by saying
that a lot of times the true power of a new feature can be subtle and
difficult to identify without spending some time thinking and
experimenting. As end-users we're often quick to dismiss something we
don't immediately understand and while that's probably not entirely
unfair in a lot of cases, in this one it was. I initially cast MERGE
aside but thanks to being somewhat open-minded I now realize that it is
actually quite powerful when used in the correct context. I have
already used it several times in situations similar to those described
in this post and look forward to a long relationship with this new DML
construct.