A few days ago I was asked a question how to remove duplicates from a table. As usually in such cases, the duplicate values were related only to several columns, not all of them. So classical approach, to copy distinct data into temp table, truncate the table with duplicates and then copy the data back wouldn’t work. In such cases I find one of the windowing functions, row_number() to be the easiest to use. You just define criteria of numbering of rows and then you remove all rows are not first in their partition. Let’s have a look how it works.
First, let’s create and populate sample table:
- if object_id('tempdb..#t') is not null
- drop table #t
-
- create table #t(Id char(3), randomData uniqueidentifier default(newid()))
-
- insert #t(Id)
- values
- ('123'), ('124'),
- ('133'), ('133'), ('133'),
- ('141'), ('141'),
- ('121'), ('121'),
- ('145'), ('145'),
- ('152')
-
- select * from #t
When you run the query, you should see result similar to the following illustration:

Deleting duplicates is pretty straightforward. You define partitioned row numbering and then you delete all rows that have row number bigger than 1:
- ;with dupes as
- (
- select *, row_number() over(partition by Id order by Id) as rn
- from #t
- )
- delete dupes
- where rn > 1
-
- select * from #t
As you see, the duplicates were removed.

The execution plan for the above statement is pretty straightforward:

The most interesting operator in this plan is the Sequence Project operator. It adds the calculated row number to the output, based on the criteria that you pass to the row_number() OVER clause. Here we partitioned the input by column Id which causes the row number to be restarted from 1 whenever new Id is encountered in the input. The row number is returned as result of internal function row_number, which you can see in the following picture:

On the left to the Sequence Project operator you can see Filter operator. This operator filters the input based on the certain predicate. You can see in the picture below, that Filter operator allows to pass though only rows that have Expr1004 > 1.

You can check that in this particular case only five rows were let through and subsequently deleted. Two with Id 133, one 141, one 121 and one 145.

What if you want to look at the rows that are deleted? Let’s repopulate the table and run the delete statement, this time with the OUTPUT clause:
- truncate table #t
-
- insert #t(Id)
- values
- ('123'), ('124'),
- ('133'), ('133'), ('133'),
- ('141'), ('141'),
- ('121'), ('121'),
- ('145'), ('145'),
- ('152')
-
-
- ;with dupes as
- (
- select *, row_number() over(partition by Id order by Id) as rn
- from #t
- )
- delete dupes
- output deleted.*
- where rn > 1
The OUTPUT clause allows you to return rows from pseudo-tables inserted and deleted, the same that you have access to in triggers. You can return these rows to the client or to a temporary table or table variable for further processing, or audit. Let’s have a look at the execution plan.

What’s this?! The Sort, Segment and Sequence Project operators where added second time to the plan! When you look at their details you will see that they have exactly the same predicates and conditions as the set on the right-hand side of the Filter operator. The difference is that they are applied to the data stream returned by the output clause. You can verify this looking at the results returned by the statement:

Have you noticed something? the rn column contains values 1 and 2, while it should contain only values 2 and 3, because this is what the Filter operator allowed to flow through. This is the evidence that the new operators work on the output data stream. You can see that they work on 5 rows, the ones that were deleted:

It is obvious that if you use output clause on queries that otherwise perform very well, you may run into problems. When you have many rows, the Sort operator can become a bottleneck when it is executed twice. The example also shows that the row number identifier of the row is not reliable within boundaries of single query. This may become an issue for you, if you want to identify rows based on the output from an update or delete statement.
From purely performance point of view, what can you do to avoid these additional operators? The answer is simple – don’t return the row_number() column in the OUTPUT clause.
- truncate table #t
-
- insert #t(Id)
- values
- ('123'), ('124'),
- ('133'), ('133'), ('133'),
- ('141'), ('141'),
- ('121'), ('121'),
- ('145'), ('145'),
- ('152')
-
-
- ;with dupes as
- (
- select *, row_number() over(partition by Id order by Id) as rn
- from #t
- )
- delete dupes
- output deleted.Id, deleted.randomData
- where rn > 1
As you see in the snippet above, I explicitly return columns from the deleted table and I don’t return the rn column. The optimizer realized that the column is not needed and removed the additional operators from the plan, so looks exactly the same as the first plan we saw in this post.

Well, if someone asked me what I think about this behavior, I would say it is a bug in the SQL Server engine. The windowing functions shouldn’t be applied to the results of the OUTPUT clause because they affect performance and potentially affect the logic of the application.
This is the link to the test script, so you can run your own investigations.