THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Aaron Bertrand

Aaron is a Product Evangelist for SentryOne, makers of Plan Explorer and a performance monitoring and event management platform for the Microsoft Data Platform and VMware. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features, and also blogs at blogs.sentryone.com and SQLPerformance.com; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at major conferences, user group meetings, and SQL Saturday events worldwide.

Bad habits to kick : making assumptions about IDENTITY

In my last post in this series, I talked about inconsistent naming conventions.  This time I want to talk about a few of the assumptions people make when using IDENTITY columns, and what kind of trouble they can cause.

IDENTITY will prevent gaps

A lot of people choose IDENTITY as their "primary key" and expect it to leave no gaps - ever.

I'll leave the choice of IDENTITY as a primary key (over natural keys or GUIDs) for other discussions.  Personally, I do use IDENTITY as the PRIMARY KEY in some scenarios, for various reasons.  Usually there are unique constraints as well, and the primary key is not always clustered, since that is not always an optimal scenario.  Again, these are conversations I will leave for elsewhere.  (Back in 2003, I compared IDENTITY and GUID; much more recently, Mladen Prajdic talked about why why he prefers surrogates).

The way people use IDENTITY, seem to stem from the belief that the value assigned by the database should mean something.  In fact, the value of an IDENTITY column should basically be unknown and meaningless to you except in the case where you are debugging actual data.  The value should never be exposed to the user, or at least not exposed in a meaningful way.

But let's get back to the basics.  The purpose of an IDENTITY column is first and foremost to provide UNIQUE values, not CONSECUTIVE values.  Unfortunately, you cannot rely on IDENTITY values to be consecutive with no gaps.  There are several ways that gaps can be created in an IDENTITY column's sequence, including rolling back a transaction and manually changing the seed value:
CREATE TABLE dbo.foo
(
   
bar INT IDENTITY(1,1)
);
GO

INSERT dbo.foo DEFAULT VALUES;

BEGIN TRANSACTION
;
INSERT dbo.foo DEFAULT VALUES;
ROLLBACK TRANSACTION;

INSERT 
dbo.foo DEFAULT VALUES;

SELECT 
bar FROM dbo.foo;

Results:

bar
--------
       1
       3

Then if you later do this:

DBCC CHECKIDENT('dbo.foo', RESEED, 5);
INSERT dbo.foo DEFAULT VALUES;

SELECT 
bar FROM dbo.foo;
Results:
bar
--------
       1
       3
       6

So, two very simple demonstrations of how there can be gaps in your IDENTITY columns.  What you can do to solve this "problem" depends on your goal.  If you just need "row number" values that have no gaps but that can change (e.g. for a report), you can just add a ROW_NUMBER() to the query and not worry about the actual IDENTITY values:

SELECT
    
bar,
    
rownum = ROW_NUMBER() OVER (ORDER BY bar)
FROM
    
dbo.foo
ORDER BY
   
rownum;

Results:

bar      rownum
-------- --------
       1        1
       3        2
       6        3
There are some situations where you have no choice in the matter; gaps are simply not allowed.  For example, business rules in some jurisdictions require invoice numbers to be sequential with no gaps.  In this case, you shouldn't be using IDENTITY at all, but rather you can use a different technique that generates the next value, typically referred to as "rolling your own IDENTITY." 
CREATE TABLE dbo.mort
(
   
bar INT PRIMARY KEY
);
GO

CREATE PROCEDURE 
dbo.mort_create
AS
BEGIN
   SET NOCOUNT
ON;
  
   
DECLARE @bar INT;
  
   
BEGIN TRANSACTION;

  
SELECT @bar = COALESCE(MAX(bar), 0) + 1
      
FROM dbo.mort WITH (UPDLOCK);
          
   
INSERT dbo.mort SELECT @bar;
  
   
COMMIT TRANSACTION;
END
GO

Now you are able to sufficiently guarantee that there will be no gaps (assume for a moment that there is actual error handling in the procedure, which I omitted for readability).  The downside of this solution is that concurrency can be drastically affected.  Or at least, that is the myth.  Let's see how many rows we can insert using this solution vs. an IDENTITY column.  I created five query windows for each method, dropped and re-created the table dbo.foo from above (this time with a primary key constraint on the IDENTITY column, to make the comparison fair), and then tried to insert 5,000 from each connection:

Method 1 : IDENTITY:

SET NOCOUNT ON;
GO
INSERT 
dbo.foo DEFAULT VALUES;
GO 5000

Method 2 : "Roll your own": 

EXEC dbo.mort_create;
GO 5000

Results: both sets of tests performed in exactly 75 seconds.  I truncated the tables and ran the tests multiple times, to ensure it wasn't an anomaly, that one of the scripts wasn't inadvertently delayed by an autogrow event, etc.  Nope, exactly 75 seconds, every time.

So in the past, those of you who have a business requirement for no gaps and a requirement for high concurrency certainly, have faced an interesting challenge.  You often had to trade off potential performance to meet your business requirement, or the opposite.  And often because people told you that the MAX+1 solution wouldn't scale as well.  I will admit that I have been guilty of telling people that from time to time, because it seemed pretty logical to me.  Now that I have performed tests, I will get out of the bad habit of telling people which solution will scale better.

IDENTITY will prevent duplicates

While the IDENTITY property attempts to provide you with unique values, it cannot guarantee them.  The element that prevents duplicates in a column is not the IDENTITY property itself, but rather the constraint(s) on that column.  Here is very simple proof, a table with an IDENTITY column but the primary key was "forgotten":

CREATE TABLE dbo.splunge
(
   
bar INT IDENTITY(1,1)
);

INSERT dbo.splunge DEFAULT VALUES;

DBCC CHECKIDENT('dbo.splunge', RESEED0);

INSERT dbo.splunge DEFAULT VALUES;

SELECT bar FROM dbo.splunge;

DROP TABLE dbo.splunge;

Results:

bar
--------
       1
       1

So clearly, if you try hard enough, or are particularly accident-prone, you can create duplicate values (unless you also explicitly define the column as unique).

INT will be big enough

Some people apply an IDENTITY column to a table that doesn't need one.  For example, a log table that holds high volumes of transaction data; what purpose does an IDENTITY column serve here?  What does it mean exactly?  If it is to ensure rows are always added to the "end" of the table, and you are also recording date and time as part of the data, then it is just wasting room on the page.  Presumably the data is coming into the log table in (roughly, or exactly) chronological order.  In which case, you could just as easily cluster (and even partition) on the datetime column.  This will keep the same behavior (inserting to the "end" of the table) and will allow much more accurate purging of the data or partition switching.  If you need to identify a set of rows later, this is typically going to be by date range.  Getting the IDENTITY values represented by that date range is just an extra and unnecessary step.

Anyway, I brought up this example because, if you choose an INT, and you start at 1, you only have ~2 billion values to use up before you will have to make a change to the table.  A knee-jerk reaction by some people is to say, "well, we should have started the seed at -2<billion>, then we wouldn't have this problem!"  Well, you would, it would just come a little later; if 2 billion isn't enough, it's unlikely that 4 billion will be enough, either. 

Ideally, you could switch to BIGINT and not look back.  However, changing data types can be very disruptive, since you could have many stored procedures, schema-bound views and other objects that are using the INT type to represent the data that is in this column.  While the data in the column is still inside the INT upper bound, you could change the column without changing all of these other elements immediately.  Implicit conversion won't kill you, but if you're approaching the upper limit, you could wake up tomorrow and suddenly have overflow problems.

For a temporary solution while you plan your conversion, you could set the identity value to -2,147,483,648.  Note that this significantly changes your index usage if they include the IDENTITY column.  If you still want the rows to sort in the order of insert when you select them, you could use a CreatedDate column with a default of CURRENT_TIMESTAMP, and order by that.  Or, if you don't have such a column, you could use something like this, but it is pretty ugly:

ORDER BY 
CASE
WHEN
IdentityColumn > THEN '1'
ELSE '2'
END
,
CASE
WHEN
IdentityColumn < 0 THEN -IdentityColumn
END DESC;

Hopefully you can see that if you anticipate this problem and can plan ahead, and you really do want the table to have an IDENTITY column, it is better to go with BIGINT from the start, rather than just double your pool of values by starting at -2,147,483,648 instead of 1.  Never mind the debugging problems on initial implementation of that table.  Another issue that will happen is if you start at 1, whether you use INT or BIGINT, you can achieve fantastic compression in SQL Server 2008 for the first many thousands of rows.  If you start at -2 billion, forget about it.

@@IDENTITY and IDENT_CURRENT() will be reliable

I strongly recommend staying away from both @@IDENTITY and IDENT_CURRENT(). 

Many people used @@IDENTITY before SCOPE_IDENTITY() arrived on the scene.  One problem with @@IDENTITY is that if the table from the first insert had a trigger, and that trigger also inserted into a table with an IDENTITY column (e.g. an audit table), the @@IDENTITY value returned is the one from the audit table, not the original table.

I also see the occasional person thinking that using IDENT_CURRENT('tablename') is a reliable way to retrieve the IDENTITY value that you just generated.  This is not true, because it is not scope-specific; another user may have also inserted into the same table after you did, and you could be getting *their* IDENTITY value.

SCOPE_IDENTITY() avoids both of these problems by returning the IDENTITY value you generated.  A couple of issues with SCOPE_IDENTITY() that you should keep in mind, however: you cannot expect SCOPE_IDENTITY() to retrieve the value that some other session inserted, and there are a few outstanding issues with SCOPE_IDENTITY() in SQL Server 2005 and 2008 (see Connect items #328811 and #476577).

Summary

Hopefully this has shed some light on a few of the assumptions people make about IDENTITY columns, and it helps prevent you from falling into the same traps.


I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code.  Up next: choosing the wrong data type.

Published Monday, October 12, 2009 12:23 AM by AaronBertrand

Comments

 

dd said:

"Do you have suggestions for the series? "

You could tag all of them with "Bad habits" so we don't have to search around.

October 12, 2009 8:47 AM
 

AaronBertrand said:

October 12, 2009 8:52 AM
 

jamiet said:

Aaron,

"Do you have suggestions for the series?"

I'm a big fan of naming constraints explicitly. So, rather than:

create table t1

(

 foo int not null primary key

, bar nvarchar(10) not null

)

I would write:

create table t1

(

 foo int not null

, bar nvarchar(10) not null

, constraint t1_PK PRIMARY KEY (foo)

)

-Jamie

October 12, 2009 10:42 AM
 

AaronBertrand said:

Jamie, true, but I don't bother doing that for sample / throw-away code.

October 12, 2009 10:51 AM
 

jamiet said:

Aaron,

Nope, me neither. But for deployed code I absolutely would - I figured this series was more aimed at code that was to be deployed than sample/throw-away code, hence why I made the suggestion.

Just to clarify, I wasn't writing that comment in response to your sample code above (where, I have just noticed, you used the inline syntax), I meant it might be a good subject for another post in this series.

cheers

Jamie

October 12, 2009 11:02 AM
 

AaronBertrand said:

Fair point Jamie.  I agree that naming constraints is better, but I honestly don't do it very often.  Why?  The benefits of doing so just don't seem worth the extra effort.  Also when I create my scripts to save away, I like to have the constraint definitions right there with the column as opposed to defined afterward.  Other than being able to guess the name, what benefit(s) do you observe from naming the constraints?  Maybe you can convince me to kick a bad habit.  :-)

October 12, 2009 11:07 AM
 

jamiet said:

Well, I'm very particular (my colleagues would say "anal" :) about naming conventions and hence I'm never content to let the engine name my constraints for me - I'd rather be in total control. That's the main reason.

Aside from that, at least you're ensuring that your constraints always have the same name on whichever environment (dev, test, QA) you happen to be working on. I have seen code that references constraints by name (usually in a data warehouse implementation to disable/enable them before/after a big insertion) so knowing the name in that situation is a pre-requisite.

Have I convinced you? :)

I do have some DDL code I always run that renames my PKs/FKs/constraints anyway accordig to my preferred naming convention, but better to be proactive about it I reckon!

-Jamie

October 12, 2009 11:16 AM
 

AaronBertrand said:

Not really, I don't want my colleagues to start calling me anal!  :-)

I agree that having the constraints use the same name in each environment is important.  However, I don't create objects in post-dev environments by hand; they are pushed by a build tool, and the script is generated based on the existing object (including any system-assigned properties).  So I don't have to worry about a mismatch between environments, but I understand why it could be important for some...

October 12, 2009 11:27 AM
 

jamiet said:

Right, might blog about it myself then. Gives me some material at least - i've been a bit light on the blogging front lately!

cheers

Jamie

October 12, 2009 11:35 AM
 

Glenn Berry said:

Maybe you could write a post about using SET NOCOUNT ON in stored procedures?  Great series, Aaron!

October 12, 2009 1:20 PM
 

Bruce W Cassidy said:

I would think the bad habit to kick in this discussion is the use of a surrogate key that has business requirements attached.  Either it's a surrogate and the value is meaningless, or it's not.

Where I use identity() to provide surrogate keys, the value is fairly meaningless.  The fact that it is ascending means it works well with a clustered index, but who cares about gaps in the value range?

Muddling the waters and relying on identity() to supply a gap-free value range seems to me rather pointless.  I do like that you supplied an alternative for just such a case, so well done there.

October 18, 2009 3:39 PM
 

Thomas said:

A key tactic to ensuring success with surrogate keys is to never show them directly to the user. You want the flexibility of changing the surrogate key value if need be. In addition, it is important to have a "business key" on your tables. By that I mean, a unique constraint, other than the surrogate which defines uniqueness. Otherwise, you can end up with identical rows and no means to differentiate them. A "gap free" value implies a user consumed value which means not the surrogate.

There are some circumstances where naming constraints comes in handy. Perhaps the biggest is if you database updates to the production system are all scripted. It is easier to know whether you have already created said constraint by name than by its type. For primary keys, it probably doesn't matter, but if you have say two or three unique constraints, knowing whether you have already created it is difficult without constraint names.

October 18, 2009 4:51 PM
 

AaronBertrand said:

Bruce, it's funny you mention that, because when I originally started the topic, it had much more of a "stop expecting IDENTITY values to mean something" flavor.  You can see this by the page name of this article, which adopted a URL-friendly version of, "Expecting identity to mean something."  While I agree with the sentiment, there are cases where system-generated surrogate keys *do* have to mean something.  Invoice and Customer numbers *are* important - there isn't another good way to identify an invoice (at least without using a complex, composite key), and customer names can change.  But users will still need to identify these things quickly and easily.

Thomas, I do not recall advocating creating tables where the *ONLY* means of uniquely identifying a row is through the surrogate key.  Of course other unique constraints should be in place.  However it is not always best to make the "natural key" the primary key for various reasons.

October 18, 2009 5:05 PM
 

Thomas said:

-- However it is not

-- always best to make the

--"natural key" the primary

-- key for various reasons.

Agreed. In fact, I rarely if ever use natural keys. States, currencies and only recently languages are the only ones on which I use natural keys as the PK. One problem with natural key is that they are not always known at the time of record creation. For example, employees. You might want to enter employee data even if you do not yet know their SSN. There might be other business rules which make the employee inactive until you have a Tax ID, but it shouldn't stop all entry from occurring until that happens.

October 18, 2009 5:31 PM
 

Philip Helmer said:

On the subject of naming constraints, it did come into play in a database that supports one of our products (we bought it, we didn't design/build it). I don't recall the exact task as this was a while ago, but it was a structure update that was to be included in a version upgrade for our customers.

The tables typically use NOT NULL columns with DEFAULT constraints defined. There are a few other contraints on the tables (a mix of PK and UNIQUE). So, all in all, each table had quite a few constraints defined on it.

None of these constraints are explicity named. We had to make a pretty broad change to the db that required either re-creating or dropping some of the constraints. Since there weren't explict names (much less a naming convention) in place, we had to identify the constraints via their system catalogs attributes rather than their names.

As you can imagine, if the constraints were named less like "DF__TEMP__orde__disc__77B1497A" and more like "DEFAULT_order_discount", that extra coding and execution overhead could have been avoided.

Hopefully, this now puts the non-explicit naming of constraints on your list of bad habits to break.

October 21, 2009 9:22 PM
 

AaronBertrand said:

Thanks Philip, in spite of my earlier comment, I have actually run into a couple of similar issues due to system-generated names.  I think it will be a hard habit to break for people who don't face these issues (since it is so easy to write something silly like CustomerID INT NOT NULL PRIMARY KEY DEFAULT 1), but nonetheless I have added it to the list of future topics I'll blog about (if Jamie doesn't beat me to it).

October 21, 2009 10:51 PM
 

Tom Thomson said:

Aaron, you give "Also when I create my scripts to save away, I like to have the constraint definitions right there with the column as opposed to defined afterward" as a reason for not explicitly naming constraints, but what is wrong with a script like

Create Table Customer (

   CustomerID INT

       Constraint PK_Customer primary key Constraint,

       Constraint DF_Customer_CustomerID default (1),

   <other columns and named column constraints>,

   <named table constraints>

   )

?

That's the way I usually do my scripts, and it gives me the best of both worlds: full control over constraint names AND having the constraint definitions embedded in the table/column definitions so that the information is all in one place..

October 22, 2009 12:16 PM
 

AaronBertrand said:

Tom, nothing is wrong with that script (well, syntactically there is a mistake, but I get your point), but it is a lot more typing and thinking than what I am used to... so like a lot of the other habits I have pointed out in this series, you just have to determine the pros and cons for you.  For me, the benefits of named constraints in most cases do not justify the time and effort it would take for me to un-learn the syntax I use now.  For a lot of people that answer is probably different, just like I will have a hard time convincing anyone that switching to my convention will provide any tangible benefits (the "less typing" argument aside, obviously) -- especially if they are already thinking about their create table scripts the way you've demonstrated.

October 28, 2009 6:31 PM
 

Jeff Bowman said:

Hi Aaron

I couldn't help but notice in your 2003 column "What should I choose for my primary key?" that you appear to laboring under the assumption that all Americans are required to have a SSN simply because we're Americans.

I would like to take this opportunity to inform you that this is most certainly not the case.

Just as there is no law requiring an American to pay an income tax on his private domestic income, there also is no law requiring an American to have a SSN.

Thanks,

Jeff Bowman

November 3, 2009 2:44 PM
 

Aaron Bertrand said:

Thanks Jeff, in your estimation, what % of US residents bother the trouble and potential risk of fighting the taxation system?  The benefit there is obvious.  What is the benefit of not having an SSN.  I would say that you lose a lot more than you gain, even in the most extreme cases of paranoia.  You need it to drive, to get paid, to get credit, to own a home...  Hell, I have an SSN, and I'm not even an American.

November 3, 2009 3:29 PM
 

Jeff Bowman said:

Hi Aaron

Actually, it's not about money at all. We can always make more money. And it's certainly not about a short term trade-off of what's easiest for the moment.

It's about whether we're going to let our government get away with lawlessness and fraud, and about what kind of life we're creating for our children.

If we have no law we have no society. All it takes for evil to prevail is for good men to do nothing.

Thanks,

Jeff Bowman

November 3, 2009 3:44 PM
 

Ganesh Hegde said:

Even if we do use a BIGINT type, there's an upper limit to the size of a table in SQL Server which would limit us from adding more rows to a table. For example, there's an upper limit of 4 GB for the size of a table in SQL Server 2005. If we have a column of int type in a table, that column alone will take up 4 GB if we start with 1 and go up to the last int (which equals the value of 2 GB) because the size of an int is 2 bytes.

October 18, 2011 6:21 AM
 

Aaron Bertrand : Bad habits to kick : choosing the wrong data type said:

April 25, 2012 8:45 AM
 

Brady Kelly said:

I think it comes down to not using IDENTITY for business columns at all.

The gaps in invoice numbers vs gaps in your primary key should be two separate matters. I would shudder to think of using my PK as my invoice number. Business logic (albeit small enough to fob off to the DB) should take care of that.

That brings me to my second point. In MSSQL only one column can be identity, and for me that is always the PK, which implicitly prohibits duplicates. So, I never have a duplicate IDENTITY column value.

May 18, 2013 12:29 PM
New Comments to this post are disabled

About AaronBertrand

...about me...

This Blog

Syndication

Archives

Privacy Statement