THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Aaron Bertrand

Aaron is a Product Evangelist for SentryOne, makers of Plan Explorer and a performance monitoring and event management platform for the Microsoft Data Platform and VMware. He has been blogging here at sqlblog.com since 2006, focusing on manageability, performance, and new features, and also blogs at blogs.sentryone.com and SQLPerformance.com; has been a Microsoft MVP since 1997; tweets as @AaronBertrand; and speaks frequently at major conferences, user group meetings, and SQL Saturday events worldwide.

Bad habits to kick : putting an IDENTITY column on every table

Back in October, I started a series of blog posts called "Bad Habits to Kick," and thought I would revive the theme.

I've worked with developers that dabble in SQL, and they tend to have a few common traits.  I'm not sure where they come from, but one that I find rather distracting is the tendency to place an IDENTITY column on every single table.  Usually this is done because this is an "easy" way to add a column to the table that allows you to identify a single row.

Now don't get me wrong, I am a firm believer that IDENTITY columns have their place, and I use them quite liberally.  When you are storing information about customers, products or companies, where there are going to be multiple references to those entities in related tables, it makes little sense to copy all of the "real" attributes that identify that entity into all of the related tables.  In some cases this would make your storage requirements surge dramatically, never mind that the performance of JOINs will suffer.  It also can make for a real mess when you are repeating information that is likely to change - such as surname, e-mail address, phone number, ISBN.  Both of these issues are technical and not logical limitations - SQL Server doesn't support a "behind-the-scenes" foreign key construct that would make repeating the information unnecessary, and there are cases where cascading updates do not exactly work as advertised.

A simple example is where you have an Orders table and an OrderDetails table.  (And as a heads up, whiteboarding your perception of Barnes & Noble's database schema is a typical part of my interviewing process - how detailed we get is up to you :-).)

CREATE TABLE dbo.Orders
(
  
OrderID INT IDENTITY(1,1PRIMARY KEY,
  
...
);

CREATE TABLE dbo.OrderDetails
(
 
OrderDetailID INT IDENTITY(1,1PRIMARY KEY,
  
OrderID INT NOT NULL FOREIGN KEY REFERENCES dbo.Orders(OrderID),
  ProductID INT NOT NULL FOREIGN KEY REFERENCES dbo.Products(ProductID),
Quantity INT NOT NULL,
  
...
);

Now tell me, what business purpose does the IDENTITY column on the OrderDetails table satisfy?  When are you ever going to need to reference rows in this table by the OrderDetailID, and not by the OrderID and/or ProductID?  And why is *that* the column that should uniquely identify the row?  This seems to imply that you would allow multiple rows for the same order and product.  Even if the PK is elsewhere and these questions disappear, the value of this column is questionable at best - for no discernible gain whatsoever, you seem to be taking up an extra 4 bytes (plus space in any indexes that include the column, and then the additional I/O involved in maintaining those indexes).

Another example is a logging table.  Let's say you have a table that logs exceptions, or API calls, or other events over time.  You insert rows monotonically either in real time or from log files.  I often see this designed as:

CREATE TABLE dbo.ActivityLog
(
 
LogID INT IDENTITY(1,1PRIMARY KEY,
  
EventDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  
...
);

I think part of the knee-jerk reaction to add an IDENTITY column and make it the PK is that events can happen at the same time, so it makes it hard to define a real natural key.  I'll concede this point, as I have several log tables that do not have a PK or a unique constraint - but they certainly have a clustered index, and they do not have an IDENTITY column.  The purpose of this type of table is usually to maintain a history of activity and not to do a lot of searching for individual rows (and when you are searching for individual rows, you are not looking for a specific LogID, you are looking for details in other columns - say, part of an error message).

An interesting thing that came up in one of these cases was the need to purge older data.  So, the developer intended to have a 30-day running window, and once a day a job would wake up and would delete any rows from the log that were more than 30 days old.  Since the clustered index was on the IDENTITY column, their code basically did this:

  - determined the highest IDENTITY value more than 30 days old
  - assigned that to a variable
  - deleted from the table where IDENTITY < variable

With the clustered index moved to the DATETIME column, and the IDENTITY column disposed of, his code became a lot easier (simply a DELETE with a WHERE clause), and more efficient too, since it didn't have to perform a lookup first.

Now let's talk about a very similar case of a logging table where I think an IDENTITY column *is* useful, such as a logging table that keeps track of backup or index activity.

CREATE TABLE dbo.ActivityLog
(
  
LogID INT IDENTITY(1,1UNIQUE NONCLUSTERED,
  
EventType TINYINT-- assume some kind of FK
  
StartDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  
EndDate DATETIME NULL,
  
...
);

Now why is an IDENTITY column useful here?  Well, rather than logging the start and end events in different rows, keeping the data on a single row makes it much easier to consolidate the start and end of a specific event.  So the code would do something like:

DECLARE @LogID INT;

INSERT dbo.ActivityLog(EventTypeSELECT 1;

SELECT @LogID SCOPE_IDENTITY();

-- do other work

UPDATE dbo.ActivityLog
SET EndDate CURRENT_TIMESTAMP
WHERE LogID @LogID;

The unique constraint makes it slightly easier for the engine to locate the row to be updated, but the clustered index could still be placed on the StartDate and/or EndDate columns to provide useful searches on date ranges (with the same monotonically increasing property as the IDENTITY column), and a primary key could still be defined on other columns.  I am sure Celko is rolling his eyes right now, though he has probably not even read this far.*

So back to the basics, am I telling you to not use IDENTITY columns?  Absolutely (and hopefully obviously) not.  Just showing that before blindly applying an IDENTITY column to a table, think about what it means and why you are using it.

* P.S. I don't want to debate the merits of using an IDENTITY column or otherwise surrogate key (such as NEWID() or NEWSEQUENTIALID()), vs. using a natural primary key.  This comes up with Celko a lot - he is completely against any kind of system-generated identifier that can't be physically verified.  When asked a pointed question such as, "what do you think Barnes & Noble uses as the primary key to identify a customer or an order?" he disappears, never to appear in the thread again.  So while he is passionate about his convictions, he has no desire to defend them.
Published Monday, February 08, 2010 1:33 AM by AaronBertrand

Comments

 

Greg Low said:

Hi Aaron,

"When are you ever going to need to reference rows in this table by the OrderDetailID, and not by the OrderID and/or ProductID?"

Easy example: if I tell you that if you buy 12 items, I'll give you two more at a discounted price, how are you going to show that with just an OrderID and a ProductID? This sort of thing happens all the time in real systems.

Regards,

Greg

February 8, 2010 12:41 AM
 

AaronBertrand said:

Greg, there is a row for each OrderID / ProductID combination (as well as a quantity column, etc.).  So unless I misunderstood, the query to answer your question is simply:

SELECT COUNT(*) FROM dbo.OrderDetails WHERE OrderID = <x>;

What additional information would a system-assigned OrderDetailID provide that makes solving the problem you mention easier?  Surely you aren't suggesting that you subtract the difference between the highest OrderDetailID and lowest OrderDetailID and expect to get a reliable answer that indicates the count?

February 8, 2010 12:54 AM
 

AaronBertrand said:

(Or, if you're saying 12 items, not 12 *different* items):

SELECT SUM(Quantity) FROM dbo.OrderDetails WHERE OrderID = <x>;

February 8, 2010 12:59 AM
 

Uri Dimant said:

Hi Aaron

I met some developers who do not know/ doubt what columns should be PK and simple create an IDENTITY property to 'meet' business requirements. When I ask them what is the purpose of they say "to prevent page spilts". (like reading books --keep CI as small a possible and create CI on monotonous increasing column)

I agree with you that we do not have to create blindly an IDENTITY on every table , it really depends on business requirements /work load and many other things...

February 8, 2010 1:30 AM
 

Uri Dimant said:

Hi Aaron

I met some developers who do not know/ doubt what columns should be PK and simple create an IDENTITY property to 'meet' business requirements. When I ask them what is the purpose of they say "to prevent page spilts". (like reading books --keep CI as small a possible and create CI on monotonous increasing column)

I agree with you that we do not have to create blindly an IDENTITY on every table , it really depends on business requirements /work load and many other things...

February 8, 2010 1:30 AM
 

Uri Dimant said:

Hi Aaron

I met some developers who do not know/ doubt what columns should be PK and simple create an IDENTITY property to 'meet' business requirements. When I ask them what is the purpose of they say "to prevent page spilts". (like reading books --keep CI as small a possible and create CI on monotonous increasing column)

I agree with you that we do not have to create blindly an IDENTITY on every table , it really depends on business requirements /work load and many other things...

February 8, 2010 1:30 AM
 

RamiReddy said:

OrderDetails

(

OrderId int,

ProductId int,

Quantity int

)

1. Suppose, I wants to make Payments Per the Ordered Product instead of entire Order..  I might made the payments in installments alos..

in that case, i need a seperate table to store the payment details of that product.. so my billpayments table need a reference to the Orderdetails

table..

BillPayments  Table

(

OrderId int,

ProductId int

)

2. Suppose, assument that I ordered a book in an order with the quantity 50. Now assume,we might not able to send all  the books ordered

in one go due to some stock availablitity.. and we might send in installments again.. so to capture this info we need another table

   ProductsDistribution

   (

OrderId int,

ProductId int,

QuantitySent int

   )

 So, in this 2 situations, i created 4 columns in the 2 tables... if the orderdetails table has an identity column, i could have been reduce columns to 2..

 and my select queries also requires, less join conditions...

 We never sure about our future needs... if someday, some new requirement came, we should be in a position to extend that simply....

 I always go for identity column atleast in business related tables....

February 8, 2010 1:35 AM
 

AaronBertrand said:

Rami, in both of these cases I find it highly unlikely that I could justify another IDENTITY column to save 4 bytes.  (Note that you are only saving *one* column, even if *both* of these situations affect you, because you forgot to count the IDENTITY column that you added to the OrderDetails table, that doesn't exist in my version.  So you've saved 4 bytes in two tables, that are going to be <= number in the details table, and added 4 bytes in the other.)  I think I would *want* both the OrderID and ProductID information in the other tables, since that combination is not going to change, and I am just going to have to do a lookup to figure out the OrderID and/or ProductID anyway.

As for throwing an IDENTITY column on a table to future-proof, you can add an IDENTITY column at any time, so I don't think it's prudent to put one there "just in case."  Why not also put a GUID, CreatedDate, ModifiedDate, CreatedbySPID, ModifiedBySPID columns on every table, just in case business requirements change someday?  Further to that, I can't think of any cases where going back to an existing table and adding an IDENTITY column would suddenly solve some new business problem or requirement.

February 8, 2010 1:54 AM
 

AaronBertrand said:

Uri, I think your submit button got stuck again.

February 8, 2010 1:58 AM
 

Marzena said:

I agree that there are situations, where IDENTITY column is definitely not needed. But the example of Order and OrderDetails tables is IMHO not appropriate. May be this is more a philospical than a technical issue. I have contributed to two ERP systems - first of them did not allow multiple multiple rows per one product in a single order and it was definitely painful for the users.

The other one allowed it and both for the users (as was mentioned - different price, different tax etc.) and and the programmers benefitted from such architecture. Having a single identity column simplifies 90% queries and DB objects related to that table. My experience tells me, that discussion about correct PK to such table is just a waste of time - of course, you could probably find a better architecture from the storage size point of view. But the performance gain is unlikely to be as significant. Simplier SQLs (and moreover standardized approach accross the whole system, where all or almost all tables have the same PK architecture and the programmer does not need to waste time searching for correct ID combination of fields) is the benefit that I prefer. Sparing four or eight bytes of data in a row that total consumes at least 100 bytes is IMHO not a good reason to change the standard.

Also I do not agree that an architect can assume, that there will never be need for such column - our customers always find a way to surprise us by their requests ;-).

On the other hand there is at least one pattern, where special IDENTITY column is definitely not needed but very often is present - M:N relationship binding table with no other information columns.

CREATE TABLE Suppliers (Id IDENTITY...)

CREATE TABLE Products (Id IDENTITY...)

CREATE TABLE SuppliedProducts (

  Id IDENTITY - why???,

  SupplierId,

  ProductId)

February 8, 2010 4:41 AM
 

AaronBertrand said:

All right, so I guess my choice in examples was poor - I consider the IDENTITY column on OrderDetails as useful as the IDENTITY column on Marzena's SuppliedProducts table, but clearly others have found a use for it.  Hopefully the point of my post is not lost due to this detail (perhaps I should re-write it with a better example).

February 8, 2010 7:10 AM
 

Greg Linwood said:

Nice article & I agree, too many db designer throw identities around willy nilly, often without reason.

Why did you choose NONCLUSTERED in your ActivityLog example? I'd have thought that would have been a good candidate for clustering?

February 8, 2010 7:25 AM
 

AaronBertrand said:

Greg, I explained after the example:

"...but the clustered index could still be placed on the StartDate and/or EndDate columns to provide useful searches on date ranges..."

Now, if all you were doing was inserting and updating the rows, and never running any queries to find errors that happened, say, yesterday, or last Tuesday between 3 and 4 PM, and never cleaning up older data based on a point in time, then I would probably agree that the IDENTITY column could be clustered.

I've found that what you lose by choosing a non-clustered index for the lookup for single-row updates, you gain several times over when doing date range queries.  Of course it depends on how you are using the data, but for this use I've found that putting the clustered index on the datetime column is more useful in the long run, and since both are monotonically increasing, it doesn't change the impact of table growth, page contention etc.

February 8, 2010 8:13 AM
 

Brian Tkatch said:

"I have several log tables that do not have a PK or a unique constraint - but they certainly have a clustered index, and they do not have an IDENTITY column."

Well said!

I like this post, it illustrates a good point. I so have an issue with this post, however. The post comes from a position that:

1) IDENTITY should be used unless there is no point in using it.

2) IDENTITYs usefulness is defined by it's use.

I approach it differently. I separate TABLEs (mostly) into two groups:

1) Object

2) Relationship

Object TABLE define objects, such as Customer, Order, or Book. Each record is an instance of that object. Hence, the TABLE is named in the singular--it is not collection of its records--it defines what each record is. This requires a unique way of identifying each individual record. Because each individual record is another object.

The question is, how is this uniqueness defined. The answer is, by it being another record. That is, two records can have exactly the same information, and yet we have two records--two objects. They are unique only because they are separate, not because they are different. In a sense, the unique "id" is the internal identifier for that particular record.

However, the internal id is not exposed. So, another identifier is required. In comes IDENTITY. It is an exposed unique identifier, we not have a unique COLUMN.

That is all true if the data can be the same in two different records. What if they cannot, by data rule definition, be the same. Then there is already a unique identifier in the field. And there would be no reason to expose a new method via IDENTITY. That would be redundant, and mask the true uniqueness of the object.

The other type of TABLE is a relationship TABLE. That is, a TABLE the related two objects in the system, such as Order_Product that related Order and Product. In that case, usually only one instance of any particular join itself is allowed (the user can order each item only once in any given Order, otherwise it's just quantity that goes up), making a natural PK. If this uniqueness is not guaranteed, that TABLE is not a join of those objects, and likely not a relationship TABLE.

There are also logs and lists. But those aren't TABLEs, per se, they just happen to be stored in TABLEs for convenience, so the method used is indeed more based on use.

I'd suggest then, IDENTITY is to be used as a unique identifier of an instance of an object, when no other data rule guarantees uniqueness.

February 8, 2010 9:29 AM
 

mjswart said:

Very cool article Aaron, this query gives a list of unreferenced identity columns:

select t.name, c.name

from sys.tables t

join sys.columns c on c.object_id = t.object_id

left join sys.foreign_key_columns fkc

on fkc.referenced_object_id = c.object_id

and fkc.referenced_column_id = c.column_id

where fkc.constraint_object_id is null

and c.is_identity = 1

order by t.name, c.name

All that's left is to determine whether the application uses (or should use) these columns.

February 8, 2010 9:31 AM
 

Nick Canale said:

I really liked this article and the comments on this article. I'm one of those guys who adds an identity column on every table for the reason that Marzena described, so that developers know that that is the key and doesn't have to waste time searching. I haven't ran into any performance issues by doing this, even if it doesn't always make sense sometimes. :)

IMO you can still have a problem with this approach if 2 rows are exactly the same except their identity primary key. I don't think this should ever realistically happen.

February 8, 2010 10:21 AM
 

AaronBertrand said:

Nick, that's right, I think another thing that tends to get lost is when people put a PK constraint on a surrogate key they forget to put a unique constraint on the column(s) that make up the natural key.  (Well, forget, or just don't think it's necessary.)

February 8, 2010 10:41 AM
 

Paul Nielsen said:

Another pro for using a surrogate keys is taht a database-wide consistent PK enables building database-wide systems (such as audit, temporal updates, data policy checks, etc). I’ve found this to be very powerful when combined with code-gen techniques.

February 8, 2010 1:10 PM
 

AaronBertrand said:

So Paul, you would advocate using an IDENTITY column for code generation purposes even on the relational table that Marzena described in a comment earlier today?

February 8, 2010 1:16 PM
 

Paul Nielsen said:

Hi Aaron,

Yes, for the purposes of adding common utilites/system to the tables. But where there is suitable natural key I would also add a unique constraint/index on that natural key. I think the greatest con to identity columns is that developers assume that the identity col satifies the need for a PK and they fail to add a correct unique constraint whcih allows duplicate rows (different ID but all other data identical.)

to summerize: It's almost as if the surrogate key is there for internal system purposes (FKs and Utilties) and the unique constraint is there for relational table design (entity integrity) purposes.

and, conposite primary keys is in my list of top ten most evil database design and development practices.

February 8, 2010 1:55 PM
 

Aaron Bertrand said:

In my last post in this series , I talked about the common habit of creating an IDENTITY column on every

February 8, 2010 10:02 PM
 

AmosFiveSix said:

We use a large system (NOT designed by our company ;-) That is setup the way your describe with an OrderDetails table with an OrderDetailID. There are different types of products and for each type we need to collect different information when an order is placed for that product. Like guest information for a hotel registration. Those bits are in separate tables so we only add a record when someone orders that type of product, and those are linked by the OrderDetailID.

February 9, 2010 9:39 AM
 

Stephen Horne (Bluedog67) said:

I lean towards using IDENTITY columns as primary keys but not in all cases. Some examples where I think an IDENTITY column as PK is excessive include the following: ZIP Code table, State table (the USPS 2 character code is sufficent for PK), a Fiscal Year table (if like some companies you have a 52 week FY - CISCO), anywhere you have a 1:0/1 relationship (even if main table has IDENTITY PK the supplementary table only needs a non-IDENTITY int column as a PK and is FK back to main table - automatically get the 1:0 and 1:1 behavior. An IDENTITY column on 2nd table is excessive and unnecessary).

I believe strongly that everywhere a developer uses an IDENTITY column as the PK in a table, the Natural Key must be determined and a unique constraint defined on the NK column(s). In all likelihood a NK exists - make it unique either as PK or with unique constraint. If an NK cannot be found then this probably indicates a problem in the design or this fact needs to be well documented. Even simple lookup tables that are just an IDENTITY PK column and a Name/Description - the Name/Description should be uniquely constrained - if not your database is incomplete.

February 9, 2010 9:22 PM
 

Lee Everest, M.S. said:

"When are you ever going to need to reference rows in this table by the OrderDetailID, and not by the OrderID and/or ProductID?"

A: Only when you need to use the order of entry

"Easy example: if I tell you that if you buy 12 items, I'll give you two more at a discounted price, how are you going to show that with just an OrderID and a ProductID?"

A: Good point.  Requires additional information in the record.  Should add a column to note the special order type. Then you add use orderdate and ordertype as the key.

Great topic for discussion, Aaron.  Made a few comments of my own on your great entry... http://www.texastoo.com/post/2010/02/09/IDENTITY-property-and-Primary-Keys.aspx

Thanks,

Lee

February 10, 2010 10:10 AM
 

Mark Stacey said:

I always use A) an ID column that is an Identity and B) an Active column that is an int(not bit? LONG set of reasons. Sometimes it's bit. But by default int)

And on the tablet tables, I have a [Key] column as well.

The reason being, I build Occasionally connected applications : ensuring that I consistently ALWAYS have a column with the same name to uniquely identify every row in every table makes coding my sync framework (yes, I use MS Sync as a backbone) much easier, and in many cases faster as well.

Order details? Well, yes. There too.

As for having multiple products of the same product for the same order....

There is a particular data model that is common, and can lead to ISSUES.

The most common way to implement this is with a quantity.

So here we have an operational system where if someone returns one of a pair of products because it is faulty, we cannot uniquely identify which one it is.  Can be coded round, but it's an instance that needs upfront thought.

March 1, 2010 7:14 AM
 

telcogod said:

I would use identity column on the orderdetail table as the primary key clustered index because it should never become fragmented.  this table should never have deletes and so should always be sequential.  A composite key would not be, and would need lots of maintenance.  And with the online option available only for clustered indexes on tables that do not have a lob, rebuilding the clustered index on the table if it had a varchar(max), etc., would impact 24*7 uptime.

March 1, 2010 7:19 AM
 

stan lewis said:

I think you have missed the most important point here; the use of an identity key removes the proper primary key identity which itself prevents insertion of duplicate key values and naturally organises the table data. of course you can add the real key columns as secondary uniquing key but this is generally not done even by IT professionals and thus duplicate data creeps into these mis analysed tables. then databases get a bad name because if they do not prevent duplicate data then what use are they? In the case given above, an order is naturally a generated row and therefore to use a rising sequential number is easily the best method of creating a unique id. This example is a red herring.

March 1, 2010 10:50 AM
 

Aaron Bertrand said:

Mark, if two chocolate bars can be uniquely identified (and should be), then I would argue that they should each be represented on their own line.  If one has been returned then that seems like information that is stored elsewhere anyway... the original order details should remain intact somewhere.

If two chocolate bars can't be uniquely identified, and one gets returned, then why isn't Quantity -= 1 sufficient?

March 1, 2010 5:35 PM
 

Paul K said:

I often purchase the same book (same order and product) but want to send them to different addresses. Wouldn't supporting that require an OrderDetailID? Maybe not.

I note what Brian said about not needing an Identity column on a many-to-many relationship table. That's a case where I tend to use the Identity and probably don't need it.

Another (too obvious) case where Identity is not needed and we avoid it is when the incoming data has a field that is essentially an Identity column anyway. But what if the developers working on the source system read this column and decide to get rid of the Identity column? Ouch.

Another case where we manage to not use Identity columns is when the table itself is so small, say <= 3 columns. That's when the "extra 4 bytes" starts to sound like a big problem to me.

But our log table's LogID is used in other tables (holds generated SQL strings and it is not 1 to 1) and I tend to search or order by LogID all the time. Can't order by Date because messages might have the same date and I want to be sure of the ordering.

March 1, 2010 5:49 PM
 

Aaron Bertrand said:

Telcogod, if the OrderID is already sequential, then you're not gaining much by adding an identity column.  Now, I've already admitted that my example wasn't the best one.  What about a many-to-many join table?  You'd put a clustered identity on that one too?  Why?  

March 1, 2010 6:13 PM
 

Aaron Bertrand said:

Stan, I agree that is a problem as well.  I am trying to publish this series in small, distinguishable units.  You can treat slapping an identity column on every table separately from picking the wrong pk.  I could treat every single bad habit in one post but it wouldn't be very consumable... and it would take so long to write that by the time I got to the end my opinions about the beginning may have changed.  :-/

March 1, 2010 6:18 PM
 

Brian Tkatch said:

"I often purchase the same book (same order and product) but want to send them to different addresses. Wouldn't supporting that require an OrderDetailID? Maybe not."

Paul, good one! I love it when someone shows reality is not as we think or want. :)

It doesn't matter which is shipped to which address. As long as one goes here and one goes there. So, add ship-to-address to the order detail-table.

Ship-to: Id (PK), Address...

Order: Id (PK), Customer

Order_Detail: Order, Product, Quantity, Ship-To

PK(Order, Product)

To enforce the ship-to matches the customer, just pull the customer_id down:

Ship-to: Id (PK), Address...

Order: Id (PK), Customer

Order_Detail: Order(PK), Customer(PK), Product, Quantity, Ship-To

PK(Order, Product)

FK(Order, Customer) to Order

FK(Customer, Ship-To) to Ship-to.

March 2, 2010 8:51 AM
 

AjarnMark said:

Aaron, I particularly LIKE your example with the Order Detail table because it made me think.  I probably would have put the IDENTITY on there in the past, but now I'll have to give it more thought.  One thing that might influence my decision is whether the ProductID really uniquely identifies a particular physical product, or is there more?  For example, is a sweater that comes in blue and red two different product IDs or is it one ID with additional defining information (e.g. color). Then you have to consider the multiple columns to define uniqueness issue and make an informed decision on what approach to take.  And therefore your post was successful because it made me think instead of respond out of habit.

Regarding your first Log table example, again it makes me think.  I probably would have put the IDENTITY on there, but you make a convincing argument that it is not necessary.  Although I have to admit to a somewhat visceral reaction to the idea of not having a PK on a table, even though you still have a CI.

March 5, 2010 1:20 PM
 

AaronBertrand said:

>>

Although I have to admit to a somewhat visceral reaction to the idea of not having a PK on a table, even though you still have a CI.

>>

I am sure in some cases there would be a good candidate key, though I have seen several cases where there is none.  Consider an ad serving platform where an ad can be served from multiple servers at the same time, and in a particular log there is nothing to uniquely identify them.  You don't have the ability to pre-aggregate the count (since it is coming from multiple servers) so you record each event individually.  I don't think that the requirement of a PK is so overwhelming that throwing an IDENTITY PK solves anything.  It is basically having a PK for the sake of being able to say "this table has a PK."  Now how do you answer the question, "what do you *use* the PK for?"  YMMV.

March 5, 2010 1:30 PM
 

AjarnMark said:

I agree Aaron...another area for me to do a little more thinking and introspection.  It's good to occasionally ask ourselves WHY we believe the things we believe.  Thanks for the prodding.

March 5, 2010 6:23 PM
 

Aaron Bertrand said:

Okay, I'm following the lead of Joe Webb ( blog | twitter ), who recently posted " My Most Popular Posts

January 28, 2011 4:09 PM
 

Hiram Green said:

I hope I'm not beating a dead horse, but...

The order detail line is, as you admit, probably not the best example for a couple of reasons:

1) It's not uncommon for a sophisticated ERP system to have another table with a one-to-many relationship to the order lines. This is because there may be more than one ship location, request date, etc. for the item.

2) The order of the lines in the order is information that should be kept. Having dealt with systems that subsequently display orders in an order that differs from the entry order (I really didn't mean to make a pun), I can tell you about some unhappy users.

That being said, I still wouldn't use a surrogate key, but then, I'm old-fashioned. I would have a LineOrder column, probably smallint, and the only candidate key would be (OrderID, LineOrder).  Yeah, the OrderLineSchedule table will have two more bytes per row, and you'll have to do a bit more work than just using the IDENTITY property, but you'll also have data that is more transparent.

I recently worked with a student system with IDENTITY integer columns in EVERY table. (And, of course, they were PRIMARY KEYs and the natural keys had no UNIQUE constraints...) My jaw dropped when I saw all the many-to-many junction tables with three int columns rather than two.

March 3, 2011 10:29 AM
 

Craig said:

Sorry for the VERY late comment :)  Very interesting article!

I tend to use identity columns on all tables.  /slap wrists

The main reason I do so, is that from a front-end point of view on an intranet MIS it allows me to store the created OrderdetailsID as a data-attribute against a field for use in AJAX/JSON requests.  I  then use that id (or absence of) for the CRUD statements, and passing the one ID means I can modularise the code a bit better.

Could I do this with storing/passing all the individual keys?  Of course, but the single-key does make life easier and as these are not  systems with millions of records the overhead seems minimal.

That and, rightly or wrongly, I am not a huge fan of composite PKs although I appreciate they have their time and place.

April 13, 2012 10:51 AM
 

DF said:

This ultimately amounts to a religious debate.  You have your opinion.  DB design is an art and not a science.  Spent way too much time debating this. Let's talk about which paint covers the best next.  It will be just as useful.

September 6, 2012 11:16 AM
 

AaronBertrand said:

@DF that's what a blog is about, opinions, right?

September 6, 2012 3:06 PM
 

Manish said:

Sometimes for replication purpose you need to have a unique key when natural isn't possible then you can use identity column.

October 8, 2012 1:09 AM
 

Vlad said:

I thought even if this is an old thread, I put in my 2 cents worth :-).

I mostly agree with the rules and conditions for IDENTITY column discussed here.

However (and this is a big one IMHO)

if you ever plan or thinking to use ORM or MicroORM for your projects

such as Dapper, Massive and PetaPoco or other like that (and this is becoming a trend I think to do so).

it is absolutely PARAMOUNT for your DB to have an IDENTITY column for every table you will access (and that is essentially every table)

as most Micro ORMs require them or at very least work better when you have it. (at least I know PetaPoco does require IDENTITY on every table you plan to use SAVE/UPDATE/DELETE on for sure.)

February 1, 2013 9:24 AM
 

AaronBertrand said:

@Vlad I find that difficult to believe. You *have* to have an IDENTITY column, or you *have* to have a primary key?

February 1, 2013 9:39 AM
 

Vlad said:

@AaronBertrand  kind of both, if you want to use some of the automatic futures of PetaPoco.

the ORM provides properties

IsNew() -- to check if the entety you are lookeing at is newly created(not in DB yet) or not (this checks the IDENTITY PK property of the for null/Zero(0) value)

AND Save() -- uses the IsNew() when deciding to insert or update the ntity.

to all POCOs it creates.

so in your code you would just ITERATE through a list of objects and call MyObject.Save() and  PetaPoco would figure out if it needs to insert or update

however it assumes that PrimaryKey is populated by DB(as in IDENTITY auto increment/generated)

it does provide INSERT()/UPDATE()/DELETE()  as well

so you can use non-identity PK but than you can not use the above

functionality abstraction methods and instead you need to structure your code in such a way that if you create new record in the application you do INSERT() right there and then.

I have a couple of apps that kind of rely on above abstraction to make it more flexible and to delay the DB calls to the very last minute.

but it's just me :-)

if you need to update you call UPDATE()

February 1, 2013 10:52 AM
 

Steve F said:

Just read the first comment by Greg and he's right on point and I believe you may have misunderstood him. An orderDetails table like:

OrderDetailID , OrderID, Product , Qty , PricePer , LineItemTotal

1 , 1 , Apple , 12 , $1.00 , $12.00

2 , 1 , Apple , 2 , $0.00 , $0.00

Seems pretty common sense, need to check if your order got it's free apples - easy to do. Removing the IDENTITY column in this instance obviously makes a PK of OrderID,Product impossible. I think the gist is that not immediately seeing the usefulness of an IDENTITY column doesn't mean it isn't useful. What about plenty of developer UI constructs that have constructs like DisplayMember / DataMember (think listbox as an example) that are looking for a single value that uniquely identifies something? Adding levels of headaches and annoyances doesn't seem like a good enough reason to save 4 bytes in many cases.

June 12, 2013 12:58 PM
 

Steve F said:

Good blog post though. If your general point is that people should occasionally think about what is considered standard practice and evaluate if it's still relevant - sure.

June 12, 2013 1:17 PM
 

AaronBertrand said:

Steve that was exactly the point. Note the title says "putting an IDENTITY column on every table" not "any table" :-) I use IDENTITY columns all the time, but that doesn't mean they're appropriate for every single table. OrderDetails was perhaps a bad (counter-)example, depending on the actual data model and business rules. A better, simpler example would be Customers, Addresses, and CustomerAddresses(CustomerID, AddressID). Assuming that an address can only ever be associated with the same customer once, there is very little reason to add an additional column just to serve as a surrogate key.

My point was that some people just put an IDENTITY column on every single table as a matter of course, whether it needs one or not, and also always make it the clustered primary key. Four bytes isn't much in the grand scheme of things, but as your data grows, that impact can be significant - especially when you're maintaining that value that you don't need in every single index, affecting every single DML statement...

June 12, 2013 1:56 PM
 

Steve F said:

I think the heart of the problem is your last sentence in paragraph 1 - 'assuming that an address can only ever be associated with the same customer once...'. Assumptions make sense, until they don't. Just like the assumption that a product shouldn't be in an order more than once makes plenty of sense, until some business reason/factor/etc... changes that assumption.

A logical extension of that CustomerAddresses could well be:

CustomerID, AddressID, StartDate, EndDate

1, A, Winter, Spring

1, B, Summer, Fall

To help determine what address to use at what time of year (seems like a good use of a CustomerAddress table). But then that person travels more than expected to become:

1, A, Winter, Winter

1, B, Spring, Spring

1, A, Summer, Summer

1, B, Fall, Fall

and we've violated the PK (Customer,Address). So you could argue, well that should be put into a different table. Sure, makes plenty of sense, but now you're wasting tons of space because you have essentially a multi-column foreign key. You could have saved a ton of space doing:

CustomerAddress (CustAddID (PK), Customer, Address)

1, 1, A

2, 1, B

CustomerAddressSeason (CustAddSeasonID (PK), CustAddID, StartDate, EndDate)

1, 1, Winter, Winter

2, 2, Spring, Spring

3, 1, Summer, Summer

4, 2, Fall, Fall

So by not having an IDENTITY column originally on CustomerAddress, you're really assuming that it'll never be used as a foreign key somewhere which is just one of those things that makes sense until it doesn't.

June 12, 2013 3:12 PM
 

AaronBertrand said:

Steve, I meant *assuming that is your well-defined business rule* not * assuming you might think you know what your business rules are.* Just because you can envision some possible extension to a data model years in the future does not mean you need to put an IDENTITY column on every single table now, just in case. I'm not going to argue with you about this because I think you're bypassing the thrust of my argument and filibustering me with technicalities. If you want to put IDENTITY columns on every single table, go ahead. I'll do it when it makes sense, and I'll call it out when it doesn't.

June 12, 2013 3:18 PM
 

Steve F said:

Perhaps another way to put it is, "If there is a chance anyone in the future may ever want to use this as a foreign key in any way, it would be good to use an IDENTITY column". Because we can't reliably predict the future, it *OFTEN* makes plenty of sense to use something like an IDENTITY column. I definitely agree with you - think about your columns and data structures - it's a great habit!

June 12, 2013 3:19 PM
 

Steve F said:

It really is a great blog and post, I'm not trying to just troll or attack. I just can't see the (seemingly) minor benefits outweighing liabilities. Like you said, edge cases either way always exist.

June 12, 2013 3:22 PM
 

Steve F said:

"I'm a Vulcan, we embrace technicality" :-)

June 12, 2013 3:44 PM
 

Bill S. said:

I found this blog post as I was pondering this very issue in a database I am responsible for today.  I am working on an updated schema with many changes and therefore data migration from the old database to the new database is something I'm concerned about.  Since I used Identity columns as the PK for most of the tables it is making the migration difficult.  My thought is that instead of using Identity columns why not just use INTs or BIGIINTs and just "assign" the rows for my lookup tables their own values.  I still have the natural keys but since these tables are used heavily in other tables I want to use the assigned ID's as foreign keys.

As an added bonus, since I know what the ID of any given row is for my lookup tables I can easily compare data across databases to ensure all the supporting attributes are equal.

Anyway, great article and something that deserved a good thorough discussion, thanks again!

August 6, 2013 2:50 PM
 

Darshana said:

Lets say we have following tables

X (X_ID PK, ...)

Y (Y_ID PK, ...)

for table XY (joint table of X,Y)

option 1 : XY (X_ID, Y_ID)

option 2 : XY (XY_ID, X_ID, Y_ID)

if the join table(XY which joins table X and Y) is having a relationship with another table(Z). then do we still need this Z table to contain both X_ID, Y_ID (with option 1)?

November 26, 2013 3:23 AM
 

Tedicated Server said:

This is the strangest Blog I've ever read.

To blindly suggest that not every table needs an Identity column is truly bizarre.

Unless I'm being fed a key form another process, table, or application, that I am to use to insert a record that already contains a GUID, or an Identity key and there should not be duplicate entries in my table for that record, then I will give that table an Identity column.

Especially in the scenario you described above. I've written many applications with a structure as described above. From mail order catalog Order Call Center/Inventory management/Shipping application to a full featured CLEC phone provisioning/slash billing/customer service solution for prepaid telephone company.

A line item transaction MUST always have a unique Identity field.

You get 8 million invoices over the course of ten years, you can Imagine how many records are in the Invoice Line Item details table. While for reporting directly on the line item data for aggregates and sales analyst where individual record IDs would be irrelevant of the data being performed. One should always be capable of producing a record, with a proper Identifying Identification of that record with a concrete reference is a must.

From the UI, especially in an MVC deign, and well even before that got popular, but especially so for MVC. When you load a detail screen and you have edit, delete and other link buttons. You want to reference that record by an ID column.

Also in programming creating a class to represent your Invoice domain object, you might need to create a generic list of your Invoice, for processing orders, or generating an online invoice using GDI and returning a bitmap stream or any other image format, my point isn't to argue proper web streaming of dynamic image content.

Also in order processing, lets say there's some process that needs to find a certain criteria in the line item table, I'd store that in a table that has (guess what?) an identity column(, it could be some physical process table, but I'll refer to it as the temp table), and store the lineitemID of my line item table as well.

Then join that temp table on the line item table by the lineitemID in the temp table.

Why the identity again? What if I need to sort, or display the results from the joined query results back to a user or into another process, that may determine, that one of those line items placed into that temp queue should be exempt from the next process.

Again delete from my temp table where the ID = the value returned from the process. This is the beauty of stored procedures after all.

Every time I've worked in a Shop and some know it all SQL extremist, asks me... "What do you need an Identity column on this table for?"

and wins the argument, because they are the ones who ultimately get the last say on these things.

Always gives me great pleasure when I get to eventually tell them...

"That's why I wanted the Identity column for."

Saying not every table needs an identity column, and then using the example you provided is just silly. You are right, but the way it was presented would be just as ludicrous as me saying.

Every table MUST have an Identity column.

Certainly not, as long as there is column that can uniquely identify a record with out having two fields in every criteria you search and join on a certain table.

I even put an identity on my Type and Status tables, that may only have a few records like... "Open, Closed" and "Residential, Customer". Because it's just more efficient to store and search on numbers rather than text.

Foreign constraints on every Identity column, might not be needed. In fact they may not even join to any physical database in the current database at all. But that doesn't mean another process, or report wont have a real need for that rows identity column.

December 19, 2013 12:40 PM
 

AaronBertrand said:

Blah, blah, blah.

You didn't like my specific example? Ok. You're not capable of translating that to any other case where people blindly add an IDENTITY column for no reason? Ok. *Shrug*

I guess you missed the part where, back in February of 2010, I left a comment starting with "All right, so I guess my choice in examples was poor" or the obvious implication that you need to think about these things at a higher level. Not everybody is dealing with OrderDetails, and not everybody that deals with OrderDetails have the same types of requirements as those you mention and those that were mentioned before you came along. 

How about Users <-> Permissions? If a table has PK = UserID, and another table has a PK of PermissionID, does the table that shows the relationship really need an IDENTITY column in addition to the PK of (UserID, PermissionID)? Why? Pretend I used that example, and stop focusing on the one you don't like. I'm not going to go back and re-write the article to use an example you can relate to better, and I don't think it's necessary, since most people get my point without having to write me a thesis on how bizarre my idea is...

December 19, 2013 1:02 PM
 

AaronBertrand said:

Darshana, can you give more concrete example? What type of thing would have a relationship with a relationship?

December 19, 2013 1:24 PM
 

AaronBertrand said:

Steve, when you come across a scenario where you later need a foreign key to point to that relationship (as with Darshana, I'd love to understand specifics here; abstractly this concept doesn't make much relational sense to me), you can always add the IDENTITY column then. Until you need it, all it is doing is taking up space and providing 0 benefit. In fact it is probably causing harm if all identity columns are created with de facto clustered indexes if the data is not used in such a way that benefits from being ordered in the order of row creation.

December 19, 2013 1:26 PM
 

Tedicated Server said:

First off, let me back track now and say nice Blog. It's refreshing to see that someone is actually monitoring the comment entries.

And you're right, "blah blah blah" indeed.

Hey I've found there's as many ways to program or design a database table structure as there is people to program them. Regardless what I disagreed on what you said. I can guarantee you that I will question every Identity column going forward that I normally wouldn't have gave it a second thought.

"How about Users <-> Permissions? If a table has PK = UserID, and another table has a PK of PermissionID"

I put an Identity column on ever table that I will use to build a matrix mapping identity columns, especially on the grandchildren tables.

Try to build an effective Rubric or Syllabus with them.

"I guess you missed the part where, back in February of 2010, I left a comment starting with "All right, so I guess my choice in examples was poor" or the obvious implication that you need to think about these things at a higher level."

Yes I did miss that post, and plan on giving your posts a thorough reading. I may not have to agree with everything you say, but I can still respect what you say, and might even read something.

Sorry if I came across harsh with my opening comment. I can assure I've read more asinine stuff than "Identity columns are not necessary on all tables."

December 19, 2013 8:16 PM
 

Tedicated Server said:

"Try to build an effective Rubric or Syllabus with them. "

Without them, I mean.

December 19, 2013 8:21 PM
 

ToddP said:

Aaron, nice article, and I agree with the point you are making that everyone should think before creating a knee-jerk IDENTITY column -- but I do think they are more useful than not.

That said, you made a small mention of "there are cases where cascading updates do not exactly work as advertised", that has intrigued me.  I am running into a problem with 2008 and a Foreign Key with Cascade Update set for it, and it's failing.  The parent table's PK is a natural key (two character-based columns), and everything appears to be configured properly in the FK constraint, but when one of the parent record's column values is changed (one of the two PK columns), I am getting an error that the FK constraint is being violated in the child table, and the whole thing gets rolled back.

Can you elaborate where you've seen cascading updates not work as you'd expect?

Thanks!

August 25, 2014 2:49 PM
 

AaronBertrand said:

Sorry Todd, I didn't see that first-hand, I either read it somewhere or had it explained to me at a session (I've always preferred my own logic over the cascade operations, which have too many limitations). Unfortunately I did not take down the specifics. Could it be this issue?

http://connect.microsoft.com/SQLServer/feedback/details/304934/

August 25, 2014 3:06 PM
 

ToddP said:

Interesting, but no, this isn't the issue I'm having. Thanks for following up :-)

August 25, 2014 3:41 PM
 

AaronBertrand said:

@Todd, I'd be happy to look into the problem (as my time permits) if you can provide a repro (maybe on http://sqlfiddle.com/)?

August 25, 2014 3:50 PM
 

Paul said:

I couldn't disagree with you more. Put an identity field in every table. There are few reasons not to have it and that comes down to weaknesses in the programming language available. It is there to IDENTIFY a UNIQUE record. Data can change. This field WILL NOT CHANGE. It doesn't require an index but often you should have one. The most awesome thing about having your items done this way. Is you change things. When you lose your data integrity, you only have known unique numbers to fall back on.

Hmmm why would accounting ever want to change your customers names? Or all their order numbers? Or why would someone put in a new update and not fully test it? If you have an identity, you can backup your table. Then if your data gets hosed, you know where it all went. Put them in every column and while you're at it, add creationdate, lastupdatetime, creationuserid, lastupdateuserid. Then when you troubleshoot things, it will be much easier. If you are worried about size of table, get bigger drives. I've spent way too much time troubleshooting databases that were written by people that can't program nor troubleshoot. Make templates of your table creation and try to keep it consistent. The same with your procs. Put in a security proc and all of them call it (if that is your design), put try catch statements in all your procs. And have your own error handling procs to call. Now I will say this, you shouldn't always have a primary key on your identity column. Nor should it always be involved in a clustered index. However, you can include the identity column in your clustered index to keep it unique, without having to depend on other columns that might not be unique.

October 25, 2014 3:18 PM
 

Rollin Shultz said:

I gather by your article example on orders, that subforms using the parent form's ID would be a good practice to follow, and as a novice studying the principles, I think that is a good practice to adopt. Besides every expert article I have read on the subject so far declares the data should be represented with as few columns as possible and I submit that advice includes ID columns. So I guess a good motto would be: "If you need it (any column) use it, otherwise don't include it". I will make this my design motto.

Regarding the example given about discount for quantity, I would submit that this should not be handled in the data tier at all. I prefer to use the DB for only necessary data and allow the front end software handle such special and likely temporary things. Though one might make a case that there may be several front ends to adjust with these special requests in a large company,I think that case is weak since those special conditions may be locality specific. Besides which, that is why we have install and update scripts for front ends.

Thanks for your article and sharing your advice.

October 28, 2014 2:48 PM
 

ComputerRoger said:

This is so fundamental, I must respond...

Why should you put an Identity column on every table?  

1.A.Beginner.Fundamental  It separates structure from content!

2.A.Beginner.Fundamental  It provides future separation.  Who of us can predict future needs with 100% accuracy?  I am humble enough that I can say my future predictions are not that accurate.

3.A.Beginner.Fundamental  Application layer handling can identify child records for Updates easily.

4.A.Beginner.Conclusion  Not adding the identity column is incompetent and lazy.

November 12, 2014 7:40 PM
 

AaronBertrand said:

Wow, what compelling arguments. You might lose data integrity and need to fall back on a "known" identity value. Not having an identity column is incompetent and lazy. *sigh*

November 20, 2014 4:13 PM
 

passerby said:

Kind of hoping people read this content read my comment :)

What you're talking about isn't bad practice or anything close to that either it is putting index or not. it was simply unnecessary performance concern, it is like signed variable vs unsigned variable sure you get double number if you don't ever use sign on that variable but it always easier just use bigger number like int64 instead of uint it was so small that it doesn't justify future error that may occur if that variable calculated against signed variable

so put the damn identity it's god damn cheap, putting it latter always harder.

January 19, 2015 2:28 AM
 

JoeyBalls said:

people seem to keep commenting on this blog, so maybe you should finally consider to update your example with a real good argument not to have an IDENTITY. :)

February 11, 2015 4:01 PM
 

AaronBertrand said:

@Joey I'm kind of over it. Just because some people don't agree that OrderDetails doesn't need an IDENTITY column does not mean that it does. You can construct dozens of examples where an IDENTITY column adds no value, and it doesn't change my argument and doesn't require me to invest time in re-writing the post. Consider a very simple schema:

   CREATE TABLE dbo.DormRooms(DormRoomID INT PK, ...);

   CREATE TABLE dbo.Students(StudentID INT PK, ...);

Now you need a junction table of some kind to indicate which students live in which dorm room, right? So you have something like this:

   CREATE TABLE dbo.DormRoomStudents(DormRoomID INT FK, StudentID FK, ..., PRIMARY KEY (DormRoomID, StudentID));

Explain to me what an IDENTITY column on that table helps you do.

Another example: a calendar table.

   CREATE TABLE dbo.Calendar([Date] DATE PRIMARY KEY, ...);

What benefit would an incrementing integer add here, especially since this is the type of table you just build once?

I could go on and on, but the specific examples are not really all that relevant. My advice is pretty simple: don't throw IDENTITY columns on every single table. Use them only where they are useful.

February 11, 2015 4:18 PM
 

Simone said:

Hi Aaron, we've a db with around 800 tables and 4000 between procedures and functions with a large use of composite keys. The question is: Would you change, piece after piece, tables and all the procedures/functions to use surrogate keys instead of composite keys dropping "old" columns? Thanks for any suggestions.

The reason of this are: slowness in retrieving data, slowness in rebuilding indexes. But i'm not sure if these are a good reasons...

March 13, 2015 12:46 PM
 

AaronBertrand said:

@Simone I really don't know. Have you proven that slowness is due to composite keys?

March 25, 2015 3:48 PM
 

S.Stone said:

Very good insight. This thread is still going strong after 5 long years!

Given all the knowns, knowns unknowns and unknown unknowns and timelines given to all kinds of different projects and all of the continuous re-scoping going on nowadays due to the every-so-popular "Agile" methodology, there are times when it's appropriate to create identity columns and there are times when it's not appropriate. Personally, I choose to default to primary keys then select the natural keys as requirements are solidified (hard to do nowadays with all of this Agile going on) and as programming languages are chosen.

Using datetime columns as clustered indexes is not the best idea, as you have no way to determine that the values are always going to be unique. For example, the ep could choose not to re-evaluate a getdate(), even though in t-sql you may be providing two separate write statements. Also, adjacent statements could execute so fast that getdate() returns the exact same time to the millisecond.

Again, very good insight and I enjoy re-reading the article every once and a while to re-enforce the notion to others that identity columns are not needed in every situation.

April 22, 2015 9:58 AM
 

AaronBertrand said:

S.Stone note that choosing a datetime column as a clustered index is *NOT* the same as choosing a datetime column as the primary key (though a lot of people do correlate the two because SQL Server's default behavior is to implement a primary key as a clustered index).

A primary key on a datetime column makes no sense for exactly the reason you specified: there can be collisions. And there is no reason to try and prevent collisions, because you should never be identifying an entity by the time it was created - we have concepts like alternate/candidate keys, and the ability to have multiple unique constraints separate from the primary key, so that row identification does not have to be jammed together with physical implementation.

However, a clustered index on a datetime column can be quite useful if data is always coming in in sequential order over time and especially if most subsequent queries use date ranges to limit the result or identify rows to affect. In these cases paying for the uniqueifier bytes to cover the occasional collision is well worth it IMHO.

April 22, 2015 10:23 AM
 

ConsistencyIsKey said:

This is why every table should have an ID, it should be called ID and it should be unique in the table. Databases don't exist in a vacuum. They are part of larger systems with user interfaces, API's, and business logic.

The value of consistently communicating that to retrieve an individual row from a table requires passing the ID of the row is much greater than the cost of the 4 bytes to store it.

When developers have to repeatedly look at the database definition to know which field -- or fields (gasp) -- uniquely identify a row and what the names of those fields are so they can create their URLs, parameter names, or SQL statements, costs go up exponentially.

Therefore, every table should have an ID field that is unique. If you need additional uniqueness, create a unique constraint on the fields that need be, like in your example of the order details with a product id and an order id. That's what unique constraints are for and it lets the database maintain the business rule there while still simplifying life for everyone who isn't a database administrator.

It takes more than DBAs to build a quality system for everyone who will use it -- developers, front end designers, and users alike.

June 16, 2015 2:47 PM
 

AaronBertrand said:

ConsistencyIsKey, an Identity ID is not the only way you should be able to uniquely identify a row. Let's say you have a table called Students and a table called Classes. A junction table would serve to represent which students are taking which classes. Are you insisting that in order to identify the classes that student 25 is taking, I need to know all of the ID values that the system arbitrarily assigned to those 8 classes? Why? If I need to retrieve the row that tells me the grade they got in Econ 101, I say WHERE StudentID = 25 AND ClassID = 15. Not WHERE StudentClassID = 4087.

June 22, 2015 2:19 PM
 

Darek Danielewski said:

Interesting perspective ... One reason to almost always add an identity column might be the use Entity Framework as consumer. For tracking purposes, every table is required to have one, I think.

June 26, 2015 11:36 AM
 

AaronBertrand said:

@Darek - every table needs an IDENTITY column, or every table needs a primary key? While often used together, they are not the same concept.

June 26, 2015 11:51 AM
 

Ryan Cerney said:

@Darek - EF doesn't care about identities at all, just PK's.  In the case where you are using Views in your data models, you have to use other techniques to get around the PK requirement such as isnull and nullif.  

July 14, 2015 11:35 AM
 

ConsistencyIsKey said:

"Why? If I need to retrieve the row that tells me the grade they got in Econ 101, I say WHERE StudentID = 25 AND ClassID = 15. Not WHERE StudentClassID = 4087."

Because not everyone uses SSMS and SQL statements to access the data.

JSON and XML don't say what the unique primary keys are for the data in their package, but if everyone using your system knows that every package of XML or JSON or anything else that doesn't include documentation of the data model, does include an ID field and that it will be unique for all the items in the package, or at least for each item of the same type, then life becomes simpler and more manageable for everyone. Especially compared to having to study the data model to learn the exception to that rule. Customized links to download grades, update grades vs other objects that have an ID, etc.

In the end, you get more rapid development, more clear communication, and an overall simpler product design -- pretty good price for 4 bytes a row.

This from a full stack developer, not just a dba who read the orange database theory book from the 1970's. I've done the whole process multiple times and seen how inferior systems that use compound primary keys are to those with a unique ID field on every table.

You're talking THEORY and I'm talking everyday practicality.

July 14, 2015 7:07 PM
 

AaronBertrand said:

I am not talking theory, I never read any orange database theory books from the 1970s, and I am speaking from both perspectives.

Your "unique ID field" - which is a meaningless, arbitrary, surrogate value, does nothing to validate that the row you pulled is the row you actually wanted, nor does it make your XML or JSON gobbledygook any more cohesive.

Note that I did not say don't use IDENTITY columns or other surrogates on ANY table, I said don't just blindly slap them on EVERY table. The benefits of having them, where they make sense, make sense. You don't need to convince anybody of that. What you haven't convinced me of, for one, is that any of this hand-waving about XML and JSON really does have any grounds in everyday practicality. Show me how your XML / JSON looks for a StudentClasses table where it requires an IDENTITY column and I'll show you how it doesn't.

July 14, 2015 7:36 PM
 

ConsistencyIsKey said:

I think it's better if I stop suffering fools...

July 16, 2015 5:35 PM
 

AaronBertrand said:

Yes, indeed, I don't agree with you, therefore I'm a fool. However one who resorts to name-calling when their arguments aren't convincing enough...

July 17, 2015 8:55 AM
 

Zeeshan said:

I am Agree with @AaronBertrand

August 10, 2015 9:46 AM
 

Chad said:

I find it much easier to work with a single field PK. Compound keys make things like building a rest API (or adding features later on) much more verbose and messy.

I've struggled with this problem for a while, but eventually I just found it simpler to create an Identity every time and worry about storage or performance if I know something might have issues.

September 14, 2015 12:26 PM
 

AaronBertrand said:

But Chad, not every table *needs* a single-column PK at all. Just because you *like* them doesn't make them necessary.

September 14, 2015 12:30 PM
 

Qt said:

Very interesting article and you make a lot of sense ... in DB world. You save space,  create unique and/or clustered index on "natural" key etc. Life is god. But when you think about how to write API for your application and interact with it. It becomes tricky. Say you need to remove one product from order, but you can do that only with two keys OrderId and ProductId. At the same time some ORM tool work best with single key, REST API pattern expects single key, your default routing scheme {controller}/{id} does not cut it, some javascript library you use to display table expects single record id key etc. Also if your natural primary key is datetime you can't easily send it with javascript to both C# and Java backends as text in JSON, because default parsers expects different datetime text representations, so now you need to send your datetime object as unix timestamp, but display to user as text. All of this headache is shifted to backend/frontend code, just to keep DB clean and pure. In the end answer to any question is always "it depends". But you need to think very hard before wondering far away from official or unofficial standards, they appeared for a reason.

September 18, 2015 3:39 AM
 

Fabricio Lima said:

Great theme...

I was searching for this kind of discussion...

One thing that I also consider important to analise is the number of nonclustered index of the table.

If I use a datetime colum, the space is 8 bytes for every index ...

If I use a composite column as clustered index, this columns get in the nonclustered index.

With the identity colum, in some cases, we can reduce the size of the table. Less insert and delete overhead.

What do you think about that?

September 30, 2015 9:05 PM
 

Pete From Southampton said:

Hiya. Recently acquired reader of your blog. Going to add my pennies here.

As much as you are right that your original example was not the best example, the thought process behind what you say is sound. I can think of many examples where IDENTITY columns are a waste of space (such as addresses tied to customers) where a composite PK and appropriate CI can be used. So I'll be taking what you said here on board.

I guess as a footnote, I will say this to the people who are all parroting the same things...

"If great minds think alike, then some minds are not thinking."

October 22, 2015 10:13 AM
 

Ray said:

Aaron - I think most of the cases we need an identity column for object tables. We may NOT need an identity column for in RELATIONSHIP tables (the table only stores foreign key pairs).

On my opinion I will still suggest adding an identity column in RELATIONSHIP table for easier front-end implementation. Imagine a normal practise that you show a list of relationship and you would like to delete one of them. Using one identity key is much more easy for implementation.

December 21, 2015 12:27 AM
 

AaronBertrand said:

@Ray I'm not sure what's difficult about locating a row based on two individual key values. So instead you want to add an additional column, and have both a primary key (surrogate) and a candidate key / unique constraint on the two *real* columns, just to prevent having to add an AND to the WHERE clause?

December 21, 2015 4:10 PM
 

Tenho said:

Hi Arron,

I think I need to give some thoughts:

1. Both natural key (business key) and identity/surrogate keys can co-exist in a table. Noone can say which one is better until he knows:

- How the developer is using that in their system/What index he's going to create: How could you tell performance of JOINs are affected if you haven't known what index is being used? Also remember for loading, I believe "where OrderDetailID=1" can always be made faster than "where OrderID=1 and ProductID=1"

- The tools he uses: I had no choice but use a DAL library which only supports tables with an identity column so the library needs 1 line to load an object by identity while it needs 3 long lines of code to load an object by its natural key. Of course each person may have their own concern. As a developer I care about the difference between 1 short line and 3 long lines of code, while as a "consultant" you are worried about a few extra bytes of storage being used for each row (this is not a real issue for me)

- The business requirement: some times you don't want to delete the data but rather mark them as deleted, or mark them as being valid for a period of time, in this case you can't simply use the combined key of [OrderID,ProductID]. It can now be [OrderID,ProductID,EffectiveFrom] or [OrderID,ProductID,EffectiveFrom,DeletedTime]. Imagine if you want to store orders for many customers in the same table then key would be [CustomerID,OrderID,ProductID,IsDeleted] (where OrderID and ProductID can be duplicated). In these situations, an identity column will save your life

2. Identity is natively supported by all major relational database systems, you have native functions to use with identity column, you can manage items to the row level as each row now has its own ID and is not updatable in a normal way. Have a look at these 2 rows, they are the same if you don't use an identity column:

[OrderDetailID=1, Order=1, ProductID=1], and

[OrderDetailID=2, Order=1, ProductID=1]

Using identity in this case I can tell if someone has deleted a row and then inserted it back again with the same values.

January 1, 2016 12:09 AM
 

AaronBertrand said:

@Tenho Just curious, why would it ever matter if someone has deleted a row and then inserted it back again with the same values? This is actually how an update works under the covers, and I suspect it's how updates can work in some ORMs and other applications as well.

January 2, 2016 8:15 PM
 

David Marrero said:

I teach DBMS and tell students to ALWAYS use an identity key on EVERY table. Why not? What is the down side? It makes a more consistent design and every table is handled the same way. That allows a much higher degree of automation and the consistency carries through the XML and JSON when exchanging data with other layers. Why do I want to think about how to structure a primary key when there is already a simple, consistent, reliable, easy-to-understand  method to do it?

To introduce composite keys means if I want to reference a single record on the UI side, I have to know which columns are included in the key, etc. Why? This complicates all aspects of retrieving that record. there is not benefit. Always being able to use a single column (first column, named with ID suffix) makes the entire process easy. I can always reference any record in the system with a single value.

If there is a special benefit to using a composite key, through all layers of the application, it escapes me.

February 5, 2016 9:02 AM
 

AaronBertrand said:

@David That totally reminds me of the driving instructors in Massachusetts teaching their kids to always drive in the middle lane. ALWAYS. That never leads to any problems...

I don't think there is any benefit to a special key, and I don't think there needs to be one, either. You don't need automation and XML and JSON and key references for every table, either. What special benefit does an identity column to a junction table between Students and Classes? It's a simple, two-column table, one column points at Students, and the other points at Classes. Adding an IDENTITY column just introduces more storage space and has zero value because most of those tables - when reconfiguring which classes a student is attending - you wipe all of the student's rows and replace them. Meaning new IDENTITY values. Where's your consistency?

February 5, 2016 9:26 AM
 

Peter Schott said:

Thanks for the article, Aaron. For some reason I'm just coming across it for the first time. While I am torn about your original example (I can see both sides), I appreciate the thoughts and the discussion along the way. I agree that not every table needs an Identity column. Even tables where I use an integer key don't always get Identity columns. For example, if we use a table to match an enumeration in code, I want to control those values easily. Using an IDENTITY in those cases means all inserts need to use SET IDENTITY_INSERT ON. I can't always do that without the right permissions.

One thing I found useful recently was in re-keying the Identity/PK (surrogate key) values for an entire system when copying data from one system to another. I was able to use a "SourceID" type column to copy and reference data from the old system to the new. Without that ability, it is a lot harder to tell which data is which and tie it together properly when entering the new data.  I still don't need that for the many-many tables as I can look up the appropriate old and new values, but it's really helpful for moving data like that and ensuring that it's all tied together. (Not essential, but helpful.)

For new tables, we'll consider the appropriate structure for the purpose. Identity is frequently used, but not always and for a lot of the reasons you've already mentioned.

April 5, 2016 7:13 PM
 

Matt G said:

I'm usually a proponent of using composite primary keys and omitting the identity column.  As has already been mentioned, it doesn't usually add much value.  However, I came across a situation in which it actually made sense to have that third column.  Let me explain.

We had an audit table in our application that was generalized to hold all audit data for that app's database.  The columns in this table are audit_header_id, schema_name, table_name, event_date, primarykey_id, operation, user_db, user_app.  You can probably see the problem: there's one generalized column for the primary key.  By generalizing our audit table to hold data relating to all database tables, we had made it incapable of gracefully handling composite primary keys.

Instead of any major code rewrites, the easiest fix was to simply introduce that identity column to the offending tables.  In this case, it did add value - it allows us to reference a row in the table by a single identifier.

September 29, 2016 12:58 PM
New Comments to this post are disabled

About AaronBertrand

...about me...

This Blog

Syndication

Archives

Privacy Statement