THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Louis Davidson

Can you over-normalize?

Now, don’t get too excited and grab your pitchforks and torches. Clearly, it is extremely possible to overdo something in the design, but very often normalization takes the rap as being the culprit. In my “Database Design Fundamentals” presentation, one of my favorite things to do is ask “What is the most important normal form?” 9 out of 10 times, someone answers “Third”. When I ask what they have against fourth, the usually say that it makes the database work too slow. But when they find out that most Third Normal Form databases are already in Fifth Normal Form, well, this fact has never actually slowed down any database that I know of.

Under-normalizing is a very common problem, which is pretty much self explanatory. Most people don’t really understand it, or even really care about it. The basics of it are very straightforward, but the finer points (and why they are so important) are lost on many people who happen to be cobbling together some form of storage for the state of their objects (what we typically like to think of as a relational database.)

But I am not even sure that by definition you could over-normalize. I will submit to you that almost every so-called over normalized is more precisely “over designed” and is actually just an architect taking a set of requirements and expanding them to meet what they felt the requirements should be. As an architect, I have a very large bag of tricks that I like to pull from to create the “perfect” solution. The problem is, users rarely want or even need “perfect”.  Yet we always get really excited to start adding to what the user asks for. Normalization can only be done in the context of the requirements. If you defined every relationship in your requirements as 1-1, you could implement a fully normalized database in one table.. Every customer makes one order for one product. Done. Clearly reality is rarely so clean, and if the requirements stated this, you would not be doing your job unless you said “Whoa, are you..sure?”

A great example of how databases get out of hand came during Audrey Hammond’s (@datachix2) presentation at SQL Saturday in Atlanta. Her example was of a personal movie review system that her (hopefully theoretical, slightly sadistic and certainly inappropriate time using) boss had given her to do. She came to a final example design that seemed to match the requirements that she had presented in one of her early slides.  Her question, how would you change the model. Several of the people in the room had suggestions to improve the model, but most of the people were expanding the model to include stuff that they wanted to see in the model (I will admit, I wanted to do the same thing :).  If we had kept going, we would have ended up with a data model that could be used to replace the guts of the Internet Movie Database (imdb.com, one of my favorite sites.)

The problem was, while the ideas that had been good ones if we were trying to brainstorm items for the requirements, they were beyond what had been agreed upon for the implementation. The goal to create a personal movie review system is not exactly the same as the needs of the boss who want to keep up with a list of movies he has watched and a list of the important people in the movie that he wants to keep up with. 

Under the name of normalization, this sort of thing happens often. We add more and more tables to our model to take the design to larger and larger proportions implementing more and more things that seem like a great idea, but aren’t what the client wants. This is when the cries of over normalization begin to go up and cursing the name of Codd as if he was the person who over-engineered your database.

I clearly don’t want to make it sound like as an architect you shouldn’t push to provide a solution that gets it right when the requirements are, shall we say, weak.  Requirements are very often not thought through, and it is your job to recognize this and be certain that the requirements are changed to to reflect this, so you can create systems that implement what the user needs. Taking the user’s requirements as the final word for what they want is another poor practice.  Users know what they do, not how to create databases (presumably that would be your job.) My rule of thumb is that:

1. The requirements dictate the database design

2. The requirements and the relational engine dictate the implementation

So unless you understand the requirements, you can’t design the optimal database, and if you don’t understand SQL Server, you are not going to end up with an optimal implementation.  Over-normalize? No.  Over-engineer? Definitely.

Published Thursday, April 29, 2010 12:53 AM by drsql

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

DM Unseen said:

Can you over normalize?

Yes, you can. It is quite easy, but not a lot of people use this: use 6NF.

6NF is normally only relevant in creating complex fully temporalized datamodels, or data models whose schemas should be extremely stable. 6NF does not always perform well, so you should be very carefull ;)

6NF is the ultimate normal form (ie you cannot normalize any further than 6NF).

April 29, 2010 4:21 AM
 

Linchi Shea said:

> But I am not even sure that by definition you could over-normalize.

> Over-normalize? No.

So Louis, why do people denormalize tables all the time?

Linchi

April 29, 2010 9:14 AM
 

drsql said:

DM Unseen, you make my point when you say "6NF is normally only relevant" If it is relavent, then it isn't too much. If it is not relevant, then applying it would be incorrect because it adds unneeded information... If it is needed by the requirements, then it would not be too much.

The point being that when we do more than is needed to fulfill the requirements of the system, then we make life harder, not easier by going too far.

April 29, 2010 9:14 AM
 

Adam Machanic said:

A more interesting question is whether you can focus on normalization too much, ignoring other database design issues that have nothing to do with normalization or lack thereof. Or confusing other database design issues with normalization...

April 29, 2010 10:15 AM
 

James Luetkehoelter said:

I think Adam has an interest point - from a logical perspective no, you can't over normalize. But as Linchi points out and most of know all too well, people denormalize like crazy. And as we all know, starting with a database design (physical) in that manner usually leads to apocolyptic endings...

The difficult thing I find out in the wild is trying to get that balance behind application design and data design. If pure development is driving how the data is designed, denormalization is inevitable (you want to structure the data so that it best serves your application). If pure data design drives things, development can be quite cumbersome.

I find just getting people to think about how they're going to want to see the data that's coming in through their system ahead of time starts to give the nudge towards more normalization. Just my 2 cents.

April 29, 2010 11:17 AM
 

drsql said:

Linchi, I think most denormalization is done for the wrong reasons, at least it is in my experience. As an example, every time someone needs to know the number of child rows for a table, they immediately jump to adding a count column. "Why not query?" I ask, and they start to squirm and indicate that they hadn't thought of it, or it would be too slow.  

Are there valid reasons to denormalize? Absolutely, but usually it is only when the read/write ratio is extremely high (or even undefined (as with zero writes) that this becomes a valid requirement for a OLTP database. Good design and solid indexing makes use of that wonderful SQL Server relational engine and answers are overall faster. Obviously there are reasons like inventory or bank balances that you just couldn't keep up with, perhaps due to being read very often, and others for just distribution purposes, but i always say, try it first, ask on the forums, then denormalize.

And let's be honest, most DEnormalization is just people's excuse for why they didn't get it right when pressed on the subject, right?

April 29, 2010 12:26 PM
 

drsql said:

Adam and James:

You guys make another completely valid point. Normalization only covers so much, for example, naming standards.  You could have the perfect design with all of your tables named A100___, and it would still stink.  That was the focus of my posts on the 7 pillars of a great design. Normalized is important, but in some respects I would prefer to have an imperfectly normalized system with names that I could follow such that I understood what they heck the designer as trying to convey.

FROM A1000212 as a

      JOIN A1000214 as a2

         ON a.aid = a2.bid

Sure most names aren't that bad, but at least those names aren't misleading :)

April 29, 2010 12:31 PM
 

Linchi Shea said:

Louis;

Fine if we are talking pure theories. But the real world is quite different. Whatever the reason, denormalization is a good tool in many many cases, and for what it tries to accomplish, it basically says that the tables are over-normalized for practical purposes.

April 29, 2010 1:17 PM
 

Alexander Kuznetsov said:

<putting on my business hat>

We invest time and effort in normalization because we expect to save more on data integrity and other things. So, when we decide whether to normalize, we need to justify the investment. In some cases we are not going to benefit from normalization enough to justify it. In such cases we clearly save money by not normalizing.

In startup environments normalizing too early may be a serious mistake - if the project does not get traction and is canceled, then all the money spent on normalization is lost.

</putting on my business hat>

April 29, 2010 3:45 PM
 

drsql said:

Alex, I understand in essense what you mean, but in practice not so much. Normalization is generally about getting the right storage to meet the requirements. If the requirements say that you need to be able to store multiple widgets per wadget, then you have to go there.  If it states one widget per wadget, then it is a different implementation.  

Requirements can be trimmed down to eliminate some of the requirements that normalization is there to help with, but not providing places to put data that the user needs is how users figure out their own way to store data, regardless of your design... And then data loss occurs, and then angry customers, and then the startup is no longer started.

April 29, 2010 3:50 PM
 

Adam Machanic said:

I agree with Alex. There is a point you can reach after which further normalization yields a diminishing return. By not normalizing to the Nth degree you may potentially be open to certain data issues, but a designer needs to balance the likelihood and potential impact of such issues against the cost of a more complex design. Not only during the design phase, but also when actually working with the data.

A design that requires 17 joins instead of 3 to get some core data might make development a lot more expensive. Writing queries that perform well may be much more difficult, and views may not solve the problem in a good way. And if that design is only in place to prevent some data issues that are highly unlikely to actually occur, that's a waste of time and money.

It's wonderful to shoot for perfection, but in most shops it's more of an ideal than a goal. The real goal is to get working product out to the customer on time and under budget, even if that product does have a bug or two.

April 30, 2010 11:39 AM
 

Linchi Shea said:

> I agree with Alex. There is a point you can reach after which further normalization yields a diminishing return.

I'd go even further to say that there is a point after which further normalization yields negative return.

April 30, 2010 12:34 PM
 

James Luetkehoelter said:

OK, I may be opening a can of worms Louis, but something in your last comment caught my eye - "Normalization is generally [this may be the out to my rebuttal] about getting the right storage to meet the requirements".

I'm not sure I agree with that. What normalization is, first and foremost about properly logically separating entities and their attributes. How you physically represent that normalization becomes a physical design decision. That's where Adam's point about a database that requires 17 joins comes into play. It may be best (from both an application and later analysis perspective) is to repeat your widget in the same table as your corresponding wadgets.

I absolutely believe that logical normalization is a necessity, starting at least 3NF to start with on a physical level, and *WITH CAREFUL CONSIDERATION* denormalization in a real world environment a inoonvenient truth.

I have a feeling we're all sort of talking about the same thing...

April 30, 2010 4:42 PM
 

drsql said:

So, here is the big question.  If the requirement say "We have customers, for each customerr we simply need to have a single address, a single phone number, and I only really need a single name field (cause requirements always say field). We never search for last name, or anything, just need it for our orders.  We only sell a single product, so all orders can be for > 1 quantity, but only the same product.  The price of our product has been the same for 100 years, and will be the same for the next 100 years."  

Unlikely but possible.  As normalizers, our instinct is to say, they really want a first name, last name, and middle initial, at a minimum. And will they really only want one address? Nah, let's give them unlimited.  And what happens if they come up with new products, let's create a product table with a product type, and allow multiple order items. And so on. This is the point of matching requirements to structure. You can go nuts and make this 15 tables, but the proper solution looks like their model, with only the tables/columns and cardinality of relationships they need.

>>What normalization is, first and foremost about properly logically separating entities and their attributes. <<

This si just stating the same thing. Entities and attributes are determined solely by requirements of the solution, not to make us feel superior :) At this point, in the name of normalization, we often over engineer what they require. Should we ask the user if they want this, and warn them of the problems that lie ahead, sure. But in the end, it is the requirements that drive what we normalize. I try to stress this when I teach about these things. Normalization is there to get the relationships and cardinality right between columns, tables, etc. But you can't define what is right without full knowledge of what is being designed. Most situations follow a common pattern, but not always.

I usually stress that 4NF is essential, but only because it deals heavily with cardinality between things. Example, a book to author relationship.  Model that as 1-many and you will fail in either direction, even if it is correct in 3NF.  If the business rules state 1 author per book, then 1-many is fine. It feels wrong to us smarty pants that are telling the people what they need, but we are modeling their world, not conforming their world to ours.

As to the 17 joins versus 3 argument. I agree completely, but there are two persectives. If you can satisfactorily meet the requirements with 4 tales rather than 18, 18 is over engineered. Paul Nielsen talks heavily about generalization in this regard and I concur.  But, if it takes 18 table to represent what the user wants and you use 4, the complexities of their universe are going to be implemented somehow.  Careful denormalization can certainly be the solution, but this implies you know the normalized solution that matches their needs.

Just be careful not to leap to a solution without considering the properly normalized alternative.

Honestly, part of the issue here is that everyone who has  replied here are level headed Top of the Class folks that I completely would trust with a CREATE TABLE statement, as well as any other DDL. In reality, we are all saying the same thing, I feel. Normalization for normalization sake is a bad thing. When you reach the point that a user has the right number of places to enter each individual bit of information they need to store, and you never have to parse out values, this is generally the perfect stopping point.

And in my experience, it is far more likely that the designer says, nah, we don't need > 1 of those, nah, didn't hear the user say that, does a design, leaves it out, and then the users start using the middle name colummn as a comma delimited list of the users' favorite movie genre's because the genre was created with cardinality 1 to the person and is possibly enforced with a constraint, but they really wanted to allow > 1... Some people say users are evil... I say they are crafty people who skirt the poor planning of designers by getting it done!

April 30, 2010 5:09 PM
 

James Luetkehoelter said:

Well said Louis - I would say that developers can sometimes be evil as well, crafting database design to server their application rather than represent what is happening in the business....

April 30, 2010 11:01 PM
 

Mark Freeman said:

I think there is a distinction to be made between normalization and scope creep. Normalizing is refining the model that meets the stated requirements. Some of this discussion is about trying to anticipate future needs in the model. I think these are two distinct subjects.

Having direct communication with end users or a good BA who is also a subject matter expert is critical in being able to determine the different between reasonable planning for the future (even if the feature isn't going to me implemented in the first release), and creating an overly complex design that will never be fully utilized. The former saves money in refactoring and data migration down the road, and the latter is a waste of time and money. In reality, it can be difficult to find where the line between the two really is, and it may involve some negotiation.

Given our experience with data modeling, we may be able to point out things like "books sometimes have more than one author" or "people may want to indicate more than one favorite genre" to the BA that they may have missed. But if they insist that they don't and won't need to handle these things, I'd get them to put it in writing, give them what they asked for, and move on. If I feel really strongly about it, I might mention it to the product owner.

I try to add value, but in the end I agree with you that my responsibility is to meet the stated requirements as quickly and inexpensively as possible, not build something to handle my assumptions about what they might really need.

--

My experience has been that normalizing a design doesn't take significantly longer and doesn't hurt performance unless taken to extremes. Providing read-only denormalized reporting views can help simplify things for ad-hoc reporting and export/integration tasks while still retaining the benefits of a reasonably normalized underlying data model for use by an application.

June 28, 2010 1:13 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Links to my other sites

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement