<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Search results matching tags 'Database Design' and 'Normalization'</title><link>http://sqlblog.com/search/SearchResults.aspx?o=DateDescending&amp;tag=Database+Design,Normalization&amp;orTags=0</link><description>Search results matching tags 'Database Design' and 'Normalization'</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>Normalization and How to Know When You Are Done… The short version…</title><link>http://sqlblog.com/blogs/louis_davidson/archive/2011/05/29/normalization-and-how-to-know-when-you-are-done-the-short-version.aspx</link><pubDate>Sun, 29 May 2011 20:54:15 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:35960</guid><dc:creator>drsql</dc:creator><description>&lt;p&gt;A while back, I was working on a short article about Normalization for a book that never got published (admittedly I wasn’t getting paid for the article, and it wasn’t for charity, so I wasn’t &lt;em&gt;&lt;strong&gt;that&lt;/strong&gt;&lt;/em&gt; broken up over it.)&amp;#160; The task at hand was to, in 2 pages or less, describe the process of normalization and help you to know when you have finished. In my upcoming book Pro SQL Server 2000 + N (where N &amp;gt; 10) Relational Database Design and Implementation, it takes about 45 pages. So it wasn’t really a realistic task, especially considering I have spent about a full paragraph letting you know how hard the task is going to be. The most important thing that is missing from this short introduction is examples, which I include in the book in truck loads.&lt;/p&gt;  &lt;p&gt;There are two distinct ways that Normalization is approached. In a very formal manner, there are a progressive set of “rules” that specify “forms” that you are working to achieve. There is nothing wrong with that definition, but progressing through the forms in a stepwise manner is certainly not how any seasoned data architect is likely to approach the problem of designing data storage. Instead, you design with the principles of normalization in mind, and use the normal forms as a test to your design. &lt;/p&gt;  &lt;p&gt;The problem with getting a great database design is compounded with how natural the process seems. The first database that the past uneducated version of me built had 10+ tables, all of obvious ones like customer, orders, etc. set up so the user interface could be produced to satisfy the client. However, tables for address and even order items were left as part of the main tables, making it a beast to work with for queries, and as my employer wanted more and more out of the system, the design became more and more taxed. The basics were there, but the internals were all wrong and the design could have used about 50 or so tables to flesh out the correct solution. Soon after (at my next company, sorry Terry), I gained a real education in the basics of database design, and the little 1000 watt halogen light bulb went off… &lt;/p&gt;  &lt;p&gt;That light bulb was there because what had looked like a more complicated in the college database class that no normal person would have created (bet you can’t guess what my grade was in &lt;strong&gt;&lt;em&gt;that &lt;/em&gt;&lt;/strong&gt;class!) was really there to help my design fit in with the tools that I was using. Turns out that the people who create relational database engines use the same concepts of normalization to help guide how the engine is created that I needed to for a database to work well. So if the relational engine vendors are using a set of concepts to guide how they create the engine, it turns out to be actually quite helpful if you follow along.&lt;/p&gt;  &lt;p&gt;First, lets look at the “formal” rules. The normalization rules are stated in terms of “forms”, starting at First Normal Form, and including several others some of which are numbered, some are named for the creators of the rule. (Note that in the strictest terms, to be in a greater form, you ought to also conform to the lesser form. So you can’t be in third normal form and not give in to the definition of the First). To be honest, it is rare that a data architect will actually refer to the normal forms&amp;#160; in conversation specifically unless they are having a nerd argument with a developer that is trying to design an entire customer relationship management system in a single table, but understanding the basics of normalization is essential to understanding why it is needed. What follows is a very quick restatement of the normal forms:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;b&gt;First Normal form/Definition of a Table&lt;/b&gt; – Attribute and row “shape”       &lt;ul&gt;       &lt;li&gt;All columns must be atomic—one value per column &lt;/li&gt;        &lt;li&gt;All rows of a table must contain the same number of values – no arrays &lt;/li&gt;        &lt;li&gt;Each row should be different from all other rows in the table – unique rows          &lt;br /&gt;&lt;/li&gt;     &lt;/ul&gt;   &lt;/li&gt;    &lt;li&gt;&lt;b&gt;Boyce-Codd Normal Form – &lt;/b&gt;Every&lt;b&gt; &lt;/b&gt;candidate key is identified, and all attributes are fully dependent on a key, and all columns must identify a fact about a key and nothing but a key.       &lt;ul&gt;       &lt;li&gt;Encompasses:          &lt;ul&gt;           &lt;li&gt;Second Normal Form - All attributes must be a fact about the entire primary key and not a subset of the primary key &lt;/li&gt;            &lt;li&gt;Third Normal Form - All attributes must be a fact about the primary key and nothing but the primary key              &lt;br /&gt;&lt;/li&gt;         &lt;/ul&gt;       &lt;/li&gt;     &lt;/ul&gt;   &lt;/li&gt;    &lt;li&gt;&lt;b&gt;Fourth Normal Form&lt;/b&gt; - There must not be more than one multivalued dependency represented in the entity. That is to say that every attribute relates to the key with a cardinality of one. Not a common rule to violate, but it definitely does occur.       &lt;br /&gt;&lt;/li&gt;    &lt;li&gt;&lt;b&gt;Fifth Normal Form&lt;/b&gt; - All relationships are broken down to binary relationships when the decomposition is lossless. Very rarely violated in typical designs. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;There are other, more theoretical forms that I won’t mention, but they are rare to even encounter the definition. In the reality of the development cycle of life, the stated rules are not hard and fast rules, but merely guiding principles that can be useful to help you avoid certain pitfalls. In practice, we end up with denormalization, (meaning purposely violating a normalization principle for a stated, understood purpose, not ignoring the rules to get done faster) mostly to satisfy some programming or performance need from the consumer of the data (programmers/queriers/etc)&lt;/p&gt;  &lt;p&gt;Once you deeply “get” the concepts of normalization, you really will find that you build a database like a well thought out Lego creation, desiring how each piece will fit in to the creation before putting pieces together, because disassembling 1000 Lego bricks to make a small change makes Legos more like work than fun. Some rebuilding based on keeping agile can be needed, but the more you plan ahead, the less data you will have to reshuffle. &lt;/p&gt;  &lt;p&gt;In actual practice, the formal definition of the rules aren’t thought of at all, but instead the guiding principles that they encompass are.&amp;#160; In my mind, I use the following four concepts in the back of my mind to guide the database I am building, falling back to the more specific rules for the really annoying/complex problem I am trying to solve:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;b&gt;Columns&lt;/b&gt; - One column, one value &lt;/li&gt;    &lt;li&gt;&lt;strong&gt;Table/row uniqueness&lt;/strong&gt; – Tables have independent meaning, rows are distinct from one another. &lt;/li&gt;    &lt;li&gt;&lt;b&gt;Proper relationships between columns&lt;/b&gt; – Columns either are a key or describe something about the row identified by the key. &lt;/li&gt;    &lt;li&gt;&lt;b&gt;Scrutinize dependencies &lt;/b&gt;- Make sure relationships between three values or tables are correct. Reduce all relationships to binary relationships if possible. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;The question in the title still has yet to be conquered. “How to&amp;#160; know when you are done?” What I left out of the description of Normalization was the granularity you go with. The word “atomic” is a common way to describe a table or column that is normalized enough. Atomic would tend to indicate something that is broken down to its absolute lowest form. But unless you are not a nerd (and would you really be reading this if you weren’t?) we know that there are lots of particles smaller than an atom. When you try to mess with particles smaller than the atom, you get a mushroom cloud that even Timothy Leary would not have approved of.&lt;/p&gt;  &lt;p&gt;It is the same way with databases. Tables and columns split to their atomic level have one and only one meaning. Deal with them at a higher level, and you will suffer with lots of substrings, switching attributes that you use to find out what a table means in a situation. But break things down too far, and you will suffer even more. My best example of this is a column that holds a large quantity of text. If you never need to us part of the data using SQL, a single column is perfect (a set of notes that the user uses on a screen is a good example.) You wouldn’t want a paragraph, sentence, and character table to store this information. On the other hand, that same character column is abused when the users start putting coded information (because users WILL find a way to work if your software fails them). Then and you have to search for, you will need to begin working with the less comfortable string manipulation functions in SQL… And just try to index a part of a large text column. Possible? Sometimes. Best way to go? Never. &lt;/p&gt;  &lt;p&gt;The key to knowing what is normalization and what is an academic exercise for a nerd is to understand the needs of the users (commonly referred to as requirements, as in “Why don’t we ever have good requirements before we code!?!”). If it is clear that the user is planning on maintaining a list of values and will need to update them programmatically, then it is your job to make each value a row in a table. But if there is no requirement to ever search on a value in that list or programmatically access part of the value, then it might be overkill to do anything other than leave the value alone.&amp;#160; It is often best to err on the side of caution, but the ideal relational storage for a document would be minimally at the word/punctuation level. If you are read this far and are convinced that would be the proper solution, then you need to get a complete book or take a class on the subject before you start creating a relational database.&lt;/p&gt;  &lt;p&gt;The reasonable answer to when you are done normalization is when users have exactly the right number of places to store the data they need and you can query/use the data without parsing it… Easy enough until the user changes their mind, huh?&lt;/p&gt;</description></item><item><title>Denormalize with consciousness (aka Constraint are not an option)</title><link>http://sqlblog.com/blogs/davide_mauri/archive/2010/08/02/denormalize-with-consciousness-aka-constraint-are-not-an-option.aspx</link><pubDate>Mon, 02 Aug 2010 16:08:05 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:27513</guid><dc:creator>manowar</dc:creator><description>&lt;p&gt;Today I had to help a customer to make their upgraded invoicing system to work correctly…and as quite often, the source of the problem was a bad denormalization.&lt;/p&gt;  &lt;p&gt;From now on, each time I can do it without revelaing too much of the customer, I’ll try to write a post so to it will be clear to anyone (developers in particular) what are the dangers of denormalizing without taking care of consistency.&lt;/p&gt;  &lt;p&gt;Denormalization is not bad “&lt;a href="http://en.wikipedia.org/wiki/A_priori_and_a_posteriori"&gt;a-priori&lt;/a&gt;”, so there are situations in which it can help performances and so it make sense in these cases. Plese note that I’m talking of &lt;em&gt;denormalization, &lt;/em&gt;which is the conscious decision to denormalize after having normalized the database schema &lt;em&gt;because actual system cannot give use the needed performances. &lt;/em&gt;I’m absolutely &lt;u&gt;not&lt;/u&gt; talking of &lt;em&gt;un-normalized&lt;/em&gt; situations. Unnormalizated database are simply bad, dot. They suffer of a pletora of problems among which, of course, the consistency of data is one of the biggest. So this post is useful in both situations, but please keep in mind that they are two completely different things!&amp;#160; &lt;img style="border-bottom-style:none;border-right-style:none;border-top-style:none;border-left-style:none;" class="wlEmoticon wlEmoticon-smile" alt="Smile" src="http://sqlblog.com/blogs/davide_mauri/wlEmoticonsmile_5D990729.png" /&gt;&lt;/p&gt;  &lt;p&gt;Now, if you decide to denormalize, you simply cannot avoid to implement a constratint that enforces the integrity of the information you’re going to duplicate. And it cannot be an application level constraint, but it &lt;em&gt;must&lt;/em&gt; be a database-level constraint….otherwise you can find yourself in the following situation: you have a column “InvoiceType” that holds the information about the type of the entity; If it’s an invoce it holds “I” and if it’s a pre-invoce (a sort of notice that an invoice will be issued) it holds “P”. Also a column named “InvoiceNumber” exists, and it stores the – guess? – invoce number. But of course pre-invoices doesn’t have any number so for them this column will always be zero.&lt;/p&gt;  &lt;p&gt;Now – you’re probably seeing where I’m going – what happen if you have a row with InvoiceType = “P” and InvoiceNumber = 1234? You’re in trouble. Well, actualy &lt;em&gt;the company &lt;/em&gt;is in trouble, since no-one can tell if this is an Invoice or a Pre-Invoce. And of course this affects the Gross Margin and you can bet the Boss won’t be happy to know this!&lt;/p&gt;  &lt;p&gt;This should have never happened….but despite everything it happened. A human error, a bug in the (ugly) application….it happened. But it would have never ever&amp;#160; happened if a simple CHECK CONSTRAINT would have been used (as you can see I’m not discussing that the table is not normalized).&lt;/p&gt;  &lt;p&gt;The situation created by such omission results in a more complex ETL Phase for the BI solution we’ve built, plus some additional work by a person who job is simply to check and solve the identified anomalies.&lt;/p&gt;  &lt;p&gt;The conclusion is that, if denormalization is the choosen way, &lt;em&gt;constraint to preserve information integrity must be put in place&lt;/em&gt;. Put it in another way: &lt;strong&gt;if you decide to denormalize you’re taking the responsibility to keep information consistent.&lt;em&gt; &lt;/em&gt;&lt;/strong&gt;And this is not an option. Apply some basic logic, the result is that &lt;em&gt;constrains are not an option!&lt;/em&gt; (You think they will slow you down during insert, update and delete? They may do…but that’s what you decided when you started to denormalize!)&lt;/p&gt;</description></item><item><title>The second pillar - Normal</title><link>http://sqlblog.com/blogs/louis_davidson/archive/2009/03/01/the-second-pillar-normal.aspx</link><pubDate>Sun, 01 Mar 2009 20:14:33 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:12368</guid><dc:creator>drsql</dc:creator><description>&lt;p&gt;The first pillar was easy, since no reasonable person is going to argue that having a design that is not coherent is desirable. No matter what the type of system, any design that isn’t easy to understand is likely to be a bad design (obvious caveats are that it must be understandable to other people of a given level of intelligence in the given subject/tool/language).&amp;#160; The second one is where things get a bit more interesting, or even “religious.”&amp;#160; &lt;/p&gt;  &lt;p&gt;In the initial post, I defined this pillar as: normalized as much as possible without harming usability/performance (based on testing).&lt;/p&gt;  &lt;p&gt;Sounds easy enough, right? It is, but the problem with this is that there is a real lack of understanding what normalized really means, and to be honest, why you would want to do it in the first place.&amp;#160; To fully cover the question of what and why of normalization would take a lot more than one post ( I do an hour long presentation where I have to talk like a chipmunk at Starbucks to cover an overview, and in my book there are 60+ pages dedicated to the subject.) &lt;/p&gt;  &lt;p&gt;Briefly, the basic gist of normalization is that you don’t duplicate data. Every single piece of data in a relational database has a very specific job and it is the ONLY point of data that does that job.&amp;#160; It is neither the source of another piece of data, nor sourced from some other piece (or pieces) of data.&lt;/p&gt;  &lt;p&gt;This single point of data should also be as simple as possible, but not simpler (to totally steal an Einstein quote).&amp;#160; If you ever need to break down a value in a SQL statement to use it (like using substring) then you are probably not normalized enough. There are plenty of reasonable anti-examples where this breaks down, such as searching in a large string object on occasion, but even that can depend on the purpose of the large string.&amp;#160; Having data at this (so called) “atomic” (cannot be reasonable broken down further) level means less need for code to manage duplicate data whether copied or summarized for performance. (It will also be important to you when you want to index some data to make a search go faster.)&lt;/p&gt;  &lt;p&gt;The reason that this is so important is that normalized data follows the same pattern of usage that the relational engine was built to handle.&amp;#160; Most of us would never use a hammer to screw in a nail, and it would be even less likely to drive a nail with a screwdriver. Why?&amp;#160; Because when we were starting to learn to use tools we learned that these were not good ideas. As young children it would not have been surprising to see a kid smacking a nail with the hard end of a hammer, but not an adult.) Sometimes you might do something less than perfect, (like using a wrench as a hammer) but as a rule, you know that the “best” practice is to have a bag full of tools that you carry around your house to get jobs done.&lt;/p&gt;  &lt;p&gt;So why do people expect SQL Server to do a good job as a data manipulation tool when they won’t learn how it supposed to be used.&amp;#160; And ever wondered why functional developers and dbas and&amp;#160; data architects clash? Because the ratio of data architects and dba’s that understand how SQL Server works is inversely proportional to that of functional developers.&amp;#160; It is not a rare occasion for me to see a messy data solution to a problem and be able to formulate a SQL solution that works in a much easier fashion, and still have the ability to follow good relational data patterns.&amp;#160; To the person who doesn’t “get” relational, very iterative, one row = one object instance solutions are normal, leading to lots of singleton retrieves from the database to do data manipulation. But to the relational programmer, we know that taking the normalized structures and letting the SQL engine take our set based queries and do all of the ugly work for us (almost always) provides a solution that works and works fast.&lt;/p&gt;  &lt;p&gt;But it all starts with shaping the data architecture in the correct manner (unlike the data architect, which can be almost any shape at all).&lt;/p&gt;</description></item></channel></rss>