I have been struggling to find a way to describe the fundamentals of first normal form for my PASS presentation, and this came to mind last night. SQL works in a very constructive way, meaning that if you have base values (commonly referred to as atomic, or scalar values) then you can build up the view that you want. However, have a single value that contains many values and SQL falls apart. Not only does this work at the column level, it also fits when talking at the row level.
Indexes work naturally on scalars. (And yes, I am aware that some other RDBMS vendors have function based indexes, but all they do is support denormalized data by making it faster to do something that is fundamentally wrong, plus you can something like this using indexed computed columns. ) Think about it for a minute, what operations work well with indexes? Equality, inequality, range checks. Everything that deals with the entire column.
What doesn't work well? Partial equality is especially heinous, especially when you want to find one value inside of another value. Like looking for all strings where the value of 'the' is found in the string. Or the value of 32 (binary 0100000) is found via an AND operator in the value of 52 (binary 0110101). You cannot index the sixth position of the integer (without building a computed column for it, but then if you did that, what would be the purpose?) It would however be easy, if you had seven columns to compute the integer value that corresponded to them for the external code, AND have the indexable, searchable columns available.
Indexability of a search argument is extremely important, as an index seek is almost always incredibly fast, even if you have millions of rows since you only have to touch index pages instead of scanning all pages of the table. And when you join two tables together and have to do (essentially) one seek in an index for every row in another table, the costs can pile up quick!
Also, I should probably state the obvious. It is easier to write (just pulling a query out of the air):
WHERE productionYear = '2006'
and manufacturer = 'Ford'
and color = 'blue'
Than to have to pick out parts of some string value:
WHERE substring(vehicleIdNumber,3,2) = 'F'
and substring (vehicleIdNumber,10,4) = '2006'
and substring (vehicleIdNumber,16,2) = 'BL'
The constructive argument becomes even more true when you start to work at the table level. You could build all related tables into a single monolithic table, if you really wanted to (many people really do try!) Customer columns, invoice columns, sales columns, salesperson columns, products, everything, one table. (Sounds stupid of course, right? But most people do this on some smaller level all of the time when they don't normalize...)
Clearly you wouldn't want to do this, but you might have a reason to display this one "table", right? Certainly...And this would be a very easy task:
JOIN invoice ON ...
JOIN product ON ...
JOIN sales ON ...
JOIN salesPerson ON ...
But if you had built everything into one monolithic table and want to just look at products:
SELECT distinct product
Easy enough (this is a typical query you might do when converting data from a legacy system), but now you have to consider how each of these values correspond to each of the other values in the row (which is what makes data conversions so much fun,) with tons of repeating data, which you will have to spend time working through, probably with cursors to do lots of checking to see if values have changed from row to row, etc.
So at the row level, and the table level, SQL is very much suited to construct values from lower order values, but generally it is unpleasant to deconstruct values. But then again, this is why we normalize, isn't it?
Crossposted to: http://drsql.spaces.live.com