THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

  • Bin packing part 4: The set-based disaster

    Almost a year ago now, I started a series of blog post on the bin packing problem. But after the first three posts, various reasons caused the research I still had to do for the fourth part to be massively delayed. It’s only now that I have finally found the time to finish my research and write up the fourth installment.

     

    After a nine-month delay, I can hardly expect you to remember what I covered in the first posts, so you may want to follow these links to re-read that:

    ·        The first post includes an explanation of the bin-packing solution and establishes a baseline that all other solutions have to be compared against – for both performance (lowest speed) and efficiency (lowest number of sessions, or bins).

    ·        The second post introduces several techniques to increase packing efficiency, though at the cost of reduced performance.

    ·        In the third post, I investigate several ways to improve performance, and find out exactly how much they decrease packing efficiency.

     

    Changed version, changed performance

     

    All performance numbers quoted in the previous posts were based on tests on my laptop, which was at that time running SQL Server 2005, SP2. But not anymore. In the meantime, I have upgraded my laptop to SQL Server 2008 RTM, so it’s highly probable that performance of all queries tested so far has changed – hopefully for the better. Since I don’t want to revive SQL Server 2005 on my laptop, I had no choice but to repeat all tests that I ran earlier in the course of researching and writing the first three parts of this series. I won’t bore you with all the details; instead I’ll just show an updated version of the table that I posted in the conclusion of the third part, listing the performance and efficiency of the baseline and of all versions of the algorithm that show an interesting trade-off between efficiency and performance.

     

    Version

    Execution time (ms)

    Number of sessions

    Baseline

    76,928

    19,457

    FillThenNext

    37,968

    23,630

    FillThenSearch (with index)

    42,353

    19,749

    FillThenSearchDesc (with index)

    46,977

    18,925

    Order50FirstB (with index)

    63,924

    18,923

    OrderDesc (with index)

    73,824

    18,923

     

    If you compare the table above to the table I posted in part 3, you’ll notice a few interesting changes:

    ·        There are no really big performance changes – queries got just a couple of percents faster or slower, but not the big speedups one would hope for when moving to a new version.

    ·        Two of the algorithms, the baseline and OrderDesc (with index) are now slower than before. The others all got a bit faster.

    ·        There’s also a new algorithm in the table: Order50FirstB (with index). Without the index, this 18,932 sessions – but with the index, this improved to 18,923. Unfortunately, when I originally tested the effects of the index I was too focused on the performance to note the increased packing efficiency. So, at the presumed 18,932 sessions I concluded that another algorithm (FillThenSearchDesc) was both faster and more efficient, and thus excluded this one from the table – whereas in reality, Order50FirstB should have been in and OrderDesc should have been out (for it’s now exactly as efficient but slower). In fact, the only reason why I still included  OrderDesc (with index) in the table above is because it was so that you could compare it to the table in the previous part.
    And in case you are wondering how it’s possible that adding an index affected packing efficiency as well as speed, here’s why: the Order50FirstB procedure uses a SELECT TOP 1 query without ORDER BY, which is known to produce undetermined results; in this case, adding the INDEX obviously affected what sessions were returned whenever this query was executed, and this had its effects on the packing efficiency. Which only goes to prove, once more, that one should never use the TOP clause without ORDER BY. (Okay, in this case I’m pretty satisfied with the unexpected change, but it could just as well have been the other way around!)

     

    Less sample data

     

    As promised in the previous part, I will now focus on developing a purely set-based solution for the bin packing problem. But before I can go there, I have to produce a new set of test data that is drastically reduced in size as compared to the original test data. There are various reasons for this. One reason is that the set-based solution requires an advanced knowledge of the number of sessions required, with the size of the query expanding if the theoretic number of sessions increases. Another reason is that the size of the query also increases as the session size increases, so I wanted to test with a smaller session size first. And finally, the performance of this tiny test-set turned out to be already pretty bad; I’ll try to make an estimation of the total execution time for a set-based solution for the original problem near the end of this post – after which you’ll understand why I never actually built or tested this version.

     

    In the attached ZIP file, you’ll find the file “Tiny Testset (Random).sql”. This is a new version of the random test data generator presented previously in the first part of this series, with the following modifications:

    ·        The maximum number of candidates per registration that was hardcoded as 100 in the original script has been replaced by a variable, defaulting to 10.

    ·        All quarters now use the same, simple expression to get an even distribution of registration sizes between 1 and the maximum size. The switches to include or exclude each quarter have been removed. I felt that the extra complexity of different distributions was not required in this case.

    ·        Since I had to code the query for a fixed maximum number of possible solutions, I added an extra variable for the maximum number of candidates in all sessions for each quarter; if the random distribution of registration sizes causes this to be exceeded, the script will simply remove the largest registrations until the total size of all remaining sessions in each quarter no longer exceeds the maximum.

     

    For my first round of testing, I decided to limit the sessions to a maximum of 20 candidates, and to assume a maximum of four sessions per quarter. I limited the amount of subjects (sessions per quarter) to 10, and set the maximum registration size to 10 as well. With these numbers, I decided to set the maximum number of candidates per quarter to 70. In theory, this still allows for impossible data combinations – for instance, there is no way that you can ever fit 10 registrations for 7 candidates each in less than 5 sessions –, but none of the tests I ran ever actually ran into this problem. But if you are going to run your own test with a different seed for the randomizer, or with different settings for other variables, you need to be aware of this possibility. Since the set-based algorithm will simply not return any results for a quarter that needs more sessions than the query caters for, this problem is easy to spot – you just need to check that the query returns results for as many quarters as your test data spans (which can be done by simply checking the amount of rows returned – easily visible when returning results in grid view in SQL Server Management Studio).

     

    The set-based solution

     

    Credit where credit is due. The set-based solution I present here is not mine. I have first seen this solution in a newsgroup posting by John Gilson, dating back to early 2005. His code did contain some minor errors that I corrected, and I have also added some optimizations to his code – but the logic used in this solution is entirely his.

     

    The set-based solution presented here is not implemented in a single query but as a series of views, building on top of each other. This has several advantages. One of them is that the logic is easier to understand. Another advantage is that it prevents endless code repetition: every view can reference the previously defined views by name instead of repeating the entire definition of the view as a subquery. This does not automatically lead to performance gains though – SQL Server will replace each view by its definition before handing the result over to the query optimizer, so the resulting plan will be exactly the same as if the entire query were written out referencing only base tables.

     

    As in the previous parts, the code is too long to reproduce here. In the attached ZIP file, you’ll find the code for this solution in the script “PureSetbasedViews.sql”, complete with extensive comments. This code also contains (commented out) statements to review the contents of each of the intermediate views, so if you wish you can uncomment them and see how the steps build upon each other to massage the data into the final solution.

     

    Step 1: Aggregating by size

     

    The first step of the algorithm is pretty simple. It builds on the basic idea that in any give solution, it’s not important which subjects are included in each session, but only how many of each size. And for the purpose of the bin packing algorithm, subjects with the same number of candidates are completely interchangeable. As an example, let’s say that there are five registrations for a given quarter, three for 6 candidates each and two for 9 candidates each. This will be represented by five rows in the dbo.Registrations table, but it can just as well be represented by two rows – one row representing 3 registrations for 6 candidates, and a second row representing 2 registrations for 9 candidates. This is implemented in the attached code in the dbo.RegsBySize view.

     

    The great advantage of this is that it vastly reduces the number of potential (but not really different) solutions. For instance, there may be solutions where a session is built by combining one of the registrations for 6 candidates is with one of the registrations of 9 candidates. When building straight from the dbo.Registrations table, any possible set-based solution would find 6 variations on this session (combining each of the three 6-candidate registrations with each of the two 9-candidate ones); when building from dbo.RegsBySize, there will be just one row representing this possible combination of two registrations.

     

    Step 2: Building sessions

     

    Using the aggregated registrations, the second step is to combine one or more registrations into possible sessions of all sizes from one up to the maximum session size. Combinations of registrations that would exceed the maximum session size are excluded, but combinations that are smaller than the maximum are not. The view dbo.PossibleSessions describes a completely denormalized result set, holding up to five (Size, Count) pairs from the dbo.RegsBySize view. Like in the previous step, only one row is created for each possible permutation – so based on the example above, there will be one row for “1 registration size 6 and 1 registration size 9 (total size 15)”, even though two such sessions could be formed from the total registrations available. Other possible sessions constructed from the above sample data would be “1 registration size 6 (total size 6)”, “1 registration size 9 (total size 9)”, “2 registrations size 6 (total size 12)”, “3 registrations size 6 (total size 18)”, and “2 registrations size 9 (total size 18)”.

     

    As you can see, this view also includes a column SessionSize (holding the total number of candidates in all the session’s registrations) and columns for the total number of sessions for each size available. The former can easily be computed from the Size and Cnt columns, and the latter can easily be fetched by joining to dbo.RegsBySize – but redundantly including these columns here vastly reduces the amount of code required in the next step.

     

    The final column in this view, RowNum, uses the ROW_NUMBER() function to assign each possible session a unique number within its quarter. The numbers are assigned by ordering the possible sessions by session size and then by size and count of each of the (size, count) pairs; this could in fact have easily been replaced by any other ordering – the only goal is to have some method to assign unique row-numbers within each quarter. These numbers are used in the next step , when solutions are made by combining multiple possible sessions, to prevent duplicate solutions – for instance, if one solution combines possible sessions 1 and 2, we don’t want another solution formed by combining sessions 2 and 1. If you want to run this code on SQL Server 2000, or on any other database platform that does not support the ROW_NUMBER() function, check John Gilson’s original posting (that was written before SQL Server 2005 was released) to see how he achieved the same goal without the RowNum column – by adding lots of complicated comparisons in the next logical step).

     

    You’ll probably note that the maximum session size of 20 is hardcoded in this view, since parameters are not supported in views. I could have used an inline table-values user-defined function instead (sacrificing support for SQL Server 2000 and before), but that would not really have helped, since the number of columns this view requires is also dependant on the maximum session size. The five (size, count) pairs supported in this version are not chosen arbitrarily; it is the minimum number required to describe all possible permutations of registrations whose total size does not exceed 20. The smallest possible session size that would require a sixth (size, count) pair would be 21 (combining one registration each of the sizes 1, 2, 3, 4, 5, and 6). Similarly, a seventh (size, count) pair (and accompanying avail column) would be required when the maximum session size is 28 or more, and so on. For the maximum session size of 100 that I used in the original test data, no less than 13 (size, count) pairs are required – so if you want to extend the set-based solution to this original size, prepare to expand the view to a total of 43 columns, and no less than a whopping 307 lines of code (if I didn’t mess up my estimations…). I’m sure this makes you start appreciating why I chose to limit myself to a much smaller set of test data for this part of the series!

     

    Step 3: Finding solutions

     

    Now that we know all possibilities to assemble sessions of all allowed sizes, it’s time to combine these possible sessions into complete solutions. Of course, not any combination is a solution – in order to qualify as a solution, each registration has to be used exactly once. This is implemented with a clever trick – the code checks to see that no registration is used more than once (or rather, that no registration size is used more often than the number of registrations of that size), and that the total number of candidates in all sessions combined equals the total number of candidates in all registrations. Since the only way to get at a total equal to the total registration size without using registrations twice is to use them all, this requirement is equivalent with the requirement to use each registration exactly once, but much easier to test.

     

    When assembling solutions from the possible sessions view, it’s important to remember that the possible sessions stores permutations – so it’s allowed to use multiple instances of the same possible session as long as the total number of available registrations of each size is not exceeded. So going back to the sample data presented earlier, one possible solution would include two occurrences of the “1 registration size 6 and 1 registration size 9 (total size 15)” session, plus one occurrence of “1 registration size 6 (total size 6)” to top it off. Another solution would include one occurrence of “3 registrations size 6 (total size 18)”, and one occurrence of “2 registrations size 9 (total size 18)”.

     

    If you look at the dbo.Solutions view in the code, you’ll see a design that is even more extremely denormalized than the dbo.PossibleSessions view already was. It includes the five (size, count) pairs already familiar from the latter, but repeats them four times – thus allowing for a maximum of four possible sessions to be combined to make up a solution. This means that the solution fails when it’s given data that requires five or more sessions to accommodate all registrations – it will simply not return any rows for that quarter. To solve that, all you have to do is add more columns for a fifth session, and add an extra LEFT JOIN to a fifth occurrence of the dbo.PossibleSessions view. You may have noted that the join condition for each join references all “previously” joined occurrences of dbo.PossibleSessions, thus increasing in size with each additional occurrence.

     

    Apart from the columns for the denormalized description of all sessions in a solution, the dbo.Solutions view includes four more columns. Apart from the unavoidable Year and Quarter columns, there is a NumSessions column holding the number of sessions used, based on a simple CASE expression (only added for easier understanding of the output; it’s not really required). The final column calculates a ranking for each solution by using the ROW_NUMBER() function, partitioning by quarter and ordering by number of sessions in the solution (least sessions used first) using a clever trick based on the fact that NULL sorts before other values (I could also have repeated the CASE expression used for the NumSessions column, but that would have been more typing). Note that the order of solutions with the same number of sessions is irrelevant for the approach used here. This Ranking column will be used in the next step.

     

    If you have already checked the query, you’ll probably understand that it’s impossible to scale this to the dimensions required to solve the original problem. The maximum session size of 100 candidates already forces us to use 13 (size, count) pairs in the dbo.PossibleSessions view, that all return here; the inclusion of one copy of that view per session in the solution moves us into some really insane figures. For instance, if you want to allow up to 200 sessions per quarter, you’d need to define 5,204 columns in the view (which is more than SQL Server allows), and the CREATE VIEW statement would have more than four million lines, well beyond the maximum batch size SQL Server allows. And that would still not be enough for the original test data – the tests conducted earlier show that some quarters need slightly more than 700 sessions. Maybe this query will find a way to pack them into less than 700, but then we’d first need to find a database that allows views with 18,204 columns and a definition of 51,155,633 (!) lines…

     

    Step 4: The best solution …

     

    After all this hardcore query magic, the fourth and final step is almost disappointingly simple. Finding the best solution is a simple matter of returning the solution with the lowest score in the Ranking column – i.e. a ranking equal to 1. And if you prefer to see all possible solutions with the lowest possible number of sessions, all it takes is to replace ROW_NUMBER() with RANK() in the definition of the dbo.Solutions view – this will assign the number 1 to all solutions with the lowest number of sessions, and higher numbers to all solutions that use more sessions.

     

    Once more, you can check John Gilson’s posting to see that without ROW_NUMBER() the dbo.Solutions view has to be joined to (a derived table based upon) itself. In practice this means that after replacing views by their definitions, the resulting query passed internally to the query optimizer doubles in size – I don’t even want to begin to imagine what effect this will have on this solution’s performance. Especially considering that performance is already pretty bad with using the ROW_NUMBER() function.

     

    Maybe this is also a good place to point out that the results you get when you query the dbo.BestSolutions view are not actually yet the final results. Okay, you will get a result set showing that, for some quarter, you have a session consisting of three registrations for five candidates plus one registration for three candidates, but you still have to select the four actual registrations to assign to this session. Since this blog post is already longer than any I’ve done before (and still growing), I decided to leave that part as an exercise to the reader. Just one word of warning – if you come up with a solution that uses multiple copies of the dbo.BestSolutions view, replace it with a temporary table holding the contents of the view. Otherwise, SQL Server will happily produce a query execution plan where dbo.BestSolutions is evaluated as often as you use it in your query – and trust me, once you’ve seen how it performs, you don’t want that to happen!

     

    … for some definition of “best”

     

    I you have already checked out the definitions of the views used, you probably don’t expect a great performance anymore. And after reading my ominous hints above, you’re really bracing yourself – and with good right. With the amount of test data reduced to just ten (or less) registrations per quarter, querying the dbo.BestSolutions view to find the best solutions for all quarters (on an empty cache) turns out to take an average execution time of 94,535 milliseconds – that’s more than one and a half minute!

     

    I have already indicated that the original code that I based this solution on was written before SQL Server 2005 was available. That is probably the reason why separate views were used instead of a single query with CTE’s. But now that I’m already running SQL Server 2008, there is of course no reason not to try a CTE-based version. After all, I also started to make use of the ranking functions that have also been introduced in SQL Server 2005. So I changed the view-based code to the CTE-based but further equivalent code in PureSetbasedCTEs.sql (also in the attached ZIP file), and executed it a couple of times to see if this would affect performance. In theory, it shouldn’t – but in practice it did: the average execution time for the CTE-based version was 90,984 ms, almost 4% faster. Even more intriguing was my observation that the CTE-based version performed more constant, with all measurements between 89 and 94 seconds. The view-based version ranged from 80 to over 100 seconds! I have not been able to explain either the performance difference or the difference in variation.

     

    I also tested a third version. Not as “purely relational” as the views or the CTE’s, but much better performing: by materializing intermediate results in temporary tables instead of keeping the virtual in views or CTE’s, I worked around the annoying weakness in SQL Server’s query optimizer, namely that it does not realise that for a query that includes multiple copies of the same logic, that logic needs not be duplicated in the execution plan. Even in cases such as these, where the duplication is very easy to spot because it’s just multiple copies of the same view or CTE name, the optimizer will still spit out a plan that happily repeats the same steps over and over again. So I replaced the views with temporary tables, forcing SQL Server to materialize and reuse intermediate results instead of recalculating them over and over again. This code, that you can find in PureSetbasedTemp.sql in the attached ZIP file, did indeed perform lots better, though still a far cry from the cursor-based solution I presented in the previous posts: 32,879 milliseconds for the tiny amount of test data this solution handles.

     

    It gets worse

     

    Even though I already know that the set-based solution will never scale to the size required to handle the original problem, I still wanted to have some idea of how it scales as the maximum session size, the maximum number of sessions, and/or the amount of registrations increase. So I started with the “best” version so far, the version with temp tables, and created two variations on it: one that allows for one more session in the solution, bringing the total to a maximum of five sessions per quarter; and another one that does the same but also allows the session size to go up to a maximum of 35 candidates by using seven (size, amount) pairs for the possible sessions instead of five. These versions are also in the attached ZIP file as PureSetbasedTempMore.sql and PureSetbasedTempBigger.sql.

     

    I did not test these versions with data that actually required bigger or more sessions, but I did test all the versions with various amounts of data by changing the amount of registrations per quarter. In the table below, you will find the number of registrations I chose, the actual average number of registrations per quarter after trimming the data down to at most 70 candidates per quarter, and the running time of each of the three set-based versions with temporary tables. Note that in this case, I executed the longer running queries only once instead of my usual practice of averaging the execution time of five executions. And in case you intend to test these queries yourself, make sure to size your databases appropriately – to repeat the tests I executed, you’ll need at least 200 MB in the data file and 400 MB in the log file. Allocate less, and autogrow will probably kick in, reducing performance even further.

     

    #Sessions

    Execution time (ms)

    requested

    actual

    Base

    More

    Bigger

    6

    6

    847

    1,015

    1,600

    7

    7

    1,510

    1,934

    3,612

    8

    8

    4,546

    6,968

    14,687

    9

    8.975

    11,980

    23,071

    57,347

    10

    9.875

    32,789

    80,656

    247,186

    11

    10.675

    70,220

    202,123

    726,233

     

    After seeing these numbers, I didn’t even try to extrapolate them to the original problem (that needs 13 (size, amount) pairs for possible sessions up to 100 candidates, and up to 700 sessions per quarter with the full amount of test data). It would probably be several years, maybe even centuries – though of course I have already demonstrated that the query will never run on any current database anyway, because it’s too long and returns too many columns.

     

    Back to the cursor?

     

    By now, you might have concluded that this post shows that for the bin-packing problem, using a cursor really is the only way to go. But allow me to disagree. While there is no doubt that a truly set-based single query solution is absolutely not viable here, there are still other possibilities. As I already indicated in the first part of the series, it is possible to combine set-based queries with iteration for a solution that packs about as well as the better cursor-based variants, but runs much faster. This solution will be the subject of my next post in this series – so stay tuned!


  • Speaking at PASS

    Are you planning to attend this year’s PASS Community Summit? There’s only three weeks left before the pre-conference seminars kick off, so if you’re not registered yet, now is the time to act!

     

    The session schedule shows two days with seven full-day pre-conference seminars each, followed by three days packed full of high-quality sessions – with four time slots per day, over three days, and twelve sessions in parallel at each time slot, you problem will not be finding sessions that are of interest to you, but rather choosing between them.

     

    And you will be able to see and hear me speaking at two of those sessions.

    • On Wednesday, I will be one of the participants in Much Ado: A Panel Discussion About “Nothing”, a debate between several MVPs and SQLBloggers about the always heated subject of NULL.
    • On Thursday, I’ll present “Cursors or setbased? Why not both?”. In this session, I’ll show that using cursors or a set-based query is not an either or decision; but that there are ways to combine the strengths of both approaches into a (sometimes!) winning strategy.

    (Please note that I’m not sure if the session schedule is final or still subject to change)

     

    So if you are at PASS, I hope that you’ll decide to drop in on either or both of those sessions. But even if you don’t, I still look forward to meeting you – so if you see me passing by and want to shake hands or exchange some opinions, don’t hesitate to catch me!


  • Data modeling: art or science?

    When I started blogging here on sqlblog.com, I intended to write about stuff like T-SQL, performance, and such; but also about data modeling and database design. In reality, the latter has hardly happened so far – but I will try to change that in the future. Starting off with this post, in which I will pose (and attempt to answer) the rather philosophical question of the title: is data modeling an art or a science?

     

    Before I can answer the question, or at least tell you how I think about the subject, we need to get the terms straight. So that if we disagree, we’ll at least disagree over the same things instead of actually agreeing but not noticing. In the next two paragraphs, I’ll try to define what data modeling is and how it relates to database design, and what sets art apart from science in our jobs.

     

    Data modeling and database design

     

    When you have to create a database, you will ideally go through two stages before you start to type your first CREATE TABLE statement. The first stage is data modeling. In this stage, the information that has to be stored is inventoried and the structure of that information is determined. This results in a logical data model. The logical data model should focus on correctness and completeness; it should be completely implementation agnostic.

     

    The second stage is a transformation of the logical data model to a database design. This stage usually starts with a mechanical conversion from the logical data model to a first draft of the implementation model (for instance, if the data model is represented as an ERM diagram but the data has to be stored in an RDBMS, all entity and all many-to-many relationships become tables and all attributes and 1-to-many relationships become columns). After that, optimization starts. Some of the optimizations will not affect the layout of tables and columns (examples are choosing indexes, or implementing partitioning), but some other optimizations will do just that (such as adding a surrogate key, denormalizing tables, or building indexed views to pre-aggregate some data). The transformation from logical data model to database design should focus on performance for an implementation on a specific platform. As long as care is taken that none of the changes made during this phase affect the actual meaning of the model, the resultant database design will be just as correct and complete (or incorrect and incomplete) as the logical data model that the database design is based on.

     

    Many people prefer to take a shortcut by combining these two stages. They produce a data model that is already geared towards implementation in a specific database. For instance by adding surrogate keys right into the first draft of the data model, because they already know (or think) that they will eventually be added anyway. I consider this to be bad practice for the following reasons:

    1. It increases the chance of errors. When you have to focus on correctness and performance at the same time, you can more easily lose track.
    2. It blurs the line. If a part of a data model can never be implemented at acceptable performance, an explicit decision has to be made (and hopefully documented) to either accept crappy performance or change the data model. If you model for performance, chances are you’ll choose a better performing alternative straight away and never document properly why you made a small digression from the original requirements.
    3. It might have negative impact on future performance. The next release of your DMBS, or maybe even the next service pack, may cause today’s performance winner to be tomorrows performance drainer. By separating the logical data model from the actual database design, you make it very easy to periodically review the choices you made for performance and assess whether they are still valid.
    4. It reduces portability and maintainability. If, one day, your boss informs you that you need to port an existing application to another RDBMS, you’ll bless the day you decided to separate the logical data model from the physical database design. Because you now only need to pull out the (still completely correct) logical data model, transform again, but this time apply optimization tricks for the new RDBMS. And also, as requirements change, it is (in my experience) easier to identify required changes in an implementation-independent logical data model and then move the changes over to the physical design, than to do that all at once if only the design is available.
    5. It may lead to improper choices. More often that I’d like, I have seen good modelers fall victim to bad habits. Such as, for instance, adding a surrogate key to every table (or entity or whatever) in the data model. But just because surrogate keys are often good for performance (on SQL Server that is – I wouldn’t know about other DBMS’s) doesn’t mean they should always be used. And the next step (that I’ve witnessed too!) is forgetting to identify the real key because there already is a key (albeit a surrogate key).

     

    Art and science

     

    For the sake of this discussion, “art” (work created by an artist) is the result of some creative process, usually completely new and unique in some way. Most artists apply learned skills, though not always in the regular way. Artists usually need some kind of inspiration. There is no way to say whether a work of art is “good” or “bad”, as that is often in the eye of the beholder – and even if all beholders agree that the work sucks, you still can’t pinpoint what exactly the artist has done wrong. Examples of artists include painters, composers, architects, etc. But some people not usually associated with art also fit the above definition, such as politicians, blog authors, or scientists (when breaking new grounds, such as Codd did when he invented the relational model). Or a chef in a fancy restaurant who is experimenting with ingredients and cooking processes to find great new recipes to include on the menu.

     

    “Science” does NOT refer to the work of scientists, but to work carried out by professionals, for which not creativity but predictability is the first criterion. Faced with the same task, a professional should consistently arrive at correct results. That doesn’t imply that he or she always will get correct results, but if he or she doesn’t, then you can be sure that an error is made and that careful examination of the steps taken will reveal exactly what that error was. Examples of professionals include bakers, masons, pilots, etc. All people that you trust to deliver work of a consistent quality – you want your bread to taste the same each day, you want to be sure your home won’t collapse, and you expect to arrive safely whenever you embark a plane. And a regular restaurant cook is also supposed to cook the new meal the chef put on the menu exactly as the chef instructed.

     

    Data modeling: art or science?

     

    Now that all the terms have been defined, it’s time to take a look at the question I started this blog post with – is data modeling an art or a science? And should it be?

     

    To start with the latter, I think it should be a science. If a customer pays a hefty sum of money to a professional data modeler to deliver a data model, then, assuming the customer did everything one can reasonably expect1 to answer questions from the data modeler, the customer can expect the data model to be correct2.

     

    1          I consider it reasonable to expect that the customer ensures that all relevant questions asked by the data modeler are answered, providing the data modeler asks these questions in a language and a jargon familiar to the customer and/or the employees he interviews. I do not consider it reasonable to expect the customer (or his employees) to learn language, jargon, or diagramming style used by data modelers.

    2          A data model is correct if it allows any data collection that is allowed according to the business rules, and disallows any data collection that would violate any business rule.

     

    Unfortunately, my ideas of what data modeling should be like appear not to be on par with the current state of reality. One of the key factors I named for “science” is predictability. And to have a predictable outcome, a process must have a solid specification. As in, “if situation X arises, you need to ask the customer question Y; in case of answer Z, add this and remove that in the data model”. Unfortunately, such exactly specified process steps are absent in most (and in all commonly used!) data modeling methods. However, barring those rules, you have to rely on the inspiration of the data modeler – will he or she realize that question Y has to be asked? And if answer Z is given, will the modeler realize that this has to be added and that has to be removed? And if that doesn’t happen, then who’s to blame? The modeler, for doing everything the (incomplete) rules that do exist prescribe, but lacking the inspiration to see what was required here? Or should we blame ourselves, our industry, for allowing ourselves to have data modeling as art for several decades already, and still accepting this as “the way it is”?