THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

Bin packing part 4: The set-based disaster

Almost a year ago now, I started a series of blog post on the bin packing problem. But after the first three posts, various reasons caused the research I still had to do for the fourth part to be massively delayed. It’s only now that I have finally found the time to finish my research and write up the fourth installment.

 

After a nine-month delay, I can hardly expect you to remember what I covered in the first posts, so you may want to follow these links to re-read that:

·        The first post includes an explanation of the bin-packing solution and establishes a baseline that all other solutions have to be compared against – for both performance (lowest speed) and efficiency (lowest number of sessions, or bins).

·        The second post introduces several techniques to increase packing efficiency, though at the cost of reduced performance.

·        In the third post, I investigate several ways to improve performance, and find out exactly how much they decrease packing efficiency.

 

Changed version, changed performance

 

All performance numbers quoted in the previous posts were based on tests on my laptop, which was at that time running SQL Server 2005, SP2. But not anymore. In the meantime, I have upgraded my laptop to SQL Server 2008 RTM, so it’s highly probable that performance of all queries tested so far has changed – hopefully for the better. Since I don’t want to revive SQL Server 2005 on my laptop, I had no choice but to repeat all tests that I ran earlier in the course of researching and writing the first three parts of this series. I won’t bore you with all the details; instead I’ll just show an updated version of the table that I posted in the conclusion of the third part, listing the performance and efficiency of the baseline and of all versions of the algorithm that show an interesting trade-off between efficiency and performance.

 

Version

Execution time (ms)

Number of sessions

Baseline

76,928

19,457

FillThenNext

37,968

23,630

FillThenSearch (with index)

42,353

19,749

FillThenSearchDesc (with index)

46,977

18,925

Order50FirstB (with index)

63,924

18,923

OrderDesc (with index)

73,824

18,923

 

If you compare the table above to the table I posted in part 3, you’ll notice a few interesting changes:

·        There are no really big performance changes – queries got just a couple of percents faster or slower, but not the big speedups one would hope for when moving to a new version.

·        Two of the algorithms, the baseline and OrderDesc (with index) are now slower than before. The others all got a bit faster.

·        There’s also a new algorithm in the table: Order50FirstB (with index). Without the index, this 18,932 sessions – but with the index, this improved to 18,923. Unfortunately, when I originally tested the effects of the index I was too focused on the performance to note the increased packing efficiency. So, at the presumed 18,932 sessions I concluded that another algorithm (FillThenSearchDesc) was both faster and more efficient, and thus excluded this one from the table – whereas in reality, Order50FirstB should have been in and OrderDesc should have been out (for it’s now exactly as efficient but slower). In fact, the only reason why I still included  OrderDesc (with index) in the table above is because it was so that you could compare it to the table in the previous part.
And in case you are wondering how it’s possible that adding an index affected packing efficiency as well as speed, here’s why: the Order50FirstB procedure uses a SELECT TOP 1 query without ORDER BY, which is known to produce undetermined results; in this case, adding the INDEX obviously affected what sessions were returned whenever this query was executed, and this had its effects on the packing efficiency. Which only goes to prove, once more, that one should never use the TOP clause without ORDER BY. (Okay, in this case I’m pretty satisfied with the unexpected change, but it could just as well have been the other way around!)

 

Less sample data

 

As promised in the previous part, I will now focus on developing a purely set-based solution for the bin packing problem. But before I can go there, I have to produce a new set of test data that is drastically reduced in size as compared to the original test data. There are various reasons for this. One reason is that the set-based solution requires an advanced knowledge of the number of sessions required, with the size of the query expanding if the theoretic number of sessions increases. Another reason is that the size of the query also increases as the session size increases, so I wanted to test with a smaller session size first. And finally, the performance of this tiny test-set turned out to be already pretty bad; I’ll try to make an estimation of the total execution time for a set-based solution for the original problem near the end of this post – after which you’ll understand why I never actually built or tested this version.

 

In the attached ZIP file, you’ll find the file “Tiny Testset (Random).sql”. This is a new version of the random test data generator presented previously in the first part of this series, with the following modifications:

·        The maximum number of candidates per registration that was hardcoded as 100 in the original script has been replaced by a variable, defaulting to 10.

·        All quarters now use the same, simple expression to get an even distribution of registration sizes between 1 and the maximum size. The switches to include or exclude each quarter have been removed. I felt that the extra complexity of different distributions was not required in this case.

·        Since I had to code the query for a fixed maximum number of possible solutions, I added an extra variable for the maximum number of candidates in all sessions for each quarter; if the random distribution of registration sizes causes this to be exceeded, the script will simply remove the largest registrations until the total size of all remaining sessions in each quarter no longer exceeds the maximum.

 

For my first round of testing, I decided to limit the sessions to a maximum of 20 candidates, and to assume a maximum of four sessions per quarter. I limited the amount of subjects (sessions per quarter) to 10, and set the maximum registration size to 10 as well. With these numbers, I decided to set the maximum number of candidates per quarter to 70. In theory, this still allows for impossible data combinations – for instance, there is no way that you can ever fit 10 registrations for 7 candidates each in less than 5 sessions –, but none of the tests I ran ever actually ran into this problem. But if you are going to run your own test with a different seed for the randomizer, or with different settings for other variables, you need to be aware of this possibility. Since the set-based algorithm will simply not return any results for a quarter that needs more sessions than the query caters for, this problem is easy to spot – you just need to check that the query returns results for as many quarters as your test data spans (which can be done by simply checking the amount of rows returned – easily visible when returning results in grid view in SQL Server Management Studio).

 

The set-based solution

 

Credit where credit is due. The set-based solution I present here is not mine. I have first seen this solution in a newsgroup posting by John Gilson, dating back to early 2005. His code did contain some minor errors that I corrected, and I have also added some optimizations to his code – but the logic used in this solution is entirely his.

 

The set-based solution presented here is not implemented in a single query but as a series of views, building on top of each other. This has several advantages. One of them is that the logic is easier to understand. Another advantage is that it prevents endless code repetition: every view can reference the previously defined views by name instead of repeating the entire definition of the view as a subquery. This does not automatically lead to performance gains though – SQL Server will replace each view by its definition before handing the result over to the query optimizer, so the resulting plan will be exactly the same as if the entire query were written out referencing only base tables.

 

As in the previous parts, the code is too long to reproduce here. In the attached ZIP file, you’ll find the code for this solution in the script “PureSetbasedViews.sql”, complete with extensive comments. This code also contains (commented out) statements to review the contents of each of the intermediate views, so if you wish you can uncomment them and see how the steps build upon each other to massage the data into the final solution.

 

Step 1: Aggregating by size

 

The first step of the algorithm is pretty simple. It builds on the basic idea that in any give solution, it’s not important which subjects are included in each session, but only how many of each size. And for the purpose of the bin packing algorithm, subjects with the same number of candidates are completely interchangeable. As an example, let’s say that there are five registrations for a given quarter, three for 6 candidates each and two for 9 candidates each. This will be represented by five rows in the dbo.Registrations table, but it can just as well be represented by two rows – one row representing 3 registrations for 6 candidates, and a second row representing 2 registrations for 9 candidates. This is implemented in the attached code in the dbo.RegsBySize view.

 

The great advantage of this is that it vastly reduces the number of potential (but not really different) solutions. For instance, there may be solutions where a session is built by combining one of the registrations for 6 candidates is with one of the registrations of 9 candidates. When building straight from the dbo.Registrations table, any possible set-based solution would find 6 variations on this session (combining each of the three 6-candidate registrations with each of the two 9-candidate ones); when building from dbo.RegsBySize, there will be just one row representing this possible combination of two registrations.

 

Step 2: Building sessions

 

Using the aggregated registrations, the second step is to combine one or more registrations into possible sessions of all sizes from one up to the maximum session size. Combinations of registrations that would exceed the maximum session size are excluded, but combinations that are smaller than the maximum are not. The view dbo.PossibleSessions describes a completely denormalized result set, holding up to five (Size, Count) pairs from the dbo.RegsBySize view. Like in the previous step, only one row is created for each possible permutation – so based on the example above, there will be one row for “1 registration size 6 and 1 registration size 9 (total size 15)”, even though two such sessions could be formed from the total registrations available. Other possible sessions constructed from the above sample data would be “1 registration size 6 (total size 6)”, “1 registration size 9 (total size 9)”, “2 registrations size 6 (total size 12)”, “3 registrations size 6 (total size 18)”, and “2 registrations size 9 (total size 18)”.

 

As you can see, this view also includes a column SessionSize (holding the total number of candidates in all the session’s registrations) and columns for the total number of sessions for each size available. The former can easily be computed from the Size and Cnt columns, and the latter can easily be fetched by joining to dbo.RegsBySize – but redundantly including these columns here vastly reduces the amount of code required in the next step.

 

The final column in this view, RowNum, uses the ROW_NUMBER() function to assign each possible session a unique number within its quarter. The numbers are assigned by ordering the possible sessions by session size and then by size and count of each of the (size, count) pairs; this could in fact have easily been replaced by any other ordering – the only goal is to have some method to assign unique row-numbers within each quarter. These numbers are used in the next step , when solutions are made by combining multiple possible sessions, to prevent duplicate solutions – for instance, if one solution combines possible sessions 1 and 2, we don’t want another solution formed by combining sessions 2 and 1. If you want to run this code on SQL Server 2000, or on any other database platform that does not support the ROW_NUMBER() function, check John Gilson’s original posting (that was written before SQL Server 2005 was released) to see how he achieved the same goal without the RowNum column – by adding lots of complicated comparisons in the next logical step).

 

You’ll probably note that the maximum session size of 20 is hardcoded in this view, since parameters are not supported in views. I could have used an inline table-values user-defined function instead (sacrificing support for SQL Server 2000 and before), but that would not really have helped, since the number of columns this view requires is also dependant on the maximum session size. The five (size, count) pairs supported in this version are not chosen arbitrarily; it is the minimum number required to describe all possible permutations of registrations whose total size does not exceed 20. The smallest possible session size that would require a sixth (size, count) pair would be 21 (combining one registration each of the sizes 1, 2, 3, 4, 5, and 6). Similarly, a seventh (size, count) pair (and accompanying avail column) would be required when the maximum session size is 28 or more, and so on. For the maximum session size of 100 that I used in the original test data, no less than 13 (size, count) pairs are required – so if you want to extend the set-based solution to this original size, prepare to expand the view to a total of 43 columns, and no less than a whopping 307 lines of code (if I didn’t mess up my estimations…). I’m sure this makes you start appreciating why I chose to limit myself to a much smaller set of test data for this part of the series!

 

Step 3: Finding solutions

 

Now that we know all possibilities to assemble sessions of all allowed sizes, it’s time to combine these possible sessions into complete solutions. Of course, not any combination is a solution – in order to qualify as a solution, each registration has to be used exactly once. This is implemented with a clever trick – the code checks to see that no registration is used more than once (or rather, that no registration size is used more often than the number of registrations of that size), and that the total number of candidates in all sessions combined equals the total number of candidates in all registrations. Since the only way to get at a total equal to the total registration size without using registrations twice is to use them all, this requirement is equivalent with the requirement to use each registration exactly once, but much easier to test.

 

When assembling solutions from the possible sessions view, it’s important to remember that the possible sessions stores permutations – so it’s allowed to use multiple instances of the same possible session as long as the total number of available registrations of each size is not exceeded. So going back to the sample data presented earlier, one possible solution would include two occurrences of the “1 registration size 6 and 1 registration size 9 (total size 15)” session, plus one occurrence of “1 registration size 6 (total size 6)” to top it off. Another solution would include one occurrence of “3 registrations size 6 (total size 18)”, and one occurrence of “2 registrations size 9 (total size 18)”.

 

If you look at the dbo.Solutions view in the code, you’ll see a design that is even more extremely denormalized than the dbo.PossibleSessions view already was. It includes the five (size, count) pairs already familiar from the latter, but repeats them four times – thus allowing for a maximum of four possible sessions to be combined to make up a solution. This means that the solution fails when it’s given data that requires five or more sessions to accommodate all registrations – it will simply not return any rows for that quarter. To solve that, all you have to do is add more columns for a fifth session, and add an extra LEFT JOIN to a fifth occurrence of the dbo.PossibleSessions view. You may have noted that the join condition for each join references all “previously” joined occurrences of dbo.PossibleSessions, thus increasing in size with each additional occurrence.

 

Apart from the columns for the denormalized description of all sessions in a solution, the dbo.Solutions view includes four more columns. Apart from the unavoidable Year and Quarter columns, there is a NumSessions column holding the number of sessions used, based on a simple CASE expression (only added for easier understanding of the output; it’s not really required). The final column calculates a ranking for each solution by using the ROW_NUMBER() function, partitioning by quarter and ordering by number of sessions in the solution (least sessions used first) using a clever trick based on the fact that NULL sorts before other values (I could also have repeated the CASE expression used for the NumSessions column, but that would have been more typing). Note that the order of solutions with the same number of sessions is irrelevant for the approach used here. This Ranking column will be used in the next step.

 

If you have already checked the query, you’ll probably understand that it’s impossible to scale this to the dimensions required to solve the original problem. The maximum session size of 100 candidates already forces us to use 13 (size, count) pairs in the dbo.PossibleSessions view, that all return here; the inclusion of one copy of that view per session in the solution moves us into some really insane figures. For instance, if you want to allow up to 200 sessions per quarter, you’d need to define 5,204 columns in the view (which is more than SQL Server allows), and the CREATE VIEW statement would have more than four million lines, well beyond the maximum batch size SQL Server allows. And that would still not be enough for the original test data – the tests conducted earlier show that some quarters need slightly more than 700 sessions. Maybe this query will find a way to pack them into less than 700, but then we’d first need to find a database that allows views with 18,204 columns and a definition of 51,155,633 (!) lines…

 

Step 4: The best solution …

 

After all this hardcore query magic, the fourth and final step is almost disappointingly simple. Finding the best solution is a simple matter of returning the solution with the lowest score in the Ranking column – i.e. a ranking equal to 1. And if you prefer to see all possible solutions with the lowest possible number of sessions, all it takes is to replace ROW_NUMBER() with RANK() in the definition of the dbo.Solutions view – this will assign the number 1 to all solutions with the lowest number of sessions, and higher numbers to all solutions that use more sessions.

 

Once more, you can check John Gilson’s posting to see that without ROW_NUMBER() the dbo.Solutions view has to be joined to (a derived table based upon) itself. In practice this means that after replacing views by their definitions, the resulting query passed internally to the query optimizer doubles in size – I don’t even want to begin to imagine what effect this will have on this solution’s performance. Especially considering that performance is already pretty bad with using the ROW_NUMBER() function.

 

Maybe this is also a good place to point out that the results you get when you query the dbo.BestSolutions view are not actually yet the final results. Okay, you will get a result set showing that, for some quarter, you have a session consisting of three registrations for five candidates plus one registration for three candidates, but you still have to select the four actual registrations to assign to this session. Since this blog post is already longer than any I’ve done before (and still growing), I decided to leave that part as an exercise to the reader. Just one word of warning – if you come up with a solution that uses multiple copies of the dbo.BestSolutions view, replace it with a temporary table holding the contents of the view. Otherwise, SQL Server will happily produce a query execution plan where dbo.BestSolutions is evaluated as often as you use it in your query – and trust me, once you’ve seen how it performs, you don’t want that to happen!

 

… for some definition of “best”

 

I you have already checked out the definitions of the views used, you probably don’t expect a great performance anymore. And after reading my ominous hints above, you’re really bracing yourself – and with good right. With the amount of test data reduced to just ten (or less) registrations per quarter, querying the dbo.BestSolutions view to find the best solutions for all quarters (on an empty cache) turns out to take an average execution time of 94,535 milliseconds – that’s more than one and a half minute!

 

I have already indicated that the original code that I based this solution on was written before SQL Server 2005 was available. That is probably the reason why separate views were used instead of a single query with CTE’s. But now that I’m already running SQL Server 2008, there is of course no reason not to try a CTE-based version. After all, I also started to make use of the ranking functions that have also been introduced in SQL Server 2005. So I changed the view-based code to the CTE-based but further equivalent code in PureSetbasedCTEs.sql (also in the attached ZIP file), and executed it a couple of times to see if this would affect performance. In theory, it shouldn’t – but in practice it did: the average execution time for the CTE-based version was 90,984 ms, almost 4% faster. Even more intriguing was my observation that the CTE-based version performed more constant, with all measurements between 89 and 94 seconds. The view-based version ranged from 80 to over 100 seconds! I have not been able to explain either the performance difference or the difference in variation.

 

I also tested a third version. Not as “purely relational” as the views or the CTE’s, but much better performing: by materializing intermediate results in temporary tables instead of keeping the virtual in views or CTE’s, I worked around the annoying weakness in SQL Server’s query optimizer, namely that it does not realise that for a query that includes multiple copies of the same logic, that logic needs not be duplicated in the execution plan. Even in cases such as these, where the duplication is very easy to spot because it’s just multiple copies of the same view or CTE name, the optimizer will still spit out a plan that happily repeats the same steps over and over again. So I replaced the views with temporary tables, forcing SQL Server to materialize and reuse intermediate results instead of recalculating them over and over again. This code, that you can find in PureSetbasedTemp.sql in the attached ZIP file, did indeed perform lots better, though still a far cry from the cursor-based solution I presented in the previous posts: 32,879 milliseconds for the tiny amount of test data this solution handles.

 

It gets worse

 

Even though I already know that the set-based solution will never scale to the size required to handle the original problem, I still wanted to have some idea of how it scales as the maximum session size, the maximum number of sessions, and/or the amount of registrations increase. So I started with the “best” version so far, the version with temp tables, and created two variations on it: one that allows for one more session in the solution, bringing the total to a maximum of five sessions per quarter; and another one that does the same but also allows the session size to go up to a maximum of 35 candidates by using seven (size, amount) pairs for the possible sessions instead of five. These versions are also in the attached ZIP file as PureSetbasedTempMore.sql and PureSetbasedTempBigger.sql.

 

I did not test these versions with data that actually required bigger or more sessions, but I did test all the versions with various amounts of data by changing the amount of registrations per quarter. In the table below, you will find the number of registrations I chose, the actual average number of registrations per quarter after trimming the data down to at most 70 candidates per quarter, and the running time of each of the three set-based versions with temporary tables. Note that in this case, I executed the longer running queries only once instead of my usual practice of averaging the execution time of five executions. And in case you intend to test these queries yourself, make sure to size your databases appropriately – to repeat the tests I executed, you’ll need at least 200 MB in the data file and 400 MB in the log file. Allocate less, and autogrow will probably kick in, reducing performance even further.

 

#Sessions

Execution time (ms)

requested

actual

Base

More

Bigger

6

6

847

1,015

1,600

7

7

1,510

1,934

3,612

8

8

4,546

6,968

14,687

9

8.975

11,980

23,071

57,347

10

9.875

32,789

80,656

247,186

11

10.675

70,220

202,123

726,233

 

After seeing these numbers, I didn’t even try to extrapolate them to the original problem (that needs 13 (size, amount) pairs for possible sessions up to 100 candidates, and up to 700 sessions per quarter with the full amount of test data). It would probably be several years, maybe even centuries – though of course I have already demonstrated that the query will never run on any current database anyway, because it’s too long and returns too many columns.

 

Back to the cursor?

 

By now, you might have concluded that this post shows that for the bin-packing problem, using a cursor really is the only way to go. But allow me to disagree. While there is no doubt that a truly set-based single query solution is absolutely not viable here, there are still other possibilities. As I already indicated in the first part of the series, it is possible to combine set-based queries with iteration for a solution that packs about as well as the better cursor-based variants, but runs much faster. This solution will be the subject of my next post in this series – so stay tuned!

Published Monday, October 27, 2008 2:28 AM by Hugo Kornelis

Attachment(s): ImEx4.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Peso said:

October 27, 2008 3:45 AM
 

Hugo Kornelis said:

Hi Peso,

Apologies for the delayed response - and thanks for the link.

I didn't yet have time to cimpletely dig to the bottom of that query. I do see that it's for a related, but not the same problem. This query will find a combination of packages for a single bin, and find the one that comes as close to the size as possible (either above or below it). To use it for bin packing, one would have to change it so that it will not return combinations that exceed the specified size, and then add logic to remove the packages used from the list and then repeat for the next bin.

I plan to take a more detailed look at the approach taken later, when I have more time available. It looks interesting enough to warrant further study, so once more - thanks for the link!

Best, Hugo

October 31, 2008 7:58 AM
 

RBarryYoung said:

I am relieved to see that you have not fallen into the fallacy of binary thinking seen in so many Cursor vs. non-Cursor arguments.  Cursors and pure Set-based SQL are hardly the only two choices, particularly on SQL Server.  Both recursive CTE's and pseudocursors can solve a huge range of problems that are not tractable to set-based solutions and usually significantly faster than Cursors (though admittedly, the later has supportability questions).

But the obvious thing to try for a restricted scheduling type problem like this one is a more general purpose programming language instead of relying solely on a data management and modification language like SQL.  Using .Net code, whether client-based or SQLCLR based seems like the best bet.

November 3, 2008 9:49 AM
 

Peso said:

Yes. The algorithm was originally written for fetching exact sum only, but has been widened to get closest of undersum or oversum, if exact sum is not found.

The algorithm in the link is only a part of the complete problem, but it handles the core, to find the sum and which amounts that supports to the solution.

It will be a trivial task to update inventory and remove items found in the algorithm, and run the algorithm again.

I am looking forward to your investigations.

November 7, 2008 1:35 PM
 

Peso said:

November 22, 2008 5:20 AM
 

Peso said:

November 23, 2008 9:34 AM
 

Peso said:

If you change the last SELECT to this instead

SELECT theSum,

COUNT(*) AS theCount

FROM (

SELECT SUM(d.faceValue * ((v.number * d.maxItems / d.permCount) % d.maxItems)) AS theSum

FROM @Data AS d

INNER JOIN TallyNumbers AS v ON v.number <= @comb

GROUP BY v.number

HAVING SUM(d.faceValue * ((v.number * d.maxItems / d.permCount) % d.maxItems)) > 0

) AS d

GROUP BY theSum

ORDER BY COUNT(*) DESC,

theSum

You can see which sum you can combine the most times with original data.

In my case, I can see that the sum of 1550 can be combined for 800 different combinations.

November 23, 2008 9:44 AM
 

fred said:

when are you going to update your blog?

October 27, 2009 1:38 PM
 

Adam Machanic said:

Going to have to second Fred's question. I just noticed that it's been a year to the day. Come on Hugo, surely you can come up with some topic to post about?

October 27, 2009 3:04 PM
 

Hugo Kornelis said:

And I'll third Fred's question.

The problem is not lack of topics - hell, I still have to finish this very series about bin packing, and the next part will be the most fun of them all!

The real problem is three-fold: A combination of (a) lack of spare time; (b) other interests that nibble away at what little spare time is left; and (c) a certain amount of blog-tiredness. I have not only not been wposting for a year, I've also not been reading any blogs for about half a year now.

The good news is that I have been kicking by own butt lately to resume blogging. And I have already done some prep work for the next post (though it's still far from finished). With still very little time to spare, I can't make any promises on an ETA for the next part, but I can assure you that it's still in the pipeline, and that I will finish it. One day.

Thanks for maintaining interest in my blog, even after such a long period of neglect from me.

October 27, 2009 5:26 PM
 

fred said:

No worries dude.....maybe you can hire a jr dba as an assistant. keep up the good work!

October 29, 2009 10:18 PM
 

Joey said:

Here ia another approach which is relatively fast

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=80857

and also http://www.rapidpig.com

August 10, 2010 8:25 AM
 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

One of the most common techniques authors use to keep their readers interested is to leave them with

October 17, 2011 6:11 PM

Leave a Comment

(required) 
(required) 
Submit

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement