THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

Bin packing part 2: Packing it tighter

In my previous post, I explained the bin packing problem, explained an example scenario, and established a baseline for both speed and efficiency of bin packing algorithms by writing a rather crude cursor-based procedure. In this part, I will look at some small modifications that can be made to this code to make it better at packing bins as full as possible, so that less bins will be needed.

 

Reordering the cursor

 

The Baseline procedure didn’t process the registrations in any particular order. Granted, there is an ORDER BY Year, Quarter in the cursor declaration, but that is only needed to enable me to generate consecutive session numbers within each quarter without having to do a SELECT MAX(SessionNo) query. Had I elected to use an IDENTITY column for the Sessions table, I could even have omitted the ORDER BY completely.

 

Since there is no order specified for the registrations within the quarter, we can assume that they will be processed in some unspecified order. But maybe we can get a more efficient distribution of registrations if we order them before processing? I am, of course, not referring to ordering by subject code, as the actual subjects are irrelevant for the algorithm – I am referring to order by the number of candidates in a registration.

 

This is very easy to test, of course. All it takes is adding one extra column to the ORDER BY clause of the cursor definition. So I created the procedures dbo.OrderAsc and dbo.OrderDesc to test the effects of ordering by ascending or descending number of candidates (see OrderAsc.sql and OrderDesc.sql in the attached ZIP file).

 

Ordering by ascending number of candidates turned out to be a pretty lousy idea. Well, to be fair, I didn’t expect otherwise – after all, if you save all the biggest registrations for the last, you’ll have no smaller registrations left to fill up the gaps. In fact, all registrations for 51 or more candidates will get a session of their own, and will not be combined with any other registration. So it’s not surprising at all to see that this method results in a huge amount of extra sessions as compared to the baseline version – an increase of no less than 19.9%!

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       6435        78.116860        14081.800000

2       3419        81.741444        6242.600000

3       7796        77.624166        17444.200000

4       5686        73.602004        15009.900000

ALL     23336       77.383227        13194.625000

 

Equally unsurprising is the fact that changing the ORDER BY clause to sort by descending number of candidates results in more successful packing of the registrations in less sessions. This version saves 2.7% as compared to the baseline, as shown in this breakdown:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5085        98.855850        581.800000

2       2799        99.847802        42.600000

3       6771        89.374981        7194.200000

4       4268        98.055529        829.900000

ALL     18923       95.429635        2162.125000

 

Even though this part mainly focuses on efficiency achieved, the time taken still remains an important factor. So I ran these two procedures five times in a row, recorded execution times, and calculated the average.  For dbo.OrderAsc, the average execution time was 78,531 ms, whereas dbo.OrderDesc clocked in at 77,934 ms. As you see, both versions are slightly slower than the baseline version. For the OrderDesc version, this is caused by the rapid growth of the number of sessions as the first, biggest registrations are processed first. This means that searching a session with enough empty space for a registration soon becomes a more time-consuming task. For OrderAsc, the reverse is true – since the smallest registrations are processed first, there will at first be only a few sessions. This means that this algorithm will be a lot faster at first – but once the bigger registrations are processed and the total number of sessions rapidly increases to be way more that that in the baseline version, this advantage is lost, and the time required to search for sessions with enough empty space soon gets so high that the advantage this algorithm had as first then turns into a disadvantage.

 

Sorting the registrations by ascending number of candidates within a quarter before processing them hurts both speed and packing efficiency of the algorithm; we can henceforth forget about this option. On the other hand, sorting by descending number of candidates increases the packing efficiency by 2.7%, though this comes at the price of a 5.7% increase in execution time. If I had to choose between these two, my pick would depend on the needs of the organization I’m working for – if the cost per session is high and plenty of time is available for the computation, I’d go with the ordered version, but if speed is more  important than saving those few extra sessions, I’d use the unsorted one.

 

Less obvious orderings

 

But why choose between these two only? I have thus far only considered the obvious sort orders, ascending and descending. Why not try a few more variations?

 

Thinking about the even distribution of data generated for the first quarter of each year in my test set, I considered that, were I tasked with manually combining the registrations as efficient as possible, I’d probably start by making sessions by combining two registrations for 50 candidates, than combining a registration for 51 candidates with one for 49, and so on. All of these sessions would total 100 candidates, and because of the even distribution of data, I expect roughly the same number of registrations for each size so I’d have only a few spare sessions left in the end.

 

That technique can’t be exactly mimicked by changing the sort order of the cursor. There are other ways to mimic it, though – but I’ll leave those for a future post J. But we can simulate this effect by ordering the registrations so that those with 50 candidates come first, then those with 49 and 51 candidates, and so on. This is done by changing the ORDER BY clause in the cursor definition to order by the “distance” between the number of candidates and the magic number 50, being half the maximum session size:

 

  ORDER BY Year, Quarter,

           ABS(NumCandidates - (@MaxCandidatesPerSession / 2.0)) ASC;

 

I didn’t hardcode the number 50, because I wanted my stored procedures to be fit for any maximum number of candidates. I divide by 2.0 instead of just 2 so that for an odd maximum session size (e.g. 25), the fraction is retained and registrations for 12 and 13 candidates are kept together because they are the same distance from half the maximum size (12.5).

 

It is of course also possible to use DESC instead of ASC to start with registrations for 100 candidates, then those for 99 or 1, and so on, saving the 50-candidate registration for the last. Both these versions are included in the attached ZIP file, in the files Order50First.sql and Order50Last.sql.

 

These versions both took slightly more time than the baseline version when I tested them on my laptop: 76,807 ms for dbo.Order50First, and 77,322 ms for dbo.Order50Last. The packing efficiency of dbo.Order50First is better than the baseline, but not as good as that of dbo.OrderDesc:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5096        98.642464        691.800000

2       2805        99.634224        102.600000

3       6780        89.256342        7284.200000

4       4269        98.032560        839.900000

ALL     18950       95.293667        2229.625000

 

For dbo.Order50Last, the resulting number of sessions is even more than we had in the baseline!

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5444        92.336884        4171.800000

2       3032        92.174802        2372.600000

3       6928        87.349595        8764.200000

4       4617        90.643491        4319.900000

ALL     20021       90.196044        4907.125000

 

The reason for the disappointing efficiency of the dbo.Order50First procedure is that there is no control over the order of the registrations that have the same distance to 50. So it is quite possible, for instance, to start with a bunch of registrations for 49 candidates that will promptly be combined to sessions for 98 candidates each – so that, when the 51-sized registrations start coming in, they have to get sessions of their own. In an attempt to fix that, I tweaked the sort order some more, making sure that the for registrations with the same distance from 50, the “bigger” registrations come before the “smaller” ones.

 

  ORDER BY Year, Quarter,

           ABS(NumCandidates - (@MaxCandidatesPerSession / 2.0)) ASC,

           NumCandidates DESC;

 

With this ORDER BY clause, I can be certain that all 51-candidate registrations are processed first, each getting its own session. After that, the 49-candidate registrations will exactly fill out all those sessions. This version (enclosed in Order50FirstB.sql) had a slightly better packing ration than dbo.Order50First – but still not as good as dbo.OrderDesc. Here are the results, which took 75,547 ms (on average) to achieve:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5089        98.778148        621.800000

2       2804        99.669757        92.600000

3       6771        89.374981        7194.200000

4       4268        98.055529        829.900000

ALL     18932       95.384270        2184.625000

 

After these tests, there was still one thing left I wanted to try. Starting with registrations for 50 places was based on an idea for evenly distributed data. For other distributions, this might turn out to be a much worse idea (though the results don’t show as much). But what if, instead of starting at half the maximum session size, we start at the average registration size? For evenly distributed data, this should work out approximately the same. But maybe this order achieves a better packing ratio for other distributions? Let’s find out.

 

Ordering by distance from the average session size for a quarter can be accomplished by using a correlated subquery in the ORDER BY clause (compatible with all versions of SQL Server), or by using an AVG function with the OVER clause (only SQL Server 2005 and up):

 

  ORDER BY a.Year, a.Quarter,

           ABS(a.NumCandidates - (SELECT AVG(b.NumCandidates * 1.0)

                                  FROM   dbo.Registrations AS b

                                  WHERE  b.Year = a.Year

                                  AND    b.Quarter = a.Quarter)) ASC;

or

  ORDER BY a.Year, a.Quarter,

           ABS(a.NumCandidates - AVG(a.NumCandidates * 1.0)

                   OVER (PARTITION BY a.Year, a.Quarter)) ASC;

 

Surprisingly, when dry-testing the query by itself, the correlated subquery turned out to be faster than the one using the OVER clause, so I didn’t have to sacrifice speed for backward compatibility. I used the correlated subquery, both with the ASC and the DESC sort option (see OrderHalfFirst.sql and OrderHalfLast.sql in the attachment), to test the two possible variations of this option. Both versions turned out to be quite inefficient packers, since they both took more sessions than the baseline. Here are the results of dbo.OrderHalfFirst, acquired in 75,056 ms:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5120        98.180078        931.800000

2       3264        85.623161        4692.600000

3       6771        89.374981        7194.200000

4       5044        82.970063        8589.900000

ALL     20199       89.401207        5352.125000

 

And OrderHalfLast, after running 79,294 ms, produced these results:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       6036        83.280649        10091.800000

2       2941        95.026861        1462.600000

3       7796        77.624166        17444.200000

4       4951        84.528580        7659.900000

ALL     21724       83.125345        9164.625000

 

Conclusion

 

I’ve investigated many options to increase packing efficiency. It turns out that, at least with the test data I used, just starting with the biggest registration and working down to the smallest yields the best results. This is not the fastest option, though. The baseline version discussed in the previous episode of this series is still fastest. So the choice would appear to depend on the requirements of your application – if you have plenty of time and computer capacity but need to use as little sessions as possible, go for the dbo.OrderDesc version. If execution time is of utmost importance and a few extra sessions are no big deal, then stick to the baseline (for now).

 

If you are in search of a solution that offers both speed and efficient packing, then the dbo.Order50FirstB version seems to be the right choice. It is only 0.05% less efficient than the best packer (dbo.OrderDesc), but over 3% faster. In the next episode I’ll be looking at ways to make the algorithm go faster. I’ll be making huge performance improvements – but packing efficiency will suffer. How much? As soon as I have completed all my tests and written the accompanying text, you’ll read it. Right here on sqlblog.com.

Published Monday, December 24, 2007 12:33 AM by Hugo Kornelis

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

In the first post of this series, I explained the bin-packing problem and established a baseline solution.

January 11, 2008 5:24 PM
 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

Almost a year ago now, I started a series of blog post on the bin packing problem . But after the first

October 26, 2008 7:30 PM
 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

One of the most common techniques authors use to keep their readers interested is to leave them with

October 17, 2011 6:11 PM

Leave a Comment

(required) 
(required) 
Submit

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement