THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server'

Bin packing part 3: Need for speed

In the first post of this series, I explained the bin-packing problem and established a baseline solution. The second post investigated ways to increase the packing efficiency. In none of these posts did I pay particular attention to performance – and frankly, it shows. Performance of all solutions presented thus far sucks eggs. Time to see what can be done about that.

 

If you look in detail at the most efficient solution so far (dbo.OrderDesc, as described in the second post of the series), you’ll see that all 40,000 registrations are processed one by one, with a cursor. No surprise here, as I’ve already promised to postpone the set-based solution to a later moment. For each of these 40,000 registrations, the following actions are executed:

 

·        The dbo.Sessions table is queried to find a session for the correct year and quarter that still has enough space left for the current registration.

·        When needed, a new session is added to the dbo.Sessions table.

·        The registration is updated in the dbo.Registrations table. Well, updating all 40,000 registrations is pretty hard to avoid as the goal is to assign each registration to a session, and as long as the processing is cursor-based, the updates will inevitably come in one by one.

·        The chosen session is updated in the dbo.Sessions table to reduce the space left by the size of registration that was just assigned to the session.

 

Improve indexes

 

Indexing is one of the most common answers to performance problems. In this case, based on the above summary of actions taken in the loop, it’s clear that adding or changing an index on dbo.Registrations won’t do much. The update of this table is currently based on the clustered index key, which is the fastest possible way to perform an update. The cursor requires a table scan (to read all rows) and a sort (to order them by year, quarter, and descending number of candidates); the sort can be avoided if the clustered index is on these three columns, but at the price of 40,000 bookmark lookups for the updates – I don’t need to run a test to see that this is a bad idea!

 

The dbo.Sessions table is less clear-cut. The clustered index is on (Year, Quarter, SessionNo), so searching for a session with enough room for a registration currently seeks the clustered index to process only the sessions of a single quarter, but it still has to scan through sessions in the quarter until it finds one with enough space. A nonclustered index on (Year, Quarter, SpaceLeft) will speed up this process, especially since it is covering for the query (the query uses the SessionNo column as well, but that column is included in the nonclustered index as part of the reference to the clustered index key). The downside to this index is that it has to be updated each time the space left in a session changes and each time a session is added. So, the question to answer is whether gained when searching for a session to add a registration to outweighs the performance lost during updates. To find out, I added this index before repeating the tests of dbo.OrderDesc:

 

CREATE NONCLUSTERED INDEX ix_Sessions

ON dbo.Sessions (Year, Quarter, SpaceLeft);

 

The packing efficiency didn’t change. The execution time did change, though – with the index in place, the average execution time of five test runs of dbo.OrderDesc was down to 68,300 ms, a 12.4% improvement over the execution time without the index. Clearly, the extra overhead incurred on the UPDATE statements is less than the savings on the SELECT statements.

 

Note that I create the index before starting the test. I assume that the index can be added permanently to the database – if the database frequently updates the dbo.Sessions table at other moments, when the bin packing procedure is not executing, it might make more sense to create it at the start of this procedure and remove it when done – and in that case, the time taken for creating and dropping the index (less than 100 ms) should be added in.

 

For completeness, I also tested the dbo.Order50FirstB version, that proved to be a good trade-off between performance and efficiency in the previous post. This version should of course see a similar performance benefit from the additional index, and indeed it does – the average execution time for dbo.Order50FirstB was down to 68,144 ms after adding the index, a saving of 9.8%.

 

If you’re duplicating the tests on your own server, then don’t forget to remove the index – we won’t need it in the following step, and the overhead of maintaining it would just waste precious time.

 

DROP INDEX dbo.Sessions.ix_Sessions;

 

Do less work …

 

Saving 12.4% execution time is great – but (of course) not enough to satisfy me! So it’s time to take another approach: let’s see if I can’t find any way to reduce the number of times the dbo.Sessions table is searched and updated. How, you might ask? Well, let’s again check how a human would operate. If I have to stow packages in boxes, I’d probably keep adding packages to the same box until I get a package that won’t fit. Only when I get a package that doesn’t fit the box anymore would I put the box away and open an new box. I have coded the T-SQL equivalent of this method, and called it dbo.FillThenNext (see FillThenNext.sql in the attached ZIP file).

 

The average execution time for this simple algorithm turned out to be just 39,110 ms, so this saves 42.6% over dbo.OrderDesc with index, or 47.0% over the baseline. But the packing efficiency turns out to be really down the drain with this one:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       6670        75.364617        16431.800000

2       3548        78.769447        7532.600000

3       8066        75.025787        20144.200000

4       5346        78.283015        11609.900000

ALL     23630       76.420440        13929.625000

 

 

A total of 23,630 sessions exceeds the number of sessions required by the baseline by 21.4%, and that of dbo.OrderDesc by 24.9%. What a high price to pay for a speed advantage!

 

… but do it smart

 

The reason for this enormous efficiency loss is that I became very wasteful. Suppose that the first registration processed is for 20 persons, and the second for 85. The second can not be combined with the first in a single session, so the algorithm opens a new session – and never again looks at the first session, even though there still are 80 seats left! That is of course a HUGE waste of resources. So I modified the algorithm. I still have a “current” session that I keep adding registrations to as long as I can, but if I find a registration that won’t fit I now first search the previous sessions for one with enough empty seats before closing the current session and opening a new one. The code for this version (dbo.FillThenSearch (is in FillThenSearch.sql in the attached ZIP file. And the test results are really awesome! The average execution time is now 45,506 ms (with the nonclustered index back in place – without it, execution time is 50,874 ms), which is of course slightly slower than my previous attempt but still 33.4% faster than dbo.OrderDesc with index, and 38.3% faster than the baseline. But the packing is much better:

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5399        93.106501        3721.800000

2       2863        97.615787        682.600000

3       6945        87.135781        8934.200000

4       4542        92.140246        3569.900000

ALL     19749       91.438300        4227.125000

 

This version still doesn’t pack as efficient as dbo.OrderDesc (4.4% more sessions) or the baseline (1.5% more sessions) – but saving 33.4% execution time at the price of only 4.4% more sessions sounds like a proposition that deserves serious consideration, unlike the previous attempt!

 

Revive an old trick

 

If you have checked the source code, you may have noticed that I have once more removed the extra ordering that I added in the previous installment. I did that on purpose, because this extra ordering

a)      decreased performance – I want to increase performance in this part, and

b)      is not guaranteed to have the same effects in a different packing algorithm.

 

But I did not want to end this post without at least testing the effect of adding back in the ordering that proved most efficient in the previous episode, so I re-introduced the sorting by descending number of candidates in dbo.FillThenSearchDesc (FillThenSearchDesc.sql in the ZIP file), and I discovered that this might be the best tradeoff so far – only 2 sessions more than dbo.OrderDesc, at only 48,626 ms (28.6% less than dbo.OrderDesc – with the nonclustered index still in place, of course).

 

Quarter NumSessions AvgSessionSize   AvgEmptySeats

------- ----------- ---------------- ----------------

1       5087        98.816984        601.800000

2       2799        99.847802        42.600000

3       6771        89.374981        7194.200000

4       4268        98.055529        829.900000

ALL     18925       95.419550        2167.125000

 

Best options so far

 

After having investigated so many different options, it’s all too easy to lose track. The table below lists the versions most worth remembering – except for the baseline, I did not include any version for which there was another version that produces less sessions in less time. The remaining sessions are the one you’ll have to choose from, making your own trade-off between saving sessions or saving execution time.

 

Version

Execution time (ms)

Number of sessions

Baseline

73,759

19,457

FillThenNext

39,110

23,630

FillThenSearch (with index)

45,506

19,749

FillThenSearchDesc (with index)

48,626

18,925

OrderDesc (with index)

68,300

18,923

 

Note that I ordered these versions, except the baseline, fastest to slowest (or least efficient to most efficient packer).

 

This concludes the third part of this series. Though I still have some ideas that might improve the performance of my current cursor-based approach, I’ll postpone them and switch gears – next episode, I’ll start investigating if these numbers can be beaten by a set-based approach.

Published Saturday, January 12, 2008 12:24 AM by Hugo Kornelis

Attachment(s): ImEx3.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Joe Celko said:

Did you ever see some of the solutions for Codd's t-join problem?  It is a "best fit" situation  -- given classrooms wtih various numbers of seat5s and classes with various numbers of students, how do you assign rooms to classes with the least wasted space?  

January 13, 2008 12:41 PM
 

Hugo Kornelis said:

Hi Joe,

You mean those covered in your "SQL for smarties"? I checked those out when I tried to answer the usenet post that ultimately resulted in this series of blog post, but they didn't fit the problem - the solutions given in your book all relied on each classroom having a different size, and each class having a different number of students. Allso, none of those solution considered combining two classes in a single class room. (Which makes sense if you are teaching, but changes the problem from bin packing to best fit - a completely different problem, requiring a completely different solution).

In part 5 of this series, I'll present an updated version of the code I originally posted in answer to said usenet post, which is MUCH faster than any of the other alternatives, and yields very acceptable packing efficiency. This code does utiilize a best fit strategy as part of the algorithm. You must be familiar with it, since I once sent you a copy when you asked for material for the third (?) edition of "Smarties", but you ultimately chose not to include it. That's why I'm now writing this series.

Best, Hugo

January 14, 2008 3:26 AM
 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

Almost a year ago now, I started a series of blog post on the bin packing problem . But after the first

October 26, 2008 7:30 PM
 

SELECT Hints, Tips, Tricks FROM Hugo Kornelis WHERE RDBMS = 'SQL Server' said:

One of the most common techniques authors use to keep their readers interested is to leave them with

October 17, 2011 6:11 PM

Leave a Comment

(required) 
(required) 
Submit

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement