THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Paul White: Page Free Space

A technical SQL Server blog from New Zealand.

Bitmap Magic (or… how SQL Server uses bitmap filters)

Question: Can a parallel query use less CPU than the same serial query, while executing faster?

The answer is yes; and to demonstrate, I'll use the following two (heap) tables, each containing a single column typed as INTEGER:

image

Let’s load the first table with 5,000 random integer values (I’m using RAND with a seed and a WHILE loop so you will get the same random numbers if you run this script yourself):

image

Now load the second table with 5,000,000 random integers:

image

Now let’s write a query to count the number of matches, using a MAXDOP 1 query hint to ensure that the execution plan uses only a single processor.  The illustration below shows the query, execution plan, and runtime statistics:

image

Turns out there are 13 matches, and this query uses 890ms of CPU time and runs for 891ms.  Now lets run the same query, but with a MAXDOP 2 hint:

image

The query now completes in 221ms using 436ms of CPU time.  That’s a four-times speed-up, while using half the CPU!

Bitmap Magic

The reason the parallel query is so much more efficient is down to the Bitmap operator.  To more clearly see the effect it has, take a look at the parallel query plan with runtime statistics included (a so-called actual execution plan):

image

Compare that to the serial plan:

image

The way these bitmap filters work is reasonably well documented, so I’ll just give an outline here and provide links to some existing documentation at the end of this post.

Hash Join

A hash join proceeds in two phases.  First, it reads all the rows from its build input and constructs a hash table containing the join keys.  Second, it reads a row at a time from the probe input, uses the same hash function as before to compute a hash value for the probe row’s join keys, and uses this hash value to check a single hash table bucket for matches.  Naturally, the possibility of hash collisions means that the join generally still has to compare the real join key values to ensure a true match.

Bitmaps in Serial Plans

Most people don’t know that a hash match join always creates a bitmap – even in a serial plan.  You can’t see the bitmap in the serial plan, because it is part of the internal implementation of the Hash Match iterator.  While processing build rows and creating its hash table, the hash join also sets one (or more) bits in a compact bitmap structure.  When the build phase completes, this bitmap provides an efficient way to check for potential hash matches without the cost of probing the hash table.

In the case of a serial-plan hash match, incoming probe rows to the hash join are hashed on the join keys and the value is used to check the bitmap.  If the corresponding bits in the bitmap are all set, there might be a match in the hash table, so the process goes on to check the hash table.  Conversely, if even one of the bits corresponding to the hashed join key value is not set, we can be certain that there is no match in the hash table, and we can discard the current probe row immediately.  The small cost of building the bitmap is offset by the time saved not checking rows that cannot match in the hash table.  Because checking a bitmap is very much faster than probing a hash table, this optimization is often effective.

Bitmaps in Parallel Plans

In a parallel plan, a bitmap is exposed as a separate plan operator.  When the hash join transitions from its build phase to the probe phase, the bitmap is passed to an iterator on the probe side of the hash join.  At a minimum, the bitmap is pushed down the probe side as far as the exchange (Parallelism) operator immediately below the join.  At this location, the bitmap is able to eliminate rows that cannot join before the rows are passed between threads inside the exchange.  There are no exchange operators in serial plans, of course, so moving the bitmap just outside the hash join in this way would confer no extra advantage compared to the ‘built in’ bitmap inside the hash match iterator.

In certain circumstances (though only in a parallel plan!) the query processor can push the bitmap even further down the plan on the probe side of the join.  The idea here is that eliminating rows earlier saves the cost of moving rows between iterators unnecessarily, and perhaps even eliminates some operations completely.  As an aside, the optimizer generally tries to push ordinary filters toward the leaves of a plan for similar reasons – eliminating rows as early as possible is usually worthwhile.  I should mention though, that the type of bitmap we are dealing with here is added after optimization has completed.  Whether this type of bitmap is added to a post-optimizer plan or not is a decision made based on the expected selectivity of the filter (so accurate statistics are essential).

Pushing the Bitmap Filter

Anyway, back to the concept of pushing the bitmap filter further down the probe side of the join than the exchange iterator immediately below it.  In many cases, the bitmap filter can be pushed all the way down to a scan or seek.  When this happens, the bitmap filter check appears as a residual predicate like this:

image

As a residual, it is applied to all rows that pass any seek predicates (for an index seek), or to all rows in the case of an index or table scan.  The diagram above shows a bitmap filter being applied to a heap table scan, for example.

Deeper Still…

If the bitmap filter is built on a single column or expression of the INTEGER or BIGINT types, and if the bitmap is to be applied to a single column of INTEGER or BIGINT type, it might be pushed down the plan even further than the seek or scan operator.  The predicate is still shown in the seek or scan as above, but it is annotated with the INROW attribute – meaning that the filter is pushed into the Storage Engine, and applied to rows as they are being read.

When this optimization occurs, rows are eliminated before the Query Processor sees the row at all – only rows that might match the hash match join are passed up from the Storage Engine.  The exact circumstances in which this optimization is applied varies a little between SQL Server releases – for example, in SQL Server 2005, the probed column has to be defined as NOT NULL, in addition to the conditions noted previously.  This restriction was relaxed in SQL Server 2008 onward.

You might be wondering how much difference the INROW optimization makes.  Surely pushing the filter as far down as the seek or scan must be nearly as good as pushing the filter into the Storage Engine?  I’ll cover that interesting question in a future post.  For now, I want to round of this entry by showing how merge join and nested loops join compare for this query.

Other Join Options

With no indexes, a query using the nested loops physical join type is an complete non-starter: we would have to scan one table fully for each row in the other table – a total of 5 billion comparisons.  That query would likely run for a very long time.

Merge Join

This type of physical join requires sorted inputs, so forcing a merge join results in a plan that fully sorts both inputs before joining.  The serial plan looks like this:

image

The query now uses 3105ms of CPU and overall execution time is 5632ms.  The extra non-CPU time there is due to one of the sort operations spilling to tempdb (despite SQL Server having more than sufficient memory available for the sort).  The spill occurs because the default memory grant algorithm happens to not reserve enough memory in advance.  Leaving that to one side, it is clear that the query would never complete in less than 3105ms anyway, so we will move on.

Continuing to force a merge join plan, but allowing parallelism (MAXDOP 2) as before:

image

As in the parallel hash join plan seen earlier, the Bitmap filter is pushed down the other side of the merge join, all the way to the Table Scan, and applied using the INROW optimization.  At 468ms of CPU and 240ms elapsed time, the merge plan with the extra sorts is very nearly as fast as the parallel hash (436 ms/221 ms).  There is one downside to the parallel merge join plan: it reserves 330KB of workspace memory, based on the expected number of rows to sort.  Since this type of bitmap is considered after cost-based optimization is complete, no adjustment is made to the estimate even though only 2488 rows flow through the lower sort.

A bitmap can only appear in a merge join plan if the bitmap is followed by a blocking operator (like a sort).  A blocking operator has to consume its entire input before producing its first row, guaranteeing that the bitmap is fully populated before rows from the inner input start being read from the inner-side table and checked against the bitmap.  It is not necessary to have a blocking operator on the other side of the merge join, and neither does it matter which side the bitmap appears on.  (Another topic for a future post).

With Indexes

The situation is different if suitable indexes are available.  The distribution of the ‘random’ data is such that we can create a unique index on the build table, but the probe table contains duplicates so we have to make do with a non-unique index:

image

The hash join (both serial and parallel versions) is largely unaffected by the change.  It cannot take advantage of the indexes, so the plans and performance are the same as seen previously.

Merge Join

The merge join no longer has to perform a many-to-many join operation, and also no longer requires a sort on either input.  The lack of a blocking sort operator means that the bitmap can no longer be used (and remember the optimizer has no say in the decision, so it rejects a sorting plan early on as being a silly idea).  The effect is that a serial plan is produced whatever setting for MAXDOP is specified, and performance is worse than the parallel plan before the indexes were added: 702ms CPU and 704ms elapsed time:

image

This does represent a marked improvement over the original serial merge join plan, however (3105ms/5632ms).  This is due to the elimination of the sorts and the better performance of the one-to-many join.

Nested Loops Join

The nested loops join benefits enormously, as you would expect.  As for merge join, the optimizer does not consider a parallel plan:

image

This is by far the best-performing query plan so far – just 16ms of CPU and 16ms elapsed time.  Of course, this assumes that the data required to satisfy the query is already in memory.  Each seek into the probe table would otherwise generate essentially random I/O, so if you still store your data on a spinning magnetic film, cold-cache performance might be considerably worse.

On my laptop, a cold-cache nested loops run resulted in 78ms of CPU and 2152ms elapsed time.  Under the same circumstances, the merge join plan used 686ms CPU and 1471ms elapsed; the hash join plan used 391ms CPU and 905ms elapsed.  Merge join and hash join both benefit from the larger, possibly sequential I/O issued by the read-ahead mechanism.

© 2011 Paul White
email: SQLkiwi@gmail.com
twitter: @SQL_Kiwi

Further reading:

Parallel Hash Join – Craig Freedman
Query Execution Bitmaps – The SQL Server Query Processing Team
Bitmaps in Microsoft SQL Server 2000 – MSDN
Interpreting Execution Plans Containing Bitmap Filters – Books Online
Understanding Hash Joins – Books Online

Test script:

USE tempdb;
GO
CREATE TABLE #BuildInt
(
    col1    INTEGER NOT NULL
);
GO
CREATE TABLE #Probe
(
    col1    INTEGER NOT NULL
);
GO
CREATE TABLE #ProbeDec
(
    col1    DECIMAL(10) NOT NULL
);
GO
-- Load 5,000 rows into the build table
SET NOCOUNT ON;
SET STATISTICS XML OFF;
 
DECLARE @I INTEGER = 1;
INSERT #BuildInt
    (col1) 
VALUES 
    (CONVERT(INTEGER, RAND(1) * 2147483647));
 
WHILE @I < 5000
BEGIN
    INSERT #BuildInt
        (col1)
    VALUES 
        (RAND() * 2147483647);
    SET @I += 1;
END;
 
INSERT #BuildDec
    (col1)
SELECT 
    CONVERT(DECIMAL(10), bi.col1)
FROM #BuildInt AS bi;
GO
-- Load 5,000,000 rows into the probe table
SET NOCOUNT ON;
SET STATISTICS XML OFF;
 
DECLARE @I INTEGER = 1;
 
INSERT #Probe
    (col1) 
VALUES 
    (CONVERT(INTEGER, RAND(2) * 2147483647));
 
BEGIN TRANSACTION;
 
WHILE @I < 5000000
BEGIN
    INSERT #Probe
        (col1) 
    VALUES 
        (CONVERT(INTEGER, RAND() * 2147483647));
        
    SET @I += 1;
    
    IF @I % 25 = 0
    BEGIN
        COMMIT TRANSACTION;
        BEGIN TRANSACTION;
    END;
END;
 
COMMIT TRANSACTION;
GO
-- Demos
SET STATISTICS XML OFF;
SET STATISTICS IO, TIME ON;
 
-- Serial
SELECT 
    COUNT_BIG(*) 
FROM #BuildInt AS bi 
JOIN #Probe AS p ON 
    p.col1 = bi.col1 
OPTION (MAXDOP 1);
 
-- Parallel
SELECT 
    COUNT_BIG(*) 
FROM #BuildInt AS bi 
JOIN #Probe AS p ON 
    p.col1 = bi.col1 
OPTION (MAXDOP 2);
 
SET STATISTICS IO, TIME OFF;
 
-- Indexes
CREATE UNIQUE CLUSTERED INDEX cuq ON #BuildInt (col1);
CREATE CLUSTERED INDEX cx ON #Probe (col1);
 
-- Vary the query hints to explore plan shapes
SELECT 
    COUNT_BIG(*) 
FROM #BuildInt AS bi 
JOIN #Probe AS p ON 
    p.col1 = bi.col1 
OPTION (MAXDOP 1, MERGE JOIN);
GO
DROP TABLE #BuildInt, #Probe;
Published Thursday, July 07, 2011 5:44 AM by Paul White

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

tobi said:

I wonder if it would be beneficial to generate parallel plans even on single-core machines to leverage the sub-linear scaling.

In fact, sub-linear scaling can never exist in any algorithm because you could just run the algorithm on N _simulated_ cores and get the sub-linear speedup even on a single physical core.

July 6, 2011 7:40 PM
 

Fabricio Catae said:

Wow, the estimated cost for the parallel version is lower than the serial plan. Simply unbelievable! I had some fun playing with the query, and tried to make it faster with MAXDOP 1. Gave up after 2 hours. This Bitmap operator does improve the Hash Join performance. I am still learning. Great article Paul.

July 6, 2011 10:27 PM
 

Paul White said:

Hi Tobi,

It's unsupported, but you could try starting SQL Server on a single-core test machine with the -P parameter to start more than one scheduler.  This will allow a 'parallel' plan to be generated, giving the benefit of bitmap elimination.

Paul

July 7, 2011 2:31 AM
 

Paul White said:

Fabrico,

Thank you - I love this bitmap thing and have been meaning to blog about it for a long time.  It's such a counter-intuitive result in some ways.  Perhaps one day SQL Server will be extended to allow a serial hash join to also push a bitmap?  Probably not - I guess if that were on the cards, they would have done it in SQL 2008, at the same time they introduced 'optimized' bitmap filtering...

Paul

July 7, 2011 2:34 AM
 

tobi said:

These detailed posts of internals are a joy to read.

July 7, 2011 6:30 AM
 

Fabiano Amorim said:

Holy shit, I didn't understood anything :-)... Loved the article... I'll read again and again and again...

Thanks for that Paul.

July 15, 2011 4:32 PM
 

Paul White: Page Free Space said:

You probably already know that it’s important to be aware of data types when writing queries, and that

July 18, 2011 7:17 AM
 

Thomas LeBlanc said:

Over the top. I have been wondering lately with some large Data Warehouse queries have been showing Extimated rows different from the Actual rows of the joins. Now I understand, thought there was a problem with Statistics.

Thanks Paul!!!

January 26, 2012 5:18 PM
 

Paul White said:

Thank you Thomas, I'm so glad you found this useful :)

Paul

January 26, 2012 5:40 PM
 

vikas said:

Hi experts

its realy fantastic article. It will more better if writer take some example on bitmap artitecture and using some picture if they explain.then it will easy to understatnd.

Regards

vikas

February 28, 2012 2:46 AM
 

Paul White: Page Free Space said:

This is a post for T-SQL Tuesday #43 hosted by my good friend Rob Farley . The topic this month is Plan

June 11, 2013 5:00 AM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement