THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Michael Coles: Sergeant SQL

SQL Server development, news and information from the front lines

Bit-Twiddling in SQL

Someone posted a question to the SQL Server forum the other day asking how to count runs of zero bits in an integer using SQL.  Basically the poster wanted to know how to efficiently determine the longest contiguous string of zero-bits (known as a run of bits) in any given 32-bit integer.  Here are a couple of examples to demonstrate the idea:

 

Decimal = Binary = Zero Run
999,999,999 decimal = 00111011 10011010 11001001 11111111 binary = 2 contiguous zero bits
666,666,666 decimal = 00100111 10111100 10000110 10101010 binary = 4 contiguous zero bits

 

My first reaction was that SQL is not my first choice of a go-to language for bit twiddling hacks.  These types of calculations are generally most efficient in C-style compiled procedural languages with plenty of bit manipulation instructions that map almost directly one-for-one to low-level machine language instructions.  In all fairness, SQL does have some bit-level operators (&, |, etc. operators), but the performance isn’t as optimized as a language like C#.

 

At any rate, a few different ideas were tossed around, like the simplistic loop-and-count procedural method.  Using this method you just keep a running total of the longest zero-bit run and keep looping over the bits, adjusting your running total, comparing to your largest run of zero bits, and shifting your integer one bit right each time.  But this being SQL, I decided to attack the problem from a set-based perspective.  To start with I built a 1,000,000 row table to hold random integers:

 

CREATE TABLE TempNum
(
    Num BIGINT NOT NULL
);
GO

WITH
GenerateRandom
AS
(
    SELECT 1 AS n,
        ABS(CONVERT(BIGINT, CONVERT(BINARY, CHECKSUM(NEWID()))) % 4294967296) AS Random

    UNION
ALL

    SELECT n + 1,
        ABS(CONVERT(BIGINT, CONVERT(BINARY, CHECKSUM(NEWID()))) % 4294967296)
    FROM GenerateRandom
    WHERE n < 1000000
)
INSERT INTO TempNum
(
    Num
)
SELECT Random
FROM GenerateRandom
OPTION (MAXRECURSION 0);
GO

 

Note that I used SQL Server’s BIGINT data type (64-bit integer) since I wanted to deal only in unsigned 32-bit integers.  For my initial crack at a solution I split the 32-bit integer up into individual bits and treated the whole thing like a classic Gaps and Islands problem.  Essentially the 1 bits are islands, the 0's are gaps, and the length of the longest gap is the correct answer. 

 

DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
GO

WITH
Powers
AS
(
    SELECT 0 AS id,
        CAST(1 AS BIGINT) AS pwr

    UNION ALL

    SELECT p.id + 1,
        POWER(CAST(2 AS BIGINT), p.id + 1)
    FROM Powers p
    WHERE p.id < 33
),
Islands
AS
(
    SELECT ROW_NUMBER() OVER (PARTITION BY tn.Num ORDER BY id) AS RowNum,
        tn.Num,
        p.id,
        p.pwr
    FROM Powers p
    CROSS JOIN TempNum tn
    WHERE p.pwr & ((tn.Num * 2 ) | CAST(8589934593 AS BIGINT)) <> 0
)
SELECT c.Num AS Num,
    MAX(n.id - c.id - 1) AS ZeroBitRun
INTO #Temp
FROM Islands AS c
INNER JOIN Islands AS n
    ON n.RowNum = c.RowNum + 1
        AND n.Num = c.Num
WHERE n.id - c.id > 1
GROUP BY c.Num,
    n.Num;
GO

 

One interesting aside on this solution - to ensure that leading and trailing zeroes were counted for every 32-bit number I had to shift the number left one bit (multiply by 2) and do a bitwise OR (| operator) with 8589934593, setting both the lowest bit (bit 0) and the highest bit (bit 33) to 1.  Basically this means you’re now dealing with a 34-bit integer, with the highest and lowest bits counted as islands.  This ensures that for the 32 bits in between leading and trailing gaps are correctly identified.  On my PC this query took an average of just over 360 seconds (6 minutes) to identify the largest run of zero bits in 1,000,000 random numbers.  Another solution posted to the group used logarithms combined with bit-shifting in a recursive CTE.  This one completed 1,000,000 iterations in about 17.5 minutes.

 

I couldn't help but think that when you get down to it, a solution to this problem should play to SQL’s strengths.  I decided that what I really need for an efficient SQL solution to this problem is a lookup table.  Granted, a lookup table of 4+ billion rows (one row for each and every 32 bit number) would take a long time to build and probably wouldn't lend itself to IO efficiencies in SQL Server.  So I opted for a scaled down version and built a lookup table of 65,536 rows with one row representing every possible 16 bit number.  There are a lot of clever ways to grab bit-level pattern information (Google "de Bruijn sequence", for instance), but to be honest this lookup table is a one-time build and static population so I decided to keep it simple.

 

CREATE TABLE #Nybbles
(
    Num int primary key not null,
    String varchar(4) not null,
    Leading tinyint not null,
    Trailing tinyint not null
);
GO

INSERT
INTO #Nybbles
(
    Num,
    String,
    Leading,
    Trailing
)
VALUES (0, '0000', 4, 4), (1, '0001', 3, 0), (2, '0010', 2, 1), (3, '0011', 2, 0),
    (
4, '0100', 1, 2), (5, '0101', 1, 0), (6, '0110', 1, 1), (7, '0111', 1, 0),
    (
8, '1000', 0, 3), (9, '1001', 0, 0), (10, '1010', 0, 1), (11, '1011', 0, 0),
    (
12, '1100', 0, 2), (13, '1101', 0, 0), (14, '1110', 0, 1), (15, '1111', 0, 0);
GO

CREATE
TABLE #BitMask
(
    Length tinyint,
    Mask varchar(16)
);
GO

WITH
GetMasks
AS
(
    SELECT 0 AS Length,
        CAST('' AS varchar(16)) AS Mask

    UNION
ALL

    SELECT Length + 1,
        CAST(REPLICATE('0', Length + 1) AS varchar(16))
    FROM GetMasks
    WHERE Length < 16
)
INSERT INTO #BitMask
(
    Length,
    Mask
)
SELECT Length,
    Mask
FROM GetMasks;
GO

CREATE
TABLE BitPattern
(
    Num int not null primary key,
    Trailing tinyint,
    Leading tinyint,
    Seq tinyint
);
GO

WITH
GetBin
AS
(
    SELECT n1.Num * 4096 + n2.Num * 256 + n3.Num * 16 + n4.Num AS Num,
        n1.String + n2.String + n3.String + n4.String AS String
    FROM #Nybbles n1
    CROSS JOIN #Nybbles n2
    CROSS JOIN #Nybbles n3
    CROSS JOIN #Nybbles n4
)
INSERT INTO BitPattern
(
    Num,
    Trailing,
    Leading,
    Seq
)
SELECT g.Num,
    (
        SELECT MAX(b1.Length) 
        FROM #BitMask b1
        WHERE RIGHT(g.String, b1.Length) = b1.Mask
    ) AS Trailing,
    (
        SELECT MAX(b2.Length) 
        FROM #BitMask b2
        WHERE LEFT(g.String, b2.Length) = b2.Mask
    ) AS Leading,
    COALESCE
    (
        (
            SELECT MAX(b3.Length)
            FROM #BitMask b3
            WHERE CHARINDEX(b3.Mask, g.String) > 0
        ), 
    0) AS Seq
FROM GetBin g
ORDER BY Num;
GO

DROP
TABLE #BitMask;
DROP TABLE #Nybbles;
GO 

 

The first part of this script puts all 16 combinations of 4-bit nybbles (nybble = half a byte) and their equivalent binary formatted strings (0 = '0000', 14 = '1110') into a temp table called #Nybbles.

 

There’s also a #BitMask temp table with bitmasks representing zero-bit runs.  The bitmasks are just strings of consecutive '0' characters of the necessary length (length 1 = '0', length 5 = '00000').

 

The BitPatterns table is the actual 16-bit number lookup table.  This table is populated by combining every 16-bit combination of nybbles from the #Nybbles temp table.  This table has 4 columns:

 

  • Num is the 16-bit number
  • Trailing is the number of zero bits trailing (on the right-hand side) in the number
  • Leading is the number of zero bits leading (on the left-hand side) in the number
  • Seq is the longest sequence of zero bits within the number

 

The total run time to build this lookup table was around 30 seconds on my computer.  Keep in mind that’s a one-time cost since you never have to build (or modify) the table again.

 

With the information in this lookup table the query that locates the longest run of zero bits in any given 32-bit number is relatively simple:

 

DBCC FREEPROCCACHE;

DBCC DROPCLEANBUFFERS;

GO

 

WITH CTE

AS

(

    SELECT tn.Num,

        (

            SELECT

                CASE WHEN l.Seq > h.Seq THEN

                (

                    CASE WHEN l.Seq > h.Trailing + l.Leading THEN l.Seq

                        ELSE h.Trailing + l.Leading

                        END

                )

                ELSE

                (

                    CASE WHEN h.Seq > h.Trailing + l.Leading THEN h.Seq

                        ELSE h.Trailing + l.Leading

                        END

                ) 

                END

        ) AS Seq,

        CASE WHEN l.Trailing = 16 THEN h.Trailing + l.Trailing

            ELSE l.Trailing

            END AS Trailing,

        CASE WHEN h.Leading = 16 THEN h.Leading + l.Leading

            ELSE h.Leading

            END AS Leading

    FROM TempNum tn WITH (NOLOCK)

    INNER JOIN BitPattern l WITH (NOLOCK)

        ON l.Num = (tn.Num & 65535)

    INNER JOIN BitPattern h WITH (NOLOCK)

        ON h.Num = (tn.Num / 65536)

)

SELECT Num,

    Seq,

    Leading,

    Trailing

INTO #Temp

FROM CTE;

GO

 

Basically you join the BitPattern lookup table on the high 16 bits and again on the low 16 bits.  The first CASE expression in the Seq subquery performs a 3-way maximum calculation.  It returns the largest of:

 

  • the low word zero-bit run (l.Seq),
  • the high word zero-bit run (h.Seq), or
  • the count of high word trailing zero-bits + the count of low word leading zero-bits (h.Trailing + l.Leading)

 

The other two CASE expressions return the total number of trailing and leading zero-bits in the 32-bit number.  The CASE expression is needed to handle the case when the high or low word is all zero-bits.

 

This particular solution took an average of about 6 seconds to calculate the longest zero-bit run for 1,000,000 numbers.

 

To my earlier point, I created a C# solution (which itself could have been optimized) that performed the exact same calculation for 1,000,000 random 32 bit integers in 2 seconds flat.  So I guess there are two main points here: (1) Make sure you choose the right tool/language for the job, and (2) Whatever tool/language you choose try to play to its strengths.

Published Sunday, April 04, 2010 12:40 AM by Mike C

Comments

 

Erik said:

Very interesting.  But why?

April 4, 2010 8:23 AM
 

Adam Machanic said:

Is someone doing some denormalization with bitmaps or something?

April 4, 2010 10:22 AM
 

James Luetkehoelter said:

I agree - very interesting approach to the question, but what was the business driver behind needing to find sequences of bit zeros in an integer?

April 4, 2010 11:32 AM
 

Mike C said:

This is from the original poster:

"This is for a large frame relay network where customers can purchase

timeslots from 1 to 32 on a line. If we are moving a customer from one

location to another then we need to know that they will fit."

In the poster's design, 0 bits represent open timeslots and 1 bits represent closed timeslots.  One 32-bit integer is used to represent the 32 timeslots mentioned above.

Could this have been modeled more effectively (properly normalized, for instance) for SQL?  Probably.  Could it have been done in another language more efficiently?  Absolutely (I got better performance with an unoptimized solution in C#).

The only real question here is whether it's worth arguing a database model with the poster who doesn't want to share details of his model, or if it's better to just assume a business and technical decision was made which makes sense for the poster's purposes.  With no knowledge of the poster's systems other than the contents and purpose of a single integer column I chose to assume the poster had made design decisions that made sense for him.

April 4, 2010 1:34 PM
New Comments to this post are disabled

This Blog

Syndication

News

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement