THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Adam Machanic

Adam Machanic, Boston-based SQL Server developer, shares his experiences with programming, monitoring, and performance tuning SQL Server. And the occasional battle with the query optimizer.

T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR

When working with time intervals, we often want to ask a couple of basic questions:

  • Which time periods are not covered by our intervals? These are known as "gaps".
  • What are the time ranges that we are fully covering? These are known as "islands".

If you're unfamiliar with "gaps" and "islands" I highly recommend reading some of Itzik Ben-Gan's recent work in SQL Server Magazine. He's had a great series going on the topic. But one thing that he hasn't found a good T-SQL solution for is a problem that I call "fuzzy islands."

When is an island fuzzy? When it doesn't necessarily have a fixed end time. For an example of this, consider a store, selling a number of products. Management might want to see a report showing when each product was selling. This is, effectively, an island question. The goal is to find all of the covered ranges during which sales occurred. But running such a report, you might find that way too much data is returned. A given product may have sold units on Monday, Tuesday, and Thursday, but for some reason no one bought one on Wednesday. Creating a new island every time there is a small gap will create a 300-page report where a 1-page dashboard might suffice--not a good user experience, nor a good way of representing the data. The solution? Introduce a bit of fuzziness--a rule that says, for instance, that a gap is only a gap if it's longer than 7 days.

Answering the fuzzy islands question is not a very difficult thing to do in T-SQL. The basic algorithm follows:

  1. Find all of the "start" dates or times. These are simply those dates or times for which a previous date or time in the fuzzy interval does not exist. So if a row is dated 2009-12-08 and our fuzzy granularity is 7 days, we know we have a start date if there is no other covered data from the end of November.
  2. For each start date identified, find the minimum date greater than the start date. This is done by looking ahead rather than behind, so if our date is 2009-12-08 and we have a granularity of 7 days, we'll look forward until December 15th.
  3. Optionally, add the fuzzy factor to the end date. This is something that I think is a good idea, as it introduces the concept of an "active" interval--a period over which, for example, a product is considered to have been selling. The interval shouldn't necessarily terminate the day that the last sale occurred. But of course this depends on the situation in question.

The following query produces a fuzzy islands report for each product in the AdventureWorks (or AdventureWorks2008) Production.TransactionHistory table. You can modify the @active_interval variable to tweak the fuzziness and change the output.

--Find all "active" product time ranges, meaning that the product
--has sold within the previous 7 days
DECLARE @active_interval INT = 7

SELECT DISTINCT
t_s.ProductID,
t_s.TransactionDate AS StartDate,
DATEADD
(
dd,
@active_interval,
(
SELECT
MIN(t_e.TransactionDate)
FROM Production.TransactionHistory AS t_e
WHERE
t_e.ProductID = t_s.ProductID
AND t_e.TransactionDate >= t_s.TransactionDate
AND NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ae
WHERE
t_ae.ProductID = t_s.ProductID
AND t_ae.TransactionDate BETWEEN
DATEADD(dd, 1, t_e.TransactionDate)
AND DATEADD(dd, @active_interval, t_e.TransactionDate)
)
)
) AS EndDate
FROM
(
SELECT DISTINCT
ProductID,
TransactionDate
FROM Production.TransactionHistory
) AS t_s
WHERE
NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ps
WHERE
t_ps.ProductID = t_s.ProductID
AND t_ps.TransactionDate BETWEEN
DATEADD(dd, -@active_interval, t_s.TransactionDate)
AND DATEADD(dd, -1, t_s.TransactionDate)
)
ORDER BY
ProductID,
StartDate
GO

Running this query you'll find that it works... But the results are returned a bit more slowly than we might desire--15 to 16 seconds on my end. Looking at the query plan, the reason for this becomes quite obvious: Lots and lots of table scans. How can we eliminate all of the overhead?

SQLCLR to the rescue. The best way to solve this problem--at least until the SQL Server team adds proper OVER clause support (LAG and LEAD, specifically)--is to use a cursor algorithm. We could do this in a T-SQL cursor, but why bother? Cursor logic in SQLCLR is much, much faster.

To solve the problem, I implemented an enumerator, called active_products_enumerator. The enumerator is initialized using a SqlDataReader and an "active interval" -- the number of days we're allowing for fuzziness. The DataReader is expected to return rows ordered by ProductID and TransactionDate. The enumeration process uses the following algorithm:

  1. If the current ProductID is not the same as the previous ProductID, return an end date for the previous interval and start a new one
  2. If the current period date is greater than the previous period date plus the active interval, return an end date for the previous interval and start a new one
  3. Otherwise, continue

Following is the MoveNext method for the enumerator (the complete code is attached to this post so that you can run it on your end without my bombarding you with a gigantic code-filled post):

public bool MoveNext()
{
    try
    {
        current_results = null;

        while (r.Read())
        {
            new_ProductID = r.GetSqlInt32(0);
            new_period_date = r.GetDateTime(1);

            if (new_ProductID != ProductID)
            {
                if (ProductID != 0)
                {
                    current_results = new results(
                        ProductID,
                        StartDate,
                        period_plus_interval);
                }

                ProductID = new_ProductID;
                StartDate = new_period_date;
                period_plus_interval = new_period_date.AddDays(activeInterval);

                if (current_results != null)
                    return (true);
            }
            else
            {
                if (period_plus_interval < new_period_date)
                {
                    current_results = new results(
                        ProductID,
                        StartDate,
                        period_plus_interval);

                    StartDate = new_period_date;
                }

                period_plus_interval = new_period_date.AddDays(activeInterval);

                if (current_results != null)
                    return (true);
            }
        }

        //return the last row of data
        if (ProductID != 0)
        {
            current_results = new results(
                ProductID,
                StartDate,
                period_plus_interval);

            //set this to 0 so we don't return another row
            ProductID = 0;
        }

        if (current_results != null)
            return (true);
        else
        {
            r.Dispose();
            return (false);
        }
    }
    catch
    {
        r.Dispose();
        throw;
    }
}

Put into a table-valued function, as the attached code does, this algorithm will return the same data as the T-SQL query in under a third of a second on my end--around 45 times faster than the T-SQL version.

Note that I've played some games with a loopback connection to get this whole thing to work. That's a topic for a future blog post, so stay tuned. In the meantime, please realize that you'll have to catalog the assembly with EXTERNAL_ACCESS permission to make this happen.

This post was created for T-SQL Tuesday, the revolving SQL Server blog party, hosted this month by... me. Enjoy!

Published Tuesday, December 08, 2009 3:13 PM by Adam Machanic

Attachment(s): AdamMachanic_TSQLTuesday_ActiveProducts.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Adam Machanic said:

Wow! The response to the first T-SQL Tuesday was truly amazing. We ended up with 20 great posts, from

December 9, 2009 8:19 AM
 

Peter Larsson said:

In response to Adam's new series of T-SQL Tuesday, I wanted to write that there are faster ways to get

December 9, 2009 5:30 PM

Leave a Comment

(required) 
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based SQL Server developer, writer, and speaker. He focuses on large-scale data warehouse performance and development, and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "SQL Server 2008 Internals" (Microsoft Press, 2009) and "Expert SQL Server 2005 Development" (Apress, 2007). Adam regularly speaks at conferences and training events on a variety of SQL Server topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and an alumnus of the INETA North American Speakers Bureau.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement