THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | Join | Help
in Search

Adam Machanic

Adam Machanic, Boston-based independent database consultant, writer, and speaker, shares his experiences with programming, performance tuning, and optimizing SQL Server 2000, 2005, and 2008, in conjunction with related technologies such as .NET.

T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR

When working with time intervals, we often want to ask a couple of basic questions:

  • Which time periods are not covered by our intervals? These are known as "gaps".
  • What are the time ranges that we are fully covering? These are known as "islands".

If you're unfamiliar with "gaps" and "islands" I highly recommend reading some of Itzik Ben-Gan's recent work in SQL Server Magazine. He's had a great series going on the topic. But one thing that he hasn't found a good T-SQL solution for is a problem that I call "fuzzy islands."

When is an island fuzzy? When it doesn't necessarily have a fixed end time. For an example of this, consider a store, selling a number of products. Management might want to see a report showing when each product was selling. This is, effectively, an island question. The goal is to find all of the covered ranges during which sales occurred. But running such a report, you might find that way too much data is returned. A given product may have sold units on Monday, Tuesday, and Thursday, but for some reason no one bought one on Wednesday. Creating a new island every time there is a small gap will create a 300-page report where a 1-page dashboard might suffice--not a good user experience, nor a good way of representing the data. The solution? Introduce a bit of fuzziness--a rule that says, for instance, that a gap is only a gap if it's longer than 7 days.

Answering the fuzzy islands question is not a very difficult thing to do in T-SQL. The basic algorithm follows:

  1. Find all of the "start" dates or times. These are simply those dates or times for which a previous date or time in the fuzzy interval does not exist. So if a row is dated 2009-12-08 and our fuzzy granularity is 7 days, we know we have a start date if there is no other covered data from the end of November.
  2. For each start date identified, find the minimum date greater than the start date. This is done by looking ahead rather than behind, so if our date is 2009-12-08 and we have a granularity of 7 days, we'll look forward until December 15th.
  3. Optionally, add the fuzzy factor to the end date. This is something that I think is a good idea, as it introduces the concept of an "active" interval--a period over which, for example, a product is considered to have been selling. The interval shouldn't necessarily terminate the day that the last sale occurred. But of course this depends on the situation in question.

The following query produces a fuzzy islands report for each product in the AdventureWorks (or AdventureWorks2008) Production.TransactionHistory table. You can modify the @active_interval variable to tweak the fuzziness and change the output.

--Find all "active" product time ranges, meaning that the product
--has sold within the previous 7 days
DECLARE @active_interval INT = 7

SELECT DISTINCT
t_s.ProductID,
t_s.TransactionDate AS StartDate,
DATEADD
(
dd,
@active_interval,
(
SELECT
MIN(t_e.TransactionDate)
FROM Production.TransactionHistory AS t_e
WHERE
t_e.ProductID = t_s.ProductID
AND t_e.TransactionDate >= t_s.TransactionDate
AND NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ae
WHERE
t_ae.ProductID = t_s.ProductID
AND t_ae.TransactionDate BETWEEN
DATEADD(dd, 1, t_e.TransactionDate)
AND DATEADD(dd, @active_interval, t_e.TransactionDate)
)
)
) AS EndDate
FROM
(
SELECT DISTINCT
ProductID,
TransactionDate
FROM Production.TransactionHistory
) AS t_s
WHERE
NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ps
WHERE
t_ps.ProductID = t_s.ProductID
AND t_ps.TransactionDate BETWEEN
DATEADD(dd, -@active_interval, t_s.TransactionDate)
AND DATEADD(dd, -1, t_s.TransactionDate)
)
ORDER BY
ProductID,
StartDate
GO

Running this query you'll find that it works... But the results are returned a bit more slowly than we might desire--15 to 16 seconds on my end. Looking at the query plan, the reason for this becomes quite obvious: Lots and lots of table scans. How can we eliminate all of the overhead?

SQLCLR to the rescue. The best way to solve this problem--at least until the SQL Server team adds proper OVER clause support (LAG and LEAD, specifically)--is to use a cursor algorithm. We could do this in a T-SQL cursor, but why bother? Cursor logic in SQLCLR is much, much faster.

To solve the problem, I implemented an enumerator, called active_products_enumerator. The enumerator is initialized using a SqlDataReader and an "active interval" -- the number of days we're allowing for fuzziness. The DataReader is expected to return rows ordered by ProductID and TransactionDate. The enumeration process uses the following algorithm:

  1. If the current ProductID is not the same as the previous ProductID, return an end date for the previous interval and start a new one
  2. If the current period date is greater than the previous period date plus the active interval, return an end date for the previous interval and start a new one
  3. Otherwise, continue

Following is the MoveNext method for the enumerator (the complete code is attached to this post so that you can run it on your end without my bombarding you with a gigantic code-filled post):

public bool MoveNext()
{
    try
    {
        current_results = null;

        while (r.Read())
        {
            new_ProductID = r.GetSqlInt32(0);
            new_period_date = r.GetDateTime(1);

            if (new_ProductID != ProductID)
            {
                if (ProductID != 0)
                {
                    current_results = new results(
                        ProductID,
                        StartDate,
                        period_plus_interval);
                }

                ProductID = new_ProductID;
                StartDate = new_period_date;
                period_plus_interval = new_period_date.AddDays(activeInterval);

                if (current_results != null)
                    return (true);
            }
            else
            {
                if (period_plus_interval < new_period_date)
                {
                    current_results = new results(
                        ProductID,
                        StartDate,
                        period_plus_interval);

                    StartDate = new_period_date;
                }

                period_plus_interval = new_period_date.AddDays(activeInterval);

                if (current_results != null)
                    return (true);
            }
        }

        //return the last row of data
        if (ProductID != 0)
        {
            current_results = new results(
                ProductID,
                StartDate,
                period_plus_interval);

            //set this to 0 so we don't return another row
            ProductID = 0;
        }

        if (current_results != null)
            return (true);
        else
        {
            r.Dispose();
            return (false);
        }
    }
    catch
    {
        r.Dispose();
        throw;
    }
}

Put into a table-valued function, as the attached code does, this algorithm will return the same data as the T-SQL query in under a third of a second on my end--around 45 times faster than the T-SQL version.

Note that I've played some games with a loopback connection to get this whole thing to work. That's a topic for a future blog post, so stay tuned. In the meantime, please realize that you'll have to catalog the assembly with EXTERNAL_ACCESS permission to make this happen.

This post was created for T-SQL Tuesday, the revolving SQL Server blog party, hosted this month by... me. Enjoy!

Published Tuesday, December 08, 2009 3:13 PM by Adam Machanic

Attachment(s): AdamMachanic_TSQLTuesday_ActiveProducts.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Twitter Trackbacks for Adam Machanic : T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR [sqlblog.com] on Topsy.com said:

December 8, 2009 3:45 PM
 

uberVU - social comments said:

This post was mentioned on Twitter by AdamMachanic: Blogged: Exploring "Fuzzy" Interval Islands Using SQLCLR - http://is.gd/5gdRE - #TSQL2sday

December 8, 2009 4:01 PM
 

Adam Machanic said:

Wow! The response to the first T-SQL Tuesday was truly amazing. We ended up with 20 great posts, from

December 9, 2009 8:19 AM
 

Peter Larsson said:

In response to Adam's new series of T-SQL Tuesday, I wanted to write that there are faster ways to get

December 9, 2009 5:30 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Adam Machanic

Adam Machanic is a Boston-based independent database consultant, writer, and speaker. He has been involved in dozens of SQL Server implementations for both high-availability OLTP and large-scale data warehouse applications, and has optimized data access layer performance for several data-intensive applications. Adam has written for numerous web sites and magazines, including SQLblog, Simple Talk, Search SQL Server, SQL Server Professional, CoDe, and VSJ. He has also contributed to several books on SQL Server, including "Expert SQL Server 2005 Development" (Apress, 2007) and "Inside SQL Server 2005: Query Tuning and Optimization" (Microsoft Press, 2007). Adam regularly speaks at user groups, community events, and conferences on a variety of SQL Server and .NET-related topics. He is a Microsoft Most Valuable Professional (MVP) for SQL Server, a Microsoft Certified IT Professional (MCITP), and a member of the INETA North American Speakers Bureau.

This Blog

Syndication

News

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement