THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Peter Larsson

Thinking outside the box

T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Without Using SQLCLR

In response to Adam's new series of T-SQL Tuesday, I wanted to write that there are faster ways to get the data in a set-based manner without resorting to SQLCLR.
http://sqlblog.com/blogs/adam_machanic/archive/2009/12/08/t-sql-tuesday-001-exploring-fuzzy-interval-islands-using-sqlclr.aspx

Many companies today still think of SQLCLR's as security risks, so I wanted to rewrite Adam's set-based code to a more efficient algorithm.
On my computer, Adam's code run in about 25 seconds. These two new algorithms below, runs in less than a second each. Which one to choose is how you interpret the definition Adam wrote. Also, this is not a blog post to make someone feel bad, it is a blog post to show there are other ways to write code, with better performance. You just have to think a little different. If someone is interested in how this algorithm works, please ask nice and I'll probably write a more in-depth blog post about the algorithm.
Complaining in the comments doesn't work.


Take 1:

DECLARE
 @Interval INT = 7

;WITH cteSource
AS (
    SELECT  ProductID,
            TransactionDate,
            ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate) AS recID
    FROM    Production.TransactionHistory
), cteMatch
AS (
    SELECT      s1.ProductID,
                s1.TransactionDate AS FromDate,
                s2.TransactionDate AS ToDate
    FROM        cteSource AS s1
    INNER JOIN  cteSource AS s2 ON s2.ProductID = s1.ProductID
    WHERE       s1.recID = s2.recID - 1
), cteYak
AS (
        SELECT      ProductID,
                    FromDate,
                    ToDate,
                    y - ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY FromDate) AS grp
        FROM        (
                        SELECT  ProductID,
                                FromDate,
                                ToDate,             
                                CASE
                                    WHEN DATEDIFF(DAY, FromDate, ToDate) <= @Interval
                                        THEN ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY FromDate)
                                    ELSE NULL
                                END AS y
                        FROM    cteMatch
                    ) AS d
        WHERE       y IS NOT NULL
)
SELECT      ProductID,
            MIN(FromDate) AS FromDate,
            MAX(ToDate) AS ToDate
FROM        cteYak
WHERE       DATEDIFF(DAY, FromDate, ToDate) <= @Interval
GROUP BY    ProductID,
            grp
ORDER BY    ProductID,
            MIN(FromDate)


This runs in less than a second.

Take2:

DECLARE @Interval INT = 7

;WITH cteSource
AS (
    SELECT  ProductID,
            TransactionDate,
            ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate) AS recID
    FROM    Production.TransactionHistory
), cteMatch
AS (
    SELECT      s1.ProductID,
                s1.TransactionDate AS FromDate,
                s2.TransactionDate AS ToDate
    FROM        cteSource AS s1
    INNER JOIN  cteSource AS s2 ON s2.ProductID = s1.ProductID
    WHERE       s1.recID = s2.recID - 1
), cteYak
AS (
    SELECT      ProductID,
                FromDate,
                ToDate,             
                CASE
                    WHEN DATEDIFF(DAY, FromDate, ToDate) <= @Interval THEN 1
                    ELSE -1
                END * ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY FromDate) AS recID
    FROM        cteMatch
), cteFinal
AS (
    SELECT      ProductID,
                MIN(FromDate) AS StartDate,
                DATEADD(DAY, @Interval, MAX(ToDate)) AS EndDate,
                1 + MAX(recID) AS recID
    FROM        (
                    SELECT  ProductID,
                            FromDate,
                            ToDate,
                            recID - ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY FromDate) AS grp,
                            recID
                    FROM    cteYak
                    WHERE   recID > 0
                ) AS d
    GROUP BY    ProductID,
                grp
 
    UNION ALL
 
    SELECT      ProductID,
                FromDate AS StartDate,
                DATEADD(DAY, @Interval, FromDate) AS EndDate,
                ABS(recID) AS recID
    FROM        cteYak
    WHERE       recID < 0
)

SELECT      ProductID,
            MIN(StartDate) AS StartDate,
            MAX(EndDate) AS EndDate
FROM        cteFinal
GROUP BY    ProductID,
            recID


This also runs in less than a second. And produces the exact same result as Adam's code.

//Peter

Published Wednesday, December 09, 2009 12:20 AM by Peso

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

james said:

I guess you were in such a hurry to try to prove someone else wrong that you could not bother with formatting or commentary on how this works? At least Adam explained his code.

December 8, 2009 9:59 PM
 

Adam Machanic said:

Hi Peter,

Not sure why James didn't like your formatting, but I have to agree that some explanation would have been helpful. I just stared at your query for 10 minutes and I'm not exactly sure what you were shooting for.

What I do know is that on my end it's only marginally faster than my T-SQL version, if I create a supporting index on (ProductID, TransactionDate), and it doesn't return the same results--my version returns 4020 rows; yours only 2015. Did you bother testing this?  I'll stick with SQLCLR for now--it's even faster with the indexes (an order of magnitude), and has the added benefit of correctness, which is just a little bit important when you're using these things in the real world. :-)

December 8, 2009 10:40 PM
 

Peso said:

This was NOT to prove someone wrong. If you read the first sentences, I wrote that there are other ways to skin this cat.

Adam's T-SQL runs in 25 seconds on my machine and the suggestion above runs in less than a seconds. If something was proven, it should be that there almost always are other ways to write code.

As the SQLCLR way, not many companies allow them as they see them as a security risk.

James, did you even digest the query? Why did you stare at the whole query for 10 minutes? Run the individual cte's and you'd see what they do.

Yes, I tested ths version and it produces the correct result, as far as I understood the requirements. As long as there are less then 7 days between two sequential dates, they belong to the same group. I am using a version of AdventureWorks where there are 45 records for Product 1. Maybe we are using different versions of AdventureWorks?

As for Adam's code, I get these first two lines with his code

1 2003-09-05 00:00:00.000 2003-09-23 00:00:00.000

1 2003-09-26 00:00:00.000 2003-10-24 00:00:00.000

The end dates doesn't even exists in the database, in the version I have.

December 9, 2009 12:59 AM
 

Peso said:

Here is how I interpreted the requirements

ProdID  TranDate    grp

1       2003-09-05  1

1       2003-09-12  1

1       2003-09-16  1

1       2003-09-26  2

1       2003-10-03  2

1       2003-10-10  2

1       2003-10-17  2

1       2003-11-07  3

As long as there are 7 days or less between two sequential dates, they belong to the same group.

December 9, 2009 1:02 AM
 

Adam Machanic said:

Peter,

Why did I stare at your query? Because I didn't feel like dissecting it and running all of the CTEs. No one does, ever. Not on a blog post, and not in real production code. That's why we write comments in our code, and that's why on blogs we add text to explain what's happening. How is anyone supposed to know what in the world a "cteYak" is?

If you read my post you would understand why the end dates are later than actual dates that exist. Those are EFFECTIVE dates. But even lacking that, your query doesn't return correct results. According to your query, for example, the period between 2004-04-04 and 2004-05-08 is not covered for ProductID 1. But if you look at the source data, you'll see that there are a few rows in that period:

2004-04-12 00:00:00.000

2004-04-20 00:00:00.000

2004-04-28 00:00:00.000

December 9, 2009 9:59 AM
 

Peso said:

But those are 8 days apart?

December 9, 2009 10:07 AM
 

Adam Machanic said:

Peter,

Did you read the original post? I outlined the requirements, as well as the algorithm, in quite a bit of detail. Perhaps a re-read is in order.

By the way, why is this post titled the same as mine, when yours is about NOT using SQLCLR?

December 9, 2009 11:00 AM
 

Adam Machanic said:

Wow! The response to the first T-SQL Tuesday was truly amazing. We ended up with 20 great posts, from

December 9, 2009 11:10 AM
 

Adam Machanic said:

Hi Peter,

Alas, your new version is still not returning the correct results on my end. It returns 8620 rows instead of the expected 4020. I took a quick peek through the results and immediately noticed ProductID 999, which has the following rows (among others):

2003-09-01 00:00:00.000 2004-06-30 00:00:00.000

2003-09-01 00:00:00.000 2004-07-01 00:00:00.000

2003-09-01 00:00:00.000 2004-07-08 00:00:00.000

2003-09-01 00:00:00.000 2004-07-07 00:00:00.000

... so apparently it's not finding the start dates in quite the right way.

December 9, 2009 10:35 PM
 

Adam Machanic said:

Okay, you can fix it by changing the body of cteSource to the following (but the SQLCLR version is still much faster, just for the record):

---

   SELECT DISTINCT

ProductID,

       TransactionDate,

       DENSE_RANK() OVER (PARTITION BY ProductID ORDER BY TransactionDate) AS recID

   FROM Production.TransactionHistory

   UNION ALL

   SELECT

ProductID,

MAX(TransactionDate)

COUNT(DISTINCT TransactionDate) + 1

FROM Production.TransactionHistory

GROUP BY

ProductID

---

December 9, 2009 10:52 PM
 

Adam Machanic said:

Hi Peter,

FYI, I just tried your technique in my real project, against a set of data with 240 million rows. Interestingly, once an index is created, the "basic" set-based method in my post is significantly faster than the ROW_NUMBER method against this particular data set.

I'm working on tuning the ROW_NUMBER method to see if I can get better performance, and have already gotten it much better by doing using a MERGE hint in the "match" CTE and ordering the outer ROW_NUMBERs by the first recId produced, rather than by the date, which eliminates several intermediate sorts. But it's still nowhere near as fast as the simple/naive solution.

January 7, 2010 4:21 PM
 

Peso said:

What about this rewrite?

DECLARE @Interval INT = 7

;WITH cteSingle(ProductID, TransactionDate, recID)

AS (

SELECT ProductID,

TransactionDate,

ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate) AS recID

FROM Production.TransactionHistory

-- WHERE ProductID = 1

), cteLower(ProductID, StartDate, recID)

AS (

SELECT s.ProductID,

s.TransactionDate AS EndDate,

ROW_NUMBER() OVER (PARTITION BY s.ProductID ORDER BY s.TransactionDate) AS recID

FROM cteSingle AS s

LEFT JOIN cteSingle AS t ON t.ProductID = s.ProductID

AND t.recID = s.recID - 1

WHERE DATEADD(DAY, @Interval, t.TransactionDate) < s.TransactionDate

OR t.ProductID IS NULL

), cteUpper(ProductID, EndDate, recID)

AS (

SELECT s.ProductID,

DATEADD(DAY, @Interval, s.TransactionDate) AS EndDate,

ROW_NUMBER() OVER (PARTITION BY s.ProductID ORDER BY s.TransactionDate) AS recID

FROM cteSingle AS s

LEFT JOIN cteSingle AS t ON t.ProductID = s.ProductID

AND t.recID = s.recID + 1

WHERE DATEADD(DAY, @Interval, s.TransactionDate) < t.TransactionDate

OR t.ProductID IS NULL

)

SELECT l.ProductID,

l.StartDate,

u.EndDate

FROM cteLower AS l

INNER JOIN cteUpper AS u ON u.ProductID = l.ProductID

WHERE l.recID = u.recID

ORDER BY l.ProductID,

l.recID

January 7, 2010 6:45 PM
 

Peso said:

Here are my measurements for the complete table Production.TransactionHistory in AdventureWorks.

RowCount – 4020

CPU – 1560

Duration – 1856

Reads – 3367

And for comparison, here are the measurements for Adam’s original query

RowCount – 4020

CPU – 35303

Duration – 24216

Reads – 1976780

January 7, 2010 6:52 PM
 

Paul White said:

This may be the least clear blog post I have ever seen.  I'm sure the posted T-SQL code is extremely clever, but it's completely opaque as presented.  The code to text ratio is quite scary.

Not complaining in the comments as such, just giving feedback.

Oh actually I am going to complain about one thing: the whole 'resorting to SQLCLR' emphasis and hand-wavy 'security risks' thing...you can do better ;c)

Paul

February 3, 2010 7:18 AM
 

Shawn T. said:

My respect for a couple of MVPs just fell a few notches. Seems to me, first of all, there might be some unfortunate confusion related to the imperfect English of a non-native speaker. But worse is the pathetic crying about the formatting and comments.

I've had to do a log of reverse engineering in my software career. It really isn't as difficult when you know you can trust the intelligence of the original writer and I've seen enough of Peter's blog to trust that he has plenty.

I'd take a lot more of these incomplete posts over the larger portion of idiocy I'm constantly wading through out on the internet forums.

February 6, 2011 3:11 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement