THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Peter Larsson

Thinking outside the box

Relational division

I came across an interesting post on Microsoft SQL Server forum this afternoon. It was a question about Relational algebra and the poster wanted to have an efficient query to solve his problem. The original title was "Select parent ids that have exact match on child values (no more, no less)".

The problem could be solved with relational division, but there is no such operator in SQL Server. Maybe there will be some day.

But for now there is no such operator, so we as developers have to find our own ways.
First prepare and populate some sample data

-- Prepare sample data
            ParentID INT NOT NULL,
            Keyword VARCHAR(25) NOT NULL,
            UNIQUE (ParentID, Keyword)

-- Populate sample data
INSERT  @Sample
VALUES  (1, 'one'),
        (1, 'two'),
        (1, 'three'),
        (1, 'four'),
        (2, 'one'),
        (2, 'two'),
        (2, 'three'),
        (3, 'one'),
        (3, 'two')

People had already been active and posted some solutions, of which this common query was present.

SELECT      s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three')
GROUP BY    s.ParentID
HAVING      COUNT(DISTINCT s.Keyword) = 3
            AND COUNT(DISTINCT s.Keyword) = (SELECT COUNT(*) FROM @Sample AS x WHERE x.ParentID = s.ParentID)

and this type of query

SELECT      s.ParentID
FROM        @Sample AS s
WHERE       s.Keyword IN ('one', 'two', 'three') 
            AND NOT EXISTS (
                            SELECT  *
                            FROM    @Sample AS x
                            WHERE   x.ParentID = s.ParentID
                                    AND x.Keyword NOT IN ('one', 'two', 'three')
GROUP BY    s.ParentID
HAVING      COUNT(DISTINCT s.Keyword) = 3

The good thing is that both produce the same wanted result but the bad thing is the inefficient execution plans.
Then one poster did his homework and read about Mr Celko and translated his algorithm to the current problem, and then the query looked like this

SELECT      ParentID
FROM        (
                SELECT  ParentID,
                        COUNT(*) OVER (PARTITION BY ParentID) AS cnt
                FROM    @Sample
            ) AS w
WHERE       Keyword IN ('one', 'two', 'three')
GROUP BY    ParentID
HAVING      MIN(cnt) = 3

Well, let's just say the execution plan is the worst of them all. The query does produce the correct result.
With a little different angle, you get a better plan for Mr Celkos query.

FROM        (
                SELECT      ParentID,
                            COUNT(*) AS cnt
                FROM        @Sample
                GROUP BY    ParentID
            ) AS w
INNER JOIN  @Sample AS s ON s.ParentID = w.ParentID
WHERE       s.Keyword IN ('one', 'two', 'three')
GROUP BY    s.ParentID
HAVING      MIN(w.cnt) = 3

With these queries in mind, I thought about the problem and realized the problem did in fact have a much simpler solution.
The query I came up with is the simplest of them all, and just does one pass of the source table. Yes, only one pass just as the first Celko query for relational division, but without the internal worktable.
This is the query I came up with

-- Peso
SELECT      ParentID
FROM        @Sample
GROUP BY    ParentID
HAVING      MIN(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 1
            AND SUM(CASE WHEN Keyword IN ('one', 'two', 'three') THEN 1 ELSE 0 END) = 3

How does the query work? The second aggregation filtering just makes sure all three keywords are present.
But the first aggregation filter? What does it do? To simplify, I just write that it takes care of the modulo part of the relational division. There cannot be a "fractional" part of the relational division, because it means that particular ParentID has more keywords than wanted.

Simple as that.


PS. These are the textual execution plans for the four types of queries and then mine.

  |--Filter(WHERE:([Expr1003]=CASE WHEN [Expr1007] IS NULL THEN (0) ELSE [Expr1007] END))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([s].[ParentID]))
            |    |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0)))
            |         |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1014]=Count(*)))
            |              |--Index Scan(OBJECT:(@Sample AS [s]),  WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
            |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0)))
                 |--Stream Aggregate(DEFINE:([Expr1015]=Count(*)))
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]) ORDERED FORWARD)

       |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1010],0)))
            |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1010]=Count(*)))
                 |--Nested Loops(Left Anti Semi Join, OUTER REFERENCES:([s].[ParentID]))
                      |--Index Scan(OBJECT:(@Sample AS [s]),  WHERE:(@Sample.[Keyword] as [s].[Keyword]='one' OR @Sample.[Keyword] as [s].[Keyword]='three' OR @Sample.[Keyword] as [s].[Keyword]='two') ORDERED FORWARD)
                      |--Index Seek(OBJECT:(@Sample AS [x]), SEEK:([x].[ParentID]=@Sample.[ParentID] as [s].[ParentID]),  WHERE:(@Sample.[Keyword] as [x].[Keyword]<>'one' AND @Sample.[Keyword] as [x].[Keyword]<>'three' AND @Sample.[Keyword] as [x].[Keyword]<>'two') ORDERED FORWARD)

       |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1005]=MIN([Expr1004])))
            |--Filter(WHERE:([Keyword]='one' OR [Keyword]='three' OR [Keyword]='two'))
                 |--Nested Loops(Inner Join)
                      |--Table Spool
                      |    |--Segment
                      |         |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)
                      |--Nested Loops(Inner Join, WHERE:((1)))
                           |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1008],0)))
                           |    |--Stream Aggregate(DEFINE:([Expr1008]=Count(*)))
                           |         |--Table Spool
                           |--Table Spool

       |--Stream Aggregate(GROUP BY:([s].[ParentID]) DEFINE:([Expr1008]=MIN([Expr1004])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([ParentID]))
                 |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1012],0)))
                 |    |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1012]=Count(*)))
                 |         |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)
                 |--Index Seek(OBJECT:(@Sample AS [s]), SEEK:([s].[ParentID]=[ParentID] AND [s].[Keyword]='one' OR [s].[ParentID]=[ParentID] AND [s].[Keyword]='three' OR [s].[ParentID]=[ParentID] AND [s].[Keyword]='two') ORDERED FORWARD)

  |--Filter(WHERE:([Expr1004]=(1) AND [Expr1005]=(3)))
       |--Stream Aggregate(GROUP BY:([ParentID]) DEFINE:([Expr1004]=MIN([Expr1006]), [Expr1005]=SUM([Expr1006])))
            |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Keyword]='three' OR [Keyword]='two' OR [Keyword]='one' THEN (1) ELSE (0) END))
                 |--Index Scan(OBJECT:(@Sample), ORDERED FORWARD)

Published Wednesday, June 30, 2010 10:03 PM by Peso



Adam Machanic said:

I think you forgot to mention what the problem is?

June 30, 2010 4:11 PM

Peso said:

Oh, I thought the title said it all.

The problem is to find an efficient query to deal with relational algebra.

Mr Celko has a wonderful description of Relational Division and some examples here

The problem is that they are not very efficient queries. It's all about to find group of connected records which is evenly divided by a number of wanted records.

For the example above, to find the ParentID(s) which have exactly ONE, TWO and THREE as keywords. Not less and certainly not more.

June 30, 2010 4:51 PM

Adam Machanic said:

A more interesting challenge--and much closer to what Celko has done--is to make this work with TWO tables. The one you already have, plus:



           KeyID INT NOT NULL,

           Keyword VARCHAR(25) NOT NULL,

           UNIQUE (KeyID, Keyword)


INSERT @Sample2

VALUES  (6, 'one'),

       (6, 'two'),

       (6, 'three'),

       (6, 'four'),

       (7, 'one'),

       (7, 'two'),

       (7, 'three'),

       (8, 'one'),

       (8, 'two')

Then the output of the division query would be:

ParentID -- KeyID

1           6

2           7

3           8

July 2, 2010 10:22 AM

Adam Machanic said:

Actually there should probably be some invalid stuff in there too:

INSERT @Sample2

VALUES  (6, 'one'),

      (6, 'two'),

      (6, 'three'),

      (6, 'four'),

      (7, 'one'),

      (7, 'two'),

      (7, 'three'),

      (8, 'one'),

      (8, 'two'),

      (9, 'one'),

      (9, 'two'),

      (9, 'four')

July 2, 2010 10:28 AM

Peso said:

July 2, 2010 11:17 AM

Adam Machanic said:

Why do you only cross post certain things and then not the followup? Seems counter productive--and it's impossible for readers to keep track.

July 2, 2010 11:42 AM

Naomi said:

Agree with the last comment - post the whole solution with as many needed cross references in one blog.

August 9, 2010 12:18 PM
New Comments to this post are disabled
Privacy Statement