## Introduction

Back in January 2014, I wrote an article for SQLperformance.com describing the cardinality estimation process for queries with multiple predicates, from the point of view of the old and new cardinality estimators. The article describes the various behaviours and formulas involved, along with the usual sprinkling of documented and undocumented trace flags.

I was able to describe the formula SQL Server 2014 uses to calculate a cardinality estimate for multiple predicates connected by `AND`

(conjunctive predicates), which was already relatively well-known. Despite some fairly energetic research, and basic-to-intermediate skills with Excel, I was unable to deduce a similar formula for the disjunctive case, where predicates are connected by `OR`

. The trace flag output I describe in the other article clearly showed that exponential backoff was used in the new 2014 cardinality estimator, but the precise formula eluded me.

I gave up on it until a few weeks ago, when I was invited to contribute to the a technical review of a White Paper Joe Sack was writing with assistance from Microsoft. One of the small contributions I was able to make was to point out that exponential backoff was used for disjunctions as well as conjunctions. The paper has now been released so I can now write about the detail behind the statement that appears there concerning cardinality estimation for disjunctive predicates (thank you Joe for pointing this out to me):

## A Bit Of Boolean

The key bit of new information is the second sentence. As soon as I read it, a vague memory from my patchy education came back to me, as well as a cunning query rewrite Itzik Ben-Gan shows in one of his books. Both relate to a set of rules for transforming Boolean algebra, known as DeMorgan’s Laws. Wikipedia has this to say:

This gives us a way to turn a disjunction into a conjunction. It is a short step from there to understand that *exponential backoff* for disjunctions in the SQL Server 2014 cardinality estimator works by applying DeMorgan’s laws. Converting the troublesome `OR`

to an `AND`

(with the appropriate negations) allows us to reuse the `AND`

*exponential backoff* calculation for `OR`

queries!

## The Theory

Borrowing from the Wikipedia article again, we have DeMorgan’s Laws stated as:

The interesting transformation from disjunction to conjunction can be expressed as:

“**not (A or B)** is the same as **(not A) and (not B)**”

This does not quite give us a direct translation from `OR`

to `AND`

. We need an extra step of observing:

“**(A or B)** is the same as **not (not (A or B))**”

This trivial double-negative means we can now directly substitute the **not (A or B)** above, to get:

“**(A or B)** is the same as **not ((not A) and (not B))**”

This gives us `OR`

expressed in terms of `NOT`

and `AND`

.

The very last bit of theory is to remember that A and B are probabilities between 0 and 1, so:

**(not X)** = **(1 – X)**

## Worked Example

Now we have all we need to show how the disjunctive exponential backoff calculation works in SQL Server 2014:

SELECT

COUNT_BIG(*) AS NumRows

FROM Production.TransactionHistory AS TH

WHERE

TH.TransactionID BETWEEN 100000 AND 168412

OR TH.TransactionDate BETWEEN '20070901' AND '20080313';

There are two predicates in this query, connected by `OR.`

The selectivity of the predicate on `Transaction ID`

can be derived directly from the statistics histogram:

DBCC SHOW_STATISTICS

(

'Production.TransactionHistory',

'PK_TransactionHistory_TransactionID'

);

This very compact histogram gives us all the information we need to say that the range of **68,413** `Transaction ID`

values can be expected to produce **68,413** rows.

Expressed as a fraction of the **113,443** rows in the table, the selectivity of this predicate is **68413 / 113443** = **0.603061**. We will call this value **selectivity A**.

The second predicate (on `TransactionDate`

) is estimated from the histogram in a similar way (except the histogram has more steps than is convenient to show here). The histogram calculation results in a cardinality estimate for this predicate that is also **68,413** (because I deliberately chose the predicates to qualify exactly the same physical rows). The selectivity is likewise **0.603061**. We will call this value **selectivity B**.

## Enter DeMorgan

We know from our work with DeMorgan that:

“**(A or B)** is the same as **not ((not A) and (not B))**”

So we can rewrite our **(A or B)** predicate immediately as **not ((not A) and (not B)),** enabling us to avoid worrying about the OR.

Note this transform is done purely for cardinality estimation purposes. The optimizer tree is not affected by this work.

## Doing the Math

Now, given that selectivities **A** and **B** are both equal to **0.603061** in this contrived example, our expanded expression becomes:

**not ((not ****0.603061**) and (not **0.603061**))

We also know:

**(not X)** = **(1 – X)**

Expanding the inner **not**s means the computation becomes:

**not ((1 - ****0.603061**) and (1 - **0.603061**))

We can perform the numeric math inside the parentheses to get:

**not (0.396939 and 0.396939)**

Now we use *exponential backoff* to compute the **and** selectivity:

**not (0.396939 * SQRT(0.396939))**

And replace the outer **not** just as we did before:

**1 - (0.396939 * SQRT(0.396939))**

Now we have just numbers. The final selectivity is **0.749916**.

## The Result

The very final step is to multiply this selectivity by the cardinality of the base table to get our overall estimate:

**0.749916 * 113443 = 85073 rounded up**

The 2014 execution plan for this query, when using the new cardinality estimator, is:

The cardinality estimate is **85,073** rows, just as we predicted!

Sadly, the *actual* number of rows returned by this query is **68,413 rows** (because both predicates qualify exactly the same rows). Nothing is perfect, and cardinality estimation doubly so.

© 2014 Paul White

email: SQLkiwi@gmail.com

twitter: @SQL_Kiwi (note the underscore!)

## Comments

## RichB said:

So.... do double notted ands get better results than old ors?

## Paul White said:

Hi RichB,

The pre-2014 formula was (S1 + S2) - (S1 * S2); see the linked article for details).The exponential backoff approach in 2014 might be "better" or "worse" for any particular query, it's hard to generalize.

## Rick Willemain said:

Thank you ! Fascinating topic and well written !!

## SQLScotsman said:

Paul,

Great article. In your SQLPerformance article that you've linked, you've stated the the new CE formula for conjuctive predictes can be expressed as:

Estimate = C * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))

So, from the article above, are you saying that the new CE formula for disjunctive predicates can be expressed as:

Estimate = C * (1 - (S1 * SQRT(S2)))