Pages

Friday, May 1, 2020

How SQL Server Optimizer calculates the Estimated Number of Rows

Have you ever wondered how SQL Server Optimizer
calculates the "Estimated Number of Rows"
when you have multiple predicates/filters & joins?

Well, you can find a brief summary below.
I assume that you are already familiar with the Statistics
(Density Vector, Histogram).
The algorithms depend if you use local variables,
parameters, multiple predicates
or a single predicate legacy CE, new CE, and more.



It is important to note when the SQL Server optimizer uses Density Vector
and when histogram.

Density is the ratio of unique values within the given column or a set of columns.
Density  = 1 / (Number of distinct values for a column or set of column)

The Density vector is used when the optimizer cannot sniff the predicate value,
for example:

   1) When you use local variables as predicates/filters in a stored procedure
       or in a batch query

  2) When you use OPTION (OPTIMIZE FOR UNKNOWN)

  3) When the parameter sniffing feature is disabled

The Histogram is used when the optimizer is able to sniff the predicate value,
for example:

1) When you use input parameters as predicates/filters in a stored procedure

2) When you use sp_executesql with parameters

3) When you use local variables with OPTION(RECOMPILE)

So let`s review common use cases:

1) The table below refers to the use case when you use one filter
    as a local variable and the statistics exist on the filtered column/index.
    This also applies when you use OPTION (OPTIMIZE FOR UNKNOWN)
    or when the parameter sniffing feature is disabled.

    DECLARE @CustomerId INT =123

    SELECT  *
    FROM     dbo.Customers
    WHERE  CustomerId = @CustomerId

Local Variables
                  FilterNew CEOld/Legacy CE
Column = @ValueDensity Vector x Number of Rows In Table
Col >, >=, <, <= @Value30% of Number of Rows In Table
Between0.3 * SQRT(0.3)TableCardinality * 0.09
LikeTableCardinality * 0.09TableCardinality*(0.09 / 6 * LOG(D))

Table Cardinality - Total Number of Rows In Table
Density = 1 / ( # distinct values for column(s) )
D - Average length rounded down to integer (Average data length is taken from statistics density vector)



2) For use case, when you use one parameterized filter or literal,
    The estimated number of rows will be selected from the histogram.
    For example: Input parameters in a stored procedure or sp_executesql

    SELECT *
    FROM    dbo.Customers
    WHERE  CustomerId = 123


3) When you have a query that uses more than one column predicate/filter:

    SELECT  *
    FROM     dbo.Customers
    WHERE  CustomerId = 123
                 AND Name ='XYZ'


FilterNew CEOld/Legacy CE
ANDTable Cardinality * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))Table Cardinality * (S1 * S2 * S3)
ORTable Cardinality * ( (S1 + S2) – (S1 * S2) )

Table Cardinality - Total Number of Rows In Table
Selectivity(S) = (# rows that pass the predicate) / (# rows in the table)
S1 means selectivity of the first filter, S2 - selectivity of the second filter and etc

Let`s see what was has changed in the new cardinality estimator compared
to the legacy estimator:

New CELegacy CE
Correlation for multi-predicate
How the estimates calculated when
you have multiple predicates on a table
Assumes there is some correlation between the predicates:
Table Cardinality * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))
Selectivity of multiple predicates computed as the multiplication of individual sensitivitiesTable Cardinality * (S1 * S2 * S3)
Out-of-range value estimation
When the filtered value is
not in range in the histogram
Table cardinality * DensityEstimation is 1 row
Join Estimates
How to estimate the JOIN result
Base containment
Calculate the JOIN selectivity
by two histograms
without applying the predicates,
Then multiple the JOIN selectivity with Predicate sensitivities.
Meaning filter predicates are independent of the join selectivity
Simple containment
After applying filters
The two histograms,
are merged together,
assuming containment.

Meaning JOIN selectivity
is dependent on the filters

According to Microsoft, regrading the JOIN Estimates:

The Legacy CE model assumes that users always query for data that exists.
This means that, for a join predicate that involves an equijoin operation for two tables,
the joined columns exist on both sides of the join.
In the presence of additional non-join filter predicates against the join table,
the Legacy CE assumes some level of correlation for the join predicates
and non-join filter predicates. This implied correlation is called Simple Containment.

Alternatively, the New CE uses Base Containment as the correlation.
The New CE model assumes that users might query for data that does not exist.
This means that the filter predicates on separate tables may not be correlated with each other.
Therefore, we use a probabilistic approach

No comments:

Post a Comment