Made By SQL: How SQL Server Optimizer calculates the Estimated Number of Rows

Have you ever wondered how SQL Server Optimizer
calculates the "Estimated Number of Rows"
when you have multiple predicates/filters & joins?

Well, you can find a brief summary below.
I assume that you are already familiar with the Statistics
(Density Vector, Histogram).
The algorithms depend if you use local variables,
parameters, multiple predicates
or a single predicate legacy CE, new CE, and more.

It is important to note when the SQL Server optimizer uses Density Vector
and when histogram.

Density is the ratio of unique values within the given column or a set of columns.
Density = 1 / (Number of distinct values for a column or set of column)

The Density vector is used when the optimizer cannot sniff the predicate value,
for example:

1) When you use local variables as predicates/filters in a stored procedure
or in a batch query

2) When you use OPTION (OPTIMIZE FOR UNKNOWN)

3) When the parameter sniffing feature is disabled

The Histogram is used when the optimizer is able to sniff the predicate value,
for example:

1) When you use input parameters as predicates/filters in a stored procedure

2) When you use sp_executesql with parameters

3) When you use local variables with OPTION(RECOMPILE)

So let`s review common use cases:

1) The table below refers to the use case when you use one filter
as a local variable and the statistics exist on the filtered column/index.
This also applies when you use OPTION (OPTIMIZE FOR UNKNOWN)
or when the parameter sniffing feature is disabled.

DECLARE @CustomerId INT =123

SELECT *
FROM dbo.Customers
WHERE CustomerId = @CustomerId

	Local Variables
Filter	New CE	Old/Legacy CE
Column = @Value	Density Vector x Number of Rows In Table
Col >, >=, <, <= @Value	30% of Number of Rows In Table
Between	0.3 * SQRT(0.3)	TableCardinality * 0.09
Like	TableCardinality * 0.09	TableCardinality(0.09 / 6 LOG(D))

Table Cardinality - Total Number of Rows In Table
Density = 1 / ( # distinct values for column(s) )
D - Average length rounded down to integer (Average data length is taken from statistics density vector)

2) For use case, when you use one parameterized filter or literal,
The estimated number of rows will be selected from the histogram.
For example: Input parameters in a stored procedure or sp_executesql

SELECT *
FROM dbo.Customers
WHERE  CustomerId = 123

3) When you have a query that uses more than one column predicate/filter:

SELECT  *
FROM dbo.Customers
WHERE  CustomerId = 123
AND Name ='XYZ'

Filter	New CE	Old/Legacy CE
AND	Table Cardinality * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))	Table Cardinality * (S1 * S2 * S3)
OR		Table Cardinality * ( (S1 + S2) – (S1 * S2) )

Table Cardinality - Total Number of Rows In Table

Selectivity(S) = (# rows that pass the predicate) / (# rows in the table)

S1 means selectivity of the first filter, S2 - selectivity of the second filter and etc

Let`s see what was has changed in the new cardinality estimator compared
to the legacy estimator:

	New CE	Legacy CE
Correlation for multi-predicate How the estimates calculated when you have multiple predicates on a table	Assumes there is some correlation between the predicates: Table Cardinality * S1 * SQRT(S2) * SQRT(SQRT(S3)) * SQRT(SQRT(SQRT(S4)))	Selectivity of multiple predicates computed as the multiplication of individual sensitivitiesTable Cardinality * (S1 * S2 * S3)
Out-of-range value estimation When the filtered value is not in range in the histogram	Table cardinality * Density	Estimation is 1 row
Join Estimates How to estimate the JOIN result	Base containment Calculate the JOIN selectivity by two histograms without applying the predicates, Then multiple the JOIN selectivity with Predicate sensitivities. Meaning filter predicates are independent of the join selectivity	Simple containment After applying filters The two histograms, are merged together, assuming containment. Meaning JOIN selectivity is dependent on the filters

According to Microsoft, regrading the JOIN Estimates:

The Legacy CE model assumes that users always query for data that exists.
This means that, for a join predicate that involves an equijoin operation for two tables,
the joined columns exist on both sides of the join.
In the presence of additional non-join filter predicates against the join table,
the Legacy CE assumes some level of correlation for the join predicates
and non-join filter predicates. This implied correlation is called Simple Containment.

Alternatively, the New CE uses Base Containment as the correlation.
The New CE model assumes that users might query for data that does not exist.
This means that the filter predicates on separate tables may not be correlated with each other.
Therefore, we use a probabilistic approach

Made By SQL

Pages

Friday, May 1, 2020

How SQL Server Optimizer calculates the Estimated Number of Rows

No comments:

Post a Comment