staging-v23.1.20: opt/memo: improve zigzag join cost and selectivity estimation with multi-column stats #123152
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 4/4 commits from #120805.
/cc @cockroachdb/release
opt: update seek and distribution cost of zigzag join to match scan
Prior to this commit, the optimizer could prefer a zigzag join over a
scan even if they produced the same number of rows. This was because scans
always included the cost of at least one seek (involving random I/O) and
some distribution cost, while zigzag joins did not. This commit updates
the cost of zigzag joins to include seek and distribution costs so they
will never be chosen over scans unless they produce fewer rows.
This change is behind the setting
optimizer_use_improved_zigzag_join_costing
.Release note (performance improvement): Added a new setting
optimizer_use_improved_zigzag_join_costing
. When enabled, the cost of zigzagjoins is updated so they will be never be chosen over scans unless they
produce fewer rows. This change only matters if the setting
enable_zigzag_join
is also true.
opt/memo: improve selectivity estimation with multi-column stats
This commit updates
correlationFromMultiColDistinctCounts
instatisticsBuilder
to use a tighter lower bound for the multi-column selectivity. This avoids
cases where we significantly over-estimate the selectivity of a multi-column
predicate.
Fixes #121397
Release note (performance improvement): Improved the selectivity estimation of
multi-column filters when the multi-column distinct count is high. This avoids
cases where we significantly over-estimate the selectivity of a multi-column
predicate and as a result can prevent the optimizer from choosing a bad query
plan.
sql: add setting optimizer_use_improved_multi_column_selectivity_estimate
Informs #121397
Release note (sql change): Added a setting
optimizer_use_improved_multi_column_selectivity_estimate
, which if enabled,causes the optimizer to use an improved selectivity estimate for multi-column
predicates. This setting will default to true on versions 24.2+, and false
on prior versions.
opt: improve variable names in selectivityFromMultiColDistinctCounts
This commit improves the variable names in
selectivityFromMultiColDistinctCounts
instatisticsBuilder
to be moreself-documenting.
Release note: None
Release justification: low-risk, high benefit change to existing functionality to unblock a customer