Infer the count of maximum distinct values from min/max #3837
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #3813.
Rationale for this change
This was a point that came out during the initial join cardinality computation PR (link) where the logic only gave an estimate when the distinct count information was available directly in the statistics. This was effective for certain use cases where the distinct count was already calculated (e.g. propagated statistics from stuff like aggregates) but for statistics that originate from initial user input, having
distinct_count
is very unlikely (e.g. there is no way to save distinct count when exporting a parquet file from pandas, none of the official backends [pyarrow/fastparquet] even support such a thing in their write APIs). So one main thing we can do is actually usemin
/max
values (which are nearly universal at this point) to calculate the maximum possible distinct count (which is actually what we need for selectivity).What changes are included in this PR?
A fallback option for inferring the maximum distinct count when the actual distinct count information is not available. It only works with numeric values (more specifically, integers) at this point (we can technically determine the range for timestamps or floats, but neither of them feels close to accurate since that would be essentially brute forcing every possible value within the precision boundaries, something that feels very unlikely to happen in real world, but open for discussion).
Are there any user-facing changes?
No backwards incompatible changes.