Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-fix in Filter and Limit statistics #8094

Merged
merged 5 commits into from
Nov 9, 2023
Merged

Bug-fix in Filter and Limit statistics #8094

merged 5 commits into from
Nov 9, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Nov 8, 2023

Which issue does this PR close?

Related to #8078.

Rationale for this change

After enum Precision is introduced in DF, some bugs are discovered. This PR resolves 3 bugs:

  1. singleton check for ColumnStatistics,
  2. limit in datasource with num rows statistics,
  3. min - max exactness of columns with constant value.

What changes are included in this PR?

1st fix: A column is labeled as singleton only if its min and max values are exact and equal.
2nd fix: To stop processing, only exact count of rows is regarded. Otherwise, we should continue to process until range estimation of precision implemented.
3rd fix: During the analysis in filter statistics, if a column is filtered with a constant value (e.g. c=1), we set its min and max values as exact.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 8, 2023
@berkaysynnada berkaysynnada marked this pull request as draft November 8, 2023 16:15
@berkaysynnada berkaysynnada marked this pull request as ready for review November 8, 2023 17:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

@berkaysynnada
Copy link
Contributor Author

Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔

I have added tests covering the 1st and 3rd cases, but can't find a test suite for the 2nd case that I can demonstrate the need for the fix. Do you have any advice for me to do that easily?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- I think this is better for sure. I hope to keep improving test coverage as part of #8078

@@ -70,7 +70,11 @@ pub async fn get_statistics_with_limit(
// files. This only applies when we know the number of rows. It also
// currently ignores tables that have no statistics regarding the
// number of rows.
if num_rows.get_value().unwrap_or(&usize::MIN) <= &limit.unwrap_or(usize::MAX) {
let conservative_num_rows = match num_rows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit e54894c into apache:main Nov 9, 2023
22 checks passed
@berkaysynnada berkaysynnada deleted the bugfix/statistics branch November 10, 2023 11:26
@alamb alamb mentioned this pull request Nov 15, 2023
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants