-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43712: [C++][Parquet] Dataset: Handle num-nulls in Parquet correctly when !HasNullCount() #43726
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @mapleFU .
// If there are no values and no nulls, it might be empty or contains | ||
// only null. | ||
return std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit weird (though it's certainly safe to return std::nullopt
).
If num_values() == 0
, there can be only nulls, so we can just return is_null(std::move(field_expr))
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. This might means "no-values", like an empty-page. I'm not sure should an empty page return is_null
, it might be ok but a bit-weird for me( is_null for null or empty data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's simpler and more consistent to return is_null(std::move(field_expr))
if we can; in general I think it usually makes the most sense to construct the most explicit/precise guarantees which are easily available (and therefore to avoid making this special case which will result in less specific guarantees).
} | ||
{ | ||
// Special case: when num_value is 0, if has_null, it would return | ||
// "is_null", otherwise it cannot gurantees anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why it cannot guarantee anything. It knows that there are no non-null values, so is_null
is true for all values (even if there are no values at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied in https://github.com/apache/arrow/pull/43726/files#r1742087940 . I can also change to is_null
. I've no trending here
Co-authored-by: Antoine Pitrou <[email protected]>
// If `statistics.HasNullCount()`, it means the all the values are nulls. | ||
// | ||
// If there are no values and no nulls, it might be empty or all values | ||
// are nulls. In this case, we also return a null expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it might be mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If there are no values and `!statistics.HasNullCount()`, it might be
// empty or all values are nulls. In this case, we also return a null
// expression.
Change to this
Thanks for the fix and tests @mapleFU ! |
@kou Have you meet this problem when merging? |
I haven't seen it. Could you try it again? Or can I try it? |
Is your GitHub Personal Access Token valid? |
I would check this |
Aha merged, sorry for distrubing |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ab0a40e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…orrectly when !HasNullCount() (apache#43726) ### Rationale for this change See issue. When `!HasNullCount`, we cannot gurantee null exists ### What changes are included in this PR? Handle HasNullCount in dataset expr ### Are these changes tested? Yes ### Are there any user-facing changes? Merely * GitHub Issue: apache#43712 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
See issue. When
!HasNullCount
, we cannot gurantee null existsWhat changes are included in this PR?
Handle HasNullCount in dataset expr
Are these changes tested?
Yes
Are there any user-facing changes?
Merely
ParquetFileFragment::EvaluateStatisticsAsExpression
should better checksStatistics::HasNullCount
#43712