-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43797: [C++] Attach arrow::ArrayStatistics
to arrow::ArrayData
#43801
Conversation
|
@pitrou @bkietz @felipecrv What do you think about this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making statistics
a "lazy" member of ArrayData is the right approach. It should probably be wrapped in a pointer, though: this will ensure that new members can be added to ArrayStatistics without impacting the size of ArrayData
8ee9b69
to
4d5b234
Compare
OK. I've changed to a pointer instead of embedding |
What just raise my curiousity is that |
Most statistics would be invalidated by slicing, such as the distinct and null counts. The minimum and maximum could be preserved, but would have to be demoted to inexact until recomputed. |
…yData` If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning.
4d5b234
to
942f757
Compare
Good catch! I forgot the I noticed that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM but I'm not familiar with details in this
If nobody objects this, I'll merge this in the next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…yData` (apache#43801) ### Rationale for this change If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning. If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels. There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics. ### What changes are included in this PR? * Add `arrow::ArrayData::statistics` * Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData` This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. `arrow::Array::statistics()` is a new public API. * GitHub Issue: apache#43797 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 4ed5a14. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 29 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…yData` (apache#43801) ### Rationale for this change If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning. If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels. There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics. ### What changes are included in this PR? * Add `arrow::ArrayData::statistics` * Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData` This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. `arrow::Array::statistics()` is a new public API. * GitHub Issue: apache#43797 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…yData` (apache#43801) ### Rationale for this change If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning. If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels. There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics. ### What changes are included in this PR? * Add `arrow::ArrayData::statistics` * Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData` This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. `arrow::Array::statistics()` is a new public API. * GitHub Issue: apache#43797 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
/// object which backs this Array. | ||
/// | ||
/// \return const ArrayStatistics& | ||
std::shared_ptr<ArrayStatistics> statistics() const { return data_->statistics; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type should be std::shared_ptr<ArrayStatistics>&
and we should probably add const ArrayStatistics
to the shared_ptr
so that callers can't mutate the statistics through the shared pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I don't know why I missed const
and &
here...
Let's add them: GH-44590
Rationale for this change
If we can attach associated statistics to an array via
ArrayData
, we can use it in later processes such as query planning.If
ArrayData
notArray
has statistics, we can use statistics in computing kernels.There was a concern that associated
arrow::ArrayStatistics
may be outdated ifarrow::ArrayData
is mutated after attachingarrow::ArrayStatistics
. Butarrow::ArrayData
isn't mutable after the first population. Soarrow::ArrayStatistics
will not be outdated. We can require mutators to take responsibility for statistics.What changes are included in this PR?
arrow::ArrayData::statistics
arrow::Array::statistics()
to get statistics attached inarrow::ArrayData
This doesn't provide a new
arrow::ArrayData
constructor (arrow::ArrayData::Make()
) that acceptsarrow::ArrayStatistics
. We can changearrow::ArrayData::statistics
after we createarrow::ArrayData
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::Array::statistics()
is a new public API.arrow::ArrayStatistics
toarrow::ArrayData
#43797