Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43797: [C++] Attach arrow::ArrayStatistics to arrow::ArrayData #43801

Merged
merged 5 commits into from
Sep 3, 2024

Conversation

kou
Copy link
Member

@kou kou commented Aug 23, 2024

Rationale for this change

If we can attach associated statistics to an array via ArrayData, we can use it in later processes such as query planning.

If ArrayData not Array has statistics, we can use statistics in computing kernels.

There was a concern that associated arrow::ArrayStatistics may be outdated if arrow::ArrayData is mutated after attaching arrow::ArrayStatistics. But arrow::ArrayData isn't mutable after the first population. So arrow::ArrayStatistics will not be outdated. We can require mutators to take responsibility for statistics.

What changes are included in this PR?

  • Add arrow::ArrayData::statistics
  • Add arrow::Array::statistics() to get statistics attached in arrow::ArrayData

This doesn't provide a new arrow::ArrayData constructor (arrow::ArrayData::Make()) that accepts arrow::ArrayStatistics. We can change arrow::ArrayData::statistics after we create arrow::ArrayData.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

arrow::Array::statistics() is a new public API.

Copy link

⚠️ GitHub issue #43797 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Aug 27, 2024

@pitrou @bkietz @felipecrv What do you think about this approach?
See also the discussion at #43705 (comment) and comments followed by it.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making statistics a "lazy" member of ArrayData is the right approach. It should probably be wrapped in a pointer, though: this will ensure that new members can be added to ArrayStatistics without impacting the size of ArrayData

cpp/src/arrow/array/data.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 27, 2024
@kou kou force-pushed the cpp-array-statistics-attach-datum branch from 8ee9b69 to 4d5b234 Compare August 28, 2024 00:39
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2024
@kou
Copy link
Member Author

kou commented Aug 28, 2024

OK. I've changed to a pointer instead of embedding ArrayStatistics to ArrayData.
But I used std::shared_ptr not std::unique_ptr because we want to export ArrayStatistics from arrow::Array. If we use std::unique_ptr, arrow::Array::statistics() also needs to return std::unique_ptr. It will be difficult to use. So I used std::shared_ptr.

@mapleFU
Copy link
Member

mapleFU commented Aug 29, 2024

What just raise my curiousity is that subslice for array should discard the statistics?

@bkietz
Copy link
Member

bkietz commented Aug 29, 2024

Most statistics would be invalidated by slicing, such as the distinct and null counts. The minimum and maximum could be preserved, but would have to be demoted to inexact until recomputed.

kou added 4 commits August 30, 2024 13:37
…yData`

If we can attach associated statistics to an array via `ArrayData`, we
can use it in later processes such as query planning.
@kou kou force-pushed the cpp-array-statistics-attach-datum branch from 4d5b234 to 942f757 Compare August 30, 2024 04:37
@kou
Copy link
Member Author

kou commented Aug 30, 2024

Good catch! I forgot the Slice() case...
I've discarded statistics on Slice(). If an user wants to re-use (min/max) statistics of the original ArrayData, the user is responsible it.

I noticed that ArrayData::CopyTo() needs to copy statistics. I've done it too.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM but I'm not familiar with details in this

cpp/src/arrow/array/data.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 31, 2024
@kou
Copy link
Member Author

kou commented Aug 31, 2024

If nobody objects this, I'll merge this in the next week.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 31, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kou kou merged commit 4ed5a14 into apache:main Sep 3, 2024
37 of 40 checks passed
@kou kou deleted the cpp-array-statistics-attach-datum branch September 3, 2024 01:13
@kou kou removed the awaiting changes Awaiting changes label Sep 3, 2024
mapleFU pushed a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
…yData` (apache#43801)

### Rationale for this change

If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning.

If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels.

There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics.

### What changes are included in this PR?

* Add `arrow::ArrayData::statistics`
* Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData`

This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

`arrow::Array::statistics()` is a new public API.
* GitHub Issue: apache#43797

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 4ed5a14.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 29 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
…yData` (apache#43801)

### Rationale for this change

If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning.

If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels.

There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics.

### What changes are included in this PR?

* Add `arrow::ArrayData::statistics`
* Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData`

This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

`arrow::Array::statistics()` is a new public API.
* GitHub Issue: apache#43797

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…yData` (apache#43801)

### Rationale for this change

If we can attach associated statistics to an array via `ArrayData`, we can use it in later processes such as query planning.

If `ArrayData` not `Array` has statistics, we can use statistics in computing kernels.

There was a concern that associated `arrow::ArrayStatistics` may be outdated if `arrow::ArrayData` is mutated after attaching `arrow::ArrayStatistics`. But `arrow::ArrayData` isn't mutable after the first population. So `arrow::ArrayStatistics` will not be outdated. We can require mutators to take responsibility for statistics.

### What changes are included in this PR?

* Add `arrow::ArrayData::statistics`
* Add `arrow::Array::statistics()` to get statistics attached in `arrow::ArrayData`

This doesn't provide a new `arrow::ArrayData` constructor (`arrow::ArrayData::Make()`) that accepts `arrow::ArrayStatistics`. We can change `arrow::ArrayData::statistics` after we create `arrow::ArrayData`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

`arrow::Array::statistics()` is a new public API.
* GitHub Issue: apache#43797

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
/// object which backs this Array.
///
/// \return const ArrayStatistics&
std::shared_ptr<ArrayStatistics> statistics() const { return data_->statistics; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type should be std::shared_ptr<ArrayStatistics>& and we should probably add const ArrayStatistics to the shared_ptr so that callers can't mutate the statistics through the shared pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I don't know why I missed const and & here...

Let's add them: GH-44590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants