Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43666: [C++] Attach arrow::ArrayStatistics to arrow::Array #43705

Closed
wants to merge 1 commit into from

Conversation

kou
Copy link
Member

@kou kou commented Aug 15, 2024

Rationale for this change

If we can attach associated statistics to an array, we can use it in later processes such as query planning.

What changes are included in this PR?

New public APIs:

  • arrow::Array::statistics(): It returns the associated statistics. It can't be changed after an array is created. A sliced array doesn't have parent array's statistics because parent array's statistics isn't valid for sliced array.
  • Add new optional arrow::ArrayStatistics argument to all arrow::*Array(ArrayData) constructors: arrow::*Array(ArrayData, ArrayStatistics = NULLPTR)

New internal APIs:

  • arrow::Array::Init(): All array constructors must call this to attach arrow::ArrayData and arrow::ArrayStatistics. Note that calling this via parent's constructor isn't allowed. Array constructors don't need to call arrow::Array::SetData() directly. It's called in arrow::Array::Init().
  • arrow::Array::SetStatitics(): It attaches arrow::ArrayStatistics to arrow::Array. In general, this is not called directly. This is called from arrow::Array::Init() internally.

Changed internal APIs:

  • arrow::Array::SetData(): It becomes a virtual method. So arrow::Array::Init() must be called by each array's constructor.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

This PR includes breaking changes to public APIs.

APIs are compatible but ABIs are incompatible.

New public APIs:

* `arrow::Array::statistics()`: It returns the associated statistics.
  It can't be changed after an array is created. A sliced array
  doesn't have parent array's statistics because parent array's
  statistics isn't valid for sliced array.
* Add new optional `arrow::ArrayStatistics` argument to all
  `arrow::*Array(ArrayData)` constructors: `arrow::*Array(ArrayData,
  ArrayStatistics = NULLPTR)`

New internal APIs:

* `arrow::Array::Init()`: All array constructors must call this to
  attach `arrow::ArrayData` and `arrow::ArrayStatistics`. Note that
  calling this via parent's constructor isn't
  allowed. Array constructors don't need to call
  `arrow::Array::SetData()` directly. It's called in
  `arrow::Array::Init()`.
* `arrow::Array::SetStatitics()`: It attaches `arrow::ArrayStatistics`
  to `arrow::Array`. In general, this is not called directly. This is
  called from `arrow::Array::Init()` internally.

Changed internal APIs:

* `arrow::Array::SetData()`: It becomes a virtual method. So
  `arrow::Array::Init()` must be called by each array's constructor.
Copy link

⚠️ GitHub issue #43666 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Aug 17, 2024

@pitrou @felipecrv What do you think about this approach?

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that statistics should be attached to ArrayData rather than Array. Otherwise Datum (which stores only ArrayData) will not have access to statistics.


/// Protected method for constructors. Don't call this method
/// directly. This should be called from Init().
virtual void SetData(const std::shared_ptr<ArrayData>& data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be virtual since it's called from constructors, which adds nuances. Specifically, if BaseArray::BaseArray() calls Array::Init() then that will not call DerivedArray::SetData().

Is there a reason to leave this virtual?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be virtual since it's called from constructors, which adds nuances. Specifically, if BaseArray::BaseArray() calls Array::Init() then that will not call DerivedArray::SetData().

You're right. So this implementation ensures calling Array::Init() from DerivedArray::DerivedArray() not BaseArray::BaseArray(). It calls DerivedArray::SetData() but this is a tricky limitation. I'll consider another approach.

@kou
Copy link
Member Author

kou commented Aug 20, 2024

My first thought is that statistics should be attached to ArrayData rather than Array. Otherwise Datum (which stores only ArrayData) will not have access to statistics.

I thought so and the first implementation uses the approach. But @felipecrv pointed out that it indicates that ArrowStatistics is mutable. But we don't want to maintain ArrowStatistics when ArrayData is changed.
See also this discussion: #42133 (comment)

We must call ArrayData::GetMutableValues() when we change ArrayData, right? If so, how about always resetting ArrayData::statistics when ArrayData::GetMutableValues() like the following?

diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
index e0508fe698..60409d1367 100644
--- a/cpp/src/arrow/array/data.h
+++ b/cpp/src/arrow/array/data.h
@@ -261,6 +261,7 @@ struct ARROW_EXPORT ArrayData {
   // Access a buffer's data as a typed C pointer
   template <typename T>
   inline T* GetMutableValues(int i, int64_t absolute_offset) {
+    statistics = NULLPTR;
     if (buffers[i]) {
       return reinterpret_cast<T*>(buffers[i]->mutable_data()) + absolute_offset;
     } else {

If it can prevent maintaining ArrayStatistics when ArrayData is changed, we may be able to attach ArrayStatistics to ArrayData.

@bkietz
Copy link
Member

bkietz commented Aug 20, 2024

how about always resetting ArrayData::statistics when ArrayData::GetMutableValues()

That's a good start, but many places in the code access the ArrayData's buffers directly instead of calling GetMutableValues(). Maybe instead we should ensure all buffers are immutable when we attach statistics to the ArrayData (at least in debug)?

@bkietz
Copy link
Member

bkietz commented Aug 20, 2024

Alternatively if we want to keep statistics out of ArrayData, Datum will need to be modified to include statistics as well. That might be less work than ensuring ArrayData is never mutated and left with invalid statistics.

... we also lose nested statistics, though. A struct array's statistics would contain no information about any children if it were only attached to the StructArray. However attaching statistics to ArrayData would allow the stats for children to be present.

I'm not sure that "statistics are null in arrays which have been mutated" is a contract we can enforce automatically and still maintain the usefulness of ArrayStatistics. As with null_count, we might need to just say that it is the responsibility of a mutater to ensure stats are reset :/

@kou
Copy link
Member Author

kou commented Aug 21, 2024

Alternatively if we want to keep statistics out of ArrayData, Datum will need to be modified to include statistics as well. That might be less work than ensuring ArrayData is never mutated and left with invalid statistics.

I didn't think about the approach. I'll consider the approach.

A struct array's statistics would contain no information about any children if it were only attached to the StructArray.

Ah, you're right. I missed it.

StructArray::field() creates a child field from ArrowData lazy:

std::shared_ptr<Array> result = MakeArray(field_data);

So we can't attach ArrowStatistics to children of StructArray.
(If we want to do it, we need to attach children's ArrowStatistics to StructArray.)

@pitrou
Copy link
Member

pitrou commented Aug 21, 2024

There are various points being made here:

  1. ArrayData is mutable: no, it isn't, except when first populating it.
  2. Mutating a ArrayData should reset its statistics: I don't get why. For example, it doesn't reset its null_count.
  3. Statistics would be better on Datum: but statistics pertain to an Array. If the Datum contains e.g. a RecordBatch, we would probably like to have per-column statistics (one per Array), rather than none at all (the null_count or max_value of a RecordBatch is ill-defined).

Another question is the cost of adding a statistics structure to either Array or ArrayData. Currently, the ArrayStatistics struct is very light-weight (with the potential exception of string min/max values). But what if some day we add more details statistics such as quantiles?

@bkietz
Copy link
Member

bkietz commented Aug 21, 2024

ArrayData is mutable: no, it isn't, except when first populating it.

This is correct; ArrayData isn't really mutable. It would be really rare for anything to set up statistics before an ArrayData was finalized. So it is probably acceptable to give mutators responsibility for statistics.

Another question is the cost of adding a statistics structure

So long as it is encapsulated in a pointer, the cost of constructing or copying an ArrayData (especially in the common case where no statistics are attached) should be minimal? Or maybe you're referring to a different cost?

@pitrou
Copy link
Member

pitrou commented Aug 22, 2024

So long as it is encapsulated in a pointer, the cost of constructing or copying an ArrayData (especially in the common case where no statistics are attached) should be minimal?

Ah, that is true. I had overlooked that this uses a shared_ptr indirection. Some people don't like the cost of shared_ptr, that said :-)

@kou
Copy link
Member Author

kou commented Aug 23, 2024

ArrayData is mutable: no, it isn't, except when first populating it.

This is correct; ArrayData isn't really mutable. It would be really rare for anything to set up statistics before an ArrayData was finalized.

Oh, I misunderstood it. I thought ArrayData is still mutable after the first population.

So it is probably acceptable to give mutators responsibility for statistics.

I think so too. Let's attach to ArrayData not Array: GH-43797

@kou
Copy link
Member Author

kou commented Aug 23, 2024

Implementation: GH-43801

@mapleFU
Copy link
Member

mapleFU commented Sep 3, 2024

@kou should this being closed?

@kou
Copy link
Member Author

kou commented Sep 3, 2024

Yes. I close this.

@kou kou closed this Sep 3, 2024
@kou kou deleted the cpp-array-statistics-attach branch September 3, 2024 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants