-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43666: [C++] Attach arrow::ArrayStatistics
to arrow::Array
#43705
Conversation
New public APIs: * `arrow::Array::statistics()`: It returns the associated statistics. It can't be changed after an array is created. A sliced array doesn't have parent array's statistics because parent array's statistics isn't valid for sliced array. * Add new optional `arrow::ArrayStatistics` argument to all `arrow::*Array(ArrayData)` constructors: `arrow::*Array(ArrayData, ArrayStatistics = NULLPTR)` New internal APIs: * `arrow::Array::Init()`: All array constructors must call this to attach `arrow::ArrayData` and `arrow::ArrayStatistics`. Note that calling this via parent's constructor isn't allowed. Array constructors don't need to call `arrow::Array::SetData()` directly. It's called in `arrow::Array::Init()`. * `arrow::Array::SetStatitics()`: It attaches `arrow::ArrayStatistics` to `arrow::Array`. In general, this is not called directly. This is called from `arrow::Array::Init()` internally. Changed internal APIs: * `arrow::Array::SetData()`: It becomes a virtual method. So `arrow::Array::Init()` must be called by each array's constructor.
|
@pitrou @felipecrv What do you think about this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought is that statistics should be attached to ArrayData rather than Array. Otherwise Datum (which stores only ArrayData) will not have access to statistics.
|
||
/// Protected method for constructors. Don't call this method | ||
/// directly. This should be called from Init(). | ||
virtual void SetData(const std::shared_ptr<ArrayData>& data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be virtual since it's called from constructors, which adds nuances. Specifically, if BaseArray::BaseArray()
calls Array::Init()
then that will not call DerivedArray::SetData()
.
Is there a reason to leave this virtual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be virtual since it's called from constructors, which adds nuances. Specifically, if
BaseArray::BaseArray()
callsArray::Init()
then that will not callDerivedArray::SetData()
.
You're right. So this implementation ensures calling Array::Init()
from DerivedArray::DerivedArray()
not BaseArray::BaseArray()
. It calls DerivedArray::SetData()
but this is a tricky limitation. I'll consider another approach.
I thought so and the first implementation uses the approach. But @felipecrv pointed out that it indicates that We must call diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
index e0508fe698..60409d1367 100644
--- a/cpp/src/arrow/array/data.h
+++ b/cpp/src/arrow/array/data.h
@@ -261,6 +261,7 @@ struct ARROW_EXPORT ArrayData {
// Access a buffer's data as a typed C pointer
template <typename T>
inline T* GetMutableValues(int i, int64_t absolute_offset) {
+ statistics = NULLPTR;
if (buffers[i]) {
return reinterpret_cast<T*>(buffers[i]->mutable_data()) + absolute_offset;
} else { If it can prevent maintaining |
That's a good start, but many places in the code access the ArrayData's buffers directly instead of calling GetMutableValues(). Maybe instead we should ensure all buffers are immutable when we attach statistics to the ArrayData (at least in debug)? |
Alternatively if we want to keep statistics out of ArrayData, Datum will need to be modified to include statistics as well. That might be less work than ensuring ArrayData is never mutated and left with invalid statistics. ... we also lose nested statistics, though. A struct array's statistics would contain no information about any children if it were only attached to the StructArray. However attaching statistics to ArrayData would allow the stats for children to be present. I'm not sure that "statistics are null in arrays which have been mutated" is a contract we can enforce automatically and still maintain the usefulness of ArrayStatistics. As with null_count, we might need to just say that it is the responsibility of a mutater to ensure stats are reset :/ |
I didn't think about the approach. I'll consider the approach.
Ah, you're right. I missed it.
arrow/cpp/src/arrow/array/array_nested.cc Line 1088 in 9fc0301
So we can't attach ArrowStatistics to children of StructArray .(If we want to do it, we need to attach children's ArrowStatistics to StructArray .)
|
There are various points being made here:
Another question is the cost of adding a statistics structure to either |
This is correct; ArrayData isn't really mutable. It would be really rare for anything to set up statistics before an ArrayData was finalized. So it is probably acceptable to give mutators responsibility for statistics.
So long as it is encapsulated in a pointer, the cost of constructing or copying an ArrayData (especially in the common case where no statistics are attached) should be minimal? Or maybe you're referring to a different cost? |
Ah, that is true. I had overlooked that this uses a |
Oh, I misunderstood it. I thought
I think so too. Let's attach to |
Implementation: GH-43801 |
@kou should this being closed? |
Yes. I close this. |
Rationale for this change
If we can attach associated statistics to an array, we can use it in later processes such as query planning.
What changes are included in this PR?
New public APIs:
arrow::Array::statistics()
: It returns the associated statistics. It can't be changed after an array is created. A sliced array doesn't have parent array's statistics because parent array's statistics isn't valid for sliced array.arrow::ArrayStatistics
argument to allarrow::*Array(ArrayData)
constructors:arrow::*Array(ArrayData, ArrayStatistics = NULLPTR)
New internal APIs:
arrow::Array::Init()
: All array constructors must call this to attacharrow::ArrayData
andarrow::ArrayStatistics
. Note that calling this via parent's constructor isn't allowed. Array constructors don't need to callarrow::Array::SetData()
directly. It's called inarrow::Array::Init()
.arrow::Array::SetStatitics()
: It attachesarrow::ArrayStatistics
toarrow::Array
. In general, this is not called directly. This is called fromarrow::Array::Init()
internally.Changed internal APIs:
arrow::Array::SetData()
: It becomes a virtual method. Soarrow::Array::Init()
must be called by each array's constructor.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
This PR includes breaking changes to public APIs.
APIs are compatible but ABIs are incompatible.
arrow::ArrayStatistics
toarrow::Array
#43666