-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133
Conversation
|
std::optional<int64_t> null_count = std::nullopt; | ||
|
||
/// \brief The number of distinct values, may not be set | ||
std::optional<int64_t> distinct_count = std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really useful? AFAIK, Parquet does not populate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This is not useful for now because Parquet C++ doesn't populate it.
I just add this because parquet::Statistics
has this.
/// data source can be read unified API via this class. | ||
struct ARROW_EXPORT ArrayStatistics { | ||
public: | ||
using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from var-len type, how do we support timestamp, internal and other primitive types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from var-len type, how do we support timestamp, internal and other primitive types?
Ah, we need arrow::DataType
for them. We can add it to ArrayStatistics
or users can use it in associated ArrayData
. The latter is a bit difficult to use but it may be preferred because it can reduce needed memory a bit.
should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.
This is a discussion point in the PR description. In general, I think that std::string_view
is better but there is a problem where we should store referred data as you mentioned. It may be costly to find the min/max values in the Arrow array because statistics provided by data source will not provide the value position.
std::optional<int64_t> distinct_count = std::nullopt; | ||
|
||
/// \brief The current minimum value buffer, may not be set | ||
std::optional<ElementBufferType> min_buffer = std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they exact min/max values, or they can be lower/upper bounds? Should we add a flag to indicate this case? FYI, parquet has a flag to indicate whether the value is exact or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I didn't know that Parquet has exact or not for min/max values:
arrow/cpp/src/parquet/parquet.thrift
Lines 265 to 282 in d078d5c
/** | |
* Lower and upper bound values for the column, determined by its ColumnOrder. | |
* | |
* These may be the actual minimum and maximum values found on a page or column | |
* chunk, but can also be (more compact) values that do not exist on a page or | |
* column chunk. For example, instead of storing "Blart Versenwald III", a writer | |
* may set min_value="B", max_value="C". Such more compact values must still be | |
* valid values within the column's logical type. | |
* | |
* Values are encoded using PLAIN encoding, except that variable-length byte | |
* arrays do not include a length prefix. | |
*/ | |
5: optional binary max_value; | |
6: optional binary min_value; | |
/** If true, max_value is the actual maximum value for a column */ | |
7: optional bool is_max_value_exact; | |
/** If true, min_value is the actual minimum value for a column */ | |
8: optional bool is_min_value_exact; |
Let's add is_min_exact
/is_max_exact
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_max_value_exact
and is_min_value_exact
are added recently and we are not using them at least in parquet-cpp and parquet-java. FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added ArrayStatistics::is_min_exact
and ArrayStatistics::is_max_exact
.
} | ||
*out = std::make_shared<ArrayType<ArrowType>>(std::move(array_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the current parquet reader will return a chunked array containing one or multiple arrays (may or may not from different row groups). Here the stats do not contain the exact min/max values because they are from the row group level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information. I didn't know it.
It seems that arrays in a chunked array correspond to column chunks in Parquet, right?
If so, ColumnChunkMetaData
for each column chunk has statistics for the column chunk, right? Or does it have statistics for the row group of the column chunk not the column chunk itself?
(Does you mean that we can't associate ColumnChunkMetaData::statistics()
information with arrow::Array
because it has the statistics for row group not column chunk?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column statistics in the ColumnChunkMetaData
is for each column chunk in that row group. However, if I remember correctly, the parquet reader may return arrow (chunked) arrays spanning more than one row group in a single RecordBatch
or Table
. I'm not sure if we can represent correct statistics in this case.
cc @mapleFU to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an extreme case, users might read a parquet file containing two row groups in a single arrow::Table
, we have to merge the column statistics, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to merge. Because each arrow::Array
in a arrow::Table
can have statistics.
We may add arrow::Table
/arrow::RecordBatch
(row group) level statistics later but it's out of scope of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Then this is not an issue if stats is per array.
cpp/src/arrow/array/array_base.h
Outdated
/// object which backs this Array. | ||
/// | ||
/// \return const ArrayStatistics& | ||
const ArrayStatistics& statistics() const { return data_->statistics; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should statistics be stored in memory together with every ArrayData
instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem with this is that statistics are derived data and ArraData
is mutable when manipulated directly, so any mutation of ArrayData
will have to consider the consequences to the derived statistics.
Lazily-computed null_count_
is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should statistics be stored in memory together with every
ArrayData
instance?
If it's not desired, we can avoid it by using std::shared_ptr<ArrayStatistics>
or something.
Another problem with this is that statistics are derived data and
ArraData
is mutable when manipulated directly, so any mutation ofArrayData
will have to consider the consequences to the derived statistics.Lazily-computed
null_count_
is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.
How about attaching the statistics read by a file reader to Array
(not ArrayData
) directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about attaching the statistics read by a file reader to Array (not ArrayData) directly?
Makes more sense. Perhaps even attach it only to the typed array classes and not Array
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added Array::SetStatistics()
for this.
I wanted to pass statistics by constructor to prevent changing statistics after construction but I didn't do it. Because our constructors already require many arguments. For example, PrimiriveArray
has 6 arguments:
arrow/cpp/src/arrow/array/array_base.h
Lines 272 to 275 in 797ca30
PrimitiveArray(const std::shared_ptr<DataType>& type, int64_t length, | |
const std::shared_ptr<Buffer>& data, | |
const std::shared_ptr<Buffer>& null_bitmap = NULLPTR, | |
int64_t null_count = kUnknownNullCount, int64_t offset = 0); |
SetStatistics()
isn't thread safe because std::shared_ptr::operator=()
isn't thread safe. So users must set statistics before parallel processing.
What do you think about this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats shared_ptr
should become an additional (but optional) parameter to MakeArray
std::shared_ptr<Array> MakeArray(const std::shared_ptr<ArrayData>& data) {
std::shared_ptr<Array> out;
ArrayDataWrapper wrapper_visitor(data, &out);
DCHECK_OK(VisitTypeInline(*data->type, &wrapper_visitor));
DCHECK(out);
return out;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll choose the changing only ArrayData
constructors approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done:
- I've added
Array::Array(const std::shared_ptr<ArrayData>& data, const std::shared_ptr<ArrayStatistics>& statistics)
that callsSetData()
andSetStatistics()
and use it instead of directSetData()
call (as much as possible) in sub arrays. - I've added
ValidateData()
and it's called bySetData()
to useArray::Array(data, statistics)
in sub arrays. - I've unified
const std::shared_ptr<ArrayData>& data
argument andstd::shared_ptr<ArrayData> data
argument in sub array's constructors toconst &
because related codes useconst &
.
There are many small diffs for the change. Review may be a bit difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes that affect the overall design of Array
should probably extracted to a separate PR so they can be reviewed more carefully. It seems very risky to decouple ValidateData
from SetData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
I've opened #43273 that only include arrow::ArrayStatistics
.
We can keep using this PR for discussing arrow::ArrayStatitics
related APIs because this PR still includes not only arrow::ArrayStatistics
but also Apache Parquet integration example.
f362ccc
to
c2ba4ed
Compare
cpp/src/arrow/type_fwd.h
Outdated
template <typename TypeClass> | ||
class TypedArrayStatistics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new typed class hierarchy for statistics seems like an overkill. Think of statistics as a dynamic object like ArrayData
and the classes like BooleanArray
, StringArray
have typed accessors to the underlying untyped statistics object just like they have typed accessors to the underlying ArrayData
for the array buffers of that type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And compute kernels that deal directly with untyped ArrayData would deal directly with untyped ArrayStats
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the typed classes are convenient but they are required in the first version. I'll remove them. If we think that they are convenient, we can revisit this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
30193df
to
c905bfc
Compare
eb1bd05
to
5bf9935
Compare
See apacheGH-42133 how to use this for Apache Parquet statistics.
See apacheGH-42133 how to use this for Apache Parquet statistics.
### Rationale for this change We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909. If we have `arrow::ArrayStatistics`, we can attach statistics read from Apache Parquet to `arrow::Array`s. This only includes `arrow::ArrayStatistics`. See GH-42133 how to use `arrow::ArrayStatitics` for Apache Parquet's statistics. ### What changes are included in this PR? This only adds `arrow::ArrayStatistics` and its tests. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #41909 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.
If we have
arrow::ArrayStatistics
, we can attach statistics read from Apache Parquet toarrow::Array
s.What changes are included in this PR?
This adds an
arrow::ArrayStatistics
argument toarrow::Array
family constructors that usearrow::ArrayData
as their argument.This supports associating statistics read from Apache Parquet data to
arrow::BooleanArray
/arrow::Int*Array
/arrow::UInt*Array
. It's for demonstrating how to usearrow::ArrayStatistics
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.