Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

Closed
wants to merge 10 commits into from

Conversation

kou
Copy link
Member

@kou kou commented Jun 13, 2024

Rationale for this change

We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.

If we have arrow::ArrayStatistics, we can attach statistics read from Apache Parquet to arrow::Arrays.

What changes are included in this PR?

This adds an arrow::ArrayStatistics argument to arrow::Array family constructors that use arrow::ArrayData as their argument.

This supports associating statistics read from Apache Parquet data to arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array. It's for demonstrating how to use arrow::ArrayStatistics.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou kou requested a review from wgtmac as a code owner June 13, 2024 08:15
Copy link

⚠️ GitHub issue #41909 has been automatically assigned in GitHub to PR creator.

std::optional<int64_t> null_count = std::nullopt;

/// \brief The number of distinct values, may not be set
std::optional<int64_t> distinct_count = std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really useful? AFAIK, Parquet does not populate this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This is not useful for now because Parquet C++ doesn't populate it.
I just add this because parquet::Statistics has this.

/// data source can be read unified API via this class.
struct ARROW_EXPORT ArrayStatistics {
public:
using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from var-len type, how do we support timestamp, internal and other primitive types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from var-len type, how do we support timestamp, internal and other primitive types?

Ah, we need arrow::DataType for them. We can add it to ArrayStatistics or users can use it in associated ArrayData. The latter is a bit difficult to use but it may be preferred because it can reduce needed memory a bit.

should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.

This is a discussion point in the PR description. In general, I think that std::string_view is better but there is a problem where we should store referred data as you mentioned. It may be costly to find the min/max values in the Arrow array because statistics provided by data source will not provide the value position.

std::optional<int64_t> distinct_count = std::nullopt;

/// \brief The current minimum value buffer, may not be set
std::optional<ElementBufferType> min_buffer = std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they exact min/max values, or they can be lower/upper bounds? Should we add a flag to indicate this case? FYI, parquet has a flag to indicate whether the value is exact or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I didn't know that Parquet has exact or not for min/max values:

/**
* Lower and upper bound values for the column, determined by its ColumnOrder.
*
* These may be the actual minimum and maximum values found on a page or column
* chunk, but can also be (more compact) values that do not exist on a page or
* column chunk. For example, instead of storing "Blart Versenwald III", a writer
* may set min_value="B", max_value="C". Such more compact values must still be
* valid values within the column's logical type.
*
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*/
5: optional binary max_value;
6: optional binary min_value;
/** If true, max_value is the actual maximum value for a column */
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;

Let's add is_min_exact/is_max_exact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_max_value_exact and is_min_value_exact are added recently and we are not using them at least in parquet-cpp and parquet-java. FYI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added ArrayStatistics::is_min_exact and ArrayStatistics::is_max_exact.

}
*out = std::make_shared<ArrayType<ArrowType>>(std::move(array_data));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the current parquet reader will return a chunked array containing one or multiple arrays (may or may not from different row groups). Here the stats do not contain the exact min/max values because they are from the row group level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. I didn't know it.

It seems that arrays in a chunked array correspond to column chunks in Parquet, right?
If so, ColumnChunkMetaData for each column chunk has statistics for the column chunk, right? Or does it have statistics for the row group of the column chunk not the column chunk itself?

(Does you mean that we can't associate ColumnChunkMetaData::statistics() information with arrow::Array because it has the statistics for row group not column chunk?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column statistics in the ColumnChunkMetaData is for each column chunk in that row group. However, if I remember correctly, the parquet reader may return arrow (chunked) arrays spanning more than one row group in a single RecordBatch or Table. I'm not sure if we can represent correct statistics in this case.

cc @mapleFU to confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an extreme case, users might read a parquet file containing two row groups in a single arrow::Table, we have to merge the column statistics, right?

Copy link
Member Author

@kou kou Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to merge. Because each arrow::Array in a arrow::Table can have statistics.
We may add arrow::Table/arrow::RecordBatch (row group) level statistics later but it's out of scope of this proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Then this is not an issue if stats is per array.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 14, 2024
/// object which backs this Array.
///
/// \return const ArrayStatistics&
const ArrayStatistics& statistics() const { return data_->statistics; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should statistics be stored in memory together with every ArrayData instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem with this is that statistics are derived data and ArraData is mutable when manipulated directly, so any mutation of ArrayData will have to consider the consequences to the derived statistics.

Lazily-computed null_count_ is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should statistics be stored in memory together with every ArrayData instance?

If it's not desired, we can avoid it by using std::shared_ptr<ArrayStatistics> or something.

Another problem with this is that statistics are derived data and ArraData is mutable when manipulated directly, so any mutation of ArrayData will have to consider the consequences to the derived statistics.

Lazily-computed null_count_ is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.

How about attaching the statistics read by a file reader to Array (not ArrayData) directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about attaching the statistics read by a file reader to Array (not ArrayData) directly?

Makes more sense. Perhaps even attach it only to the typed array classes and not Array itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added Array::SetStatistics() for this.
I wanted to pass statistics by constructor to prevent changing statistics after construction but I didn't do it. Because our constructors already require many arguments. For example, PrimiriveArray has 6 arguments:

PrimitiveArray(const std::shared_ptr<DataType>& type, int64_t length,
const std::shared_ptr<Buffer>& data,
const std::shared_ptr<Buffer>& null_bitmap = NULLPTR,
int64_t null_count = kUnknownNullCount, int64_t offset = 0);

SetStatistics() isn't thread safe because std::shared_ptr::operator=() isn't thread safe. So users must set statistics before parallel processing.

What do you think about this API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats shared_ptr should become an additional (but optional) parameter to MakeArray

std::shared_ptr<Array> MakeArray(const std::shared_ptr<ArrayData>& data) {
  std::shared_ptr<Array> out;
  ArrayDataWrapper wrapper_visitor(data, &out);
  DCHECK_OK(VisitTypeInline(*data->type, &wrapper_visitor));
  DCHECK(out);
  return out;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll choose the changing only ArrayData constructors approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • I've added Array::Array(const std::shared_ptr<ArrayData>& data, const std::shared_ptr<ArrayStatistics>& statistics) that calls SetData() and SetStatistics() and use it instead of direct SetData() call (as much as possible) in sub arrays.
  • I've added ValidateData() and it's called by SetData() to use Array::Array(data, statistics) in sub arrays.
  • I've unified const std::shared_ptr<ArrayData>& data argument and std::shared_ptr<ArrayData> data argument in sub array's constructors to const & because related codes use const &.

There are many small diffs for the change. Review may be a bit difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes that affect the overall design of Array should probably extracted to a separate PR so they can be reviewed more carefully. It seems very risky to decouple ValidateData from SetData.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.
I've opened #43273 that only include arrow::ArrayStatistics.

We can keep using this PR for discussing arrow::ArrayStatitics related APIs because this PR still includes not only arrow::ArrayStatistics but also Apache Parquet integration example.

@kou kou force-pushed the cpp-array-statistics branch from f362ccc to c2ba4ed Compare July 11, 2024 06:35
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 11, 2024
Comment on lines 92 to 93
template <typename TypeClass>
class TypedArrayStatistics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new typed class hierarchy for statistics seems like an overkill. Think of statistics as a dynamic object like ArrayData and the classes like BooleanArray, StringArray have typed accessors to the underlying untyped statistics object just like they have typed accessors to the underlying ArrayData for the array buffers of that type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And compute kernels that deal directly with untyped ArrayData would deal directly with untyped ArrayStats objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the typed classes are convenient but they are required in the first version. I'll remove them. If we think that they are convenient, we can revisit this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@kou kou force-pushed the cpp-array-statistics branch from 30193df to c905bfc Compare July 12, 2024 04:53
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 12, 2024
@kou kou force-pushed the cpp-array-statistics branch from eb1bd05 to 5bf9935 Compare July 13, 2024 01:33
@kou kou changed the title GH-41909: PoC: [C++] Add arrow::ArrayStatistics GH-41909: [C++] Add arrow::ArrayStatistics Jul 15, 2024
@felipecrv felipecrv requested a review from pitrou July 15, 2024 15:32
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 15, 2024
@kou kou changed the title GH-41909: [C++] Add arrow::ArrayStatistics GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration Jul 16, 2024
kou added a commit to kou/arrow that referenced this pull request Jul 16, 2024
See apacheGH-42133 how to use this for Apache Parquet statistics.
kou added a commit to kou/arrow that referenced this pull request Aug 2, 2024
See apacheGH-42133 how to use this for Apache Parquet statistics.
kou added a commit that referenced this pull request Aug 4, 2024
### Rationale for this change

We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.

If we have `arrow::ArrayStatistics`, we can attach statistics read from Apache Parquet to `arrow::Array`s.

This only includes `arrow::ArrayStatistics`. See GH-42133 how to use `arrow::ArrayStatitics` for Apache Parquet's statistics.

### What changes are included in this PR?

This only adds `arrow::ArrayStatistics` and its tests.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #41909

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou closed this Sep 9, 2024
@kou kou deleted the cpp-array-statistics branch September 9, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants