GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

kou · 2024-06-13T08:15:16Z

Rationale for this change

We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.

If we have arrow::ArrayStatistics, we can attach statistics read from Apache Parquet to arrow::Arrays.

What changes are included in this PR?

This adds an arrow::ArrayStatistics argument to arrow::Array family constructors that use arrow::ArrayData as their argument.

This supports associating statistics read from Apache Parquet data to arrow::BooleanArray/arrow::Int*Array/arrow::UInt*Array. It's for demonstrating how to use arrow::ArrayStatistics.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

GitHub Issue: [C++] Add arrow::ArrayStatistics #41909

github-actions · 2024-06-13T08:15:43Z

⚠️ GitHub issue #41909 has been automatically assigned in GitHub to PR creator.

wgtmac · 2024-06-13T14:23:42Z

cpp/src/arrow/array/statistics.h

+  std::optional<int64_t> null_count = std::nullopt;
+
+  /// \brief The number of distinct values, may not be set
+  std::optional<int64_t> distinct_count = std::nullopt;


Is this really useful? AFAIK, Parquet does not populate this.

You're right. This is not useful for now because Parquet C++ doesn't populate it.
I just add this because parquet::Statistics has this.

wgtmac · 2024-06-13T14:26:38Z

cpp/src/arrow/array/statistics.h

+/// data source can be read unified API via this class.
+struct ARROW_EXPORT ArrayStatistics {
+ public:
+  using ElementBufferType = std::variant<bool, int8_t, uint8_t, int16_t, uint16_t,


Apart from var-len type, how do we support timestamp, internal and other primitive types?

BTW, should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.

Apart from var-len type, how do we support timestamp, internal and other primitive types?

Ah, we need arrow::DataType for them. We can add it to ArrayStatistics or users can use it in associated ArrayData. The latter is a bit difficult to use but it may be preferred because it can reduce needed memory a bit.

should we add an optional buffer to store values for string/binary types? Thus we can use std::string_view to represent them. If the min/max values are exact values, we can point them to the values in the arrow array and do not store any value in the optional buffer.

This is a discussion point in the PR description. In general, I think that std::string_view is better but there is a problem where we should store referred data as you mentioned. It may be costly to find the min/max values in the Arrow array because statistics provided by data source will not provide the value position.

wgtmac · 2024-06-13T14:27:45Z

cpp/src/arrow/array/statistics.h

+  std::optional<int64_t> distinct_count = std::nullopt;
+
+  /// \brief The current minimum value buffer, may not be set
+  std::optional<ElementBufferType> min_buffer = std::nullopt;


Are they exact min/max values, or they can be lower/upper bounds? Should we add a flag to indicate this case? FYI, parquet has a flag to indicate whether the value is exact or not.

Thanks. I didn't know that Parquet has exact or not for min/max values:

arrow/cpp/src/parquet/parquet.thrift

Lines 265 to 282 in d078d5c

/**

* Lower and upper bound values for the column, determined by its ColumnOrder.

*

* These may be the actual minimum and maximum values found on a page or column

* chunk, but can also be (more compact) values that do not exist on a page or

* column chunk. For example, instead of storing "Blart Versenwald III", a writer

* may set min_value="B", max_value="C". Such more compact values must still be

* valid values within the column's logical type.

*

* Values are encoded using PLAIN encoding, except that variable-length byte

* arrays do not include a length prefix.

*/

5: optional binary max_value;

6: optional binary min_value;

/** If true, max_value is the actual maximum value for a column */

7: optional bool is_max_value_exact;

/** If true, min_value is the actual minimum value for a column */

8: optional bool is_min_value_exact;

Let's add is_min_exact/is_max_exact.

is_max_value_exact and is_min_value_exact are added recently and we are not using them at least in parquet-cpp and parquet-java. FYI.

I've added ArrayStatistics::is_min_exact and ArrayStatistics::is_max_exact.

wgtmac · 2024-06-13T14:37:47Z

cpp/src/parquet/arrow/reader_internal.cc

  }
+  *out = std::make_shared<ArrayType<ArrowType>>(std::move(array_data));


IIRC, the current parquet reader will return a chunked array containing one or multiple arrays (may or may not from different row groups). Here the stats do not contain the exact min/max values because they are from the row group level.

Thanks for the information. I didn't know it.

It seems that arrays in a chunked array correspond to column chunks in Parquet, right?
If so, ColumnChunkMetaData for each column chunk has statistics for the column chunk, right? Or does it have statistics for the row group of the column chunk not the column chunk itself?

(Does you mean that we can't associate ColumnChunkMetaData::statistics() information with arrow::Array because it has the statistics for row group not column chunk?)

The column statistics in the ColumnChunkMetaData is for each column chunk in that row group. However, if I remember correctly, the parquet reader may return arrow (chunked) arrays spanning more than one row group in a single RecordBatch or Table. I'm not sure if we can represent correct statistics in this case.

cc @mapleFU to confirm.

In an extreme case, users might read a parquet file containing two row groups in a single arrow::Table, we have to merge the column statistics, right?

We don't need to merge. Because each arrow::Array in a arrow::Table can have statistics.
We may add arrow::Table/arrow::RecordBatch (row group) level statistics later but it's out of scope of this proposal.

Sounds good. Then this is not an issue if stats is per array.

felipecrv · 2024-06-14T17:58:14Z

cpp/src/arrow/array/array_base.h

+  /// object which backs this Array.
+  ///
+  /// \return const ArrayStatistics&
+  const ArrayStatistics& statistics() const { return data_->statistics; }


Should statistics be stored in memory together with every ArrayData instance?

Another problem with this is that statistics are derived data and ArraData is mutable when manipulated directly, so any mutation of ArrayData will have to consider the consequences to the derived statistics.

Lazily-computed null_count_ is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.

Should statistics be stored in memory together with every ArrayData instance?

If it's not desired, we can avoid it by using std::shared_ptr<ArrayStatistics> or something.

Another problem with this is that statistics are derived data and ArraData is mutable when manipulated directly, so any mutation of ArrayData will have to consider the consequences to the derived statistics.

Lazily-computed null_count_ is a source of bugs and complexity for this reason. IMO statistics should be (1) computed or (2) carried from a file readers (like Parquet's) as something on the side.

How about attaching the statistics read by a file reader to Array (not ArrayData) directly?

How about attaching the statistics read by a file reader to Array (not ArrayData) directly?

Makes more sense. Perhaps even attach it only to the typed array classes and not Array itself.

I've added Array::SetStatistics() for this.
I wanted to pass statistics by constructor to prevent changing statistics after construction but I didn't do it. Because our constructors already require many arguments. For example, PrimiriveArray has 6 arguments:

arrow/cpp/src/arrow/array/array_base.h

Lines 272 to 275 in 797ca30

PrimitiveArray(const std::shared_ptr<DataType>& type, int64_t length,

const std::shared_ptr<Buffer>& data,

const std::shared_ptr<Buffer>& null_bitmap = NULLPTR,

int64_t null_count = kUnknownNullCount, int64_t offset = 0);

SetStatistics() isn't thread safe because std::shared_ptr::operator=() isn't thread safe. So users must set statistics before parallel processing.

What do you think about this API?

The stats shared_ptr should become an additional (but optional) parameter to MakeArray

std::shared_ptr<Array> MakeArray(const std::shared_ptr<ArrayData>& data) { std::shared_ptr<Array> out; ArrayDataWrapper wrapper_visitor(data, &out); DCHECK_OK(VisitTypeInline(*data->type, &wrapper_visitor)); DCHECK(out); return out; }

OK. I'll choose the changing only ArrayData constructors approach.

Done:

I've added Array::Array(const std::shared_ptr<ArrayData>& data, const std::shared_ptr<ArrayStatistics>& statistics) that calls SetData() and SetStatistics() and use it instead of direct SetData() call (as much as possible) in sub arrays.

I've added ValidateData() and it's called by SetData() to use Array::Array(data, statistics) in sub arrays.

I've unified const std::shared_ptr<ArrayData>& data argument and std::shared_ptr<ArrayData> data argument in sub array's constructors to const & because related codes use const &.

There are many small diffs for the change. Review may be a bit difficult.

Changes that affect the overall design of Array should probably extracted to a separate PR so they can be reviewed more carefully. It seems very risky to decouple ValidateData from SetData.

It makes sense.
I've opened #43273 that only include arrow::ArrayStatistics.

We can keep using this PR for discussing arrow::ArrayStatitics related APIs because this PR still includes not only arrow::ArrayStatistics but also Apache Parquet integration example.

felipecrv · 2024-07-11T20:10:20Z

cpp/src/arrow/type_fwd.h

+template <typename TypeClass>
+class TypedArrayStatistics;


A new typed class hierarchy for statistics seems like an overkill. Think of statistics as a dynamic object like ArrayData and the classes like BooleanArray, StringArray have typed accessors to the underlying untyped statistics object just like they have typed accessors to the underlying ArrayData for the array buffers of that type.

And compute kernels that deal directly with untyped ArrayData would deal directly with untyped ArrayStats objects.

I think that the typed classes are convenient but they are required in the first version. I'll remove them. If we think that they are convenient, we can revisit this later.

See apacheGH-42133 how to use this for Apache Parquet statistics.

### Rationale for this change We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909. If we have `arrow::ArrayStatistics`, we can attach statistics read from Apache Parquet to `arrow::Array`s. This only includes `arrow::ArrayStatistics`. See GH-42133 how to use `arrow::ArrayStatitics` for Apache Parquet's statistics. ### What changes are included in this PR? This only adds `arrow::ArrayStatistics` and its tests. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #41909 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

kou requested a review from wgtmac as a code owner June 13, 2024 08:15

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review labels Jun 13, 2024

wgtmac reviewed Jun 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 14, 2024

felipecrv reviewed Jun 14, 2024

View reviewed changes

kou force-pushed the cpp-array-statistics branch from f362ccc to c2ba4ed Compare July 11, 2024 06:35

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 11, 2024

felipecrv reviewed Jul 11, 2024

View reviewed changes

kou force-pushed the cpp-array-statistics branch from 30193df to c905bfc Compare July 12, 2024 04:53

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 12, 2024

kou added 4 commits July 13, 2024 06:44

apacheGH-41909: PoC: [C++] Add arrow::ArrayStatistics

d5013c9

Move statistics to Array from ArrayData

355172f

Add is_min_exact/is_max_exact

49e7b03

Use BooleanArrayStatistics

2bed0ae

kou added 6 commits July 13, 2024 06:44

Use ChunkedArray to keep statistics

0ef48e1

Use Array's ArrayData constructor to assign statistics

fd7af21

Add a missing &

4f609bb

Remove TypedArrayStatistics

23844db

Fix style

e58ce08

Don't use virtual functions in constructor directly

5bf9935

kou force-pushed the cpp-array-statistics branch from eb1bd05 to 5bf9935 Compare July 13, 2024 01:33

kou changed the title ~~GH-41909: PoC: [C++] Add arrow::ArrayStatistics~~ GH-41909: [C++] Add arrow::ArrayStatistics Jul 15, 2024

felipecrv requested a review from pitrou July 15, 2024 15:32

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 15, 2024

kou changed the title ~~GH-41909: [C++] Add arrow::ArrayStatistics~~ GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration Jul 16, 2024

kou added a commit to kou/arrow that referenced this pull request Jul 16, 2024

apacheGH-41909: [C++] Add arrow::ArrayStatistics

f9136fb

See apacheGH-42133 how to use this for Apache Parquet statistics.

kou mentioned this pull request Jul 16, 2024

GH-41909: [C++] Add arrow::ArrayStatistics #43273

Merged

kou added a commit to kou/arrow that referenced this pull request Aug 2, 2024

apacheGH-41909: [C++] Add arrow::ArrayStatistics

bed8e3d

See apacheGH-42133 how to use this for Apache Parquet statistics.

This was referenced Aug 13, 2024

[C++] Attach arrow::ArrayStatistics to arrow::Array #43666

Closed

GH-43666: [C++] Attach arrow::ArrayStatistics to arrow::Array #43705

Closed

kou closed this Sep 9, 2024

kou deleted the cpp-array-statistics branch September 9, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

kou commented Jun 13, 2024 •

edited

Loading

github-actions bot commented Jun 13, 2024

wgtmac Jun 13, 2024

kou Jun 14, 2024

wgtmac Jun 13, 2024

wgtmac Jun 13, 2024

kou Jun 14, 2024

wgtmac Jun 13, 2024

kou Jun 14, 2024

wgtmac Jun 14, 2024

kou Jul 11, 2024

wgtmac Jun 13, 2024

kou Jun 14, 2024

wgtmac Jun 14, 2024

wgtmac Jun 14, 2024

kou Jun 14, 2024 •

edited

Loading

wgtmac Jun 14, 2024

felipecrv Jun 14, 2024

felipecrv Jun 14, 2024

kou Jun 15, 2024

felipecrv Jun 18, 2024

kou Jul 11, 2024

felipecrv Jul 11, 2024

kou Jul 12, 2024

kou Jul 12, 2024

felipecrv Jul 15, 2024

kou Jul 16, 2024

felipecrv Jul 11, 2024

felipecrv Jul 11, 2024

kou Jul 12, 2024

kou Jul 12, 2024

	/**
	* Lower and upper bound values for the column, determined by its ColumnOrder.
	*
	* These may be the actual minimum and maximum values found on a page or column
	* chunk, but can also be (more compact) values that do not exist on a page or
	* column chunk. For example, instead of storing "Blart Versenwald III", a writer
	* may set min_value="B", max_value="C". Such more compact values must still be
	* valid values within the column's logical type.
	*
	* Values are encoded using PLAIN encoding, except that variable-length byte
	* arrays do not include a length prefix.
	*/
	5: optional binary max_value;
	6: optional binary min_value;
	/** If true, max_value is the actual maximum value for a column */
	7: optional bool is_max_value_exact;
	/** If true, min_value is the actual minimum value for a column */
	8: optional bool is_min_value_exact;

		}
		*out = std::make_shared<ArrayType<ArrowType>>(std::move(array_data));

	PrimitiveArray(const std::shared_ptr<DataType>& type, int64_t length,
	const std::shared_ptr<Buffer>& data,
	const std::shared_ptr<Buffer>& null_bitmap = NULLPTR,
	int64_t null_count = kUnknownNullCount, int64_t offset = 0);

GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

GH-41909: [C++] PoC: Add arrow::ArrayStatistics with Parquet statistics integration #42133

Conversation

kou commented Jun 13, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Jun 13, 2024 •

edited

Loading

kou Jun 14, 2024 •

edited

Loading