GH-35627: [C++][Format][Integration] Add string view to the arrow format #35628

bkietz · 2023-05-16T21:33:20Z

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

makes explicit the guarantee that lifetime of all character data is equal
to that of the array which views it, which is critical for confident
consumption across an interface boundary
makes the arrays meaningfully serializable and
venue agnostic; directly usable in shared memory without modification
allows easy validation

Changes outside the C++ implementation:

New types added to Schema.fbs
Message.fbs amended to support variable buffer counts between string view chunks
datagen.py extended to produce integration JSON for string view arrays
Columnar.rst amended with a description of the string view format

Changes to the C++ implementation:

The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
The values of string view arrays can be visited as std::string_view as with StringArray
String view arrays can be round tripped through IPC, parquet, and integration JSON
A variant of the string view type utf8_view(/*has_raw_pointers=*/true) is supported
which uses raw pointer views. This enables zero copy interop with code which uses
raw pointer views.
Conversions are provided between index/offset view arrays, raw pointer view arrays, and
regular string arrays.

Closes: [Format] Add string view type to the arrow format #35627

github-actions · 2023-05-16T21:33:47Z

Closes: [Format] Add string view type to the arrow format #35627

github-actions · 2023-05-16T21:33:49Z

⚠️ GitHub issue #35627 has been automatically assigned in GitHub to PR creator.

format/Message.fbs

westonpace

Just some questions from a scan of the text portion. I'll try and look at the rest in more detail later.

docs/source/format/Columnar.rst

pitrou · 2023-06-19T16:07:51Z

Plasma was removed from the repo, so the src/plasma changes shouldn't appear here :-)

cpp/src/arrow/util/string_header.h

cpp/src/arrow/util/span.h

cpp/src/parquet/arrow/reader_internal.cc

pitrou · 2023-06-19T16:18:20Z

cpp/src/arrow/util/string_header.h

+/// Long string    |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix raw pointer to out-of-line portion


Can raw pointers be a separate PR?

Ping @bkietz . Either raw pointers are a separate PR, or at least they should not use the same payload type (StringViewHeader vs. something else).

Note that using raw pointers is a property of the array's DataType, not a per-value property inside the array.

cpp/src/arrow/util/range.h

cpp/src/arrow/type_traits.h

cpp/src/arrow/engine/substrait/expression_internal.cc

cpp/src/arrow/compute/kernel.cc

mapleFU · 2023-06-28T15:30:11Z

docs/source/format/Columnar.rst

+      |------------|------------|------------|-------------|
+      | length     | prefix     | buf. index | offset      |
+
+In both the long and short string cases, the first four bytes encode the


Hi, I've a question about this format.
For StructArray[1] or FixedListArray[2], when parent is not valid, the correspond child leaves "undefined". When a child validity is valid, would it point to a undefined address?

tustvold

FYI I started a Rust implementation here, I will leave comments as I encounter things.

One thing this PR doesn't appear to define is the FFI schema mapping, in particular how to encode StringView and BinaryView in the textual schema representation

tustvold · 2023-07-30T08:07:43Z

docs/source/format/Columnar.rst

+stored inline in the prefix, after the length. This prefix enables a
+profitable fast path for string comparisons, which are frequently determined
+within the first four bytes.
+


Suggested change

All views must be well defined, even for null slots, in particular if the length is greater than 12, the prefix, buffer index and offset must refer to valid data.

This is a very important property for the Rust implementation to be able to provide safe value access without needing to inspect the null mask. This in turn is important because it allows more sophisticated strategies to handle / iterate the null mask.

Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:

Invalid string views which are masked by a null bit do not cause validation to fail

tustvold · 2023-07-30T08:11:47Z

docs/source/format/Columnar.rst

+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:


The endianness of this data structure wasn't immediately apparent to me, I interpreted the view as being a single 128-bit integer with the native endianness. I believe this is consistent with intervals

Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:

in little-endian mode:

bytes 0-3 (little-endian int32)

data: bytes 4-15 (no endianness)

in big-endian mode:

bytes 0-3 (big-endian int32)

data: bytes 4-15 (no endianness)

pitrou

More comments on this. I took the liberty to ping on some previous comments.

pitrou · 2023-08-24T13:26:12Z

cpp/src/parquet/statistics.cc

+  } else if (values.type_id() == ::arrow::Type::BINARY_VIEW ||
+             values.type_id() == ::arrow::Type::STRING_VIEW) {
+    ::arrow::VisitArraySpanInline<::arrow::BinaryViewType>(


Perhaps define a is_binary_view_like?

Refer to this graph to visualize how this would fit among the existing predicates.

https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db

cpp/src/arrow/array/array_base.cc

format/Message.fbs

format/Schema.fbs

dev/archery/archery/integration/datagen.py

docs/source/format/Columnar.rst

pitrou · 2023-08-24T14:05:19Z

docs/source/format/Columnar.rst

+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:


Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:

in little-endian mode:

bytes 0-3 (little-endian int32)

data: bytes 4-15 (no endianness)

in big-endian mode:

bytes 0-3 (big-endian int32)

data: bytes 4-15 (no endianness)

pitrou · 2023-08-24T14:08:34Z

docs/source/format/Columnar.rst

+stored inline in the prefix, after the length. This prefix enables a
+profitable fast path for string comparisons, which are frequently determined
+within the first four bytes.
+


Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:

Invalid string views which are masked by a null bit do not cause validation to fail

cpp/src/arrow/compute/kernel.cc

String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: #35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

bkietz · 2023-09-21T16:22:10Z

Superceded by #37792

…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

### Rationale for this change After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: #37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…mplementation (#35769) ### Rationale for this change See #35628 for the rationale and description of the StringView/BinaryView array types. This change is adding Go as a second implementation of it. ### What changes are included in this PR? Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC. ### Are these changes tested? Yes, unit tests have been added and integration tests run * Closes: [#38718](#38718) * Closes: #38718 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Alex Shcherbakov <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

…o Go implementation (apache#35769) ### Rationale for this change See apache#35628 for the rationale and description of the StringView/BinaryView array types. This change is adding Go as a second implementation of it. ### What changes are included in this PR? Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC. ### Are these changes tested? Yes, unit tests have been added and integration tests run * Closes: [apache#38718](apache#38718) * Closes: apache#38718 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Alex Shcherbakov <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

github-actions bot added Component: C++ Component: Documentation Component: Parquet Component: Python awaiting committer review Awaiting committer review labels May 16, 2023

zeroshade reviewed May 23, 2023

View reviewed changes

format/Message.fbs Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes Component: Gandiva and removed awaiting committer review Awaiting committer review labels May 23, 2023

bkietz force-pushed the string-view/indices-offsets branch from 5f1ad87 to c7622da Compare May 24, 2023 17:01

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 24, 2023

westonpace requested changes May 24, 2023

View reviewed changes

docs/source/format/Columnar.rst Show resolved Hide resolved

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved

github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes Component: R awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels May 24, 2023

zeroshade mentioned this pull request May 25, 2023

GH-38718: [Go][Format][Integration] Add StringView/BinaryView to Go implementation #35769

Merged

bkietz marked this pull request as ready for review May 26, 2023 00:13