-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35627: [C++][Format][Integration] Add string view to the arrow format #35628
Conversation
|
5f1ad87
to
c7622da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some questions from a scan of the text portion. I'll try and look at the rest in more detail later.
Plasma was removed from the repo, so the |
/// Long string |----|----|--------| | ||
/// ^ ^ ^ | ||
/// | | | | ||
/// size prefix raw pointer to out-of-line portion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can raw pointers be a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @bkietz . Either raw pointers are a separate PR, or at least they should not use the same payload type (StringViewHeader
vs. something else).
Note that using raw pointers is a property of the array's DataType, not a per-value property inside the array.
|------------|------------|------------|-------------| | ||
| length | prefix | buf. index | offset | | ||
|
||
In both the long and short string cases, the first four bytes encode the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've a question about this format.
For StructArray[1] or FixedListArray[2], when parent is not valid, the correspond child leaves "undefined". When a child validity is valid, would it point to a undefined address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I started a Rust implementation here, I will leave comments as I encounter things.
One thing this PR doesn't appear to define is the FFI schema mapping, in particular how to encode StringView and BinaryView in the textual schema representation
stored inline in the prefix, after the length. This prefix enables a | ||
profitable fast path for string comparisons, which are frequently determined | ||
within the first four bytes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All views must be well defined, even for null slots, in particular if the length is greater than 12, the prefix, buffer index and offset must refer to valid data. | |
This is a very important property for the Rust implementation to be able to provide safe value access without needing to inspect the null mask. This in turn is important because it allows more sophisticated strategies to handle / iterate the null mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:
Invalid string views which are masked by a null bit do not cause validation to fail
of potentially several **data** buffers or may contain the characters | ||
inline. | ||
|
||
The views buffer contains `length` view structures with the following layout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endianness of this data structure wasn't immediately apparent to me, I interpreted the view as being a single 128-bit integer with the native endianness. I believe this is consistent with intervals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:
- in little-endian mode:
- bytes 0-3 (little-endian int32)
- data: bytes 4-15 (no endianness)
- in big-endian mode:
- bytes 0-3 (big-endian int32)
- data: bytes 4-15 (no endianness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments on this. I took the liberty to ping on some previous comments.
} else if (values.type_id() == ::arrow::Type::BINARY_VIEW || | ||
values.type_id() == ::arrow::Type::STRING_VIEW) { | ||
::arrow::VisitArraySpanInline<::arrow::BinaryViewType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps define a is_binary_view_like
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to this graph to visualize how this would fit among the existing predicates.
https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db
of potentially several **data** buffers or may contain the characters | ||
inline. | ||
|
||
The views buffer contains `length` view structures with the following layout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:
- in little-endian mode:
- bytes 0-3 (little-endian int32)
- data: bytes 4-15 (no endianness)
- in big-endian mode:
- bytes 0-3 (big-endian int32)
- data: bytes 4-15 (no endianness)
stored inline in the prefix, after the length. This prefix enables a | ||
profitable fast path for string comparisons, which are frequently determined | ||
within the first four bytes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:
Invalid string views which are masked by a null bit do not cause validation to fail
String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: #35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Superceded by #37792 |
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: #37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…mplementation (#35769) ### Rationale for this change See #35628 for the rationale and description of the StringView/BinaryView array types. This change is adding Go as a second implementation of it. ### What changes are included in this PR? Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC. ### Are these changes tested? Yes, unit tests have been added and integration tests run * Closes: [#38718](#38718) * Closes: #38718 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Alex Shcherbakov <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…o Go implementation (apache#35769) ### Rationale for this change See apache#35628 for the rationale and description of the StringView/BinaryView array types. This change is adding Go as a second implementation of it. ### What changes are included in this PR? Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC. ### Are these changes tested? Yes, unit tests have been added and integration tests run * Closes: [apache#38718](apache#38718) * Closes: apache#38718 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Alex Shcherbakov <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which
to that of the array which views it, which is critical for confident
consumption across an interface boundary
venue agnostic; directly usable in shared memory without modification
Changes outside the C++ implementation:
Schema.fbs
Message.fbs
amended to support variable buffer counts between string view chunksdatagen.py
extended to produce integration JSON for string view arraysColumnar.rst
amended with a description of the string view formatChanges to the C++ implementation:
std::string_view
as with StringArrayutf8_view(/*has_raw_pointers=*/true)
is supportedwhich uses raw pointer views. This enables zero copy interop with code which uses
raw pointer views.
regular string arrays.