-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35627: [Format][Integration] Add string-view to arrow format #37526
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/source/format/Columnar.rst
Outdated
|
||
Views must be aligned to an 8-byte boundary. This restriction enables more | ||
efficient interoperation with systems where the index and offset are replaced | ||
by a raw pointer. All integers (length, buffer index, and offset) are unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let's please not make a special case for binary-view here. We use signed offsets and lengths everywhere else.
by a raw pointer. All integers (length, buffer index, and offset) are unsigned | |
by a raw pointer. All integers (length, buffer index, and offset) are signed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should discuss this on the ML, I'll raise this there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the standard seems to have adopted signed integers for sizes in every other case. It seems very strange to impose that only for new types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a requirement to be imposed on sizes in all cases going forward. It's only relevant to Utf8View because existing implementations of string view already use unsigned integers here, so the concern is whether to keep compatibility with those or maintain the convention of the arrow format. If you'd like to +1 signed integers, please say so on the ML!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the mailing list, I've updated this to signed integers.
933b5d6
to
7084f71
Compare
7084f71
to
b01b801
Compare
b01b801
to
4a24b36
Compare
+1, thanks all! |
🎉 |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 9d6d501. There were 4 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: #37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…apache#37526) String view (and equivalent non-utf8 binary view) is an alternative representation for variable length strings which offers greater efficiency for several common operations. This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a buffer index and offset, which - makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary - makes the arrays meaningfully serializable and venue agnostic; directly usable in shared memory without modification - allows easy validation This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including: - New types added to Schema.fbs - Message.fbs amended to support variable buffer counts between string view chunks - datagen.py extended to produce integration JSON for string view arrays - Columnar.rst amended with a description of the string view format * Closes: apache#35627 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…pache#37792) ### Rationale for this change After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review. ### What changes are included in this PR? - The new types are available with new subclasses of DataType, Array, ArrayBuilder, ... - The values of string view arrays can be visited as `std::string_view` as with StringArray - String view arrays can be round tripped through IPC and integration JSON * Closes: apache#37710 Relevant mailing list discussions: * https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy * https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which
to that of the array which views it, which is critical for confident
consumption across an interface boundary
venue agnostic; directly usable in shared memory without modification
This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including: