Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35627: [C++][Format][Integration] Add string view to the arrow format #35628

Closed
wants to merge 38 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented May 16, 2023

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

  • makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
  • makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
  • allows easy validation

Changes outside the C++ implementation:

  • New types added to Schema.fbs
  • Message.fbs amended to support variable buffer counts between string view chunks
  • datagen.py extended to produce integration JSON for string view arrays
  • Columnar.rst amended with a description of the string view format

Changes to the C++ implementation:

  • The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
  • The values of string view arrays can be visited as std::string_view as with StringArray
  • String view arrays can be round tripped through IPC, parquet, and integration JSON
  • A variant of the string view type utf8_view(/*has_raw_pointers=*/true) is supported
    which uses raw pointer views. This enables zero copy interop with code which uses
    raw pointer views.
  • Conversions are provided between index/offset view arrays, raw pointer view arrays, and
    regular string arrays.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35627 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting changes Awaiting changes Component: Gandiva and removed awaiting committer review Awaiting committer review labels May 23, 2023
@bkietz bkietz force-pushed the string-view/indices-offsets branch from 5f1ad87 to c7622da Compare May 24, 2023 17:01
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 24, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions from a scan of the text portion. I'll try and look at the rest in more detail later.

docs/source/format/Columnar.rst Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes Component: R awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels May 24, 2023
@bkietz bkietz marked this pull request as ready for review May 26, 2023 00:13
@pitrou
Copy link
Member

pitrou commented Jun 19, 2023

Plasma was removed from the repo, so the src/plasma changes shouldn't appear here :-)

/// Long string |----|----|--------|
/// ^ ^ ^
/// | | |
/// size prefix raw pointer to out-of-line portion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can raw pointers be a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @bkietz . Either raw pointers are a separate PR, or at least they should not use the same payload type (StringViewHeader vs. something else).

Note that using raw pointers is a property of the array's DataType, not a per-value property inside the array.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 20, 2023
|------------|------------|------------|-------------|
| length | prefix | buf. index | offset |

In both the long and short string cases, the first four bytes encode the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've a question about this format.
For StructArray[1] or FixedListArray[2], when parent is not valid, the correspond child leaves "undefined". When a child validity is valid, would it point to a undefined address?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I started a Rust implementation here, I will leave comments as I encounter things.

One thing this PR doesn't appear to define is the FFI schema mapping, in particular how to encode StringView and BinaryView in the textual schema representation

stored inline in the prefix, after the length. This prefix enables a
profitable fast path for string comparisons, which are frequently determined
within the first four bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All views must be well defined, even for null slots, in particular if the length is greater than 12, the prefix, buffer index and offset must refer to valid data.

This is a very important property for the Rust implementation to be able to provide safe value access without needing to inspect the null mask. This in turn is important because it allows more sophisticated strategies to handle / iterate the null mask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:

Invalid string views which are masked by a null bit do not cause validation to fail

of potentially several **data** buffers or may contain the characters
inline.

The views buffer contains `length` view structures with the following layout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endianness of this data structure wasn't immediately apparent to me, I interpreted the view as being a single 128-bit integer with the native endianness. I believe this is consistent with intervals

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:

  • in little-endian mode:
    • bytes 0-3 (little-endian int32)
    • data: bytes 4-15 (no endianness)
  • in big-endian mode:
    • bytes 0-3 (big-endian int32)
    • data: bytes 4-15 (no endianness)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments on this. I took the liberty to ping on some previous comments.

Comment on lines +440 to +442
} else if (values.type_id() == ::arrow::Type::BINARY_VIEW ||
values.type_id() == ::arrow::Type::STRING_VIEW) {
::arrow::VisitArraySpanInline<::arrow::BinaryViewType>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps define a is_binary_view_like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to this graph to visualize how this would fit among the existing predicates.

https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db

cpp/src/arrow/array/array_base.cc Show resolved Hide resolved
format/Message.fbs Show resolved Hide resolved
format/Schema.fbs Show resolved Hide resolved
dev/archery/archery/integration/datagen.py Show resolved Hide resolved
docs/source/format/Columnar.rst Show resolved Hide resolved
docs/source/format/Columnar.rst Show resolved Hide resolved
of potentially several **data** buffers or may contain the characters
inline.

The views buffer contains `length` view structures with the following layout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Each packed struct's field should have its endianness handled independently.
For example, short strings would be:

  • in little-endian mode:
    • bytes 0-3 (little-endian int32)
    • data: bytes 4-15 (no endianness)
  • in big-endian mode:
    • bytes 0-3 (big-endian int32)
    • data: bytes 4-15 (no endianness)

stored inline in the prefix, after the length. This prefix enables a
profitable fast path for string comparisons, which are frequently determined
within the first four bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @bkietz . This doesn't seem to match the current C++ binary view tests, which state that:

Invalid string views which are masked by a null bit do not cause validation to fail

cpp/src/arrow/compute/kernel.cc Show resolved Hide resolved
bkietz added a commit that referenced this pull request Sep 21, 2023
String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: #35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz
Copy link
Member Author

bkietz commented Sep 21, 2023

Superceded by #37792

@bkietz bkietz closed this Sep 21, 2023
etseidl pushed a commit to etseidl/arrow that referenced this pull request Sep 28, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
bkietz added a commit that referenced this pull request Oct 26, 2023
### Rationale for this change

After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: #37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
bkietz pushed a commit that referenced this pull request Nov 14, 2023
…mplementation (#35769)

### Rationale for this change
See #35628 for the rationale and description of the StringView/BinaryView array types.

This change is adding Go as a second implementation of it.

### What changes are included in this PR?

Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC.

### Are these changes tested?
Yes, unit tests have been added and integration tests run

* Closes: [#38718](#38718)
* Closes: #38718

Lead-authored-by: Matt Topol <[email protected]>
Co-authored-by: Alex Shcherbakov <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…o Go implementation (apache#35769)

### Rationale for this change
See apache#35628 for the rationale and description of the StringView/BinaryView array types.

This change is adding Go as a second implementation of it.

### What changes are included in this PR?

Add Array Types for `StringView` and `BinaryView` along with `StringViewType` and `BinaryViewType` and necessary enums and builders. These arrays can be round tripped through JSON and IPC.

### Are these changes tested?
Yes, unit tests have been added and integration tests run

* Closes: [apache#38718](apache#38718)
* Closes: apache#38718

Lead-authored-by: Matt Topol <[email protected]>
Co-authored-by: Alex Shcherbakov <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Format] Add string view type to the arrow format
9 participants