Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35627: [Format][Integration] Add string-view to arrow format #37526

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Sep 1, 2023

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

  • makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
  • makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
  • allows easy validation

This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including:

  • New types added to Schema.fbs
  • Message.fbs amended to support variable buffer counts between string view chunks
  • datagen.py extended to produce integration JSON for string view arrays
  • Columnar.rst amended with a description of the string view format

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Sep 5, 2023

Views must be aligned to an 8-byte boundary. This restriction enables more
efficient interoperation with systems where the index and offset are replaced
by a raw pointer. All integers (length, buffer index, and offset) are unsigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's please not make a special case for binary-view here. We use signed offsets and lengths everywhere else.

Suggested change
by a raw pointer. All integers (length, buffer index, and offset) are unsigned
by a raw pointer. All integers (length, buffer index, and offset) are signed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss this on the ML, I'll raise this there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the standard seems to have adopted signed integers for sizes in every other case. It seems very strange to impose that only for new types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a requirement to be imposed on sizes in all cases going forward. It's only relevant to Utf8View because existing implementations of string view already use unsigned integers here, so the concern is whether to keep compatibility with those or maintain the convention of the arrow format. If you'd like to +1 signed integers, please say so on the ML!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the mailing list, I've updated this to signed integers.

format/Schema.fbs Outdated Show resolved Hide resolved
format/Message.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
format/Schema.fbs Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 35627-string-view-format-only branch from 933b5d6 to 7084f71 Compare September 19, 2023 17:39
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Sep 19, 2023
@bkietz bkietz force-pushed the 35627-string-view-format-only branch from 7084f71 to b01b801 Compare September 20, 2023 15:14
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 20, 2023
@bkietz
Copy link
Member Author

bkietz commented Sep 20, 2023

CI failures 1, 2 are unrelated, see #37803

@bkietz bkietz force-pushed the 35627-string-view-format-only branch from b01b801 to 4a24b36 Compare September 21, 2023 00:10
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 21, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Sep 21, 2023
@bkietz
Copy link
Member Author

bkietz commented Sep 21, 2023

+1, thanks all!

@bkietz bkietz merged commit 9d6d501 into apache:main Sep 21, 2023
7 of 9 checks passed
@bkietz bkietz removed the awaiting merge Awaiting merge label Sep 21, 2023
@bkietz bkietz deleted the 35627-string-view-format-only branch September 21, 2023 12:01
@alamb
Copy link
Contributor

alamb commented Sep 21, 2023

🎉

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 9d6d501.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

etseidl pushed a commit to etseidl/arrow that referenced this pull request Sep 28, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
bkietz added a commit that referenced this pull request Oct 26, 2023
### Rationale for this change

After the PR changing the spec and schema ( #37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( #35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: #37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37792)

### Rationale for this change

After the PR changing the spec and schema ( apache#37526 ) is accepted, this PR will be undrafted. It adds the minimal addition of a C++ implementation and was extracted from the original C++ Utf8View pr ( apache#35628 ) for ease of review.

### What changes are included in this PR?

- The new types are available with new subclasses of DataType, Array, ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as with StringArray
- String view arrays can be round tripped through IPC and integration JSON

* Closes: apache#37710

Relevant mailing list discussions: 
* https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
* https://lists.apache.org/thread/3qhkomvvc69v3gkotbwldyko7yk9cs9k

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Format] Add string view type to the arrow format
9 participants