Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Add string view type to the arrow format #35627

Closed
bkietz opened this issue May 16, 2023 · 1 comment · Fixed by #37526
Closed

[Format] Add string view type to the arrow format #35627

bkietz opened this issue May 16, 2023 · 1 comment · Fixed by #37526

Comments

@bkietz
Copy link
Member

bkietz commented May 16, 2023

Describe the enhancement requested

Adding an issue to GH to track adding string view to the arrow format, as discussed on this ML thread https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

Current ML discussion at https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt

Component(s)

C++, Format, Integration

@bkietz
Copy link
Member Author

bkietz commented Sep 11, 2023

@kou kou changed the title Add string view type to the arrow format [Format] Add string view type to the arrow format Sep 12, 2023
bkietz added a commit that referenced this issue Sep 21, 2023
String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from #35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: #35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 14.0.0 milestone Sep 21, 2023
etseidl pushed a commit to etseidl/arrow that referenced this issue Sep 28, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…apache#37526)

String view (and equivalent non-utf8 binary view) is an alternative representation for
variable length strings which offers greater efficiency for several common operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers as a
buffer index and offset, which

-   makes explicit the guarantee that lifetime of all character data is equal
    to that of the array which views it, which is critical for confident
    consumption across an interface boundary
-   makes the arrays meaningfully serializable and
    venue agnostic; directly usable in shared memory without modification
-   allows easy validation

This PR is extracted from apache#35628 to unblock independent PRs now that the vote has passed, including:

-   New types added to Schema.fbs
-   Message.fbs amended to support variable buffer counts between string view chunks
-   datagen.py extended to produce integration JSON for string view arrays
-   Columnar.rst amended with a description of the string view format

* Closes: apache#35627

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment