Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Implement StringViewArray and BinaryViewArray #5374

Closed
28 of 31 tasks
alamb opened this issue Feb 8, 2024 · 17 comments
Closed
28 of 31 tasks

[EPIC] Implement StringViewArray and BinaryViewArray #5374

alamb opened this issue Feb 8, 2024 · 17 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Feb 8, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently two new types were added to the Arrow format that make it more suitable for certain types of operations on strings

Specifically when doing filtering / take with string data, creating a new Utf8Array requires copying the strings to a new, packed binary buffer. The "VariableSizeBinaryView" was designed to solve this limitation and recently added to the Arrow spec.

Describe the solution you'd like
I would like to implement StringViewArray and BinaryViewArray following the spec:
The spec: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
https://github.com/apache/arrow/blob/3fe598ae4dfd7805ab05452dd5ed4b0d6c97d8d5/format/Schema.fbs#L187-L205

Initially, I would suggest we get the basic types in place:

Then as follow on PRs, add support for key features

Potential Follow ons / additional optimizations

Describe alternatives you've considered
I think a good plan would be to dust off the prototype on #4585 from @tustvold (linked from #4253).

Initially, the idea would be to dust off the PR and split it into a few smaller PRs with tests and docs.

Additional context
Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790

Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2024

@tustvold pointed out the previous PR he made was #4585

We can probably dust that off and get it ready to merge as part of this project

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2024

@kallisti-dev I believe may do some work on this.

I suggested breaking it down into a sequence of PRs (keeping notes of what is not yet implemented along the way)

Specifically, I suggest the first PR should have:

  1. The new DataType
  2. The new StringViewArray and impl Array
  3. A basic constructor for creating and validating StringViewArray
  4. Any feature (e.g. IPC support, parquet support, etc) should return a Arrow::NotYetImplemented error (rather than panic)

That will likely be a sizeable PR on its own.

@alamb
Copy link
Contributor Author

alamb commented Feb 21, 2024

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2024

Moving an update from @sundy-li and @ariesdevil from #4585 (comment) to here for visibility

I believe @ariesdevil is considering working on this relatively shortly

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.

@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.

@ariesdevil
Copy link
Contributor

Moving an update from @sundy-li and @ariesdevil from #4585 (comment) to here for visibility↳

I believe @ariesdevil is considering working on this relatively shortly↳

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.↳

@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.

Do we have a group similar to Discord or Slack?

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2024

Do we have a group similar to Discord or Slack?

@ariesdevil There is both discord and slack -- links are here https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-rust-community

The discord server doesn't require an invite and seems to be a bit more active recently

@alamb
Copy link
Contributor Author

alamb commented Mar 4, 2024

@ariesdevil says they have time to begin work on this so I will begin filing subtasks

Update: I filed these two tasks to get us started.

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2024

Update here is that @ariesdevil has a PR #5481 that we plan to merge tomorrow with the initial array implementations and we'll continue iterating from there

Once that is merged, I'll plan to write some more tickets up

@alamb
Copy link
Contributor Author

alamb commented Mar 18, 2024

I filed a bunch of follow on tickets, in case anyone cares

#5506
#5507
#5508
#5509
#5510
#5511
#5513

@cgbur
Copy link

cgbur commented Mar 18, 2024

Is there an issue to make sure that parquet support for the new arrow view types are supported. Currently can not use with the parquet writer when they are specified in the schema.

I am actually more familiar with Arrow than I am Parquet so excuse the question: is there an official Parquet spec or implementation plan for the view types?

@ariesdevil
Copy link
Contributor

I'm working on read/write parquet for view types.

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

Update here: thanks to @ariesdevil and @XiangpengHao and @RinChanNOWWW we have pretty good basic support for StringView in arrow-rs -- we have basic construction, cast and soon comparison functionality complete. Also, we can now read directly from parquet into StringView arrays #5530

We have also begun the work to integrate this into DataFusion, which is tracked in apache/datafusion#10918

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Update here is that most of the basic StringView work for basic scanning / filtering is done. There will likely be an large tail of places to implement special StringView support as follw on.

Over the next week or two I plan to file a new ticket for the remaining known work and close this ticket down

@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2024

I think we are close to finishing enough of this ticket to claim initial support is done -- I expect we can claim initial support is complete in the 52.2.0 release #5998, due to be released in a few weeks time. Once that is done, I'll open a epic to track remaining / additional work

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

Given we have now released all the basic support in arrow 52.0.0 am going to claim we are done with the initial implementation and close this ticket 🎉

Huge kudos to everyone who helped including @RinChanNOWWW @ariesdevil @tustvold @XiangpengHao and @a10y (sorry if I forgot anyone)

I am collecting additional potential work items in #6163

@alamb alamb closed this as completed Jul 31, 2024
kou pushed a commit to apache/arrow that referenced this issue Sep 20, 2024
…#44175)

### Rationale for this change

BinaryView and Utf8View are now supported in arrow-rs [as of the 52.0.0 release](apache/arrow-rs#5374 (comment)).

### What changes are included in this PR?

Add two checkmarks for the Data Type support.

### Are these changes tested?

N/A

### Are there any user-facing changes?

N/A

Authored-by: wiedld <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants