-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for Binaryview and StringView types #367
Conversation
NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO | ||
NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO, | ||
NANOARROW_TYPE_BINARY_VIEW, | ||
NANOARROW_TYPE_STRING_VIEW, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ArrowType enum meant to be stable? (i.e. can I only add new types at the end, or can I put them at a more logical place in the list?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably best to put it at the end to keep the existing values (we can always rearrange them in places where they're more likely to be seen).
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #367 +/- ##
==========================================
- Coverage 88.38% 88.02% -0.36%
==========================================
Files 75 75
Lines 12677 12731 +54
==========================================
+ Hits 11205 11207 +2
- Misses 1472 1524 +52 ☔ View full report in Codecov by Sentry. |
@paleolimbot the reason that the Python binding for the ArrayView currently checks the buffer types dynamically to determine the number of buffers, is that so this works on an uninitialized ArrayView (just based on the layout, without an Array already being created)? See here: arrow-nanoarrow/python/src/nanoarrow/_lib.pyx Lines 685 to 690 in b3c952a
i.e. why not just return Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those |
I've tried to remove all references to
That's true. We could rewrite |
if (array_view->storage_type == NANOARROW_TYPE_STRING_VIEW) { | ||
array_view->n_varidic_buffers = array->n_buffers - 3; | ||
array_view->variadic_buffer_sizes = array->buffers[array->n_buffers - 1]; | ||
// array_view->variadic_buffers = ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// array_view->variadic_buffers = ... | |
array_view->variadic_buffers = array-> buffers + 2 |
...I think?
Sorry for hickjacking this thread but I've become aware of the introductions of those new arrow data types because GDAL compilation broke against arrow 15.0 (due to a switch() not handling the new cases). Will be addressed in OSGeo/gdal#9116 in a minimalistic way by erroring out on those types. Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ? And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()... Their value proposition compared to regular string or binary is unclear to me. The only thing I see is that they might be a way to reduce memory usage by pointing to the same offset in case of duplicated strings? but wasn't that the purpose of dictionaries? (actually the same question would hold for the RUN_END_ENCODED stuff added in libarrow 12). As a relatively new comer to the Arrow ecosystem, I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations. |
I don't believe it's possible to get a stringview or binaryview from Parquet; however, in theory you could get one from Feather or Arrow IPC today with Arrow 15. I think that until pyarrow implements a basic level of support it's unlikely to actually end up being used. Some day the C++ Parquet scanner might be able to return those types but there are very few people working on that part of the code base and I think it's unlikely to be implemented any time soon.
I think the gist of it is that regular string and binary are slow to sort, which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it as their primary representation. Arrow added it primarily for interchange with those systems (e.g., a user-defined function based on the C Data interface) although I agree that it came at the expense unnecessary complexity for 99.9% of Arrow users.
I agree. In theory this is what nanoarrow is designed to help with (when the types are supported...so not yet). The focus of the version about to be released is testing and stability...0.5.0 is more likely to include features that could be used in a fallback sort of way (i.e., maybe
In Arrow C++, this is most likely to be supported via an option in the readers. For Parquet In Python, the
For now I think you will be hard-pressed to find a producer that actually produces REE or View types (ListView is also on the horizon if it's not already implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them (but Python bindings are on the roadmap first, which is no small feat!)...perhaps it can help! |
OK, understood! I see the C doc comment actually mentions this: "use it to represent a hypothetical ArrowArray that does not exist yet, or use it to validate the buffers of a future ArrowArray.". |
FYI, note that Python Polars can create Arrow files with the Utf8View type written to them by using the experimental option; >>> import polars as pl
>>> import pyarrow.feather as feather
>>> pl.DataFrame({"a": ["foo"]}).write_ipc("test.arrow", future=True)
>>> feather.read_table("test.arrow")
pyarrow.Table
a: string_view
----
a: [["foo"]] Since Rust Polars started using the Utf8View type, downstream Rust projects will use it. |
Nice! We're currently focused on a few other short-term nanoarrow things (e.g., Python bindings) but Joris did all the hard work here and we should be able to get this merged in the next few months with support in R and Python. |
Hey, R polars package's Series/DataFrame to/from nanoarrow_array_stream conversion have been rewritten to only using the C Stream interface (pola-rs/r-polars#1075, pola-rs/r-polars#1076, pola-rs/r-polars#1078). polars::pl$Series(values = "foo") |>
nanoarrow::as_nanoarrow_array_stream(future = TRUE) |>
polars::as_polars_series()
#> polars Series: shape: (1,)
#> Series: '' [str]
#> [
#> "foo"
#> ] Created on 2024-05-06 with reprex v2.1.0 |
@eitsupi Sorry I missed this comment!
nanoarrow (for R or Python) can definitely transport any type (even invalid ones!), even though conversion to/from R won't work. I think you might also run into trouble printing these (although I believe I have some tests for this...you'd probably be able to see the format and buffer addresses). |
Hi @jorisvandenbossche @paleolimbot We are trying to support interop between |
I'll take a stab at this in the next few days to see what is required beyond this PR. You probably just need a "build by buffer" and "consume by buffer" level of support (as opposed to building by element or consuming by element, which is harder). |
See also for more reading: |
Closing since this was implemented in #596 🚀 |
Very draft PR, just putting here publicly what I experimented.
This is currently a minimal addition that just allows to inspect an array / type: