Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for Binaryview and StringView types #367

Closed

Conversation

jorisvandenbossche
Copy link
Member

Very draft PR, just putting here publicly what I experimented.

This is currently a minimal addition that just allows to inspect an array / type:

In [1]: builder = pa.lib.StringViewBuilder()
   ...: builder.append("test")
   ...: builder.append("very long string that is not inlined")
   ...: builder.append(None)
   ...: builder.append("test")
   ...: arr = builder.finish()

In [2]: import nanoarrow as na

In [3]: na.c_schema(arr.type)
Out[3]: 
<nanoarrow.c_lib.CSchema string_view>
- format: 'vu'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

In [4]: na.c_schema_view(arr.type)
Out[4]: 
<nanoarrow.c_lib.CSchemaView>
- type: 'string_view'
- storage_type: 'string_view'

In [5]: na.c_array(arr)
Out[5]: 
<nanoarrow.c_lib.CArray string_view>
- length: 4
- offset: 0
- null_count: 1
- buffers: (140250311598080, 140250311598144, 140250311598208, 94079215855408)
- dictionary: NULL
- children[0]:

In [4]: na.c_array_view(arr)
Out[4]: 
<nanoarrow.c_lib.CArrayView>
- storage_type: 'string_view'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - <bool validity[1 b] 11010000>
  - <string_view data[0 b] b''>
- dictionary: NULL
- children[0]:

NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO
NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO,
NANOARROW_TYPE_BINARY_VIEW,
NANOARROW_TYPE_STRING_VIEW,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ArrowType enum meant to be stable? (i.e. can I only add new types at the end, or can I put them at a more logical place in the list?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably best to put it at the end to keep the existing values (we can always rearrange them in places where they're more likely to be seen).

@codecov-commenter
Copy link

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (b3c952a) 88.38% compared to head (b3a9894) 88.02%.

Files Patch % Lines
src/nanoarrow/schema.c 0.00% 19 Missing ⚠️
src/nanoarrow/nanoarrow_types.h 0.00% 4 Missing ⚠️
src/nanoarrow/array.c 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #367      +/-   ##
==========================================
- Coverage   88.38%   88.02%   -0.36%     
==========================================
  Files          75       75              
  Lines       12677    12731      +54     
==========================================
+ Hits        11205    11207       +2     
- Misses       1472     1524      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jorisvandenbossche
Copy link
Member Author

@paleolimbot the reason that the Python binding for the ArrayView currently checks the buffer types dynamically to determine the number of buffers, is that so this works on an uninitialized ArrayView (just based on the layout, without an Array already being created)? See here:

@property
def n_buffers(self):
for i in range(3):
if self._ptr.layout.buffer_type[i] == NANOARROW_BUFFER_TYPE_NONE:
return i
return 3

i.e. why not just return self._ptr.array.n_buffers here?

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

@paleolimbot
Copy link
Member

why not just return self._ptr.array.n_buffers here?

I've tried to remove all references to ArrayView.array from nanoarrow (C, Python, R, and otherwise), although I haven't removed the member from the ArrayView yet because I think some previous code relies on it. Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

Because if we don't consider the variadic buffers as part of the ArrayView's buffers, then accessing like the above in Python doesn't work for those

That's true. We could rewrite n_buffers() to += _ptr.n_variadic_buffers and rewrite buffer() to create a buffer view from the variaidic buffers if i is large enough (or error, since accessing variadic buffers directly isn't strictly needed to decode the content).

if (array_view->storage_type == NANOARROW_TYPE_STRING_VIEW) {
array_view->n_varidic_buffers = array->n_buffers - 3;
array_view->variadic_buffer_sizes = array->buffers[array->n_buffers - 1];
// array_view->variadic_buffers = ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// array_view->variadic_buffers = ...
array_view->variadic_buffers = array-> buffers + 2

...I think?

@rouault
Copy link

rouault commented Jan 21, 2024

Sorry for hickjacking this thread but I've become aware of the introductions of those new arrow data types because GDAL compilation broke against arrow 15.0 (due to a switch() not handling the new cases). Will be addressed in OSGeo/gdal#9116 in a minimalistic way by erroring out on those types. Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ? And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()... Their value proposition compared to regular string or binary is unclear to me. The only thing I see is that they might be a way to reduce memory usage by pointing to the same offset in case of duplicated strings? but wasn't that the purpose of dictionaries? (actually the same question would hold for the RUN_END_ENCODED stuff added in libarrow 12). As a relatively new comer to the Arrow ecosystem, I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations.
Perhaps there should be some "data type negociation" mechanism where a consumer could tell to the producer something like "I just understand Int32, Float64, String. Do your best to present me only those data types by possibly morphing content to that"

@paleolimbot
Copy link
Member

Do you know if those types will be actually found in serialized formats, namely Parquet and Feather ?

I don't believe it's possible to get a stringview or binaryview from Parquet; however, in theory you could get one from Feather or Arrow IPC today with Arrow 15. I think that until pyarrow implements a basic level of support it's unlikely to actually end up being used. Some day the C++ Parquet scanner might be able to return those types but there are very few people working on that part of the code base and I think it's unlikely to be implemented any time soon.

Their value proposition compared to regular string or binary is unclear to me.

I think the gist of it is that regular string and binary are slow to sort, which is why Meta's Velox, DuckDB, and (very recently) Polars have adopted it as their primary representation. Arrow added it primarily for interchange with those systems (e.g., a user-defined function based on the C Data interface) although I agree that it came at the expense unnecessary complexity for 99.9% of Arrow users.

I should point that the proliferation of basic data types is going to be a serious obstacle to adoption by new implementations.

I agree. In theory this is what nanoarrow is designed to help with (when the types are supported...so not yet). The focus of the version about to be released is testing and stability...0.5.0 is more likely to include features that could be used in a fallback sort of way (i.e., maybe ArrowArrayViewGetLogicalXXX() that would handle the dictionary, run-end-encoded, view, or normal cases at the expense of an extra switch()).

Perhaps there should be some "data type negociation" mechanism

In Arrow C++, this is most likely to be supported via an option in the readers. For Parquet In Python, the __arrow_c_schema__() and __arrow_c_stream__(requested_schema=None) methods can in theory handle this (i.e., you query __arrow_c_schema__() and if it contains types you don't understand, you pass a different schema to __arrow_c_stream__()'s requested_schema. That said, pyarrow doesn't implement requested_schema yet, but it also doesn't really implement the view types either.

And I hope that I won't have to add support for them in OGRLayer::WriteArrowBatch()

For now I think you will be hard-pressed to find a producer that actually produces REE or View types (ListView is also on the horizon if it's not already implemented in Arrow C++). It's on nanoarrow's roadmap to support all of them (but Python bindings are on the roadmap first, which is no small feat!)...perhaps it can help!

@jorisvandenbossche
Copy link
Member Author

Basically, the ArrowArrayView doesn't care how the memory it's pointing to got there (i.e., there's no requirement that an ArrowArray ever existed). This is mostly helpful for reading IPC: the decoder first fills in an ArrowArrayView and only allocates the ArrowArray if requested.

OK, understood! I see the C doc comment actually mentions this: "use it to represent a hypothetical ArrowArray that does not exist yet, or use it to validate the buffers of a future ArrowArray.".

@eitsupi
Copy link
Contributor

eitsupi commented Feb 6, 2024

FYI, note that Python Polars can create Arrow files with the Utf8View type written to them by using the experimental option;

>>> import polars as pl
>>> import pyarrow.feather as feather
>>> pl.DataFrame({"a": ["foo"]}).write_ipc("test.arrow", future=True)
>>> feather.read_table("test.arrow")
pyarrow.Table
a: string_view
----
a: [["foo"]]

Since Rust Polars started using the Utf8View type, downstream Rust projects will use it.

@paleolimbot
Copy link
Member

Nice! We're currently focused on a few other short-term nanoarrow things (e.g., Python bindings) but Joris did all the hard work here and we should be able to get this merged in the next few months with support in R and Python.

@eitsupi
Copy link
Contributor

eitsupi commented May 6, 2024

Hey, R polars package's Series/DataFrame to/from nanoarrow_array_stream conversion have been rewritten to only using the C Stream interface (pola-rs/r-polars#1075, pola-rs/r-polars#1076, pola-rs/r-polars#1078).
Even though nanoarrow 0.4.0 does not support StringView, It seems to work fine even when StringView is included.

polars::pl$Series(values = "foo") |>
  nanoarrow::as_nanoarrow_array_stream(future = TRUE) |>
  polars::as_polars_series()
#> polars Series: shape: (1,)
#> Series: '' [str]
#> [
#>  "foo"
#> ]

Created on 2024-05-06 with reprex v2.1.0

@paleolimbot
Copy link
Member

@eitsupi Sorry I missed this comment!

It seems to work fine even when StringView is included.

nanoarrow (for R or Python) can definitely transport any type (even invalid ones!), even though conversion to/from R won't work. I think you might also run into trouble printing these (although I believe I have some tests for this...you'd probably be able to see the format and buffer addresses).

@JayjeetAtGithub
Copy link

Hi @jorisvandenbossche @paleolimbot We are trying to support interop between arrow::StringViewArray / arrow::BinaryViewArray and cudf::column in libcudf. But since, libcudf uses nanoarrow which does not have string view or binary view types yet, we are blocked. If nanoarrow could support these two types (along with the mappings into ArrowArray / ArrowDeviceArray / ArrowSchema) implementation for these types, we could implement the interop to cudf::column.

@paleolimbot
Copy link
Member

I'll take a stab at this in the next few days to see what is required beyond this PR. You probably just need a "build by buffer" and "consume by buffer" level of support (as opposed to building by element or consuming by element, which is harder).

@kylebarron
Copy link

Their value proposition compared to regular string or binary is unclear to me

See also for more reading:

@paleolimbot
Copy link
Member

Closing since this was implemented in #596 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants