-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dictionary IDs Arrow IPC #1206
Comments
Another alternative is to remove This appears to be what @jorgecarleitao has done in arrow2 https://github.com/jorgecarleitao/arrow2/search?q=dict_id |
It seems like a serde concern to me, and I don't see any value of having it in the schema. I'd be in favor of removing it from |
We're currently having a lot of issues with
That is all to say I'd like to work on making |
+1 it has been a major foot gun for us. |
That seems reasonable to me from what I can tell. cc @tustvold Also I wonder if @jhorstmann has any thoughts in this matter. I believe he was / has used |
Dict id being present as a first-party field inside the schema has always felt a bit odd to me, I fully support making it a serde detail. If users want a potentially broader notion of dictionary IDs at the schema level, there is nothing to prevent them using field metadata to do this |
AFAIK we are not using the (A bit later we switched to just hydrating the dictionary arrays, because the dictionary were often larger than the actual filtered or aggregated data.) |
I'm hopeful that we can get not preserving dict IDs to be the default for the next major release: #6788 Then for the next one we can remove the |
Which part is this question about
The
Field
data structure contains adict_id
member, that stores an i64. It appears the intention of this is that different dictionaries will have different IDs, however, this currently appears to only be respected by the IPC format and isn't widely utilised by arrow-rs.Describe your question
Most of arrow-rs is completely agnostic to dict_id, with compute kernels completely ignoring it, even those that recompute dictionaries such as concat.
The only parts of the stack that appear to use the dict_ids are the IPC interfaces, which will error if they encounter the same dict_id multiple times. I think this is inconsistency is a tad confusing, I think we should do one of the following:
Of these the first would definitely be simpler to implement, but I'm not familiar enough with the purpose of dict_id to be certain there isn't some use-case this would preclude?
Additional context
As
Field
is part of theSchema
, RecordBatch with different dict_id will appear to have different schema. This may have downstream implications for things like DataFusion which have strong assumptions on schema consistency within a plan.This cropped up in apache/datafusion#1596 as it is using the arrow IPC format to spill buffers to disk.
The text was updated successfully, but these errors were encountered: