Dictionary IDs Arrow IPC #1206

tustvold · 2022-01-19T11:03:49Z

Which part is this question about

The Field data structure contains a dict_id member, that stores an i64. It appears the intention of this is that different dictionaries will have different IDs, however, this currently appears to only be respected by the IPC format and isn't widely utilised by arrow-rs.

Describe your question

Most of arrow-rs is completely agnostic to dict_id, with compute kernels completely ignoring it, even those that recompute dictionaries such as concat.

The only parts of the stack that appear to use the dict_ids are the IPC interfaces, which will error if they encounter the same dict_id multiple times. I think this is inconsistency is a tad confusing, I think we should do one of the following:

Keep the current agnosticism within arrow-rs and assign IDs in the writers (potentially using Arc::ptr_eq on the values array)
Make arrow-rs respect dict_ids

Of these the first would definitely be simpler to implement, but I'm not familiar enough with the purpose of dict_id to be certain there isn't some use-case this would preclude?

Additional context

As Field is part of the Schema, RecordBatch with different dict_id will appear to have different schema. This may have downstream implications for things like DataFusion which have strong assumptions on schema consistency within a plan.

This cropped up in apache/datafusion#1596 as it is using the arrow IPC format to spill buffers to disk.

The text was updated successfully, but these errors were encountered:

alamb · 2022-01-19T14:06:54Z

Another alternative is to remove dict_id entirely from Field and only use it while doing IPC serialization

This appears to be what @jorgecarleitao has done in arrow2

https://github.com/jorgecarleitao/arrow2/search?q=dict_id
https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/field.rs#L12-L21

avantgardnerio · 2023-12-07T17:17:48Z

It seems like a serde concern to me, and I don't see any value of having it in the schema. I'd be in favor of removing it from Field entirely and assigning it only during serde.

thinkharderdev · 2024-03-12T17:19:28Z

Another alternative is to remove dict_id entirely from Field and only use it while doing IPC serialization

This appears to be what @jorgecarleitao has done in arrow2

https://github.com/jorgecarleitao/arrow2/search?q=dict_id https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/field.rs#L12-L21

We're currently having a lot of issues with dict_id in arrow flight. There are a couple of different issues which I think @alamb suggestion above would alleviate:

Ensuring that you assign a unique dict_id all the time ends up creating a lot of boilerplate since you need some way to track the current dict_ids and ensure that you always preserve the dict_id once it is assigned.
FlightDataEncoder is inconsistent in where it gets dictionary ids from. When it sends the schema it will of course send the dict_id defined in the schema field. But when serializing the array it will send the dict_id from the array's datatype. Theoretically these should always be the same but since PartialEq on Field ignores the dict_id this is hard to enforce in practice.
To the last point, it is not even always possible to explicitly assign a dict_id when building an array with nested dictionary fields. Until very recently, using GenericListBuilder would just default the dict_id on it's item field.

That is all to say I'd like to work on making dict_id an internal implementation detail of the serialization and completely ignore the field's dict_id for that purpose. Then we can deprecate Field::dict_id and then eventually remove the dict_id field from Field entirely.

avantgardnerio · 2024-03-12T18:33:29Z

+1 it has been a major foot gun for us.

alamb · 2024-03-12T21:16:58Z

That is all to say I'd like to work on making dict_id an internal implementation detail of the serialization and completely ignore the field's dict_id for that purpose. Then we can deprecate Field::dict_id and then eventually remove the dict_id field from Field entirely.

That seems reasonable to me from what I can tell.

cc @tustvold

Also I wonder if @jhorstmann has any thoughts in this matter. I believe he was / has used dict_id in the past so maybe has some more perspective to share

tustvold · 2024-03-12T21:28:04Z

Dict id being present as a first-party field inside the schema has always felt a bit odd to me, I fully support making it a serde detail.

If users want a potentially broader notion of dictionary IDs at the schema level, there is nothing to prevent them using field metadata to do this

jhorstmann · 2024-03-13T07:49:53Z

AFAIK we are not using the dict_id property. We make heavy use of dictionary arrays, and one of the invariants in our system is that columns use the same dictionary across batches, but this is not reflected in the schema or fields. In fact, when we tried to use the ipc format to persist data we were post-processing the schema and assigning dictionary ids based on the position of the field in the schema. That post-processing would not have been necessary if the dict id was only part of the serialization api.

(A bit later we switched to just hydrating the dictionary arrays, because the dictionary were often larger than the actual filtered or aggregated data.)

brancz · 2024-11-26T16:55:02Z

I'm hopeful that we can get not preserving dict IDs to be the default for the next major release: #6788

Then for the next one we can remove the dict_id field once and for all.

tustvold added the question Further information is requested label Jan 19, 2022

tustvold mentioned this issue Jan 19, 2022

Consolidate sort and external_sort apache/datafusion#1596

Merged

tustvold mentioned this issue Jan 21, 2022

Dictionary Uniqueness and Sortedness #1217

Closed

alamb mentioned this issue Dec 23, 2022

Support DictionaryArrays in Arrow Flight #3389

Open

avantgardnerio mentioned this issue Dec 7, 2023

Preserve dict_id on Field during serde roundtrip apache/datafusion#8457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary IDs Arrow IPC #1206

Dictionary IDs Arrow IPC #1206

tustvold commented Jan 19, 2022 •

edited

Loading

alamb commented Jan 19, 2022 •

edited

Loading

avantgardnerio commented Dec 7, 2023

thinkharderdev commented Mar 12, 2024

avantgardnerio commented Mar 12, 2024

alamb commented Mar 12, 2024

tustvold commented Mar 12, 2024 •

edited

Loading

jhorstmann commented Mar 13, 2024

brancz commented Nov 26, 2024

Dictionary IDs Arrow IPC #1206

Dictionary IDs Arrow IPC #1206

Comments

tustvold commented Jan 19, 2022 • edited Loading

alamb commented Jan 19, 2022 • edited Loading

avantgardnerio commented Dec 7, 2023

thinkharderdev commented Mar 12, 2024

avantgardnerio commented Mar 12, 2024

alamb commented Mar 12, 2024

tustvold commented Mar 12, 2024 • edited Loading

jhorstmann commented Mar 13, 2024

brancz commented Nov 26, 2024

tustvold commented Jan 19, 2022 •

edited

Loading

alamb commented Jan 19, 2022 •

edited

Loading

tustvold commented Mar 12, 2024 •

edited

Loading