You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not preserving the dict ID does not work with regular IPC files (streaming works fine). This is because in the file writer, the schema is serialized twice, however, in the current implementation the same dictionary tracker is used for both iterations. This is not an issue when dictionary IDs are preserved, because in that case the dictionary tracker just passes through whatever is the in dict_id field in the Field, however, when not preserving the dict ID, it continues to assign new dict IDs that don't actually make any sense, and serializes the footer with incorrect dict IDs because of that.
To Reproduce
Write any record that contains at least one dictionary to a file writer that is configured to not preserve dict IDs.
let inner: DictionaryArray<Int32Type> = vec!["a", "b", "a"].into_iter().collect();
let array = Arc::new(inner) as ArrayRef;
let dctfield = Arc::new(Field::new("dict", array.data_type().clone(), false));
let s = StructArray::from(vec![(dctfield, array)]);
let struct_array = Arc::new(s) as ArrayRef;
let schema = Arc::new(Schema::new(vec![Field::new(
"struct",
struct_array.data_type().clone(),
false,
)]));
let batch = RecordBatch::try_new(schema, vec![struct_array]).unwrap();
let mut buf = Vec::new();
let mut writer = crate::writer::FileWriter::try_new_with_options(
&mut buf,
batch.schema_ref(),
IpcWriteOptions::default().with_preserve_dict_id(false),
)
.unwrap();
writer.write(&batch).unwrap();
writer.finish().unwrap();
drop(writer);
let mut reader = FileReader::try_new(std::io::Cursor::new(buf), None).unwrap();
assert_eq!(batch, reader.next().unwrap().unwrap());
Expected behavior
Writing a record batch to an IPC file that contains a dict and not preserving dict IDs works.
Additional context
I haven't studied the spec in detail, but it does seem odd to me that the schema is written twice to the IPC file (once as the first message, and once in the footer), however, at least the way it stands, this can't be changed, because the dict IDs need to be assigned before writing the first record batch, so this can only be changed once the preserve dict ID setting is removed because dict IDs are never preserved.
The fix is very simple, simply create a new dictionary tracker with the same configuration as the first time when the schema is written for the second time. It's a 3 line fix that I already have, but I wanted to make sure to open this issue for tracking purposes.
Describe the bug
Not preserving the dict ID does not work with regular IPC files (streaming works fine). This is because in the file writer, the schema is serialized twice, however, in the current implementation the same dictionary tracker is used for both iterations. This is not an issue when dictionary IDs are preserved, because in that case the dictionary tracker just passes through whatever is the in dict_id field in the
Field
, however, when not preserving the dict ID, it continues to assign new dict IDs that don't actually make any sense, and serializes the footer with incorrect dict IDs because of that.To Reproduce
Write any record that contains at least one dictionary to a file writer that is configured to not preserve dict IDs.
Expected behavior
Writing a record batch to an IPC file that contains a dict and not preserving dict IDs works.
Additional context
I haven't studied the spec in detail, but it does seem odd to me that the schema is written twice to the IPC file (once as the first message, and once in the footer), however, at least the way it stands, this can't be changed, because the dict IDs need to be assigned before writing the first record batch, so this can only be changed once the preserve dict ID setting is removed because dict IDs are never preserved.
The fix is very simple, simply create a new dictionary tracker with the same configuration as the first time when the schema is written for the second time. It's a 3 line fix that I already have, but I wanted to make sure to open this issue for tracking purposes.
@tustvold @alamb
The text was updated successfully, but these errors were encountered: