IPC file writer produces incorrect footer when not preserving dict ID #6710

brancz · 2024-11-10T16:11:05Z

Describe the bug

Not preserving the dict ID does not work with regular IPC files (streaming works fine). This is because in the file writer, the schema is serialized twice, however, in the current implementation the same dictionary tracker is used for both iterations. This is not an issue when dictionary IDs are preserved, because in that case the dictionary tracker just passes through whatever is the in dict_id field in the Field, however, when not preserving the dict ID, it continues to assign new dict IDs that don't actually make any sense, and serializes the footer with incorrect dict IDs because of that.

To Reproduce

Write any record that contains at least one dictionary to a file writer that is configured to not preserve dict IDs.

        let inner: DictionaryArray<Int32Type> = vec!["a", "b", "a"].into_iter().collect();

        let array = Arc::new(inner) as ArrayRef;

        let dctfield = Arc::new(Field::new("dict", array.data_type().clone(), false));

        let s = StructArray::from(vec![(dctfield, array)]);
        let struct_array = Arc::new(s) as ArrayRef;

        let schema = Arc::new(Schema::new(vec![Field::new(
            "struct",
            struct_array.data_type().clone(),
            false,
        )]));

        let batch = RecordBatch::try_new(schema, vec![struct_array]).unwrap();

        let mut buf = Vec::new();
        let mut writer = crate::writer::FileWriter::try_new_with_options(
            &mut buf,
            batch.schema_ref(),
            IpcWriteOptions::default().with_preserve_dict_id(false),
        )
        .unwrap();
        writer.write(&batch).unwrap();
        writer.finish().unwrap();
        drop(writer);

        let mut reader = FileReader::try_new(std::io::Cursor::new(buf), None).unwrap();

        assert_eq!(batch, reader.next().unwrap().unwrap());

Expected behavior

Writing a record batch to an IPC file that contains a dict and not preserving dict IDs works.

Additional context

I haven't studied the spec in detail, but it does seem odd to me that the schema is written twice to the IPC file (once as the first message, and once in the footer), however, at least the way it stands, this can't be changed, because the dict IDs need to be assigned before writing the first record batch, so this can only be changed once the preserve dict ID setting is removed because dict IDs are never preserved.

The fix is very simple, simply create a new dictionary tracker with the same configuration as the first time when the schema is written for the second time. It's a 3 line fix that I already have, but I wanted to make sure to open this issue for tracking purposes.

@tustvold @alamb

The text was updated successfully, but these errors were encountered:

alamb · 2024-11-16T14:32:48Z

label_issue.py automatically added labels {'arrow'} from #6711

brancz added the bug label Nov 10, 2024

brancz mentioned this issue Nov 10, 2024

File writer preserve dict bug #6711

Merged

alamb closed this as completed in #6711 Nov 15, 2024

alamb added the arrow Changes to the arrow crate label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPC file writer produces incorrect footer when not preserving dict ID #6710

IPC file writer produces incorrect footer when not preserving dict ID #6710

brancz commented Nov 10, 2024

alamb commented Nov 16, 2024

IPC file writer produces incorrect footer when not preserving dict ID #6710

IPC file writer produces incorrect footer when not preserving dict ID #6710

Comments

brancz commented Nov 10, 2024

alamb commented Nov 16, 2024