Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC file writer produces incorrect footer when not preserving dict ID #6710

Closed
brancz opened this issue Nov 10, 2024 · 1 comment · Fixed by #6711
Closed

IPC file writer produces incorrect footer when not preserving dict ID #6710

brancz opened this issue Nov 10, 2024 · 1 comment · Fixed by #6711
Labels
arrow Changes to the arrow crate bug

Comments

@brancz
Copy link
Contributor

brancz commented Nov 10, 2024

Describe the bug

Not preserving the dict ID does not work with regular IPC files (streaming works fine). This is because in the file writer, the schema is serialized twice, however, in the current implementation the same dictionary tracker is used for both iterations. This is not an issue when dictionary IDs are preserved, because in that case the dictionary tracker just passes through whatever is the in dict_id field in the Field, however, when not preserving the dict ID, it continues to assign new dict IDs that don't actually make any sense, and serializes the footer with incorrect dict IDs because of that.

To Reproduce

Write any record that contains at least one dictionary to a file writer that is configured to not preserve dict IDs.

        let inner: DictionaryArray<Int32Type> = vec!["a", "b", "a"].into_iter().collect();

        let array = Arc::new(inner) as ArrayRef;

        let dctfield = Arc::new(Field::new("dict", array.data_type().clone(), false));

        let s = StructArray::from(vec![(dctfield, array)]);
        let struct_array = Arc::new(s) as ArrayRef;

        let schema = Arc::new(Schema::new(vec![Field::new(
            "struct",
            struct_array.data_type().clone(),
            false,
        )]));

        let batch = RecordBatch::try_new(schema, vec![struct_array]).unwrap();

        let mut buf = Vec::new();
        let mut writer = crate::writer::FileWriter::try_new_with_options(
            &mut buf,
            batch.schema_ref(),
            IpcWriteOptions::default().with_preserve_dict_id(false),
        )
        .unwrap();
        writer.write(&batch).unwrap();
        writer.finish().unwrap();
        drop(writer);

        let mut reader = FileReader::try_new(std::io::Cursor::new(buf), None).unwrap();

        assert_eq!(batch, reader.next().unwrap().unwrap());

Expected behavior

Writing a record batch to an IPC file that contains a dict and not preserving dict IDs works.

Additional context

I haven't studied the spec in detail, but it does seem odd to me that the schema is written twice to the IPC file (once as the first message, and once in the footer), however, at least the way it stands, this can't be changed, because the dict IDs need to be assigned before writing the first record batch, so this can only be changed once the preserve dict ID setting is removed because dict IDs are never preserved.

The fix is very simple, simply create a new dictionary tracker with the same configuration as the first time when the schema is written for the second time. It's a 3 line fix that I already have, but I wanted to make sure to open this issue for tracking purposes.

@tustvold @alamb

@brancz brancz added the bug label Nov 10, 2024
@alamb alamb added the arrow Changes to the arrow crate label Nov 16, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'arrow'} from #6711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants