Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

Closed
wants to merge 1 commit into from

Conversation

manolama
Copy link
Contributor

@manolama manolama commented Oct 9, 2023

Rationale for this change

Previously, writing a file using the ArrowFileWriter class would only flush the initial state of dictionary vectors to the file so that subsequent updates were ignored. Likewise the ArrowFileReader only read the first dictionary block and re-used it for all subsequent data blocks.

What changes are included in this PR?

The ArrowFileWriter now flushes dictionary vectors on each call to writeBatch(). The ArrowFileReader will now load the dictionaries for each block on loadNextBatch() or loadRecordBatch(1).

Are these changes tested?

Yes

Are there any user-facing changes?

If users relied on previous behavior to encode a single dictionary for use across multiple batches without the isDelta flag set, this change may break that behavior. However per the documentation at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages, the previous behavior was in error.

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

⚠️ GitHub issue #38168 has been automatically assigned in GitHub to PR creator.

@manolama manolama changed the title GH-38168: [Java] Fix multi-batch Dictionary in Arrow{Reader|Writer} GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} Oct 9, 2023
…|Writer}

When manually writing dictionary vectors and writing multiple batches in a single `ArrowFileWriter`,
only the first dictionary batch was written and subsequent batches were ignored.
On reading, the `ArrowFileReader` would load only the first batch and use that batch for decoding
subsequent batches, resulting in errors or incorrect decodings.
This patch will now flush the dictionaries on each batch write and load the batches for the
dictionaries on read.
Following the docs at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages.

Note that this does not address the delta dictionary encoding issue as the writer does not
currently havea means of setting the delta flag. Neither does it allow for streaming writes
of dictionaries (though the unit tests show a work-around).

Fix for apache#38168
@lidavidm
Copy link
Member

@vibhatha @davisusanibar can one of you review the PR? The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries

@vibhatha
Copy link
Collaborator

I will check @lidavidm

@manolama
Copy link
Contributor Author

manolama commented Oct 11, 2023

The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries

@lidavidm I mentioned in the commit comment that this is not a fix for delta writing (though it may help if we add an API to set the delta flag. It's always false.) Instead I need replacement dictionaries for parallel processing of the blocks.

I can see if the repo has a delta encoded dictionary somewhere to test decoding and maybe try and test delta encoding somehow.

@lidavidm
Copy link
Member

The IPC file format explicitly disallows replacement dictionaries.

@manolama
Copy link
Contributor Author

That's insanely confusing:

Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID.

Did the spec change to disallow replacements or is the doc correct and the spec now allows replacement?

@lidavidm
Copy link
Member

lidavidm commented Oct 11, 2023

Replacement is allowed in streams, but not files. But they use the same data structures.

@manolama
Copy link
Contributor Author

Oof, ok thanks for that info. I'll tweak this for the dictionary then and see if I can get the folks to allow replacements in the file format as well.

@lidavidm
Copy link
Member

Er, sorry, I should say deltas are allowed in streams, but not files.

@manolama
Copy link
Contributor Author

Could you point me to the spec that talks about files vs streams please?

@lidavidm
Copy link
Member

It's linked above. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

Sorry, I don't know what's with me today. My original statement was right. "Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported)." -> deltas are allowed in files, just not replacements

@manolama
Copy link
Contributor Author

manolama commented Oct 11, 2023

No worries. So nothing more formal defining the spec than that doc? Guess I'll go through Jiras to find where that came from. Files should support replacement if you ask me.

Pinging the mailing list: https://lists.apache.org/thread/mvhmsk5mg6y3nkr6yo9hojo0x3wo7zf7

Comment on lines -132 to -137
if (dictionariesWritten) {
return;
}
dictionariesWritten = true;
// Write out all dictionaries required.
// Replacement dictionaries are not supported in the IPC file format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my clarification:

Referring to Apache Arrow Columnar specification.

Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported).

Here the relevant part of the code to provide that functionality has been removed. Does this change that functionality? Does it need a documentation update?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2023
@manolama
Copy link
Contributor Author

Heard back from Micah and we'd need a format change first. Will take a look at that later if we get a bit more involved with Arrow. Closing this for now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Multi-batch dictionary bug in ArrowFile{Reader|Writer}
3 participants