-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169
Conversation
|
…|Writer} When manually writing dictionary vectors and writing multiple batches in a single `ArrowFileWriter`, only the first dictionary batch was written and subsequent batches were ignored. On reading, the `ArrowFileReader` would load only the first batch and use that batch for decoding subsequent batches, resulting in errors or incorrect decodings. This patch will now flush the dictionaries on each batch write and load the batches for the dictionaries on read. Following the docs at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages. Note that this does not address the delta dictionary encoding issue as the writer does not currently havea means of setting the delta flag. Neither does it allow for streaming writes of dictionaries (though the unit tests show a work-around). Fix for apache#38168
@vibhatha @davisusanibar can one of you review the PR? The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries |
I will check @lidavidm |
@lidavidm I mentioned in the commit comment that this is not a fix for delta writing (though it may help if we add an API to set the delta flag. It's always false.) Instead I need replacement dictionaries for parallel processing of the blocks. I can see if the repo has a delta encoded dictionary somewhere to test decoding and maybe try and test delta encoding somehow. |
The IPC file format explicitly disallows replacement dictionaries. |
That's insanely confusing:
Did the spec change to disallow replacements or is the doc correct and the spec now allows replacement? |
Replacement is allowed in streams, but not files. But they use the same data structures. |
Oof, ok thanks for that info. I'll tweak this for the dictionary then and see if I can get the folks to allow replacements in the file format as well. |
Er, sorry, I should say deltas are allowed in streams, but not files. |
Could you point me to the spec that talks about files vs streams please? |
It's linked above. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format Sorry, I don't know what's with me today. My original statement was right. "Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported)." -> deltas are allowed in files, just not replacements |
No worries. So nothing more formal defining the spec than that doc? Guess I'll go through Jiras to find where that came from. Files should support replacement if you ask me. Pinging the mailing list: https://lists.apache.org/thread/mvhmsk5mg6y3nkr6yo9hojo0x3wo7zf7 |
if (dictionariesWritten) { | ||
return; | ||
} | ||
dictionariesWritten = true; | ||
// Write out all dictionaries required. | ||
// Replacement dictionaries are not supported in the IPC file format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my clarification:
Referring to Apache Arrow Columnar specification.
Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported).
Here the relevant part of the code to provide that functionality has been removed. Does this change that functionality? Does it need a documentation update?
Heard back from Micah and we'd need a format change first. Will take a look at that later if we get a bit more involved with Arrow. Closing this for now, thanks! |
Rationale for this change
Previously, writing a file using the
ArrowFileWriter
class would only flush the initial state of dictionary vectors to the file so that subsequent updates were ignored. Likewise theArrowFileReader
only read the first dictionary block and re-used it for all subsequent data blocks.What changes are included in this PR?
The
ArrowFileWriter
now flushes dictionary vectors on each call towriteBatch()
. TheArrowFileReader
will now load the dictionaries for each block onloadNextBatch()
orloadRecordBatch(1)
.Are these changes tested?
Yes
Are there any user-facing changes?
If users relied on previous behavior to encode a single dictionary for use across multiple batches without the
isDelta
flag set, this change may break that behavior. However per the documentation at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages, the previous behavior was in error.