GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

manolama · 2023-10-09T21:56:33Z

Rationale for this change

Previously, writing a file using the ArrowFileWriter class would only flush the initial state of dictionary vectors to the file so that subsequent updates were ignored. Likewise the ArrowFileReader only read the first dictionary block and re-used it for all subsequent data blocks.

What changes are included in this PR?

The ArrowFileWriter now flushes dictionary vectors on each call to writeBatch(). The ArrowFileReader will now load the dictionaries for each block on loadNextBatch() or loadRecordBatch(1).

Are these changes tested?

Yes

Are there any user-facing changes?

If users relied on previous behavior to encode a single dictionary for use across multiple batches without the isDelta flag set, this change may break that behavior. However per the documentation at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages, the previous behavior was in error.

Closes: [Java] Multi-batch dictionary bug in ArrowFile{Reader|Writer} #38168

github-actions · 2023-10-09T21:57:00Z

⚠️ GitHub issue #38168 has been automatically assigned in GitHub to PR creator.

…|Writer} When manually writing dictionary vectors and writing multiple batches in a single `ArrowFileWriter`, only the first dictionary batch was written and subsequent batches were ignored. On reading, the `ArrowFileReader` would load only the first batch and use that batch for decoding subsequent batches, resulting in errors or incorrect decodings. This patch will now flush the dictionaries on each batch write and load the batches for the dictionaries on read. Following the docs at https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages. Note that this does not address the delta dictionary encoding issue as the writer does not currently havea means of setting the delta flag. Neither does it allow for streaming writes of dictionaries (though the unit tests show a work-around). Fix for apache#38168

lidavidm · 2023-10-11T13:46:31Z

@vibhatha @davisusanibar can one of you review the PR? The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries

vibhatha · 2023-10-11T14:05:12Z

I will check @lidavidm

manolama · 2023-10-11T17:39:44Z

The main thing I'd make sure of is that this writes delta dictionaries and not replacement dictionaries

@lidavidm I mentioned in the commit comment that this is not a fix for delta writing (though it may help if we add an API to set the delta flag. It's always false.) Instead I need replacement dictionaries for parallel processing of the blocks.

I can see if the repo has a delta encoded dictionary somewhere to test decoding and maybe try and test delta encoding somehow.

lidavidm · 2023-10-11T17:42:42Z

The IPC file format explicitly disallows replacement dictionaries.

lidavidm · 2023-10-11T17:43:11Z

Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported).

manolama · 2023-10-11T17:59:01Z

That's insanely confusing:

Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID.

Did the spec change to disallow replacements or is the doc correct and the spec now allows replacement?

lidavidm · 2023-10-11T18:04:19Z

Replacement is allowed in streams, but not files. But they use the same data structures.

manolama · 2023-10-11T18:06:17Z

Oof, ok thanks for that info. I'll tweak this for the dictionary then and see if I can get the folks to allow replacements in the file format as well.

lidavidm · 2023-10-11T18:18:21Z

Er, sorry, I should say deltas are allowed in streams, but not files.

manolama · 2023-10-11T18:51:04Z

Could you point me to the spec that talks about files vs streams please?

lidavidm · 2023-10-11T18:55:01Z

It's linked above. https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

Sorry, I don't know what's with me today. My original statement was right. "Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported)." -> deltas are allowed in files, just not replacements

manolama · 2023-10-11T20:00:26Z

No worries. So nothing more formal defining the spec than that doc? Guess I'll go through Jiras to find where that came from. Files should support replacement if you ask me.

Pinging the mailing list: https://lists.apache.org/thread/mvhmsk5mg6y3nkr6yo9hojo0x3wo7zf7

vibhatha · 2023-10-13T01:41:50Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java

-    if (dictionariesWritten) {
-      return;
-    }
-    dictionariesWritten = true;
-    // Write out all dictionaries required.
-    // Replacement dictionaries are not supported in the IPC file format.


Just for my clarification:

Referring to Apache Arrow Columnar specification.

Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported).

Here the relevant part of the code to provide that functionality has been removed. Does this change that functionality? Does it need a documentation update?

manolama · 2023-10-16T21:56:22Z

Heard back from Micah and we'd need a format change first. Will take a look at that later if we get a bit more involved with Arrow. Closing this for now, thanks!

manolama requested a review from lidavidm as a code owner October 9, 2023 21:56

github-actions bot added Component: Java awaiting review Awaiting review labels Oct 9, 2023

manolama changed the title ~~GH-38168: [Java] Fix multi-batch Dictionary in Arrow{Reader|Writer}~~ GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} Oct 9, 2023

manolama force-pushed the tweaks branch from 92e2c73 to e11025b Compare October 9, 2023 21:58

vibhatha reviewed Oct 13, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2023

manolama closed this Oct 16, 2023

vibhatha mentioned this pull request Oct 24, 2023

GH-38414 [Java][Vector] Add Delta dictionary support #38423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

manolama commented Oct 9, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Oct 9, 2023

lidavidm commented Oct 11, 2023

vibhatha commented Oct 11, 2023

manolama commented Oct 11, 2023 •

edited

Loading

lidavidm commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023 •

edited

Loading

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023 •

edited

Loading

vibhatha Oct 13, 2023

manolama commented Oct 16, 2023

GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

GH-38168: [Java] Fix multi-batch Dictionary in ArrowFile{Reader|Writer} #38169

Conversation

manolama commented Oct 9, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Oct 9, 2023

lidavidm commented Oct 11, 2023

vibhatha commented Oct 11, 2023

manolama commented Oct 11, 2023 • edited Loading

lidavidm commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023 • edited Loading

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023

lidavidm commented Oct 11, 2023

manolama commented Oct 11, 2023 • edited Loading

vibhatha Oct 13, 2023

Choose a reason for hiding this comment

manolama commented Oct 16, 2023

manolama commented Oct 9, 2023 •

edited by github-actions bot

Loading

manolama commented Oct 11, 2023 •

edited

Loading

lidavidm commented Oct 11, 2023 •

edited

Loading

manolama commented Oct 11, 2023 •

edited

Loading