Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IPC FileDecoder #5249

Merged
merged 3 commits into from
Dec 30, 2023
Merged

Add IPC FileDecoder #5249

merged 3 commits into from
Dec 30, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 27, 2023

Which issue does this PR close?

Closes #5153
Closes #5165
Closes #5252
Part of #1207

Rationale for this change

In a similar vein to arrow-csv and arrow-json this extracts a low-level, push-based interface for decoding files. This lends itself to use-cases that need more control over how IO is performed, for example:

  • Async IO
  • Memory mapped Buffer
  • Parallel Decode

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 27, 2023
@tustvold
Copy link
Contributor Author

@comath perhaps you might take a look, whilst perhaps not as simple as #5165 I think this is a much more flexible and therefore future-proof mechanism to achieve this

Comment on lines +575 to +598
/// let trailer_start = buffer.len() - 10;
/// let footer_len = read_footer_length(buffer[trailer_start..].try_into().unwrap()).unwrap();
/// let footer = root_as_footer(&buffer[trailer_start - footer_len..trailer_start]).unwrap();
///
/// let back = fb_to_schema(footer.schema().unwrap());
/// assert_eq!(&back, schema.as_ref());
///
/// let mut decoder = FileDecoder::new(schema, footer.version());
///
/// // Read dictionaries
/// for block in footer.dictionaries().iter().flatten() {
/// let block_len = block.bodyLength() as usize + block.metaDataLength() as usize;
/// let data = buffer.slice_with_length(block.offset() as _, block_len);
/// decoder.read_dictionary(&block, &data).unwrap();
/// }
///
/// // Read record batch
/// let batches = footer.recordBatches().unwrap();
/// assert_eq!(batches.len(), 1); // Only wrote a single batch
///
/// let block = batches.get(0);
/// let block_len = block.bodyLength() as usize + block.metaDataLength() as usize;
/// let data = buffer.slice_with_length(block.offset() as _, block_len);
/// let back = decoder.read_record_batch(block, &data).unwrap().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still somewhat verbose, but I couldn't see an easy way to reduce this further that didn't end up in knots with self-referential structs (as flatbuffers borrow data)

pub fn read_record_batch(
&self,
block: &Block,
buf: &Buffer,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth noting that this interface won't allow pushing down column projection to IO, I think this is a bridge we cross when we add support for this.


/// Read the dictionary with the given block and data buffer
pub fn read_dictionary(&mut self, block: &Block, buf: &Buffer) -> Result<(), ArrowError> {
let message = self.read_message(buf)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bit deal but seems we didn't check for metadata version for DictionaryBatch before. Maybe it was missed before.

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
Co-authored-by: Liang-Chi Hsieh <[email protected]>
@tustvold tustvold merged commit 9863486 into apache:master Dec 30, 2023
25 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

Thank you for the reviews @viirya and the code @tustvold

@comath
Copy link
Contributor

comath commented Dec 30, 2023

@comath perhaps you might take a look, whilst perhaps not as simple as #5165 I think this is a much more flexible and therefore future-proof mechanism to achieve this

I'm a bit late, but I like it. It exposes the mechanisms for people to do their own crazy things in the future.

@alamb
Copy link
Contributor

alamb commented Dec 31, 2023

It exposes the mechanisms for people to do their own crazy things in the future.

😆 -- love it!

@tustvold tustvold mentioned this pull request Mar 1, 2024
@tustvold tustvold mentioned this pull request Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support get offsets or blocks info from arrow file. Request to Memmap Arrow IPC files on disk
4 participants