Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockwise IO in IPC FileReader (#5153) #5179

Merged
merged 4 commits into from
Dec 8, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 6, 2023

Which issue does this PR close?

Part of #5153

Rationale for this change

This is the first part of being able to extract a BufferRead trait that allowing for zero-copy deserialization of IPC files loaded into Buffer.

What changes are included in this PR?

Uses large blockwise reads to avoid the need for BufReader and to reduce the amount of IO requests necessary to read a file

Are there any user-facing changes?

@tustvold tustvold requested a review from viirya December 6, 2023 23:20
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2023
reader.seek(SeekFrom::End(-6))?;
reader.read_exact(&mut magic_buffer)?;
if magic_buffer != super::ARROW_MAGIC {
/// Returns errors if the file does not meet the Arrow Format footer requirements
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be sufficient to just check the footer, and this provides a more predictable IO pattern

@@ -498,10 +498,34 @@ pub fn read_dictionary(
Ok(())
}

/// Read the data for a given block
fn read_block<R: Read + Seek>(mut reader: R, block: &Block) -> Result<Buffer, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making an assumption here that blocks should be large enough to be an appropriate IO size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @pitrou might know if this is a bad idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that a block typically points to an entire record batch, yes, a block is certainly large enough.

///
/// <https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format>
fn parse_message(buf: &[u8]) -> Result<Message, ArrowError> {
let buf = match buf[..4] == CONTINUATION_MARKER {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does highlight a somewhat peculiar quirk of the IPC file format, it doesn't actually care about the size prefixes. I guess this is just a historical artifact of the fact the IPC streams came first.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this change @tustvold -- this PR makes sense to me.

arrow-ipc/src/reader.rs Show resolved Hide resolved
let message = crate::root_as_message(&block_data[..]).map_err(|err| {
ArrowError::ParseError(format!("Unable to get root as message: {err:?}"))
})?;
let buf = read_block(&mut reader, block)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reducing the number of copies of the code that does the read seems like a good improvement to me

It seems like previously the code did 2 IOs per block (at least) to read the header and then the data. With this PR it will only do 1 IO a block (given there is no continuation)

@tustvold tustvold merged commit 9630aaf into apache:master Dec 8, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants