-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blockwise IO in IPC FileReader (#5153) #5179
Conversation
reader.seek(SeekFrom::End(-6))?; | ||
reader.read_exact(&mut magic_buffer)?; | ||
if magic_buffer != super::ARROW_MAGIC { | ||
/// Returns errors if the file does not meet the Arrow Format footer requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be sufficient to just check the footer, and this provides a more predictable IO pattern
@@ -498,10 +498,34 @@ pub fn read_dictionary( | |||
Ok(()) | |||
} | |||
|
|||
/// Read the data for a given block | |||
fn read_block<R: Read + Seek>(mut reader: R, block: &Block) -> Result<Buffer, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making an assumption here that blocks should be large enough to be an appropriate IO size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @pitrou might know if this is a bad idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that a block typically points to an entire record batch, yes, a block is certainly large enough.
/// | ||
/// <https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format> | ||
fn parse_message(buf: &[u8]) -> Result<Message, ArrowError> { | ||
let buf = match buf[..4] == CONTINUATION_MARKER { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does highlight a somewhat peculiar quirk of the IPC file format, it doesn't actually care about the size prefixes. I guess this is just a historical artifact of the fact the IPC streams came first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this change @tustvold -- this PR makes sense to me.
let message = crate::root_as_message(&block_data[..]).map_err(|err| { | ||
ArrowError::ParseError(format!("Unable to get root as message: {err:?}")) | ||
})?; | ||
let buf = read_block(&mut reader, block)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reducing the number of copies of the code that does the read
seems like a good improvement to me
It seems like previously the code did 2 IOs per block (at least) to read the header and then the data. With this PR it will only do 1 IO a block (given there is no continuation)
Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Part of #5153
Rationale for this change
This is the first part of being able to extract a
BufferRead
trait that allowing for zero-copy deserialization of IPC files loaded intoBuffer
.What changes are included in this PR?
Uses large blockwise reads to avoid the need for BufReader and to reduce the amount of IO requests necessary to read a file
Are there any user-facing changes?