Blockwise IO in IPC FileReader (#5153) #5179

tustvold · 2023-12-06T23:20:27Z

Which issue does this PR close?

Part of #5153

Rationale for this change

This is the first part of being able to extract a BufferRead trait that allowing for zero-copy deserialization of IPC files loaded into Buffer.

What changes are included in this PR?

Uses large blockwise reads to avoid the need for BufReader and to reduce the amount of IO requests necessary to read a file

Are there any user-facing changes?

tustvold · 2023-12-06T23:22:04Z

arrow-ipc/src/reader.rs

-        reader.seek(SeekFrom::End(-6))?;
-        reader.read_exact(&mut magic_buffer)?;
-        if magic_buffer != super::ARROW_MAGIC {
+    /// Returns errors if the file does not meet the Arrow Format footer requirements


It should be sufficient to just check the footer, and this provides a more predictable IO pattern

tustvold · 2023-12-06T23:24:38Z

arrow-ipc/src/reader.rs

@@ -498,10 +498,34 @@ pub fn read_dictionary(
    Ok(())
 }

+/// Read the data for a given block
+fn read_block<R: Read + Seek>(mut reader: R, block: &Block) -> Result<Buffer, ArrowError> {


I am making an assumption here that blocks should be large enough to be an appropriate IO size

Perhaps @pitrou might know if this is a bad idea?

Given that a block typically points to an entire record batch, yes, a block is certainly large enough.

tustvold · 2023-12-06T23:26:51Z

arrow-ipc/src/reader.rs

+///
+/// <https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format>
+fn parse_message(buf: &[u8]) -> Result<Message, ArrowError> {
+    let buf = match buf[..4] == CONTINUATION_MARKER {


This does highlight a somewhat peculiar quirk of the IPC file format, it doesn't actually care about the size prefixes. I guess this is just a historical artifact of the fact the IPC streams came first.

alamb

Thank you for this change @tustvold -- this PR makes sense to me.

arrow-ipc/src/reader.rs

alamb · 2023-12-08T17:58:31Z

arrow-ipc/src/reader.rs

-                let message = crate::root_as_message(&block_data[..]).map_err(|err| {
-                    ArrowError::ParseError(format!("Unable to get root as message: {err:?}"))
-                })?;
+                let buf = read_block(&mut reader, block)?;


reducing the number of copies of the code that does the read seems like a good improvement to me

It seems like previously the code did 2 IOs per block (at least) to read the header and then the data. With this PR it will only do 1 IO a block (given there is no continuation)

Co-authored-by: Andrew Lamb <[email protected]>

Blockwise IO in IPC FileReader (apache#5153)

8e2c15a

tustvold requested a review from viirya December 6, 2023 23:20

github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2023

Docs

4b989de

tustvold commented Dec 6, 2023

View reviewed changes

Clippy

eae3ff5

tustvold commented Dec 6, 2023

View reviewed changes

tustvold mentioned this pull request Dec 7, 2023

Easy way to zero-copy IPC buffers. #5165

Closed

alamb mentioned this pull request Dec 8, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 apache/datafusion#8420

Closed

7 tasks

alamb approved these changes Dec 8, 2023

View reviewed changes

Update arrow-ipc/src/reader.rs

014ba5f

Co-authored-by: Andrew Lamb <[email protected]>

tustvold merged commit 9630aaf into apache:master Dec 8, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blockwise IO in IPC FileReader (#5153) #5179

Blockwise IO in IPC FileReader (#5153) #5179

tustvold commented Dec 6, 2023

tustvold Dec 6, 2023

tustvold Dec 6, 2023

tustvold Dec 6, 2023

pitrou Dec 7, 2023

tustvold Dec 6, 2023

alamb left a comment

alamb Dec 8, 2023

Blockwise IO in IPC FileReader (#5153) #5179

Blockwise IO in IPC FileReader (#5153) #5179

Conversation

tustvold commented Dec 6, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Dec 6, 2023

Choose a reason for hiding this comment

tustvold Dec 6, 2023

Choose a reason for hiding this comment

tustvold Dec 6, 2023

Choose a reason for hiding this comment

pitrou Dec 7, 2023

Choose a reason for hiding this comment

tustvold Dec 6, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 8, 2023

Choose a reason for hiding this comment