Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support get offsets or blocks info from arrow file. #5252

Closed
my-vegetable-has-exploded opened this issue Dec 28, 2023 · 7 comments · Fixed by #5249
Closed

Support get offsets or blocks info from arrow file. #5252

my-vegetable-has-exploded opened this issue Dec 28, 2023 · 7 comments · Fixed by #5249
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@my-vegetable-has-exploded
Copy link
Contributor

my-vegetable-has-exploded commented Dec 28, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It seems that there isn't a pub function to provide offsets or blocks info from arrow file.

Describe the solution you'd like

In arrow-ipc/src/reader.rs,

impl<R: Read + Seek> FileReader<R> {
    pub fn blocks(&self) -> Vec<Block> {
        &self.blocks
    }
   //OR
    pub fn offsets(&self) -> Vec<i64> {
        &self.blocks.iter().map(Block::offset).collect()
    }
}

Describe alternatives you've considered

Additional context

related to apache/datafusion#8503

@my-vegetable-has-exploded my-vegetable-has-exploded added the enhancement Any new improvement worthy of a entry in the changelog label Dec 28, 2023
@tustvold
Copy link
Contributor

See #5249

@my-vegetable-has-exploded
Copy link
Contributor Author

See #5249

Sorry, I didn't find what I was looking for in this project. Could you give me more hints?

@tustvold
Copy link
Contributor

The FileDecoder provides a mechanism to control how the various parts of a file are decoded and processed, as per the linked DF ticket.

Getting the blocks and offsets from FileReader isn't very useful as there is no way to actually control the IO that it performs.

@my-vegetable-has-exploded
Copy link
Contributor Author

The FileDecoder provides a mechanism to control how the various parts of a file are decoded and processed, as per the linked DF ticket.FileDecoder 提供了一种机制来控制如何根据链接的 DF 票证对文件的各个部分进行解码和处理。

I'm not sure if I understand it correctly. According to the doctest of FileDecoder, we can get footer firstly, then use FileDecoder to read different recordBatches.

@tustvold
Copy link
Contributor

Correct, which gives you the ability to decode said RecordBatch in parallel in much the same way as we do for parquet row groups

@my-vegetable-has-exploded
Copy link
Contributor Author

Correct, which gives you the ability to decode said RecordBatch in parallel in much the same way as we do for parquet row groups

Get it! Thanks.

@tustvold tustvold added the arrow Changes to the arrow crate label Jan 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'arrow'} from #5249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants