Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for async ChunkReader? #924

Closed
neverchanje opened this issue Nov 8, 2021 · 3 comments
Closed

Plan for async ChunkReader? #924

neverchanje opened this issue Nov 8, 2021 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@neverchanje
Copy link

neverchanje commented Nov 8, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

https://docs.rs/parquet/6.1.0/parquet/file/reader/trait.ChunkReader.html The ChunkReader trait is designed to hide the details of the underlying storage, which however may use the async feature, rusto-s3 for example:

pub async fn get_object_range

https://durch.github.io/rust-s3/s3/bucket/struct.Bucket.html#method.get_object_range

That means every get_read call has to be implemented by blocking on an async function.

impl parquet::file::reader::ChunkReader for ParquetChunkReader {
  type T = SliceableCuror;

  fn get_read(&self, start: u64, length: usize) -> parquet::errors::Result<Self::SliceableCuror> {
    futures::executor::block_on(async move {
      bucket.get_object_range(...)
    })
  }
}

The non-async limitation of the ChunkReader trait will result in some problems, especially when the query execution engines are using async, tokio, specifically, it will have to use tokio::spawn_blocking to perform a parquet read.

Describe the solution you'd like
A clear and concise description of what you want to happen.

We can add async to ChunkReader::get_read.

#[async_trait::async_trait]
pub trait ChunkReader: Length {
    type T: Read;

    async fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
}

Many of the API will definitely need also to be changed to async. Until now I have no very clear idea and plan
for moving them to the async version with backward compability.

One possible solution is to provide a totally aync API, and the non-async API wraps it via async_std::block_on.

pub trait RowGroupReaderAsync {
  async fn get_column_reader(&self, i: usize) -> Result<ColumnReader> { ... }
}

pub struct RowGroupReader {
  fn get_column_reader() { async_std::block_on(self.get_column_reader_async()) }
}

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@neverchanje neverchanje added the enhancement Any new improvement worthy of a entry in the changelog label Nov 8, 2021
@alamb alamb added the parquet Changes to the parquet crate label Nov 18, 2021
@alamb
Copy link
Contributor

alamb commented Nov 18, 2021

Hi @neverchanje -- sorry for the late follow up

There are some thoughts on approaches in this issue: #111 (comment) -- it may also be worth taking a look at arrow2 to see if that does what you need

@alamb
Copy link
Contributor

alamb commented Jan 13, 2022

@tustvold has a propoal / POC in this PR: #1154

@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

This was added in #1154, in arrow 9.0.0, slated for release in ~ 3 days

@alamb alamb closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants