Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an async ParquetReader for arrow #111

Closed
alamb opened this issue Apr 26, 2021 · 3 comments · Fixed by #1154
Closed

Provide an async ParquetReader for arrow #111

alamb opened this issue Apr 26, 2021 · 3 comments · Fixed by #1154
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10307

The aim of this issue is to discuss and try to implement async in the Parquet crate for read traits.

It focuses on the read part to limit the complexity and impact of the changes. The design choices should also make sense for the write part.

Related issues:
ARROW-9275 is a more generic and abstract discussion about async. This issue focuses on Parquet read

ARROW-9464 focuses on threading in datafusion but overlaps with this issue when datafusion reads from parquet

 

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@jorgecarleitao jorgecarleitao added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 29, 2021
@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2021

There is a PR in arrow2 for such functionality: jorgecarleitao/arrow2#260 which may serve as an inspiration

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2021

The approach that @jorgecarleitao took in jorgecarleitao/arrow2#260 is quite clever. Rather than a single struct that can read parquet files synchronously and asynchronously, I think he effectively added a second API for reading the required portions of the files into memory buffers and then uses shared encoding/decoding logic with the serialized reader.

Thus, one idea for adding async support to the parquet crate might be to follow this example and create a new reader like AsyncFileReader (alongside the existing SerializedFileReader) that handles the I/O to fetch the required parts (e.g. fetching the bytes that contain metadata, or encoded pages), and then calls into the existing encoder/decoder logic

Something like

               ┌────────────────────────────┐                
               │ Existing common encoding + │                
               │decoding logic that operates│                
               │     on bytes in memory     │                
               └────────────────────────────┘                
                              ▲                              
                 ┌────────────┴──────────┐                   
                 │                       │                   
                 │                       │                   
            .─────────.             .─────────.              
         ,─'           '─.       ,─'           '─.           
        ;    Logic to     :     ;  new logic to   :          
        :   read bytes    ;     :   read bytes    ;          
         ╲ synchronously ╱       ╲asynchronously ╱           
          '─.         ,─'         '─.         ,─'            
             `───────'               `───────'               
                 ▲                       ▲                   
            ┌────┘                       └──────┐            
            │                                   │            
            │                                   │            
┌───────────────────────┐           ┌───────────────────────┐
│ SerializedFileReader  │           │    AsyncFileReader    │
└───────────────────────┘           └───────────────────────┘
                                                             
    existing parquet                           new           
          crate                       entrypoint for async   
                                             reader          

Here is the current read API:
https://docs.rs/parquet/5.3.0/parquet/file/reader/index.html

cc @yjshen

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2022

FYI @yjshen @neverchanje , @tustvold has created a proof of concept of a async parquet reader #1154

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 19, 2022
Add Sync + Send bounds to parquet crate
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 28, 2022
Add Sync + Send bounds to parquet crate
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 28, 2022
Add Sync + Send bounds to parquet crate
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 28, 2022
Add Sync + Send bounds to parquet crate
alamb pushed a commit to alamb/arrow-rs that referenced this issue Feb 1, 2022
Add Sync + Send bounds to parquet crate
alamb pushed a commit that referenced this issue Feb 2, 2022
* Async parquet reader (#111)

Add Sync + Send bounds to parquet crate

* Remove Sync from DataType

* Review feedback

* Add basic test

* Fix lints

* Review feedback

* Tweak CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants