-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous geoparquet reader #493
Conversation
Adapted mostly from the synchronous read_geoparquet function, but with async/await statements.
src/io/parquet/reader.rs
Outdated
let stream: ParquetRecordBatchStream<T> = builder.build().unwrap(); | ||
let batches: Vec<RecordBatch> = stream.try_collect::<Vec<_>>().await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting this error error about unsatisfied trait bounds:
error[E0599]: the method `try_collect` exists for struct `ParquetRecordBatchStream<T>`, but its trait bounds were not satisfied
--> src/io/parquet/reader.rs:158:44
|
158 | let batches: Vec<RecordBatch> = stream.try_collect::<Vec<_>>().await.unwrap();
| ^^^^^^^^^^^ method cannot be called on `ParquetRecordBatchStream<T>` due to unsatisfied trait bounds
|
::: ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-50.0.0/src/arrow/async_reader/mod.rs:554:1
A little unsure about how to handle this, do I set the bounds on L157's ParquetRecordBatchStream<T>
or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it can be a confusing error message that is hard to debug. In this case, I went to the docs of ParquetRecordBatchStream
and looked at its implementation of Stream
. In that docstring, the bound on T
is T: AsyncFileReader + Unpin + Send + 'static
, so the inference is that the bound on T
in this function needs to additionally have Unpin
, which indeed fixed the compilation issue.
(I don't actually know what Unpin does, since I don't do a lot of async. If you'd like to learn more, I'd suggest this deep dive: https://fasterthanli.me/articles/pin-and-suffering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool, thanks for the explanation! Was banging my head trying to figure out what was missing. Now I know more about trait bounds (and how to read the docs a bit better to find the traits) 🧑🎓
Thanks! I pushed a commit with the following changes:
|
Create a dedicated reader_async.rs file to put read_geoparquet_async, and move the parse_geoparquet_metadata function from reader.rs to geoparquet_metadata.rs.
Oh good, my iDE was giving me squiggly red lines under the Yep, I've got rust-analyzer set-up, but my IDE (Pulsar) only shows those inferred types when I hover over them. Ok to remove them if you prefer it that way.
Thanks for setting up that new feature flag! I was wondering yesterday too if async should be enabled by default with the |
pub struct GeoParquetReaderOptions { | ||
batch_size: usize, | ||
coord_type: CoordType, | ||
pub batch_size: usize, | ||
pub coord_type: CoordType, | ||
} | ||
|
||
impl GeoParquetReaderOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this GeoParquetReaderOptions
struct go in a common place like src/io/parquet/structs.rs
? Since this will be used in both the sync/async GeoParquet readers?
A common function to get the arrow_schema, geometry_column_index and target_geo_data_type out of the GeoParquet file for both the async/sync readers. Also turned parse_geoparquet_metadata back into a private func.
Needed to use tokio::fs::File which implements the AsyncRead trait, otherwise unit test is similar to the synchronous version in reader.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bring in object-store crate to read from URL (if it gets complicated, maybe split it into a separate PR)
Given how 'big' this PR already is, I'm tempted to bring in object-store
in a separate PR. I see that you're already doing some stuff with FlatGeobuf x object-store in #494.
let Ok((geometry_column_index, target_geo_data_type)) = | ||
parse_geoparquet_metadata(parquet_meta.file_metadata(), &arrow_schema, *coord_type) | ||
else { | ||
panic!("Cannot parse geoparquet metadata"); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should return a GeoArrowError here instead of panic? A little unsure if this let-else syntax is the idiomatic way of doing things...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should change the return type of this function to
-> Result<(Arc<Schema>, usize, Option<GeoDataType>)> {
(note that this Result
is pointing to crate::error::Result
, which shadow's the standard-lib's Result
. See
Line 73 in eb94df3
pub type Result<T> = std::result::Result<T, GeoArrowError>; |
Then now that this function returns a Result
, you can use the ?
operator to handle exceptions. So this can change to
let Ok((geometry_column_index, target_geo_data_type)) = | |
parse_geoparquet_metadata(parquet_meta.file_metadata(), &arrow_schema, *coord_type) | |
else { | |
panic!("Cannot parse geoparquet metadata"); | |
}; | |
let (geometry_column_index, target_geo_data_type) = | |
parse_geoparquet_metadata(parquet_meta.file_metadata(), &arrow_schema, *coord_type)?; | |
Ok((arrow_schema, geometry_column_index, target_geo_data_type)) |
That is, we call the first line, unwrapping the result if it's an Ok
or propagating the error upwards if it's an Err
, and then we return an Ok
with the data to return from this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then anywhere that calls build_arrow_schema
can use ?
to unwrap the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, returning Result
makes more sense. Done in 97b5911.
Follow what's used by the synchronous read_geoparquet function.
The nice thing is that object store integration can be as little as just documentation. This is because let storage_container = Arc::new(MicrosoftAzureBuilder::from_env().build().unwrap());
let location = Path::from("path/to/blob.parquet");
let meta = storage_container.head(&location).await.unwrap();
println!("Found Blob with {}B at {}", meta.size, meta.location);
let reader = ParquetObjectReader::new(storage_container, meta);
let table = read_geoparquet_async(reader, options).await?; |
No need to use the let-else statement anymore with a panic!, and allows using `?` operator to handle exceptions when the function is called.
Add module level documentation showing how to use read_geoparquet and read_geoparquet_async to read GeoParquet files into a GeoTable struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how 'big' this PR already is, I'm tempted to bring in
object-store
in a separate PR. I see that you're already doing some stuff with FlatGeobuf x object-store in #494.The nice thing is that object store integration can be as little as just documentation. This is because
ParquetObjectReader
implementsAsyncFileReader
.
Ok, this PR should be ready for review! I've added some module-level example docs for read_geoparquet_async
and read_geoparquet
, but stopped short of mentioning object-store
just yet. My idea with bringing in object-store
was more on applying it on the Pyo3/Python side around here:
geoarrow-rs/python/core/src/io/parquet.rs
Lines 20 to 26 in eb94df3
pub fn read_parquet(path: String, batch_size: usize) -> PyGeoArrowResult<GeoTable> { | |
let file = File::open(path).map_err(|err| PyFileNotFoundError::new_err(err.to_string()))?; | |
let options = GeoParquetReaderOptions::new(batch_size, Default::default()); | |
let table = _read_geoparquet(file, options)?; | |
Ok(GeoTable(table)) | |
} |
Specifically, making it possible for a user to pass in either a local path or a remote path like s3://bucket/example.geoparquet
directly, and there would be a match
statement to handle different object storage services (s3/az/gs/etc). There's this object_store::parse_url
function that can be used to build the ObjectStore. But let's discuss the implementation in #492 first, I'm not entirely sure if it's better to have this parsing logic in geoarrow-rs
, or leave it to the user to handle the object-store
part themselves.
use tokio::fs::File; | ||
|
||
#[tokio::test] | ||
async fn nybb() { | ||
let file = File::open("fixtures/geoparquet/nybb.parquet") | ||
.await | ||
.unwrap(); | ||
let options = GeoParquetReaderOptions::new(65536, Default::default()); | ||
let _output_geotable = read_geoparquet_async(file, options).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... This is because
ParquetObjectReader
implementsAsyncFileReader
. So following this example a user can just dolet storage_container = Arc::new(MicrosoftAzureBuilder::from_env().build().unwrap()); let location = Path::from("path/to/blob.parquet"); let meta = storage_container.head(&location).await.unwrap(); println!("Found Blob with {}B at {}", meta.size, meta.location); let reader = ParquetObjectReader::new(storage_container, meta); let table = read_geoparquet_async(reader, options).await?;
Oh cool, that's real handy! One thing I'm a little confused about is how we can pass in a tokio::fs::File
here, since it only implements Sync
+ Unpin
, but read_geoparquet_async
has the trait bound <R: AsyncFileReader + Unpin + Send + 'static>
. Is the AsyncFileReader
impl not absolutely necessary, or does ParquetRecordBatchStreamBuilder::new
do some special handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsyncFileReader
is defined by the parquet
crate.
But also note this "blanket" implementation
impl<T: AsyncRead + AsyncSeek + Unpin + Send> AsyncFileReader for T
That means that AsyncFileReader
is automatically implemented for any type that already implements AsyncRead
and AsyncSeek
. A tokio File
implements AsyncRead and implements AsyncSeek. Therefore the File
(and anything else built for tokio's ecosystem) should automatically work with Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, learned something new again! Also starting to 'get' Rust traits a lot more 😀
Yeah let's work out the Python bits in a different PR. In particular, I think it makes sense to have both an I'm also not exactly sure what public Python API is ideal, so we should come back to that. |
Any piece of code that you put in a docstring is automatically a doctest. So it needs to compile and run (and pass) unless you annotate it with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Implementing an asynchronous GeoParquet file reader using
ParquetRecordBatchStream
.TODO:
src/io/parquet/reader.rs
read_geoparquet
andread_geoparquet_async
functions parse the GeoParquet metadata using the same functionobject-store
crate to read from URL (if it gets complicated, maybe split it into a separate PR)Addresses #492
P.S. This is my first ever Rust PR, so take it easy 🙈