-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for RFC 86: Column-oriented read API for vector layers #367
Conversation
That C header seems like a recipe for disaster. The structs are defined in |
Before my change, the generated gdal/gdal-sys/prebuilt-bindings/gdal_3.6.rs Lines 5927 to 5932 in bd8f877
Is there a better way to include those definitions?
Can you explain this more? Given that the definitions in that header are specified by the Arrow project to remain the same, seems fine to include the header this way. I copied that header in manually because the similar pyogrio PR (vectorized Python GDAL wrapper) did the same |
you also need to add |
Thank you! That was going to be my next question 😄 |
See the doc-comment. If I add that header to #[doc = " Data type for a Arrow C stream Include ogr_recordbatch.h to get the\n definition."]
#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub struct ArrowArrayStream {
pub get_schema: ::std::option::Option<
unsafe extern "C" fn(arg1: *mut ArrowArrayStream, out: *mut ArrowSchema) -> libc::c_int,
>,
pub get_next: ::std::option::Option<
unsafe extern "C" fn(arg1: *mut ArrowArrayStream, out: *mut ArrowArray) -> libc::c_int,
>,
pub get_last_error: ::std::option::Option<
unsafe extern "C" fn(arg1: *mut ArrowArrayStream) -> *const libc::c_char,
>,
pub release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut ArrowArrayStream)>,
pub private_data: *mut libc::c_void,
} I understand that they're unlikely to change, but I'd be more comfortable using the same definition GDAL is using. It's conceptually correct, and we ship less code, so why not do it that way? |
The issue was that I'm pretty inexperienced with C and didn't know how to add |
Yeah, the I still have some doubts about the API -- I just skimmed the code, but is there a reason why we can't pass a But I guess some |
That's what I was trying to do before this commit
My conclusion was that the first I changed the API based on examples I could find using the C Data Interface. For example, Polars in its Arrow-based FFI with Python and pyarrow does a similar process of creating an empty pointer, passing those pointers to the arrow producer (in this example |
I think this discussion in arrow2 is useful background reading for the Arrow FFI API |
One additional thing to note here is that I suppose we can expose a safe API if we depend on |
Yeah, that was my idea. Having some features to allow using types from those two crates. |
My main hesitation with that is that it makes the |
struct ArrowArrayStream(pub Box<ArrowArrayStreamFfi>);
impl ArrowArrayStream {
unsafe fn new(inner: Box<ArrowArrayStreamFfi>) { Self(inner) }
}
trait LayerAccess: Sized {
fn read_arrow_stream(&mut self, out_stream: &mut ArrowArrayStream) {
// ...
}
}
diff --git i/examples/read_ogr_arrow.rs w/examples/read_ogr_arrow.rs
index a0a6fcf..fb63cd4 100644
--- i/examples/read_ogr_arrow.rs
+++ w/examples/read_ogr_arrow.rs
@@ -37,14 +37,15 @@ fn run() -> Result<()> {
let gdal_pointer: *mut gdal::ArrowArrayStream = output_stream_ptr.cast();
// Read the layer's data into our provisioned pointer
- unsafe { layer_a.read_arrow_stream(gdal_pointer) }
+ let output_stream = unsafe { layer_a.read_arrow_stream() };
// The rest of this example is arrow2-specific.
// `arrow2` has a helper class `ArrowArrayStreamReader` to assist with iterating over the raw
// batches
- let mut arrow_stream_reader =
- unsafe { arrow2::ffi::ArrowArrayStreamReader::try_new(output_stream).unwrap() };
+ let mut arrow_stream_reader = unsafe {
+ arrow2::ffi::ArrowArrayStreamReader::try_new(std::mem::transmute(output_stream)).unwrap()
+ };
// Iterate over the stream until it's finished
// arrow_stream_reader.next() will return None when the stream has no more data
@@ -84,6 +85,7 @@ fn run() -> Result<()> {
// Access the first row as WKB
let _wkb_buffer = binary_array.value(0);
+ dbg!(_wkb_buffer);
}
Ok(())
diff --git i/src/vector/layer.rs w/src/vector/layer.rs
index 99f0c83..ceb6c93 100644
--- i/src/vector/layer.rs
+++ w/src/vector/layer.rs
@@ -481,7 +481,7 @@ pub trait LayerAccess: Sized {
/// This uses the Arrow C Data Interface to operate on raw pointers provisioned from Rust.
/// These pointers must be valid and provisioned according to the ArrowArrayStream spec
#[cfg(all(major_is_3, minor_ge_6))]
- unsafe fn read_arrow_stream(&mut self, out_stream: *mut gdal_sys::ArrowArrayStream) {
+ fn read_arrow_stream(&mut self) -> Box<gdal_sys::ArrowArrayStream> {
unsafe {
let version_check = gdal_sys::GDALCheckVersion(3, 6, std::ptr::null_mut());
if version_check == 0 {
@@ -491,6 +491,16 @@ pub trait LayerAccess: Sized {
self.reset_feature_reading();
+ let mut out_stream = Box::new(gdal_sys::ArrowArrayStream {
+ get_schema: None,
+ get_next: None,
+ get_last_error: None,
+ release: None,
+ private_data: std::ptr::null_mut(),
+ });
+
+ let out_stream_ptr = &mut *out_stream as *mut gdal_sys::ArrowArrayStream;
+
let options = std::ptr::null_mut();
unsafe {
@@ -500,11 +510,13 @@ pub trait LayerAccess: Sized {
// "INCLUDE_FID".as_ptr() as *const libc::c_char,
// "NO".as_ptr() as *const libc::c_char,
// );
- let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream, options);
+ let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream_ptr, options);
if !success {
panic!("Failed to read arrow data");
}
}
+
+ out_stream
}
} |
I don't think it has to be boxed, but we don't know ahead of time how big the arrays are that will be produced. So if it's not boxed you might run the risk of a stack overflow?
Yeah... maybe it's useful to mark the function as
If we have our own
I'm still learning and don't really know how to use |
I think the reader takes a Box. I don't think the size of the type is a problem, the struct only has a couple of pointers.
Not just useful, but required. The user might fill in some bogus pointers (that's safe!), we would pass them on to GDAL, and the program would crash. As long as we don't construct the stream (which would mean depending on a specific Arrow crate), it needs to be unsafe.
We could add some conversion functions, but yes.
Taking an |
i guess this is really difficult to decide. I don't want to depend on Arrow or Arrow2. If we add a dependency it should be Arrow since that's the official one. However, Arrow versions are really messy to handle. So i would suggest to add the unsafe method and find a safe solution later? Maybe arrow/arrow2 will converge at some point... |
Agreed, let's take a pointer and we can do a feature gate if we end up with a nicer API for arrow2. Maybe the only question is whether we want to offer a wrapper like in my comment above. |
The Rust Arrow ecosystem seems pretty evenly divided between |
src/vector/layer.rs
Outdated
let options = std::ptr::null_mut(); | ||
|
||
unsafe { | ||
// Note to self: example of how to operate on options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a nicer way to do this: a68aba7#diff-735b4e8f56e7d6ad6abe40c99808df64a2b2e05b2b13f976303cefe2dca42affR174.
You might want to remove or update the commented-out code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are areas where CslStringList
is passed in through the public API, so I'll update this to have the user pass that in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to Option<&CslStringList>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's drop the Option
, Geometry::make_valid
ended up not using one. An empty CslStringList
gets passed as a null pointer, so there shouldn't be any downsides.
And can you squash the commits if it's not too much of a hassle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took out the Option
and squashed
// "NO".as_ptr() as *const libc::c_char, | ||
// ); | ||
let success = gdal_sys::OGR_L_GetArrowStream(self.c_layer(), out_stream, options); | ||
if !success { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why some of these functions return bool
while others return OGRErr
😕.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ it is indeed a little surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a nit about a comment
Do you know the best way to fix this CI? I don't know how to have the example only exist for gdal 3.6+ |
4e0c725
to
3901bf0
Compare
The example seems fine now, it's just Clippy complaining about an unused import. |
That's my issue... On GDAL 3.6 the example runs fine, but I need to have the example not run on <3.6 and not get checked by clippy on <3.6, or else clippy will think the imports aren't being used |
It's complaining about |
I see; a few of the previous commits had clippy errors on the example |
bors r+ Thanks for the patience and for working on this. We might have Arrow support before GDAL 3.7 comes out! 😃 |
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron - [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md). - [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users. --- ### Description This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies. The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph: > There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active. ### Change list - Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. - Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`? - Implement `read_arrow_stream` - Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2) ### Todo - Pass in options to `OGR_L_GetArrowStream`? According to the guide: > The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific. So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row. - Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. ### References - [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) Closes #280 Co-authored-by: Kyle Barron <[email protected]>
bors r- |
Canceled. |
Ok indeed so I fixed that clippy error in |
For the example, you can move the imports inside the function and make another version of it with the same feature gate, but negated. Then you can drop the check. |
bors d+ (And please don't forget to squash at the end) |
✌️ kylebarron can now approve this pull request. To approve and merge a pull request, simply reply with |
c3c2cca
to
2e8214b
Compare
Ok I think I followed your instructions correctly (hard to test because I have gdal 3.6 locally). And I squashed again (sorry, I'm used to just merging with a squash commit so I forget). I've never used |
bors r+ |
CHANGES.md
if knowledge of this change could be valuable to users.Description
This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory.
This function operates on an
ArrowArrayStream
struct that needs to be passed in. Most of the time, users will be using a helper library for this, likearrow-rs
orarrow2
. The nice part about this API is that this crate does not need to declare those as dependencies.The OGR guide is very helpful reading. Would love someone to double-check this PR in context of this paragraph:
Change list
arrow_bridge.h
with the Arrow C Data Interface headers.arrow_bridge.h
to the bindgen script so thatgdal_3.6.rs
includes a definition forArrowArrayStream
. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from3.6.0
instead of3.6.2
?read_arrow_stream
arrow2
Todo
Pass in options to
OGR_L_GetArrowStream
? According to the guide:So maybe we should have an
options: Option<Vec<(String, String)>>
argument? Pyogrio uses this to turn off generating anfid
for every row.Have an option to skip reading some columns. Pyogrio does this with calls to
OGR_L_SetIgnoredFields
.References
Closes #280