-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python GeoArrow Module Proposal #38
Comments
This is awesome! All comments I have are just details.
It definitely should! Opening an issue would be helpful (I can take a stab at implementing).
Agreed! It may be nice to (optionally) expose something like the In addition to output type, something describing the input/output shape relationship would be nice (scalar function, vector function, aggregate function, window function).
I think in practice we might want to consider anything "geoarrowish" (i.e., it will work when passed to any Python function in geoarrow land) that implements I'd add that a Type representation is another first-class data structure (implementing
In addition to what you listed, I think other things that use pyarrow features (e.g., the parquet reader or datasets) are in scope. If the proportion of code dedicated to those features starts to overwhelm the code base I can see that changing, but at the moment it seems like a lot of work to move (e.g.) the GeoParquet reader somewhere else when it is really a very thin wrapper around PyArrow's reader. I'd love to remove the hard dependency on geoarrow.c, although I also don't have to do that right now (happy to review a PR!). The compute functionality (i.e.,
Pretty much just to register the This is a place where you could make a
Probably this can just live in Shapely?
Probably this can just live in pyogrio and/or gdal.ogr?
Pretty much the same scope as it has now except now it can leverage the dunder methods to maybe export some functions or Kernel/Operation/Algorithm.
You'll have to lead the way here (by adding it in yourself or letting me know in a PR review where I should have added useful type annotations)...because of the mostly functional approach and garbage autocomplete of pyarrow, this is hard to do at the core level. At the pandas or Rust or polars level this is more feasible? ...I probably forgot some things! Apologies if these comments missed something about what you are proposing! |
Opened apache/arrow#38717!
Do you have pseudocode of this? It's a little hard to wrap my head around. By output type do you mean the geometry type (point, line, etc) or array/chunked array/table? from geoarrow.python.c import Area
operation = Area()
array = PolygonArray(...)
float_array = operation(array)
One thing I've been doing in Rust is enabling some simple broadcasting. So you should be able to do
👍
Ok this is good to know. I agree not every implementation needs those concrete classes; it's helpful for me to think of those classes in a logical sense, and on the rust side I think I'll end up implementing all of them as classes.
👍
👍 I agree, I just forgot that there were other elements in there already than the extension types and arrays.
hmm, I'm not sure what the best approach is here. I was originally thinking of I will mention that we've had somewhat different priorities in geoarrow-c and geoarrow-rs so far, where it seems like you've been more intentional about supporting the full spec, whereas I only support 2D geometries right now.
👍 That's cool! I also don't see myself spending time on this in the short term, but an interesting API to think about.
I don't see shapely having any interest in a pyarrow dependency. Maybe shapely would consider an optional dependency on
Here you're right; given that pyogrio already depends on pyarrow, maybe it would be easiest to add a geoarrow-python dependency to pyogrio and ensure it's loading to geoarrow extension types?
I agree it's hard to do with pyarrow (once upon a time I spent a few hours trying to make type stubs for pyarrow!) but from Rust it should be pretty feasible! |
I have actual code! import geoarrow.pyarrow as ga
op = ga.Kernel.as_wkb(ga.wkt())
op._type_out
#> WkbType(geoarrow.wkb)
No need for a pyarrow dependency? import geoarrow.c.lib as lib
import numpy as np
type = lib.CVectorType.Make(lib.GeometryType.POINT, lib.Dimensions.XY, lib.CoordType.SEPARATE)
schema = type.to_schema()
builder = lib.CBuilder(schema)
builder.set_buffer_double(1, np.array([1.0, 2.0, 3.0]))
builder.set_buffer_double(2, np.array([3.0, 4.0, 4.0]))
array = builder.finish()
array
#> <geoarrow.c._lib.ArrayHolder at 0x10c0e3690>
# Just for demo:
import pyarrow as pa
import geoarrow.pyarrow as _
pa.Array._import_from_c(array._addr(), schema._addr())
#> PointArray:PointType(geoarrow.point)[3]
#> <POINT (1 3)>
#> <POINT (2 3)>
#> <POINT (3 4)> ...or a lazy geoarrow.rs import when something similar is supported there. |
Thanks for starting this discussion, Kyle! Don't have time to respond thoroughly right now, so just two quick things:
I think that's indeed a reasonable proposal for pyogrio.
I think on the short term, it might also make sense to have a version of this in |
That would be really cool, especially if it could vendor nanoarrow (?) geoarrow-c (?) and not need to add a complex pyarrow dependency that I imagine some shapely maintainers might balk at. I would love this to exist but can't say I have the C/Cython knowledge and motivation to make it happen myself 🙂 |
I was thinking an optional runtime dependency on geoarrow-c (which can create a C-data interface array from pybuffers). I'm pretty game to give that a shot (has overlap with R things I want to do) but it may take a bit (and need some changes to geoarrow-c to be future-stable).
How about moving them to It seems like the some things I should prioritize here are implementing the stream protocol for ChunkedArrays and implement the protocol in objects returned by geoarrow.c? |
I'd agree with that. One of my priorities is implementing the core pycapsule protocol in geoarrow-rs (I have to re-implement it from the arrow-rs version because arrow-rs doesn't maintain extenson type information) |
With geoarrow/geoarrow-rs#293 I now have a working prototype (of an |
This is just a proof of concept, only for area. It's annoying how verbose the implementation is, so I want to ideate on that a bit more. It works! ```py import pyarrow as pa import shapely import pandas as pd import geopandas as gpd from geoarrow.rust.core import WKBArray gdf = gpd.read_file(gpd.datasets.get_path('nybb')) large_gdf = pd.concat([gdf] * 1000) wkb = shapely.to_wkb(large_gdf['geometry']) pa_wkb_arr = pa.array(wkb) rust_wkb_arr = WKBArray.from_arrow(pa_wkb_arr) rust_multi_polygon_arr = rust_wkb_arr.to_multi_polygon_array() area1 = rust_multi_polygon_arr.area() from geoarrow.rust.core import area area2 = area(rust_multi_polygon_arr) ``` <img width="350" alt="image" src="https://github.com/geoarrow/geoarrow-rs/assets/15164633/e0ba4dc2-b441-4f7a-a4f3-b10dea43f1a6"> Closes #266. For geoarrow/geoarrow-python#38
Python GeoArrow Module Proposal
The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around
geoarrow-python
fit together really well.(It's possible this should be written RFC-style as a PR to a docs folder in this repo?)
Goals:
geoarrow-c
and/orgeoarrow-rust
and largely reuse their python bindings without having to create ones from scratchconvex_hull
should always return aPolygonArray
instead of a genericGeometryArray
that the user can't "see into".This proposal is based around the new Arrow PyCapsule Interface, which allows libraries to safely interoperate data without memory leaks and without going through pyarrow. This is implemented in pyarrow as of v14+, work is underway to add it to arrow-rs, and presumably nanoarrow support is not too hard to implement.
Primarily Functional API
A functional API makes it easy to take in data without knowing its provenance. Implementations may choose to also implement methods on classes if desired to improve the API usability, but nothing should be implemented solely as a method.
Data Structures
These are the data structure concepts that I think need to be first-class. Each core implementation will implement classes that conform to one of these
GeometryArray
This is a logical array of contiguous memory that conforms to the GeoArrow spec. I envision there being
PointArray
,LineStringArray
, etc. classes that are all subclasses of this.This object should have an
__arrow_c_array__
member that conforms to the PyCapsule interface. The exportedArrowSchema
must include extension type information (an extension name ofgeoarrow.*
and optionally extension metadata).Whether the array uses small or large list offsets internally does not matter, but the implementation should respect the
requested_schema
parameter of the PyCapsule interface when exporting.GeometryStorageArray?
In geoarrow-rs I've tried to make a distinction between "storage" types (i.e. WKB and WKT) and "analysis" types (i.e. anything zero-copy). This is partially to nudge users not to store data as WKB and operate directly on the WKB repeatedly. Do we want to make any spec-level distinction between storage and analysis arrays? Should every operation accept storage types? I think it should be fine for a function to declare it'll accept only non-storage types, and direct a user to call, say,
parse_wkb
.ChunkedGeometryArray
I believe that chunked arrays need to be a first-class data concept. Chunking is core to the Arrow and Parquet ecosystems, and to handle something like
unary_union
that requires the entire column as input to a single kernel requires understanding some type of chunked input. I envision there beingChunkedPointArray
,ChunkedLineStringArray
, etc. classes that are all subclasses of this.This should have an
__arrow_c_stream__
member. TheArrowSchema
must represent a valid GeoArrow geometry type and must include extension type information (at least a name ofgeoarrow.*
and optionally extension metadata).This stream should be compatible with Dewey's existing kernel structure that allows for pushing a sequence of arrays into the kernel.
(It looks like pyarrow doesn't implement
__arrow_c_stream__
for aChunkedArray
? To me it seems natural for it to exist on a ChunkedArray... I'd be happy to open an issue.)GeometryTable
For operations like joins, kernels need to be aware not only of geometries but also of attribute columns.
This should have an
__arrow_c_stream__
member. TheArrowSchema
must be a struct type that includes all fields in the table. The ArrowArray must be a struct array that includes all arrays in the table. At least one child of theArrowSchema
must have GeoArrow extension type information (an extension name ofgeoarrow.*
and optionally extension metadata).Future proofing
Spatial indexes can be serialized within a table or geometry array by having a struct containing the geometry column and a binary-typed run end array holding the bytes of the index (pending geoarrow discussion).
Not sure what other future proofing to consider.
Module hierarchy
General things to consider:
geoarrow.pyarrow
or have a separate librarygeoarrow.shapely
that's very minimal. (Personally, I could go either way, but ifgeoarrow.shapely
isn't likely to change often, I might lean towards a separate module...?)geoarrow.pyarrow
geoarrow.pandas
geoarrow.shapely
Contains two main functions for I/O between geoarrow geometry arrays and shapely using the shapely to/from ragged array implementation.
from_shapely
returns pyarrow-based extension arrays. Longer term it also takes a parameter for the max chunk size.depends on geoarrow-pyarrow, shapely
geoarrow.gdal
Wrapper around pyogrio?
geoarrow.c
I'll let dewey give his thoughts here.
geoarrow.rust.core
Float64Array
and assume the user will call pyarrow.array() on the result? Or should I depend on pyarrow in the short term?geoarrow.rust.proj, geoarrow.rust.geos
geoarrow.rust.core
.geoarrow-rs
but no python dependenciesDownsides
geoarrow.*
will handle making it simple for end users)?geoarrow.pyarrow.PointArray
,geoarrow.c.PointArray
,geoarrow.rust.core.PointArray
. This is, in some ways, unfortunate. But it allows users to control dependencies closely. And unavoidable unless functions returned bare PyCapsule objects?Static Typing
A full proposal for static typing is out of the scope of this proposal (and some operations just won't be possible to type accurately).
A few methods will be amenable to generics, as shown below. But ideally every function can be given a return type that matches one of the Arrow PyCapsule protocols. At least in the Rust implementation, I'd like to have type stubs that accurately return type classes (though sadly I'll still have to write the
.pyi
type stubs by hand).The text was updated successfully, but these errors were encountered: