-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a "geo-arrow" format #1
Comments
Sorry for the slow response on this, thanks for creating the issue. Great to meet you @jorisvandenbossche - it's awesome work you're doing. I definitely think geo-arrow-spec makes sense for inclusion in cdw-geo. Though I think eventually this repository may drop away in favor of actual specs. It does seem like it'd make sense to eventually have one 'geoparquet' place, where it explains both formats, if that's what ultimately ends up making sense. And I agree on keeping development of geo-arrow-spec in the geopandas repository. I'm definitely interested to see the evolution of the geo-arrow-spec and the arrow-native layout. This effort I think does make sense to start with WKB, since most vendors already use it, but I can imagine some might be interested in a more efficient format if they start to try to read data in place. |
Picking this up again. My plan right now is to do the "easy" geoparquet format first (i.e. not geo-arrow) that
It'd be helpful to have sample parquet files of what these systems currently export. Looking at https://github.com/opengeospatial/cdw-geo#current-support that's just Snowflake right now. @cholmes do you know who from Snowflake to ask for a sample file? |
Awesome.
Yes, not sure of his github user name, but I'll introduce you on email. |
I took a quick pass at this in #2. |
Nice to meet you as well @cholmes, and sorry for the slow reply. And thanks Tom for getting this moving again! For the WKB vs "arrow native" (i.e. nested list arrays) question: we might want to keep both options long-term. The metadata we use in GeoPandas currently includes a |
Looking at #2, I have a general question: currently it defines a metadata standard that is quite close to what we did in GeoPandas / geo-arrow-spec, but not exactly. So the existing geo-parquet files written by geopandas (and R's sfarrow) are not actually compatible with this. I think ideally we only want a single metadata specification for Parquet on the long term? So that means we need to discuss the differences / see how the two can be aligned? |
Ah cool, this makes sense to me. Though lately I've been wondering if there are any advantages to the WKB? Is it just that there's more parsers today that understand WKB? Is it that hard to parse/understand the 'arrow native' one? Or are there other downsides?
Yes, fully agree we only want a single metadata specification for geo in parquet. What are the differences? We're basically starting from scratch, so happy to look to you for the start, and help promote what you've done and get others reading/writing the same format, and being open to evolving it if their requirements are a bit different. Where can I read up more on what you've done so far? |
I think it was mostly because 1) WKB is indeed ubiquitous and almost everybody should be able to understand it, and 2) to get started with something (like Tom did here as well), so we could actually add parquet IO functionality to GeoPandas. The "arrow native" one shouldn't necessarily be hard to understand, since it are "just numbers", but it's not a typical format. Eg for python / shapely, there is not a single function that can parse it at the moment (you need some custom code to do it , e.g. in pygeos it currently looks like this (pygeos/pygeos#93 (comment)) to do it efficiently). But it has been on my to do list for a long time to actually implement those to make this easier. For the differences in metadata between what we have in geopandas vs the first draft here: one thing is the the question where to put column-level metadata (grouped together in the file metadata, or separate in each column metadata, cfr #2 (comment)). Another is the way to store CRS information (#3, we are currently using WKT), and the top-level key ("geo" vs "geoparquet"). We can maybe create specific issues for the different questions. |
I'm happy to move the column metadata to the file level. I didn't realize geo-arrow-spec put it at the file level until after my first write up (@jorisvandenbossche do you have a sense for whether using both file and column-level metadata is common in parquet?). I picked "geoparquet" as the top-level key to avoid clashing with geo-arrow-spec for now, but I agree that long-term we want just one standard. If we can get everyone on the same page, then just using "geo" as the key seems reasonable. Maybe we talk through CRS / epsg stuff in #3? Adding in those additional fields like Oh, and adding a required "encoding" field seems sensible, since presumably all the other metadata will be standard between "WKB" and "arrow". |
Thanks for the explanation. Yeah, if we start with just 'encoding' and two options that sounds good. But I do think we should keep an eye on adoption of the two in the early days, and see if we can just get gdal/geos/geotools/js libraries to just all parse the 'new' one, and then have a 1.0.0 that just has one way of doing it. All of the uniting points sound great. We can discuss CRS stuff, but I'm pretty happy with just using WKT CRS, with a very clear default. It's the most comprehensive way to do it. |
And BTW, thanks for getting things rolling again! |
A note on this front of a "geo-arrow" encoding, there is a PR that starts to more formally describe this format at geoarrow/geoarrow#12 |
I was pointed to a similar idea implemented in GeoMesa Parquet Filesystem DataStore. Code links: Parquet Writer Test and WriteSupport Writing MultiPolygon produces following schema:
This looks like a pretty big win. In particular because this splits x/y into two columns this allows using parquet min/max stats to be used for spatial filtering. One thing I'm not clear on is if it's possible to have the same parquet schema be used for all geometry types or if the one would be forced back to WKB encoding as soon as you have Points and Lines sharing the same column. |
Yes, GeoMesa uses something similar to the "arrow native" format being described above. You can see the arrow field definitions here, and the parquet ones here. From working with it, the pros and cons to this approach are as follows, in no particular order: Pros:
Cons:
|
I'm excited to see this spec developing! |
Now that #191 has been merged, I'm going to close this. Also refer to the geoarrow spec at https://github.com/geoarrow/geoarrow |
Various members of the Python and R geospatial communities are working on a geo-arrow-spec: a way to store geospatial data in Apache Arrow (and Apache Parquet) format. This issue is to introduce the cdw-geo and geo-arrow-spec groups, and hash out a plan for how to proceed, since there's some overlap between the two groups' goals.
In addition to intros, I wanted to address Why another parquet format?
The geoparquet format will likely store geometries using something like WKB. The geo-arrow-spec hasn't settled on a representation for geometries (see geoarrow/geoarrow#4 and geoarrow/geoarrow#3), but it will likely move away from using WKB to an "arrow native" memory layout. geoarrow/geoarrow#4 (comment) has more information on why, but the short version is that the arrow-native layout a.) doesn't require decoding WKB to use the geometries, b.) coordinates are contiguous in memory, c.) provides random access to geometries without having to parse unnecessary data.
Some logistical questions (with my recommendations):
cc @jorisvandenbossche from the geo-arrow-spec side.
The text was updated successfully, but these errors were encountered: