Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a "geo-arrow" format #1

Closed
TomAugspurger opened this issue Aug 16, 2021 · 16 comments
Closed

Proposal for a "geo-arrow" format #1

TomAugspurger opened this issue Aug 16, 2021 · 16 comments

Comments

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Aug 16, 2021

Various members of the Python and R geospatial communities are working on a geo-arrow-spec: a way to store geospatial data in Apache Arrow (and Apache Parquet) format. This issue is to introduce the cdw-geo and geo-arrow-spec groups, and hash out a plan for how to proceed, since there's some overlap between the two groups' goals.

In addition to intros, I wanted to address Why another parquet format?

The geoparquet format will likely store geometries using something like WKB. The geo-arrow-spec hasn't settled on a representation for geometries (see geoarrow/geoarrow#4 and geoarrow/geoarrow#3), but it will likely move away from using WKB to an "arrow native" memory layout. geoarrow/geoarrow#4 (comment) has more information on why, but the short version is that the arrow-native layout a.) doesn't require decoding WKB to use the geometries, b.) coordinates are contiguous in memory, c.) provides random access to geometries without having to parse unnecessary data.


Some logistical questions (with my recommendations):

  1. Would a version of the geo-arrow-spec be appropriate for inclusion in cdw-geo? (I think so, as long as we think having two parquet formats won't confuse people)
  2. geo-arrow-spec still needs to work out some details, including the memory layout of geometries. Where should development on geo-arrow-spec happen? (I think keep in in geo-arrow-spec, since not everyone in cdw-geo will care. Once it's more stable, we can consider moving it out of geopandas/geo-arrow-spec and into this repository, if both groups agree).

cc @jorisvandenbossche from the geo-arrow-spec side.

@cholmes
Copy link
Member

cholmes commented Sep 14, 2021

Sorry for the slow response on this, thanks for creating the issue. Great to meet you @jorisvandenbossche - it's awesome work you're doing.

I definitely think geo-arrow-spec makes sense for inclusion in cdw-geo. Though I think eventually this repository may drop away in favor of actual specs. It does seem like it'd make sense to eventually have one 'geoparquet' place, where it explains both formats, if that's what ultimately ends up making sense.

And I agree on keeping development of geo-arrow-spec in the geopandas repository.

I'm definitely interested to see the evolution of the geo-arrow-spec and the arrow-native layout. This effort I think does make sense to start with WKB, since most vendors already use it, but I can imagine some might be interested in a more efficient format if they start to try to read data in place.

@TomAugspurger
Copy link
Collaborator Author

Picking this up again. My plan right now is to do the "easy" geoparquet format first (i.e. not geo-arrow) that

  1. Stores geometries in WKB
  2. Defines where to put additional geometadata (which columns are geometries, the CRS, etc.)

It'd be helpful to have sample parquet files of what these systems currently export. Looking at https://github.com/opengeospatial/cdw-geo#current-support that's just Snowflake right now. @cholmes do you know who from Snowflake to ask for a sample file?

@cholmes
Copy link
Member

cholmes commented Oct 13, 2021

Awesome.

@cholmes do you know who from Snowflake to ask for a sample file?

Yes, not sure of his github user name, but I'll introduce you on email.

@TomAugspurger
Copy link
Collaborator Author

I took a quick pass at this in #2.

@jorisvandenbossche
Copy link
Collaborator

Nice to meet you as well @cholmes, and sorry for the slow reply. And thanks Tom for getting this moving again!

For the WKB vs "arrow native" (i.e. nested list arrays) question: we might want to keep both options long-term. The metadata we use in GeoPandas currently includes a "encoding": "WKB" per column, so which allows to specify different ways that the geometries are actually stored in the geometry column.

@jorisvandenbossche
Copy link
Collaborator

Looking at #2, I have a general question: currently it defines a metadata standard that is quite close to what we did in GeoPandas / geo-arrow-spec, but not exactly. So the existing geo-parquet files written by geopandas (and R's sfarrow) are not actually compatible with this.
(I know the geo-arrow-spec is also about in-memory Arrow data, but in practice it's currently mostly used for Parquet files).

I think ideally we only want a single metadata specification for Parquet on the long term? So that means we need to discuss the differences / see how the two can be aligned?

@cholmes
Copy link
Member

cholmes commented Oct 15, 2021

For the WKB vs "arrow native" (i.e. nested list arrays) question: we might want to keep both options long-term. The metadata we use in GeoPandas currently includes a "encoding": "WKB" per column, so which allows to specify different ways that the geometries are actually stored in the geometry column.

Ah cool, this makes sense to me. Though lately I've been wondering if there are any advantages to the WKB? Is it just that there's more parsers today that understand WKB? Is it that hard to parse/understand the 'arrow native' one? Or are there other downsides?

I think ideally we only want a single metadata specification for Parquet on the long term? So that means we need to discuss the differences / see how the two can be aligned?

Yes, fully agree we only want a single metadata specification for geo in parquet. What are the differences? We're basically starting from scratch, so happy to look to you for the start, and help promote what you've done and get others reading/writing the same format, and being open to evolving it if their requirements are a bit different. Where can I read up more on what you've done so far?

@jorisvandenbossche
Copy link
Collaborator

I've been wondering if there are any advantages to the WKB? Is it just that there's more parsers today that understand WKB? Is it that hard to parse/understand the 'arrow native' one? Or are there other downsides?

I think it was mostly because 1) WKB is indeed ubiquitous and almost everybody should be able to understand it, and 2) to get started with something (like Tom did here as well), so we could actually add parquet IO functionality to GeoPandas.

The "arrow native" one shouldn't necessarily be hard to understand, since it are "just numbers", but it's not a typical format. Eg for python / shapely, there is not a single function that can parse it at the moment (you need some custom code to do it , e.g. in pygeos it currently looks like this (pygeos/pygeos#93 (comment)) to do it efficiently). But it has been on my to do list for a long time to actually implement those to make this easier.
In addition to that, there are also a few open questions to decide on (eg include Z dimension with XY or keep separate), in geoarrow/geoarrow#3 and geoarrow/geoarrow#4. That's mainly blocked on someone taking the time to move this forward (I hope to find that time the coming month).

For the differences in metadata between what we have in geopandas vs the first draft here: one thing is the the question where to put column-level metadata (grouped together in the file metadata, or separate in each column metadata, cfr #2 (comment)). Another is the way to store CRS information (#3, we are currently using WKT), and the top-level key ("geo" vs "geoparquet"). We can maybe create specific issues for the different questions.
In the current geo-arrow-spec version, we also store some additional information (optional bbox, library name+version that created it).

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Oct 15, 2021

I'm happy to move the column metadata to the file level. I didn't realize geo-arrow-spec put it at the file level until after my first write up (@jorisvandenbossche do you have a sense for whether using both file and column-level metadata is common in parquet?).

I picked "geoparquet" as the top-level key to avoid clashing with geo-arrow-spec for now, but I agree that long-term we want just one standard. If we can get everyone on the same page, then just using "geo" as the key seems reasonable.

Maybe we talk through CRS / epsg stuff in #3?

Adding in those additional fields like bbox seems reasonable. I kept the very first draft here small to just get things rolling.

Oh, and adding a required "encoding" field seems sensible, since presumably all the other metadata will be standard between "WKB" and "arrow".

@cholmes
Copy link
Member

cholmes commented Oct 15, 2021

I think it was mostly because 1) WKB is indeed ubiquitous and almost everybody should be able to understand it, and 2) to get started with something (like Tom did here as well), so we could actually add parquet IO functionality to GeoPandas.

Thanks for the explanation. Yeah, if we start with just 'encoding' and two options that sounds good. But I do think we should keep an eye on adoption of the two in the early days, and see if we can just get gdal/geos/geotools/js libraries to just all parse the 'new' one, and then have a 1.0.0 that just has one way of doing it.

All of the uniting points sound great. We can discuss CRS stuff, but I'm pretty happy with just using WKT CRS, with a very clear default. It's the most comprehensive way to do it.

@jorisvandenbossche
Copy link
Collaborator

I kept the very first draft here small to just get things rolling.

And BTW, thanks for getting things rolling again!

@jorisvandenbossche
Copy link
Collaborator

A note on this front of a "geo-arrow" encoding, there is a PR that starts to more formally describe this format at geoarrow/geoarrow#12

@echeipesh
Copy link
Collaborator

echeipesh commented Apr 6, 2022

I was pointed to a similar idea implemented in GeoMesa Parquet Filesystem DataStore.

Code links: Parquet Writer Test and WriteSupport
cc: @jnh5y @elahrvivaz

Writing MultiPolygon produces following schema:

❯ pqrs schema geomesa.parquet
Metadata for file: geomesa.parquet

version: 1
num of rows: 45962
created by: parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
metadata:
  geomesa.fs.sft.name: level2
  geomesa.parquet.version: 1
  writer.model.name: SimpleFeatureWriteSupport
  geomesa.fs.sft.spec: *geom:MultiPolygon:org.geotools.jdbc.nativeTypeName=MULTIPOLYGON:org.geotools.jdbc.nativeType=12:hasGeopkgSpatialIndex=true:nativeSRID=4326:COORDINATE_DIMENSION=2,GID_0:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_0:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,GID_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NL_NAME_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,GID_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,VARNAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NL_NAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,TYPE_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,ENGTYPE_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,CC_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,HASC_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT
message level2 {
  OPTIONAL group geom {
    REQUIRED group x (LIST) {
      REPEATED group list {
        REQUIRED group element (LIST) {
          REPEATED group list {
            REPEATED DOUBLE element;
          }
        }
      }
    }
    REQUIRED group y (LIST) {
      REPEATED group list {
        REQUIRED group element (LIST) {
          REPEATED group list {
            REPEATED DOUBLE element;
          }
        }
      }
    }
  }
}

This looks like a pretty big win. In particular because this splits x/y into two columns this allows using parquet min/max stats to be used for spatial filtering.

One thing I'm not clear on is if it's possible to have the same parquet schema be used for all geometry types or if the one would be forced back to WKB encoding as soon as you have Points and Lines sharing the same column.

@elahrvivaz
Copy link

Yes, GeoMesa uses something similar to the "arrow native" format being described above. You can see the arrow field definitions here, and the parquet ones here.

From working with it, the pros and cons to this approach are as follows, in no particular order:

Pros:

  • Don't need WKB parsing code to read the files
  • Can take advantage of native column encoding/compression/chunk skipping
  • Can push down predicates into native (parquet) filters

Cons:

  • Have to fall back to WKB when dealing with different geometry types in a single column
  • Doesn't interop well with Spark's parquet writing, which doesn't handle nested/repeated fields (last I checked)
  • Doesn't account for Z/M values (although could be extended to do so)

@thomcom
Copy link

thomcom commented May 12, 2022

I'm excited to see this spec developing!

@kylebarron
Copy link
Collaborator

Now that #191 has been merged, I'm going to close this. Also refer to the geoarrow spec at https://github.com/geoarrow/geoarrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants