Proposal for a "geo-arrow" format #1

TomAugspurger · 2021-08-16T11:49:56Z

Various members of the Python and R geospatial communities are working on a geo-arrow-spec: a way to store geospatial data in Apache Arrow (and Apache Parquet) format. This issue is to introduce the cdw-geo and geo-arrow-spec groups, and hash out a plan for how to proceed, since there's some overlap between the two groups' goals.

In addition to intros, I wanted to address Why another parquet format?

The geoparquet format will likely store geometries using something like WKB. The geo-arrow-spec hasn't settled on a representation for geometries (see geoarrow/geoarrow#4 and geoarrow/geoarrow#3), but it will likely move away from using WKB to an "arrow native" memory layout. geoarrow/geoarrow#4 (comment) has more information on why, but the short version is that the arrow-native layout a.) doesn't require decoding WKB to use the geometries, b.) coordinates are contiguous in memory, c.) provides random access to geometries without having to parse unnecessary data.

Some logistical questions (with my recommendations):

Would a version of the geo-arrow-spec be appropriate for inclusion in cdw-geo? (I think so, as long as we think having two parquet formats won't confuse people)
geo-arrow-spec still needs to work out some details, including the memory layout of geometries. Where should development on geo-arrow-spec happen? (I think keep in in geo-arrow-spec, since not everyone in cdw-geo will care. Once it's more stable, we can consider moving it out of geopandas/geo-arrow-spec and into this repository, if both groups agree).

cc @jorisvandenbossche from the geo-arrow-spec side.

cholmes · 2021-09-14T22:18:21Z

Sorry for the slow response on this, thanks for creating the issue. Great to meet you @jorisvandenbossche - it's awesome work you're doing.

I definitely think geo-arrow-spec makes sense for inclusion in cdw-geo. Though I think eventually this repository may drop away in favor of actual specs. It does seem like it'd make sense to eventually have one 'geoparquet' place, where it explains both formats, if that's what ultimately ends up making sense.

And I agree on keeping development of geo-arrow-spec in the geopandas repository.

I'm definitely interested to see the evolution of the geo-arrow-spec and the arrow-native layout. This effort I think does make sense to start with WKB, since most vendors already use it, but I can imagine some might be interested in a more efficient format if they start to try to read data in place.

TomAugspurger · 2021-10-13T13:26:20Z

Picking this up again. My plan right now is to do the "easy" geoparquet format first (i.e. not geo-arrow) that

Stores geometries in WKB
Defines where to put additional geometadata (which columns are geometries, the CRS, etc.)

It'd be helpful to have sample parquet files of what these systems currently export. Looking at https://github.com/opengeospatial/cdw-geo#current-support that's just Snowflake right now. @cholmes do you know who from Snowflake to ask for a sample file?

cholmes · 2021-10-13T14:05:57Z

Awesome.

@cholmes do you know who from Snowflake to ask for a sample file?

Yes, not sure of his github user name, but I'll introduce you on email.

TomAugspurger · 2021-10-13T14:17:30Z

I took a quick pass at this in #2.

jorisvandenbossche · 2021-10-14T23:09:42Z

Nice to meet you as well @cholmes, and sorry for the slow reply. And thanks Tom for getting this moving again!

For the WKB vs "arrow native" (i.e. nested list arrays) question: we might want to keep both options long-term. The metadata we use in GeoPandas currently includes a "encoding": "WKB" per column, so which allows to specify different ways that the geometries are actually stored in the geometry column.

jorisvandenbossche · 2021-10-14T23:29:53Z

Looking at #2, I have a general question: currently it defines a metadata standard that is quite close to what we did in GeoPandas / geo-arrow-spec, but not exactly. So the existing geo-parquet files written by geopandas (and R's sfarrow) are not actually compatible with this.
(I know the geo-arrow-spec is also about in-memory Arrow data, but in practice it's currently mostly used for Parquet files).

I think ideally we only want a single metadata specification for Parquet on the long term? So that means we need to discuss the differences / see how the two can be aligned?

cholmes · 2021-10-15T04:16:03Z

For the WKB vs "arrow native" (i.e. nested list arrays) question: we might want to keep both options long-term. The metadata we use in GeoPandas currently includes a "encoding": "WKB" per column, so which allows to specify different ways that the geometries are actually stored in the geometry column.

Ah cool, this makes sense to me. Though lately I've been wondering if there are any advantages to the WKB? Is it just that there's more parsers today that understand WKB? Is it that hard to parse/understand the 'arrow native' one? Or are there other downsides?

I think ideally we only want a single metadata specification for Parquet on the long term? So that means we need to discuss the differences / see how the two can be aligned?

Yes, fully agree we only want a single metadata specification for geo in parquet. What are the differences? We're basically starting from scratch, so happy to look to you for the start, and help promote what you've done and get others reading/writing the same format, and being open to evolving it if their requirements are a bit different. Where can I read up more on what you've done so far?

jorisvandenbossche · 2021-10-15T12:04:58Z

I've been wondering if there are any advantages to the WKB? Is it just that there's more parsers today that understand WKB? Is it that hard to parse/understand the 'arrow native' one? Or are there other downsides?

I think it was mostly because 1) WKB is indeed ubiquitous and almost everybody should be able to understand it, and 2) to get started with something (like Tom did here as well), so we could actually add parquet IO functionality to GeoPandas.

The "arrow native" one shouldn't necessarily be hard to understand, since it are "just numbers", but it's not a typical format. Eg for python / shapely, there is not a single function that can parse it at the moment (you need some custom code to do it , e.g. in pygeos it currently looks like this (pygeos/pygeos#93 (comment)) to do it efficiently). But it has been on my to do list for a long time to actually implement those to make this easier.
In addition to that, there are also a few open questions to decide on (eg include Z dimension with XY or keep separate), in geoarrow/geoarrow#3 and geoarrow/geoarrow#4. That's mainly blocked on someone taking the time to move this forward (I hope to find that time the coming month).

For the differences in metadata between what we have in geopandas vs the first draft here: one thing is the the question where to put column-level metadata (grouped together in the file metadata, or separate in each column metadata, cfr #2 (comment)). Another is the way to store CRS information (#3, we are currently using WKT), and the top-level key ("geo" vs "geoparquet"). We can maybe create specific issues for the different questions.
In the current geo-arrow-spec version, we also store some additional information (optional bbox, library name+version that created it).

TomAugspurger · 2021-10-15T14:00:02Z

I'm happy to move the column metadata to the file level. I didn't realize geo-arrow-spec put it at the file level until after my first write up (@jorisvandenbossche do you have a sense for whether using both file and column-level metadata is common in parquet?).

I picked "geoparquet" as the top-level key to avoid clashing with geo-arrow-spec for now, but I agree that long-term we want just one standard. If we can get everyone on the same page, then just using "geo" as the key seems reasonable.

Maybe we talk through CRS / epsg stuff in #3?

Adding in those additional fields like bbox seems reasonable. I kept the very first draft here small to just get things rolling.

Oh, and adding a required "encoding" field seems sensible, since presumably all the other metadata will be standard between "WKB" and "arrow".

cholmes · 2021-10-15T14:19:40Z

I think it was mostly because 1) WKB is indeed ubiquitous and almost everybody should be able to understand it, and 2) to get started with something (like Tom did here as well), so we could actually add parquet IO functionality to GeoPandas.

Thanks for the explanation. Yeah, if we start with just 'encoding' and two options that sounds good. But I do think we should keep an eye on adoption of the two in the early days, and see if we can just get gdal/geos/geotools/js libraries to just all parse the 'new' one, and then have a 1.0.0 that just has one way of doing it.

All of the uniting points sound great. We can discuss CRS stuff, but I'm pretty happy with just using WKT CRS, with a very clear default. It's the most comprehensive way to do it.

jorisvandenbossche · 2021-10-21T20:05:41Z

I kept the very first draft here small to just get things rolling.

And BTW, thanks for getting things rolling again!

jorisvandenbossche · 2022-02-22T15:15:40Z

A note on this front of a "geo-arrow" encoding, there is a PR that starts to more formally describe this format at geoarrow/geoarrow#12

echeipesh · 2022-04-06T19:26:39Z

I was pointed to a similar idea implemented in GeoMesa Parquet Filesystem DataStore.

Code links: Parquet Writer Test and WriteSupport
cc: @jnh5y @elahrvivaz

Writing MultiPolygon produces following schema:

❯ pqrs schema geomesa.parquet
Metadata for file: geomesa.parquet

version: 1
num of rows: 45962
created by: parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
metadata:
  geomesa.fs.sft.name: level2
  geomesa.parquet.version: 1
  writer.model.name: SimpleFeatureWriteSupport
  geomesa.fs.sft.spec: *geom:MultiPolygon:org.geotools.jdbc.nativeTypeName=MULTIPOLYGON:org.geotools.jdbc.nativeType=12:hasGeopkgSpatialIndex=true:nativeSRID=4326:COORDINATE_DIMENSION=2,GID_0:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_0:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,GID_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NL_NAME_1:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,GID_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,VARNAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,NL_NAME_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,TYPE_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,ENGTYPE_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,CC_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT,HASC_2:String:org.geotools.jdbc.nativeType=12:org.geotools.jdbc.nativeTypeName=TEXT
message level2 {
  OPTIONAL group geom {
    REQUIRED group x (LIST) {
      REPEATED group list {
        REQUIRED group element (LIST) {
          REPEATED group list {
            REPEATED DOUBLE element;
          }
        }
      }
    }
    REQUIRED group y (LIST) {
      REPEATED group list {
        REQUIRED group element (LIST) {
          REPEATED group list {
            REPEATED DOUBLE element;
          }
        }
      }
    }
  }
}

This looks like a pretty big win. In particular because this splits x/y into two columns this allows using parquet min/max stats to be used for spatial filtering.

One thing I'm not clear on is if it's possible to have the same parquet schema be used for all geometry types or if the one would be forced back to WKB encoding as soon as you have Points and Lines sharing the same column.

elahrvivaz · 2022-04-06T19:44:36Z

Yes, GeoMesa uses something similar to the "arrow native" format being described above. You can see the arrow field definitions here, and the parquet ones here.

From working with it, the pros and cons to this approach are as follows, in no particular order:

Pros:

Don't need WKB parsing code to read the files
Can take advantage of native column encoding/compression/chunk skipping
Can push down predicates into native (parquet) filters

Cons:

Have to fall back to WKB when dealing with different geometry types in a single column
Doesn't interop well with Spark's parquet writing, which doesn't handle nested/repeated fields (last I checked)
Doesn't account for Z/M values (although could be extended to do so)

thomcom · 2022-05-12T14:56:01Z

I'm excited to see this spec developing!

kylebarron · 2024-03-25T18:01:36Z

Now that #191 has been merged, I'm going to close this. Also refer to the geoarrow spec at https://github.com/geoarrow/geoarrow

alasarr mentioned this issue Mar 27, 2022

Define polygon orientation rules #46

Closed

kylebarron mentioned this issue May 12, 2022

Exploration of native parquet predicate pushdown support with arrow-native geometries geoarrow/geoarrow#20

Closed

ammojamo mentioned this issue Mar 15, 2023

Does GeoPandas actually support this spec anymore? geoarrow/geoarrow#36

Closed

paleolimbot mentioned this issue Sep 14, 2023

Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata #185

Closed

cholmes mentioned this issue Dec 4, 2023

Introduce bounding box column definition #191

Merged

kylebarron closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a "geo-arrow" format #1

Proposal for a "geo-arrow" format #1

TomAugspurger commented Aug 16, 2021 •

edited

Loading

cholmes commented Sep 14, 2021

TomAugspurger commented Oct 13, 2021

cholmes commented Oct 13, 2021

TomAugspurger commented Oct 13, 2021

jorisvandenbossche commented Oct 14, 2021

jorisvandenbossche commented Oct 14, 2021

cholmes commented Oct 15, 2021

jorisvandenbossche commented Oct 15, 2021

TomAugspurger commented Oct 15, 2021 •

edited

Loading

cholmes commented Oct 15, 2021

jorisvandenbossche commented Oct 21, 2021

jorisvandenbossche commented Feb 22, 2022

echeipesh commented Apr 6, 2022 •

edited

Loading

elahrvivaz commented Apr 6, 2022

thomcom commented May 12, 2022

kylebarron commented Mar 25, 2024

Proposal for a "geo-arrow" format #1

Proposal for a "geo-arrow" format #1

Comments

TomAugspurger commented Aug 16, 2021 • edited Loading

cholmes commented Sep 14, 2021

TomAugspurger commented Oct 13, 2021

cholmes commented Oct 13, 2021

TomAugspurger commented Oct 13, 2021

jorisvandenbossche commented Oct 14, 2021

jorisvandenbossche commented Oct 14, 2021

cholmes commented Oct 15, 2021

jorisvandenbossche commented Oct 15, 2021

TomAugspurger commented Oct 15, 2021 • edited Loading

cholmes commented Oct 15, 2021

jorisvandenbossche commented Oct 21, 2021

jorisvandenbossche commented Feb 22, 2022

echeipesh commented Apr 6, 2022 • edited Loading

elahrvivaz commented Apr 6, 2022

thomcom commented May 12, 2022

kylebarron commented Mar 25, 2024

TomAugspurger commented Aug 16, 2021 •

edited

Loading

TomAugspurger commented Oct 15, 2021 •

edited

Loading

echeipesh commented Apr 6, 2022 •

edited

Loading