Update geoparquet spec #2

TomAugspurger · 2021-10-13T14:13:24Z

First pass at a spec for geoparquet format. Lots of open questions, but the basic idea is to encode geometry columns as WKB, and set a standard place for some additional metadata (like the CRS).

format-specs/geoparquet.md

TomAugspurger · 2021-10-13T14:16:33Z

format-specs/geoparquet.md

+
+| Field Name |                               Type                                      |                                                                   Description                                                                     |
+| ---------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
+| epsg       | integer\| null                                                          | **REQUIRED.** [EPSG code](http://www.epsg-registry.org/) of the datasource                                                                        |


Mostly copied from https://github.com/stac-extensions/projection.

Do we want this here? Or do we keep the core spec lean and leave this to extensions?

Just have thought about this for 5-10 minutes, changing my opinion several times. Probably worth putting this in its own issue for discussion.

I think my current take is put a field in core that defaults to WGS-84/lat,long/4326. And extensions can be used to add more options.

I think the STAC thing is overkill, it's meant to give all options to people, and most would only use one. And the json one I think makes less sense inside a non-json format. For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions. And not include projjson or wkt2.

Sounds good.

For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions.

I vaguely recall discussions about non-geospatial geometries, i.e. no epsg / CRS. I don't recall if that was on the opening-call or another context, but should we allow espg: null to allow for this use case?

I personally don't like allowing non-spatial geometries. I say start without it, and if a real user has a case for it then can consider it. But my take is they are welcome to use the same ideas in the spec and make their own file. Like I'm happy for some 'geometry parquet' standard that is very close to ours. But I think we should start with locating things with a CRS. In STAC I think we said you could only use 'null' if you used one of the other proj constructs that provide the crs info.

format-specs/geoparquet.md

TomAugspurger · 2021-10-13T17:38:23Z

format-specs/geoparquet.md

+
+## Column metadata
+
+Each geometry column should include additional metadata


I'm just noticing that geoarrow/geoarrow#2 puts the per-column metadata in the file metadata under geo.columns, which is a mapping from column name to additional metadata.

@brendan-ward do you know why that uses file metadata, rather than putting it in the column metadata for each column? I suppose it's a bit easier for tools to parse when it's in the file metadata, since they don't have to check the metadata for each column?

If I recall correctly, we were trying to parallel what we saw pandas doing when creating pandas-specific metadata for a parquet table (including data types of each column), which also does this at the file level. I don't see an easy way to interact with column-level metadata using pyarrow, which is what we used in geopandas to implement geoparquet support.

That said, suggestions on geo-arrow-spec suggesting better ways to handle column-level metadata are certainly welcome!

Thanks for the context. I'm tempted to just follow the lead of geopandas and pandas here.

FWIW, pyarrow fortunately does read in the column metadata, and it can be accessed from the Schema.field(column) object:

In [17]: import pyarrow.parquet as pq In [18]: schema = pq.read_schema("examples/geoparquet/example.parquet") In [19]: schema.field("geometry").metadata Out[19]: {b'epsg': b'4326'}

Yes, column instead of file (table) metadata could work as well. I am not fully sure, but I think the "easier to parse if everything is together" was the reason to do it this way (in addition that the pandas metadata does it this way as well).

The column/field metadata is accessible as well through the pyarrow interface, as you show, but I think it's less ergonomic to have to check the metadata of each column to see if it is a geometry column.

Now, specifically for Arrow, we might want to define standardized extension types for the geometry columns on the longer term, and that actually is done by using column/field metadata (https://arrow.apache.org/docs/format/Columnar.html#extension-types). That's of course Arrow and not Parquet specific, but it would translate to column metadata in Parquet as well.

Additional note on this (I only realized this last week): a good reason we used file-level metadata and not field-level metadata, is that Arrow doesn't actually support reading/writing field-level metadata for Parquet.

The Parquet spec has this, so this could be added to the Arrow implementation. But even then, this might be less ideal. In the Parquet spec this lives in the ColumnMetaData of a ColumnChunk of a RowGroup (see https://parquet.apache.org/documentation/latest/#metadata). So that would mean that the metadata is repeated for each RowGroup, and a single Parquet file can consist of many row groups.

cholmes · 2021-10-13T18:50:56Z

format-specs/geoparquet.md

+
+|     Field Name     |  Type  |                             Description                              |
+| ------------------ | ------ | -------------------------------------------------------------------- |
+| geoparquet_version | string | **REQUIRED** The version of the metadata standard used when writing. |


Meant to ask earlier if you had a version for this, but looks like you did pick a 0.1.0 in your example. We should put some mention at the top about 'current version'. That it's an unreleased 0.1.0...

Update geoparquet spec

bc586cd

TomAugspurger commented Oct 13, 2021

View reviewed changes

TomAugspurger mentioned this pull request Oct 13, 2021

Proposal for a "geo-arrow" format #1

Closed

Tom Augspurger added 2 commits October 13, 2021 12:33

Added example

285d5ca

Updated spec

c3f72a0

TomAugspurger commented Oct 13, 2021

View reviewed changes

cholmes reviewed Oct 13, 2021

View reviewed changes

cholmes merged commit f1c14d1 into opengeospatial:main Oct 14, 2021

alasarr mentioned this pull request Mar 27, 2022

Define polygon orientation rules #46

Closed

cholmes mentioned this pull request Jun 18, 2024

Updates for the 1.1.0 release #230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update geoparquet spec #2

Update geoparquet spec #2

TomAugspurger commented Oct 13, 2021

TomAugspurger Oct 13, 2021

cholmes Oct 13, 2021

TomAugspurger Oct 13, 2021

cholmes Oct 13, 2021

TomAugspurger Oct 13, 2021

brendan-ward Oct 13, 2021

TomAugspurger Oct 14, 2021

jorisvandenbossche Oct 14, 2021

jorisvandenbossche Feb 22, 2022

cholmes Oct 13, 2021


		## Column metadata

		Each geometry column should include additional metadata

Update geoparquet spec #2

Update geoparquet spec #2

Conversation

TomAugspurger commented Oct 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment