Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update geoparquet spec #2

Merged
merged 3 commits into from
Oct 14, 2021

Conversation

TomAugspurger
Copy link
Collaborator

First pass at a spec for geoparquet format. Lots of open questions, but the basic idea is to encode geometry columns as WKB, and set a standard place for some additional metadata (like the CRS).

format-specs/geoparquet.md Outdated Show resolved Hide resolved
format-specs/geoparquet.md Outdated Show resolved Hide resolved

| Field Name | Type | Description |
| ---------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| epsg | integer\| null | **REQUIRED.** [EPSG code](http://www.epsg-registry.org/) of the datasource |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly copied from https://github.com/stac-extensions/projection.

Do we want this here? Or do we keep the core spec lean and leave this to extensions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have thought about this for 5-10 minutes, changing my opinion several times. Probably worth putting this in its own issue for discussion.

I think my current take is put a field in core that defaults to WGS-84/lat,long/4326. And extensions can be used to add more options.

I think the STAC thing is overkill, it's meant to give all options to people, and most would only use one. And the json one I think makes less sense inside a non-json format. For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions. And not include projjson or wkt2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions.

I vaguely recall discussions about non-geospatial geometries, i.e. no epsg / CRS. I don't recall if that was on the opening-call or another context, but should we allow espg: null to allow for this use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like allowing non-spatial geometries. I say start without it, and if a real user has a case for it then can consider it. But my take is they are welcome to use the same ideas in the spec and make their own file. Like I'm happy for some 'geometry parquet' standard that is very close to ours. But I think we should start with locating things with a CRS. In STAC I think we said you could only use 'null' if you used one of the other proj constructs that provide the crs info.

format-specs/geoparquet.md Show resolved Hide resolved

## Column metadata

Each geometry column should include additional metadata
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just noticing that geoarrow/geoarrow#2 puts the per-column metadata in the file metadata under geo.columns, which is a mapping from column name to additional metadata.

@brendan-ward do you know why that uses file metadata, rather than putting it in the column metadata for each column? I suppose it's a bit easier for tools to parse when it's in the file metadata, since they don't have to check the metadata for each column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, we were trying to parallel what we saw pandas doing when creating pandas-specific metadata for a parquet table (including data types of each column), which also does this at the file level. I don't see an easy way to interact with column-level metadata using pyarrow, which is what we used in geopandas to implement geoparquet support.

That said, suggestions on geo-arrow-spec suggesting better ways to handle column-level metadata are certainly welcome!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. I'm tempted to just follow the lead of geopandas and pandas here.

FWIW, pyarrow fortunately does read in the column metadata, and it can be accessed from the Schema.field(column) object:

In [17]: import pyarrow.parquet as pq

In [18]: schema = pq.read_schema("examples/geoparquet/example.parquet")

In [19]: schema.field("geometry").metadata
Out[19]: {b'epsg': b'4326'}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, column instead of file (table) metadata could work as well. I am not fully sure, but I think the "easier to parse if everything is together" was the reason to do it this way (in addition that the pandas metadata does it this way as well).

The column/field metadata is accessible as well through the pyarrow interface, as you show, but I think it's less ergonomic to have to check the metadata of each column to see if it is a geometry column.

Now, specifically for Arrow, we might want to define standardized extension types for the geometry columns on the longer term, and that actually is done by using column/field metadata (https://arrow.apache.org/docs/format/Columnar.html#extension-types). That's of course Arrow and not Parquet specific, but it would translate to column metadata in Parquet as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional note on this (I only realized this last week): a good reason we used file-level metadata and not field-level metadata, is that Arrow doesn't actually support reading/writing field-level metadata for Parquet.

The Parquet spec has this, so this could be added to the Arrow implementation. But even then, this might be less ideal. In the Parquet spec this lives in the ColumnMetaData of a ColumnChunk of a RowGroup (see https://parquet.apache.org/documentation/latest/#metadata). So that would mean that the metadata is repeated for each RowGroup, and a single Parquet file can consist of many row groups.


| Field Name | Type | Description |
| ------------------ | ------ | -------------------------------------------------------------------- |
| geoparquet_version | string | **REQUIRED** The version of the metadata standard used when writing. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to ask earlier if you had a version for this, but looks like you did pick a 0.1.0 in your example. We should put some mention at the top about 'current version'. That it's an unreleased 0.1.0...

@cholmes cholmes merged commit f1c14d1 into opengeospatial:main Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants