-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update geoparquet spec #2
Update geoparquet spec #2
Conversation
format-specs/geoparquet.md
Outdated
|
||
| Field Name | Type | Description | | ||
| ---------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| epsg | integer\| null | **REQUIRED.** [EPSG code](http://www.epsg-registry.org/) of the datasource | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly copied from https://github.com/stac-extensions/projection.
Do we want this here? Or do we keep the core spec lean and leave this to extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have thought about this for 5-10 minutes, changing my opinion several times. Probably worth putting this in its own issue for discussion.
I think my current take is put a field in core that defaults to WGS-84/lat,long/4326. And extensions can be used to add more options.
I think the STAC thing is overkill, it's meant to give all options to people, and most would only use one. And the json one I think makes less sense inside a non-json format. For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions. And not include projjson or wkt2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions.
I vaguely recall discussions about non-geospatial geometries, i.e. no epsg / CRS. I don't recall if that was on the opening-call or another context, but should we allow espg: null
to allow for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like allowing non-spatial geometries. I say start without it, and if a real user has a case for it then can consider it. But my take is they are welcome to use the same ideas in the spec and make their own file. Like I'm happy for some 'geometry parquet' standard that is very close to ours. But I think we should start with locating things with a CRS. In STAC I think we said you could only use 'null' if you used one of the other proj constructs that provide the crs info.
|
||
## Column metadata | ||
|
||
Each geometry column should include additional metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just noticing that geoarrow/geoarrow#2 puts the per-column metadata in the file metadata under geo.columns
, which is a mapping from column name to additional metadata.
@brendan-ward do you know why that uses file metadata, rather than putting it in the column metadata for each column? I suppose it's a bit easier for tools to parse when it's in the file metadata, since they don't have to check the metadata for each column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, we were trying to parallel what we saw pandas doing when creating pandas-specific metadata for a parquet table (including data types of each column), which also does this at the file level. I don't see an easy way to interact with column-level metadata using pyarrow
, which is what we used in geopandas
to implement geoparquet support.
That said, suggestions on geo-arrow-spec suggesting better ways to handle column-level metadata are certainly welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. I'm tempted to just follow the lead of geopandas and pandas here.
FWIW, pyarrow fortunately does read in the column metadata, and it can be accessed from the Schema.field(column) object:
In [17]: import pyarrow.parquet as pq
In [18]: schema = pq.read_schema("examples/geoparquet/example.parquet")
In [19]: schema.field("geometry").metadata
Out[19]: {b'epsg': b'4326'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, column instead of file (table) metadata could work as well. I am not fully sure, but I think the "easier to parse if everything is together" was the reason to do it this way (in addition that the pandas metadata does it this way as well).
The column/field metadata is accessible as well through the pyarrow interface, as you show, but I think it's less ergonomic to have to check the metadata of each column to see if it is a geometry column.
Now, specifically for Arrow, we might want to define standardized extension types for the geometry columns on the longer term, and that actually is done by using column/field metadata (https://arrow.apache.org/docs/format/Columnar.html#extension-types). That's of course Arrow and not Parquet specific, but it would translate to column metadata in Parquet as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional note on this (I only realized this last week): a good reason we used file-level metadata and not field-level metadata, is that Arrow doesn't actually support reading/writing field-level metadata for Parquet.
The Parquet spec has this, so this could be added to the Arrow implementation. But even then, this might be less ideal. In the Parquet spec this lives in the ColumnMetaData of a ColumnChunk of a RowGroup (see https://parquet.apache.org/documentation/latest/#metadata). So that would mean that the metadata is repeated for each RowGroup, and a single Parquet file can consist of many row groups.
|
||
| Field Name | Type | Description | | ||
| ------------------ | ------ | -------------------------------------------------------------------- | | ||
| geoparquet_version | string | **REQUIRED** The version of the metadata standard used when writing. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to ask earlier if you had a version for this, but looks like you did pick a 0.1.0 in your example. We should put some mention at the top about 'current version'. That it's an unreleased 0.1.0...
First pass at a spec for geoparquet format. Lots of open questions, but the basic idea is to encode geometry columns as WKB, and set a standard place for some additional metadata (like the CRS).