Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update geoparquet spec #2
Update geoparquet spec #2
Changes from 1 commit
bc586cd
285d5ca
c3f72a0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to ask earlier if you had a version for this, but looks like you did pick a 0.1.0 in your example. We should put some mention at the top about 'current version'. That it's an unreleased 0.1.0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just noticing that geoarrow/geoarrow#2 puts the per-column metadata in the file metadata under
geo.columns
, which is a mapping from column name to additional metadata.@brendan-ward do you know why that uses file metadata, rather than putting it in the column metadata for each column? I suppose it's a bit easier for tools to parse when it's in the file metadata, since they don't have to check the metadata for each column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, we were trying to parallel what we saw pandas doing when creating pandas-specific metadata for a parquet table (including data types of each column), which also does this at the file level. I don't see an easy way to interact with column-level metadata using
pyarrow
, which is what we used ingeopandas
to implement geoparquet support.That said, suggestions on geo-arrow-spec suggesting better ways to handle column-level metadata are certainly welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. I'm tempted to just follow the lead of geopandas and pandas here.
FWIW, pyarrow fortunately does read in the column metadata, and it can be accessed from the Schema.field(column) object:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, column instead of file (table) metadata could work as well. I am not fully sure, but I think the "easier to parse if everything is together" was the reason to do it this way (in addition that the pandas metadata does it this way as well).
The column/field metadata is accessible as well through the pyarrow interface, as you show, but I think it's less ergonomic to have to check the metadata of each column to see if it is a geometry column.
Now, specifically for Arrow, we might want to define standardized extension types for the geometry columns on the longer term, and that actually is done by using column/field metadata (https://arrow.apache.org/docs/format/Columnar.html#extension-types). That's of course Arrow and not Parquet specific, but it would translate to column metadata in Parquet as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional note on this (I only realized this last week): a good reason we used file-level metadata and not field-level metadata, is that Arrow doesn't actually support reading/writing field-level metadata for Parquet.
The Parquet spec has this, so this could be added to the Arrow implementation. But even then, this might be less ideal. In the Parquet spec this lives in the ColumnMetaData of a ColumnChunk of a RowGroup (see https://parquet.apache.org/documentation/latest/#metadata). So that would mean that the metadata is repeated for each RowGroup, and a single Parquet file can consist of many row groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly copied from https://github.com/stac-extensions/projection.
Do we want this here? Or do we keep the core spec lean and leave this to extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have thought about this for 5-10 minutes, changing my opinion several times. Probably worth putting this in its own issue for discussion.
I think my current take is put a field in core that defaults to WGS-84/lat,long/4326. And extensions can be used to add more options.
I think the STAC thing is overkill, it's meant to give all options to people, and most would only use one. And the json one I think makes less sense inside a non-json format. For the immediate term I think we just use epsg, and say it's required and must be set to 4326, and other options may be defined in extensions. And not include projjson or wkt2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
I vaguely recall discussions about non-geospatial geometries, i.e. no epsg / CRS. I don't recall if that was on the opening-call or another context, but should we allow
espg: null
to allow for this use case?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like allowing non-spatial geometries. I say start without it, and if a real user has a case for it then can consider it. But my take is they are welcome to use the same ideas in the spec and make their own file. Like I'm happy for some 'geometry parquet' standard that is very close to ours. But I think we should start with locating things with a CRS. In STAC I think we said you could only use 'null' if you used one of the other proj constructs that provide the crs info.