Consider externalizability of metadata #118

mojodna · 2022-09-19T03:16:17Z

When [Geo]Parquet files/sources are used within systems that treat them as tables (like Spark, Trino/Presto, Athena, etc.), basic Parquet metadata is tracked in a "catalog" (e.g., a Hive-compatible catalog like AWS Glue Catalog). The engine being used for querying uses metadata to limit the parts of files (and files themselves) that are scanned, but they only expose the columnar content that's present, not the metadata. In some cases, metadata can be queried from the catalogs (e.g., from Athena, but the catalogs need additional work to support the metadata specified by GeoParquet (and this largely hasn't been done yet).

In the meantime, I'm curious if it makes sense to take the same metadata that's contained in the footer and externalize it into an alternate file (which could be serialized as Parquet, Avro, JSON, etc.). This would allow the query engines to register the metadata as a separate "table" (query-able as a standard source vs. requiring catalog support) and surface/take advantage of "table"-level information like CRS at query-time. At the moment, the CRS of a geometry column is something that needs to be determined out of band.

This is somewhat similar to #79, in that it doesn't look at GeoParquet sources as "files" ("tables" are often backed by many files), and could be seen as another reason to (de-)duplicate data from file footers into something that covers the whole set.

/cc @jorisvandenbossche and @kylebarron, since we talked a bit about this at FOSS4G.

jatorre · 2022-09-22T13:49:25Z

So two topics here. It sounds like some systems like Presto or Athena (which are the same right?), rely on the catalogs to read the metadata and those systems will need to support geoparquet. I think that is a fair assumption and I dont think we should override that, we should try to get those systems to recognise geoparaquet.

Meanwhile a geoparquet read by Presto/Athena will look like having a binary column if the catalog has not been updated to support geoparquet. Thats fine, thats the same behaviour I would expect on other products.

So I think it is the same like with BigQuery, Redshfit, Snowflake, etc... is juts that on those system there 2 pieces of software that need to be updated and in the others is only 1 since they combine the catalog with the engine.

Now, a different story is with the multi-file datasets... I assume the recommendation is for every part to contain the same metadata and if there is a global metadata file to all of them we can include also, as a recommendation, the geoparque metadata?

mojodna · 2022-09-22T17:00:33Z

For purposes of this issue, I'm only expressing a desire for the GeoParquet-specific metadata (bbox, CRS, geometry type(?)) to be duplicated from the Parquet footers into a file that can be directly addressed and read independently of the files containing data.

"is a global metadata file to all of them we can include also, as a recommendation, the geoparquet metadata?"

Effectively, yeah.

To @jatorre's other comments:

Athena is related to Presto, but they're not the same; Athena v2's functions are currently based on Presto 0.217's, Athena includes support for user-defined functions implemented with AWS Lambda, and Athena has diverged with its support for Hudi and Iceberg. GeoParquet files can be registered with Glue Catalog (as vanilla Parquet) and queried using Athena; there's no external "geometry" type, so WKB columns appear as byte arrays and can be converted to internal geometries for use w/ geospatial functions. Neither the catalog nor the engine (Athena) understand GeoParquet-specific metadata, so bbox-related optimizations aren't possible and there's no way to programmatically know the CRS of a GeoParquet source (hence the desire for the metadata to be directly addressable/join-able).

Glue Catalog is responsible for storing and tracking metadata about the objects that make up tables (and surfaces this metadata to a variety of services, software, and customer tools that relate to Hadoop, so not just Athena). Glue Catalog is probably the starting point for AWS to fully support GeoParquet (esp. for engine optimizations), but there are ecosystem-wide considerations around geometry types (optionally including CRS), since the "Hive Metastore" (Catalog) has become a de facto standard, and with it comes assumptions about the universe of Hive data types.

I view Athena and Glue Catalog as separate considerations (and will do what I can to raise them within AWS, though I don't have much visibility into those teams).

jatorre · 2022-09-26T08:17:02Z

I am going to ask some people from Apache on support here to provide best practice on how to handle this. It does not sound like a geo specific thing, but more of how do you treat advanced custom data types on this situation.

cholmes · 2022-10-24T23:30:01Z

We should discuss whether to move this to the 'future' milestone, in line with the latest discussions where we're focusing on 'interoperability' in 1.0.0, and after that we'll dig into use cases of using GeoParquet as a direct source / streaming from it.

But I could see addressing the specific request in conjunction with #79

cholmes modified the milestones: future, 1.0.0-beta.1 Oct 24, 2022

cholmes modified the milestones: 1.0.0-beta.1, future Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider externalizability of metadata #118

Consider externalizability of metadata #118

mojodna commented Sep 19, 2022

jatorre commented Sep 22, 2022

mojodna commented Sep 22, 2022

jatorre commented Sep 26, 2022

cholmes commented Oct 24, 2022

Consider externalizability of metadata #118

Consider externalizability of metadata #118

Comments

mojodna commented Sep 19, 2022

jatorre commented Sep 22, 2022

mojodna commented Sep 22, 2022

jatorre commented Sep 26, 2022

cholmes commented Oct 24, 2022