Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce bounding box column definition #191

Merged
merged 15 commits into from
Mar 11, 2024

Conversation

jwass
Copy link
Collaborator

@jwass jwass commented Nov 27, 2023

Introduces the per-row bounding box definition following the discussion from #188.

This initial proposal goes with a definition that looks like:

"geometry": {
  "encoding": "WKB",
  "geometry_types": [],
  "bbox": [-71.1, 42.3, -70.9, 42.5],
  "geometry_bbox": {"column": "bbox"}
}
  • Add documentation to the top-level GeoParquet description and definition.
  • Add the geometry_bbox definition to the json schema
  • Add a few tests. Verify with pytest test_json_schema.py
  • Updates the examples and metadata schema poetry run python generate_example.py and poetry run python update_example_schemas.py

* Add documentation to the top-level GeoParquet description and definition.
* Add the geometry_bbox definition to the json schema
* Add a few tests. Verify with `pytest test_json_schema.py`
@jwass
Copy link
Collaborator Author

jwass commented Nov 27, 2023

How do people feel about the bbox column field names xmin, xmax, ymin, ymax?

When putting this together I noticed that geopandas gdf.bounds uses minx, maxx, miny, maxy. @paleolimbot made a good argument that we use the xmin style in other places in GeoParquet so should stick with that, but is worth calling out here.

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up! Just a few comments to get the discussion going.

Sorry for missing your earlier comment before you got to the PR, but I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!

For three dimensions the additional fields zmin and zmax MUST be present.

In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.

@jorisvandenbossche
Copy link
Collaborator

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Parquet indeed does not speak about "struct" columns, and this is unfortunately one of those areas that is completely underdocumented. I think the relevant term is "group", but in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types that is only used to explain LIST and MAP logical type.

When translating an Arrow struct column, I get the following Parquet schema:

import pyarrow as pa
struct_arr = pa.StructArray.from_arrays([[1, 2]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"bbox": struct_arr})

import pyarrow.parquet as pq
pq.write_table(table, "test_bbox_schema.parquet")
pq.read_metadata("test_bbox_schema.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f02f01dac40>
required group field_id=-1 schema {
  optional group field_id=-1 bbox {
    optional int64 field_id=-1 xmin;
    optional int64 field_id=-1 xmax;
    optional int64 field_id=-1 ymin;
    optional int64 field_id=-1 ymax;
  }
}

So something like "a group field with 4 child fieds with names ..."?

Change the geometry_bbox to the broader "covering" section. Update
tests and examples.

Made some documentation updates:
* Parquet schema -> group
* Do not require zmin/zmax if geometries have 3 dimensions
@jwass
Copy link
Collaborator Author

jwass commented Nov 28, 2023

I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.

Sounds good. I just made the updates to go to that original proposal with the "covering" section. I admit to not having a better word so let's stick with it for now.

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Good call. I went with @jorisvandenbossche's language here. And I agree, the Parquet docs/website are really lacking on this.

I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!

Agreed. We can change as necessary depending on the outcome of that testing.

For three dimensions the additional fields zmin and zmax MUST be present.

In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.

Thanks. I changed the language to read zmin and zmax MAY be present but not required.

@jwass
Copy link
Collaborator Author

jwass commented Nov 28, 2023

So something like "a group field with 4 child fieds with names ..."?

@jorisvandenbossche Thanks! I made some updates and went with this wording.

@rouault
Copy link
Contributor

rouault commented Nov 28, 2023

I'm just thinking that the use of a struct/group with child fields could potentially be a complication for some implementations. I mean the "flat" model is quite a common one among geospatial formats. I struggled a bit to deal with Parquet nested construct when mapping that to the OGR data model. I've bitten that bullet now, so the current proposal is perfectly fine to me. Just wanted to raise that potential issue for other "simple" hypothetical implementations. But perhaps dealing with Parquet means that dealing with nested constructs is an implicit requirement.

@kylebarron
Copy link
Collaborator

kylebarron commented Nov 28, 2023

What about specifically listing the names of the paths in the schema:

"covering": {
  "box": {
    "xmin": "bbox.minx",
    "ymin": "bbox.miny",
    "xmax": "bbox.maxx",
    "ymax": "bbox.maxy",
  }
}

Then it can be straightforward to either use dot-access notation as above, or to refer to columns at the top-level:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy",
  }
}

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

And a geoarrow linestring column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.list.element.list.element.x",
    "ymin": "geometry.list.element.list.element.y",
    "xmax": "geometry.list.element.list.element.x",
    "ymax": "geometry.list.element.list.element.y",
  }
}

(The exact syntax is up for future discussion; that's what pyarrow.parquet shows for the
value of path_in_schema for parquet metadata of a geoarrow linestring file, tested with this file)

<pyarrow._parquet.ColumnChunkMetaData object at 0x114d2a1b0>
  file_offset: 509677
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x106814270>
      has_min_max: True
      min: 215869.21669693943
      max: 781793.0855915633
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 16382
  data_page_offset: 409875
  total_compressed_size: 493295
  total_uncompressed_size: 493604
<pyarrow._parquet.ColumnChunkMetaData object at 0x114cfe7a0>
  file_offset: 1002855
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.y
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x114d2b1f0>
      has_min_max: True
      min: 4790520.432107779
      max: 5237313.334385211
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 509812
  data_page_offset: 903053
  total_compressed_size: 493043
  total_uncompressed_size: 493596

@jorisvandenbossche
Copy link
Collaborator

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

@kylebarron
Copy link
Collaborator

kylebarron commented Nov 28, 2023

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

I think the question is whether covering: box is intended to be used for more than just predicate pushdown. I had assumed that covering: box was only intended to enable statistics in the row group metadata, but that readers should not assume how to use the actual data within the column.

Given that creating bounding boxes from a geometry column (once in memory) should be very fast, I think the main goal should be on solidifying a flexible way to describe the MBR of the chunk in the row group metadata, without defining what the values of that column must be. I like the above because it's flexible among struct and top-level columns while also being future proof to a geoarrow encoding.

(Though to be clear, I'm not arguing for this PR to explicitly allow geoarrow; for now we can only document top-level columns and a struct column)

@jwass
Copy link
Collaborator Author

jwass commented Nov 28, 2023

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

One nice outcome of this approach is it avoids the duplication (triplication?) of point coordinates that @rouault raised in #188 (comment).

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

@rouault
Copy link
Contributor

rouault commented Nov 28, 2023

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

agreed on that: I would expect the content of the xmin/... field to be a scalar to KISS

@Maxxen
Copy link

Maxxen commented Nov 28, 2023

Just chiming in with my 2c. DuckDB can't do predicate pushdowns into struct columns yet (hopefully someday!), and I imagine other simpler readers may also struggle to fully utilize nested types, so I would think the "flat" representation is the most practical/has the lowest "barrier of entry". Although from a conceptual standpoint its kind of a bummer because I feel like this is the perfect use case for structs :/.

@migurski
Copy link

migurski commented Nov 29, 2023

Just popping in with a 👍 to this proposal. GDAL’s new support for column stats in 3.8.0 combined with spatially-clustered rows and relatively-small row groups has shown excellent performance in my tests with country-scale OSM datasets.

I hadn’t seen the work on structs here so I was experimenting with separate xmin/ymin/xmax/ymax columns. They seem to be equivalent for read performance.

@cholmes
Copy link
Member

cholmes commented Nov 29, 2023

What about specifically listing the names of the paths in the schema

I like the upside of being able to handle both structs and top level columns, and handling points without duplication would definitely be a win.

But I worry about the complexity - a reader would need to at least know about all the options, and if it's not capable of predict pushdown would have to know that some fields wouldn't work. And ideally a reader would notify the user in some way that it's not working / implemented. And writer implementations would have to make choices of how to do it. And either push that decision on the user, or else we recommend a 'default'. So if we just allow any field specified then it feels like we just sorta punt on the question of which approach to do, and we push the complexity out to the ecosystem. And it'd be harder to just look at a parquet file and know just by looking at the data if it's got a bbox - if we have one set approach of fields of set names then you can be pretty sure if a dataset has that then it is implementing a spatial optimization.

So I think I lean towards first picking one way and seeing if it will work for everyone. But it could be interesting to use this way to specify that one way if we want to future proof it a bit - like we list the path names, but we align on just having one locked in value for the first release, like how we did the geometry encoding. Maybe that's what you were proposing? I wasn't sure if the idea was people could specify any field they wanted to serve as the box, or if it'd just be a way to specify a set of options.

@cholmes
Copy link
Member

cholmes commented Nov 29, 2023

One thing that I do think should accompany this is some recommendations on how to actually use this construct to accelerate spatial querying. Like talk through different indexing options, explain a bit about how you need to order the file and how the row group stats work. I think this would likely fit best in some sort of 'best practice' document that sits alongside the spec. And I don't think it needs to be part of this PR, but would be good to have in place for a first 1.1-beta release. I'd be up to take a first crack at it, but would be good for others who are deeper into spatial indexing and parquet to help out.

@paleolimbot
Copy link
Collaborator

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps). It is a bummer because a struct column is definitely the right concept for this.

That would mean:

"covering": {
  "box": {
    "xmin": "name_of_top_level_xmin",
    "ymin": "name_of_top_level_ymin",
    "xmax": "name_of_top_level_xmax",
    "ymax": "name_of_top_level_ymax"
  }
}

A future version of the spec could allow nesting with "xmin": ["name_of_top_level_column", "xmin"] (there's no guarantee that a column name does not contain a period). I don't think we should start by allowing both (or maybe we will never get to a place where we allow both).

@jwass
Copy link
Collaborator Author

jwass commented Nov 29, 2023

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps).

Agreed. We had discussed surveying existing Parquet readers prior to making a decision, but it feels like we already know the answer without having to do more investigation...?

As @paleolimbot said, we could go with @kylebarron's suggestion of having the column specified per box element, but the spec could enforce that they MUST point to fields in the root of the schema. We always have the option later to relax that constraint without breaking backward compatibility.

Separately, I like the idea of the increased flexibility but wonder if it allows too many personal preferences to exist in the ecosystem... Someone could write:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy"
  }
}

because they prefer minx over xmin. I think I'd find the competing opinions annoying if many different ones are out there. I guess we could just have a best practice recommendation of how to define these. If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

@paleolimbot
Copy link
Collaborator

If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

Perhaps a prefix? <geometry column name>_xmin, <geometry column name>_ymin, etc?

@kylebarron
Copy link
Collaborator

Someone could write ... because they prefer minx over xmin

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

@jwass
Copy link
Collaborator Author

jwass commented Dec 3, 2023

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

As someone that spends a lot of time writing SQL, the real interface of a Parquet file for me is the table schema's columns and data types because most SQL query engines can't read the GeoParquet metadata - at least not for a while. If I want to reuse queries across different GeoParquet datasets, they'll have to be dynamic over column names which is a pain especially if determining the columns requires reading the data before I know how to query it.

I guess in my opinion, the more standardized the Parquet schema is - column names and types - and the less flexible the metadata allows it to be, the easier GeoParquet will be to work with and the more successful it will be. This is a very similar discussion to #169 as well. The broader difference of philosophy is beyond the scope of this PR but a fun one that I think will keep coming up :)

Either way, I think Chris's proposal to constrain the allowed values could make this work and future-proof to make it more open-ended later.

@jwass
Copy link
Collaborator Author

jwass commented Dec 3, 2023

Thanks everyone for the feedback and good discussion. Here's my read on what would be a good path forward:

Specify every bounding box column

We can adopt @kylebarron's idea to define the bounding box like this:

"covering": {
  "box": {
    "xmin": "xmin",
    "ymin": "ymin",
    "xmax": "xmax",
    "ymax": "ymax"
  }
}

Constraints on the initial release that can be relaxed later

Top-Level Columns

Since duckdb and others don't do predicate pushdown on nested columns, the initial definition should require the box columns to be at the schema root. Otherwise we'll be negating most performance benefits for some systems.

Restricting Column Names

Per @cholmes's comment, we can start by restricting the column names and pinning them to be xmin, xmax, etc. initially. This will keep GeoParquet datasets consistent in their schemas for some time, but give us some future-proofing to allow nested columns (bbox.xmin) or arbitrary column names ({"xmin": "minx", ...}") in the future without having to redesign the box spec presumably.

What to do about files with multiple geometries?

We can't require the same names for a different geometry column. I like @paleolimbot's idea of constraining the names to be a prefix that is the geometry column name + "_". So the bbox columns for my_other_geom are my_other_geom_xmin, my_other_geom_ymin, etc. But I also like the idea of the primary geometry's bbox just being called xmin rather than e.g. geometry_xmin. In other words, we could say the primary geometry column's prefix is the empty string, but all others are column_name_.

Thoughts? I can make the PR changes soon if there's agreement on the direction here.

@tschaub
Copy link
Collaborator

tschaub commented Dec 3, 2023

@jwass - I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds. This would allow us to later add geo metadata that pointed to other columns serving the same purpose (and to potentially have other covering types).

@jwass
Copy link
Collaborator Author

jwass commented Dec 3, 2023

I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds.

@tschaub That's not quite what I was suggesting since I am proposing adding this to the metadata. That way the systems that can interact with the metadata can still find it in a definitive way. We could define the bounding box more as convention for now without touching the geo metadata yet, and/or add it to the compatible parquet spec. But it seemed like most folks wanted it in there.

@tschaub
Copy link
Collaborator

tschaub commented Dec 3, 2023

If we don't make a change to the geo metadata, it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases - without needing a breaking release if we find out that the initial proposal wasn't quite right.

@migurski
Copy link

migurski commented Dec 4, 2023

The min/max columns create a single-level spatial index and are absolutely useful in selection use cases. This example shows a 44MB Geoparquet file for the country of Ghana with only data for the town of Kumasi selected:

https://nbviewer.org/urls/gist.githubusercontent.com/migurski/e88e6d34d8686ed54c8a3ed158061409/raw/3ba8640c38f4a87574e976000450d5e17ade4c17/index.ipynb

Preview, showing in-Kumasi row groups in blue vs. non-Kumasi row groups in gray:

Screen Shot 2023-12-03 at 6 23 09 PM

With row groups of 100 features and everything sorted by quadkey of one feature corner we use the column stats alone to identify 106 row groups out of 1,001 total that must be read. This saves almost 90% of required reads. I did some tests with other space-filling curves and they don’t make a substantial difference. It’s the min/max column stats and the clustering alone that do it.

For larger source files this would be a requirement to use the network efficiently. Worldwide OSM road and building data can be dozens or hundreds of GB in size and I wouldn’t be able to afford a table scan to efficiently query a region.

Copy link
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have anything additional to add here!

My only question is what happens when there is a period in column_name. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!

@jwass
Copy link
Collaborator Author

jwass commented Jan 30, 2024

My only question is what happens when there is a period in column_name. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!

@paleolimbot This came up today and @jorisvandenbossche mentioned it might be common in R. My guess is in some SQL engines you'd refer to a column like that with quotes like:

SELECT
    "my.column.with.periods".xmin
FROM t

We could require something similar such that periods are nested groups but when they are within quotes they're just part of the column name. The quotes would just have to be properly escaped in the JSON.

Describe bounding box as the "coordinate range" of the geometry which is
language used by the GeoJSON spec
@jorisvandenbossche
Copy link
Collaborator

A more robust way to specify a "nested path", is to specify each part of the path as a separate string. In Python we would naturally use a tuple for that, like ("column_name", "xmin"). In JSON that translates to an array, I don't know if that would be strange to use here.

I was quickly looking into how I would implement this bbox filter with pyarrow, and there I will actually need to convert the dotted path we have here to a tuple or list of paths, because to specify a filter when reading I can do something like filter=(pc.field(("bbox", "xmin")) > xx) & (..) & ... The current pc.field(..) helper function to specify a field doesn't support a dotted path, but instead requires you to be specific (which allows to specify a top-level field name with dots in the name). We could of course change this in pyarrow, but I also like the explicitness.

That means that I would need to split the dotted path. A simple bbox_metadata["xmin"].split(".") would work in most cases, but that would not work when there are quoted sub-paths inside the string. Supporting something like "\"my.column.with.periods\".xmin" would make this quite a bit more complicated.

@rouault
Copy link
Contributor

rouault commented Jan 30, 2024

In JSON that translates to an array, I don't know if that would be strange to use here.

I was also thinking to that

Or, as we want all the xmin/ymin/xmax/ymax to be subfields of the same field:

"covering": {
  "bbox": {
    "struct_field": "bbox",
    "xmin_subfield": "xmin",
    "ymin_subfield": "ymin",
    "xmax_subfield": "xmax",
    "ymax_subfield": "ymax"
  }
}

@jwass
Copy link
Collaborator Author

jwass commented Jan 30, 2024

If we want to be the most future-proof, maybe we should just go with the array:

"covering": {
  "bbox": {
    "xmin": ["bbox", "xmin"],
    "ymin": ["bbox", "ymin"],
    "xmax": ["bbox", "xmax"],
    "ymax": ["bbox", "ymax"],
  }
}

Kind of annoying but it should make things much easier for readers that need to find the right field in the schema as @jorisvandenbossche pointed out. We can constrain the json schema so that each element has length 2 and fix the final element which gives lots of flexibility to relax those later.

I'm good with it.

Update bbox so that each element is an array. For example:
["bbox", "xmin"] which represents a top-level bbox struct field
with an xmin member beneath it
Accidentally committed in last version for testing
@jwass
Copy link
Collaborator Author

jwass commented Jan 30, 2024

Updated to new format as described above. @jorisvandenbossche I re-requested a review just to make sure this in line with your thinking here.

@rouault
Copy link
Contributor

rouault commented Feb 1, 2024

FYI, I've a GDAL implementation of this in OSGeo/gdal#9185


##### bbox covering encoding

Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that pages should be also mentioned besides row groups, as bbox column also works for page level indexes. Wrote a longer rant about this in #188 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated to reflect this. Thanks!

Copy link
Collaborator Author

@jwass jwass Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csringhofer Can you take a look at 40ebb37?

Pretty small change just to also mention page indexes in addition to row groups. There aren't any other discussions of row groups in the spec so that's the only change. Let me know if you think that works.

```
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first the column name "bbox" was confusing to me as it is the same as the json struct name. Maybe "bbox_col" would be clearer? Afterwards it could be added that if there is a single geometry column, then the recommended bbox column name is simply "bbox".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csringhofer Are you referring to the example as in:

"covering": {
    "bbox": {
        "xmin": ["bbox_col", "xmin"],
        ...

I'm hesitant to change it because our recommendation really is to call it "bbox". I agree it's a bit confusing. If there's anything to rename it might be the "bbox" under covering. It used to be called just "box" in earlier versions of the PR but now that it's just the bbox columns, I put it back. I'm open to other ideas though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a bit confusing here in the example, but for the actual spec I would also keep "bbox" both for the recommended column name as the key here in the metadata.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another example could be added with bbox column for multiple geometry columns.
It is also not clear what is the recommended name in that case - there is an example with "any_column", but using something like "geom_column_name_bbox" seems clearer to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but using something like "geom_column_name_bbox" seems clearer to me.

FWIW, that's the convention I've used in the GDAL writer


### Bounding Box Columns

A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE`. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE

About repetition: while the struct's repetition must be the same as geometry column's, the nested fields' repetation must be "required", right? So they can never be null if their parent is not null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE

I concur with that! This is an assumption I've actually made in my GDAL implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the update in 40ebb37. Let me know if okay or not.

* Mention that bounding boxes are useful for both row group
statistics and page indexes
* Require that bbox encoding columns be all of the same type (float or double)
@cholmes
Copy link
Member

cholmes commented Feb 13, 2024

Can we merge this in? Any remaining work to do? If I don't hear in a day or two I'm going to hit the green button.


### Bounding Box Columns

A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE` and all columns MUST use the same type. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.
Copy link

@csringhofer csringhofer Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xmin, xmax, ymin, and ymax representing the geometry's coordinate range.

I am confused about the semantics in case of spherical geometries.
"range" suggests to me that xmin should be always <= xmax, but this is not true in the spherical case, right? How to represent a bbox that crosses the 180.0° line of longitude or contains a pole? Or such a bbox cannot be represented?

It would be nice to add some guidance/warning about interpreting in the spherical case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the bounding box at the file-level metadata (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox), we refer to the GeoJSON spec:

For geometries in a geographic coordinate reference system, longitude and latitude values are listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for a set of geometries that cross the antimeridian.

Would that work here as well / provide sufficient information?

Of course, in contrast with the bbox metadata for the full file which is just a JSON array of 4 numbers, here we need to give the numbers explicit field names. The current proposal uses "xmin", "xmax", etc, which is not ideal for the geographical case.
(in general a simple bbox might not be ideal for geographical data anyway, and the current proposal leaves it open to add other "covering" types later)

Copy link
Collaborator Author

@jwass jwass Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @csringhofer.

I agree with @jorisvandenbossche that we'd just defer to how GeoJSON defines bbox for anti-meridian crossings. (There's another discussion about also adopting GeoJSON's recommendation to split geometries at the anti-meridian but that's probably further out).

I suppose we could rename the bbox fields to "south", "west", etc. but it feels off to me and I think xmin, xmax is still the right name with the caveats Joris listed. It's also worth mentioning that naive queries against the bbox won't be effective for anti-meridian crossings, including row group filtering optimizations. But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this sentence to the docs: As with the top-level [bbox](#bbox) column, the values follow the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for geometries that cross the antimeridian.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

I assume that would require the geometries crossing the anti-meridian are in dedicated row groups. If you start mixing geometries crossing the A.M. and geometries not crossing it, then the min(minx), min(miny), max(maxx), max(maxy) statistics aren't going to make any sense.

e.g if you have a geometry [-10,-10,10,10] and a [170,-10,-170,10] (crossing A-M) in a single row group, then the row group stats are going to be [-10,-10,10,10] and thus the geometry crossing A-M will not be selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, that's a good point. Silently not selecting a row if you are not aware of this, doesn't sound good.
Shall we for now just say that this feature (bbox column) doesn't support A-M crossing geometries, and thus cannot be used for data including such geometries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @rouault. We discussed this at the last GeoParquet meeting and decided that we'll just say that antimeridian crossings aren't supported for now. I'll also make an issue to track that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #198 to continue that discussion.

Copy link
Collaborator

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last updates!

Going to merge this now, we can always iterate a bit further on the details in follow-ups (such as the antimeridian case for which you already opened an issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants