Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce bounding box column definition #191

Merged
merged 15 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified examples/example.parquet
Binary file not shown.
20 changes: 20 additions & 0 deletions examples/example_metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,26 @@
180.0,
83.6451
],
"covering": {
"bbox": {
"xmax": [
"bbox",
"xmax"
],
"xmin": [
"bbox",
"xmin"
],
"ymax": [
"bbox",
"ymax"
],
"ymin": [
"bbox",
"ymin"
]
}
},
"crs": {
"$schema": "https://proj.org/schemas/v0.6/projjson.schema.json",
"area": "World.",
Expand Down
32 changes: 32 additions & 0 deletions format-specs/geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ Each geometry column in the dataset MUST be included in the `columns` field abov
| edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. |
| bbox | \[number] | Bounding Box of the geometries in the file, formatted according to [RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5). |
| epoch | number | Coordinate epoch in case of a dynamic CRS, expressed as a decimal year. |
| covering | object | Object containing bounding box column names to help accelerate spatial data retrieval |


#### crs

Expand Down Expand Up @@ -134,6 +136,36 @@ For non-geographic coordinate reference systems, the items in the bbox are minim

The bbox values are in the same coordinate reference system as the geometry.

#### covering

The covering field specifies optional simplified representations of each geometry. The keys of the "covering" object MUST be a supported encoding. Currently the only supported encoding is "bbox" which specifies the names of [bounding box columns](#bounding-box-columns)

Example:
```
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first the column name "bbox" was confusing to me as it is the same as the json struct name. Maybe "bbox_col" would be clearer? Afterwards it could be added that if there is a single geometry column, then the recommended bbox column name is simply "bbox".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csringhofer Are you referring to the example as in:

"covering": {
    "bbox": {
        "xmin": ["bbox_col", "xmin"],
        ...

I'm hesitant to change it because our recommendation really is to call it "bbox". I agree it's a bit confusing. If there's anything to rename it might be the "bbox" under covering. It used to be called just "box" in earlier versions of the PR but now that it's just the bbox columns, I put it back. I'm open to other ideas though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a bit confusing here in the example, but for the actual spec I would also keep "bbox" both for the recommended column name as the key here in the metadata.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another example could be added with bbox column for multiple geometry columns.
It is also not clear what is the recommended name in that case - there is an example with "any_column", but using something like "geom_column_name_bbox" seems clearer to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but using something like "geom_column_name_bbox" seems clearer to me.

FWIW, that's the convention I've used in the GDAL writer

"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"]
}
}
```

##### bbox covering encoding

Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group or page index bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row.
jwass marked this conversation as resolved.
Show resolved Hide resolved

The format of the `bbox` encoding is `{"xmin": ["column_name", "xmin"], "ymin": ["column_name", "ymin"], "xmax": ["column_name", "xmax"], "ymax": ["column_name", "ymax"]}`. The arrays represent Parquet schema paths for nested groups. In this example, `column_name` is a Parquet group with fields `xmin`, `ymin`, `xmax`, `ymax`. The value in `column_name` MUST exist in the Parquet file and meet the criteria in the [Bounding Box Column](#bounding-box-columns) definition. In order to constrain this value to a single bounding group field, the second item in each element MUST be `xmin`, `ymin`, etc. All values MUST use the same column name.

Note: the value specified in this field should not be confused with the top-level [`bbox`](#bbox) field which contains the single bounding box of this geometry over the whole GeoParquet file.

### Bounding Box Columns

A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE` and all columns MUST use the same type. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.
Copy link

@csringhofer csringhofer Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xmin, xmax, ymin, and ymax representing the geometry's coordinate range.

I am confused about the semantics in case of spherical geometries.
"range" suggests to me that xmin should be always <= xmax, but this is not true in the spherical case, right? How to represent a bbox that crosses the 180.0° line of longitude or contains a pole? Or such a bbox cannot be represented?

It would be nice to add some guidance/warning about interpreting in the spherical case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the bounding box at the file-level metadata (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox), we refer to the GeoJSON spec:

For geometries in a geographic coordinate reference system, longitude and latitude values are listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for a set of geometries that cross the antimeridian.

Would that work here as well / provide sufficient information?

Of course, in contrast with the bbox metadata for the full file which is just a JSON array of 4 numbers, here we need to give the numbers explicit field names. The current proposal uses "xmin", "xmax", etc, which is not ideal for the geographical case.
(in general a simple bbox might not be ideal for geographical data anyway, and the current proposal leaves it open to add other "covering" types later)

Copy link
Collaborator Author

@jwass jwass Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @csringhofer.

I agree with @jorisvandenbossche that we'd just defer to how GeoJSON defines bbox for anti-meridian crossings. (There's another discussion about also adopting GeoJSON's recommendation to split geometries at the anti-meridian but that's probably further out).

I suppose we could rename the bbox fields to "south", "west", etc. but it feels off to me and I think xmin, xmax is still the right name with the caveats Joris listed. It's also worth mentioning that naive queries against the bbox won't be effective for anti-meridian crossings, including row group filtering optimizations. But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this sentence to the docs: As with the top-level [bbox](#bbox) column, the values follow the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for geometries that cross the antimeridian.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering

I assume that would require the geometries crossing the anti-meridian are in dedicated row groups. If you start mixing geometries crossing the A.M. and geometries not crossing it, then the min(minx), min(miny), max(maxx), max(maxy) statistics aren't going to make any sense.

e.g if you have a geometry [-10,-10,10,10] and a [170,-10,-170,10] (crossing A-M) in a single row group, then the row group stats are going to be [-10,-10,10,10] and thus the geometry crossing A-M will not be selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, that's a good point. Silently not selecting a row if you are not aware of this, doesn't sound good.
Shall we for now just say that this feature (bbox column) doesn't support A-M crossing geometries, and thus cannot be used for data including such geometries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @rouault. We discussed this at the last GeoParquet meeting and decided that we'll just say that antimeridian crossings aren't supported for now. I'll also make an issue to track that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #198 to continue that discussion.


The bounding box column MUST be at the root of the schema. The bounding box column MUST NOT be nested in a group.

### Additional information

#### Feature identifiers
Expand Down
44 changes: 44 additions & 0 deletions format-specs/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,50 @@
},
"epoch": {
"type": "number"
},
"covering": {
"type": "object",
"minProperties": 1,
"properties": {
"bbox": {
"type": "object",
"required": ["xmin", "xmax", "ymin", "ymax"],
"properties": {
"xmin": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "xmin" }
],
"additionalItems": false
},
"xmax": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "xmax" }
],
"additionalItems": false
},
"ymin": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "ymin" }
],
"additionalItems": false
},
"ymax": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "ymax" }
],
"additionalItems": false
}
}
}
}
}
}
}
Expand Down
22 changes: 18 additions & 4 deletions scripts/generate_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
>>> import json, pprint, pyarrow.parquet as pq
>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"]))
"""
from collections import OrderedDict
import json
import pathlib

Expand All @@ -19,6 +20,14 @@

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df = df.to_crs("ogc:84")

geometry_bbox = df.bounds.rename(
OrderedDict(
[("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")]
),
axis=1,
)
df["bbox"] = geometry_bbox.to_dict("records")
table = pa.Table.from_pandas(df.head().to_wkb())


Expand All @@ -39,14 +48,19 @@ def get_version() -> str:
"crs": json.loads(df.crs.to_json()),
"edges": "planar",
"bbox": [round(x, 4) for x in df.total_bounds],
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"],
},
},
},
},
}

schema = (
table.schema
.with_metadata({"geo": json.dumps(metadata)})
)
schema = table.schema.with_metadata({"geo": json.dumps(metadata)})
table = table.cast(schema)

pq.write_table(table, HERE / "../examples/example.parquet")
86 changes: 84 additions & 2 deletions scripts/test_json_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ def get_version() -> str:
"columns": {
"geometry": {
"encoding": "WKB",
"geometry_types": [],
},
"geometry_types": []
}
},
}

Expand Down Expand Up @@ -210,6 +210,88 @@ def get_version() -> str:
metadata["columns"]["geometry"]["epoch"] = "2015.1"
invalid_cases["epoch_string"] = metadata

# Geometry Bbox
metadata_covering_template = copy.deepcopy(metadata_template)
metadata_covering_template["columns"]["geometry"]["covering"] = {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"],
},
}


# Allow "any_column.xmin" etc.
metadata = copy.deepcopy(metadata_covering_template)
valid_cases["valid_default_bbox"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"] = {
"xmin": ["any_column", "xmin"],
"ymin": ["any_column", "ymin"],
"xmax": ["any_column", "xmax"],
"ymax": ["any_column", "ymax"],
}
valid_cases["valid_but_not_bbox_struct_name"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"].pop("bbox")
invalid_cases["empty_geometry_bbox"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"] = {}
invalid_cases["empty_geometry_bbox_missing_fields"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("xmin")
invalid_cases["covering_bbox_missing_xmin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("ymin")
invalid_cases["covering_bbox_missing_ymin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("xmax")
invalid_cases["covering_bbox_missing_xmax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("ymax")
invalid_cases["covering_bbox_missing_ymax"] = metadata

# Invalid bbox xmin/xmax/ymin/ymax values
metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmin"] = ["bbox", "not_xmin"]
invalid_cases["covering_bbox_invalid_xmin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmax"] = ["bbox", "not_xmax"]
invalid_cases["covering_bbox_invalid_xmax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymin"] = ["bbox", "not_ymin"]
invalid_cases["covering_bbox_invalid_ymin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymax"] = ["bbox", "not_ymax"]
invalid_cases["covering_bbox_invalid_ymax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmin"] = ["bbox", "xmin", "invalid_extra"]
invalid_cases["covering_bbox_extra_xmin_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmax"] = ["bbox", "xmax", "invalid_extra"]
invalid_cases["covering_bbox_extra_xmax_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymin"] = ["bbox", "ymin", "invalid_extra"]
invalid_cases["covering_bbox_extra_ymin_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymax"] = ["bbox", "ymax", "invalid_extra"]
invalid_cases["covering_bbox_extra_ymax_elements"] = metadata


# # Tests

Expand Down
Loading