Skip to content

Commit

Permalink
Introduce bounding box column definition (#191)
Browse files Browse the repository at this point in the history
  • Loading branch information
jwass authored Mar 11, 2024
1 parent 882ea48 commit 0309eac
Show file tree
Hide file tree
Showing 6 changed files with 200 additions and 6 deletions.
Binary file modified examples/example.parquet
Binary file not shown.
20 changes: 20 additions & 0 deletions examples/example_metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,26 @@
180.0,
83.6451
],
"covering": {
"bbox": {
"xmax": [
"bbox",
"xmax"
],
"xmin": [
"bbox",
"xmin"
],
"ymax": [
"bbox",
"ymax"
],
"ymin": [
"bbox",
"ymin"
]
}
},
"crs": {
"$schema": "https://proj.org/schemas/v0.6/projjson.schema.json",
"area": "World.",
Expand Down
34 changes: 34 additions & 0 deletions format-specs/geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ Each geometry column in the dataset MUST be included in the `columns` field abov
| edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. |
| bbox | \[number] | Bounding Box of the geometries in the file, formatted according to [RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5). |
| epoch | number | Coordinate epoch in case of a dynamic CRS, expressed as a decimal year. |
| covering | object | Object containing bounding box column names to help accelerate spatial data retrieval |


#### crs

Expand Down Expand Up @@ -134,6 +136,38 @@ For non-geographic coordinate reference systems, the items in the bbox are minim

The bbox values are in the same coordinate reference system as the geometry.

#### covering

The covering field specifies optional simplified representations of each geometry. The keys of the "covering" object MUST be a supported encoding. Currently the only supported encoding is "bbox" which specifies the names of [bounding box columns](#bounding-box-columns)

Example:
```
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"]
}
}
```

##### bbox covering encoding

Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group and page index bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row.

The format of the `bbox` encoding is `{"xmin": ["column_name", "xmin"], "ymin": ["column_name", "ymin"], "xmax": ["column_name", "xmax"], "ymax": ["column_name", "ymax"]}`. The arrays represent Parquet schema paths for nested groups. In this example, `column_name` is a Parquet group with fields `xmin`, `ymin`, `xmax`, `ymax`. The value in `column_name` MUST exist in the Parquet file and meet the criteria in the [Bounding Box Column](#bounding-box-columns) definition. In order to constrain this value to a single bounding group field, the second item in each element MUST be `xmin`, `ymin`, etc. All values MUST use the same column name.

The value specified in this field should not be confused with the top-level [`bbox`](#bbox) field which contains the single bounding box of this geometry over the whole GeoParquet file.

Note: This technique to use the bounding box to improve spatial queries does not apply to geometries that cross the antimeridian. Such geometries are unsupported by this method.

### Bounding Box Columns

A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. As with the top-level [`bbox`](#bbox) column, the values follow the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for geometries that cross the antimeridian. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE` and all columns MUST use the same type. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value.

The bounding box column MUST be at the root of the schema. The bounding box column MUST NOT be nested in a group.

### Additional information

#### Feature identifiers
Expand Down
44 changes: 44 additions & 0 deletions format-specs/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,50 @@
},
"epoch": {
"type": "number"
},
"covering": {
"type": "object",
"minProperties": 1,
"properties": {
"bbox": {
"type": "object",
"required": ["xmin", "xmax", "ymin", "ymax"],
"properties": {
"xmin": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "xmin" }
],
"additionalItems": false
},
"xmax": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "xmax" }
],
"additionalItems": false
},
"ymin": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "ymin" }
],
"additionalItems": false
},
"ymax": {
"type": "array",
"items": [
{ "type": "string" },
{ "const": "ymax" }
],
"additionalItems": false
}
}
}
}
}
}
}
Expand Down
22 changes: 18 additions & 4 deletions scripts/generate_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
>>> import json, pprint, pyarrow.parquet as pq
>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"]))
"""
from collections import OrderedDict
import json
import pathlib

Expand All @@ -19,6 +20,14 @@

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df = df.to_crs("ogc:84")

geometry_bbox = df.bounds.rename(
OrderedDict(
[("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")]
),
axis=1,
)
df["bbox"] = geometry_bbox.to_dict("records")
table = pa.Table.from_pandas(df.head().to_wkb())


Expand All @@ -39,14 +48,19 @@ def get_version() -> str:
"crs": json.loads(df.crs.to_json()),
"edges": "planar",
"bbox": [round(x, 4) for x in df.total_bounds],
"covering": {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"],
},
},
},
},
}

schema = (
table.schema
.with_metadata({"geo": json.dumps(metadata)})
)
schema = table.schema.with_metadata({"geo": json.dumps(metadata)})
table = table.cast(schema)

pq.write_table(table, HERE / "../examples/example.parquet")
86 changes: 84 additions & 2 deletions scripts/test_json_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ def get_version() -> str:
"columns": {
"geometry": {
"encoding": "WKB",
"geometry_types": [],
},
"geometry_types": []
}
},
}

Expand Down Expand Up @@ -210,6 +210,88 @@ def get_version() -> str:
metadata["columns"]["geometry"]["epoch"] = "2015.1"
invalid_cases["epoch_string"] = metadata

# Geometry Bbox
metadata_covering_template = copy.deepcopy(metadata_template)
metadata_covering_template["columns"]["geometry"]["covering"] = {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"],
},
}


# Allow "any_column.xmin" etc.
metadata = copy.deepcopy(metadata_covering_template)
valid_cases["valid_default_bbox"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"] = {
"xmin": ["any_column", "xmin"],
"ymin": ["any_column", "ymin"],
"xmax": ["any_column", "xmax"],
"ymax": ["any_column", "ymax"],
}
valid_cases["valid_but_not_bbox_struct_name"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"].pop("bbox")
invalid_cases["empty_geometry_bbox"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"] = {}
invalid_cases["empty_geometry_bbox_missing_fields"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("xmin")
invalid_cases["covering_bbox_missing_xmin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("ymin")
invalid_cases["covering_bbox_missing_ymin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("xmax")
invalid_cases["covering_bbox_missing_xmax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"].pop("ymax")
invalid_cases["covering_bbox_missing_ymax"] = metadata

# Invalid bbox xmin/xmax/ymin/ymax values
metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmin"] = ["bbox", "not_xmin"]
invalid_cases["covering_bbox_invalid_xmin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmax"] = ["bbox", "not_xmax"]
invalid_cases["covering_bbox_invalid_xmax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymin"] = ["bbox", "not_ymin"]
invalid_cases["covering_bbox_invalid_ymin"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymax"] = ["bbox", "not_ymax"]
invalid_cases["covering_bbox_invalid_ymax"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmin"] = ["bbox", "xmin", "invalid_extra"]
invalid_cases["covering_bbox_extra_xmin_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["xmax"] = ["bbox", "xmax", "invalid_extra"]
invalid_cases["covering_bbox_extra_xmax_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymin"] = ["bbox", "ymin", "invalid_extra"]
invalid_cases["covering_bbox_extra_ymin_elements"] = metadata

metadata = copy.deepcopy(metadata_covering_template)
metadata["columns"]["geometry"]["covering"]["bbox"]["ymax"] = ["bbox", "ymax", "invalid_extra"]
invalid_cases["covering_bbox_extra_ymax_elements"] = metadata


# # Tests

Expand Down

0 comments on commit 0309eac

Please sign in to comment.