Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to process geoJson files into mcf #915

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions scripts/geo/geojson/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Processing GeoJSON files

The directory contains script to process GeoJSON files.

## Generate GeoJSON from Shape file.
GeoJSON files can be downloaded from source or generated from shape fles.
To extract geoJSON from a shape file (.shp), run the following command:

```
ogr2ogr -f GeoJSON <output.geojson> <input.shp>
```

Note that the remainign scripts assume geoJSON contains coordinates with
Latitude/Longitude. In case, the shape file contains other coordinates, such as
EPSG:27700, convert it to lat/long coordinates (EPSG:4326) using additional
arguments for coordinates:
```
ogr2ogr -f GeoJSON -s_srs EPSG:27700 -t_srs EPSG:4326 <output.geojson> <input.shp>
```

## Resolve places
If the dcid for the place can't be extracted from the geoJSON features
extract the places and the properties in the geoJSON into a csv
with a row per feature using the following command:
```
python3 process_geojson.py --input_geojson=<input.gsojson> --output_csv=<place_properties.csv>
```
Then add a 'dcid' column with the resolved place using other tools such as the
[place_name_resolver](https://github.com/datacommonsorg/data/tree/master/tools/place_name_resolver).

Pass the csv with resolved places to the next step to generated MCFs with the
flag --places_csv.


## Generate MCFs with geoJsonCoordinates
Run the `process_geojson.py` script to generate MCF files with the
geoJsonCoordinate property containing the boundary polygons. One MCF file is
generated for each polygon simplification level with the suffix DP<N> where
N is a simplification level.

```
# Create output_geojson.mcf with Nodes for each feature in the input geoJSON
python $GEO_DIR/process_geojson.py --input_geojson=<input.geojson> \
--output_mcf_prefix=output_geojson
```


The script supports additional arguments:
```
--place_name_properties: list of properties used to create the name of the
place. The name is looked up in the places_csv for the place dcid.

--places_csv: csv file with dcid for resolved places.
If should contain the columns: 'place_name' with the name of the place
and 'dcid' with the resolved dcid.

--new_dcid_template: Format string used to generate the dcid for places that
can't be resolved using the places_csv.
It can refer to the geoJSON features. For example: 'geoID/{FIPS}' where 'FIPS'
is a feature in the geoJSON.

--output_mcf_pvs: Comma separated list of property:value to be added to the
output MCF. The value can be a python format string with reference to other
properties. For example: "location: [LatLong {LAT} {LONG}]"
where 'LAT' and 'LONG' are features in the geoJSON.
These property:values are added to the first output MCF without the
simplified polygons.
Some common properties to add to the output:
'typeOf: dcid:AdministrativeArea1,location: [LatLong {latitude} {longitude}],containedInPlace: dcid:country/AUS,name: "{place_name}"'

The value can also be a python expression prefixed with '='.
For example:
'typeOf: ="AdministrativeArea1" if len(dcid) <= 8 else "AdministrativeArea2"'

```

112 changes: 112 additions & 0 deletions scripts/geo/geojson/geojson_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Utilities for geoJson coordinates."""

import json
import sys

from absl import logging
from shapely import geometry
from typing import Union

# Maximum geoJsonCoordinates value in bytes.
# Polygons larger than this are simplified with a node in geoJsonCoordinatesNote.
_MAX_GEOJSON_SIZE = 10000000 # ~10MB
# Tolerance to simplify the polygon when it exceeds _MAX_GEOJSON_SIZE
# Polygon is simplified successively by this until it fits.
_MIN_TOLERANCE = 0.005

def get_geojson_dict(geojson: Union[str, dict]) -> dict:
"""Returns a geoJson dict parsed from string.

Args:
geoJson: geojson as astring with escaped quotes or a dictionary

Returns:
dictionary of geojson
"""
geo_json = None
if isinstance(geojson, dict):
geo_json = geojson
elif isinstance(geojson, str):
# Convert geojson to a dict
try:
# Parse quoted string with escaped double quotes
geo_json = json.loads(json.loads(geojson))
except TypeError:
geo_json = None
logging.debug(f'Failed to load geojson: {geojson[:100]}')
if geo_json is None:
# Parse dictionary as a string
try:
geo_json = ast.literal_eval(geojson)
except TypeError:
geo_json = None
logging.debug(f'Failed to eval geojson: {geojson[:100]}')
return geo_json


def get_geojson_polygon(geojson: dict) -> geometry.Polygon:
"""Returns a polygon for the geoJson.

Args:
geoJson: polygon as s dictionary of lines wiht lat/long

Returns:
geometry.Polygon object created from the input dict
"""
geo_json = get_geojson_dict(geojson)
if not geo_json:
return None
polygon = geometry.shape(geo_json)
if not polygon or not polygon.is_valid:
return None
return polygon


def get_limited_geojson_str(geojson_str: str,
polygon: geometry.Polygon = None,
tolerance: float = 0.0,
max_length: int = _MAX_GEOJSON_SIZE) -> str:
"""Returns a polygon that is within the max length.

Args:
geojson_str: string representation of polygon geojson
polygon: olygon object to be converted to string
tolerance: simplification factor for Douglas Peucker algo
max_length: Maximum length of the geojson string
polygons larger than that size are successivly simplified
until it is within the max_length

Returns:
geojson as a string with double quotes escaped that is less than max_length.
"""
if geojson_str is None:
geojson_str = ''
sgeojson_str = geojson_str
iteration_count = 0
while not sgeojson_str or len(sgeojson_str) > max_length:
tolerance = tolerance + _MIN_TOLERANCE * iteration_count
if polygon is None:
polygon = get_geojson_polygon(geojson_str)
if polygon:
spolygon = polygon.simplify(tolerance)
sgeojson_str = json.dumps(json.dumps(geometry.mapping(spolygon)))
logging.debug(
f'Simplifying {len(geojson_str)} bytes polygon with tolerance'
f' {tolerance} to {len(sgeojson_str)}')
else:
break
iteration_count += 1
return sgeojson_str, tolerance
Loading
Loading