Python implementation for a OSM / Openaddresses.io Geocoder.
This geocoder is implemented in PostgreSQL DB functions as much as possible, there is a simple API and an example flask app included.
You will need PostgreSQL 9.5+ (or 11.0+ for OpenAddresses.io support) with PostGIS installed as well as some disk space and data-files from OpenStreetMap and (optionally) OpenAddresses.io.
Data import will be done via Omniscale's imposm3 and a supplied python script to import the openaddresses.io data.
Optionally you can use the libpostal machine learning address classifier to parse addresses supplied as input to the forward geocoder.
For formatting the addresses from the reverse geocoder the worldwide.yml
from OpenCageData address-formatting repository is used to format the address according to customs in the country that is been encoded.
See README.md
in the repository for more information.
- Initial release, reverse geocoding works, forward geocoding is slow
Warning: DB Format changed, you'll have to re-import data
- Fixed forward geocoding speed
- Fixed import scripts to be more resilient
- Made Openaddresses.io completely optional
- Restored compatability with older 3.x python versions
- Restored compatability with older PostgreSQL DB versions (9.5+ if you do no use openaddresses.io)
- Switched to
pipenv
- Fix missing import for structured forward geocoding
- Fix Copy and Paste error in forward geocoding SQL
If you're coming from 2.0.0
, re-run the finalize step to update the SQL functions:
$ pipenv run bin/finalize_geocoder.py --db postgresql://geocoder:password@localhost/osmgeocoder
- Add type hints to all interfaces
- Add
_dict
variants for geocoding functions to get raw data instead of formatted strings - Bugfix: Reading of custom opencage data file for address formatting was broken
- Returned addresses now contain county and state if available
- Return Attribution in API and in webservices
Statistics uutdated, will be updated shortly
Just for your information, this process takes a lot of time for a big import. Example figures on a machine with a Core i7-7700K on 4.2 GHz with a Samsung (SATA-)SSD and 32GB of RAM (and some tuned buffer sizes for Postgres):
- Import of the Europe-Region of OpenStreetMap:
- Import time: 3 hours
- OSM Data file: 20 GB
- Temporary space needed: 35 GB
- Final size in DB: 58.7 GB
- Summary of space requirement: 115 GB
- Import of the two Openaddresses.io files for Europe:
- Import time: 1 hour
- Data files: 4 GB
- Temporary space needed: 2 GB
- Final size in DB: 18 GB
- Summary of space requirement: 24 GB
- Conversion of the OpenStreetMap data into geocoding format:
- Conversion time: 5 hours
- Final size in DB: 10.5GB
So in summary you'll need 9 hours of time and 150 GB of disk space. After cleanup you'll need 28.5 GB of disk space for the Europe data set. A compressed DB export of the converted data sums up to 2.8 GB of RAW data and will explode on import to the said 28 GB.
- Create a PostgreSQL Database (we use the name
osmgeocoder
for the DB name andgeocoder
for the DB user in the example) - Create the PostGIS, trigram and fuzzy string search extension for the DB:
CREATE SCHEMA gis; -- isolate postgis into its own schema for easier development
ALTER SCHEMA gis OWNER TO geocoder;
CREATE EXTENSION postgis WITH SCHEMA gis; -- put postgis into gis schema
CREATE SCHEMA str; -- isolate string functions into its own schema for easier development
ALTER SCHEMA str OWNER TO geocoder;
CREATE EXTENSION pg_trgm WITH SCHEMA str; -- trigram search, used for forward geocoding
CREATE EXTENSION fuzzystrmatch WITH SCHEMA str; -- metaphone search, used for text prediction
CREATE SCHEMA crypto; -- isolate crypto functions into its own schema for easier development
ALTER SCHEMA crypto OWNER TO geocoder;
CREATE EXTENSION pgcrypto WITH SCHEMA crypto; -- used to generate uuids
ALTER DATABASE geocoder SET search_path TO public, gis, str, crypto; -- set search path to include the other schemas
- Fetch a copy of imposm3
- Get a OpenStreetMap data file (for example from Geofabrik, start with a small region!)
- Create a virtualenv and install packages:
pipenv sync
- See below for importing openaddresses.io data if needed (this is completely optional)
- Import some OpenStreetMap data into the DB (grab a coffee or two):
$ bin/prepare_osm.py --db postgresql://geocoder:password@localhost/osmgeocoder --import-data osm.pbf --optimize
- Modify configuration file to match your setup. The example config is in
osmgeocoder/data/config-example.json
. - Optionally install and start the postal machine learning address categorizer (see below)
- Import the geocoding functions into the DB:
$ bin/finalize_geocoder.py --db postgresql://geocoder:password@localhost/osmgeocoder
- Geocode:
bin/address2coordinate.py --config config.json --center 48.3849 10.8631 Lauterl
bin/coordinate2address.py --config config.json 48.3849 10.8631
For a full example see the example_setup.sh
shell script.
NOTE: you can also install this via pip:
- the scripts from the
bin
directory will be copied to your environment. - An example config file will be placed in your virtualenv in
osmgeocoder/data/config-example.json
- The PIP installation will not install
flask
andgunicorn
nor will it try to installpostal
, if you want to use those services you need to install those optional dependencies yourself (read on!)
For some countries there are not enough buildings tagged in the OSM data so we can use the OpenAddresses.io data to augment the OSM data.
The import is relatively slow as the data is contained in a big bunch of zipped CSV files, we try to use more threads to import the data faster but it could take a while...
wget https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-europe.zip # download openaddress.io data
pipenv run bin/import_openaddress_data.py \ # run an import
--db postgresql://geocoder:password@host/osmgeocoder \
--threads 4 \
--optimize \
openaddr-collected-europe.zip
When you have imported the data it will create some tables in your DB, license
which contains the licenses of the imported data (the API will return the license attribution string with the data), oa_city
which is a foreign key target from oa_street
which in turn is a fk target to oa_house
which contains the imported data.
If you want to import more than one file, just do so, the tables will not be cleared between import runs, the indices will be dropped and rebuilt after the import though. Skip the --optimize
flag for the imports and run an optimize only pass last to save some time.
If you want to save even more time import with --fast
, but be aware this leaves the DB without any indices or foreign key constraints, an optimize pass is required after importing with this flag!
If you want to start over run the command with the --clean-start
flag... Be careful, this destroys all openaddresses.io data in the tables.
Be aware that the make process will download some data-files (about 1GB in size). The installation of libpostal will need around 1 GB of disk space and about 2 GB of disk space while compiling.
Currently there is no Ubuntu package for libpostal
, so we have to install it by hand:
git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure --prefix=/opt/libpostal --datadir=/opt/libpostal/share
make -j4
sudo make install
echo "/opt/libpostal/lib" | sudo tee /etc/ld.so.conf.d/libpostal.conf
sudo ldconfig
echo 'export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/libpostal/lib/pkgconfig"' | sudo tee /etc/profile.d/libpostal.sh
Now log out and on again or run a new login shell (e.g. bash -l
) and install the missing python modules:
workon osmgeocoder
CFLAGS="-L/opt/libpostal/lib -I/opt/libpostal/include" pip install postal
pip install gunicorn
pip install flask
Source checkout:
pipenv run bin/postal_service.py --config config/config.json
PIP install:
/path/to/virtualenv/bin/postal_service.py --config config.json
Attention: Depending on the speed of your disk, startup of this service may take some seconds (this is why this is implemented as a service) and it will take about 2 GB of RAM, so be warned!
If you want to run it in production mode just run it with gunicorn
directly.
See the Gunicorn documentation for further information.
Simple Example is following (one worker, run as daemon, bind to 127.0.0.1:3200):
pipenv run gunicorn postal_service:app \
--bind 127.0.0.1:3200 \
--workers 1 \
--pid /var/run/postal_service.pid \
--log-file /var/log/postal_service.log \
--daemon
Attention: Every worker takes that 2GB RAM toll!
The file geocoder_service.py
is a simple Flask app to present the geocoder as a HTTP service.
pipenv run pip install gunicorn
pipenv run pip install flask
You will need a working config file too.
The service will search for a config file in the following places:
~/.osmgeocoderrc
~/.config/osmgeocoder.json
/etc/osmgeocoder.json
osmgeocoder.json
You can override the path by setting the environment variable GEOCODER_CONFIG
.
Gunicorn example:
pipenv run gunicorn geocoder_service:app \
--env 'GEOCODER_CONFIG=config/config.json'
--bind 127.0.0.1:8080 \
--workers 4 \
--pid /var/run/osmgeocoder_service.pid \
--log-file /var/log/osmgeocoder_service.log \
--daemon
Address string to coordinate.
- Endpoint
/forward
- Method
POST
- Content-Type
application/json
- Body:
address
: (required) User input / address to convert to coordinatescenter
: (optional) Array with center coordinate to sort matchescountry
: (optional) ISO Country code, use only if no center coordinate is available as it slows down the geocoder massively.
- Response: Array of objects
address
: Fully written address line, formatted by country standardslat
: Latitudelon
: Longitudelicense
: License attribution string
Coordinate to address string.
- Endpoint
/reverse
- Method
POST
- Content-Type
application/json
- Body:
lat
: Latitudelon
: Longitude
- Response: Object
address
: Nearest address to the point (building search) ornull
, formatted by country standardslicense
: License attribution string
Intelligent text completion while typing.
- Endpoint
/predict
- Method
POST
- Content-Type
application/json
- Body:
query
: User input
- Response: Object
predictions
: Up to 10 text predictions, sorted by equality and most common first
Example:
{
"db": {
"dbname": "osm",
"user": "osm",
"password": "password"
},
"opencage_data_file": "data/worldwide.yml",
"postal": {
"service_url": "http://localhost:3200/",
"port": 3200
}
}
Keys:
db
: Database configuration this will be built into a Postgres connection stringpostal
->service_url
: (optional) URL where to find the libpostal service, if not supplied searching is reduced to street names onlypostal
->port
: (optional) only used when running the libpostal service directly without explicitly using gunicornopencage_data_file
: (optional) Data file for the address formatter, defaults to the one included in the package
The complete project contains actually only two classes:
Publicly accessible method prototypes are:
def __init__(self, db=None, db_handle=None, address_formatter_config=None, postal=None):
pass
def forward(self, address, country=None, center=None):
pass
def forward_dict(self, address, country=None, center=None):
pass
def forward_structured(self, road=None, house_number=None, postcode=None, city=None, country=None, center=None):
pass
def forward_structured_dict(self, road=None, house_number=None, postcode=None, city=None, country=None, center=None):
pass
def reverse(self, lat, lon, radius=100, limit=10):
pass
def reverse_dict(self, lat, lon, radius=100, limit=10):
pass
def reverse_epsg3857(self, x, y, radius=100, limit=10):
pass
def reverse_epsg3857_dict(self, x, y, radius=100, limit=10):
pass
def predict_text(self, input):
pass
Initialize a geocoder, this will read all files to be used and set up the DB connection.
db
: Dictionary with DB config, when used the geocoder will create a DB-connection on its owndb_handle
: Postgres connection, use this if the connection is handled outside the scope of the geocoder (for example when you want to use the geocoder in Django)address_formatter_config
: Path to theworldwide.yaml
(optional)postal
: Dictionary with postal config (at leastservice_url
key)
see Config File above for more info.
Geocode an address to a lat, lon location.
address
: Address to codecountry
: (optional) Country code to restrict search and format addresscenter
: (optional) Center coordinate to sort results for (will be used to determine country too, so you can skip thecountry
flag)
This function is a generator which yield
s the obtained results.
Geocode an address to a lat, lon location without using the address classifier, use this if your input is already structured.
road
: (optional) Street/Road namehouse_number
: (optional) House number, this is a string because of things like1a
postcode
: (optional) Post code, this is a string because not all countries use numbers only and zero prefixes,city
: (optional) Citycountry
: (optional) Country code to restrict search and format addresscenter
: (optional) Center coordinate to sort results for (will be used to determine country too, so you can skip thecountry
flag)
Be sure that at least one of road
, postcode
or city
is filled, results are not predictable if none is set.
This function is a generator which yield
s the obtained results.
Geocode a lat, lon location into a readable address:
lat
: Latitude to codelon
: Longitute to coderadius
: Search radius in meterslimit
: (optional) maximum number of results to return
This function is a generator which yield
s the obtained results.
Geocode a x, y location in EPGS 3857 projection (aka Web Mercator) into a readable address:
x
: X coordinatey
: Y coordinateradius
: Search radius in meterslimit
: (optional) maximum number of results to return
Use this function if you're using Web Mercator in your application internally to avoid constant re-projection between lat, lon and x, y.
This function is a generator which yield
s the obtained results.
Return possible text prediction results for the user input. This could be used while the user is typing their query to reduce the load on the database (by avoiding typos and running fewer requests against the geocoder because the user skips over typing long words one character by each).
input
: User input
This function is a generator which yield
s the obtained results.
ATTENTION: Do not feed complete "sentences" into this function as it will not yield the expected result, tokenize into words on client side and only request predictions for the current word the user is editing.
Publicly accessible method prototypes are:
def __init__(self, config=None):
pass
def format(self, address, country=None):
pass
Initialize the address formatter
config
: (optional) override default config file to use for the address formatter, defaults to config file included in this package
Format an address in the default layout used in the specified country. Return value may contain line breaks.
address
: Dictionary that contains the address parts, see below for recognized keyscountry
: Country code of the formatting template to use
Recognized keys in address
:
attention
house
road
house_number
postcode
city
town
village
county
state
country
suburb
city_district
state_district
state_code
neighbourhood