Skip to content

Commit

Permalink
Merge pull request #2757 from lonvia/filter-postcodes
Browse files Browse the repository at this point in the history
Add filtering, normalisation and variants for postcodes
  • Loading branch information
lonvia authored Jun 24, 2022
2 parents 0cd3a1b + 536f08f commit 3bf3b89
Show file tree
Hide file tree
Showing 35 changed files with 1,563 additions and 222 deletions.
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ ignored-classes=NominatimArgs,closing
# 'too-many-ancestors' is triggered already by deriving from UserDict
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use

good-names=i,x,y,fd,db
good-names=i,x,y,fd,db,cc
149 changes: 149 additions & 0 deletions docs/customize/Country-Settings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Customizing Per-Country Data

Whenever an OSM is imported into Nominatim, the object is first assigned
a country. Nominatim can use this information to adapt various aspects of
the address computation to the local customs of the country. This section
explains how country assignment works and the principal per-country
localizations.

## Country assignment

Countries are assigned on the basis of country data from the OpenStreetMap
input data itself. Countries are expected to be tagged according to the
[administrative boundary schema](https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative):
a OSM relation with `boundary=administrative` and `admin_level=2`. Nominatim
uses the country code to distinguish the countries.

If there is no country data available for a point, then Nominatim uses the
fallback data imported from `data/country_osm_grid.sql.gz`. This was computed
from OSM data as well but is guaranteed to cover all countries.

Some OSM objects may also be located outside any country, for example a buoy
in the middle of the ocean. These object do not get any country assigned and
get a default treatment when it comes to localized handling of data.

## Per-country settings

### Global country settings

The main place to configure settings per country is the file
`settings/country_settings.yaml`. This file has one section per country that
is recognised by Nominatim. Each section is tagged with the country code
(in lower case) and contains the different localization information. Only
countries which are listed in this file are taken into account for computations.

For example, the section for Andorra looks like this:

```
partition: 35
languages: ca
names: !include country-names/ad.yaml
postcode:
pattern: "(ddd)"
output: AD\1
```

The individual settings are described below.

#### `partition`

Nominatim internally splits the data into multiple tables to improve
performance. The partition number tells Nominatim into which table to put
the country. This is purely internal management and has no effect on the
output data.

The default is to have one partition per country.

#### `languages`

A comma-separated list of ISO-639 language codes of default languages in the
country. These are the languages used in name tags without a language suffix.
Note that this is not necessarily the same as the list of official languages
in the country. There may be officially recognised languages in a country
which are only ever used in name tags with the appropriate language suffixes.
Conversely, a non-official language may appear a lot in the name tags, for
example when used as an unofficial Lingua Franca.

List the languages in order of frequency of appearance with the most frequently
used language first. It is not recommended to add languages when there are only
very few occurrences.

If only one language is listed, then Nominatim will 'auto-complete' the
language of names without an explicit language-suffix.

#### `names`

List of names of the country and its translations. These names are used as
a baseline. It is always possible to search countries by the given names, no
matter what other names are in the OSM data. They are also used as a fallback
when a needed translation is not available.

!!! Note
The list of names per country is currently fairly large because Nominatim
supports translations in many languages per default. That is why the
name lists have been separated out into extra files. You can find the
name lists in the file `settings/country-names/<country code>.yaml`.
The names section in the main country settings file only refers to these
files via the special `!include` directive.

#### `postcode`

Describes the format of the postcode that is in use in the country.

When a country has no official postcodes, set this to no. Example:

```
ae:
postcode: no
```

When a country has a postcode, you need to state the postcode pattern and
the default output format. Example:

```
bm:
postcode:
pattern: "(ll)[ -]?(dd)"
output: \1 \2
```

The **pattern** is a regular expression that describes the possible formats
accepted as a postcode. The pattern follows the standard syntax for
[regular expressions in Python](https://docs.python.org/3/library/re.html#regular-expression-syntax)
with two extra shortcuts: `d` is a shortcut for a single digit([0-9])
and `l` for a single ASCII letter ([A-Z]).

Use match groups to indicate groups in the postcode that may optionally be
separated with a space or a hyphen.

For example, the postcode for Bermuda above always consists of two letters
and two digits. They may optionally be separated by a space or hyphen. That
means that Nominatim will consider `AB56`, `AB 56` and `AB-56` spelling variants
for one and the same postcode.

Never add the country code in front of the postcode pattern. Nominatim will
automatically accept variants with a country code prefix for all postcodes.

The **output** field is an optional field that describes what the canonical
spelling of the postcode should be. The format is the
[regular expression expand syntax](https://docs.python.org/3/library/re.html#re.Match.expand) referring back to the bracket groups in the pattern.

Most simple postcodes only have one spelling variant. In that case, the
**output** can be omitted. The postcode will simply be used as is.

In the Bermuda example above, the canonical spelling would be to have a space
between letters and digits.

!!! Warning
When your postcode pattern covers multiple variants of the postcode, then
you must explicitly state the canonical output or Nominatim will not
handle the variations correctly.

### Other country-specific configuration

There are some other configuration files where you can set localized settings
according to the assigned country. These are:

* [Place ranking configuration](Ranking.md)

Please see the linked documentation sections for more information.
24 changes: 22 additions & 2 deletions docs/customize/Tokenizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,14 @@ The following is a list of sanitizers that are shipped with Nominatim.
rendering:
heading_level: 6

##### clean-postcodes

::: nominatim.tokenizer.sanitizers.clean_postcodes
selection:
members: False
rendering:
heading_level: 6


#### Token Analysis

Expand All @@ -222,8 +230,12 @@ by a sanitizer (see for example the
The token-analysis section contains the list of configured analyzers. Each
analyzer must have an `id` parameter that uniquely identifies the analyzer.
The only exception is the default analyzer that is used when no special
analyzer was selected. There is one special id '@housenumber'. If an analyzer
with that name is present, it is used for normalization of house numbers.
analyzer was selected. There are analysers with special ids:

* '@housenumber'. If an analyzer with that name is present, it is used
for normalization of house numbers.
* '@potcode'. If an analyzer with that name is present, it is used
for normalization of postcodes.

Different analyzer implementations may exist. To select the implementation,
the `analyzer` parameter must be set. The different implementations are
Expand Down Expand Up @@ -356,6 +368,14 @@ house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.

The analyzer cannot be customized.

##### Postcode token analyzer

The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
a 'lookup' varaint of the token, which produces variants with optional
spaces. Use together with the clean-postcodes sanitizer.

The analyzer cannot be customized.

### Reconfiguration

Changing the configuration after the import is currently not possible, although
Expand Down
6 changes: 3 additions & 3 deletions docs/develop/Tokenizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,11 +245,11 @@ Currently, tokenizers are encouraged to make sure that matching works against
both the search token list and the match token list.

```sql
FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
FUNCTION token_get_postcode(info JSONB) RETURNS TEXT
```

Return the normalized version of the given postcode. This function must return
the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
Return the postcode for the object, if any exists. The postcode must be in
the form that should also be presented to the end-user.

```sql
FUNCTION token_strip_info(info JSONB) RETURNS JSONB
Expand Down
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ pages:
- 'Overview': 'customize/Overview.md'
- 'Import Styles': 'customize/Import-Styles.md'
- 'Configuration Settings': 'customize/Settings.md'
- 'Per-Country Data': 'customize/Country-Settings.md'
- 'Place Ranking' : 'customize/Ranking.md'
- 'Tokenizers' : 'customize/Tokenizers.md'
- 'Special Phrases': 'customize/Special-Phrases.md'
Expand Down
7 changes: 6 additions & 1 deletion lib-php/TokenPostcode.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,12 @@ class Postcode
public function __construct($iId, $sPostcode, $sCountryCode = '')
{
$this->iId = $iId;
$this->sPostcode = $sPostcode;
$iSplitPos = strpos($sPostcode, '@');
if ($iSplitPos === false) {
$this->sPostcode = $sPostcode;
} else {
$this->sPostcode = substr($sPostcode, 0, $iSplitPos);
}
$this->sCountryCode = empty($sCountryCode) ? '' : $sCountryCode;
}

Expand Down
16 changes: 10 additions & 6 deletions lib-php/tokenizer/icu_tokenizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -190,13 +190,17 @@ private function addTokensFromDB(&$oValidTokens, $aTokens, $sNormQuery)
if ($aWord['word'] !== null
&& pg_escape_string($aWord['word']) == $aWord['word']
) {
$sNormPostcode = $this->normalizeString($aWord['word']);
if (strpos($sNormQuery, $sNormPostcode) !== false) {
$oValidTokens->addToken(
$sTok,
new Token\Postcode($iId, $aWord['word'], null)
);
$iSplitPos = strpos($aWord['word'], '@');
if ($iSplitPos === false) {
$sPostcode = $aWord['word'];
} else {
$sPostcode = substr($aWord['word'], 0, $iSplitPos);
}

$oValidTokens->addToken(
$sTok,
new Token\Postcode($iId, $sPostcode, null)
);
}
break;
case 'S': // tokens for classification terms (special phrases)
Expand Down
5 changes: 5 additions & 0 deletions lib-sql/functions/address_lookup.sql
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,11 @@ BEGIN
location := ROW(null, null, null, hstore('ref', place.postcode), 'place',
'postcode', null, null, false, true, 5, 0)::addressline;
RETURN NEXT location;
ELSEIF place.address is not null and place.address ? 'postcode'
and not place.address->'postcode' SIMILAR TO '%(,|;)%' THEN
location := ROW(null, null, null, hstore('ref', place.address->'postcode'), 'place',
'postcode', null, null, false, true, 5, 0)::addressline;
RETURN NEXT location;
END IF;

RETURN;
Expand Down
19 changes: 9 additions & 10 deletions lib-sql/functions/interpolation.sql
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,6 @@ DECLARE
linegeo GEOMETRY;
splitline GEOMETRY;
sectiongeo GEOMETRY;
interpol_postcode TEXT;
postcode TEXT;
stepmod SMALLINT;
BEGIN
Expand All @@ -174,8 +173,6 @@ BEGIN
ST_PointOnSurface(NEW.linegeo),
NEW.linegeo);

interpol_postcode := token_normalized_postcode(NEW.address->'postcode');

NEW.token_info := token_strip_info(NEW.token_info);
IF NEW.address ? '_inherited' THEN
NEW.address := hstore('interpolation', NEW.address->'interpolation');
Expand Down Expand Up @@ -207,6 +204,11 @@ BEGIN
FOR nextnode IN
SELECT DISTINCT ON (nodeidpos)
osm_id, address, geometry,
-- Take the postcode from the node only if it has a housenumber itself.
-- Note that there is a corner-case where the node has a wrongly
-- formatted postcode and therefore 'postcode' contains a derived
-- variant.
CASE WHEN address ? 'postcode' THEN placex.postcode ELSE NULL::text END as postcode,
substring(address->'housenumber','[0-9]+')::integer as hnr
FROM placex, generate_series(1, array_upper(waynodes, 1)) nodeidpos
WHERE osm_type = 'N' and osm_id = waynodes[nodeidpos]::BIGINT
Expand Down Expand Up @@ -260,13 +262,10 @@ BEGIN
endnumber := newend;

-- determine postcode
postcode := coalesce(interpol_postcode,
token_normalized_postcode(prevnode.address->'postcode'),
token_normalized_postcode(nextnode.address->'postcode'),
postcode);
IF postcode is NULL THEN
SELECT token_normalized_postcode(placex.postcode)
FROM placex WHERE place_id = NEW.parent_place_id INTO postcode;
postcode := coalesce(prevnode.postcode, nextnode.postcode, postcode);
IF postcode is NULL and NEW.parent_place_id > 0 THEN
SELECT placex.postcode FROM placex
WHERE place_id = NEW.parent_place_id INTO postcode;
END IF;
IF postcode is NULL THEN
postcode := get_nearest_postcode(NEW.country_code, nextnode.geometry);
Expand Down
5 changes: 2 additions & 3 deletions lib-sql/functions/placex_triggers.sql
Original file line number Diff line number Diff line change
Expand Up @@ -992,7 +992,7 @@ BEGIN
{% if debug %}RAISE WARNING 'Got parent details from search name';{% endif %}

-- determine postcode
NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
NEW.postcode := coalesce(token_get_postcode(NEW.token_info),
location.postcode,
get_nearest_postcode(NEW.country_code, NEW.centroid));

Expand Down Expand Up @@ -1150,8 +1150,7 @@ BEGIN

{% if debug %}RAISE WARNING 'RETURN insert_addresslines: %, %, %', NEW.parent_place_id, NEW.postcode, nameaddress_vector;{% endif %}

NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
NEW.postcode);
NEW.postcode := coalesce(token_get_postcode(NEW.token_info), NEW.postcode);

-- if we have a name add this to the name search table
IF NEW.name IS NOT NULL THEN
Expand Down
27 changes: 25 additions & 2 deletions lib-sql/tokenizer/icu_tokenizer.sql
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,10 @@ AS $$
$$ LANGUAGE SQL IMMUTABLE STRICT;


CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
RETURNS TEXT
AS $$
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
SELECT info->>'postcode';
$$ LANGUAGE SQL IMMUTABLE STRICT;


Expand Down Expand Up @@ -223,3 +223,26 @@ BEGIN
END;
$$
LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION create_postcode_word(postcode TEXT, lookup_terms TEXT[])
RETURNS BOOLEAN
AS $$
DECLARE
existing INTEGER;
BEGIN
SELECT count(*) INTO existing
FROM word WHERE word = postcode and type = 'P';

IF existing > 0 THEN
RETURN TRUE;
END IF;

-- postcodes don't need word ids
INSERT INTO word (word_token, type, word)
SELECT lookup_term, 'P', postcode FROM unnest(lookup_terms) as lookup_term;

RETURN FALSE;
END;
$$
LANGUAGE plpgsql;

4 changes: 2 additions & 2 deletions lib-sql/tokenizer/legacy_tokenizer.sql
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,10 @@ AS $$
$$ LANGUAGE SQL IMMUTABLE STRICT;


CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
RETURNS TEXT
AS $$
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
SELECT info->>'postcode';
$$ LANGUAGE SQL IMMUTABLE STRICT;


Expand Down
Empty file added nominatim/data/__init__.py
Empty file.
Loading

0 comments on commit 3bf3b89

Please sign in to comment.