Skip to content

Commit

Permalink
add developers documentation for query-side of tokenizer
Browse files Browse the repository at this point in the history
  • Loading branch information
lonvia committed Dec 13, 2024
1 parent fbb6edf commit 5b40aa5
Showing 1 changed file with 33 additions and 5 deletions.
38 changes: 33 additions & 5 deletions docs/develop/Tokenizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,14 +91,19 @@ for a custom tokenizer implementation.

### Directory Structure

Nominatim expects a single file `src/nominatim_db/tokenizer/<NAME>_tokenizer.py`
containing the Python part of the implementation.
Nominatim expects two files containing the Python part of the implementation:

* `src/nominatim_db/tokenizer/<NAME>_tokenizer.py` contains the tokenizer
code used during import and
* `src/nominatim_api/search/NAME>_tokenizer.py` has the code used during
query time.

`<NAME>` is a unique name for the tokenizer consisting of only lower-case
letters, digits and underscore. A tokenizer also needs to install some SQL
functions. By convention, these should be placed in `lib-sql/tokenizer`.

If the tokenizer has a default configuration file, this should be saved in
the `settings/<NAME>_tokenizer.<SUFFIX>`.
`settings/<NAME>_tokenizer.<SUFFIX>`.

### Configuration and Persistence

Expand All @@ -110,9 +115,11 @@ are tied to a database installation and must only be read during installation
time. If they are needed for the runtime then they must be saved into the
`nominatim_properties` table and later loaded from there.

### The Python module
### The Python modules

The Python module is expect to export a single factory function:
#### `src/nominatim_db/tokenizer/`

The import Python module is expected to export a single factory function:

```python
def create(dsn: str, data_dir: Path) -> AbstractTokenizer
Expand All @@ -123,6 +130,20 @@ is a directory in the project directory that the tokenizer may use to save
database-specific data. The function must return the instance of the tokenizer
class as defined below.

#### `src/nominatim_api/search/`

The query-time Python module must also export a factory function:

``` python
def create_query_analyzer(conn: SearchConnection) -> AbstractQueryAnalyzer
```

The `conn` parameter contains the current search connection. See the
[library documentation](../library/Low-Level-DB-Access.md#searchconnection-class)
for details on the class. The function must return the instance of the tokenizer
class as defined below.


### Python Tokenizer Class

All tokenizers must inherit from `nominatim_db.tokenizer.base.AbstractTokenizer`
Expand All @@ -138,6 +159,13 @@ and implement the abstract functions defined there.
options:
heading_level: 6


### Python Query Analyzer Class

::: nominatim_api.search.query_analyzer_factory.AbstractQueryAnalyzer
options:
heading_level: 6

### PL/pgSQL Functions

The tokenizer must provide access functions for the `token_info` column
Expand Down

0 comments on commit 5b40aa5

Please sign in to comment.