diff --git a/README.md b/README.md index 7614277a..0ab6182e 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,40 @@ -| WARNING: This repo is a work in progress! | -| --- | +# A natural language classification engine for geocoding -An address parsing engine designed for geocoding. +This library contains primitive 'building blocks' which can be composed together to produce a powerful and flexible natural language parser. -Strategic goals: -- Seperate unit, housenumber, road from 'everything else' -- Does not require a corpus of 'real places' to operate -- Do not attempt to classify administrative areas -- Provide multiple solutions in the case of ambiguous parses -- Basic typo correction -- Honour delimiters -- Extensible to handle queries such as 'pizza near new york' -- Record offsets to the original token positions in the input text -- Pluggable classifiers -- Support for partially complete 'autocomplete' tokens +The project was designed and built to work with the [Pelias geocoder](https://github.com/pelias/pelias), so it comes bundled with a parser called `AddressParser` which can be included in other npm project independent of Pelias. -#### CLI +It is also possible to modify the configuration of `AddressParser`, the dictionaries or the semantics. You can also easily create a completely new parser to suit your own domain. + +[![NPM](https://nodei.co/npm/pelias-parser.png?downloads=true&stars=true)](https://nodei.co/npm/pelias-parser) + +[![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/pelias/pelias) + +## AddressParser Example + +``` +30 w 26 st nyc 10010 + +(0.95) ➜ [ + { housenumber: '30' }, + { street: 'w 26 st' }, + { locality: 'nyc' }, + { postcode: '10010' } +] +``` + +## Application Interfaces + +You can access the library via three different interfaces: +- all parts of the codebase are available in `javascript` via `npm` +- on the `command line` via the `node bin/cli.js` script +- through a `web service` via the `node server/http.js` script + +> the web service provides an interactive demo at the URL `/parser/parse` + +## Quick Start + +A quick and easy way to get started with the library is to use the command-line interface: ``` node bin/cli.js West 26th Street, New York, NYC, 10010 @@ -23,89 +42,167 @@ node bin/cli.js West 26th Street, New York, NYC, 10010 ![cli](./docs/cli.png) -#### Server +--- -```bash -$ PORT=6100 npm run server; -``` +# Architecture Description -![demo](./docs/demo.png) -![api](./docs/api.png) +Please refer to the CLI screenshot above for a visual reference. -### open browser +## Tokenization -the server should now be running and you should be able to access the http API: +Tokenization is the process of splitting text into individual words. -```bash -http://localhost:6100/ -``` +The spliting process used by the engine maintains token positions, so it's able to 'remember' where each character was in the original input text. + +> Tokenization is coloured `blue` on the command-line. + +### Span + +The most primitive element is called a `span`, this is essentially just a single string of text with some metadata attached. -try the following paths: +The terms `word`, `phrase` and `section` (explained below) are all just ways of using a `span`. + +### Section Boundaries + +Some parsers like [libpostal](https://github.com/openvenues/libpostal) ignore characters such as `comma`, `tab`, `newline` and `quote`. + +While it's unrealistic to expect commas always being present, it's very useful to record their positions when they are. + +These boundary positions help to avoid parsing errors for queries such as `Main St, East Village` being parsed as `Main St East` in `Village`. + +Once sections are established there is no 'bleeding' of information between sections, avoiding the issue above. + +### Word Splitting + +Each section is then split in to individual `words`, by default this simply considers whitespace as a word boundary. + +As per the `section`, the original token positions are maintained. + +### Phrase Generation + +May terms such as 'New York City' span multiple words, these multi-word tokens are called `phrases`. + +In order to be able to classify `phrase` terms, permutations of adjacent words are generated. + +Phrase generation is performed per-section, so it will not generate a `phrase` which contains words from more than one `section`. + +Phrase generation is controlled by a configuration which specifies things like the minimum & maximum amount of words allowed in a `phrase`. + +### Token Graph + +A graph is used to associate `word`, `phrase` and `section` elements to each other. + +The graph is free-form, so it's easy to add a new relationship between terms on the future, as required. + +Graph Example: ```javascript -/demo -/parser/parse?text=12 main st +// find the next word in this section +word.findOne('next') + +// find all words in this phrase +phrase.findAll('child') ``` -### Architecture overview +## Classification -#### 1. start with an input: -``` -30 West 26th Street, New York, NYC, 10010 -``` +Classification is the process of establishing that a `word` or `phrase` represents a 'concept' (such as a street name). -#### 2. split tokens in to logical groups: -``` -[ - "30 West 26th Street", - "New York", - "NYC", - "10010" -] -``` +Classification can be based on: +- Dictionary matching (usually with normalization applied) +- Pattern matching (such as regular expressions) +- Composite matching (such as relative positioning) +- External API calls (such as calling other services) +- Other semantic matching techniques -#### 3. tokenize groups: -``` -[ - [ "30", "west", "26th", "street" ], - [ "new", "york" ], - [ "nyc" ], - [ "10010" ] -] -``` +> Classification is coloured `green` and `red` on the command-line. -#### 4. generate phrase permutations: -``` -[ - [ - "30 west 26th street", - "30 west 26th", - "30 west", - "30", - "west 26th street", - "west 26th", - "west", - "26th street", - "26th" - ], - [ - "new york", - "new", - "york" - ], - [ "nyc" ], - [ "10010" ] -] +### Classifier Types + +The library comes with three generic classifiers which can be extended in order to create a new `classifier`: + +- WordClassifier +- PhraseClassifier +- SectionClassifier + +### Classifiers + +The library comes bundled with a range of classifiers out-of-the box. + +You can find them in the `/classifier` directory, dictionary-based classifiers usually store their data in the `/resources` directory. + +Example of some of the included classifiers: + +```javascript +// word classifiers +HouseNumberClassifier +PostcodeClassifier +StreetPrefixClassifier +StreetSuffixClassifier +CompoundStreetClassifier +DirectionalClassifier +OrdinalClassifier +StopWordClassifier + +// phrase classifiers +IntersectionClassifier +PersonClassifier +GivenNameClassifier +SurnameClassifier +PersonalSuffixClassifier +PersonalTitleClassifier +ChainClassifier +PlaceClassifier +WhosOnFirstClassifier ``` -#### 5. run classifiers against all phrases and record potential classes per phrase +## Solvers + +Solving is the final process, where `solutions` are generated based on all the classifications that have been made. + +Each parse can contain multiple `solutions`, each is provided with a `confidence` score and is displayed sorted from highest scoring solution to lowest scoring. + +The core of this process is the `ExclusiveCartesianSolver` module. + +This `solver` generates all the possible permutations of the different classifications while taking care to: +- ensure the same `span` position is not used more than once +- ensure that the same `classification` is not used more than once. + +After the `ExclusiveCartesianSolver` has run there are additional solvers which can: +- filter the `solutions` to remove inconsistencies +- add new `solutions` to provide additional functionality (such as intersections) + +### Solution Masks + +It is possible to produce a simple `mask` for any generated solution, this is useful for comparing the `solution` to the original text: + +```javascript +VVV VVVV NN SSSSSSS AAAAAA PPPPP +Foo Cafe 10 Main St London 10010 Earth ``` -'10010' -> postcode -'west 26th street' -> street -'26th street' -> street -'street' -> street_postfix + +# Contributing + +Please fork and pull request against upstream master on a feature branch. Pretty please; provide unit tests. + +## Unit tests + +You can run the unit test suite using the command: + +```bash +$ npm test ``` -#### 6. generate solutions -Given the classifications for each phrase, compute an array of potential parses for the input, a confidence score can also be provided. \ No newline at end of file +### Continuous Integration + +Travis tests every release against all supported Node.js versions. + +[![Build Status](https://travis-ci.org/pelias/parser.png?branch=master)](https://travis-ci.org/pelias/parser) + + +### Versioning + +We rely on semantic-release and Greenkeeper to maintain our module and dependency versions. + +[![Greenkeeper badge](https://badges.greenkeeper.io/pelias/parser.svg)](https://greenkeeper.io/) \ No newline at end of file