Supports for hyphen as alternative spans #56

Joxit · 2019-08-27T12:39:28Z

Background

Sometime, parsing fails when words are not well split. Hyphens' main purpose is to glue words together. That meas, when an hyphen is used, we can process it like a simple space in order to have two separate words.

Only processing hyphens like spaces can unfortunately not be the final solution because the hyphen is also useful in some other cases.

That's why I suggest to take advantage of our graphs and add some alternative ways to complete a phrase without hyphens.

How it works ?

When we split all sections, we do a first compute on spaces only (like before) and then a second compute on hyphen.

Example for 10 Boulevard Saint-Germain Paris, when we split this section, we get this: 10, Boulevard, Saint-Germain, Paris, here is the graph:

With the hyphen step, we will have 10, Boulevard, Saint-Germain, Paris, Saint, Germain

Thanks to this, we will be able to parse phrases such as :

10 Boulevard Saint-Germain Paris: which is housenumber + street (first solution without this PR 👎)
10 Boulevard Saint-Germains Paris: which is housenumber + street + locality (first solution with this PR 👍)

missinglink

Wow very cool.

I'm trying to wrap my head around all the code since it's quite complex, but I think it looks good.
I've added a couple of minor comments but looks good 👍

tokenization/permutate.js

tokenization/permutate.test.js

Joxit added 2 commits July 8, 2019 15:27

feat: Supports for hyphen as alternative spans

42bc72d

test: Add tests for hyphen as alternative spans

f82c944

missinglink approved these changes Sep 5, 2019

View reviewed changes

tokenization/permutate.js Outdated Show resolved Hide resolved

tokenization/permutate.test.js Outdated Show resolved Hide resolved

fix: typo

403e228

NickStallman mentioned this pull request Sep 5, 2019

Stop word issue pelias/pelias#822

Open

feat: Add static function connectSiblings in Span

519d117

Joxit merged commit b643e0b into master Sep 15, 2019

Joxit deleted the joxit/alternative-spans branch September 15, 2019 12:21

Joxit mentioned this pull request Apr 17, 2020

Update pelias-parser pelias/api#1420

Merged

Joxit mentioned this pull request May 1, 2020

Add support for unit type numbered #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports for hyphen as alternative spans #56

Supports for hyphen as alternative spans #56

Joxit commented Aug 27, 2019

missinglink left a comment

Supports for hyphen as alternative spans #56

Supports for hyphen as alternative spans #56

Conversation

Joxit commented Aug 27, 2019

Background

How it works ?

missinglink left a comment

Choose a reason for hiding this comment