Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports for hyphen as alternative spans #56

Merged
merged 4 commits into from
Sep 15, 2019
Merged

Conversation

Joxit
Copy link
Member

@Joxit Joxit commented Aug 27, 2019

Background

Sometime, parsing fails when words are not well split. Hyphens' main purpose is to glue words together. That meas, when an hyphen is used, we can process it like a simple space in order to have two separate words.

Only processing hyphens like spaces can unfortunately not be the final solution because the hyphen is also useful in some other cases.

That's why I suggest to take advantage of our graphs and add some alternative ways to complete a phrase without hyphens.

How it works ?

When we split all sections, we do a first compute on spaces only (like before) and then a second compute on hyphen.

Example for 10 Boulevard Saint-Germain Paris, when we split this section, we get this: 10, Boulevard, Saint-Germain, Paris, here is the graph:

step1

With the hyphen step, we will have 10, Boulevard, Saint-Germain, Paris, Saint, Germain

step2

Thanks to this, we will be able to parse phrases such as :

  • 10 Boulevard Saint-Germain Paris: which is housenumber + street (first solution without this PR 👎)
  • 10 Boulevard Saint-Germains Paris: which is housenumber + street + locality (first solution with this PR 👍)

Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow very cool.

I'm trying to wrap my head around all the code since it's quite complex, but I think it looks good.
I've added a couple of minor comments but looks good 👍

tokenization/permutate.js Outdated Show resolved Hide resolved
tokenization/permutate.test.js Outdated Show resolved Hide resolved
@Joxit Joxit merged commit b643e0b into master Sep 15, 2019
@Joxit Joxit deleted the joxit/alternative-spans branch September 15, 2019 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants