Supports for hyphen as alternative spans #56
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Sometime, parsing fails when words are not well split. Hyphens' main purpose is to glue words together. That meas, when an hyphen is used, we can process it like a simple space in order to have two separate words.
Only processing hyphens like spaces can unfortunately not be the final solution because the hyphen is also useful in some other cases.
That's why I suggest to take advantage of our graphs and add some alternative ways to complete a phrase without hyphens.
How it works ?
When we split all sections, we do a first compute on spaces only (like before) and then a second compute on hyphen.
Example for
10 Boulevard Saint-Germain Paris
, when we split this section, we get this:10
,Boulevard
,Saint-Germain
,Paris
, here is the graph:With the hyphen step, we will have
10
,Boulevard
,Saint-Germain
,Paris
,Saint
,Germain
Thanks to this, we will be able to parse phrases such as :
10 Boulevard Saint-Germain Paris
: which ishousenumber
+street
(first solution without this PR 👎)10 Boulevard Saint-Germains Paris
: which ishousenumber
+street
+locality
(first solution with this PR 👍)