Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for comma delimited housenumber + street #29

Closed
missinglink opened this issue May 27, 2019 · 2 comments · Fixed by #30
Closed

support for comma delimited housenumber + street #29

missinglink opened this issue May 27, 2019 · 2 comments · Fixed by #30

Comments

@missinglink
Copy link
Member

missinglink commented May 27, 2019

I've seen a few cases internationally where users insert a comma between every component of the address, I'm not sure if this is done manually or when joining cells in a spreadsheet.

This is actually great for most tokens because it helps us to avoid parsing ambiguities.
The issue is when used between the housenumber and the street

so the parser will fail for an address such as:

1, Foo St, Foo, Bar, 411027

but pass for one where the first comma is not present:

1 Foo St, Foo, Bar, 411027

The code responsible for this is the TokenDistanceFilter, which should be modified to ignore section boundaries when considering adjacency.

@missinglink
Copy link
Member Author

missinglink commented May 27, 2019

Off the top of my head there are two ways to accomplish this:

  • the prev and next graph nodes only apply within the same section, so we could consider changing this behaviour /or add a new graph relationship which linked to spans across sections (we would need to consider the impact of this and potential errors that might be caused by having an API like this)
  • record a 'token position', so that each token is assigned a number starting from 0 and incrementing one per token as we read from left-to-right. It would then be possible to write a query which find a token by position (although this type of query is not currently trivial to write as it would require iterating over all sections to locate a span in that way).

@missinglink
Copy link
Member Author

One other approach would be to check for a prev relationship and if that doesn't exist then check if there is a previous span, if so, use the child:last node from that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant