Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autocomplete milestone #127

Merged
merged 21 commits into from
Apr 22, 2016
Merged

autocomplete milestone #127

merged 21 commits into from
Apr 22, 2016

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Apr 22, 2016

This PR refactors the analyzers used by the /v1/autocomplete endpoint, with the goals of:

  • removing all interdependencies with the /v1/search endpoint making subsequent refactoring easier.
  • providing a more robust method of handling synonym substitution:
    • by considering the differences between 'index time' analysis and 'query time' analysis.
    • by handling 'partial tokens' (partially complete words) and 'full tokens' differently.

Currently we use 3 different analyzers in the /v1/autocomplete endpoint:

analyzer "trade center"
peliasOneEdgeGram "t", "tr", "tra", "trad", "trade", "c", "ce", "cen", "cent", "cente", "center"
peliasTwoEdgeGram "tr", "tra", "trad", "trade", "ce", "cen", "cent", "cente", "center"
peliasPhrase "trade", "ctr"

The peliasPhrase analyzer was originally intended to be used with /v1/search and you can see above that the way it handles synonyms is mismatched with the way the other 2 analyzers handle the word center (for example). this is the cause of pelias/pelias#211

new analyzers:

The new analyzers proposed in this PR are:

analyzer tokenizer partial safe? "center"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

They produce the same tokens when given the abbreviated/contracted form "ctr":

analyzer tokenizer partial safe? "ctr"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

directionals:

They also handle directional synonyms in a similar way:

analyzer tokenizer partial safe? "north"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "north"
peliasQueryFullToken keyword × "north"

Again, they produce the same tokens when given the abbreviated/contracted form "n":

analyzer tokenizer partial safe? "n"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "n"
peliasQueryFullToken keyword × "north"

note: there is a bit of a 'hack' in place for the above peliasIndexTwoEdgeGram analysis that is specific to directionals, you can see it adds a single gram 'n' in to a token stream which usually only contains grams of size 2+. This improves address matching and reduces 'jitter'.

api/query changes:

All usages of existing analyzers in /v1/autocomplete must be updated:

  • peliasOneEdgeGram -> peliasQueryPartialToken
  • peliasPhrase -> peliasQueryFullToken

Additionally the autocomplete queries should no longer need to use the phrase.* index, all queries can safely be performed against the name.* index (if not already doing so).

note: we can discuss removing the phrase.* index completely! this would greatly reduce the cluster disk/ram usage, it might be possible to achieve all the functionality of /v1/search using the prefixGram index. let's discuss this in another issue.

dataset importer changes:

nil

risks / expected acceptance test changes:

There is not much that can go wrong here, the only differences at index time are that:

  • peliasIndexOneEdgeGram expands directionals whereas peliasOneEdgeGram does not.
  • peliasIndexTwoEdgeGram is the same and includes the 'hack' mentioned above.

The differences at query time are:

  • issue 211 is resolved
  • expect to see better handling of queries containing a single directional gram such as 'w 26 st'.

I've left some other changes I would like to make for a future PR in order to reduce the amount of changes going in at the same time.

related:

closes #96 (contained in this branch)
closes #109 (contained)
closes #113 (contained)

closes #105
resolves pelias/pelias#211
related pelias/openaddresses#68

@orangejulius
Copy link
Member

I copied and pasted the PR notes from #109 to here since we're going to link directly to this PR in the release notes!

@orangejulius orangejulius deleted the missinglink branch May 24, 2016 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants