Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute an ngram field for all admin data #345

Closed
wants to merge 1 commit into from
Closed

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Feb 8, 2019

This PR adds one new field per parent.* admin entry called, for example parent.locality_ngram.

It takes advantage of the copy_to operation in order to copy the admin name and admin abbreviation inputs for each record and adds them to the new ngram field.

The ngram field is tokenized using the existing peliasIndexOneEdgeGram analyzer, so as to produce prefix-ngrams which can be used for autocomplete.

The motivation here is to be able to, quite simply and efficiently, improve autocomplete queries which contain admin areas.
We currently only autocomplete on the name.default field and require full completion of the admin inputs before any effect is noticed in the results.

I suspect the changes required for the queries (such as this) will be minimal.
Due to the way addressit works (splitting 'address parts' and 'admin parts') this should play well with the existing parsing logic.

@missinglink
Copy link
Member Author

missinglink commented Feb 8, 2019

@orangejulius there will certainly be additional disk overhead for these extra fields, but due to the repetitive nature of the data in these fields, I suspect that it won't be that noticeable once converted to an inverted-index.

I disabled field_data and doc_values so that should also reduce the index size.

One nice unexpected benefit of using copy_to was that we don't need to add these fields to _source.excludes because they are not stored and so contain no _source data.

I'm not 100% sure about the performance impact, especially for common terms such as new or west, so we'll need to do a global-build test to check that it doesn't kill performance.

@missinglink
Copy link
Member Author

missinglink commented Feb 8, 2019

I did a build of portland-metro and compared disk usage to master:

➜  portland-metro du -sh elasticsearch
423M	elasticsearch

➜  portland-metro du -sh elasticsearch
976M	elasticsearch

Looks like it more than doubles disk usage...

I'd still like to pursue this, I wonder if we can either reduce the disk requirement or just accept it for the value it offers?

@missinglink missinglink closed this Feb 8, 2019
@missinglink missinglink reopened this Feb 8, 2019
@orangejulius
Copy link
Member

orangejulius commented Feb 8, 2019

Ouch, yeah that disk space usage would be a problem.

What about using the fields feature? I think it's is even more well suited to storing multiple analysis variations for a single field than copy_to, and might reduce disk usage.

On the other hand we might just have to accept that we would need more disk usage if we store partial admin tokens.

@missinglink
Copy link
Member Author

@orangejulius I rewrote this PR using the fields feature:

➜  portland-metro du -sh elasticsearch
459M	elasticsearch

I'm not sure why this uses much less HDD space but it's great, I will open a separate PR.

@missinglink
Copy link
Member Author

missinglink commented Feb 19, 2019

Closing issue, this PR has been superceded by #347

@orangejulius
Copy link
Member

That's awesome that it saves a bunch of space and makes this feature possible. Perhaps we should use it to replace phrase as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants