Compute an ngram field for all admin data #345

missinglink · 2019-02-08T05:14:10Z

This PR adds one new field per parent.* admin entry called, for example parent.locality_ngram.

It takes advantage of the copy_to operation in order to copy the admin name and admin abbreviation inputs for each record and adds them to the new ngram field.

The ngram field is tokenized using the existing peliasIndexOneEdgeGram analyzer, so as to produce prefix-ngrams which can be used for autocomplete.

The motivation here is to be able to, quite simply and efficiently, improve autocomplete queries which contain admin areas.
We currently only autocomplete on the name.default field and require full completion of the admin inputs before any effect is noticed in the results.

I suspect the changes required for the queries (such as this) will be minimal.
Due to the way addressit works (splitting 'address parts' and 'admin parts') this should play well with the existing parsing logic.

missinglink · 2019-02-08T05:19:04Z

@orangejulius there will certainly be additional disk overhead for these extra fields, but due to the repetitive nature of the data in these fields, I suspect that it won't be that noticeable once converted to an inverted-index.

I disabled field_data and doc_values so that should also reduce the index size.

One nice unexpected benefit of using copy_to was that we don't need to add these fields to _source.excludes because they are not stored and so contain no _source data.

I'm not 100% sure about the performance impact, especially for common terms such as new or west, so we'll need to do a global-build test to check that it doesn't kill performance.

missinglink · 2019-02-08T08:10:52Z

I did a build of portland-metro and compared disk usage to master:

➜  portland-metro du -sh elasticsearch
423M	elasticsearch

➜  portland-metro du -sh elasticsearch
976M	elasticsearch

Looks like it more than doubles disk usage...

I'd still like to pursue this, I wonder if we can either reduce the disk requirement or just accept it for the value it offers?

orangejulius · 2019-02-08T16:47:48Z

Ouch, yeah that disk space usage would be a problem.

What about using the fields feature? I think it's is even more well suited to storing multiple analysis variations for a single field than copy_to, and might reduce disk usage.

On the other hand we might just have to accept that we would need more disk usage if we store partial admin tokens.

missinglink · 2019-02-19T09:30:24Z

@orangejulius I rewrote this PR using the fields feature:

➜  portland-metro du -sh elasticsearch
459M	elasticsearch

I'm not sure why this uses much less HDD space but it's great, I will open a separate PR.

missinglink · 2019-02-19T09:34:37Z

Closing issue, this PR has been superceded by #347

orangejulius · 2019-02-19T16:02:25Z

That's awesome that it saves a bunch of space and makes this feature possible. Perhaps we should use it to replace phrase as well.

feat(admin_ngram): compute an ngram field for all admin data

9a19429

missinglink requested a review from orangejulius February 8, 2019 05:14

missinglink closed this Feb 8, 2019

missinglink reopened this Feb 8, 2019

missinglink mentioned this pull request Feb 19, 2019

Compute an ngram field for all admin data #347

Merged

missinglink closed this Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute an ngram field for all admin data #345

Compute an ngram field for all admin data #345

missinglink commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 8, 2019 •

edited

Loading

orangejulius commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 19, 2019

missinglink commented Feb 19, 2019 •

edited

Loading

orangejulius commented Feb 19, 2019

Compute an ngram field for all admin data #345

Compute an ngram field for all admin data #345

Conversation

missinglink commented Feb 8, 2019 • edited Loading

missinglink commented Feb 8, 2019 • edited Loading

missinglink commented Feb 8, 2019 • edited Loading

orangejulius commented Feb 8, 2019 • edited Loading

missinglink commented Feb 19, 2019

missinglink commented Feb 19, 2019 • edited Loading

orangejulius commented Feb 19, 2019

missinglink commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 8, 2019 •

edited

Loading

orangejulius commented Feb 8, 2019 •

edited

Loading

missinglink commented Feb 19, 2019 •

edited

Loading