-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Elasticsearch fields feature #285
Comments
Here's a full writeup of the changes that are likely required: Elasticsearch Multi-Field RefactorBackgroundTo facilitate both search and autocomplete, we break down all our text data into two separate full text indices. They are called the
We do this by actually having two separate fields in our Elasticsearch schema. From Elasticsearch's perspective, there is no relation between these two fields. Only our application code in the Pelias API understands the However, although we didn't know it at the time, Elasticsearch has a feature to allow one input text to be analyzed multiple ways. Using this feature would be a pure refactoring from a functionality standpoint (all queries should return exactly the same results as before) but should lead to increased readability in our code, and probably some minor disk usage savings and possibly performance increases in Elasticsearch. Pelias Components InvolvedPelias schema libraryThis is where our code for managing Elasticsearch's schema lives. It's a set of Node.js scripts with lots of unit and integration tests for the behavior of the Elasticsearch schema. Using the fields feature will be done here. Pelias APIThis is where all of our core logic for the geocoder lives. It's a fairly large, completely stateless Node.js Express app. Most of the significant logic changes will live here, or in a subset of the code that we extracted into the pelias-query module. At least four different query types will have to be updated (two for search, one for structured search, and one for autocomplete). Fortunately most of our query code is nicely parameterized. We have a templating system that allows us to extract much of the logic into configuration variables. It might be the case that a prototype is as simple as changing two lines that control the name of the schema fields used for the Pelias ModelThis is a library included in all of our different importers that allows us to easily create new records in a format that lines up nicely with our Elasticsearch schema. The changes here should be limited to essentially removing all references to the phrase field in the main Document definition, since all the existing code is doing is copying one text string into two places, and that's exactly what this change will remove the Acceptance tests (to verify functionality)No code changes will be required here, but we have great acceptance-tests that verify pretty much all Pelias functionality. We have a new and growing set of tests for a small city (Portland) and our most tried and true global acceptance-tests which are essentially the ultimate authority on whether any change makes it into Pelias. The global acceptance-tests require a full planet build which is a high barrier for most Pelias contributors. It's especially painful for testing schema changes like this one (which require a _re_build, and can take a day or two). To bridge the gap we are working on building out a collection of test suites for areas of different sizes, so code can be tested on progressively larger builds. How to get startedStart with pelias/schemaYou should be able to clone the pelias/schema repository, and follow the usage guide to re-create the Pelias schema on Elasticsearch with behavior no different than when using the Docker image. Once that's verified (perhaps by doing an import) again, you can move on to changing the schema. Because we have multiple name and phrase sub-fields for different purposes (alternative names, variations in formatting, different languages), we use the Elasticsearch dynamic template feature to configure them all once. I think the change to make will be to remove the ImportersIt would be worth re-running an importer without any changes, because it might just work well enough to test. I'd suggest running the OpenAddresses importer via Docker with If not, you'll have to run it yourself like pelias/schema. The instructions in the readme, especially the example configuration of pelias.json will be helpful. As mentioned earlier, much logic common to all importers is stored in the pelias/model library. If changes are required there (such as removing any references to the phrase field, you'll have to use npm link to "point" your OpenAddresses importer at a local copy of the model library with any required changes. pelias/apiOnce data is successfully imported, try starting the API and running some simple address queries. http://localhost:3100/v1/search?text=777 NE MLK Blvd, Portland, OR should work with OpenAddesses imported. Pelias/query changes, if requiredIf changes to the pelias/query module turn out to be required, you can "point" acceptance-testsWe have a suite of several hundred acceptance tests for Portland that can be run once all data is re-imported with the schema changes, to validate that it was indeed a pure refactor. The acceptance tests can be run from pelias/docker with |
Connects pelias/schema#285
Currently, we take the "name" field (and all its language variants) and run it through Elasticsearch twice. Once with the "name" analyzers, and once in the "phrase.*" field with the phrase analyzers. This is done so that we can have special analysis to handle all the different use cases of autocomplete and regular search.
It works, but is a bit of a pain to manage.
Even worse, in order to save space and improve performance of our indicies, we exclude the
phrase.*
fields from the _source object:schema/mappings/document.js
Line 195 in 2a2d691
This causes all sorts of things to break:
phrase.*
fields again from the contents of thename.*
fields. Forgetting to do this will usually prevent the document from showing up in forward geocoding queriesIt turns out Elasticsearch has functionality to support exactly this functionality: fields.
This allows one field to be indexed in multiple ways, without the confusion of multiple fields that do not have any inherent relationship in Elasticsearch. While there would have to be some cosmetic changes to all our queries, it looks like the change is not a big deal overall.
Here's an example of how fields can work in an Elasticsearch index mapping, from the docs above:
This would be a great first issue for someone new to Pelias, but with some Elasticsearch experience, and we'd be glad to help you get started.
Handling breaking changes
Fixing this will likely require a breaking change to our schema design. When those are required, we try to make a change to the Pelias API codebase well in advance that can handle either the old or new schema. This helps avoid problems for people who are transitioning from older to newer builds.
While it might not be possible or practical to do so in this case, we will want to consider how to make the transition across this breaking change as smooth as possible.
The text was updated successfully, but these errors were encountered: