feat(query): Use full form for match_phrase query #100

orangejulius · 2019-06-18T20:24:02Z

There are two forms for the Elasticsearch match_phrase query:

A shorter, condensed version which does not allow setting extra options:

{
  "match_phrase": {
    "the-field": "the search terms"
  }
}

A complete version that does allow setting options:

{
  "match_phrase": {
    "the-field": {
      "query": "the search terms",
      "some-option": "some-value"
    }
  }
}

Pelias currently uses a mix of the two formats. This change removes all usage of the condensed format in favor of the full format.

As a pure refactoring, with no functional changes, this should be safe to merge immediately. However, it will make it easier to add new options to phrase queries (such as slop) in the future.

There are two forms for the Elasticsearch [match_phrase](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) query: 1. A shorter, condensed version which does not allow setting extra options: ``` { "match_phrase": { "the-field": "the search terms" } } ``` 2. A complete version that does allow setting options: ``` { "match_phrase": { "the-field": { "query": "the search terms", "some-option": "some-value" } } } ``` Pelias currently uses a mix of the two formats. This change removes all usage of the condensed format in favor of the full format. As a pure refactoring, with no functional changes, this should be safe to merge immediately. However, it will make it easier to add new options to phrase queries (such as `slop`) in the future.

This updates the fixture expectations for pelias/query#100 which uses the full `match_phrase` query format everywhere for consistency and to allow adding future parameters easily.

* chore(package): update ciao to version 2.0.0 * chore(ciao): Update ciao tests Amazingly in over a year since running these, only two things had to be changed. * fix(errors): Only return 400 status on known parameter errors Pelias has for a long time returned 400 as a default status whenever anything goes wrong, as well as when a user has passed invalid parameters. By using a new exception class, it is now possible to differetiate between known parameter errors, and unexpected errors that truly represent an HTTP 500. * feat(trimByGranularity): refactor code to add logging and debug output, modify loop to break after first match * feat(trimByGranularity): refactor to use an index instead of multiple loops over the results * feat(address_search_using_ids): add postcode to variable store if we have it * Update pelias-query to 9.7.0 This is required for the other half of postalcode matching * fix(log): Log for each interpolation request The groundwork for this was laid in pelias#1220, but it wasn't completed. By logging information about each individual interpolation request (a single API response may require many calls to the interpolation service if several streets were returned from a query), we can get a good overview of the interpolation service's performance. * feat(log): Add full clean context to parser logs Without the unmodified text input, it's hard to tell what cases is causing these messages. * feat(max_character_count_layer_filter): filter layer=address records for character counts <= 2 * feat(autocomplete): drive address filter by config * feat(unit_number_only_queries): do not consider parses with only the unit number as "admin only" * feat(libpostal_numeric_inputs): recast entirely numeric parses to label "house" * feat(focus.point): use exponential instead of linear scoring Linear scoring, by design, gives all records the same score past a certain point. This has the disadvantage that identical records that are very far away cannot be sorted by distance. By using exponential scoring, we can achieve decent sorting of even very far away records. This is very helpful for cities and postalcodes. Connects pelias#1206 * fix(autocomplete focus.point): Use 50km scale parameter The `scale` parameter controls how quickly scores decrease from the maximum as the distance from the `center_point` to the record in question increases. Set this to 50km, which is the same as search. Connects pelias#1206 * Use test description consistent with fixture file * feat(test): add test cases for autocomplete token matching permutations * feat(refactor): refactor addressit sanitizer for readability * feat(refactor): refactor addressit sanitizer - bugfix * Update service/configurations/Interpolation.js Change Language to Interpolation to fix copy/paste error. * feat(docs): update gitter link Most of our chatting happens in pelias/pelias, not pelias/api * chore(package): update dependencies * travis: reflect new best-practice travis-ci configuration Travis-CI has or will shortly make in early December 2018 a number of beneficial changes to their Linux continuous integration testing infrastructure. Changes that impact pelias/api are: * Linux infrastructure combined into one (virtualized), from two previously (virtualized and container-based). [0][1] * Offering a more modern, supported Ubuntu Xenial (16.04 LTS) [2] Projects using "sudo: false" (container-based infrastructure), have been recommended to remove that configuration soon. In any case, the transition will happen regardless for projects by December 7, 2018. [0] https://blog.travis-ci.com/2018-10-04-combining-linux-infrastructures [1] https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration [2] https://docs.travis-ci.com/user/reference/xenial/ * Add boundary.wof parameter to search, search/structured and autocomplete endpoints * fix(dedupe_query_size_1): check size when deduping * feat(autocomplete_empty_tokens): remove empty tokens * Change name from boundary.wof to boundary.gid and update param format * Add boundary.gid to /reverse endpoint * Add filter function for Placeholder results * Cleanup * feat(params): Add boundary.gid parameter Connects pelias#1249 * feat(stable_sort): fix stable sorting on node 11.6.0 * feat(dedupe): improved matching across languages * feat(dedupe): consider the user agent language in deduplication * feat(dedupe_improved_parent_matching): only check parent fields above the layer of the least granular match * feat(inter_layer_deduping): dedupe between layers and prefer some layers over others * feat(dedupe_improved_parent_matching): do not compare parent items at same level, use explicit list of synonymous layers * feat(dedupe_improved_parent_matching): add Vientiane test case * fix typos * fix(dedupe): Compare parent layers above lowest record This expands the layers checked for hierarchy difference to be any layer higher than the _lower_ of the two records being compared. Previously, it was any layer higher than the _higher_ of the two record (another way to think about it was it was only comparing hierarchy elements that BOTH records have). This fixes the issue where the city of Pennsylvania, Illinois was being considered a duplicate of the state of Pennsylvania. * feat(query) Set cutoff_frequency in all queries * feat(addendum): display addendum data in geojson * docs: Add link to Pelias services documentation The rules for when different Pelias services are required or recommended is now too complex to fit in the API readme. Fortunately we have it all written down in one place. * fix(package): update addressit to version 1.7.0 * feat(search): enable fallback queries when placeholder returns 0 results * feat(admin_ngram): query ngram fields for admin matching on autocomplete * fix(jshint): change syntax to silence jshint eerror on commit * feat(admin_ngrams): revert ccb9bc8 * fix(logging): Remove log of every query sent to addressit This is pretty verbose as its fairly common, and we can determine what queries were sent to addressit in other ways. * fix(logging): Clean up coarse reverse error log It was logging both info and error level messages. While not commonly called, this seems unnecessary. * fix(logging): allow access log to be empty This permits disabling access logs, which may or may not be useful in different use cases. * correct -> incorrect in documentation This error completely changed the meaning of a line in documentation * feat(deps): Remove `tmp` devDependency This package hasn't been used since 33341b7, which was removed in favor of the [microservice-wrapper](https://github.com/pelias/microservice-wrapper). Closes pelias#1266 * feat(admin_ngrams): Restore admin ngrams functionality This restores the functionality from pelias#1259 to allow autocomplete on partial administrative area names. It's a solid improvement, and we are waiting to merge only to give the community time to update their schema settings. This reverts commit 43a4bed. * feat(deps): Remove unused source-map dependency Like pelias#1269, this dev dependency was not being used. We added it long ago to try to fix issues with resolving NPM dependencies, but those issues haven't been seen recently. * fix(package): update check-types to version 8.0.0 * feat(categories): enable filter for autocomplete, show categories in geojson even when categories param is blank * feat(categories): gracefully handle a mixture of valid and invalid categories * Allow size parameter for autocomplete endpoint. * Add support for boundary circle to autocomplete. * Make sanitizer aware of boundary.circle. * feat(multi boundary.country): Support for multi `boundary.country` query parameter * fix(query): add code comments and sanity check to text_parser_addressit * feat(performance): add single_token_address_filter sanitizer * feat(performance): update single_token_address_filter to consider only layers relevant to clean.sources * feat(performance): rename single_token_address_filter to _address_layer_filter, add numeric check, emit warning * feat(performance): use parsed_text input to determine token length where available * fix(package): update pelias-config to version 4.0.0 Closes pelias#1289 * feat(localNamingConventions): fix for ESP street number formatting * Allow postalcode + admin area queries We currently have very strict logic for which combinations of inputs the fallback search query supports. Any input that is out of alignment with those expectations will be deferred to the lower quality addressit query system. This changes some of that expectation logic. Where previously it ignored inputs with a postcode and any other field, now a postcode in combination with any administrative fields are allowed. This is so, for example, a user can query for '90210, California' and hopefully see postalcodes called 90210 that are in California first in the results. Obviously postalcodes are complex so there might be a few cases that don't work, but it should help. As a bonus, it will remove fragility and complexity from our query generation. * feat(cutoff_frequency): enable cutoff_frequency for multi_match queries * feat(docs): Add README section for custom sources/layers This documentation exists only in the notes for a pull request, which inevitably will get out of date over time and is not as readily accessible. Connects pelias#1131 * Add Romania to flipped street number countries. * fix(package): update geolib to version 3.0.0 * Use new function names in geolib 3 * Update distance test expectation for precision changes in geolib 3 * fix(naming conventions): Columbian addresses should be flipped Fixes pelias#972 * Add comment explaining purpose of helper/placeTypes.js Its a list of _administrative_ placetypes. * fix(search_with_ids): Skip query when not address related The search_with_ids query (in some places more accurately referred to as the `address_search_with_ids` query), can really ONLY return addresses or related results such as streets, cities, etc. It will never return venues. However, if the user passed `layers=venue` as a query param, this query was still executing and if any results were returned, then a non-venue record would be returned as a result. * feat(translation) Translate name to request language when available * fix(package): update async to version 3.0.1 * feat(debug_population_popularity): display population & popularity in geojson when debug flag enabled * feat(improved_es_debugging): display more information about the elasticsearch request when in debug mode * fix(improved_es_debugging): remove unused statement * feat(max_text_length): limit max text length to 140 characters * feat(auto_discover_type_mapping): discover sources & layers by querying elasticsearch * feat(auto_discover_type_mapping): remove _replica_first preference * fix(predicate): Fix address related layers detection This fixes an error in the logic from pelias#1307. The `address_search_with_ids` query can only return results in the address or street layers. However, the predicate logic was allowing the query to execute if _any_ admin layer was specified, which could allow such a query to return street or address layers when not intended. * feat(controller): Remove search_with_ids controller I was attempting to use the new debug output from pelias#1313 only to realize it was not added to the `search_with_ids` controller, only the `search` controller. These two controllers are very similar, so this PR combines them into one that can be used in all cases. * feat(query): Use more consistent names The search endpoint uses multiple query types that are executed one after the other: if the first does not return acceptable results, the next is executed, and so on. The last query in this list used to be the only query the search endpoint had. Partially because it's been around for a long time, it is referred to in several different ways across different parts of the code. This change unifies all of them so that the query is called `search_addressit` everywhere. * chore(package): update pelias-query to version 9.13.0 * feat(query) use full match_phrase query This updates the fixture expectations for pelias/query#100 which uses the full `match_phrase` query format everywhere for consistency and to allow adding future parameters easily. * chore(package): update pelias-query to version 9.14.0 * feat(query): Remove deprecated optimize_bbox param This parameter of the `geo_distance` query was deprecated in Elasticsearch 2.2 Connects pelias/query#102 * chore(tests): Fix test description * fix(package): update elasticsearch to version 16.0.0 * chore(tests): Ensure test ES documents have proper format `source_id` should (currently) match `_id`, and should be present. * feat(geojsonify): Use `source_id` rather than `_id` field This change does not have any impact on the output of the API, but allows us to make changes to the contents of the `_id` field in order to correct the issue described in pelias/pelias#672 * fix(interpolation): Use street source_id if none found in interpolated result This fix ensures the interpolation middleware always returns results with a `source_id` field, fixing two issues: 1. It ensures the `source_id` property is always present in Pelias API responses 2. It allows `helper/geojsonify.js` to rely on `source_id` * feat(place): Query for old and new contents of `_id` field _This change completes step 2 of the `_id` field contents migration described in pelias/pelias#672 (comment) _ As part of the transition away from Elasticsearch types, we must change the contents of the `_id` field in our Elasticsearch documents. This change modifies the `/v1/place` endpoint so that it queries Elasticsearch for documents based on both the old and new contents of the `_id` field, allowing the same API code to work regardless of the version of Pelias used to import data. Connects pelias/pelias#672 * feat(auto_discover_type_mapping): update README * fix(auto_discover_type_mapping): bugfix to allow existing references to the type mapping to continue functioning after update * fix(auto_discover_type_mapping): update README * Netherlands use 'street number' The localized results of Pelias for the Netherlands are in the wrong order. The Dutch way is `street number`, not `number street` eg `damstraat 1` * chore(tests): Ensure fixture name is in test output This makes it faster to update the correct fixture * feat(search): Use minimum_should_match to filter hits This change utilizes the [minimum_should_match](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-minimum-should-match.html) Elasticsearch query parameter to reduce the number of hits to search queries. Previously, only one token in an input had to match, regardless of the input size. This allows for queries to potentially match a huge number of documents, especially as the number of tokens grows. Now, most `ngrams` queries have thee parameter set to `1<-1 3<-25%` Breaking this down, it means that for the given number of tokens in a query, there is a certain number of optional tokens: | token count | optional tokens | | --- | --- | | 1 | 0 (obviously) | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6+ | at least 75% of tokens must match | This should help ensure quality results are returned where possible, even if long inputs contain a few extraneous bits of information, while reducing the chance of extremely expensive queries. * feat(autocomplete): focus point distance filter This change uses an extra filter clause that takes into account a focus point (if used) and the input text length to filter out very far away street and adress records, improving performance. Replaces pelias#1215 * feat(peliasQueryAnalyzer): replace usage of peliasQueryPartialToken and peliasQueryFullToken analyzers with the new peliasQuery analyzer * feat(dedupe): Prefer OSM records over Geonames In the case of a venue, we should prefer OSM records over Geonames. * feat(geocodeJSON): remove geocoding.query.tokens array * fix(package): update check-types to version 9.0.0 * feat(elasticsearch): do not reuse config for elasticsearch client * Handle signals in a Docker friendly way Similar to pelias/placeholder#160, this small change allows `ctrl+C` to immediately and gracefully shut down a docker container started with `docker run pelias/api`. It's a nice user experience improvement for people running our containers on the command line. * chore(Node.js): Add Node.js 12 to CI Node.js 12 (and especially Node.js 12.9) is supposed to be significantly faster than previous versions, especially when it comes to parsing JSON, which is often a bottleneck for Pelias importers. https://nodejs.org/en/blog/release/v12.9.0/ https://v8.dev/blog/v8-release-76 Connects pelias/pelias#800 * feat(flip_housenumber_street): update config to cover more country codes where the label format is "{{street}} {{housenumber}}" * fix(package): update pelias-model to version 6.0.0 * chore(tests): Update tests for GID changes Connects pelias/pelias#672 * feat(deps): Move from joi to hapi/joi This just keeps the output of `npm install` free of warnings * fix(package): update pelias-model to version 7.0.0 * chore(tests): Update test expectations for pelias-model-7.0.0 * fix(package): update check-types to version 10.0.0 * feat(140_chars): restrict autocomplete inputs to 140 chars * chore(test): Remove all instances of `_type` from ES docs This will help ensure that in the future no code depends on the `_type` value from Elasticsearch. Connects pelias/pelias#719 * chore(travis): Pin to Node.js 12.10 Pending resolution of issues that may be related to the `geolib` module. See manuelbieh/geolib#208 * feat(parser): replace addressit with pelias native parser * feat(parser): improved postfix cursor position for text with no admin classification * feat(parser): pelias/parser improvements * feat(parser): bump pelias/parser version * feat(parser): bump pelias/parser version * feat(parser): bump pelias/parser version * feat(parser): bump pelias/parser version * feat(parser): updates to tokenizer sanitizer * typo * feat(parser): stricter tokenization of exact matching admin queries * feat(parser): switch to using multi_match for admin subqueries * feat(admin_subqueries): test cross_fields query * feat(admin_subqueries): test operator:and query * feat(admin_subqueries): set all boosts to 1 * feat(admin_subqueries): add locality_a and country_a to multi_match * feat(admin_subqueries): revert to operator:or * feat(admin_subqueries): remove cutoff_frequency * feat(admin_subqueries): move admin matching to MUST condition * feat(tokenizer): consider query as complete if the final char is a numeral * feat(autocomplete): test removing exact_matching subquery * feat(admin_subqueries): add cutoff_frequency * feat(pelias_parser): admin queries - remove subject from admin subquery * feat(admin_subqueries): remove cutoff_frequency * feat(autocomplete): use phrase index for complete tokens * feat(parser): remove parsed_text.name * feat(parser): so not consider address parses as safe to use with an ngrams index due to parses potentially containing partial suffixes * feat(parser): bump pelias/parser version * feat(tokenizer): consider query as complete if the $subject is not at the end of $clean.text * feat(parser): bump pelias/parser version * feat(autocomplete): experiment adding name.default to admin multi_match * feat(autocomplete): progess commit * feat(autocomplete): typo * feat(autocomplete): improved matching at the cusp * feat(autocomplete): improved performance and reduced noise for admin matching * feat(deps): bump parser dep version * feat(deps): bump parser dep version * feat(deps): bump parser dep version * test: disable parserConsumedAllTokens for admin parses * feat(deps): bump parser dep version * feat(query): add should subquery for cross_street matching * feat(logging): add summary logging for pelias parser * feat(deps): bump parser dep version * feat(pelias_parser): additional parser tests * feat(deps): bump parser dep version * feat(pelias_parser): fix tests * feat(search_addressit): generate cross_street subquery where available * feat(pelias_parser): limit input text to 140 characters * feat(pelias_parser): replace peliasQueryPartialToken analyzer with peliasQuery * feat(pelias_parser): disable "ngrams_last_token_only_multi" view when every "completed" token is numeric * Add context to pelias parser logs * Pin to pelias-parser-1.38.0 for now * refactor(pelias_parser): add code comments relating to "add_name_to_multimatch", clean up related code * refactor(pelias_parser): remove disused code/comments * feat(pelias_parser): completely remove "addressit" and references to it * refactor(pelias_parser): remove references to "original style queries" * chore(Travis): Use string to specify minor version Otherwise 12.10 is interpreted as a number and becomes version 12.1. * chore(query): Remove empty pop_subquery view `pelias-query` now includes basic views for standard Elasticsearch queries, including `match_all`. The API esentially created a wrapper around the `match_all` view, which was used in some cases and resulted in an extra query view that didn't do much. This change removes that boilerplate code and makes it a bit more clear what the underlying functionality of a query will be. * chore(tests): Remove unused fixture * chore(query): Remove unused function parameter At first glance, this code made it appear that the phrase query was being used across the focus point filter functionality on libpostal based search queries, but it turns out it's not! The `focus_only_function` view constructor takes no parameters, so this code wasn't having any effect. * updated with latest api * fix(local_naming_conventions): improve localNamingConventions middleware to support records with name/street aliases * fixed lint errors * added back placeholder polygon * fixed geometry results

orangejulius merged commit ca7ddd6 into master Jun 18, 2019

orangejulius deleted the use-full-match_phrase-query branch June 18, 2019 21:16

orangejulius mentioned this pull request Jun 18, 2019

Support pelias-query 9.13.0 pelias/api#1321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(query): Use full form for match_phrase query #100

feat(query): Use full form for match_phrase query #100

orangejulius commented Jun 18, 2019

feat(query): Use full form for match_phrase query #100

feat(query): Use full form for match_phrase query #100

Conversation

orangejulius commented Jun 18, 2019