Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Use full form for match_phrase query #100

Merged
merged 1 commit into from
Jun 18, 2019

Conversation

orangejulius
Copy link
Member

There are two forms for the Elasticsearch match_phrase query:

  1. A shorter, condensed version which does not allow setting extra options:
{
  "match_phrase": {
    "the-field": "the search terms"
  }
}
  1. A complete version that does allow setting options:
{
  "match_phrase": {
    "the-field": {
      "query": "the search terms",
      "some-option": "some-value"
    }
  }
}

Pelias currently uses a mix of the two formats. This change removes all usage of the condensed format in favor of the full format.

As a pure refactoring, with no functional changes, this should be safe to merge immediately. However, it will make it easier to add new options to phrase queries (such as slop) in the future.

There are two forms for the Elasticsearch [match_phrase](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) query:

1. A shorter, condensed version which does not allow setting extra
options:
```
{
  "match_phrase": {
    "the-field": "the search terms"
  }
}
```

2. A complete version that does allow setting options:
```
{
  "match_phrase": {
    "the-field": {
      "query": "the search terms",
      "some-option": "some-value"
    }
  }
}
```

Pelias currently uses a mix of the two formats. This change removes all
usage of the condensed format in favor of the full format.

As a pure refactoring, with no functional changes, this should be safe
to merge immediately. However, it will make it easier to add new options
to phrase queries (such as `slop`) in the future.
@orangejulius orangejulius merged commit ca7ddd6 into master Jun 18, 2019
@orangejulius orangejulius deleted the use-full-match_phrase-query branch June 18, 2019 21:16
orangejulius added a commit to pelias/api that referenced this pull request Jun 18, 2019
This updates the fixture expectations for
pelias/query#100 which uses the full
`match_phrase` query format everywhere for consistency and to allow
adding future parameters easily.
missinglink pushed a commit to pelias/api that referenced this pull request Jun 26, 2019
This updates the fixture expectations for
pelias/query#100 which uses the full
`match_phrase` query format everywhere for consistency and to allow
adding future parameters easily.
jsvrcek added a commit to EventKit/pelias-api that referenced this pull request Oct 8, 2019
* chore(package): update ciao to version 2.0.0

* chore(ciao): Update ciao tests

Amazingly in over a year since running these, only two things had to be
changed.

* fix(errors): Only return 400 status on known parameter errors

Pelias has for a long time returned 400 as a default status whenever
anything goes wrong, as well as when a user has passed invalid
parameters.

By using a new exception class, it is now possible to differetiate
between known parameter errors, and unexpected errors that truly
represent an HTTP 500.

* feat(trimByGranularity): refactor code to add logging and debug output, modify loop to break after first match

* feat(trimByGranularity): refactor to use an index instead of multiple loops over the results

* feat(address_search_using_ids): add postcode to variable store if we have it

* Update pelias-query to 9.7.0

This is required for the other half of postalcode matching

* fix(log): Log for each interpolation request

The groundwork for this was laid in pelias#1220, but it wasn't completed.

By logging information about each individual interpolation request (a
single API response may require many calls to the interpolation service
if several streets were returned from a query), we can get a good
overview of the interpolation service's performance.

* feat(log): Add full clean context to parser logs

Without the unmodified text input, it's hard to tell what cases is
causing these messages.

* feat(max_character_count_layer_filter): filter layer=address records for character counts <= 2

* feat(autocomplete): drive address filter by config

* feat(unit_number_only_queries): do not consider parses with only the unit number as "admin only"

* feat(libpostal_numeric_inputs): recast entirely numeric parses to label "house"

* feat(focus.point): use exponential instead of linear scoring

Linear scoring, by design, gives all records the same score past a
certain point.

This has the disadvantage that identical records that are very far away
cannot be sorted by distance.

By using exponential scoring, we can achieve decent sorting of even very
far away records. This is very helpful for cities and postalcodes.

Connects pelias#1206

* fix(autocomplete focus.point): Use 50km scale parameter

The `scale` parameter controls how quickly scores decrease from the
maximum as the distance from the `center_point` to the record in
question increases.

Set this to 50km, which is the same as search.

Connects pelias#1206

* Use test description consistent with fixture file

* feat(test): add test cases for autocomplete token matching permutations

* feat(refactor): refactor addressit sanitizer for readability

* feat(refactor): refactor addressit sanitizer - bugfix

* Update service/configurations/Interpolation.js

Change Language to Interpolation to fix copy/paste error.

* feat(docs): update gitter link

Most of our chatting happens in pelias/pelias, not pelias/api

* chore(package): update dependencies

* travis: reflect new best-practice travis-ci configuration

Travis-CI has or will shortly make in early December 2018 a number of beneficial
changes to their Linux continuous integration testing infrastructure.

Changes that impact pelias/api are:
* Linux infrastructure combined into one (virtualized), from two previously
  (virtualized and container-based). [0][1]
* Offering a more modern, supported Ubuntu Xenial (16.04 LTS) [2]

Projects using "sudo: false" (container-based infrastructure), have been
recommended to remove that configuration soon. In any case, the transition
will happen regardless for projects by December 7, 2018.

[0] https://blog.travis-ci.com/2018-10-04-combining-linux-infrastructures
[1] https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration
[2] https://docs.travis-ci.com/user/reference/xenial/

* Add boundary.wof parameter to search, search/structured and autocomplete endpoints

* fix(dedupe_query_size_1): check size when deduping

* feat(autocomplete_empty_tokens): remove empty tokens

* Change name from boundary.wof to boundary.gid and update param format

* Add boundary.gid to /reverse endpoint

* Add filter function for Placeholder results

* Cleanup

* feat(params): Add boundary.gid parameter

Connects pelias#1249

* feat(stable_sort): fix stable sorting on node 11.6.0

* feat(dedupe): improved matching across languages

* feat(dedupe): consider the user agent language in deduplication

* feat(dedupe_improved_parent_matching): only check parent fields above the layer of the least granular match

* feat(inter_layer_deduping): dedupe between layers and prefer some layers over others

* feat(dedupe_improved_parent_matching): do not compare parent items at same level, use explicit list of synonymous layers

* feat(dedupe_improved_parent_matching): add Vientiane test case

* fix typos

* fix(dedupe): Compare parent layers above lowest record

This expands the layers checked for hierarchy difference to be any layer
higher than the _lower_ of the two records being compared. Previously,
it was any layer higher than the _higher_ of the two record (another way
to think about it was it was only comparing hierarchy elements that BOTH
records have).

This fixes the issue where the city of Pennsylvania, Illinois was being
considered a duplicate of the state of Pennsylvania.

* feat(query) Set cutoff_frequency in all queries

* feat(addendum): display addendum data in geojson

* docs: Add link to Pelias services documentation

The rules for when different Pelias services are required or recommended
is now too complex to fit in the API readme. Fortunately we have it all
written down in one place.

* fix(package): update addressit to version 1.7.0

* feat(search): enable fallback queries when placeholder returns 0 results

* feat(admin_ngram): query ngram fields for admin matching on autocomplete

* fix(jshint): change syntax to silence jshint eerror on commit

* feat(admin_ngrams): revert ccb9bc8

* fix(logging): Remove log of every query sent to addressit

This is pretty verbose as its fairly common, and we can determine what
queries were sent to addressit in other ways.

* fix(logging): Clean up coarse reverse error log

It was logging both info and error level messages. While not commonly
called, this seems unnecessary.

* fix(logging): allow access log to be empty

This permits disabling access logs, which may or may not be useful in
different use cases.

* correct -> incorrect in documentation

This error completely changed the meaning of a line in documentation

* feat(deps): Remove `tmp` devDependency

This package hasn't been used since
33341b7, which was removed in favor of
the [microservice-wrapper](https://github.com/pelias/microservice-wrapper).

Closes pelias#1266

* feat(admin_ngrams): Restore admin ngrams functionality

This restores the functionality from pelias#1259
to allow autocomplete on partial administrative area names.

It's a solid improvement, and we are waiting to merge only to give the
community time to update their schema settings.

This reverts commit 43a4bed.

* feat(deps): Remove unused source-map dependency

Like pelias#1269, this dev dependency was
not being used. We added it long ago to try to fix issues with resolving
NPM dependencies, but those issues haven't been seen recently.

* fix(package): update check-types to version 8.0.0

* feat(categories): enable filter for autocomplete, show categories in geojson even when categories param is blank

* feat(categories): gracefully handle a mixture of valid and invalid categories

* Allow size parameter for autocomplete endpoint.

* Add support for boundary circle to autocomplete.

* Make sanitizer aware of boundary.circle.

* feat(multi boundary.country): Support for multi `boundary.country` query parameter

* fix(query): add code comments and sanity check to text_parser_addressit

* feat(performance): add single_token_address_filter sanitizer

* feat(performance): update single_token_address_filter to consider only layers relevant to clean.sources

* feat(performance): rename single_token_address_filter to _address_layer_filter, add numeric check, emit warning

* feat(performance): use parsed_text input to determine token length where available

* fix(package): update pelias-config to version 4.0.0

Closes pelias#1289

* feat(localNamingConventions): fix for ESP street number formatting

* Allow postalcode + admin area queries

We currently have very strict logic for which combinations of inputs the
fallback search query supports. Any input that is out of alignment with
those expectations will be deferred to the lower quality addressit query
system.

This changes some of that expectation logic. Where previously it ignored
inputs with a postcode and any other field, now a postcode in
combination with any administrative fields are allowed.

This is so, for example, a user can query for '90210, California' and
hopefully see postalcodes called 90210 that are in California first in
the results.

Obviously postalcodes are complex so there might be a few cases that
don't work, but it should help. As a bonus, it will remove fragility and
complexity from our query generation.

* feat(cutoff_frequency): enable cutoff_frequency for multi_match queries

* feat(docs): Add README section for custom sources/layers

This documentation exists only in the notes for a pull request, which
inevitably will get out of date over time and is not as readily
accessible.

Connects pelias#1131

* Add Romania to flipped street number countries.

* fix(package): update geolib to version 3.0.0

* Use new function names in geolib 3

* Update distance test expectation for precision changes in geolib 3

* fix(naming conventions): Columbian addresses should be flipped

Fixes pelias#972

* Add comment explaining purpose of helper/placeTypes.js

Its a list of _administrative_ placetypes.

* fix(search_with_ids): Skip query when not address related

The search_with_ids query (in some places more accurately referred to as
the `address_search_with_ids` query), can really ONLY return addresses
or related results such as streets, cities, etc. It will never return
venues.

However, if the user passed `layers=venue` as a query param, this query
was still executing and if any results were returned, then a non-venue
record would be returned as a result.

* feat(translation) Translate name to request language when available

* fix(package): update async to version 3.0.1

* feat(debug_population_popularity): display population & popularity in geojson when debug flag enabled

* feat(improved_es_debugging): display more information about the elasticsearch request when in debug mode

* fix(improved_es_debugging): remove unused statement

* feat(max_text_length): limit max text length to 140 characters

* feat(auto_discover_type_mapping): discover sources & layers by querying elasticsearch

* feat(auto_discover_type_mapping): remove _replica_first preference

* fix(predicate): Fix address related layers detection

This fixes an error in the logic from
pelias#1307.

The `address_search_with_ids` query can only return results in the
address or street layers. However, the predicate logic was allowing the
query to execute if _any_ admin layer was specified, which could allow
such a query to return street or address layers when not intended.

* feat(controller): Remove search_with_ids controller

I was attempting to use the new debug output from
pelias#1313 only to realize it was not added
to the `search_with_ids` controller, only the `search` controller.

These two controllers are very similar, so this PR combines them into
one that can be used in all cases.

* feat(query): Use more consistent names

The search endpoint uses multiple query types that are executed one
after the other: if the first does not return acceptable results, the
next is executed, and so on.

The last query in this list used to be the only query the search
endpoint had. Partially because it's been around for a long time, it is
referred to in several different ways across different parts of the
code.

This change unifies all of them so that the query is called
`search_addressit` everywhere.

* chore(package): update pelias-query to version 9.13.0

* feat(query) use full match_phrase query

This updates the fixture expectations for
pelias/query#100 which uses the full
`match_phrase` query format everywhere for consistency and to allow
adding future parameters easily.

* chore(package): update pelias-query to version 9.14.0

* feat(query): Remove deprecated optimize_bbox param

This parameter of the `geo_distance` query was deprecated in
Elasticsearch 2.2

Connects pelias/query#102

* chore(tests): Fix test description

* fix(package): update elasticsearch to version 16.0.0

* chore(tests): Ensure test ES documents have proper format

`source_id` should (currently) match `_id`, and should be present.

* feat(geojsonify): Use `source_id` rather than `_id` field

This change does not have any impact on the output of the API, but
allows us to make changes to the contents of the `_id` field in order to
correct the issue described in pelias/pelias#672

* fix(interpolation): Use street source_id if none found in interpolated result

This fix ensures the interpolation middleware always returns results
with a `source_id` field, fixing two issues:

1. It ensures the `source_id` property is always present in Pelias API
responses
2. It allows `helper/geojsonify.js` to rely on `source_id`

* feat(place): Query for old and new contents of `_id` field

_This change completes step 2 of the `_id` field contents migration
described in pelias/pelias#672 (comment) _

As part of the transition away from Elasticsearch types, we must change
the contents of the `_id` field in our Elasticsearch documents.

This change modifies the `/v1/place` endpoint so that it queries
Elasticsearch for documents based on both the old and new contents of
the `_id` field, allowing the same API code to work regardless of the
version of Pelias used to import data.

Connects pelias/pelias#672

* feat(auto_discover_type_mapping): update README

* fix(auto_discover_type_mapping): bugfix to allow existing references to the type mapping to continue functioning after update

* fix(auto_discover_type_mapping): update README

* Netherlands use 'street number'

The localized results of Pelias for the Netherlands are in the wrong order. The Dutch way is `street number`, not `number street` eg `damstraat 1`

* chore(tests): Ensure fixture name is in test output

This makes it faster to update the correct fixture

* feat(search): Use minimum_should_match to filter hits

This change utilizes the [minimum_should_match](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-minimum-should-match.html)
Elasticsearch query parameter to reduce the number of hits to search
queries.

Previously, only one token in an input had to match, regardless of the
input size. This allows for queries to potentially match a huge number
of documents, especially as the number of tokens grows.

Now, most `ngrams` queries have thee parameter set to `1<-1 3<-25%`

Breaking this down, it means that for the given number of tokens in a
query, there is a certain number of optional tokens:

| token count | optional tokens |
| --- | --- |
| 1   | 0 (obviously) |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6+  | at least 75% of tokens must match |

This should help ensure quality results are returned where possible,
even if long inputs contain a few extraneous bits of information, while
reducing the chance of extremely expensive queries.

* feat(autocomplete): focus point distance filter

This change uses an extra filter clause that takes into account a focus
point (if used) and the input text length to filter out very far away
street and adress records, improving performance.

Replaces pelias#1215

* feat(peliasQueryAnalyzer): replace usage of peliasQueryPartialToken and peliasQueryFullToken analyzers with the new peliasQuery analyzer

* feat(dedupe): Prefer OSM records over Geonames

In the case of a venue, we should prefer OSM records over Geonames.

* feat(geocodeJSON): remove geocoding.query.tokens array

* fix(package): update check-types to version 9.0.0

* feat(elasticsearch): do not reuse config for elasticsearch client

* Handle signals in a Docker friendly way

Similar to pelias/placeholder#160, this small
change allows `ctrl+C` to immediately and gracefully shut down a docker
container started with `docker run pelias/api`.

It's a nice user experience improvement for people running our
containers on the command line.

* chore(Node.js): Add Node.js 12 to CI

Node.js 12 (and especially Node.js 12.9) is supposed to be significantly
faster than previous versions, especially when it comes to parsing JSON, which
is often a bottleneck for Pelias importers.

https://nodejs.org/en/blog/release/v12.9.0/
https://v8.dev/blog/v8-release-76

Connects pelias/pelias#800

* feat(flip_housenumber_street): update config to cover more country codes where the label format is "{{street}} {{housenumber}}"

* fix(package): update pelias-model to version 6.0.0

* chore(tests): Update tests for GID changes

Connects pelias/pelias#672

* feat(deps): Move from joi to hapi/joi

This just keeps the output of `npm install` free of warnings

* fix(package): update pelias-model to version 7.0.0

* chore(tests): Update test expectations for pelias-model-7.0.0

* fix(package): update check-types to version 10.0.0

* feat(140_chars): restrict autocomplete inputs to 140 chars

* chore(test): Remove all instances of `_type` from ES docs

This will help ensure that in the future no code depends on the `_type`
value from Elasticsearch.

Connects pelias/pelias#719

* chore(travis): Pin to Node.js 12.10

Pending resolution of issues that may be related to the `geolib` module.

See manuelbieh/geolib#208

* feat(parser): replace addressit with pelias native parser

* feat(parser): improved postfix cursor position for text with no admin classification

* feat(parser): pelias/parser improvements

* feat(parser): bump pelias/parser version

* feat(parser): bump pelias/parser version

* feat(parser): bump pelias/parser version

* feat(parser): bump pelias/parser version

* feat(parser): updates to tokenizer sanitizer

* typo

* feat(parser): stricter tokenization of exact matching admin queries

* feat(parser): switch to using multi_match for admin subqueries

* feat(admin_subqueries): test cross_fields query

* feat(admin_subqueries): test operator:and query

* feat(admin_subqueries): set all boosts to 1

* feat(admin_subqueries): add locality_a and country_a to multi_match

* feat(admin_subqueries): revert to operator:or

* feat(admin_subqueries): remove cutoff_frequency

* feat(admin_subqueries): move admin matching to MUST condition

* feat(tokenizer): consider query as complete if the final char is a numeral

* feat(autocomplete): test removing exact_matching subquery

* feat(admin_subqueries): add cutoff_frequency

* feat(pelias_parser): admin queries - remove subject from admin subquery

* feat(admin_subqueries): remove cutoff_frequency

* feat(autocomplete): use phrase index for complete tokens

* feat(parser): remove parsed_text.name

* feat(parser): so not consider address parses as safe to use with an ngrams index due to parses potentially containing partial suffixes

* feat(parser): bump pelias/parser version

* feat(tokenizer): consider query as complete if the $subject is not at the end of $clean.text

* feat(parser): bump pelias/parser version

* feat(autocomplete): experiment adding name.default to admin multi_match

* feat(autocomplete): progess commit

* feat(autocomplete): typo

* feat(autocomplete): improved matching at the cusp

* feat(autocomplete): improved performance and reduced noise for admin matching

* feat(deps): bump parser dep version

* feat(deps): bump parser dep version

* feat(deps): bump parser dep version

* test: disable parserConsumedAllTokens for admin parses

* feat(deps): bump parser dep version

* feat(query): add should subquery for cross_street matching

* feat(logging): add summary logging for pelias parser

* feat(deps): bump parser dep version

* feat(pelias_parser): additional parser tests

* feat(deps): bump parser dep version

* feat(pelias_parser): fix tests

* feat(search_addressit): generate cross_street subquery where available

* feat(pelias_parser): limit input text to 140 characters

* feat(pelias_parser): replace peliasQueryPartialToken analyzer with peliasQuery

* feat(pelias_parser): disable "ngrams_last_token_only_multi" view when every "completed" token is numeric

* Add context to pelias parser logs

* Pin to pelias-parser-1.38.0 for now

* refactor(pelias_parser): add code comments relating to "add_name_to_multimatch", clean up related code

* refactor(pelias_parser): remove disused code/comments

* feat(pelias_parser): completely remove "addressit" and references to it

* refactor(pelias_parser): remove references to "original style queries"

* chore(Travis): Use string to specify minor version

Otherwise 12.10 is interpreted as a number and becomes version 12.1.

* chore(query): Remove empty pop_subquery view

`pelias-query` now includes basic views for standard Elasticsearch
queries, including `match_all`. The API esentially created a wrapper
around the `match_all` view, which was used in some cases and resulted
in an extra query view that didn't do much.

This change removes that boilerplate code and makes it a bit more clear
what the underlying functionality of a query will be.

* chore(tests): Remove unused fixture

* chore(query): Remove unused function parameter

At first glance, this code made it appear that the phrase query was
being used across the focus point filter functionality on libpostal
based search queries, but it turns out it's not!

The `focus_only_function` view constructor takes no parameters, so this
code wasn't having any effect.

* updated with latest api

* fix(local_naming_conventions): improve localNamingConventions middleware to support records with name/street aliases

* fixed lint errors

* added back placeholder polygon

* fixed geometry results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant