Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experiment] Disable Elasticsearch field norms #323

Closed
wants to merge 1 commit into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Sep 28, 2018

This is the PR to add schema support for elasticsearch 5.6.12 while still maintaining backwards compatibility with 2.4.

This feature has been long requested, so I'm happy to have a PR ready for testing 😃

About the upgrade

Elasticsearch 5.x is mostly backwards compatible with 2.x, so not many changes are required at this stage, we will make more significant changes in the migration from 5->6, but this PR is the start of a bunch of work to modernize the elasticsearch version supported by Pelias.

One of the major changes between the two versions is the 'Similarity', in 2.4 this was something called 'default' (now called 'classic'). In version 5 this was replaced by BM25 as the default similarity scoring engine.

As a result of the similarity algorthm changes, queries will score differently, the actual floating point numbers allocated by elasticsearch are different when compared to 2.4.

Pelias is only concerned with relative scoring between documents, however in some cases there are minor difference in relative scoring.

These scoring differences are fairly rare, I have disabled 'norms' in this PR which helps in mitigating this issue.

Please be aware that there will inevitably be changes to sorting order in a small amount of queries, for the most part these will be improvements, if there are regressions please open an issue to discuss.

The other major change was in query composition, for instance the error message no [query] registered for [filtered] is thrown for queries which begin { "query": { "filtered".

We started rewriting these queries in Pelias some time ago, when I ran automated tests and manual checks I was not able to trigger any query errors, I suspect that we have already completed the query migration, again we would be happy to hear if we've missed any 😃

Upgrade period & deprecation notice

I wanted to do this in a way that is also backwards compatible with [email protected], so that users don't have a 'hard cutover', giving them some time to allocate resources to get their infrastructure moved over to 5.6.12.

The plan is to maintain support for both 2.4 & 5.6 for only as long is required to confirm that the transition was successful, at which stage we deprecate support for 2.4.

We'll try to support both where we can, but sometime in 2019 we will completely drop support for 2.4 so that we can move forward and start taking advantage of new features only available in versions 5 & 6.

How to enable support for 5.6?

You'll need to run the code in this PR and ensure you have a current version of npm elasticsearch installed. (ie. run npm install)

Then you will need to change the esclient.apiVersion in your pelias.json file.

head -n3 ~/pelias.json 
{
  "esclient": {
    "apiVersion": "2.4",
head -n3 ~/pelias.json 
{
  "esclient": {
    "apiVersion": "5.6",

A quick note on versions:

  • If you set your apiVersion to 5.6 then it will work fine against either version of elasticsearch.
  • If you set your apiVersion to 2.4 then it will only work against 2.4, an error will be thrown when trying to use a 5.6 server with a 2.4 client.
  • I will soon update the default version in pelias/config to be 5.6

Which docker images should I use?

For elasticsearch 5.6 you can use pelias/elasticsearch:5.6.12:

docker run --rm -p 9200:9200 pelias/elasticsearch:5.6.12

For elasticsearch 2.4 you can use pelias/elasticsearch:2.4:

docker run --rm -p 9200:9200 pelias/elasticsearch:2.4

Note that a PR will be coming soon for pelias/docker which updates the existing 'projects'.

What's in the PR?

  • Updates to npm elastictest to support 5.6
  • Update npm elasticsearch to support new APIs
  • Updated Travis config to run integration tests against 5.6.12 (updated readme on how to run this locally)
  • Pass esclient version to elastictest so that the correct APIs are used for integration tests
  • Rewite any integration test queries which used the old query syntax

Schema Changes

geo_point

The geo_point properties were removed in #321

norms

norms have been disabled for all relevant fields using the syntax: norms: { enabled: false }

norms are useful for a traditional fulltext search engine where 'smaller' fields like a title are considered more important than 'larger' fields such as the text body.

this is not relevant to geocoding so I have disabled norms, this is required in this PR because without it the BM25 algorithm takes norms in to account and scores integration test queries in a way which isn't compatible between versions.

the other benefit of disabling norms is reducing the amount of disk storage (see the documentation linked above for more info).

@missinglink missinglink force-pushed the support_both_es2_and_es5 branch 14 times, most recently from f05a94e to b6f3d9b Compare September 28, 2018 11:35
@missinglink
Copy link
Member Author

🎉 Unit tests and Integration tests passing on CI for all combinations of nodejs versions 6, 8 & 10 and elasticsearch versions 2.4.6 and 5.6.12

https://travis-ci.org/pelias/schema/builds/434573990

@orangejulius
Copy link
Member

This is awesome, so many good changes here.

analyzer: 'peliasPhrase',
fielddata : {
loading: 'eager_global_ordinals'
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears in 2.4 fielddata is enabled by default. How about we remove this change from this PR to keep it as small as possible, and let #302 handle it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of the fielddata sections has been undone via rebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@missinglink
Copy link
Member Author

missinglink commented Sep 28, 2018

cc/ @Ashleyholt, please forward this on to your team :)

@Joxit
Copy link
Member

Joxit commented Oct 6, 2018

That's a great job!
This seems to work with both es2 and es5 for me !

@gopi-ar
Copy link

gopi-ar commented Oct 19, 2018

Ran a full planet import yesterday on this branch with ES 5.6; here are some observations:

Import

  • much smaller index size: ~170 GB - 15 shards, 3 nodes. Earlier it was ~300 GB with 25 shards, 3 nodes
  • import duration time decreased by a few hours or so (took ~3 days on a 16 core, 96 GB ram server)

Performance

  • search RPS decreased from 20 RPS on 2.4 (master branch) to 5 RPS (on this branch) with p95 over 1s. This was random address data around the world. We tried it on a much larger instance with 72 vCPUs, only went to 10 RPS.
  • reverse RPS is the same at 150 RPS, with p95 at around 500ms.

For context, the following services were running during the loadtest:

  • placeholder
  • pip
  • libpostal

Interpolation was not running.

@orangejulius
Copy link
Member

Hey @gopi-ar,
Thanks for all the info and for taking the time to do the testing.

All our indications are that ES5 should be much faster (especially reverse queries, as they've improved the algorithms for geo_distance), so to be honest I'm surprised by your results.

We are planning an Elasticsearch 5 checkin sometime next week, perhaps you can join us and we can figure out what sort of performance to expect with ES5?

@gopi-ar
Copy link

gopi-ar commented Oct 23, 2018

@orangejulius yep, we were also expecting a huge bump with reverse since photon saw a drastic increase as well.

Also, pre-loading the whole index into RAM didn't seem to help. RAM usage for 2 types of settings we tried for the whole build are:

["*"] - ~140GB
["nvd", "dvd", "tim", "doc", "dim"] - ~ 67GB

@orangejulius orangejulius changed the title support for elasticsearch 5.x while also maintaining backwards compatibility with 2.x [experiment] Disable Elasticsearch field norms Oct 25, 2018
@orangejulius
Copy link
Member

An update on the status of this branch: we initially tried out disabling norms on various fields in our Elasticsearch mapping, because that helped make some of our schema integration tests pass.

It turns out those changes didn't translate very well to real world testing, so we split most of the work in this PR out into #328 and merged that. We'll do a little more testing, but most likely will close this particular PR.

@orangejulius orangejulius deleted the support_both_es2_and_es5 branch March 6, 2019 18:37
@missinglink missinglink mentioned this pull request Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants