Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate fielddata on _id/_uid #25240

Closed
jpountz opened this issue Jun 15, 2017 · 3 comments
Closed

Deprecate fielddata on _id/_uid #25240

jpountz opened this issue Jun 15, 2017 · 3 comments
Labels
blocker discuss :Search/Search Search-related issues that do not fall into other categories v6.0.0-beta1

Comments

@jpountz
Copy link
Contributor

jpountz commented Jun 15, 2017

Some features like random_score use fielddata on _id and our documentation sometimes recommends to sort on _id in order to have a stable sort across pages (eg. https://www.elastic.co/guide/en/elasticsearch/reference/master/search-request-search-after.html). However fielddata on such unique fields has a huge cost if you have significant amounts of data, we should move away from it.

We probably have two options here, which are to either add doc values to _id (#11887) but this proved controversial due to the overhead it adds to the index, or switch to _doc but this has the drawback of not being comparable across shards or after merges.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 16, 2017

We discussed in FixitFriday and agreed to cut over random_score to use sequence numbers for random score generation. The only drawback compared to today is that documents would get a different score after they are updated, even if the same seed is used.

Consistent sorting on different point-in-time snapshots was more controversial. Here are some opinions:

  • there is nothing we can do in that case, we should set expectations and just remove the recommendation to use the _uid as a tie breaker
  • we should recommend using _seq_no as a tie-breaker for that purpose rather than _uid. It would not be stable in case of updates but is the best we can provide which works in practice.
  • we should automatically append _seq_no to every sort internally. Same reasoning as the previous idea but without exposing _seq_no which is quite an expert/internal feature.

@rjernst
Copy link
Member

rjernst commented Jun 22, 2017

The only drawback compared to today is that documents would get a different score after they are updated

I don't see how this is any different than the way it used to be, using the docid. This is very trappy, to have a random score with the capability of setting a seed, but the seed does not actually guarantee anything.

Why not expose the ability to set what field is used, and make it required if using a seed. Then the default can be a truly random score (using docid), but if a seed+field is set, that is used.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 4, 2017

I like that idea.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Jul 7, 2017
…ion.

We currently use fielddata on the `_id` field which is trappy, especially as we
do it implicitly. This changes the `random_score` function to use doc ids when
no seed is provided and to suggest a field when a seed is provided.

For now the change only emits a deprecation warning when no field is supplied
but this should be replaced by a strict check on 7.0.

Closes elastic#25240
jpountz added a commit that referenced this issue Jul 19, 2017
…ion. (#25594)

We currently use fielddata on the `_id` field which is trappy, especially as we
do it implicitly. This changes the `random_score` function to use doc ids when
no seed is provided and to suggest a field when a seed is provided.

For now the change only emits a deprecation warning when no field is supplied
but this should be replaced by a strict check on 7.0.

Closes #25240
jpountz added a commit to jpountz/elasticsearch that referenced this issue Jul 19, 2017
…, `_shard`].

`_id` requires fielddata to be loaded into memory for sorting, which does not
scale on large clusters. Adding doc values to the `_id` field proved
controversial so instead, we are removing use-cases for fieddata on the `_uid`
or `_id` fields.

Relates elastic#25240
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Fielddata labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker discuss :Search/Search Search-related issues that do not fall into other categories v6.0.0-beta1
Projects
None yet
Development

No branches or pull requests

4 participants