Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically map all numerics to floats by default? #16018

Closed
jpountz opened this issue Jan 15, 2016 · 10 comments
Closed

Dynamically map all numerics to floats by default? #16018

jpountz opened this issue Jan 15, 2016 · 10 comments
Labels
>breaking >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Jan 15, 2016

Elasticsearch assumes that if a number contains a dot, then it should be mapped as a floating-point number (double in 2.x and float in master) and otherwise as an long. But this is quite trappy as it means that we expect that floating point numbers are consistently serialized with a dot (see eg. https://twitter.com/bitemyapp/status/687415657651154944 or #15961).

Instead, we could map all numerics to floats by default (but you could still use dynamic templates to override it if you want). This would have two drawbacks:

  • floats can only represent integer values accurately up to 2^24 (~16M)
  • it would increase storage requirements

I ran some simulations to see how worse it would be to store integers as floating point numbers, the good news being that since most bits will be zeros on the right side of the mantissa, gcd compression will help save some bits:

  • less than 256 unique values: storing numbers as a float or as a long doesn't matter since we would use table compression in both cases
  • more than 256 unique values between 1 and 1000: an int would require 10 bits per value, rounded to 12 for retrieval efficiency, while a float would use 17 bits rounded to 20 for retrieval efficiency (+67%)
  • more than 256 unique values between 1 and 100,000: an int would require 17 bits per value, rounded to 20 for retrieval efficiency, while a float would use 24 bits rounded to 24 for retrieval efficiency (+20%)

I'm not sold yet about what we should do but thought we should have this discussion. Again, note that it would only apply to dynamically mapped fields, integers that are mapped as integers would remain as efficient as they are today.

@clintongormley clintongormley added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Jan 15, 2016
@clintongormley
Copy link
Contributor

I think you'd make half the people happy, and the other half unhappy. It's so easy to add a dynamic mapping rule that allows you to add all numeric fields as float should you choose to do so, I'm not sure it's worth the change.

@lpic10
Copy link

lpic10 commented Mar 1, 2017

I think this should be done because the defaults should try to favor usability over performance (or storage in this case)

@jpountz
Copy link
Contributor Author

jpountz commented Aug 1, 2017

This might become more necessary as we are considering rejecting numbers that have a decimal part on integer types: #25861.

@agirbal
Copy link

agirbal commented Aug 4, 2017

+1, this is very trappy.
Just realized I had some random docs getting dropped due to float vs long indexing depending on which type is picked up on 1st doc. Many times ppl dont fully control how the json numbers gets serialized.
It would not be realistic to name all fields in mapping, so I'd set up some catch all number types to force all to floats, but I have to look into docs which types are possibly autodetected or if I need to catch all possible numeric types (which would be pretty ugly).
Maybe I'm missing something easy in the docs.
Thanks!

@javanna
Copy link
Member

javanna commented Mar 16, 2018

@elastic/es-search-aggs

@polyfractal
Copy link
Contributor

We chatted about this in the search/aggs meeting a little while ago (forgot to update, sorry).

We decided that the breaking change + potential confusion around floating point error made this less than ideal. In our experience, floating point errors are difficult to understand for even relatively savvy users. Especially if we were to map to floats instead of doubles, it was feared many users could be bitten by rounding without understanding what was happening and start seeing strange search results because of it.

Ranges can look very strange when fp rounding errors happen. And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with very strange side effects (returning the wrong users, etc)

We felt it would be at least as tricky as truncation errors, so breaking for a different set of hard-to-understand semantics wasn't worth it.

I just realized we didn't discuss the decision made in #25861 to remove coerce however, and how that might affect this issue. I was going to close, but perhaps it should be discussed again. @jpountz thoughts?

@jtibshirani
Copy link
Contributor

@polyfractal would you be able to clarify how removing coerce might affect the decision in this issue (and require further discussion)?

@polyfractal
Copy link
Contributor

@jtibshirani I think it was related to Adrien's earlier comment in #16018 (comment)

E.g. if coerce goes away then any number with fractional portions (exception fractional parts that are zero like 1.0) will be rejected. So dynamically mapping all values to float/double makes it more user friendly in that all values can be indexed by default. If this wasn't implemented I think we'd be in a situation where float/doubles would have to be explicitly mapped first.

But I'm not positive, that's just my guess based on Adrien's comment. :)

@jtibshirani
Copy link
Contributor

Thanks for the additional context! To me, even with the coerce option it seems like floats need to be mapped explicitly -- if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior). Hopefully @jpountz will be able to clarify his comment, and we can see if we can close or another discussion is needed.

@jpountz
Copy link
Contributor Author

jpountz commented May 7, 2019

if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior)

Agreed! To be sure we are on the same page, the behavior you are describing is our current default behavior.

And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with very strange side effects (returning the wrong users, etc)

This argument convinces me that we should not map all numbers as floats by default, so I'll close this issue.

@jpountz jpountz closed this as completed May 7, 2019
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants