Dynamically map all numerics to floats by default? #16018

jpountz · 2016-01-15T15:22:29Z

Elasticsearch assumes that if a number contains a dot, then it should be mapped as a floating-point number (double in 2.x and float in master) and otherwise as an long. But this is quite trappy as it means that we expect that floating point numbers are consistently serialized with a dot (see eg. https://twitter.com/bitemyapp/status/687415657651154944 or #15961).

Instead, we could map all numerics to floats by default (but you could still use dynamic templates to override it if you want). This would have two drawbacks:

floats can only represent integer values accurately up to 2^24 (~16M)
it would increase storage requirements

I ran some simulations to see how worse it would be to store integers as floating point numbers, the good news being that since most bits will be zeros on the right side of the mantissa, gcd compression will help save some bits:

less than 256 unique values: storing numbers as a float or as a long doesn't matter since we would use table compression in both cases
more than 256 unique values between 1 and 1000: an int would require 10 bits per value, rounded to 12 for retrieval efficiency, while a float would use 17 bits rounded to 20 for retrieval efficiency (+67%)
more than 256 unique values between 1 and 100,000: an int would require 17 bits per value, rounded to 20 for retrieval efficiency, while a float would use 24 bits rounded to 24 for retrieval efficiency (+20%)

I'm not sold yet about what we should do but thought we should have this discussion. Again, note that it would only apply to dynamically mapped fields, integers that are mapped as integers would remain as efficient as they are today.

clintongormley · 2016-01-15T16:28:30Z

I think you'd make half the people happy, and the other half unhappy. It's so easy to add a dynamic mapping rule that allows you to add all numeric fields as float should you choose to do so, I'm not sure it's worth the change.

lpic10 · 2017-03-01T23:23:36Z

I think this should be done because the defaults should try to favor usability over performance (or storage in this case)

jpountz · 2017-08-01T09:00:27Z

This might become more necessary as we are considering rejecting numbers that have a decimal part on integer types: #25861.

agirbal · 2017-08-04T16:17:55Z

+1, this is very trappy.
Just realized I had some random docs getting dropped due to float vs long indexing depending on which type is picked up on 1st doc. Many times ppl dont fully control how the json numbers gets serialized.
It would not be realistic to name all fields in mapping, so I'd set up some catch all number types to force all to floats, but I have to look into docs which types are possibly autodetected or if I need to catch all possible numeric types (which would be pretty ugly).
Maybe I'm missing something easy in the docs.
Thanks!

javanna · 2018-03-16T10:32:46Z

@elastic/es-search-aggs

polyfractal · 2018-11-19T20:04:56Z

We chatted about this in the search/aggs meeting a little while ago (forgot to update, sorry).

We decided that the breaking change + potential confusion around floating point error made this less than ideal. In our experience, floating point errors are difficult to understand for even relatively savvy users. Especially if we were to map to floats instead of doubles, it was feared many users could be bitten by rounding without understanding what was happening and start seeing strange search results because of it.

Ranges can look very strange when fp rounding errors happen. And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with very strange side effects (returning the wrong users, etc)

We felt it would be at least as tricky as truncation errors, so breaking for a different set of hard-to-understand semantics wasn't worth it.

I just realized we didn't discuss the decision made in #25861 to remove coerce however, and how that might affect this issue. I was going to close, but perhaps it should be discussed again. @jpountz thoughts?

jtibshirani · 2019-05-01T18:17:08Z

@polyfractal would you be able to clarify how removing coerce might affect the decision in this issue (and require further discussion)?

polyfractal · 2019-05-01T20:11:18Z

@jtibshirani I think it was related to Adrien's earlier comment in #16018 (comment)

E.g. if coerce goes away then any number with fractional portions (exception fractional parts that are zero like 1.0) will be rejected. So dynamically mapping all values to float/double makes it more user friendly in that all values can be indexed by default. If this wasn't implemented I think we'd be in a situation where float/doubles would have to be explicitly mapped first.

But I'm not positive, that's just my guess based on Adrien's comment. :)

jtibshirani · 2019-05-01T20:58:49Z

Thanks for the additional context! To me, even with the coerce option it seems like floats need to be mapped explicitly -- if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior). Hopefully @jpountz will be able to clarify his comment, and we can see if we can close or another discussion is needed.

jpountz · 2019-05-07T07:19:34Z

if the first document indexed happens to contain a number without a decimal point, then all subsequent floats will be truncated (which is likely undesirable/ confusing behavior)

Agreed! To be sure we are on the same page, the behavior you are describing is our current default behavior.

And while a keyword should be used instead, many people accidentally use the dynamic long for IDs which also tend to be very large and could easily hit fp errors with very strange side effects (returning the wrong users, etc)

This argument convinces me that we should not map all numbers as floats by default, so I'll close this issue.

jpountz added the discuss label Jan 15, 2016

clintongormley added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Jan 15, 2016

jpountz mentioned this issue Mar 24, 2016

Dynamic mappings fail when a single document generates inconsistent mapping updates #15377

Open

colings86 added >enhancement >breaking labels Apr 24, 2018

$@polyfractal$ polyfractal removed the discuss label Nov 15, 2018

jpountz closed this as completed May 7, 2019

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically map all numerics to floats by default? #16018

Dynamically map all numerics to floats by default? #16018

jpountz commented Jan 15, 2016

clintongormley commented Jan 15, 2016

lpic10 commented Mar 1, 2017

jpountz commented Aug 1, 2017

agirbal commented Aug 4, 2017 •

edited

Loading

javanna commented Mar 16, 2018

polyfractal commented Nov 19, 2018

jtibshirani commented May 1, 2019

polyfractal commented May 1, 2019

jtibshirani commented May 1, 2019

jpountz commented May 7, 2019

Dynamically map all numerics to floats by default? #16018

Dynamically map all numerics to floats by default? #16018

Comments

jpountz commented Jan 15, 2016

clintongormley commented Jan 15, 2016

lpic10 commented Mar 1, 2017

jpountz commented Aug 1, 2017

agirbal commented Aug 4, 2017 • edited Loading

javanna commented Mar 16, 2018

polyfractal commented Nov 19, 2018

jtibshirani commented May 1, 2019

polyfractal commented May 1, 2019

jtibshirani commented May 1, 2019

jpountz commented May 7, 2019

agirbal commented Aug 4, 2017 •

edited

Loading