XContentParser shouldn't lose data from floating-point numbers. #46531

jpountz · 2019-09-10T10:13:56Z

Currently the JSON and YAML parsers parse all floating-point numbers as doubles.
This is problematic with numbers that have a higher accuracy than what doubles
can handle. This changes proposes to parse as a BigDecimal instead when the
decimal number in the json is not equal to the string representation of any
double.

Closes #46261

Currently the JSON and YAML parsers parse all floating-point numbers as doubles. This is problematic with numbers that have a higher accuracy than what doubles can handle. This changes proposes to parse as a BigDecimal instead when the decimal number in the json is not equal to the string representation of any double. Closes elastic#46261

elasticmachine · 2019-09-10T10:13:59Z

Pinging @elastic/es-core-infra

jpountz · 2019-09-10T12:12:00Z

Quick note: this change doesn't mean that Elasticsearch won't alter the source when applying filtering. For instance leading zeroes would be removed. However it wouldn't alter the accuracy of floating-point numbers.

I did my best to prevent this change from hurting performance too much, however there might be cases that are significantly affected. For this reason I'm considering only merging into master (8.0).

jpountz · 2019-09-10T12:21:57Z

docs/reference/aggregations/pipeline/avg-bucket-aggregation.asciidoc

@@ -105,7 +105,7 @@ And the following may be the response:
         ]
      },
      "avg_monthly_sales": {
-          "value": 328.33333333333333
+          "value": 328.3333333333333


note to reviewers: this only passed before because parsing was lossy

Schwaller · 2019-09-10T13:08:21Z

Thank you for working on this! Looks good IMHO :)

jpountz · 2019-09-11T15:06:02Z

I've discussed this change with other people at Elastic, here are some thoughts:

this should get backported to 7.x since this is a bug fix
since this could break some users in ways that are hard to fix, we should introduce a system property that makes Elasticsearch use the old behavior
when the system property is set, we should emit deprecation warnings
we will need to run some benchmarks to assess the impact of this change

ebadyano · 2019-12-05T00:47:53Z

I ran microbenchmark on the two functions isDouble and slowIsDouble. I ran it on a few Number sets, each is an array of 100 char arrays with the following numbers patterns to parse:
Empty empty char array for baseline to see the total overhead
Double set of numbers that have 14 significant digits in a format 1612253569.7123
Decimal set of numbers that have 17 significant digits in a format 1612253569.7123565
DoubleScientific set of numbers that have 15 significant digits in a format 1.61225356771235E9
DecimalScientific set of numbers that have 18 significant digits in a format 1.61225356971235689E9
DoubleLead0 set of numbers that have 10 significant digits in a format 0.000001612253569

Benchmark                                (numberSet)   Mode  Cnt      Score    Error   Units

IsDoubleBenchmark.isDoubleFastOrg              Empty  thrpt   25  16223.762 ± 16.001  ops/ms
IsDoubleBenchmark.isDoubleFastOrg             Double  thrpt   25    868.049 ±  0.516  ops/ms
IsDoubleBenchmark.isDoubleFastOrg            Decimal  thrpt   25     11.351 ±  0.114  ops/ms
IsDoubleBenchmark.isDoubleFastOrg   DoubleScientific  thrpt   25     28.283 ±  0.040  ops/ms
IsDoubleBenchmark.isDoubleFastOrg  DecimalScientific  thrpt   25      9.955 ±  0.114  ops/ms
IsDoubleBenchmark.isDoubleFastOrg        DoubleLead0  thrpt   25     17.428 ±  0.133  ops/ms

IsDoubleBenchmark.isDoubleSlowOrg             Double  thrpt   25     28.143 ±  1.209  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg            Decimal  thrpt   25     11.347 ±  0.089  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg   DoubleScientific  thrpt   25     27.812 ±  0.247  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg  DecimalScientific  thrpt   25     10.053 ±  0.077  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg        DoubleLead0  thrpt   25     17.846 ±  0.134  ops/ms

I tried adding one optimization in isDoubleSlow, which improves case for scientific notation:

 static boolean slowIsDouble(char[] chars, int charsOff, int charsLen) {
        try {
            BigDecimal bigDec = new BigDecimal(chars, charsOff, charsLen);
           if (bigDec.precision() <= 15)
                return true;
            double asDouble = bigDec.doubleValue();
            if (Double.isFinite(asDouble) == false) {
                return false;
            }
...

IsDoubleBenchmark.isDoubleFastOpt              Empty  thrpt   25  16202.767 ± 17.566  ops/ms
IsDoubleBenchmark.isDoubleFastOpt             Double  thrpt   25    867.313 ±  0.735  ops/ms
IsDoubleBenchmark.isDoubleFastOpt            Decimal  thrpt   25     11.312 ±  0.045  ops/ms
IsDoubleBenchmark.isDoubleFastOpt   DoubleScientific  thrpt   25    179.539 ±  0.473  ops/ms
IsDoubleBenchmark.isDoubleFastOpt  DecimalScientific  thrpt   25      9.885 ±  0.042  ops/ms
IsDoubleBenchmark.isDoubleFastOpt        DoubleLead0  thrpt   25    186.776 ±  0.090  ops/ms

@jpountz @danielmitterdorfer FYI

jpountz · 2019-12-06T17:49:53Z

@evangoetzman Thanks for running these benchmarks. I reviewed usage of XContentParser#numberType() / XContentParser#numberValue() / XContentParser#objectText() / XContentParser#objectBytes() and I think the only performance-sensitive call sites are the terms queries, and XContent copying, like we do for _source filtering, both at index and search-time.

My reading of the benchmark is that it's unlikely that these operations get dramatically slow if we remain on the fast path. For instance a terms query with 10k values could take 1 second, just to get parsed. I think it's ok because the way that the PR is built only slows things down when values are floating-point numbers (it delegates to the json parser first, and only checks if it needs a big decimal if the json parser produces a double, so if it produced a long this code would never get called), which are not a good fit for terms queries. I'm having more trouble reasoning about _source includes, so we'll have to bet that we would follow the fast path most of the time.

Your idea for the optimization is good, but we need to check the scale too? If the exponent is extremely high, then it would parse to an infinity as a double?

evangoetzman · 2019-12-06T18:06:55Z

@jpountz I don't believe I'm the right person to be tagged on this ticket. You may have intended to tag @ebadyano, who ran the benchmarking tests

ebadyano · 2019-12-06T19:40:03Z

@jpountz Right, good catch about the optimization. Yes, we would need to check scale() as well.

ebadyano

LGTM!

Schwaller · 2020-07-02T08:43:52Z

Sooooo ... can we get this merged/fixed please?

Schwaller · 2021-12-10T11:34:58Z

Hey guys hope you are fine in this tricky time! Would it be possible to schedule some fix for this? Seems others are struggling too but don't really use bug search first ;-) https://discuss.elastic.co/t/bug-big-decimal-value-losed-precision/253105

elasticsearchmachine · 2022-03-30T17:51:11Z

Hi @jpountz, I've created a changelog YAML for you.

elasticsearchmachine · 2022-07-27T15:00:15Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

Schwaller · 2024-06-26T10:14:49Z

Can we make this happen? Our customers need full precision now and we would be basically forced to do our own implementation of this :/ I'm guessing the probability of a quick fix is low so how would you maybe suggest to work around it?

jpountz added >bug :Core/Infra/Core Core issues without another label v8.0.0 labels Sep 10, 2019

iter

ed71835

jpountz commented Sep 10, 2019

View reviewed changes

iter

0307766

jpountz added the team-discuss label Sep 10, 2019

Merge branch 'master' into fix/xcontent_bigdecimal

5a84b5e

jpountz removed the team-discuss label Dec 10, 2019

jpountz added 3 commits December 10, 2019 17:56

Merge branch 'master' into fix/xcontent_bigdecimal

e711ea7

Fix test failure.

44ab353

Implement Evgenia's optimization.

814b32d

ebadyano self-requested a review December 13, 2019 15:23

ebadyano approved these changes Dec 20, 2019

View reviewed changes

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

jpountz mentioned this pull request Mar 29, 2021

Fields API can lose precision when fetching dates #70085

Open

pgomulka mentioned this pull request Apr 12, 2021

Floating-point value accepted for a date field during indexing, but fails later on update/re-index #71311

Open

arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

Update docs/changelog/46531.yaml

dacafb3

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

stu-elastic removed the v8.5.0 label Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XContentParser shouldn't lose data from floating-point numbers. #46531

XContentParser shouldn't lose data from floating-point numbers. #46531

jpountz commented Sep 10, 2019

elasticmachine commented Sep 10, 2019

jpountz commented Sep 10, 2019

jpountz Sep 10, 2019

Schwaller commented Sep 10, 2019

jpountz commented Sep 11, 2019

ebadyano commented Dec 5, 2019

jpountz commented Dec 6, 2019

evangoetzman commented Dec 6, 2019

ebadyano commented Dec 6, 2019

ebadyano left a comment

Schwaller commented Jul 2, 2020

Schwaller commented Dec 10, 2021

elasticsearchmachine commented Mar 30, 2022

elasticsearchmachine commented Jul 27, 2022

Schwaller commented Jun 26, 2024

XContentParser shouldn't lose data from floating-point numbers. #46531

Are you sure you want to change the base?

XContentParser shouldn't lose data from floating-point numbers. #46531

Conversation

jpountz commented Sep 10, 2019

elasticmachine commented Sep 10, 2019

jpountz commented Sep 10, 2019

jpountz Sep 10, 2019

Choose a reason for hiding this comment

Schwaller commented Sep 10, 2019

jpountz commented Sep 11, 2019

ebadyano commented Dec 5, 2019

jpountz commented Dec 6, 2019

evangoetzman commented Dec 6, 2019

ebadyano commented Dec 6, 2019

ebadyano left a comment

Choose a reason for hiding this comment

Schwaller commented Jul 2, 2020

Schwaller commented Dec 10, 2021

elasticsearchmachine commented Mar 30, 2022

elasticsearchmachine commented Jul 27, 2022

Schwaller commented Jun 26, 2024