Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XContentParser shouldn't lose data from floating-point numbers. #46531

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 10, 2019

Currently the JSON and YAML parsers parse all floating-point numbers as doubles.
This is problematic with numbers that have a higher accuracy than what doubles
can handle. This changes proposes to parse as a BigDecimal instead when the
decimal number in the json is not equal to the string representation of any
double.

Closes #46261

Currently the JSON and YAML parsers parse all floating-point numbers as doubles.
This is problematic with numbers that have a higher accuracy than what doubles
can handle. This changes proposes to parse as a BigDecimal instead when the
decimal number in the json is not equal to the string representation of any
double.

Closes elastic#46261
@jpountz jpountz added >bug :Core/Infra/Core Core issues without another label v8.0.0 labels Sep 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jpountz
Copy link
Contributor Author

jpountz commented Sep 10, 2019

Quick note: this change doesn't mean that Elasticsearch won't alter the source when applying filtering. For instance leading zeroes would be removed. However it wouldn't alter the accuracy of floating-point numbers.

I did my best to prevent this change from hurting performance too much, however there might be cases that are significantly affected. For this reason I'm considering only merging into master (8.0).

@@ -105,7 +105,7 @@ And the following may be the response:
]
},
"avg_monthly_sales": {
"value": 328.33333333333333
"value": 328.3333333333333
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: this only passed before because parsing was lossy

@Schwaller
Copy link

Thank you for working on this! Looks good IMHO :)

@jpountz
Copy link
Contributor Author

jpountz commented Sep 11, 2019

I've discussed this change with other people at Elastic, here are some thoughts:

  • this should get backported to 7.x since this is a bug fix
  • since this could break some users in ways that are hard to fix, we should introduce a system property that makes Elasticsearch use the old behavior
  • when the system property is set, we should emit deprecation warnings
  • we will need to run some benchmarks to assess the impact of this change

@ebadyano
Copy link
Contributor

ebadyano commented Dec 5, 2019

I ran microbenchmark on the two functions isDouble and slowIsDouble. I ran it on a few Number sets, each is an array of 100 char arrays with the following numbers patterns to parse:
Empty empty char array for baseline to see the total overhead
Double set of numbers that have 14 significant digits in a format 1612253569.7123
Decimal set of numbers that have 17 significant digits in a format 1612253569.7123565
DoubleScientific set of numbers that have 15 significant digits in a format 1.61225356771235E9
DecimalScientific set of numbers that have 18 significant digits in a format 1.61225356971235689E9
DoubleLead0 set of numbers that have 10 significant digits in a format 0.000001612253569

Benchmark                                (numberSet)   Mode  Cnt      Score    Error   Units

IsDoubleBenchmark.isDoubleFastOrg              Empty  thrpt   25  16223.762 ± 16.001  ops/ms
IsDoubleBenchmark.isDoubleFastOrg             Double  thrpt   25    868.049 ±  0.516  ops/ms
IsDoubleBenchmark.isDoubleFastOrg            Decimal  thrpt   25     11.351 ±  0.114  ops/ms
IsDoubleBenchmark.isDoubleFastOrg   DoubleScientific  thrpt   25     28.283 ±  0.040  ops/ms
IsDoubleBenchmark.isDoubleFastOrg  DecimalScientific  thrpt   25      9.955 ±  0.114  ops/ms
IsDoubleBenchmark.isDoubleFastOrg        DoubleLead0  thrpt   25     17.428 ±  0.133  ops/ms

IsDoubleBenchmark.isDoubleSlowOrg             Double  thrpt   25     28.143 ±  1.209  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg            Decimal  thrpt   25     11.347 ±  0.089  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg   DoubleScientific  thrpt   25     27.812 ±  0.247  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg  DecimalScientific  thrpt   25     10.053 ±  0.077  ops/ms
IsDoubleBenchmark.isDoubleSlowOrg        DoubleLead0  thrpt   25     17.846 ±  0.134  ops/ms

I tried adding one optimization in isDoubleSlow, which improves case for scientific notation:

 static boolean slowIsDouble(char[] chars, int charsOff, int charsLen) {
        try {
            BigDecimal bigDec = new BigDecimal(chars, charsOff, charsLen);
           if (bigDec.precision() <= 15)
                return true;
            double asDouble = bigDec.doubleValue();
            if (Double.isFinite(asDouble) == false) {
                return false;
            }
...
IsDoubleBenchmark.isDoubleFastOpt              Empty  thrpt   25  16202.767 ± 17.566  ops/ms
IsDoubleBenchmark.isDoubleFastOpt             Double  thrpt   25    867.313 ±  0.735  ops/ms
IsDoubleBenchmark.isDoubleFastOpt            Decimal  thrpt   25     11.312 ±  0.045  ops/ms
IsDoubleBenchmark.isDoubleFastOpt   DoubleScientific  thrpt   25    179.539 ±  0.473  ops/ms
IsDoubleBenchmark.isDoubleFastOpt  DecimalScientific  thrpt   25      9.885 ±  0.042  ops/ms
IsDoubleBenchmark.isDoubleFastOpt        DoubleLead0  thrpt   25    186.776 ±  0.090  ops/ms

@jpountz @danielmitterdorfer FYI

@jpountz
Copy link
Contributor Author

jpountz commented Dec 6, 2019

@evangoetzman Thanks for running these benchmarks. I reviewed usage of XContentParser#numberType() / XContentParser#numberValue() / XContentParser#objectText() / XContentParser#objectBytes() and I think the only performance-sensitive call sites are the terms queries, and XContent copying, like we do for _source filtering, both at index and search-time.

My reading of the benchmark is that it's unlikely that these operations get dramatically slow if we remain on the fast path. For instance a terms query with 10k values could take 1 second, just to get parsed. I think it's ok because the way that the PR is built only slows things down when values are floating-point numbers (it delegates to the json parser first, and only checks if it needs a big decimal if the json parser produces a double, so if it produced a long this code would never get called), which are not a good fit for terms queries. I'm having more trouble reasoning about _source includes, so we'll have to bet that we would follow the fast path most of the time.

Your idea for the optimization is good, but we need to check the scale too? If the exponent is extremely high, then it would parse to an infinity as a double?

@evangoetzman
Copy link

@jpountz I don't believe I'm the right person to be tagged on this ticket. You may have intended to tag @ebadyano, who ran the benchmarking tests

@ebadyano
Copy link
Contributor

ebadyano commented Dec 6, 2019

@jpountz Right, good catch about the optimization. Yes, we would need to check scale() as well.

@ebadyano ebadyano self-requested a review December 13, 2019 15:23
Copy link
Contributor

@ebadyano ebadyano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@Schwaller
Copy link

Sooooo ... can we get this merged/fixed please?

@Schwaller
Copy link

Hey guys hope you are fine in this tricky time! Would it be possible to schedule some fix for this? Seems others are struggling too but don't really use bug search first ;-) https://discuss.elastic.co/t/bug-big-decimal-value-losed-precision/253105

@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @jpountz, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@Schwaller
Copy link

Can we make this happen? Our customers need full precision now and we would be basically forced to do our own implementation of this :/ I'm guessing the probability of a quick fix is low so how would you maybe suggest to work around it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numbers of the original document lose precision if the query contains excludes