Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid variance computed in extended_stats aggregation #37303

Closed
raphz opened this issue Jan 10, 2019 · 5 comments · Fixed by #37384
Closed

Invalid variance computed in extended_stats aggregation #37303

raphz opened this issue Jan 10, 2019 · 5 comments · Fixed by #37384
Labels

Comments

@raphz
Copy link

raphz commented Jan 10, 2019

Elasticsearch version (bin/elasticsearch --version): 6.1.4

Plugins installed: n/a

JVM version (java -version): 1.8.191

OS version (uname -a if on a Unix-like system): Centos 7.5

Description of the problem including expected versus actual behavior:
In some conditions, the variance computed in an extended_stats aggregation is computed as a negative number that should never append.

The variance is a sum of positive numbers, hence cannot be negative. What makes it negative here is the way it is computed (probably as the difference of two positive numbers here: "sum_of_squares / count" and "avg * avg"). Due to the non-infinite precision of floating point numbers, both numbers are 'almost' the same...

Proposed solution:

At least prevent negative values to appear in the variance: add a "Math.max(0.0, ...)" to the existing computation formula.

Steps to reproduce:

Using the attached zip file, do the following :

  • in env.properties: put the host/port to the elastic-search instance (default: localhost:9200)
  • run ./reproduce.sh my-test-index (or any other index name, beware, this index will be dropped!)

What does it do?

  • drop the given index
  • create it new with a specific mapping: index: long, amount: double
  • load 3 documents with same amount: 49.95
  • ask for an "extended_stats" aggregation on the amounts

Provide logs (if relevant):

{"acknowledged":true}

{"acknowledged":true,"shards_acknowledged":true,"index":"my-test-index"}

{"_index":"my-test-index","_type":"document","_id":"BErfN2gBWLZgAGEHAdb8","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

{"_index":"my-test-index","_type":"document","_id":"BUrfN2gBWLZgAGEHAtY2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

{"_index":"my-test-index","_type":"document","_id":"BkrfN2gBWLZgAGEHAtZd","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

{"_shards":{"total":10,"successful":5,"failed":0}}
{
 "took" : 1,
 "timed_out" : false,
 "_shards" : {
   "total" : 5,
   "successful" : 5,
   "skipped" : 0,
   "failed" : 0
 },
 "hits" : {
   "total" : 3,
   "max_score" : 1.0,
   "hits" : [
     {
       "_index" : "my-test-index",
       "_type" : "document",
       "_id" : "BUrfN2gBWLZgAGEHAtY2",
       "_score" : 1.0,
       "_source" : {
         "amount" : 49.95,
         "index" : 2
       }
     },
     {
       "_index" : "my-test-index",
       "_type" : "document",
       "_id" : "BErfN2gBWLZgAGEHAdb8",
       "_score" : 1.0,
       "_source" : {
         "amount" : 49.95,
         "index" : 1
       }
     },
     {
       "_index" : "my-test-index",
       "_type" : "document",
       "_id" : "BkrfN2gBWLZgAGEHAtZd",
       "_score" : 1.0,
       "_source" : {
         "amount" : 49.95,
         "index" : 3
       }
     }
   ]
 },
 "aggregations" : {
   "amount" : {
     "count" : 3,
     "min" : 49.95,
     "max" : 49.95,
     "avg" : 49.95000000000001,
     "sum" : 149.85000000000002,
     "sum_of_squares" : 7485.0075000000015,
     "variance" : -3.0316490059097606E-13,
     "std_deviation" : "NaN",
     "std_deviation_bounds" : {
       "upper" : "NaN",
       "lower" : "NaN"
     }
   }
 }
}

File used to reproduce:
to-reproduce-it.zip

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@polyfractal
Copy link
Contributor

Ah yeah, I think you're probably right. Would you like to work on a PR to fix this?

If not, I'm marking this as adopt-me and we'll get it fixed when we have a few spare cycles. Thanks for the bug report!

@polyfractal polyfractal added good first issue low hanging fruit help wanted adoptme labels Jan 10, 2019
@vishnugt
Copy link
Contributor

I would like to work on a PR to fix this.

@polyfractal
Copy link
Contributor

@vishnugt 👍 feel free!

vishnugt added a commit to vishnugt/elasticsearch that referenced this issue Jan 12, 2019
vishnugt added a commit to vishnugt/elasticsearch that referenced this issue Jan 14, 2019
@mysticaltech
Copy link

mysticaltech commented Jun 17, 2020

As mentioned in a previous issue, this is due to catastrophic cancellation, the sum of square method of calculating variances in floating point environment is not appropriate, instead the Welford algorithm has to be used to get proper variance values https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants