Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New type of aggregation #9791

Closed
zmathe opened this issue Feb 20, 2015 · 9 comments
Closed

New type of aggregation #9791

zmathe opened this issue Feb 20, 2015 · 9 comments
Assignees

Comments

@zmathe
Copy link

zmathe commented Feb 20, 2015

Hello,
I am using ElasticSearch 1.4.2 and Kibana 4. I am not able to get the correct plots, because the following aggregation does not exists in Elasticsearch and also Kibana.
Given T1, T2, T3 ... Tn time periods. Each period contains the following element: {T1, d1:v1, d2:v2...dn:vn} where d1:v1 keys value pairs. For example: {"09:00","Product":"a","Color":"red","quantity":1}, {"09:01","Product":"c","Color":"blue","quantity":5}.

I wanted to make a plot based on time(X axis) and quantity (Y axis). The current elasticsearch and kibana return the "wrong" values in my case.

The system automatically splits the T1..Tn time interval into bucket[1],bucket[2],...,bucket[n](bucket[1]>=T1 and bucket[n]<=T[n])

bucket[1] = SUM(T1) + SUM(T2)
bucket[2] = SUM(T1) + SUM(T2)+SUM(T3)
...
bucket[n] = SUM(Tn-1) + SUM(Tn)

This aggregation is wrong for me. I would like to have the following:
bucket[1] = AVG(SUM(T1) + SUM(T2))
bucket[2] = AVG(SUM(T1) + SUM(T2)+SUM(T3))
...
bucket[n] = AVG(SUM(Tn-1) + SUM(Tn))

For example:

screen shot 2015-02-18 at 18 13 53

The bucket length is 18 minute: we have 09:00,16:12, 16:30 buckets.
09:00 SUM(T1) + SUM(t2) = 1 + 1 + 1 + 1 + 1= 5
16:12 SUM(T3) = 1 + 2 + 2+ 2 = 7
16:20 SUM(T4) = 1

What I want to have:
09:00 = AVG(SUM(T1) + SUM(T2)) = (2 + 3)/2 = 2.5
16:12 AVG(SUM(T3)) = 7/1 = 7
16:20 AVG(SUM(T4)) = 1

I provide an example which can easily reproducible using kibana 4 and the latest elasticsearch:
-Insert the following data to elasticsearch(I used sense from marvel):
POST /example/transactions/_bulk
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:30:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}

  • make the query using kibana 4
    screen shot 2015-02-18 at 17 31 43

The query made to the elasticsearch is:

{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [{
                        "range": {
                            "time": {
                                "gte": 1423399451544,
                                "lte": 1423631917911
                            }
                        }
                    }],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "3": {
            "date_histogram": {
                "field": "time",
                "interval": "3600000ms",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": 1423399451544,
                    "max": 1423631917911
                }
            },
            "aggs": {
                "4": {
                    "terms": {
                        "field": "Product",
                        "size": 0,
                        "order": {
                            "1": "desc"
                        }
                    },
                    "aggs": {
                        "1": {
                            "sum": {
                                "field": "quantity"
                            }
                        }
                    }
                }
            }
        }
    },
    "fields": ["*", "_source"],
    "script_fields": {
        "time": {
            "script": "if (doc[\"time\"].value == 0) { null } else { doc[\"time\"].value }"
        }
    }
}

I made a simple web page where I can make queries to ES and the result of the query is plotted. I changed the query as follow in order to make the "correct" plot.

{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "time": {
                                   "gte": 1423147775940,
                                    "lte": 1423752575940
                                }
                            }
                        }
                    ],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "size": 0,
                "field": "Product"
            },
            "aggs": {
                "end_data": {
                    "date_histogram": {
                        "field": "time",
                        "interval": "1080000ms",
                        "min_doc_count": 1,
                        "extended_bounds": {
                           "min":1423147775940,
                           "max":1423752575940
                        }
                    },
                    "aggs": {
                        "times": {
                            "terms": {
                                "field": "time"
                            },
                            "aggs": {
                                "total_jobs": {
                                    "sum": {
                                        "field": "quantity"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "fields": [
        "*",
        "_source"
    ],
    "script_fields": {
        "time": {
            "script": "if (doc[\"time\"].value == 0) { null } else { doc[\"time\"].value }"
        }
    }
}

I made the averages in JS and I got the plot what I want:
screen shot 2015-02-18 at 17 36 35

Is it possible to support this kind of aggregation?

Thank you,
Zoltan

@zmathe
Copy link
Author

zmathe commented Feb 20, 2015

Let me know if you need more information. I can provide you a real example.

@zmathe
Copy link
Author

zmathe commented Feb 27, 2015

I show real example in order to better understand the problem. I would like to replace part of our monitoring system which is based on MySQL and mathplotlib to elasticsearch-kibana.
First plot is the correct plot generated using MySQL.
screen shot 2015-02-19 at 18 00 59

The equivalent plot in elasticsearch is:
screen shot 2015-02-19 at 17 57 26

As you can see the number of jobs in the Y axis is ~7 million, which is not correct(the buckets size is 3 hours).

I implemented a simple tool where you execute a query and we can see the result.
screen shot 2015-02-19 at 18 18 56

As you can see if I do an extra step I can have the correct plot. But this means I can not use Kibana.

Thanks,
Zoltan

@clintongormley
Copy link
Contributor

@colings86 this sounds like it would fit into the new aggs features?

@colings86
Copy link
Contributor

@clintongormley yep this would fit into the new agg features in #9876, although it would probably need to be done through the scriptable implementation.

Closing in favour of that issue (#9876)

@zmathe
Copy link
Author

zmathe commented Mar 2, 2015

Thank you very much for adding this to the new aggregation framework!

@bleskes
Copy link
Contributor

bleskes commented Mar 3, 2015

@zmathe I may be totally missing something but in your aggs examples, I see that the kibana version does a date histogram first, then prodcut terms aggs and then sums the quantity.

The request under

I made a simple web page where I can make queries to ES and the result of the query is plotted. I changed the query as follow in order to make the "correct" plot.

is different. It does terms aggs first, then date histogram, then summing the quantity. I wonder if just changing the agg order in kibana will also give you what you want?

@zmathe
Copy link
Author

zmathe commented Mar 3, 2015

@bleskes Changing the order does not work in Kibana, because there is an extra step (which calculates the averages), which can not be done in ES. I had to do this extra step in the client side (in the JS).
Changing the order will give the correct buckets, but I do not want to sum the quantity I want to have the
avg. In the following short example I am going to explain the problem.
The dataset I use:

POST /example/transactions/_bulk
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:30:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}

Kibana is using the following query to plot the quantity over the time, group by product:

GET /example/_search/?search_type=count
{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [{
                        "range": {
                            "time": {
                                "gte": 1423399451544,
                                "lte": 1423631917911
                            }
                        }
                    }],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "3": {
            "date_histogram": {
                "field": "time",
                "interval": "3600000ms",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": 1423399451544,
                    "max": 1423631917911
                }
            },
            "aggs": {
                "4": {
                    "terms": {
                        "field": "Product",
                        "size": 0,
                        "order": {
                            "1": "desc"
                        }
                    },
                    "aggs": {
                        "1": {
                            "sum": {
                                "field": "quantity"
                            }
                        }
                    }
                }
            }
        }
    }
}

The result of this query:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "3": {
         "buckets": [
            {
               "4": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "1": {
                           "value": 5
                        },
                        "key": "a",
                        "doc_count": 5
                     }
                  ]
               },
               "key_as_string": "2015-02-09T09:00:00.000Z",
               "key": 1423472400000,
               "doc_count": 5
            },
            {
               "4": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "1": {
                           "value": 8
                        },
                        "key": "b",
                        "doc_count": 5
                     }
                  ]
               },
               "key_as_string": "2015-02-09T16:00:00.000Z",
               "key": 1423497600000,
               "doc_count": 5
            }
         ]
      }
   }
}

Changing the aggs order the query is:

GET /example/_search/?search_type=count
{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "time": {
                                   "gte": 1423147775940,
                                    "lte": 1423752575940
                                }
                            }
                        }
                    ],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "size": 0,
                "field": "Product"
            },
            "aggs": {
                "end_data": {
                    "date_histogram": {
                        "field": "time",
                        "interval": "1080000ms",
                        "min_doc_count": 1,
                        "extended_bounds": {
                           "min":1423147775940,
                           "max":1423752575940
                        }
                    },
                    "aggs": {
                        "times": {
                            "terms": {
                                "field": "time"
                            },
                            "aggs": {
                                "total_jobs": {
                                    "sum": {
                                        "field": "quantity"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

The result of this query is:
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "2": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "a",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T09:00:00.000Z",
                        "key": 1423472400000,
                        "doc_count": 5,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            },
            {
               "key": "b",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T16:12:00.000Z",
                        "key": 1423498320000,
                        "doc_count": 4,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423498500000,
                                 "key_as_string": "2015-02-09T16:15:00.000Z",
                                 "doc_count": 4,
                                 "total_jobs": {
                                    "value": 7
                                 }
                              }
                           ]
                        }
                     },
                     {
                        "key_as_string": "2015-02-09T16:30:00.000Z",
                        "key": 1423499400000,
                        "doc_count": 1,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423499400000,
                                 "key_as_string": "2015-02-09T16:30:00.000Z",
                                 "doc_count": 1,
                                 "total_jobs": {
                                    "value": 1
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            }
         ]
      }
   }
}

As you can see in the result we have two buckets 'a' and 'b', in the other example we have: 2015-02-09T09:00:00.000Z and 2015-02-09T16:00:00.000Z.

In the last example in the bucket 'a' I had to make the avgs (because it can not be calculated by ES):
{
               "key": "a",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T09:00:00.000Z",
                        "key": 1423472400000,
                        "doc_count": 5,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            }

I have: a->2015-02-09T09:00:00.000Z and the average of:
 "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
3+2/2=2.5.
After the avgs I made the correct buckets by changing the order of the rows:
 2015-02-09T09:00:00.000Z, a, 2.5.
However in the firs example (the current ES implementation):
2015-02-09T09:00:00.000Z, a, 5
You can see the different:

![screen shot 2015-03-03 at 14 27 48](https://cloud.githubusercontent.com/assets/981448/6463479/bca0959e-c1b2-11e4-939b-6fe434bf6563.png)


Because of the avg calculation I can not use Kibana. I think this will be useful not only us.  I am sorry for the long message.
Thanks,
Zoltan

@zmathe
Copy link
Author

zmathe commented Mar 3, 2015

I am sorry the screen shot is missing in the previous post:
screen shot 2015-03-03 at 14 27 48

@aschokking
Copy link

We've been running into this exact same problem with summed averages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants