New type of aggregation #9791

zmathe · 2015-02-20T16:08:49Z

Hello,
I am using ElasticSearch 1.4.2 and Kibana 4. I am not able to get the correct plots, because the following aggregation does not exists in Elasticsearch and also Kibana.
Given T1, T2, T3 ... Tn time periods. Each period contains the following element: {T1, d1:v1, d2:v2...dn:vn} where d1:v1 keys value pairs. For example: {"09:00","Product":"a","Color":"red","quantity":1}, {"09:01","Product":"c","Color":"blue","quantity":5}.

I wanted to make a plot based on time(X axis) and quantity (Y axis). The current elasticsearch and kibana return the "wrong" values in my case.

The system automatically splits the T1..Tn time interval into bucket[1],bucket[2],...,bucket[n](bucket[1]>=T1 and bucket[n]<=T[n])

bucket[1] = SUM(T1) + SUM(T2)
bucket[2] = SUM(T1) + SUM(T2)+SUM(T3)
...
bucket[n] = SUM(Tn-1) + SUM(Tn)

This aggregation is wrong for me. I would like to have the following:
bucket[1] = AVG(SUM(T1) + SUM(T2))
bucket[2] = AVG(SUM(T1) + SUM(T2)+SUM(T3))
...
bucket[n] = AVG(SUM(Tn-1) + SUM(Tn))

For example:

The bucket length is 18 minute: we have 09:00,16:12, 16:30 buckets.
09:00 SUM(T1) + SUM(t2) = 1 + 1 + 1 + 1 + 1= 5
16:12 SUM(T3) = 1 + 2 + 2+ 2 = 7
16:20 SUM(T4) = 1

What I want to have:
09:00 = AVG(SUM(T1) + SUM(T2)) = (2 + 3)/2 = 2.5
16:12 AVG(SUM(T3)) = 7/1 = 7
16:20 AVG(SUM(T4)) = 1

I provide an example which can easily reproducible using kibana 4 and the latest elasticsearch:
-Insert the following data to elasticsearch(I used sense from marvel):
POST /example/transactions/_bulk
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:30:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}

make the query using kibana 4

The query made to the elasticsearch is:

{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [{
                        "range": {
                            "time": {
                                "gte": 1423399451544,
                                "lte": 1423631917911
                            }
                        }
                    }],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "3": {
            "date_histogram": {
                "field": "time",
                "interval": "3600000ms",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": 1423399451544,
                    "max": 1423631917911
                }
            },
            "aggs": {
                "4": {
                    "terms": {
                        "field": "Product",
                        "size": 0,
                        "order": {
                            "1": "desc"
                        }
                    },
                    "aggs": {
                        "1": {
                            "sum": {
                                "field": "quantity"
                            }
                        }
                    }
                }
            }
        }
    },
    "fields": ["*", "_source"],
    "script_fields": {
        "time": {
            "script": "if (doc[\"time\"].value == 0) { null } else { doc[\"time\"].value }"
        }
    }
}

I made a simple web page where I can make queries to ES and the result of the query is plotted. I changed the query as follow in order to make the "correct" plot.

{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "time": {
                                   "gte": 1423147775940,
                                    "lte": 1423752575940
                                }
                            }
                        }
                    ],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "size": 0,
                "field": "Product"
            },
            "aggs": {
                "end_data": {
                    "date_histogram": {
                        "field": "time",
                        "interval": "1080000ms",
                        "min_doc_count": 1,
                        "extended_bounds": {
                           "min":1423147775940,
                           "max":1423752575940
                        }
                    },
                    "aggs": {
                        "times": {
                            "terms": {
                                "field": "time"
                            },
                            "aggs": {
                                "total_jobs": {
                                    "sum": {
                                        "field": "quantity"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "fields": [
        "*",
        "_source"
    ],
    "script_fields": {
        "time": {
            "script": "if (doc[\"time\"].value == 0) { null } else { doc[\"time\"].value }"
        }
    }
}

I made the averages in JS and I got the plot what I want:

Is it possible to support this kind of aggregation?

Thank you,
Zoltan

zmathe · 2015-02-20T16:10:30Z

Let me know if you need more information. I can provide you a real example.

zmathe · 2015-02-27T14:05:18Z

I show real example in order to better understand the problem. I would like to replace part of our monitoring system which is based on MySQL and mathplotlib to elasticsearch-kibana.
First plot is the correct plot generated using MySQL.

The equivalent plot in elasticsearch is:

As you can see the number of jobs in the Y axis is ~7 million, which is not correct(the buckets size is 3 hours).

I implemented a simple tool where you execute a query and we can see the result.

As you can see if I do an extra step I can have the correct plot. But this means I can not use Kibana.

Thanks,
Zoltan

clintongormley · 2015-02-28T04:00:33Z

@colings86 this sounds like it would fit into the new aggs features?

colings86 · 2015-03-02T15:05:35Z

@clintongormley yep this would fit into the new agg features in #9876, although it would probably need to be done through the scriptable implementation.

Closing in favour of that issue (#9876)

zmathe · 2015-03-02T15:19:01Z

Thank you very much for adding this to the new aggregation framework!

bleskes · 2015-03-03T10:41:08Z

@zmathe I may be totally missing something but in your aggs examples, I see that the kibana version does a date histogram first, then prodcut terms aggs and then sums the quantity.

The request under

I made a simple web page where I can make queries to ES and the result of the query is plotted. I changed the query as follow in order to make the "correct" plot.

is different. It does terms aggs first, then date histogram, then summing the quantity. I wonder if just changing the agg order in kibana will also give you what you want?

zmathe · 2015-03-03T13:44:28Z

@bleskes Changing the order does not work in Kibana, because there is an extra step (which calculates the averages), which can not be done in ES. I had to do this extra step in the client side (in the JS).
Changing the order will give the correct buckets, but I do not want to sum the quantity I want to have the
avg. In the following short example I am going to explain the problem.
The dataset I use:

POST /example/transactions/_bulk
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "b", "time": "2015-02-09T16:30:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:00:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time":"2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 1, "Product": "a", "time": "2015-02-09T09:15:00Z"}
{ "index": {}}
{"Color": "red", "quantity": 2, "Product": "b", "time": "2015-02-09T16:15:00Z"}

Kibana is using the following query to plot the quantity over the time, group by product:

GET /example/_search/?search_type=count
{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [{
                        "range": {
                            "time": {
                                "gte": 1423399451544,
                                "lte": 1423631917911
                            }
                        }
                    }],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "3": {
            "date_histogram": {
                "field": "time",
                "interval": "3600000ms",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": 1423399451544,
                    "max": 1423631917911
                }
            },
            "aggs": {
                "4": {
                    "terms": {
                        "field": "Product",
                        "size": 0,
                        "order": {
                            "1": "desc"
                        }
                    },
                    "aggs": {
                        "1": {
                            "sum": {
                                "field": "quantity"
                            }
                        }
                    }
                }
            }
        }
    }
}

The result of this query:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "3": {
         "buckets": [
            {
               "4": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "1": {
                           "value": 5
                        },
                        "key": "a",
                        "doc_count": 5
                     }
                  ]
               },
               "key_as_string": "2015-02-09T09:00:00.000Z",
               "key": 1423472400000,
               "doc_count": 5
            },
            {
               "4": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "1": {
                           "value": 8
                        },
                        "key": "b",
                        "doc_count": 5
                     }
                  ]
               },
               "key_as_string": "2015-02-09T16:00:00.000Z",
               "key": 1423497600000,
               "doc_count": 5
            }
         ]
      }
   }
}

Changing the aggs order the query is:

GET /example/_search/?search_type=count
{
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "*"
                }
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "time": {
                                   "gte": 1423147775940,
                                    "lte": 1423752575940
                                }
                            }
                        }
                    ],
                    "must_not": []
                }
            }
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "size": 0,
                "field": "Product"
            },
            "aggs": {
                "end_data": {
                    "date_histogram": {
                        "field": "time",
                        "interval": "1080000ms",
                        "min_doc_count": 1,
                        "extended_bounds": {
                           "min":1423147775940,
                           "max":1423752575940
                        }
                    },
                    "aggs": {
                        "times": {
                            "terms": {
                                "field": "time"
                            },
                            "aggs": {
                                "total_jobs": {
                                    "sum": {
                                        "field": "quantity"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

The result of this query is:
{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "2": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "a",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T09:00:00.000Z",
                        "key": 1423472400000,
                        "doc_count": 5,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            },
            {
               "key": "b",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T16:12:00.000Z",
                        "key": 1423498320000,
                        "doc_count": 4,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423498500000,
                                 "key_as_string": "2015-02-09T16:15:00.000Z",
                                 "doc_count": 4,
                                 "total_jobs": {
                                    "value": 7
                                 }
                              }
                           ]
                        }
                     },
                     {
                        "key_as_string": "2015-02-09T16:30:00.000Z",
                        "key": 1423499400000,
                        "doc_count": 1,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423499400000,
                                 "key_as_string": "2015-02-09T16:30:00.000Z",
                                 "doc_count": 1,
                                 "total_jobs": {
                                    "value": 1
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            }
         ]
      }
   }
}

As you can see in the result we have two buckets 'a' and 'b', in the other example we have: 2015-02-09T09:00:00.000Z and 2015-02-09T16:00:00.000Z.

In the last example in the bucket 'a' I had to make the avgs (because it can not be calculated by ES):
{
               "key": "a",
               "doc_count": 5,
               "end_data": {
                  "buckets": [
                     {
                        "key_as_string": "2015-02-09T09:00:00.000Z",
                        "key": 1423472400000,
                        "doc_count": 5,
                        "times": {
                           "doc_count_error_upper_bound": 0,
                           "sum_other_doc_count": 0,
                           "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
                        }
                     }
                  ]
               }
            }

I have: a->2015-02-09T09:00:00.000Z and the average of:
 "buckets": [
                              {
                                 "key": 1423473300000,
                                 "key_as_string": "2015-02-09T09:15:00.000Z",
                                 "doc_count": 3,
                                 "total_jobs": {
                                    "value": 3
                                 }
                              },
                              {
                                 "key": 1423472400000,
                                 "key_as_string": "2015-02-09T09:00:00.000Z",
                                 "doc_count": 2,
                                 "total_jobs": {
                                    "value": 2
                                 }
                              }
                           ]
3+2/2=2.5.
After the avgs I made the correct buckets by changing the order of the rows:
 2015-02-09T09:00:00.000Z, a, 2.5.
However in the firs example (the current ES implementation):
2015-02-09T09:00:00.000Z, a, 5
You can see the different:

![screen shot 2015-03-03 at 14 27 48](https://cloud.githubusercontent.com/assets/981448/6463479/bca0959e-c1b2-11e4-939b-6fe434bf6563.png)


Because of the avg calculation I can not use Kibana. I think this will be useful not only us.  I am sorry for the long message.
Thanks,
Zoltan

zmathe · 2015-03-03T13:45:16Z

I am sorry the screen shot is missing in the previous post:

aschokking · 2015-03-18T21:50:25Z

We've been running into this exact same problem with summed averages.

clintongormley added :Analytics/Aggregations Aggregations >feature labels Feb 28, 2015

clintongormley assigned colings86 Feb 28, 2015

colings86 closed this as completed Mar 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New type of aggregation #9791

New type of aggregation #9791

zmathe commented Feb 20, 2015

zmathe commented Feb 20, 2015

zmathe commented Feb 27, 2015

clintongormley commented Feb 28, 2015

colings86 commented Mar 2, 2015

zmathe commented Mar 2, 2015

bleskes commented Mar 3, 2015

zmathe commented Mar 3, 2015

zmathe commented Mar 3, 2015

aschokking commented Mar 18, 2015

New type of aggregation #9791

New type of aggregation #9791

Comments

zmathe commented Feb 20, 2015

zmathe commented Feb 20, 2015

zmathe commented Feb 27, 2015

clintongormley commented Feb 28, 2015

colings86 commented Mar 2, 2015

zmathe commented Mar 2, 2015

bleskes commented Mar 3, 2015

zmathe commented Mar 3, 2015

zmathe commented Mar 3, 2015

aschokking commented Mar 18, 2015