Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

range_histogram and date_range_histogram aggregations to help analyse "session" duration type data #23182

Closed
colings86 opened this issue Feb 15, 2017 · 11 comments

Comments

@colings86
Copy link
Contributor

Now that (since 5.2) we support range field types I am wondering if we can use them to help users with the concurrent sessions problem (e.g. https://discuss.elastic.co/t/display-concurrency-in-data-on-kibana/26006/3)

The problem detailed in the post above is that the user is trying to determine, for each 30 second period, how many concurrent phone calls are occurring. This problem can be generalised to wanting to analysis how many concurrent 'sessions' are occurring over fixed intervals of time (or potentially some other unit for this axis). By 'session' here I mean something that has a start time and an end time, this could be phone calls, web sessions, calendar meetings/appointments.

The aggregation would work by adding each collected document to all the histogram buckets which fall into the range given by the value of the range field. Currently the range field does not write doc_values when indexing so we will either need to write doc_values or have a different way to retrieve the field values in a columnar way.

The following should be interpreted as thinking out loud and may or may not be useful:

For the non-date applications of this, one (possibly contrived) use-case could be in aggregated metric data. If I was taking temperature data for every weather station in the UK, I might have a document per day that would probably contain the mean and median temperature for the day but also minimum and maximum temperature for the day which I could store in a range field containing the range of temperatures reported that day. When I come to analyse the data one useful thing to see would be how many days the temperature was between -10C to 0C, 0C to 10C, 10C to 20C etc. I could use the range_histogram aggregation to get the answer to this question as it would tell me for each 10C interval how many days the temperature was recorded in the interval at some point in the day. Analysing the max and min temperature independently would only tell me the days when the maximum or minimum was in each interval which answers a slightly different question.

@colings86 colings86 changed the title range_histogram and date_range_histogram aggregations to help analysis "session" duration type data range_histogram and date_range_histogram aggregations to help analyse "session" duration type data Feb 15, 2017
@colings86
Copy link
Contributor Author

Discussed in FixItFriday and it was generally considered that this would be good to implement but we would need to use Binary DocValues for the columnar data for range fields. We expect that the number of range fields should be small so we don't think this would affect compression too much. In keeping with the other field types that support doc values, we would enable doc_values by default on range fields.

@esequiasneto
Copy link

esequiasneto commented Apr 5, 2017

I also have this problem and it was resolved by adding a painless script. I followed the complete query along with the script:

{ "size": 0, "aggs": { "4": { "date_range": { "field": "base_iptv_terminal_data_inicio", "ranges": [{ "from": "now-24h-3h", "to": "now-3h" }], "time_zone": "UTC" }, "aggs": { "3": { "terms": { "field": "base_iptv_canal_nome.keyword", "size": 5, "order": { "_count": "desc" } }, "aggs": { "2": { "date_histogram": { "interval": "minute", "time_zone": "UTC", "min_doc_count": 0, "script": { "inline": "long start = doc['base_iptv_terminal_data_inicio'].value; def data_fim = doc['base_iptv_terminal_data_fim'].value ?: new Date().getTime(); if ( data_fim > doc['base_iptv_epg_data_fim'].value ){data_fim = doc['base_iptv_epg_data_fim'].value;}def l = []; l.add(start); for (long i = start; i < data_fim; i += 60000) { l.add(i); } return l;" } } } } } } } } }

I do not know if it is a good solution because it has influenced the generation time of the graph in Kibana.

Example Document:
{ "_index": "sysrec", "_type": "terminallog", "_id": "AVsWcx_uxyYI9_eHPD9k", "_score": 1, "_source": { "base_iptv_terminal_data_fim_mili": 1490723191928, "base_iptv_terminal_data_fim": "2017-03-28 14:46:31", "base_iptv_canal_id": 30, "base_iptv_epg_data_fim_mili": 1490723700000, "base_iptv_canal_nome": "Fox", "base_iptv_programa_ano": 2011, "base_iptv_programa_id": 30891, "base_iptv_epg_data_fim": "2017-03-28 14:55:00", "base_iptv_terminal_ip": "10.0.171.19", "base_iptv_terminal_data_inicio_mili": 1490723097616, "base_iptv_terminal_data_inicio": "2017-03-28 14:44:57", "base_iptv_epg_data_inicio_mili": 1490718300000, "base_iptv_terminal_numeroserie": "A30100041638010006623135", "base_iptv_terminal_mac": "0c:56:5c:65:0f:9f", "base_iptv_programa_titulo_original": "Attack The Block", "base_iptv_programa_titulo": "Ataque ao Prédio", "acao": "i", "data_criacao": "2017-03-28 16:44:55", "base_iptv_epg_data_inicio": "2017-03-28 13:25:00" } }

@colings86
Copy link
Contributor Author

Stalled until #24823 is merged

@pickypg
Copy link
Member

pickypg commented Jul 27, 2017

@colings86 just as a note: #24823 was merged.

@gmoskovicz
Copy link
Contributor

@colings86 any news regarding this?

@colings86
Copy link
Contributor Author

@gmoskovicz this is no longer stalled but it is not currently being worked on either. IT should be possible to implement now, though it will need a ValuesSource implementation to be created for Range fields to be used with the new aggregation.

@colings86 colings86 removed the stalled label Jan 5, 2018
@gmoskovicz
Copy link
Contributor

Sounds good.

So until this isn't implemented the only way to aggregate this fields is probably to reindex into a regular date field (for example) and create multiple values for the field or use a script.

@colings86
Copy link
Contributor Author

Yes, for now the way to do this is to have separate date fields for the start and end date (or start date and duration) and use a script to calculate the histogram data (as in the discuss issue linked in the description of this issue)

@markharwood
Copy link
Contributor

cc @elastic/es-search-aggs

@kdward
Copy link

kdward commented Mar 21, 2018

The fact that range fields don't work in aggregations should probably be documented as a warning on either the description of the range types, in the aggregation documentation, or both. Without documentation to the contrary a user would naturally expect a date histogram to work on their data that has been mapped as a date range.

@polyfractal
Copy link
Contributor

Organizing all the agg range issues into a central ticket, closing in favor of #34644

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants