[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

gaby · 2019-05-21T14:01:48Z

Describe the feature:

Currently the Infrastructure UI only shows data for the Last 1 minute. When using Metricbeat with intervals greater than 1 minute you get 0 results.
The default time range should be configurable in the UI, thus allowing systems with greater intervals to be shown.
This will also reduce the resource usage of Metricbeat, which currently can be over 100MB im events per day for a single host when using System/Docker Metricset.

Describe a specific use case for the feature:

System collecting metrics on intervals greater than 1 minute. Metricbeats generates a lot of events, and collecting all those events 3600 times during the day is excesive.
Track Metrics for non-critical infrastructure on intervals greater than 1 minute.
When the InfraUI was first released this was set to 1hr, it was later changed to 5min, and now Last 1min.
Settings panel was added for several fields, etc in the InfraUI. The time range interval setting is missing in this panel.

elasticmachine · 2019-05-22T20:28:15Z

Pinging @elastic/infra-logs-ui

simianhacker · 2019-05-23T19:37:41Z

@makwarth This is a feature I anticipated on implementing, the plumbing is in place for the API's. I think we just need to figure out where this setting will live. I think our configuration UI needs to be redesigned to support multiple sections for different aspects of the UI.

jasonrhodes · 2019-05-24T13:08:30Z

I mentioned this at GAH but just to reiterate: would it make sense to use the super date picker "range" format instead of a single point in time for the infrastructure UI? Then it would be configured the same way as in other apps... there may very well be reasons that won't work here but wanted to throw it out there.

makwarth · 2019-05-27T08:01:43Z

The waffle map is great for showing point in time snapshots (1m, 5m, 15min), but not so much for longer time ranges. I'd worry about users having to adjust the time range too often when opening the Infra UI - and then having to revert the time range when navigating away from the Infra UI.

@gabrielcalderon, which time range defaults would suit your needs in the Infra UI?

gaby · 2019-05-27T08:48:09Z

@makwarth I can test again today, but im pretty sure my waffle map was empty if using anything over 1 minute.

Something like the ones you mentioned would work. Maybe also 3m.

skh · 2019-07-15T12:30:20Z

I think we won't be able to use one interval for all metrics in the future, because the different metricbeat modules have different default reporting periods. One notable example is the AWS module (see https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-module-aws.html) which uses 300s by default, which is determined by the frequency AWS sends metrics internally to CloudWatch.

The potential of ending up with specific interval settings for certain metricbeat modules (or even metricsets) in the Infra UI configuration feels clunky to me, but maybe that is what will be necessary in the end -- one default interval, and the option to override that setting by module/metricset.

Another option could be to find out the reporting period, and therefore the appropriate interval, from the data with a query like:

GET metricbeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "host.name": "noether"
          }
        },
        {
          "match_phrase": {
            "event.module": "system"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-10m"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggregations": {
    "my_buckets": {
      "composite": {
        "size": 10, 
        "sources": [
          {
            "date": {
              "date_histogram": {
                "field": "@timestamp",
                "calendar_interval": "1m",
                "order": "desc"
              }
            }
          }
        ]
      }
    } 
  }
}

simianhacker · 2019-07-15T17:30:42Z

@skh I like this approach, it's actually how I would solve the problem. The only thing I would possibly change is the interval to a smaller unit, maybe something like 1 second. Then from there we should able to workout what the interval "might" be. My reasoning for using a smaller interval is that if we can guess "It looks like we have 1 second data" Then in the metrics detail page we could show a the data at a higher resolution by setting the TSVB interval to >=1s (which will let the user zoom down to 1s) depending on what we see in the data. I think right now we are using >=1m for the metrics detail page because it seems like a safe default.

jasonrhodes · 2019-07-16T03:16:05Z

Another option could be to find out the reporting period, and therefore the appropriate interval, from the data with a query

I definitely like this idea but what happens if the data is being sent up every 20 min, or every hour? And in this query how do you tell the difference between lack of data due to interval vs. lack of data due to the monitor having been down, or failed to send, or something along those lines? (That second point may not make much difference here but I'm never sure.)

For now, is there a simple solution we can think of to solve the basic case? Maybe switching to bar charts to show this time series data would help so that you don't have to draw the interpolation from non-zero to "zero" for the missing buckets and therefore you don't draw attention to them...

skh · 2019-07-17T10:24:04Z

For now, is there a simple solution we can think of to solve the basic case? Maybe switching to bar charts to show this time series data would help so that you don't have to draw the interpolation from non-zero to "zero" for the missing buckets and therefore you don't draw attention to them...

The issue is unrelated to charts, it causes the inventory view (waffle map) to be empty when the bucket interval does not match the data.

Solutions I can think of, roughly ordered from easiest (for us) to cleverest:

Use hardcoded intervals working with metricbeat default reporting periods, and document that. We're halfway there:
- we still need to use different intervals for different types of data (aws module has reporting period 300s, many others 10s)
- we need to add good documentation so users can tweak their setup to make it work
Make it configurable in kibana.yml and document it. Use intervals working with metricbeat default reporting periods as defaults. As we need different intervals for different modules, we need to find a way to configure several intervals that is still manageable.
Same as above, but expose the configuration in the source configuration UI.
Expose the interval length in the UI on all pages that are affected, using metricbeat default reporting periods as starting point. This need visual design, and probably a way to locally persist the user's choice, so that they don't have to select the interval every time they come back to a page.
Dynamically analyze the data as proposed in [Feature][InfraUI] Make "Last 1 minute" time range configurable #36774 (comment) . This could be combined with exposing the interval length in the UI, so that we would start with the interval size we guessed from the data, instead of with the hardcoded defaults.

skh · 2019-07-17T10:54:44Z

what happens if the data is being sent up every 20 min, or every hour? And in this query how do you tell the difference between lack of data due to interval vs. lack of data due to the monitor having been down, or failed to send, or something along those lines? (That second point may not make much difference here but I'm never sure.)

Good points.

For very large interval sizes:

Right now our assumptions are much stricter (reporting interval <= 60s). What kinds of data do we want to support?
Time series data coming from infrastructure monitoring most likely will not come in in 20 or 60 minutes intervals. Other data might (scientific measurements?, weather?), but do we support it?
Remember that users always can use Kibana dashboards for their visualizing needs for other types of time series data.
What are our requirements, other than "make it work and don't bother users too much"?

For missing data:

We can't distinguish from a monitor being down, a system being down, or the network being down.
When data comes in again, it will backfill ("rewrite history"), because the @timestamp field is written by metricbeat, not elasticsearch.
I think we shouldn't try to do the job of Uptime monitoring. When we find no data for the time range being queried, we don't need to find out the reporting period because we can't display metric data for that time range anyway.

For partially missing data:

This is an edge case we need to handle. Options I can think of right now:
- fall back to a default
- use the average of the distances between found timestamps
- use the shortest distance between found timestamps

jasonrhodes · 2019-07-19T21:16:01Z

For very large interval sizes:

I agree with all your points in the sense that I don't think we need to support those edge cases. Sounds like we would just know internally that we support data being set up to about 10m intervals.

I think we shouldn't try to do the job of Uptime monitoring. When we find no data for the time range being queried, we don't need to find out the reporting period because we can't display metric data for that time range anyway.

Works for me.

For partially missing data:

I like "use the shortest distance between found timestamps" which I suppose we could do all the time, whether we know if there is partially missing data or not, right?

skh · 2019-07-29T13:40:28Z

We ended up finding another solution as described in elastic/beats#12616 :

When the data contains a period field (in seconds), the Infra UI will use this value to determine the bucket interval size. If this field is not present, we will default to the default period value of the respective metricbeat module.

It is still possible that we will also add UI elements to choose the bucket interval size in the UI as well, but we'll need separate tickets for those.

gaby · 2019-07-29T13:48:32Z

@skh I think that's the best solution for this problem. Thanks for looking into this.

lukasolson added Feature:Metrics UI Metrics UI feature triage_needed labels May 22, 2019

sgrodzicki added [zube]: Investigate discuss and removed triage_needed labels May 28, 2019

sgrodzicki assigned skh May 28, 2019

skh added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jul 1, 2019

jasonrhodes mentioned this issue Jul 15, 2019

[Infra UI] Change metadata endpoint to check for cloud metrics #39280

Closed

skh mentioned this issue Jul 24, 2019

Send period in metricbeat data elastic/beats#12616

Closed

skh closed this as completed Jul 29, 2019

zube bot added [zube]: Done and removed [zube]: Investigate labels Jul 29, 2019

sgrodzicki removed the [zube]: Done label Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

gaby commented May 21, 2019

elasticmachine commented May 22, 2019

simianhacker commented May 23, 2019

jasonrhodes commented May 24, 2019

makwarth commented May 27, 2019

gaby commented May 27, 2019

skh commented Jul 15, 2019

simianhacker commented Jul 15, 2019

jasonrhodes commented Jul 16, 2019

skh commented Jul 17, 2019 •

edited

Loading

skh commented Jul 17, 2019 •

edited

Loading

jasonrhodes commented Jul 19, 2019

skh commented Jul 29, 2019

gaby commented Jul 29, 2019

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

Comments

gaby commented May 21, 2019

elasticmachine commented May 22, 2019

simianhacker commented May 23, 2019

jasonrhodes commented May 24, 2019

makwarth commented May 27, 2019

gaby commented May 27, 2019

skh commented Jul 15, 2019

simianhacker commented Jul 15, 2019

jasonrhodes commented Jul 16, 2019

skh commented Jul 17, 2019 • edited Loading

skh commented Jul 17, 2019 • edited Loading

jasonrhodes commented Jul 19, 2019

skh commented Jul 29, 2019

gaby commented Jul 29, 2019

skh commented Jul 17, 2019 •

edited

Loading

skh commented Jul 17, 2019 •

edited

Loading