Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

Closed
gaby opened this issue May 21, 2019 · 13 comments
Closed

[Feature][InfraUI] Make "Last 1 minute" time range configurable #36774

gaby opened this issue May 21, 2019 · 13 comments
Assignees
Labels
discuss Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@gaby
Copy link

gaby commented May 21, 2019

Describe the feature:

  • Currently the Infrastructure UI only shows data for the Last 1 minute. When using Metricbeat with intervals greater than 1 minute you get 0 results.
  • The default time range should be configurable in the UI, thus allowing systems with greater intervals to be shown.
  • This will also reduce the resource usage of Metricbeat, which currently can be over 100MB im events per day for a single host when using System/Docker Metricset.

Describe a specific use case for the feature:

  • System collecting metrics on intervals greater than 1 minute. Metricbeats generates a lot of events, and collecting all those events 3600 times during the day is excesive.

  • Track Metrics for non-critical infrastructure on intervals greater than 1 minute.

  • When the InfraUI was first released this was set to 1hr, it was later changed to 5min, and now Last 1min.

  • Settings panel was added for several fields, etc in the InfraUI. The time range interval setting is missing in this panel.

@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-logs-ui

@simianhacker
Copy link
Member

@makwarth This is a feature I anticipated on implementing, the plumbing is in place for the API's. I think we just need to figure out where this setting will live. I think our configuration UI needs to be redesigned to support multiple sections for different aspects of the UI.

Copy link
Member

I mentioned this at GAH but just to reiterate: would it make sense to use the super date picker "range" format instead of a single point in time for the infrastructure UI? Then it would be configured the same way as in other apps... there may very well be reasons that won't work here but wanted to throw it out there.

@makwarth
Copy link

The waffle map is great for showing point in time snapshots (1m, 5m, 15min), but not so much for longer time ranges. I'd worry about users having to adjust the time range too often when opening the Infra UI - and then having to revert the time range when navigating away from the Infra UI.

@gabrielcalderon, which time range defaults would suit your needs in the Infra UI?

@gaby
Copy link
Author

gaby commented May 27, 2019

@makwarth I can test again today, but im pretty sure my waffle map was empty if using anything over 1 minute.

Something like the ones you mentioned would work. Maybe also 3m.

@skh skh added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jul 1, 2019
@skh
Copy link
Contributor

skh commented Jul 15, 2019

I think we won't be able to use one interval for all metrics in the future, because the different metricbeat modules have different default reporting periods. One notable example is the AWS module (see https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-module-aws.html) which uses 300s by default, which is determined by the frequency AWS sends metrics internally to CloudWatch.

The potential of ending up with specific interval settings for certain metricbeat modules (or even metricsets) in the Infra UI configuration feels clunky to me, but maybe that is what will be necessary in the end -- one default interval, and the option to override that setting by module/metricset.

Another option could be to find out the reporting period, and therefore the appropriate interval, from the data with a query like:

GET metricbeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "host.name": "noether"
          }
        },
        {
          "match_phrase": {
            "event.module": "system"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-10m"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggregations": {
    "my_buckets": {
      "composite": {
        "size": 10, 
        "sources": [
          {
            "date": {
              "date_histogram": {
                "field": "@timestamp",
                "calendar_interval": "1m",
                "order": "desc"
              }
            }
          }
        ]
      }
    } 
  }
}

@simianhacker
Copy link
Member

@skh I like this approach, it's actually how I would solve the problem. The only thing I would possibly change is the interval to a smaller unit, maybe something like 1 second. Then from there we should able to workout what the interval "might" be. My reasoning for using a smaller interval is that if we can guess "It looks like we have 1 second data" Then in the metrics detail page we could show a the data at a higher resolution by setting the TSVB interval to >=1s (which will let the user zoom down to 1s) depending on what we see in the data. I think right now we are using >=1m for the metrics detail page because it seems like a safe default.

@jasonrhodes
Copy link
Member

Another option could be to find out the reporting period, and therefore the appropriate interval, from the data with a query

I definitely like this idea but what happens if the data is being sent up every 20 min, or every hour? And in this query how do you tell the difference between lack of data due to interval vs. lack of data due to the monitor having been down, or failed to send, or something along those lines? (That second point may not make much difference here but I'm never sure.)

For now, is there a simple solution we can think of to solve the basic case? Maybe switching to bar charts to show this time series data would help so that you don't have to draw the interpolation from non-zero to "zero" for the missing buckets and therefore you don't draw attention to them...

@skh
Copy link
Contributor

skh commented Jul 17, 2019

For now, is there a simple solution we can think of to solve the basic case? Maybe switching to bar charts to show this time series data would help so that you don't have to draw the interpolation from non-zero to "zero" for the missing buckets and therefore you don't draw attention to them...

The issue is unrelated to charts, it causes the inventory view (waffle map) to be empty when the bucket interval does not match the data.

Solutions I can think of, roughly ordered from easiest (for us) to cleverest:

  • Use hardcoded intervals working with metricbeat default reporting periods, and document that. We're halfway there:
    • we still need to use different intervals for different types of data (aws module has reporting period 300s, many others 10s)
    • we need to add good documentation so users can tweak their setup to make it work
  • Make it configurable in kibana.yml and document it. Use intervals working with metricbeat default reporting periods as defaults. As we need different intervals for different modules, we need to find a way to configure several intervals that is still manageable.
  • Same as above, but expose the configuration in the source configuration UI.
  • Expose the interval length in the UI on all pages that are affected, using metricbeat default reporting periods as starting point. This need visual design, and probably a way to locally persist the user's choice, so that they don't have to select the interval every time they come back to a page.
  • Dynamically analyze the data as proposed in [Feature][InfraUI] Make "Last 1 minute" time range configurable #36774 (comment) . This could be combined with exposing the interval length in the UI, so that we would start with the interval size we guessed from the data, instead of with the hardcoded defaults.

@skh
Copy link
Contributor

skh commented Jul 17, 2019

what happens if the data is being sent up every 20 min, or every hour? And in this query how do you tell the difference between lack of data due to interval vs. lack of data due to the monitor having been down, or failed to send, or something along those lines? (That second point may not make much difference here but I'm never sure.)

Good points.

For very large interval sizes:

  • Right now our assumptions are much stricter (reporting interval <= 60s). What kinds of data do we want to support?
  • Time series data coming from infrastructure monitoring most likely will not come in in 20 or 60 minutes intervals. Other data might (scientific measurements?, weather?), but do we support it?
  • Remember that users always can use Kibana dashboards for their visualizing needs for other types of time series data.
  • What are our requirements, other than "make it work and don't bother users too much"?

For missing data:

  • We can't distinguish from a monitor being down, a system being down, or the network being down.
  • When data comes in again, it will backfill ("rewrite history"), because the @timestamp field is written by metricbeat, not elasticsearch.
  • I think we shouldn't try to do the job of Uptime monitoring. When we find no data for the time range being queried, we don't need to find out the reporting period because we can't display metric data for that time range anyway.

For partially missing data:

  • This is an edge case we need to handle. Options I can think of right now:
    • fall back to a default
    • use the average of the distances between found timestamps
    • use the shortest distance between found timestamps

@jasonrhodes
Copy link
Member

For very large interval sizes:

I agree with all your points in the sense that I don't think we need to support those edge cases. Sounds like we would just know internally that we support data being set up to about 10m intervals.

I think we shouldn't try to do the job of Uptime monitoring. When we find no data for the time range being queried, we don't need to find out the reporting period because we can't display metric data for that time range anyway.

Works for me.

For partially missing data:

I like "use the shortest distance between found timestamps" which I suppose we could do all the time, whether we know if there is partially missing data or not, right?

@skh
Copy link
Contributor

skh commented Jul 29, 2019

We ended up finding another solution as described in elastic/beats#12616 :

When the data contains a period field (in seconds), the Infra UI will use this value to determine the bucket interval size. If this field is not present, we will default to the default period value of the respective metricbeat module.

It is still possible that we will also add UI elements to choose the bucket interval size in the UI as well, but we'll need separate tickets for those.

@gaby
Copy link
Author

gaby commented Jul 29, 2019

@skh I think that's the best solution for this problem. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

8 participants