Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Enhanced host details - Processes - Additional Features #83968

Closed
Zacqary opened this issue Nov 20, 2020 · 26 comments · Fixed by #84716
Closed

[Metrics UI] Enhanced host details - Processes - Additional Features #83968

Zacqary opened this issue Nov 20, 2020 · 26 comments · Fixed by #84716
Assignees
Labels
enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@Zacqary
Copy link
Contributor

Zacqary commented Nov 20, 2020

Some acceptance criteria from #80307 still needs to be met:

  • De-couple the queries for the data needed for the expanded view and only make that request for an individually expanded accordion on demand, rather than loading it as part of the list query for all processes.

Note: We are NOT going to implement the sparkline visualizations from the design, to avoid performance bucketing problems.

APM correlation

host.name, process.pid, @timestamp can be used to correlate a process with an APM service, that can place an APM agent language icon beside the process name on the table.

Discussion about how to link to APM is partially here -> #80307 (comment)

@Zacqary Zacqary added enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 20, 2020
@Zacqary Zacqary added this to the Metrics UI 7.11 milestone Nov 20, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@Zacqary Zacqary self-assigned this Nov 24, 2020
@jasonrhodes
Copy link
Member

jasonrhodes commented Nov 24, 2020

@elastic/apm @sqren @sorantis @nehaduggal can any of you help us understand the "View Trace in APM" requirement in this ticket? How can we most easily/safely link to APM from a running process with host.name, process.pid, and @timestamp? And would we be linking to a trace or would it more likely be to a service page? Thanks!

UPDATE: I just saw this comment -> #80307 (comment)

It looks like there is still some ambiguity about where exactly to link?

@jasonrhodes
Copy link
Member

After meeting with @hbharding, @sorantis, @simianhacker, @Zacqary, @phillipb, and @katefarrar about this work, we decided to make a few performance and UX optimizations.

Step 1 (these will be added to the AC for this ticket):

  • Remove the sparkline visualizations from the process list view
  • De-couple the queries for the data needed for the expanded view and only make that request for an individually expanded accordion on demand, rather than loading it as part of the list query for all processes.

Step 2 (this will be moved to a new, separate ticket):

  • Consider whether we need to be querying for all processes or if, more likely, a Top N Processes by X view would likely be useful for users and more performant. Still need to figure out a good value or set of values for N (10? 20?), and a way to easily switch X (CPU, Memory)

@sorenlouv
Copy link
Member

sorenlouv commented Nov 25, 2020

It looks like there is still some ambiguity about where exactly to link?

In that discussion we talked about linking to a service given a host.name and process.pid. I gave the follow suggestion:

# link:
/app/apm/link-to/service?host=my-host&pid=my-pid 

# redirects to:
/app/apm/my-java-service/overview

If we want to link to a specific trace some additional info is needed. Obviously if you have access to a trace.id that would solve it since we already have /app/apm/link-to/trace/${trace.id}. Without trace.id I'm not sure how we can do it. So linking to a service might do it for now.

@sorantis
Copy link

sorantis commented Nov 25, 2020

@jasonrhodes The sparkline provides a quick overview of recent process behavior and the plan was to standardize the use of sparklines for other table views, e.g. containers, pods, processes, and (in the future) functions. We're already using it in other places, like APM. I'd not consider removing it until we have a good understanding of the actual performance impact.

@sgrodzicki
Copy link

sgrodzicki commented Nov 25, 2020

After meeting with @hbharding, @sorantis, @simianhacker, @Zacqary, @phillipb, and @katefarrar about this work, we decided to make a few performance and UX optimizations.

Can we start by getting a performance baseline of the current query vs new ones?

  1. Current query with sparklines
  2. Current query without sparklines
  3. Query for TOP 10/20/30 with sparklines
  4. Query for TOP 10/20/30 without sparklines

@jasonrhodes
Copy link
Member

@sorantis if we don't get rid of the sparklines we are sort of back at square one on what to do with this ticket. The performance complexity that the sparklines introduce is exponential because of max bucket size problems. I think choosing to standardize on those may end up creating a ton of issues across the app, if we aren't very careful.

@sgrodzicki yeah we can do some query comparisons. I was hoping we could do something quick and easy now and do something more holistic after that, but it sounds like we need more information. I'll work with @Zacqary to look at those query numbers and we'll report back.

@jasonrhodes
Copy link
Member

@sqren I wasn't sure if that "link-to" link exists already? If it does, that looks like it would be perfect for us.

As for trace vs service, it feels like we want to link to service. @sorantis can you confirm?

@sorantis
Copy link

@jasonrhodes correct, @alex-fedotyev and I discussed this and came to the conclusion that linking to a service is better than to a particular trace.

@jasonrhodes
Copy link
Member

OK so for the query performance issues, let's do what @sgrodzicki suggested before we make any changes to the UI/queries. To do this, I'd like to see a full example query for each of the following 4 scenarios attached to this ticket, so we can reference them later.

  1. Current query, as-is
  2. Current query, without the data needed for sparklines
  3. Query for TOP 10/20/30 (with the sparkline data)
  4. Query for TOP 10/20/30 (without the sparkline data)

Then we can profile those queries against the dev-next cluster and see the differences in timing. That's still just one somewhat arbitrary set of data, but it's at least a start so we can understand which decision makes sense for now, as well as for moving forward.

Let me know if that doesn't make sense.

@sorenlouv
Copy link
Member

@sqren I wasn't sure if that "link-to" link exists already? If it does, that looks like it would be perfect for us.

It doesn't. But it should be trivial to implement if this is what you need.

@simianhacker
Copy link
Member

simianhacker commented Nov 25, 2020

1. Current Request (from /api/infra/metrics_api)

This request is made via the new Metrics API. It is called multiple times with the afterKey until all there are no more available results. Once we have all the results then it's sorted in memory based on the field requested. The fields for the meta key are extracted from the last bucket.

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-15m",
              "lte": "now"
            }
          }
        },
        {
          "exists": {
            "field": "system.process.cmdline"
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "groupings": {
      "composite": {
        "size": 9,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "system.process.cmdline"
              }
            }
          }
        ]
      },
      "aggs": {
        "histogram": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "30s",
            "offset": "0s",
            "extended_bounds": {
              "min": "now-15m",
              "max": "now"
            }
          },
          "aggregations": {
            "cpu": {
              "avg": {
                "field": "system.process.cpu.total.norm.pct"
              }
            },
            "memory": {
              "avg": {
                "field": "system.process.memory.rss.pct"
              }
            },
            "meta": {
              "top_hits": {
                "size": 1,
                "sort": [
                  {
                    "@timestamp": {
                      "order": "desc"
                    }
                  }
                ],
                "_source": [
                  "system.process.cpu.start_time",
                  "system.process.state",
                  "user.name",
                  "process.pid"
                ]
              }
            }
          }
        }
      }
    }
  }
}
2. Current query (without data for sparklines)

This query doesn't really exist but I imagine this is what it would look like. It would need to be a custom query that uses the same mechanism to retrieve all the results and sort in memory just like the Metrics API request.

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1m",
              "lte": "now"
            }
          }
        },
        {
          "exists": {
            "field": "system.process.cmdline"
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "groupings": {
      "composite": {
        "size": 9,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "system.process.cmdline"
              }
            }
          }
        ]
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.norm.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.cpu.start_time",
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        }
      }
    }
  }
}

3. Top N for Process with Sparklines

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-15m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "processes": {
      "terms": {
        "field": "system.process.cmdline",
        "size": 20,
        "order": {
          "cpu": "desc"
        }
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "time": {
          "max": {
            "field": "system.process.cpu.start_time"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        },
        "sparklines": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "1m",
            "extended_bounds": {
              "max": "now",
              "min": "now-15m"
            }
          },
          "aggs": {
            "cpu": {
              "avg": {
                "field": "system.process.cpu.total.pct"
              }
            },
            "memory": {
              "avg": {
                "field": "system.process.memory.rss.pct"
              }
            }
          }
        }
      }
    }
  }
}

4. Top N for Process without Sparklines

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "processes": {
      "terms": {
        "field": "system.process.cmdline",
        "size": 20,
        "order": {
          "cpu": "desc"
        }
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "time": {
          "max": {
            "field": "system.process.cpu.start_time"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        }
      }
    }
  }
}

@Zacqary
Copy link
Contributor Author

Zacqary commented Nov 25, 2020

  1. Current query, as-is

Tested this on the gke-* hosts on dev-next. Took 40 seconds the first time, but as I continued retrying it to get an average, the query time decreased by about 10 seconds each time until finally settling around 15-18 seconds.

  1. Current query, without the data needed for sparklines

For this test, I reduced the timerange to look back at 1 minute of data instead of 15 minutes.

These took between 5-10 seconds, though this was after running the first test, so it may still have been benefiting from whatever caching caused the sparkline-compatible query to drastically speed up after multiple refreshes

Will post results of the TOP queries after testing them.

@simianhacker
Copy link
Member

simianhacker commented Nov 25, 2020

I updated the ranges to reflect what we are doing in production. Everything with the data histogram should be the last 15 minutes and anything without should be the last 1 minute. To be fair, we should probably make 2 requests. One for the summary data for the last 1 minute and one for the sparklines which is the last 15 minutes.

@simianhacker
Copy link
Member

@Zacqary On the subsequent requests for #1, did you change the after key? That would make it more realistic because they are paginating.

@Zacqary
Copy link
Contributor Author

Zacqary commented Nov 25, 2020

@simianhacker I didn't, I just kept refreshing the page because I wanted to get an average, but then it turned out it sped up.

This was done through the Inventory view piping it through the Metrics API instead of using the Dev Tools to make a query directly, measuring the XHR time.

@Zacqary
Copy link
Contributor Author

Zacqary commented Nov 25, 2020

  1. Top N for Process with Sparklines

Ran this in the Dev Tools. This took about 10 seconds initially but only 100ms on subsequent requests.

  1. Top N for Process without Sparklines

Between 50-100ms in the Dev Tools, but I'm not sure if that's accurate since it's probably taking advantage of Query 3's cache.

EDIT: I ran Query 4 again using a timerange from several weeks ago and it took 790ms, so that's probably a more accurate non-cached reading.

@simianhacker
Copy link
Member

@Zacqary I think you can break the cache by changing the host name

@Zacqary
Copy link
Contributor Author

Zacqary commented Nov 25, 2020

@simianhacker Tried that; only playing with the date broke the cache

@Zacqary
Copy link
Contributor Author

Zacqary commented Nov 25, 2020

Updated the AC just now to make sure we group by process.command_line instead of system.process.cmdline due to some migration changes in Metricbeat

@dgieselaar
Copy link
Member

@Zacqary in case you didn't try yet, you can also use request_cache=false or set profile to true. They all break different caching mechanisms I think.

You could perhaps also use top_metrics instead of top_hits. It should support keywords now as well and might be faster.

We're also using sparklines in one of our new pages. I had to separately fetch the sparklines data because I was running into the too_many_buckets exception.

@jasonrhodes
Copy link
Member

jasonrhodes commented Nov 30, 2020

OK so 5-10 seconds for "current query as-is, without sparklines" is still pretty not great, especially compared to the Top N queries without sparklines (which I agree, should definitely be queried separately, if we're doing them).

Is there a real reason we want to preserve the pagination for this UX or are we just sticking to it because we had it originally? If there isn't a great reason to preserve it, I think we should move forward with Top 10 (or Top 15) and just let clicking on the headers sort with a new query. Then we should log a separate ticket to do the sparklines query separately and add those back into the design. We can always rethink the sorting later, as well. We'll need help from @hbharding to figure out how to show people we are showing them Top 10 Memory, Top 10 CPU, but I don't think we should get too hung up on the sorting.

cc @sorantis @simianhacker @sgrodzicki Thoughts?

@Zacqary
Copy link
Contributor Author

Zacqary commented Dec 2, 2020

I'm focusing on the View Trace in APM functionality now, and I want to clarify what we're expecting. Should this button still say "Trace" if it's linking to a service, and not a trace? Or is there a way that we can fetch a relevant trace.id with another ES query?

Also, I'm testing on edge-oblt and querying APM for one of the hostnames in the Inventory view doesn't return any results. Do we always want the "View in APM" button to be present, even if clicking it will return no results like this? Is there a way we can check to make sure clicking the button will actually link to something?

@Zacqary
Copy link
Contributor Author

Zacqary commented Dec 2, 2020

@sqren Should we consider the View in APM button dependent on #84814?

@sorenlouv
Copy link
Member

@Zacqary Yes - unless you have another way of linking to APM you are dependent on #84814?. With only 8 workings left before FF it might be hard to squeeze into 7.11 (with all the other stuff we are also trying to cram in) but we'll do our best.

@Zacqary
Copy link
Contributor Author

Zacqary commented Dec 2, 2020

Okay, I've split the APM button into a separate issue in that case: #84849

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants