Rework system metrics overview and host overview #3630

flash1293 · 2022-06-30T08:46:08Z

What does this PR do?

This PR reworks the system overview and host overview dashboards, following the best practices documented here: https://docs.google.com/document/d/1uyyFGx6xA5Kvl8c-ZdvXdvBGrHTylxU9F69TGqfzdmw/edit

Checklist

~~I have reviewed tips for building integrations and this pull request is aligned with them.~~
~~I have verified that all data streams collect metrics or logs.~~
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.

Detailed changes in this PR

System overview:

Make all visualizations "by value" (no library visualizations necessary)
Enable margins between panels
Remove titles from most panels if the vis itself is sufficient
Replace the navigation on top with an explainer for how to drill down to the host overview (by using the row click menu of the "hosts" table)
Convert "Number of hosts" to Lens vis
Memory/CPU/Disk usage gauges: Use EUI status color scheme
Turn the two top n TSVB metrics for top hosts by CPU and Memory into a single "hosts" table with both metrics ordered by CPU usage, showing 1000 host
The user can navigate to the host overview page using a row click drilldown (the three dots context menu next to each row) - it will navigate to the dashboard with a pre-set filter - by default it's inline because this speeds up the navigation massively, but shift-click works to open in a new tab (like with any link)
⚠️ It shows the top 1000 hosts by CPU, the table can be client-side ordered by memory directly on the dashboard, but it will only consider the already loaded 1000 hosts - does this make sense or are two tables needed?(one for top by CPU, one for top by memory)
⚠️ The hosts table is not using "last value" mode, instead it takes the average of cpu/memory over the full time range - does this make sense?
Convert CPU usage histogram to Lens and use similar coloring like for the gauges, but de-emphasizing the "normal" state (light grey for below 70%, yellow up to 85%, then red)

Host overview:

Make all visualizations "by value" (no library visualizations necessary)
Enable margins between panels
Remove titles from most panels if the vis itself is sufficient
Add input control to allow changing the host without leaving the page and without knowing available host names
Change markdown header to explain how to use input control and also link back to overview page
Restructure the page:
- Most important metrics, then a separate section for CPU, Memory, Disk and Network
- Convert "Number of processes" to Lens vis
Memory/CPU/Disk usage/Load/Swap gauges: Use EUI status color scheme
CPU section
- Convert CPU Usage and Load over time charts to Lens
  - ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
- Turn Top N processes by CPU usage into Lens table with color coding
- ⚠️ The processes table is not using "last value" mode, instead it takes the average of cpu over the full time range - does this make sense?
Memory section
- Convert Memory Usage over time charts to Lens
  - ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
- Turn Top N processes by memory usage into Lens table with color coding
- ⚠️ The processes table is not using "last value" mode, instead it takes the average of cpu over the full time range - does this make sense?
- Move swap usage gauge from top section down
Disk section
- Restyle Disk IO chart to use same colors
  - ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
- Turn Top N mountpoints by disk usage into Lens table with color coding
- ⚠️ The mountpoints table is not using "last value" mode, instead it takes the average of disk usage over the full time range - does this make sense?
- Copy disk usage gauge from top section down as a reference
Network section
- Copy down inbound and outbound traffic metric vis from top section as a reference
- Split up in and out packetloss visualizations into separate panels
- Turn top in and out interfaces top n visualizations into a single combined Lens table, using the max of the counter metrics
- ⚠️ The old interface topn visualizations were using average of in/out bytes, but using the maximum seems more correct, does this make sense?
- Color in/out traffic in bytes and packets visualizations similar to Lens

Divergence from datavis proposal

Moving some metrics from the top section into the aspect-specific sections (swap usage, packet loss) - these seem too specific to me and don't help when trying to get a "general understanding" of the health of the host
Adding the host switcher to the top of the dashboard - it takes away
Giving the aspect-specific sections a header instead of simple separators - IMHO it doesn't hurt the overall look of the dashboard and it helps setting the context
Not duplicating as many metrics for the aspect-specific sections - I didn't want to take away as much space from the time series chart, in some places I did duplication to make the dashboard look more consistent. Happy to discuss this part
Not adding a "Top hosts by memory usage" heatmap to the system overview - it wasn't there in the existing dashboard, not sure whether it makes sense
Using tables instead of horizontal bar charts for the top processes / interfaces / mountpoints and putting them into their respective section instead of grouping them at the bottom. Visually I like them better in the bottom, but for investigation I think it's helpful to have them next to the chart they relate to. I also picked a table so the list can grow larger than the available space and start scrolling - IMHO in this situation it's a helpful model to discover more

cc @gvnmagni

Problems/Leftovers

I couldn't do some things because they are either not available in Lens yet at all or not at the current version

Rank by last value not in 8.1 - could be used for the disk usage per mountpoint table instead of max (it's more correct when resets are happening) - this is available in 8.2
As mentioned above - Some of "top processes/interfaces/... by x' are "last value" - this is not possible in Lens yet, we are working on this feature: [Lens] Window config for last value, counter rate and average/percentile kibana#132112 (might be available in 8.4)
Nicely looking metrics are not available, we are working on this [Lens] Implement new metric grid visualization kibana#134242 (might be available in 8.4)
Formula can't be time scaled in Lens (prevents conversion of disk.io traffic over time) - this is available in 8.3
Breakdown can't be collapsed (prevents conversion of network in/out over time for bytes/packages) - this is available in 8.3
Can't use the new input controls - the ones I used got deprecated in 8.3 in favor of a new implementation but they need to be integrated with drilldowns [Controls] Drilldown and Links panel Integration kibana#136650 - unclear when available

How to test this PR locally

For each visualization, validate whether the configuration still makes sense - I'm not that familiar with the dashboard and maybe I made some mistake
Take especially care with the points marked with a warning triangle in the list above - these changes I'm not 100% sure myself, but the others might be problematic too

Screenshots

elasticmachine · 2022-06-30T09:03:38Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-09-27T15:01:51.587+0000
Duration: 18 min 50 sec

Test stats 🧪

Test	Results
Failed	0
Passed	246
Skipped	0
Total	246

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

elasticmachine · 2022-06-30T09:03:59Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	100.0% (`3/3`)	💚
Files	100.0% (`4/4`)	💚 2.688
Classes	100.0% (`4/4`)	💚 2.688
Methods	60.759% (`48/79`)	👎 -29.538
Lines	98.793% (`2702/2735`)	👍 7.408
Conditionals	100.0% (`0/0`)	💚

cmacknz · 2022-06-30T12:21:16Z

Added @joshdover as reviewer to get someone with more Kibana experience to look at this, Josh feel free to assign someone else to look at this if needed.

joshdover · 2022-07-18T12:20:09Z

@flash1293 This is a huge improvement in visual layout and use of new Kibana features. I exported the dashboard to my personal cluster and I found it a lot easier to use. 🎉

⚠️ It shows the top 1000 hosts by CPU, the table can be client-side ordered by memory directly on the dashboard, but it will only consider the already loaded 1000 hosts - does this make sense or are two tables needed?(one for top by CPU, one for top by memory)

I think being able to find the top memory consumers is important, regardless of CPU usage. If we have to do a separate table for this we probably should or we should add a similar heatmap for Top Hosts by Memory Usage.

⚠️ The hosts table is not using "last value" mode, instead it takes the average of cpu/memory over the full time range - does this make sense?

For this comment + all others regarding last value mode:

I think this is tricky because there are two use cases:

Help me find what is the problem with my system right now
Help me determine what processes are consuming the most resources in <time range>

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

IMO this is the biggest outstanding problem with this PR.

Question about the drilldowns: it seems that clicking on the table rows doesn't open the host drilldown, but clicking the context menu works. Is there a bug here, because I see the drilldown is configured for "table row click" but that doesn't seem to work?

Related to that, it'd be great for us to expand on this in the future and have a process dashboard that can be drilled down into from the host overview board.

joshdover · 2022-07-18T12:23:03Z

Also, can the CPU heatmap on the overview dashboard also drilldown to the host dashboard on click?

flash1293 · 2022-07-18T12:45:03Z

Thanks for the review @joshdover . I think I can address most of your points somehow - I will report back on these once I have a solution.

about

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

An important note is that the tsvb vis would show the average of the last few seconds/minutes, but still order the list by the overall average, so the shown entries even in the current dashboard are not necessarily the top consumers. We have a feature on our near term list (8.5 or even 8.4 if we hurry) to allow to do the same thing in Lens (order by overall time range, show the last few minutes in the table). However I’m not sure whether it’s the best behavior as it’s hard to understand for a user - it’s a bit averaged over the full time range and a bit “current state”.

to make it consistent there are two approaches:

Show data for the full time range (like it is right now) - if the user wants to see the latest state, they have to set the dashboard time range to 30s
Show the very last state - take the very last value, order the list by that and also show that very last value (not averaging at all, just show the value from the last document in the current time range that has the field defined).

What do you think? Should we keep the current tsvb behavior or go one of the other ways?

joshdover · 2022-07-18T13:09:54Z

What do you think? Should we keep the current tsvb behavior or go one of the other ways?

Thanks for clarifying how this currently works and I do think this likely matches what we want, but the downside is we'd need to bump the minimum package version to 8.4 or 8.5 which isn't super desirable.

I'd be ok keeping the current behavior in the PR over increasing the minimum version right now. As you mentioned the user can still get the latest value using a shorter time span, which I think is better than only solving one of the use cases or increasing the minimum version right now.

We should track some of the improvements we'd like to make the next time we increase the minimum version. I think we'll want to try supporting at least 2-3 minor releases back from the latest public one. Today the latest release is 8.3.x, so bumping the minimum to 8.1 seems fine.

flash1293 · 2022-07-18T13:20:07Z

I think we'll want to try supporting at least 2-3 minor releases back from the latest public one. Today the latest release is 8.3.x, so bumping the minimum to 8.1 seems fine.

actually that’s an important point - in the original pr description I listed things that can be improved in future versions (most notably new metric vis) . When discussing this with @akshay-saraswat we concluded bumping to the latest minor would he justifiable (as users of older stack versions can still use the existing legacy assets). Do you think we should make a lag of 2 minors a rule?

flash1293 · 2022-07-18T13:24:20Z

Another important point I just realized - the url drilldown used to link from the overview to the detail dashboard is not part of the basic subscription. I guess we need to make sure to only use basic features, is that assumption correct?

joshdover · 2022-07-18T14:44:05Z

Do you think we should make a lag of 2 minors a rule?

We should probably have some policy on this, but I don't think we do. I think the owning team of this package should chime in on how often we make bug fixes to this package that need to be backported to previous minors. cc @cmacknz

Another important point I just realized - the url drilldown used to link from the overview to the detail dashboard is not part of the basic subscription. I guess we need to make sure to only use basic features, is that assumption correct?

Hmm that will be a breaking change of sorts since the current dashboard does support this (is it also only on basic?).

I still think it'd be good to include it assuming that there is some graceful fallback when not on basic and nothing breaks. May need to update the copy to not mention the drilldown feature though.

flash1293 · 2022-07-18T14:59:08Z

The problem with drilldown is that I can’t link the filter up with the input control on the details dashboard. But in 8.3 there’s a new input controls feature that might not have this problem (I have to check though)

flash1293 · 2022-07-22T14:47:17Z

Addressed some points:

Two tables on the overview page - one ordered by CPU, one ordered by memory
Switched to filter-based drilldown as table context url drilldown is not available in basic
Added another heatmap for memory usage over time
Added drilldowns to the heatmaps, too
- ⚠️ The heatmap always selects both host an time range and there's no way to do both. However it still seems useful to me
Removed input control on host overview page as it can't be integrated with multiple dashboards

The biggest open question is

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

@cmacknz @joshdover could you have a look and check wether it makes sense like this?

joshdover · 2022-08-04T10:51:43Z

Removed input control on host overview page as it can't be integrated with multiple dashboards

@flash1293 What does this mean exactly?

We have a feature on our near term list (8.5 or even 8.4 if we hurry) to allow to do the same thing in Lens (order by overall time range, show the last few minutes in the table).

A couple questions on this feature:

Is it available yet?
Would it be possible to both show the overall average in the time range and the last value?

flash1293 · 2022-08-04T11:06:12Z

@joshdover

What does this mean exactly?

When navigating from one dashboard to another via filter drilldown, then the filter will be set on the target dashboard, but if there is an input control on the same field it won't be picked up as the "current value". So the user ends up with a regular filter pill and an empty select box - if they select something from the select box, a second filter pill is added, effectively filtering out everything. The user would need to remove the filter pill first and reselect which is pretty confusing.

A couple questions on this feature:
Is it available yet?
Would it be possible to both show the overall average in the time range and the last value?

I merged it yesterday, so it will be in 8.5

We can show the overall average and the very last value (top metric of the field sorted by timestamp descending) - would that make sense?

joshdover · 2022-08-08T10:36:54Z

We can show the overall average and the very last value (top metric of the field sorted by timestamp descending) - would that make sense?

I think if there were two separate columns and clearly labeled it would make sense ("Average CPU" and "Last CPU"?). We could compare this to what other visualizations are doing in other packages but I think we're trying to define what should be the best practice pattern here and I'm not sure looking at what we've done elsewhere is helpful.

WDYT?

flash1293 · 2022-08-08T10:43:58Z

Agreed. I like "average" vs "last" - it's less vague than "average over the last few seconds" which is what TSVB is doing at the moment. Gonna update the PR

flash1293 · 2022-08-12T12:50:50Z

@joshdover Split up all the metrics in the tables into "Average" and "Last value":

It seems helpful to me - the space is there to show another value and it's better information than just the current value or just the average

joshdover

LGTM - did not manually test the latest iteration. Thanks for all your work on this @flash1293! 🎉

nimarezainia

Thanks for these changes. I haven't been able to personally view the changes but based on the discussions looks like a great improvement. Once I have access to 8.5, i'll provide more feedback if required.

joshdover · 2022-08-29T18:26:25Z

@nimarezainia As mentioned, we can have some folks test this before we merge these changes. This can be done by downloading these two files and then importing them into Kibana from Stack Management > Saved Objects. I'd recommend creating a test space to do this in since it will override the dashboards from the integration that is currently installed:

https://raw.githubusercontent.com/flash1293/integrations/system-dashboard-rework/packages/system/kibana/dashboard/system-79ffd6e0-faa0-11e6-947f-177f697178b8.json
https://raw.githubusercontent.com/flash1293/integrations/system-dashboard-rework/packages/system/kibana/dashboard/system-Metrics-system-overview.json

flash1293 · 2022-08-29T18:30:43Z

Slight correction - the files can’t be imported directly as they aren’t in the right format (elastic-package is doing some transformations) I will provide a proper export tomorrow and send it around.

flash1293 · 2022-08-31T14:43:07Z

OK, as discussed offline I reverted the Lens tables on the system overview dashboard back to the TSVB top n visualizations for one-click-drilldown functionality (with adjusted color schemes):

@joshdover @nimarezainia The latest state as importable file can be found here: https://gist.githubusercontent.com/flash1293/d3a8b167ad91576f9c9a770d163e1b20/raw/505cfe2a74083f957f29e260a435bc14be0560b3/export.ndjson

Save this link as an ndjson file , then go to Stack management > Saved object management and import it there (should work for every instance which is receiving system metric data via integration >= 8.1.0)

ruflin · 2022-09-12T09:38:16Z

@joshdover @cmacknz @nimarezainia @jlind23 It would be great to get this over the line. This is not only about the system dashboard itself which is a huge improvement but it will also serve as an example for many other integrations on how we should build the dashboards.

joshdover · 2022-09-12T10:23:10Z

I'm good with this being merged. @nimarezainia did you still want to get additional feedback from SAs or are you happy with the modifications @flash1293 made?

nimarezainia · 2022-09-21T14:45:48Z

@joshdover sorry I haven't been able to take care of this. if you are all happy let's merge and I will try and find SAs to review it.

drewdaemon

Beautiful work. Only one comment from my side.

I think the directions in the System Overview markdown panel are out of date since the tables got switched back to TSVB.

Also, "table below" should probably be changed to "tables below."

Approving anyway so as not to hold this PR up.

elasticmachine · 2022-09-26T09:10:02Z

🚀 Benchmarks report

Package `system` 👍(1) 💚(1) 💔(1)

Expand to view

Data stream	Previous EPS	New EPS	Diff (%)	Result
`syslog`	16393.44	9433.96	-6959.48 (-42.45%)	💔

To see the full report comment with /test benchmark fullreport

joshdover · 2022-09-27T12:15:31Z

@flash1293 I say we ship this. If we get user complaints, it's not hard to revert and release another update.

…work

flash1293 · 2022-09-27T15:03:58Z

Alright, I just removed the unused visualizations - if the build goes green I'm going to merge.

flash1293 · 2022-09-27T15:28:41Z

@cmacknz could you take it from here in terms of releasing?

cmacknz · 2022-09-29T01:28:53Z

Yes I can promote the integration, thanks!

flash1293 added 3 commits June 29, 2022 19:18

rework system dashboard

a1c5fd7

some updates

611f51f

some updates

c2cfab3

flash1293 added the enhancement New feature or request label Jun 30, 2022

flash1293 requested review from a team as code owners June 30, 2022 08:46

flash1293 requested review from cmacknz and kvch June 30, 2022 08:46

cmacknz requested review from fearful-symmetry and joshdover June 30, 2022 12:20

cmacknz removed the request for review from kvch June 30, 2022 12:21

flash1293 added 3 commits July 22, 2022 16:14

Merge remote-tracking branch 'origin/main' into system-dashboard-rework

7ee34dc

remove merge marker

2a58ce0

review comments

ec778ac

flash1293 added 2 commits August 12, 2022 14:48

Merge remote-tracking branch 'origin/main' into system-dashboard-rework

c7f0d4f

review comments

ed5eb80

joshdover approved these changes Aug 23, 2022

View reviewed changes

cmacknz approved these changes Aug 24, 2022

View reviewed changes

mukeshelastic requested a review from nimarezainia August 29, 2022 08:18

nimarezainia approved these changes Aug 29, 2022

View reviewed changes

flash1293 added 2 commits August 31, 2022 15:47

Merge remote-tracking branch 'origin/main' into system-dashboard-rework

f51ff93

switch back to topn for the top host lists=

b1d8954

drewdaemon approved these changes Sep 23, 2022

View reviewed changes

flash1293 and others added 2 commits September 26, 2022 10:52

Update system-Metrics-system-overview.json

37adf18

Merge branch 'main' into system-dashboard-rework

18e696a

flash1293 added 2 commits September 27, 2022 16:45

Merge remote-tracking branch 'upstream/main' into system-dashboard-re…

7e8c8c0

…work

remove unused visualizations

1bfb8ca

flash1293 merged commit a57592a into elastic:main Sep 27, 2022

drewdaemon mentioned this pull request Dec 19, 2022

Make system integration the gold standard for Kibana best practices #4868

Open

11 tasks

drewdaemon mentioned this pull request Jan 30, 2023

[System] Issue with short time frame (5min) #1437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework system metrics overview and host overview #3630

Rework system metrics overview and host overview #3630

flash1293 commented Jun 30, 2022 •

edited

Loading

elasticmachine commented Jun 30, 2022 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Jun 30, 2022 •

edited

Loading

cmacknz commented Jun 30, 2022

joshdover commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

flash1293 commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

flash1293 commented Jul 22, 2022 •

edited

Loading

joshdover commented Aug 4, 2022

flash1293 commented Aug 4, 2022 •

edited

Loading

joshdover commented Aug 8, 2022

flash1293 commented Aug 8, 2022

flash1293 commented Aug 12, 2022

joshdover left a comment

nimarezainia left a comment

joshdover commented Aug 29, 2022

flash1293 commented Aug 29, 2022

flash1293 commented Aug 31, 2022

ruflin commented Sep 12, 2022

joshdover commented Sep 12, 2022

nimarezainia commented Sep 21, 2022

drewdaemon left a comment

elasticmachine commented Sep 26, 2022 •

edited

Loading

joshdover commented Sep 27, 2022

flash1293 commented Sep 27, 2022

flash1293 commented Sep 27, 2022

cmacknz commented Sep 29, 2022

Rework system metrics overview and host overview #3630

Rework system metrics overview and host overview #3630

Conversation

flash1293 commented Jun 30, 2022 • edited Loading

What does this PR do?

Checklist

Detailed changes in this PR

Divergence from datavis proposal

Problems/Leftovers

How to test this PR locally

Screenshots

elasticmachine commented Jun 30, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

elasticmachine commented Jun 30, 2022 • edited Loading

🌐 Coverage report

cmacknz commented Jun 30, 2022

joshdover commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

flash1293 commented Jul 18, 2022

joshdover commented Jul 18, 2022

flash1293 commented Jul 18, 2022

flash1293 commented Jul 22, 2022 • edited Loading

joshdover commented Aug 4, 2022

flash1293 commented Aug 4, 2022 • edited Loading

joshdover commented Aug 8, 2022

flash1293 commented Aug 8, 2022

flash1293 commented Aug 12, 2022

joshdover left a comment

Choose a reason for hiding this comment

nimarezainia left a comment

Choose a reason for hiding this comment

joshdover commented Aug 29, 2022

flash1293 commented Aug 29, 2022

flash1293 commented Aug 31, 2022

ruflin commented Sep 12, 2022

joshdover commented Sep 12, 2022

nimarezainia commented Sep 21, 2022

drewdaemon left a comment

Choose a reason for hiding this comment

elasticmachine commented Sep 26, 2022 • edited Loading

🚀 Benchmarks report

Package system 👍(1) 💚(1) 💔(1)

joshdover commented Sep 27, 2022

flash1293 commented Sep 27, 2022

flash1293 commented Sep 27, 2022

cmacknz commented Sep 29, 2022

flash1293 commented Jun 30, 2022 •

edited

Loading

elasticmachine commented Jun 30, 2022 •

edited

Loading

elasticmachine commented Jun 30, 2022 •

edited

Loading

flash1293 commented Jul 22, 2022 •

edited

Loading

flash1293 commented Aug 4, 2022 •

edited

Loading

elasticmachine commented Sep 26, 2022 •

edited

Loading

Package `system` 👍(1) 💚(1) 💔(1)