Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Scale alerts in UI #80397

Closed
chrisronline opened this issue Oct 13, 2020 · 21 comments
Closed

[Monitoring] Scale alerts in UI #80397

chrisronline opened this issue Oct 13, 2020 · 21 comments
Labels
enhancement New value added to drive a business result Team:Monitoring Stack Monitoring team

Comments

@chrisronline
Copy link
Contributor

Screen Shot 2020-10-13 at 3 00 54 PM

Eventually, this will not scale for the large number of alerts we plan to add. We need to think of a way to only show some and provide a way to see them all in a friendly way.

@chrisronline chrisronline added enhancement New value added to drive a business result Team:Monitoring Stack Monitoring team labels Oct 13, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@ravikesarwani
Copy link
Contributor

We have a few other usability issues that we should look at solving as well.
Adding it here so we can have a brain storming session to see how we can make this user experience better.

Show this information in a more concise manner.
Screen Shot 2020-10-15 at 2 16 41 PM

We need some organization and unique information to distinguish these alerts.
Screen Shot 2020-07-22 at 10 36 47 AM

@igoristic
Copy link
Contributor

igoristic commented Oct 20, 2020

For the 8 CPU alert notifications I think we can group them into 5min (or some interval) time buckets, eg:

1 minutes ago  
2 High Disk Usage   >
--------------
5 minutes ago  
2 High CPU Usage    >
--------------
1 hour ago  
5 High CPU Usage    >

And we can also do some kind of grouping/categories when editing alerts, eg:

Usage Alerts        >
--------------
Query Alerts        >
--------------
Monitoring Alerts   >

Clicking on the Usage Alerts would go to next menu, eg:

CPU Usage      >
--------------
Disk Usage     >
--------------
Memory Usage   >

But, I really think we should get a UI/UX designer involved to help us out

@ravikesarwani
Copy link
Contributor

I will spend sometime to think these UX issues and propose possible presentation options.

cc: @katrin-freihofner I know UX team is super busy in the 7.11 timeframe but at least you will follow the discussion and maybe keep us straight if we are going completely off the rails.

@chrisronline
Copy link
Contributor Author

I think there is progress we can make until design can help us more, such as grouping alerts better per Igor's comment.

I have opened #81569 to explore some of these ways.

@ravikesarwani
Copy link
Contributor

Organizing alerts in "setup mode"
As we add more alerts in 7.11 the ES Node list will get long.
In 7.11 we are adding the following alerts:

  • Write threadpool rejects
  • Search threadpool rejects
  • CCR Read Exceptions
  • Max shard size
    We need to organize the alerts into categories for easy visualizations.

Consolidate alerts on the ES nodes under the following sub menus:

  • Cluster health
    - Nodes changed
    - Version mismatch
    - Max shard size
  • Resource utilization
    - CPU usage
    - Disk usage
    - Memory usage (JVM)
  • Errors and exceptions
    - Missing monitoring data
    - Write threadpool rejects
    - Search threadpool rejects
    - CCR Read Exceptions

@ravikesarwani
Copy link
Contributor

Firing alerts
When looking at the firing alerts on the overview pages we currently show the alerts sorted based on time stamp and display the alert name. When multiple nodes of the cluster have the same alert firing there's no way to distinguish and help user drill down to the right one.

When 8 items or less:
Show a flat list sorted my most recent with the following 3 values shown at the top level:

  • Time stamp
  • Alert name
  • Add Node name and link

Screen Shot 2020-10-26 at 10 36 03 AM

When there are 8+ alerts firing we group alerts based on node at the top level.
The hypothesis is that investigation and any fix will be done by the Admin on a per node basis on the cluster and hence grouping by node when there are many alerts firing can help them focus on fixing issues in a more organized manner.

  • Node with link (# of alerts)
    • Time stamp, Alert

Screen Shot 2020-10-26 at 10 36 43 AM

We should allow switching by the user between a flat
timestamp sorted list (like right now with node name and link added) vs.
grouped by nodes. We can choose the default view for the user based on the
number of firing alerts but they can switch between them using a toggle
button anytime.

@ravikesarwani
Copy link
Contributor

On node details page we currently show each firing alert and all the investigative suggestions expanded on top. If you have many alerts then that takes up a lot of information at the top pushing node metrics details lower requiring more scrolling by the user.
Screen Shot 2020-10-26 at 11 10 08 AM

Can we add an expand collapse design for the firing alerts?
Each alert information is shown in one line with an expand arrow on the right. Clicking that will expand the details showing options for the investigative workflows.
Something like this:
Screen Shot 2020-10-26 at 11 10 34 AM

@chrisronline
Copy link
Contributor Author

How do we feel about organizing alerts based on severity level? It's a concept we have but I don't think we are leveraging much. Right now, we show badges in the UI for both warning and danger severity level alerts, but it doesn't seem we are giving much weight to this concept moving forward.

Should we keep it and organize each alert under one of them? Or should we move away from that level of categorization?

@ravikesarwani
Copy link
Contributor

To me defining severity to distinguish alerts can work only for a very minimal set of use cases.

Is disk capacity a "warning" or "danger". I would say depending upon how full the disk is.
80% full to me is a warning and 85% (when ES will start to behave differently) maybe a danger.
Currently in our alerts we don't have multiple threshold levels but when we do we can tie severity to that and display that visually.

@chrisronline
Copy link
Contributor Author

chrisronline commented Nov 11, 2020

I have some screenshots to provide from the work I've been doing here. Please let me know if this matches expectations or where we need to correct.

Firing mode

Screen Shot 2020-11-11 at 3 31 49 PM

Screen Shot 2020-11-11 at 3 33 10 PM

Screen Shot 2020-11-11 at 3 33 19 PM

Screen Shot 2020-11-11 at 3 33 23 PM

Screen Shot 2020-11-11 at 3 33 29 PM

Screen Shot 2020-11-11 at 3 33 35 PM

Screen Shot 2020-11-11 at 3 33 39 PM

Setup mode

Screen Shot 2020-11-11 at 3 34 52 PM

Screen Shot 2020-11-11 at 3 36 32 PM

Screen Shot 2020-11-11 at 3 36 36 PM

Screen Shot 2020-11-11 at 3 36 41 PM

Screen Shot 2020-11-11 at 3 36 45 PM

@ravikesarwani
Copy link
Contributor

@chrisronline looks really great. Awesome work here.

One minor question/suggestion I had was around the sorting of the list. For "Firing mode" whenever possible can we sort the list based on most recent alert?

@chrisronline
Copy link
Contributor Author

@ravikesarwani Should we sort within the defined categories? Or keep the categories ordering constant and just sort the list of firing alerts within each category by most recent?

@chrisronline
Copy link
Contributor Author

I also have an update on the detail pages:

Screen Shot 2020-11-12 at 3 05 08 PM

Screen Shot 2020-11-12 at 3 05 19 PM

Screen Shot 2020-11-12 at 3 05 23 PM

Any thoughts appreciated on the direction here too.

@ravikesarwani
Copy link
Contributor

I like the "configure" option being added here in the details page.

While the design you propose here will do the job I see expand/collapse as little more user friendly (compared to pop up) for the following reasons:

  • Allows users to "expand" and then work on the page and explore graphs etc. and come back to review the alert details again without having to find and click on the right link again (to show the popup).
  • Multiple alerts can be expanded and reviewed at the same time, when needed.
  • We will have more space to work with. This could come in handy as more alerts are added or enhancement done to make investigative workflows better.

@ravikesarwani
Copy link
Contributor

@ravikesarwani Should we sort within the defined categories? Or keep the categories ordering constant and just sort the list of firing alerts within each category by most recent?

Just the list of firing alerts within each category should be a good start.
We do have to think a little bit about dynamic updates and not change the ordering when user has the pop up open.

@chrisronline
Copy link
Contributor Author

@ravikesarwani How about:

Screen Shot 2020-11-16 at 1 36 52 PM

Screen Shot 2020-11-16 at 1 36 56 PM

Screen Shot 2020-11-16 at 1 37 01 PM

@ravikesarwani
Copy link
Contributor

Thanks, I like this better and feel it makes a better customer experience. Great work here!
Few minor comments:

  • Did you take out the "bell" icon? I thought it was a good icon to indicate an alert.
  • Can we make the arrow ">" a little more visible, sort of like call to action. I don't want user to miss out on that action.

@chrisronline
Copy link
Contributor Author

Can we make the arrow ">" a little more visible, sort of like call to action. I don't want user to miss out on that action.

Unfortunately, we can't as the component doesn't allow it and the EUI team isn't in favor of making the size changeable.

Here is a new screenshot with the icon back in:

Screen Shot 2020-11-17 at 11 59 52 AM

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Nov 18, 2020

Let's get this code reviewed, tested & merged. Very helpful usability improvements.

@chrisronline
Copy link
Contributor Author

Resolved with #81569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

5 participants