[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

chrisronline · 2020-12-14T19:30:11Z

Summary

Stack Monitoring provides a set of out-of-the-box alerts, created by simply loading the Stack Monitoring UI within Kibana. The default action for each alert is a server log and the action messaging is controlled by the Stack Monitoring UI code directly.

PRs

Original, and CPU alert: #68805
Disk usage alert: #75419
JVM memory usage alert: #79039
Missing monitoring data alert: #78208
Threadpool rejections alert: #79433

Testing

Creation

Ensure alerts are created once visiting the Stack Monitoring UI
Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

Management

Ensure that you can view and manage these alerts when Setup Mode is active
Ensure you can properly add an additional action (to any alert) and it works as expected

UX

Ensure you can see all created alerts while in Setup Mode
Ensure you can properly navigate between display modes of alerts [Monitoring] Some progress on making alerts better in the UI #81569

Specific alerts

Edge cases

Ensure a CCS environment works as expected
Ensure Setup Mode works as expected on cloud ([Monitoring] Ensure setup mode works on cloud but only for alerts #73127)
Ensure a dedicated monitoring cluster environment works as expected
Ensure a quality UX with multiple Kibanas (we recommend a dedicated Kibana for a dedicated monitoring cluster) ("Duplicate" alerts can potentially exist based on what the user is doing in each Kibana but they should be able to easily disable the ones they don't want)
Ensure Stack Monitoring UI works properly when alerts are not available (both because ssl is disabled and an encryption key is not set: [Monitoring] Fix UI error when alerting is not available #77179)
Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI ([Monitoring] Prevent edit/create for Stack Monitoring alerts in Alerts Management #77097)

elasticmachine · 2020-12-15T17:10:05Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

Zacqary · 2021-01-05T20:20:12Z

Ensure you can properly navigate between display modes of alerts

Not quite sure how to test this one. Does this state only happen if you're monitoring multiple clusters?

chrisronline · 2021-01-05T20:28:22Z

@Zacqary

Sure! In #81569, we introduce two ways to view alerts in the UI - grouped by "node/instance" or grouped by the alert type. That line item means we should just verify you can toggle between the two and both of them make sense and the grouping looks right

Zacqary · 2021-01-05T20:30:13Z

@chrisronline Got it, is it expected that the toggle switch doesn't appear if you're only monitoring a single node?

chrisronline · 2021-01-05T20:32:27Z

@Zacqary Negative. That's something we should probably optimize in the near future, but it does not do that now.

Zacqary · 2021-01-05T20:42:26Z

Okay, well I'm not seeing the Group by node toggle at all. Guess I can't pass that test case then?

chrisronline · 2021-01-05T20:57:41Z

@Zacqary Oh wow, that's weird. Do you mind posting a screenshot of what you are seeing?

Zacqary · 2021-01-05T21:28:37Z

Just not seeing Group By Node anywhere. This example is with only one node, but I still wasn't able to see it when adding more nodes.

Zacqary · 2021-01-06T17:53:13Z

Ensure you can properly trigger and see the server log for both threadpool rejection alerts

~~Which ones are these?~~ Never mind, missed em in the Errors and Exceptions section

chrisronline · 2021-01-06T18:12:01Z

@Zacqary I'm a bit dumbfounded. #85719 made it into 7.11, as you can see here but I don't know why it's not showing up in the UI. I just tested on cloud staging, but I'm going to test locally and see if I can figure out what's up

Zacqary · 2021-01-06T18:24:09Z

For the threadpool alerts, not sure if I'm doing it right but it doesn't seem to be firing when I:

Set both Write and Read to alert every 1 minute, and fire on 1 rejection
Do a GET .monitoring-es*/_search to find the most recent document with a node_stats field
POST .monitoring-es*/_update/<doc id> to set node_stats.thread_pool.search.rejected to a number greater than 0, same for write.rejected

chrisronline · 2021-01-06T18:44:57Z

@Zacqary #85719 doesn't seem to be in BC1, but it is in the 7.11 branch so maybe we need to wait until BC2 to test some of this.

chrisronline · 2021-01-06T19:04:57Z

@igoristic Can you advise on #85841 (comment)?

Zacqary · 2021-01-06T19:12:01Z

@chrisronline I'm running 7.11 locally and still don't see the Group By Node toggle. I actually tried on master first by accident and it also didn't show up.

chrisronline · 2021-01-06T19:27:23Z

@Zacqary Oh, my bad! You are unable to see this functionality while in Setup mode. It's only accessible for firing alerts. I'd recommend using Setup mode to set the threshold for various alerts very low to trigger them and test it that way.

Zacqary · 2021-01-06T20:29:49Z

Ah that makes sense. Works outside Setup Mode.

Zacqary · 2021-01-07T17:07:08Z

Do I still need to do something to enable Watcher for the legacy alerts? Seems like I'm having trouble getting the license expiration pipeline to set it off

chrisronline · 2021-01-07T17:23:46Z

You'll need trial or higher license and that should be it. I'd double check the watches exist and then you can do something like:

POST .monitoring-es-*/_search?filter_path=hits.hits._source.license
{
  "size": 1,
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "type": {
        "value": "cluster_stats"
      }
    }
  }
}

to verify the pipeline is properly changing the document

Zacqary · 2021-01-07T17:28:19Z

Yeah the document's updating, watches don't exist though.

chrisronline · 2021-01-07T17:29:25Z

What license level are you on? If you are on trial+ and using legacy monitoring collection, the watches should be created for you.

Zacqary · 2021-01-07T17:31:30Z

Using Trial and Metricbeat monitoring, alerts are created but no watches.

chrisronline · 2021-01-07T17:39:18Z

Ah, there is a known issue around watch creation and using only metricbeat monitoring: elastic/elasticsearch#51762 (comment)

We're tracking to remove these watches in 7.12: #85047 so this bug will be moot but it's still there for now.

To get around it, please enable legacy monitoring (via the cluster setting) in order for the watches to exist. You can disable legacy monitoring as soon as the watches have been created

Zacqary · 2021-01-07T17:43:34Z

Remind me how to enable legacy monitoring again? It's a dev tools request, right?

chrisronline · 2021-01-07T17:43:50Z

Yea

PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": true
  }
}

Zacqary · 2021-01-07T17:52:00Z

Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI

Marked this as working, but note that alerts can be deleted from the management UI

chrisronline · 2021-01-07T17:58:40Z

@Zacqary Really? I'm able to do it on staging cloud 7.11:

Zacqary · 2021-01-07T18:07:46Z

@chrisronline Yeah, is that not a bug? There doesn't seem to be a way to create them again

chrisronline · 2021-01-07T18:16:55Z

It's a bit of a confusing flow, I'll admit. Users should be able to delete from there, but the creation is handled purely by the Stack Monitoring UI. In the future, we hope to simplify this but we don't imagine it's a huge point of confusion, as users are most likely not deleting these alerts.

chrisronline · 2021-01-07T19:33:47Z

@ravikesarwani Any thoughts on this experience and if we need to change it?

Zacqary · 2021-01-07T19:35:45Z

Set up a remote cluster using yarn es snapshot --license trial but it's reporting a Basic license instead

Zacqary · 2021-01-07T19:56:48Z

Ensure you can properly trigger and see the server log for the missing monitoring data alert

Unsure how to get this one to fire. Tried turning off Metricbeat for a while, but that just switched the stack to Self-Monitoring mode and didn't fire any alerts.

chrisronline · 2021-01-07T20:24:04Z

Set up a remote cluster using yarn es snapshot --license trial but it's reporting a Basic license instead

I'm not sure if that's a bug with yarn es snapshot or what?

Unsure how to get this one to fire. Tried turning off Metricbeat for a while, but that just switched the stack to Self-Monitoring mode and didn't fire any alerts.

That should work. The collection method doesn't matter here - I'd just be consistent. Monitor the ES node with either legacy or MB for enough time to see the monitoring data show up, then disable either and make sure an alert fires in the Kibana server log. If not, then sounds like a bug

Zacqary · 2021-01-07T20:36:19Z

If not, then sounds like a bug

Yeah, seems to not be firing, then

ravikesarwani · 2021-01-07T20:43:37Z

@ravikesarwani Any thoughts on this experience and if we need to change it?

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

chrisronline · 2021-01-07T20:45:45Z

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

They do, but I think @Zacqary is saying it's not obvious to the user that will happen

Zacqary · 2021-01-07T20:50:35Z

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

That actually didn't happen for me the first time I tried it, but it's working now that I've tried it again. If I can find a consistent way to reproduce I'll let y'all know.

Zacqary · 2021-01-07T21:51:00Z

Setting up CCS on Cloud doesn't seem to be showing the remote cluster in the Stack Monitoring UI. This is from creating two Cloud deployments and adding one of them as a remote cluster using the Cloud UI, not through Kibana (where Remote Clusters isn't available in Stack Management).

chrisronline · 2021-01-11T15:04:52Z

Setting up CCS on Cloud doesn't seem to be showing the remote cluster in the Stack Monitoring UI. This is from creating two Cloud deployments and adding one of them as a remote cluster using the Cloud UI, not through Kibana (where Remote Clusters isn't available in Stack Management).

I'm not sure how this works. I did this as well and there is no remote info from GET _cluster/info. If that doesn't return anything, the CCS will not work in the Stack Monitoring UI.

Zacqary · 2021-01-11T16:32:12Z

I can mark the CCS case as working if we're not targeting the Cloud CCS for this release, but I'd recommend tracking that as an issue for the future.

Still unable to get the two unchecked alerts above (missing monitoring data and threadpool rejection) to fire on my end.

chrisronline · 2021-01-11T16:53:08Z

@Zacqary Thanks.

missing monitoring data

This one appears to be bugged now so great catch. Fix is #87882

@igoristic Are you able to look into the threadpool rejection alert?

igoristic · 2021-01-14T17:08:51Z

@Zacqary This won't work for triggering the alert. This is because that number/value is actually a cumulative counter. So, by the time you test the query it will be replaced by a more recent document that has it set as zero, thus you won't get the min and max delta that the query was designed for.

The only way to do this is by setting the size to 0 on a designated node that is monitored, via: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html and then executing a search operation on that specific node. Though, I was only able to do this locally, because I couldn't change the thread pool size on cloud

sgrodzicki · 2021-01-21T12:44:13Z

@Zacqary @chrisronline @igoristic where are we with this?

Zacqary · 2021-01-21T17:03:28Z

Sorry, missed the updates in the transition back to 7.12 work. I can try these test cases again asap.

chrisronline added this to the Logs & Metrics UI 7.11.0 test plan milestone Dec 14, 2020

chrisronline added Team:Monitoring Stack Monitoring team test-plan labels Dec 15, 2020

Zacqary self-assigned this Jan 5, 2021

chrisronline mentioned this issue Jan 11, 2021

[Montoring] Use fetchClustersRange #87882

Merged

Zacqary closed this as completed Jan 21, 2021

chrisronline mentioned this issue Mar 1, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

Closed

24 tasks

simianhacker mentioned this issue Apr 29, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #98765

Closed

24 tasks

neptunian mentioned this issue Jul 6, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #104440

Closed

35 tasks

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

Comments

chrisronline commented Dec 14, 2020 • edited by Zacqary Loading

Summary

PRs

Testing

Creation

Management

UX

Specific alerts

Edge cases

elasticmachine commented Dec 15, 2020

Zacqary commented Jan 5, 2021

chrisronline commented Jan 5, 2021

Zacqary commented Jan 5, 2021

chrisronline commented Jan 5, 2021

Zacqary commented Jan 5, 2021

chrisronline commented Jan 5, 2021

Zacqary commented Jan 5, 2021

Zacqary commented Jan 6, 2021 • edited Loading

chrisronline commented Jan 6, 2021

Zacqary commented Jan 6, 2021

chrisronline commented Jan 6, 2021

chrisronline commented Jan 6, 2021

Zacqary commented Jan 6, 2021

chrisronline commented Jan 6, 2021

Zacqary commented Jan 6, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

ravikesarwani commented Jan 7, 2021 • edited Loading

chrisronline commented Jan 7, 2021

Zacqary commented Jan 7, 2021

Zacqary commented Jan 7, 2021

chrisronline commented Jan 11, 2021

Zacqary commented Jan 11, 2021

chrisronline commented Jan 11, 2021 • edited Loading

igoristic commented Jan 14, 2021

sgrodzicki commented Jan 21, 2021

Zacqary commented Jan 21, 2021

chrisronline commented Dec 14, 2020 •

edited by Zacqary

Loading

Zacqary commented Jan 6, 2021 •

edited

Loading

ravikesarwani commented Jan 7, 2021 •

edited

Loading

chrisronline commented Jan 11, 2021 •

edited

Loading