Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

Closed
23 tasks done
chrisronline opened this issue Dec 14, 2020 · 43 comments
Closed
23 tasks done

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

chrisronline opened this issue Dec 14, 2020 · 43 comments
Assignees
Labels
Team:Monitoring Stack Monitoring team test-plan

Comments

@chrisronline
Copy link
Contributor

chrisronline commented Dec 14, 2020

Summary

Stack Monitoring provides a set of out-of-the-box alerts, created by simply loading the Stack Monitoring UI within Kibana. The default action for each alert is a server log and the action messaging is controlled by the Stack Monitoring UI code directly.

PRs

Original, and CPU alert: #68805
Disk usage alert: #75419
JVM memory usage alert: #79039
Missing monitoring data alert: #78208
Threadpool rejections alert: #79433

Testing

Creation

  • Ensure alerts are created once visiting the Stack Monitoring UI
  • Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

Management

  • Ensure that you can view and manage these alerts when Setup Mode is active
  • Ensure you can properly add an additional action (to any alert) and it works as expected

UX

Specific alerts

  • Ensure you can properly trigger and see the server log for the CPU usage alert
  • Ensure you can properly trigger and see the server log for the disk usage alert
  • Ensure you can properly trigger and see the server log for the jvm memory alert
  • Ensure you can properly trigger and see the server log for the missing monitoring data alert (Note: This alert is only concerned with Elasticsearch now and no longer looks at other stack products [Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309)
  • Ensure you can properly trigger and see the server log for both threadpool rejection alerts
  • Ensure you can properly trigger and see the server log for the legacy cluster health alert
  • Ensure you can properly trigger and see the server log for the legacy nodes change alert
  • Ensure you can properly trigger and see the server log for the legacy Elasticsearch version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy Kibana version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy Logstash version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy license expiration alert

Edge cases

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@Zacqary Zacqary self-assigned this Jan 5, 2021
@Zacqary
Copy link
Contributor

Zacqary commented Jan 5, 2021

Ensure you can properly navigate between display modes of alerts

Not quite sure how to test this one. Does this state only happen if you're monitoring multiple clusters?

@chrisronline
Copy link
Contributor Author

@Zacqary

Sure! In #81569, we introduce two ways to view alerts in the UI - grouped by "node/instance" or grouped by the alert type. That line item means we should just verify you can toggle between the two and both of them make sense and the grouping looks right

@Zacqary
Copy link
Contributor

Zacqary commented Jan 5, 2021

@chrisronline Got it, is it expected that the toggle switch doesn't appear if you're only monitoring a single node?

@chrisronline
Copy link
Contributor Author

@Zacqary Negative. That's something we should probably optimize in the near future, but it does not do that now.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 5, 2021

Okay, well I'm not seeing the Group by node toggle at all. Guess I can't pass that test case then?

@chrisronline
Copy link
Contributor Author

@Zacqary Oh wow, that's weird. Do you mind posting a screenshot of what you are seeing?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 5, 2021

Just not seeing Group By Node anywhere. This example is with only one node, but I still wasn't able to see it when adding more nodes.

Screen Shot 2021-01-05 at 3 27 50 PM

Screen Shot 2021-01-05 at 3 27 46 PM

Screen Shot 2021-01-05 at 3 27 42 PM

@Zacqary
Copy link
Contributor

Zacqary commented Jan 6, 2021

Ensure you can properly trigger and see the server log for both threadpool rejection alerts

Which ones are these? Never mind, missed em in the Errors and Exceptions section

@chrisronline
Copy link
Contributor Author

@Zacqary I'm a bit dumbfounded. #85719 made it into 7.11, as you can see here but I don't know why it's not showing up in the UI. I just tested on cloud staging, but I'm going to test locally and see if I can figure out what's up

@Zacqary
Copy link
Contributor

Zacqary commented Jan 6, 2021

For the threadpool alerts, not sure if I'm doing it right but it doesn't seem to be firing when I:

  • Set both Write and Read to alert every 1 minute, and fire on 1 rejection
  • Do a GET .monitoring-es*/_search to find the most recent document with a node_stats field
  • POST .monitoring-es*/_update/<doc id> to set node_stats.thread_pool.search.rejected to a number greater than 0, same for write.rejected

@chrisronline
Copy link
Contributor Author

@Zacqary #85719 doesn't seem to be in BC1, but it is in the 7.11 branch so maybe we need to wait until BC2 to test some of this.

@chrisronline
Copy link
Contributor Author

@igoristic Can you advise on #85841 (comment)?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 6, 2021

@chrisronline I'm running 7.11 locally and still don't see the Group By Node toggle. I actually tried on master first by accident and it also didn't show up.

@chrisronline
Copy link
Contributor Author

@Zacqary Oh, my bad! You are unable to see this functionality while in Setup mode. It's only accessible for firing alerts. I'd recommend using Setup mode to set the threshold for various alerts very low to trigger them and test it that way.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 6, 2021

Ah that makes sense. Works outside Setup Mode.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Do I still need to do something to enable Watcher for the legacy alerts? Seems like I'm having trouble getting the license expiration pipeline to set it off

@chrisronline
Copy link
Contributor Author

You'll need trial or higher license and that should be it. I'd double check the watches exist and then you can do something like:

POST .monitoring-es-*/_search?filter_path=hits.hits._source.license
{
  "size": 1,
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "type": {
        "value": "cluster_stats"
      }
    }
  }
}

to verify the pipeline is properly changing the document

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Yeah the document's updating, watches don't exist though.

@chrisronline
Copy link
Contributor Author

What license level are you on? If you are on trial+ and using legacy monitoring collection, the watches should be created for you.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Using Trial and Metricbeat monitoring, alerts are created but no watches.

@chrisronline
Copy link
Contributor Author

Ah, there is a known issue around watch creation and using only metricbeat monitoring: elastic/elasticsearch#51762 (comment)

We're tracking to remove these watches in 7.12: #85047 so this bug will be moot but it's still there for now.

To get around it, please enable legacy monitoring (via the cluster setting) in order for the watches to exist. You can disable legacy monitoring as soon as the watches have been created

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Remind me how to enable legacy monitoring again? It's a dev tools request, right?

@chrisronline
Copy link
Contributor Author

Yea

PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": true
  }
}

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI

Marked this as working, but note that alerts can be deleted from the management UI

@chrisronline
Copy link
Contributor Author

@Zacqary Really? I'm able to do it on staging cloud 7.11:

Kapture 2021-01-07 at 12 56 53

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

@chrisronline Yeah, is that not a bug? There doesn't seem to be a way to create them again

@chrisronline
Copy link
Contributor Author

It's a bit of a confusing flow, I'll admit. Users should be able to delete from there, but the creation is handled purely by the Stack Monitoring UI. In the future, we hope to simplify this but we don't imagine it's a huge point of confusion, as users are most likely not deleting these alerts.

@chrisronline
Copy link
Contributor Author

@ravikesarwani Any thoughts on this experience and if we need to change it?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Screen Shot 2021-01-07 at 1 34 23 PM

Set up a remote cluster using yarn es snapshot --license trial but it's reporting a Basic license instead

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Ensure you can properly trigger and see the server log for the missing monitoring data alert

Unsure how to get this one to fire. Tried turning off Metricbeat for a while, but that just switched the stack to Self-Monitoring mode and didn't fire any alerts.

@chrisronline
Copy link
Contributor Author

Set up a remote cluster using yarn es snapshot --license trial but it's reporting a Basic license instead

I'm not sure if that's a bug with yarn es snapshot or what?

Unsure how to get this one to fire. Tried turning off Metricbeat for a while, but that just switched the stack to Self-Monitoring mode and didn't fire any alerts.

That should work. The collection method doesn't matter here - I'd just be consistent. Monitor the ES node with either legacy or MB for enough time to see the monitoring data show up, then disable either and make sure an alert fires in the Kibana server log. If not, then sounds like a bug

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

If not, then sounds like a bug

Yeah, seems to not be firing, then

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Jan 7, 2021

@ravikesarwani Any thoughts on this experience and if we need to change it?

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

@chrisronline
Copy link
Contributor Author

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

They do, but I think @Zacqary is saying it's not obvious to the user that will happen

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

If the user deleted the alert don't they get recreated at the time of stack monitoring UI reload?

That actually didn't happen for me the first time I tried it, but it's working now that I've tried it again. If I can find a consistent way to reproduce I'll let y'all know.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 7, 2021

Setting up CCS on Cloud doesn't seem to be showing the remote cluster in the Stack Monitoring UI. This is from creating two Cloud deployments and adding one of them as a remote cluster using the Cloud UI, not through Kibana (where Remote Clusters isn't available in Stack Management).

@chrisronline
Copy link
Contributor Author

Setting up CCS on Cloud doesn't seem to be showing the remote cluster in the Stack Monitoring UI. This is from creating two Cloud deployments and adding one of them as a remote cluster using the Cloud UI, not through Kibana (where Remote Clusters isn't available in Stack Management).

I'm not sure how this works. I did this as well and there is no remote info from GET _cluster/info. If that doesn't return anything, the CCS will not work in the Stack Monitoring UI.

@Zacqary
Copy link
Contributor

Zacqary commented Jan 11, 2021

I can mark the CCS case as working if we're not targeting the Cloud CCS for this release, but I'd recommend tracking that as an issue for the future.

Still unable to get the two unchecked alerts above (missing monitoring data and threadpool rejection) to fire on my end.

@chrisronline
Copy link
Contributor Author

chrisronline commented Jan 11, 2021

@Zacqary Thanks.

missing monitoring data

This one appears to be bugged now so great catch. Fix is #87882

@igoristic Are you able to look into the threadpool rejection alert?

@igoristic
Copy link
Contributor

@Zacqary This won't work for triggering the alert. This is because that number/value is actually a cumulative counter. So, by the time you test the query it will be replaced by a more recent document that has it set as zero, thus you won't get the min and max delta that the query was designed for.

The only way to do this is by setting the size to 0 on a designated node that is monitored, via: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html and then executing a search operation on that specific node. Though, I was only able to do this locally, because I couldn't change the thread pool size on cloud

@sgrodzicki
Copy link

@Zacqary @chrisronline @igoristic where are we with this?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 21, 2021

Sorry, missed the updates in the transition back to 7.12 work. I can try these test cases again asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Monitoring Stack Monitoring team test-plan
Projects
None yet
Development

No branches or pull requests

6 participants