Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Alerts for throughput and failure rate anomalies #159288

Open
sorenlouv opened this issue Jun 8, 2023 · 7 comments
Open

[APM] Alerts for throughput and failure rate anomalies #159288

sorenlouv opened this issue Jun 8, 2023 · 7 comments
Assignees
Labels
apm:alerting epic Feature:Alerting Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:APM All issues that need APM UI Team support

Comments

@sorenlouv
Copy link
Member

Today we allow users to create anomaly detection jobs (ML Jobs) which will produce anomaly results for latency, throughput and failure rates.
Users can create rules and be alerted when there are anomalies for latency but they have no way of doing the same for throughput and failure rate anomalies.

There is a ruled called ApmRuleType.Anomaly and the user facing description for this rule is:

Alert when either the latency, throughput, or failed transaction rate of a service is anomalous.

This is quite misleading because it does in fact not produce alerts for throughput or failed transaction rate. Only latency as can be seen in the terms filter below:

...termQuery(
'detector_index',
getApmMlDetectorIndex(ApmMlDetectorType.txLatency)
),

Solution

It should be possible to receive alerts for throughput and failure rate anomalies. Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency.

Related enhancement request: https://github.com/elastic/enhancements/issues/12409 (internal)

@sorenlouv sorenlouv added the Team:APM All issues that need APM UI Team support label Jun 8, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@gbamparop gbamparop changed the title [APM] Alerts for for throughput and failure rate anomalies [APM] Alerts for throughput and failure rate anomalies Jun 12, 2023
@gbamparop
Copy link
Contributor

@elastic/apm-pm do you think that these should be separate rule types or just the one we currently have that will alert on latency, throughput and failed transaction rate?

@katrin-freihofner
Copy link
Contributor

@sqren I agree, the name and description are misleading. We have plans to add the Anomaly rule (currently only available in Stack management) to Observability.

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management? I think as the ML job covers latency, throughput, and failure rates the Anomaly detection rule in Stack management would be able to alert on all three.

@sorenlouv
Copy link
Member Author

sorenlouv commented Jul 11, 2023

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management?

I'm not 100% sure but APM has a rule called "Anomaly" and I see another one under Stack management called "Anomaly detection alert" - I assume that's the one your are referring to.

APM: Anomaly rule

The APM Anomaly rule will break down alerts by service.name, service.environment and transaction.type - similar to how all APM rules work. This means that a user will know exactly which service was anomalous and caused an alert. Furthermore, they can choose to only receive alerts for specific services/environments etc:

Machine Learning: "Anomaly detection alert"

This rule asks the user to select an existing ML job, and then specify a result type. This is much more generic but also quite a bit harder to understand how to use. Furthermore, it's not possible to group or filter by service.name / service.environment / transaction.type afaict.

image image

@sorenlouv sorenlouv added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Jul 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@akhileshpok
Copy link

@gbamparop - I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule. We should make sure that the threshold settings/ranges are appropriate for the new metrics.

@sorenlouv sorenlouv added the bug Fixes for quality problems that affect the customer experience label Jul 17, 2023
@sorenlouv
Copy link
Member Author

sorenlouv commented Jul 17, 2023

I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule.

Agree, that's also what I've suggested in the issue description:

"Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency"

We should make sure that the threshold settings/ranges are appropriate for the new metrics.

Actually, we don't even need to think of this. The only metric the rule cares about is severity. Meaning a severity like "critical" can apply to both latency anomalies, throughput anomalies and failure rate anomalies.

@katrin-freihofner katrin-freihofner self-assigned this Jul 18, 2023
@katrin-freihofner katrin-freihofner removed the bug Fixes for quality problems that affect the customer experience label Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:alerting epic Feature:Alerting Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:APM All issues that need APM UI Team support
Projects
None yet
Development

No branches or pull requests

6 participants