[APM] Alerts for throughput and failure rate anomalies #159288

sorenlouv · 2023-06-08T09:04:52Z

Today we allow users to create anomaly detection jobs (ML Jobs) which will produce anomaly results for latency, throughput and failure rates.
Users can create rules and be alerted when there are anomalies for latency but they have no way of doing the same for throughput and failure rate anomalies.

There is a ruled called ApmRuleType.Anomaly and the user facing description for this rule is:

Alert when either the latency, throughput, or failed transaction rate of a service is anomalous.

This is quite misleading because it does in fact not produce alerts for throughput or failed transaction rate. Only latency as can be seen in the terms filter below:

kibana/x-pack/plugins/apm/server/routes/alerts/rule_types/anomaly/register_anomaly_rule_type.ts

Lines 172 to 175 in 7890be6

    
           ...termQuery( 
        
             'detector_index', 
        
             getApmMlDetectorIndex(ApmMlDetectorType.txLatency) 
        
           ),

Solution

It should be possible to receive alerts for throughput and failure rate anomalies. Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency.

Related enhancement request: https://github.com/elastic/enhancements/issues/12409 (internal)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-06-08T09:04:55Z

Pinging @elastic/apm-ui (Team:APM)

gbamparop · 2023-06-12T14:37:32Z

@elastic/apm-pm do you think that these should be separate rule types or just the one we currently have that will alert on latency, throughput and failed transaction rate?

katrin-freihofner · 2023-07-11T09:14:23Z

@sqren I agree, the name and description are misleading. We have plans to add the Anomaly rule (currently only available in Stack management) to Observability.

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management? I think as the ML job covers latency, throughput, and failure rates the Anomaly detection rule in Stack management would be able to alert on all three.

sorenlouv · 2023-07-11T11:37:59Z

Do you know why there is a separate "APM anomaly" rule and how it is different from the one in Stack management?

I'm not 100% sure but APM has a rule called "Anomaly" and I see another one under Stack management called "Anomaly detection alert" - I assume that's the one your are referring to.

APM: Anomaly rule

The APM Anomaly rule will break down alerts by service.name, service.environment and transaction.type - similar to how all APM rules work. This means that a user will know exactly which service was anomalous and caused an alert. Furthermore, they can choose to only receive alerts for specific services/environments etc:

Machine Learning: "Anomaly detection alert"

This rule asks the user to select an existing ML job, and then specify a result type. This is much more generic but also quite a bit harder to understand how to use. Furthermore, it's not possible to group or filter by service.name / service.environment / transaction.type afaict.

elasticmachine · 2023-07-11T11:38:13Z

Pinging @elastic/actionable-observability (Team: Actionable Observability)

akhileshpok · 2023-07-13T09:46:28Z

@gbamparop - I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule. We should make sure that the threshold settings/ranges are appropriate for the new metrics.

sorenlouv · 2023-07-17T12:04:52Z

I would suggest that we re-use and extend the capabilities of the existing APM anomaly rule.

Agree, that's also what I've suggested in the issue description:

"Instead of creating new rules the existing ApmRuleType.Anomaly rule should be updated to also produce alerts for other types of anomalies than latency"

We should make sure that the threshold settings/ranges are appropriate for the new metrics.

Actually, we don't even need to think of this. The only metric the rule cares about is severity. Meaning a severity like "critical" can apply to both latency anomalies, throughput anomalies and failure rate anomalies.

sorenlouv added the Team:APM All issues that need APM UI Team support label Jun 8, 2023

This was referenced Jun 8, 2023

[APM] Alerting use cases and examples #103785

Open

[APM] Add rule types to the general list of rules available #126909

Closed

gbamparop changed the title ~~[APM] Alerts for for throughput and failure rate anomalies~~ [APM] Alerts for throughput and failure rate anomalies Jun 12, 2023

gbamparop added the needs product label Jun 12, 2023

gbamparop mentioned this issue Jun 30, 2023

[APM] Documentation updates #160568

Merged

5 tasks

sorenlouv added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Jul 11, 2023

sorenlouv added the bug Fixes for quality problems that affect the customer experience label Jul 17, 2023

gbamparop removed the needs product label Jul 17, 2023

katrin-freihofner self-assigned this Jul 18, 2023

katrin-freihofner removed the bug Fixes for quality problems that affect the customer experience label Jul 18, 2023

gbamparop added the apm:alerting label Jul 31, 2023

emma-raffenne added Feature:Alerting epic labels Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Alerts for throughput and failure rate anomalies #159288

[APM] Alerts for throughput and failure rate anomalies #159288

sorenlouv commented Jun 8, 2023

elasticmachine commented Jun 8, 2023

gbamparop commented Jun 12, 2023

katrin-freihofner commented Jul 11, 2023

sorenlouv commented Jul 11, 2023 •

edited

Loading

elasticmachine commented Jul 11, 2023

akhileshpok commented Jul 13, 2023

sorenlouv commented Jul 17, 2023 •

edited

Loading

[APM] Alerts for throughput and failure rate anomalies #159288

[APM] Alerts for throughput and failure rate anomalies #159288

Comments

sorenlouv commented Jun 8, 2023

Solution

elasticmachine commented Jun 8, 2023

gbamparop commented Jun 12, 2023

katrin-freihofner commented Jul 11, 2023

sorenlouv commented Jul 11, 2023 • edited Loading

APM: Anomaly rule

Machine Learning: "Anomaly detection alert"

elasticmachine commented Jul 11, 2023

akhileshpok commented Jul 13, 2023

sorenlouv commented Jul 17, 2023 • edited Loading

sorenlouv commented Jul 11, 2023 •

edited

Loading

sorenlouv commented Jul 17, 2023 •

edited

Loading