Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Throttle Period #14

Closed
Nishant23 opened this issue Mar 29, 2019 · 10 comments
Closed

Throttle Period #14

Nishant23 opened this issue Mar 29, 2019 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@Nishant23
Copy link

Nishant23 commented Mar 29, 2019

If we configure email or slack in the action block, it will send alerts each time monitor is triggered. In xpack of elasticseach this can be controlled by setting up throttle_period in action block. It will wait for the throttle_period amount of time after the first alert and then resend the alerts if the issue is not resolved yet once. Can we have the same functionality as in this?

For more info, you can look at https://www.elastic.co/guide/en/x-pack/current/actions.html#actions-ack-throttle

@CarlMeadows
Copy link
Contributor

Noted - thanks for the feedback. We do have the Acknowledge feature which will suspend subsequent notifications until the issue is resolved - but I can see folks also only wanting alerts to go out every 15 or 30 minutes on a 1 minute polling event - even before they can acknowledge it.

@elfisher elfisher added the enhancement New feature or request label Apr 2, 2019
@vamshin
Copy link
Member

vamshin commented Apr 3, 2019

Adding my observations for the throttling feature.

We could provide throttling at 3 levels.
Level 1: Monitor
Level2: Trigger (If no trigger level throttling, it falls back to Monitor)
Level3: Action (If no Action level throttling, it falls back to trigger)

Why would we need these levels?
A Monitor could have multiple triggers. If we want all triggers to have similar throttling, then instead of adding throttling configuration at each trigger level, we could set at monitor level. Same logic holds for Trigger and Action
Currently Xpack supports throttling at Trigger level and Action level. But For Xpack, 1 watch(Monitor) = 1 trigger, so trigger level throttling means similar to level 1 and level2 described above.

Also should the throttling be global to all the alerts of the trigger or it should be with respect to a alert?
If throttling is with respect to alert, then the throttling timer resets when alert is completed and the notification would be sent on next alert.
If throttling is global to alerts, then the timer runs across alerts.
One drawback we could think of with former approach is, if we have an alert thats going in and out, then the throttling does not come into picture.

Please share feedback if you see any concerns or if i am missing something here

@Nishant23
Copy link
Author

Nishant23 commented Apr 12, 2019

@CarlMeadows When is this planned to be released?

@elfisher
Copy link
Contributor

@Nishant23 we are still working through how we want this mechanic to work. I'll be posting a high-level summary of our plans shortly to this issue thread. Stay tuned.

@elfisher
Copy link
Contributor

[RFC] Alert throttling

The purpose of this request for comments (RFC) is to discuss how to enhance Open Distro for Elasticsearch Alerting to include throttling mechanisms on actions and provide users with the ability to undo acknowledgements.

Problem statement

Currently monitors, triggers, and actions all run on the schedule defined in the monitor. Each time a monitor runs, its triggers will be evaluated and actions will be taken if the trigger thresholds are exceeded. This could create a lot of noise if you have monitors that run often, however you may not want to reduce the monitor frequency because you still want to check the data often. For example, you may want to check the error rates of application logs in Elasticsearch every 5 minutes, but you might only want to send alerts every 30 minutes while your error rates are high.

Proposed solution

We will introduce a throttling property in actions which will be used to reduce the frequency at which the action is taken. Taking the same example from above, you will be able to run the monitor checking error rates in your application every 5 minutes, but define your alert to at most be sent every 30 minutes.

Note: We are planning to apply throttling to unique alerting events. Let's look at an example of what this means with throttling set to 30 minutes. Let's say a trigger that goes into alerting at 10:00 and completes at 10:10. If the trigger goes back to alerting at 10:20, alerts would be sent for the 10:00 event and the 10:20 event because each alert would be a unique event. If the 10:20 alert does not complete it will then send notifications at 10:50, 11:20, 11:50, and so on until completing or being acknowledged.

Example Action configuration

{
      "name": "test-action",
      "destination_id": "RtaaOmkBC25HCRGm0fxi",
      "throttle": {
        "value": 30,
         "unit": "MINUTES"
    },
    ...
}

Additional functionality we are considering

Additionally we are thinking of adding default throttling settings at the trigger and monitor level which you can configure if you want to inherit a default throttling setting from the parent. For example, you may have multiple triggers and actions configured for one monitor, but you want them all to have the same throttling settings. You could use the monitor default_throttling property to set a default that all of its actions will use.

It is important to note that if a child's throttling property is configured it will take precedence over its parent's. For example, if I configure a default_throttling property on a monitor and a trigger, the trigger property will be used for it's actions. If I configure throttling property on an action, it will over rule its trigger's property. Below are examples of how priority is selected.

Monitor (Throttle configured) Trigger (Throttle configured) Action (Throttle configure) Throttle Used
Yes Yes Yes Action
Yes Yes No Trigger
Yes No No Monitor
No Yes Yes Action
No Yes No Trigger

Example Monitor Configuration

{
  "name": "test-monitor",
  "type": "monitor",
  "enabled": true,
  "schedule": {
    "period": {
      "interval": 1,
      "unit": "MINUTES"
    },
    "default_throttle": {
      "value": 10,
      "unit": "MINUTES"
    },
    ...
 }

Example Trigger COnfiguration

{
    "id": "StaeOmkBC25HCRGmL_y-",
    "name": "test-trigger",
    "severity": "1",
    "default_throttle": {
      "value": 10,
      "unit": "MINUTES"
    },
    ...
 }

@elfisher
Copy link
Contributor

@vamshin I've incorporated your thoughts from above into the RFC.

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Apr 12, 2019

@elfisher the throttling here is to prevent sending out too many alerts in short time, another manual way is acknowledge. Can we consider enhance the acknowledge to suppress the alert for a period(such as suppress the alert for next hour)?

@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented Apr 12, 2019

For better tracking, create another issue to track the enhancement of acknowledge for a period of time.

@rhyscooper
Copy link

This would be a very valuable addition to remove noise from alerting and is something I've been looking for!

It might be useful to be able to configure whether the throttling applies to only single events, or can span multiple events so it is the user's choice whether they want to filter out intermittent 'spikey' alerts, or still receive multiple alerts when the alert condition is fluctuating between true/false.

@ylwu-amzn
Copy link
Contributor

Code change merged. Close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants