Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarms suddently stop alerting (query log without data or not updating / being frozen) #540

Closed
IlliciteS opened this issue Jun 22, 2023 · 11 comments
Labels
question Further information is requested

Comments

@IlliciteS
Copy link

IlliciteS commented Jun 22, 2023

Hello,

We ran into something we cannot understand. All our Praeco alarms are set up through Praeco, manually (not by yaml, so).
Some alarms run pretty well while others... Just stop working all of sudden.

An example :

AlarmMainView

The data in the main view are fine.

QueryLog

Here's the Query Log view.
You can see it stopped working the 4/6/2023 9:00:00 AM. I don't know why. And I could not make it work again, so I had to duplicate the alarm, edit it, change the limit execution by a small number (like 4 min instead of 5), save it, delete the old broken alarm, go back to the duplicated alarm and change its name back to the original one.

Query log worked... And then, since 6/21/2023 2:34:52 PM, it does not run anymore.

Praeco Yaml :
`__praeco_full_path: "FOTT/Services/High number error access Service TV HISENSE"
__praeco_query_builder: "{"query":{"logicalOperator":"all","children":[{"type":"query-builder-rule","query":{"rule":"actKey","selectedOperator":"contains","selectedOperand":"actKey","value":"accessServiceError"}},{"type":"query-builder-rule","query":{"rule":"media","selectedOperator":"contains","selectedOperand":"media","value":"tvhisense"}}]}}"
alert:

  • "ms_teams"
  • "pagerduty"
    alert_subject: "High number error access Service TV HISENSE"
    alert_text: ""
    doc_type: "dynamic_templates"
    filter:
  • query:
    query_string:
    query: "actKey:accessServiceError AND media:tvhisense"
    generate_kibana_discover_url: true
    import: "../../BaseRule.config"
    index: "switchplus-ott-prod"
    is_enabled: true
    kibana_discover_app_url: ""
    kibana_discover_columns:
  • "actkey.keyword"
    kibana_discover_from_timedelta:
    minutes: 10
    kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a"
    kibana_discover_to_timedelta:
    minutes: 10
    kibana_discover_version: "7.15"
    limit_execution: "0 */1 * * *"
    match_enhancements: []
    ms_teams_alert_summary: "ElastAlert Message"
    ms_teams_attach_kibana_discover_url: false
    ms_teams_kibana_discover_title: "Discover in Kibana"
    ms_teams_proxy: ""
    ms_teams_theme_color: "#F912DE"
    ms_teams_webhook_url: ""
    name: "High number error access Service TV HISENSE"
    num_events: 2500
    pagerduty_api_version: "v2"
    pagerduty_client_name: "elastalert"
    pagerduty_event_type: "trigger"
    pagerduty_incident_key: "nbErrorAccessTVHISENSE"
    pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z"
    pagerduty_v2_payload_component: "threshold"
    pagerduty_v2_payload_group: "FOTT"
    pagerduty_v2_payload_severity: "critical"
    pagerduty_v2_payload_source: "ElastAlert"
    priority: 2
    realert:
    hours: 1
    terms_size: 50
    timeframe:
    hours: 1
    timestamp_field: "dt"
    timestamp_type: "iso"
    type: "frequency"
    use_count_query: true
    use_strftime_index: false`

Some alarms, when doing that duplicate workaround, just don't update at all and get this Query Log tab (while the graph in the overwiew is perfectly working) :
QueryLogNoData

Its overview:

Overview

Praeco Yml for that one:
`__praeco_full_path: "FOTT/Usage/Nb de lancement de player bas PLAYSTATION CPFRA"
__praeco_query_builder: "{"query":{"logicalOperator":"all","children":[{"type":"query-builder-rule","query":{"rule":"actKey","selectedOperator":"contains","selectedOperand":"actKey","value":"launchOnePlayer"}},{"type":"query-builder-rule","query":{"rule":"media","selectedOperator":"contains","selectedOperand":"media","value":"playstation"}},{"type":"query-builder-rule","query":{"rule":"zone","selectedOperator":"contains","selectedOperand":"zone","value":"cpfra"}}]}}"
alert:

  • "pagerduty"
    alert_subject: "Very low launched player on PLAYSTATION on CPFRA"
    alert_text: "Very low launched player on PLAYSTATION on CPFRA"
    doc_type: "dynamic_templates"
    filter:
  • query:
    query_string:
    query: "actKey:launchOnePlayer AND media:playstation AND zone:cpfra"
    generate_kibana_discover_url: true
    import: "../../BaseRule.config"
    index: "switchplus-ott-prod"
    is_enabled: true
    kibana_discover_app_url: ""
    kibana_discover_from_timedelta:
    minutes: 10
    kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a"
    kibana_discover_to_timedelta:
    minutes: 10
    kibana_discover_version: "7.15"
    limit_execution: "0 */1 * * *"
    match_enhancements: []
    name: "Nb de lancement de player bas PLAYSTATION CPFRA"
    pagerduty_api_version: "v2"
    pagerduty_client_name: "elastalert"
    pagerduty_event_type: "trigger"
    pagerduty_incident_key: "nbLaunchPlayerPlaystationCPFRA"
    pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z"
    pagerduty_v2_payload_component: "threshold"
    pagerduty_v2_payload_group: "FOTT"
    pagerduty_v2_payload_severity: "critical"
    pagerduty_v2_payload_source: "ElastAlert"
    priority: 2
    realert:
    minutes: 10
    terms_size: 50
    threshold: 150
    timeframe:
    hours: 1
    timestamp_field: "dt"
    timestamp_type: "iso"
    type: "flatline"
    use_count_query: true
    use_strftime_index: false`

And some other alarms, after being duplicated and the original one removed, have their Query Log tab that goes back to the original's Query Log tab, like they were not deleted at all, keeping the old frozen historic. I am wondering if the old / first alarm has really been deleted (and if not, why it does not appear in Praeco).

And that's why I have 3 questions:

1 - Any idea why this happens (except after a docker being destroyed / rebuilt, I noticed that.)
2 - Is there a way to "reconnect" all the alarms after such an incident without duplicating them (when it works)? I try enable /disable an alarm, does not work.
3 - Are there any logs about a specific alarm in the Praeco docker and / or in the Elastalert docker? If so, where?

👀 Operating environment

  • your praeco / elastalect docker files
  • elasticsearch version : 7.15.2
  • version of praeco : praeco 1.8.13
  • praecoapp/elastalert-server:20230402
@IlliciteS IlliciteS added the question Further information is requested label Jun 22, 2023
@nsano-rururu
Copy link
Collaborator

It may be related to the following movements. In the next version, I would like to change to add settings from the screen with Praeco.
https://elastalert2.readthedocs.io/en/latest/elastalert.html
image

@nsano-rururu
Copy link
Collaborator

Try to solve by adding to BaseRule.config

disable_rules_on_error: false

@IlliciteS
Copy link
Author

I added that into the BaseRule.config (and nothing else, just add it) but it does not work.
That being said, we found a curious workaround:

If we go to Praeco and, for a frozen alarm, Edit -> Disable the "Limit Excecution", wait a bit, then the alarm will be "re enable". Then we can re-enable the Limit Execution and the alarm is still fine.

It also works through the yaml. So we plan to make a script which will add a # to comment the limite execution into the yaml for all the alarms, and then, like 2 min later, will delete that # to uncomment the limite execution.

Note: we update to your latest version of Praeco and Elastalert, and that workaround still works.

@nsano-rururu
Copy link
Collaborator

There is a comment that implements a function that only limits the execution of rules to a specific time of the day, rather than disabling alerts.

Yelp/elastalert#492 (comment)

I've merged this feature into a new branch, beta, and released it as a new package version 0.2.0b1 available on pypi.

This includes a couple other changes as well, like threading support, but you can now limit rule execution to certain times of the day using limit_execution using cron syntax. For example

limit_execution: "* 7-22 * * *"
Would mean to only run the rule between 7 am and 10 pm every day.

This feature is still in beta, of course, but you're welcome to try.

@IlliciteS
Copy link
Author

IlliciteS commented Jun 26, 2023

Yes, we are already using the limit_execution. To be clear, this:

image

Which equals, in yaml, to:
image

If I am not mistaken, right?

And so it's this feature that creates an issue (for us, at least). And we do not disable the alarm, but this feature, to make the alarms work again.

So the script we will implement will do that:
First, it will comment the limit_execution in the yaml:
#limit_execution: "0 */1 * * *"

And the it will delete this #, ie, it will uncomment that feature so this feature works again:
limit_execution: "0 */1 * * *"
The alarms always stay enable.

Before, the alarms were "frozen". The Query Log showed either "no data" or an old date.
After this work around, the alarms are not frozen anymore, and work -> The query log tab display all the queries made by the alarms.

By the way, maybe we do not use the Limit Execution the proper way. We use it to run a query evey 5 min, or every one hour, for instance.
If we want to run an alarm between 10 am and 11 pm, we use the "Use Time Window" feature, like shown in the first screen.

@nsano-rururu
Copy link
Collaborator

Repository owner locked and limited conversation to collaborators Jun 26, 2023
@nsano-rururu
Copy link
Collaborator

Repository owner unlocked this conversation Jun 26, 2023
@nsano-rururu
Copy link
Collaborator

this
image

@nsano-rururu
Copy link
Collaborator

Yelp/elastalert#2119

@IlliciteS
Copy link
Author

Thanks for your answer. That's interresting; this case is known since 2019.
So perhaps we should disable the limit_execution for now and let the main cron runs every alarms.

@nsano-rururu
Copy link
Collaborator

Regarding limit_execution, I think there is a bug because I feel that there was some inquiry in the discussion of elastalert2.
https://github.com/jertel/elastalert2/discussions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants