-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alarms suddently stop alerting (query log without data or not updating / being frozen) #540
Comments
It may be related to the following movements. In the next version, I would like to change to add settings from the screen with Praeco. |
Try to solve by adding to BaseRule.config disable_rules_on_error: false |
I added that into the BaseRule.config (and nothing else, just add it) but it does not work. If we go to Praeco and, for a frozen alarm, Edit -> Disable the "Limit Excecution", wait a bit, then the alarm will be "re enable". Then we can re-enable the Limit Execution and the alarm is still fine. It also works through the yaml. So we plan to make a script which will add a # to comment the limite execution into the yaml for all the alarms, and then, like 2 min later, will delete that # to uncomment the limite execution. Note: we update to your latest version of Praeco and Elastalert, and that workaround still works. |
There is a comment that implements a function that only limits the execution of rules to a specific time of the day, rather than disabling alerts. I've merged this feature into a new branch, beta, and released it as a new package version 0.2.0b1 available on pypi. This includes a couple other changes as well, like threading support, but you can now limit rule execution to certain times of the day using limit_execution using cron syntax. For example limit_execution: "* 7-22 * * *" This feature is still in beta, of course, but you're welcome to try. |
Yes, we are already using the limit_execution. To be clear, this: If I am not mistaken, right? And so it's this feature that creates an issue (for us, at least). And we do not disable the alarm, but this feature, to make the alarms work again. So the script we will implement will do that: And the it will delete this #, ie, it will uncomment that feature so this feature works again: Before, the alarms were "frozen". The Query Log showed either "no data" or an old date. By the way, maybe we do not use the Limit Execution the proper way. We use it to run a query evey 5 min, or every one hour, for instance. |
Thanks for your answer. That's interresting; this case is known since 2019. |
Regarding limit_execution, I think there is a bug because I feel that there was some inquiry in the discussion of elastalert2. |
Hello,
We ran into something we cannot understand. All our Praeco alarms are set up through Praeco, manually (not by yaml, so).
Some alarms run pretty well while others... Just stop working all of sudden.
An example :
The data in the main view are fine.
Here's the Query Log view.
You can see it stopped working the 4/6/2023 9:00:00 AM. I don't know why. And I could not make it work again, so I had to duplicate the alarm, edit it, change the limit execution by a small number (like 4 min instead of 5), save it, delete the old broken alarm, go back to the duplicated alarm and change its name back to the original one.
Query log worked... And then, since 6/21/2023 2:34:52 PM, it does not run anymore.
Praeco Yaml :
`__praeco_full_path: "FOTT/Services/High number error access Service TV HISENSE"
__praeco_query_builder: "{"query":{"logicalOperator":"all","children":[{"type":"query-builder-rule","query":{"rule":"actKey","selectedOperator":"contains","selectedOperand":"actKey","value":"accessServiceError"}},{"type":"query-builder-rule","query":{"rule":"media","selectedOperator":"contains","selectedOperand":"media","value":"tvhisense"}}]}}"
alert:
alert_subject: "High number error access Service TV HISENSE"
alert_text: ""
doc_type: "dynamic_templates"
filter:
query_string:
query: "actKey:accessServiceError AND media:tvhisense"
generate_kibana_discover_url: true
import: "../../BaseRule.config"
index: "switchplus-ott-prod"
is_enabled: true
kibana_discover_app_url: ""
kibana_discover_columns:
kibana_discover_from_timedelta:
minutes: 10
kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a"
kibana_discover_to_timedelta:
minutes: 10
kibana_discover_version: "7.15"
limit_execution: "0 */1 * * *"
match_enhancements: []
ms_teams_alert_summary: "ElastAlert Message"
ms_teams_attach_kibana_discover_url: false
ms_teams_kibana_discover_title: "Discover in Kibana"
ms_teams_proxy: ""
ms_teams_theme_color: "#F912DE"
ms_teams_webhook_url: ""
name: "High number error access Service TV HISENSE"
num_events: 2500
pagerduty_api_version: "v2"
pagerduty_client_name: "elastalert"
pagerduty_event_type: "trigger"
pagerduty_incident_key: "nbErrorAccessTVHISENSE"
pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z"
pagerduty_v2_payload_component: "threshold"
pagerduty_v2_payload_group: "FOTT"
pagerduty_v2_payload_severity: "critical"
pagerduty_v2_payload_source: "ElastAlert"
priority: 2
realert:
hours: 1
terms_size: 50
timeframe:
hours: 1
timestamp_field: "dt"
timestamp_type: "iso"
type: "frequency"
use_count_query: true
use_strftime_index: false`
Some alarms, when doing that duplicate workaround, just don't update at all and get this Query Log tab (while the graph in the overwiew is perfectly working) :
Its overview:
Praeco Yml for that one:
`__praeco_full_path: "FOTT/Usage/Nb de lancement de player bas PLAYSTATION CPFRA"
__praeco_query_builder: "{"query":{"logicalOperator":"all","children":[{"type":"query-builder-rule","query":{"rule":"actKey","selectedOperator":"contains","selectedOperand":"actKey","value":"launchOnePlayer"}},{"type":"query-builder-rule","query":{"rule":"media","selectedOperator":"contains","selectedOperand":"media","value":"playstation"}},{"type":"query-builder-rule","query":{"rule":"zone","selectedOperator":"contains","selectedOperand":"zone","value":"cpfra"}}]}}"
alert:
alert_subject: "Very low launched player on PLAYSTATION on CPFRA"
alert_text: "Very low launched player on PLAYSTATION on CPFRA"
doc_type: "dynamic_templates"
filter:
query_string:
query: "actKey:launchOnePlayer AND media:playstation AND zone:cpfra"
generate_kibana_discover_url: true
import: "../../BaseRule.config"
index: "switchplus-ott-prod"
is_enabled: true
kibana_discover_app_url: ""
kibana_discover_from_timedelta:
minutes: 10
kibana_discover_index_pattern_id: "fb57fb40-67ed-11ec-9329-ff0008e8c62a"
kibana_discover_to_timedelta:
minutes: 10
kibana_discover_version: "7.15"
limit_execution: "0 */1 * * *"
match_enhancements: []
name: "Nb de lancement de player bas PLAYSTATION CPFRA"
pagerduty_api_version: "v2"
pagerduty_client_name: "elastalert"
pagerduty_event_type: "trigger"
pagerduty_incident_key: "nbLaunchPlayerPlaystationCPFRA"
pagerduty_service_key: "R03DL1FOO4FU3BPVDWUPARRTGWOCH29Z"
pagerduty_v2_payload_component: "threshold"
pagerduty_v2_payload_group: "FOTT"
pagerduty_v2_payload_severity: "critical"
pagerduty_v2_payload_source: "ElastAlert"
priority: 2
realert:
minutes: 10
terms_size: 50
threshold: 150
timeframe:
hours: 1
timestamp_field: "dt"
timestamp_type: "iso"
type: "flatline"
use_count_query: true
use_strftime_index: false`
And some other alarms, after being duplicated and the original one removed, have their Query Log tab that goes back to the original's Query Log tab, like they were not deleted at all, keeping the old frozen historic. I am wondering if the old / first alarm has really been deleted (and if not, why it does not appear in Praeco).
And that's why I have 3 questions:
1 - Any idea why this happens (except after a docker being destroyed / rebuilt, I noticed that.)
2 - Is there a way to "reconnect" all the alarms after such an incident without duplicating them (when it works)? I try enable /disable an alarm, does not work.
3 - Are there any logs about a specific alarm in the Praeco docker and / or in the Elastalert docker? If so, where?
👀 Operating environment
The text was updated successfully, but these errors were encountered: