[BUG] - apscheduler skipping alerts #187

markus-nclose · 2021-05-24T09:50:34Z

Firstly, thanks for maintaining the project.

Elastalert version - latest
Python version - Python 3.8.5
OS - Ubuntu 20.04.1 LTS

Problem description. - This problem comes from the original elastalert. We noticed that amount of rules actually being run by Elastalert was different every time it ran - this was viewed in the Elastalert Elasticsearch index.
We never had this issue with a "small" amount of rules and only noticed it when a large set of rules was loaded.

In the Elastalert logs you would see this intermittently:

May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:03 UTC)" was missed by 0:00:02.944895
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.912215
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:06 UTC)" was missed by 0:00:02.827846
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.758194
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:07 UTC)" was missed by 0:00:02.758226
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:08 UTC)" was missed by 0:00:02.617983
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:06 UTC)" was missed by 0:00:02.407513
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:07 UTC)" was missed by 0:00:02.351592
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.262315
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.244299
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:04 UTC)" was missed by 0:00:02.242278
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:04 UTC)" was missed by 0:00:02.237550

We modified elastalert.py and added misfire_grace_time to job as a hack to ensure all the rules runs.
The parameter was found here : https://apscheduler.readthedocs.io/en/stable/modules/job.html

This is the result of change:

The text was updated successfully, but these errors were encountered:

jertel · 2021-05-24T12:32:03Z

Yes, ElastAlert can struggle with large numbers of rules, especially if the rule queries aren't tuned adequately, or if the ES cluster is under powered. If you can put together a PR for this new option we can get it merged in for the next release.

ferozsalam · 2021-05-24T18:03:09Z

Would a misfire_grace_time of None be better here? According to the docs: None means “allow the job to run no matter how late it is”
Out of curiosity, what are the consequences of missing a job? Does the job get run again taking into account any documents from missed time period? Or does ElastAlert potentially skip documents entirely?

markus-nclose · 2021-05-25T05:30:16Z

@jertel

The elastic cluster is quite fast and optimized. 95% of our queries is very basic and runs extremely fast.
Elastalert runs through the rules within a minute and then sits and wait for the next scheduled run (with a large amount of the above errors). Its almost like there is a hard limit on the number rules based on a time calculation.
Currently we running 330 rules/alerts.
I'm not confident in my abilities to submit a PR for this, not sure what the right fix is.

@ferozsalam Monitoring the elastalert_status index, we can see an average of 150-180 out of +-330 rules/alerts being run each schedule, this happens indefinitely. We couldn't tell if the each rules was excluded everytime from being run.

ferozsalam · 2021-05-25T06:51:44Z

This is very curious! I did some investigation on our setup - we run ~128 rules against our cluster and I can see no evidence of this issue occurring.

I think setting the misfire_grace_time might be a reasonable hotfix for the moment, but I agree the root question of whether the scheduling is buggy/not performant enough probably requires a more thorough investigation. 330 rules is really not very many, in particular if the queries are simple and the cluster is performant.

I would be interested to see if others are also having this issue, and would be interested to see the average time_taken values for rules in the cluster, as stored in the elastalert_status index.

markus-nclose · 2021-05-25T07:08:42Z

Here you can see the interval of Elastalert and the schedule. Note the inconsitant amount of rules ran.

Average time_taken for same period.

ferozsalam · 2021-05-25T08:02:08Z

Thanks for the prompt reply. One thing worth mentioning is that the time_taken from your cluster is relatively high. Most of the rules in our cluster run in <0.5(s?), and nothing exceeds 2s.

I'm not certain that this is the source of the problem - I see that you have mentioned that ElastAlert seems to sit waiting with nothing to do - but I think there's a chance it might be making it worse.

For the rules that are taking a particularly long amount of time (perhaps >5s), are you able to provide an example of what they look like?

markus-nclose · 2021-05-25T08:23:38Z

Yeah, like I mentioned initially, 95% of our queries is very basic. Unfortunately we have a few (~15 alerts) that run regex and take 20-40s which skews the numbers, all the rest runs for about 0.1-0.6.
I will quickly do a test with an Elastic instance with no data to ensure searching time is not the issue.

markus-nclose · 2021-05-25T12:18:59Z

Okay, @ferozsalam you right. Looks like when the queries take some time it skips some rules. I find it strange that it skips running some rules and then sits and waits 5mins for next run. Unfortunately there is no way around the regex/slow rules and I'm sure other people unknowingly will experience the same issue.

jertel · 2021-05-25T12:33:38Z

The BackgroundScheduler defaults to a 10 thread limit (https://github.com/agronholm/apscheduler/blob/3.0/apscheduler/executors/pool.py), so if your roughly 15 slow rules are using up all those threads for 20-30 seconds then no other threads are available to service the rest of the rules.

mrfroggg · 2021-05-25T14:01:16Z

Could we maybe add an optional configuration parameter to change/raise that number of threads?

jertel · 2021-05-25T14:34:50Z

Yes, I think both options would be useful:

Customize number of scheduler task threads
Customize misfire grace time (seconds)

jertel · 2021-05-26T16:17:17Z

Take a look at PR #192 and let me know if you have any concerns with this change.

mrfroggg · 2021-05-26T17:31:17Z

Looks good!

jertel mentioned this issue May 26, 2021

Add optional configuration values for scheduler #192

Merged

nsano-rururu closed this as completed in #192 May 26, 2021

github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - apscheduler skipping alerts #187

[BUG] - apscheduler skipping alerts #187

markus-nclose commented May 24, 2021

jertel commented May 24, 2021

ferozsalam commented May 24, 2021

markus-nclose commented May 25, 2021 •

edited

Loading

ferozsalam commented May 25, 2021

markus-nclose commented May 25, 2021

ferozsalam commented May 25, 2021

markus-nclose commented May 25, 2021

markus-nclose commented May 25, 2021

jertel commented May 25, 2021

mrfroggg commented May 25, 2021

jertel commented May 25, 2021

jertel commented May 26, 2021

mrfroggg commented May 26, 2021

[BUG] - apscheduler skipping alerts #187

[BUG] - apscheduler skipping alerts #187

Comments

markus-nclose commented May 24, 2021

jertel commented May 24, 2021

ferozsalam commented May 24, 2021

markus-nclose commented May 25, 2021 • edited Loading

ferozsalam commented May 25, 2021

markus-nclose commented May 25, 2021

ferozsalam commented May 25, 2021

markus-nclose commented May 25, 2021

markus-nclose commented May 25, 2021

jertel commented May 25, 2021

mrfroggg commented May 25, 2021

jertel commented May 25, 2021

jertel commented May 26, 2021

mrfroggg commented May 26, 2021

markus-nclose commented May 25, 2021 •

edited

Loading