Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - apscheduler skipping alerts #187

Closed
markus-nclose opened this issue May 24, 2021 · 13 comments · Fixed by #192
Closed

[BUG] - apscheduler skipping alerts #187

markus-nclose opened this issue May 24, 2021 · 13 comments · Fixed by #192

Comments

@markus-nclose
Copy link
Contributor

Firstly, thanks for maintaining the project.

Elastalert version - latest
Python version - Python 3.8.5
OS - Ubuntu 20.04.1 LTS

Problem description. - This problem comes from the original elastalert. We noticed that amount of rules actually being run by Elastalert was different every time it ran - this was viewed in the Elastalert Elasticsearch index.
We never had this issue with a "small" amount of rules and only noticed it when a large set of rules was loaded.

In the Elastalert logs you would see this intermittently:

May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:03 UTC)" was missed by 0:00:02.944895
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.912215
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:06 UTC)" was missed by 0:00:02.827846
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.758194
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:07 UTC)" was missed by 0:00:02.758226
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:08 UTC)" was missed by 0:00:02.617983
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:06 UTC)" was missed by 0:00:02.407513
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:07 UTC)" was missed by 0:00:02.351592
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.262315
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:05 UTC)" was missed by 0:00:02.244299
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:04 UTC)" was missed by 0:00:02.242278
May 24 09:27:06 minimdr python[2498368]: WARNING:apscheduler.executors.default:Run time of job "ElastAlerter.handle_rule_execution (trigger: interval[0:08:00], next run at: 2021-05-24 09:35:04 UTC)" was missed by 0:00:02.237550

We modified elastalert.py and added misfire_grace_time to job as a hack to ensure all the rules runs.
The parameter was found here : https://apscheduler.readthedocs.io/en/stable/modules/job.html

This is the result of change:
image

@jertel
Copy link
Owner

jertel commented May 24, 2021

Yes, ElastAlert can struggle with large numbers of rules, especially if the rule queries aren't tuned adequately, or if the ES cluster is under powered. If you can put together a PR for this new option we can get it merged in for the next release.

@ferozsalam
Copy link
Collaborator

  • Would a misfire_grace_time of None be better here? According to the docs: None means “allow the job to run no matter how late it is”
  • Out of curiosity, what are the consequences of missing a job? Does the job get run again taking into account any documents from missed time period? Or does ElastAlert potentially skip documents entirely?

@markus-nclose
Copy link
Contributor Author

markus-nclose commented May 25, 2021

@jertel

  • The elastic cluster is quite fast and optimized. 95% of our queries is very basic and runs extremely fast.
  • Elastalert runs through the rules within a minute and then sits and wait for the next scheduled run (with a large amount of the above errors). Its almost like there is a hard limit on the number rules based on a time calculation.
  • Currently we running 330 rules/alerts.
  • I'm not confident in my abilities to submit a PR for this, not sure what the right fix is.

@ferozsalam Monitoring the elastalert_status index, we can see an average of 150-180 out of +-330 rules/alerts being run each schedule, this happens indefinitely. We couldn't tell if the each rules was excluded everytime from being run.

@ferozsalam
Copy link
Collaborator

This is very curious! I did some investigation on our setup - we run ~128 rules against our cluster and I can see no evidence of this issue occurring.

I think setting the misfire_grace_time might be a reasonable hotfix for the moment, but I agree the root question of whether the scheduling is buggy/not performant enough probably requires a more thorough investigation. 330 rules is really not very many, in particular if the queries are simple and the cluster is performant.

I would be interested to see if others are also having this issue, and would be interested to see the average time_taken values for rules in the cluster, as stored in the elastalert_status index.

@markus-nclose
Copy link
Contributor Author

Here you can see the interval of Elastalert and the schedule. Note the inconsitant amount of rules ran.
image
Average time_taken for same period.
image

@ferozsalam
Copy link
Collaborator

Thanks for the prompt reply. One thing worth mentioning is that the time_taken from your cluster is relatively high. Most of the rules in our cluster run in <0.5(s?), and nothing exceeds 2s.

I'm not certain that this is the source of the problem - I see that you have mentioned that ElastAlert seems to sit waiting with nothing to do - but I think there's a chance it might be making it worse.

For the rules that are taking a particularly long amount of time (perhaps >5s), are you able to provide an example of what they look like?

@markus-nclose
Copy link
Contributor Author

Yeah, like I mentioned initially, 95% of our queries is very basic. Unfortunately we have a few (~15 alerts) that run regex and take 20-40s which skews the numbers, all the rest runs for about 0.1-0.6.
I will quickly do a test with an Elastic instance with no data to ensure searching time is not the issue.

@markus-nclose
Copy link
Contributor Author

Okay, @ferozsalam you right. Looks like when the queries take some time it skips some rules. I find it strange that it skips running some rules and then sits and waits 5mins for next run. Unfortunately there is no way around the regex/slow rules and I'm sure other people unknowingly will experience the same issue.

@jertel
Copy link
Owner

jertel commented May 25, 2021

The BackgroundScheduler defaults to a 10 thread limit (https://github.com/agronholm/apscheduler/blob/3.0/apscheduler/executors/pool.py), so if your roughly 15 slow rules are using up all those threads for 20-30 seconds then no other threads are available to service the rest of the rules.

@mrfroggg
Copy link
Contributor

Could we maybe add an optional configuration parameter to change/raise that number of threads?

@jertel
Copy link
Owner

jertel commented May 25, 2021

Yes, I think both options would be useful:

  1. Customize number of scheduler task threads
  2. Customize misfire grace time (seconds)

@jertel
Copy link
Owner

jertel commented May 26, 2021

Take a look at PR #192 and let me know if you have any concerns with this change.

@mrfroggg
Copy link
Contributor

Looks good!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants