-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Improve scheduling of recurring tasks so they re-run more on-time #189114
Labels
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Comments
mikecote
added
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
labels
Jul 24, 2024
Pinging @elastic/response-ops (Team:ResponseOps) |
48 tasks
After prototyping this (#190093), I did notice the tasks sometimes run more frequently than scheduled but it's much closer to the set number of tasks per minute. |
mikecote
added a commit
that referenced
this issue
Sep 16, 2024
…r original time (#190093) Resolves #189114 In this PR, I'm changing the logic to calculate the task's next run at. Whenever the gap between the task's runAt and when it was picked up is less than the poll interval, we'll use the `runAt` to schedule the next. This way we don't continuously add time to the task's next run (ex: running every 1m turns into every 1m 3s). I've had to modify a few tests to have a more increased interval because this made tasks run more frequently (on time), which introduced flakiness. ## To verify 1. Create an alerting rule that runs every 10s 2. Apply the following diff to your code ``` diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts index 55d5f85e5d3..4342dcdd845 100644 --- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts +++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts @@ -31,5 +31,7 @@ export function getNextRunAt( Date.now() ); + console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`); + return new Date(nextCalculatedRunAt); } ``` 3. Observe the logs, the gap between runAt and startedAt should be less than the poll interval, so the next run at is based on `runAt` instead of `startedAt`. 4. Stop Kibana for 15 seconds then start it again 5. Observe the first logs when the rule runs again and notice now that the gap between runAt and startedAt is larger than the poll interval, the next run at is based on `startedAt` instead of `runAt` to spread the tasks out evenly. --------- Co-authored-by: Elastic Machine <[email protected]>
kibanamachine
pushed a commit
to kibanamachine/kibana
that referenced
this issue
Sep 16, 2024
…r original time (elastic#190093) Resolves elastic#189114 In this PR, I'm changing the logic to calculate the task's next run at. Whenever the gap between the task's runAt and when it was picked up is less than the poll interval, we'll use the `runAt` to schedule the next. This way we don't continuously add time to the task's next run (ex: running every 1m turns into every 1m 3s). I've had to modify a few tests to have a more increased interval because this made tasks run more frequently (on time), which introduced flakiness. ## To verify 1. Create an alerting rule that runs every 10s 2. Apply the following diff to your code ``` diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts index 55d5f85e5d3..4342dcdd845 100644 --- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts +++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts @@ -31,5 +31,7 @@ export function getNextRunAt( Date.now() ); + console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`); + return new Date(nextCalculatedRunAt); } ``` 3. Observe the logs, the gap between runAt and startedAt should be less than the poll interval, so the next run at is based on `runAt` instead of `startedAt`. 4. Stop Kibana for 15 seconds then start it again 5. Observe the first logs when the rule runs again and notice now that the gap between runAt and startedAt is larger than the poll interval, the next run at is based on `startedAt` instead of `runAt` to spread the tasks out evenly. --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit 1f673dc)
kibanamachine
referenced
this issue
Sep 16, 2024
…f their original time (#190093) (#193022) # Backport This will backport the following commits from `main` to `8.x`: - [Consistent scheduling when tasks run within the poll interval of their original time (#190093)](#190093) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Mike Côté","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-16T14:10:36Z","message":"Consistent scheduling when tasks run within the poll interval of their original time (#190093)\n\nResolves https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm changing the logic to calculate the task's next run at.\r\nWhenever the gap between the task's runAt and when it was picked up is\r\nless than the poll interval, we'll use the `runAt` to schedule the next.\r\nThis way we don't continuously add time to the task's next run (ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to modify a few tests to have a more increased interval because\r\nthis made tasks run more frequently (on time), which introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting rule that runs every 10s\r\n2. Apply the following diff to your code\r\n```\r\ndiff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex 55d5f85e5d3..4342dcdd845 100644\r\n--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5 +31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+ console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap between runAt and startedAt should be less\r\nthan the poll interval, so the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4. Stop Kibana for 15 seconds then start it again\r\n5. Observe the first logs when the rule runs again and notice now that\r\nthe gap between runAt and startedAt is larger than the poll interval,\r\nthe next run at is based on `startedAt` instead of `runAt` to spread the\r\ntasks out evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"Consistent scheduling when tasks run within the poll interval of their original time","number":190093,"url":"https://github.com/elastic/kibana/pull/190093","mergeCommit":{"message":"Consistent scheduling when tasks run within the poll interval of their original time (#190093)\n\nResolves https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm changing the logic to calculate the task's next run at.\r\nWhenever the gap between the task's runAt and when it was picked up is\r\nless than the poll interval, we'll use the `runAt` to schedule the next.\r\nThis way we don't continuously add time to the task's next run (ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to modify a few tests to have a more increased interval because\r\nthis made tasks run more frequently (on time), which introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting rule that runs every 10s\r\n2. Apply the following diff to your code\r\n```\r\ndiff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex 55d5f85e5d3..4342dcdd845 100644\r\n--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5 +31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+ console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap between runAt and startedAt should be less\r\nthan the poll interval, so the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4. Stop Kibana for 15 seconds then start it again\r\n5. Observe the first logs when the rule runs again and notice now that\r\nthe gap between runAt and startedAt is larger than the poll interval,\r\nthe next run at is based on `startedAt` instead of `runAt` to spread the\r\ntasks out evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/190093","number":190093,"mergeCommit":{"message":"Consistent scheduling when tasks run within the poll interval of their original time (#190093)\n\nResolves https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm changing the logic to calculate the task's next run at.\r\nWhenever the gap between the task's runAt and when it was picked up is\r\nless than the poll interval, we'll use the `runAt` to schedule the next.\r\nThis way we don't continuously add time to the task's next run (ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to modify a few tests to have a more increased interval because\r\nthis made tasks run more frequently (on time), which introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting rule that runs every 10s\r\n2. Apply the following diff to your code\r\n```\r\ndiff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex 55d5f85e5d3..4342dcdd845 100644\r\n--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5 +31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+ console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap between runAt and startedAt should be less\r\nthan the poll interval, so the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4. Stop Kibana for 15 seconds then start it again\r\n5. Observe the first logs when the rule runs again and notice now that\r\nthe gap between runAt and startedAt is larger than the poll interval,\r\nthe next run at is based on `startedAt` instead of `runAt` to spread the\r\ntasks out evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Mike Côté <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Today, recurring tasks are rescheduled based on when they started running. This approach helps spread tasks over time when there is a large amount of task delay so the load onto the system is more constant instead of sporadic. It also avoids constant thundering herds when enabling many rules simultaneously or when Kibana has been down for a period of time.
In the image below, you can see this in action where downtime causes a series of tasks to queue up the next time Kibana runs, but over time, the number of requests start to even out in the middle.
However, using
startedAt
in the calculation always causes some delay between each execution because of the time it takes for the system to start running a task after it's been due to run. For example, a task running1m
would normally run every ~1m 3s
because the default poll interval of3s
contributes to the delay before tasks start running.To preserve task spreading when the system is overloaded while fixing the minor delays added after every run, we should use the
runAt
whenever the task runs within10s
of when it was due. This way, in normal scenarios, the tasks run as close to their schedule as possible.A code sample can be found in the file
task_runner.ts
#186972. We would need to add logic to apply this calculation only if the gap between runAt and startedAt is less than 10s.Definition of Done
startedAt - runAt
greater than 10s use the same calculation as exists todaystartedAt - runAt
less than 10s use the new calculation based onrunAt
insteadMath.max(..., Date.now());
)The text was updated successfully, but these errors were encountered: