Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Improve scheduling of recurring tasks so they re-run more on-time #189114

Closed
mikecote opened this issue Jul 24, 2024 · 2 comments · Fixed by #190093
Closed

[Task Manager] Improve scheduling of recurring tasks so they re-run more on-time #189114

mikecote opened this issue Jul 24, 2024 · 2 comments · Fixed by #190093
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jul 24, 2024

Today, recurring tasks are rescheduled based on when they started running. This approach helps spread tasks over time when there is a large amount of task delay so the load onto the system is more constant instead of sporadic. It also avoids constant thundering herds when enabling many rules simultaneously or when Kibana has been down for a period of time.

In the image below, you can see this in action where downtime causes a series of tasks to queue up the next time Kibana runs, but over time, the number of requests start to even out in the middle.
image

However, using startedAt in the calculation always causes some delay between each execution because of the time it takes for the system to start running a task after it's been due to run. For example, a task running 1m would normally run every ~1m 3s because the default poll interval of 3s contributes to the delay before tasks start running.

To preserve task spreading when the system is overloaded while fixing the minor delays added after every run, we should use the runAt whenever the task runs within 10s of when it was due. This way, in normal scenarios, the tasks run as close to their schedule as possible.

A code sample can be found in the file task_runner.ts #186972. We would need to add logic to apply this calculation only if the gap between runAt and startedAt is less than 10s.

Definition of Done

  • Tasks that have startedAt - runAt greater than 10s use the same calculation as exists today
  • Tasks that have startedAt - runAt less than 10s use the new calculation based on runAt instead
  • 10s defined as a constant variable somewhere so it's easy to change at a later time
  • Tasks never get scheduled in the past (Math.max(..., Date.now());)
  • Tests
@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 24, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

mikecote commented Aug 14, 2024

After prototyping this (#190093), I did notice the tasks sometimes run more frequently than scheduled but it's much closer to the set number of tasks per minute.

@mikecote mikecote self-assigned this Sep 10, 2024
mikecote added a commit that referenced this issue Sep 16, 2024
…r original time (#190093)

Resolves #189114

In this PR, I'm changing the logic to calculate the task's next run at.
Whenever the gap between the task's runAt and when it was picked up is
less than the poll interval, we'll use the `runAt` to schedule the next.
This way we don't continuously add time to the task's next run (ex:
running every 1m turns into every 1m 3s).

I've had to modify a few tests to have a more increased interval because
this made tasks run more frequently (on time), which introduced
flakiness.

## To verify
1. Create an alerting rule that runs every 10s
2. Apply the following diff to your code
```
diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
index 55d5f85e5d3..4342dcdd845 100644
--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
@@ -31,5 +31,7 @@ export function getNextRunAt(
     Date.now()
   );

+  console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);
+
   return new Date(nextCalculatedRunAt);
 }
```
3. Observe the logs, the gap between runAt and startedAt should be less
than the poll interval, so the next run at is based on `runAt` instead
of `startedAt`.
4. Stop Kibana for 15 seconds then start it again
5. Observe the first logs when the rule runs again and notice now that
the gap between runAt and startedAt is larger than the poll interval,
the next run at is based on `startedAt` instead of `runAt` to spread the
tasks out evenly.

---------

Co-authored-by: Elastic Machine <[email protected]>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Sep 16, 2024
…r original time (elastic#190093)

Resolves elastic#189114

In this PR, I'm changing the logic to calculate the task's next run at.
Whenever the gap between the task's runAt and when it was picked up is
less than the poll interval, we'll use the `runAt` to schedule the next.
This way we don't continuously add time to the task's next run (ex:
running every 1m turns into every 1m 3s).

I've had to modify a few tests to have a more increased interval because
this made tasks run more frequently (on time), which introduced
flakiness.

## To verify
1. Create an alerting rule that runs every 10s
2. Apply the following diff to your code
```
diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
index 55d5f85e5d3..4342dcdd845 100644
--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
@@ -31,5 +31,7 @@ export function getNextRunAt(
     Date.now()
   );

+  console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);
+
   return new Date(nextCalculatedRunAt);
 }
```
3. Observe the logs, the gap between runAt and startedAt should be less
than the poll interval, so the next run at is based on `runAt` instead
of `startedAt`.
4. Stop Kibana for 15 seconds then start it again
5. Observe the first logs when the rule runs again and notice now that
the gap between runAt and startedAt is larger than the poll interval,
the next run at is based on `startedAt` instead of `runAt` to spread the
tasks out evenly.

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit 1f673dc)
kibanamachine referenced this issue Sep 16, 2024
…f their original time (#190093) (#193022)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Consistent scheduling when tasks run within the poll interval of
their original time
(#190093)](#190093)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Mike
Côté","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-16T14:10:36Z","message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"Consistent
scheduling when tasks run within the poll interval of their original
time","number":190093,"url":"https://github.com/elastic/kibana/pull/190093","mergeCommit":{"message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/190093","number":190093,"mergeCommit":{"message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Mike Côté <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants