Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

mikecote · 2020-09-09T17:57:22Z

⚠️ This PR merges into a feature branch

Part of #65553.

Feature branch PR: #75666.

In this PR, I'm connecting two previous PRs (#75293 and #75679) by creating a configuration manager. As described in the proposal, back pressure will be applied as follows:

Reduce max workers by 20% every 10 seconds until 429 errors are no longer encountered
Increase poll interval by 20% every 10 seconds until 429 errors are no longer encountered

Once 429 errors are no longer encountered, the system will start going back to normal configuration as follows:

Increase max workers by 5% every 10 seconds until original configuration is reached
Decrease poll interval by 5% every 10 seconds until original configuration is reached

Each time the system starts changing configurations from the normal values, a warning message will be logged. For further insight, a series of debug messages are also logged.

…into task-manager/throughput-manager

gmmorris

Looking good!
Lets get this to the finish line :)

x-pack/plugins/task_manager/server/lib/throughput_manager.ts

…into task-manager/throughput-manager

mikecote · 2020-09-29T00:21:25Z

x-pack/plugins/task_manager/server/polling/observable_monitor.ts

@@ -31,6 +39,7 @@ export function createObservableMonitor<T, E>(
  return new Observable((subscriber) => {
    const subscription: Subscription = interval(heartbeatInterval)
      .pipe(
+        startWith(0),


@gmmorris it seems when introducing the observable monitor that task manager wouldn't start claiming tasks until poll_interval * 2. With this change, it's now only poll_interval. I noticed this when writing some integration tests.

Wouldn't that mean Master was broken and wouldn't be working for weeks?
What do you mean by task manager wouldn't start claiming tasks until poll_interval * 2.?
That wouldn't be true on master as poll_interval doesn't change on master....

Am I misunderstanding? 🤔

Sorry I should have clarified. In my tests, I noticed the updateByQuery wasn't called until 6000ms passed instead of 3000ms when using a poll_interval of 3000. Adding this makes the call happen 3000ms after starting task manager instead of 6000ms.

gmmorris

Looking good, this aligns nicely with how we do other things in TM, can't wait to se all the pieces fall into place :)

elasticmachine · 2020-09-29T12:38:45Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

This reverts commit 0817e5e.

pmuellr

LGTM, left some nit comments

x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts

pmuellr · 2020-09-30T18:33:02Z

x-pack/plugins/task_manager/server/lib/create_managed_configuration.test.ts

+        errors$.next(SavedObjectsErrorHelpers.createTooManyRequestsError('a', 'b'));
+        clock.tick(ADJUST_THROUGHPUT_INTERVAL);
+      }
+      expect(subscription).toHaveBeenNthCalledWith(2, 80);


nit: could put these numbers in an array and do the expect()'s in a loop, but may be harder to debug problems that way ...

I've been thinking about this as well. The two up sides I saw this way 1) it provides a clear example of how the configuration gets reduced from 100 when errors keep emitting 2) it allowed to add comments to explain some of the inner usage of Math.floor and distinctUntilChanged() as the assertions happened. I could always cut a few assertions out.

pmuellr · 2020-09-30T18:37:18Z

x-pack/plugins/task_manager/server/integration_tests/managed_configuration.test.ts

@@ -0,0 +1,102 @@
+/*


So, new directory integration_tests? I guess the idea with these is that they actually launch a task manager to operate on, so a little different than our other jest tests. Cool - I could see us adding more tests here!

Exactly! There's a few plugins that use the concept of jest integration tests to have something higher level than a unit test yet lower level than an API integration test to make sure it all works together. I agree there's a lot of potential here for future tests.

I realized the test ran by the node scripts/jest script instead of the node scripts/jest_integration. I'll look into it.

kibanamachine · 2020-09-30T21:23:36Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 9f6a53b

Metrics [docs]

distributable file count

id	value	diff	baseline
`default`	45823	+1	45822

History

💚 Build #77983 succeeded b2a62d0
💚 Build #77866 succeeded 0817e5e
💚 Build #77711 succeeded c497979
💔 Build #77707 failed 6b8a614
💔 Build #77492 failed b9cfd70

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…ith a 429 (#75666) * Make task manager maxWorkers and pollInterval observables (#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>

…ith a 429 (elastic#75666) * Make task manager maxWorkers and pollInterval observables (elastic#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (elastic#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (elastic#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>

…ith a 429 (#75666) (#80355) * Make task manager maxWorkers and pollInterval observables (#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]> Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>

mikecote added 4 commits September 9, 2020 13:42

WIP

6aa2f86

Cleanup

e71be99

Add error count to message

0e093ba

Reset observable values on stop

2f4a026

mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 9, 2020

mikecote self-assigned this Sep 9, 2020

mikecote mentioned this pull request Sep 9, 2020

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #75666

Merged

4 tasks

mikecote added 7 commits September 9, 2020 14:00

Add comments

744ccfb

Fix issues when changing configurations

7bed495

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

1a0de3e

…into task-manager/throughput-manager

Cleanup code

c81634d

Cleanup pt2

b216f25

Some renames

c40cba1

Fix typecheck

e3b1056

gmmorris self-requested a review September 17, 2020 08:36

mikecote added 5 commits September 17, 2020 15:03

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

ad2531d

…into task-manager/throughput-manager

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

542537a

…into task-manager/throughput-manager

Use observables to manage throughput

5a6bc17

Rename class

9f69ed3

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

476a899

…into task-manager/throughput-manager

gmmorris suggested changes Sep 23, 2020

View reviewed changes

x-pack/plugins/task_manager/server/lib/throughput_manager.ts Outdated Show resolved Hide resolved

x-pack/plugins/task_manager/server/lib/throughput_manager.ts Outdated Show resolved Hide resolved

x-pack/plugins/task_manager/server/lib/throughput_manager.ts Outdated Show resolved Hide resolved

mikecote added 7 commits September 24, 2020 12:35

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

5f18950

…into task-manager/throughput-manager

Switch to createManagedConfiguration

38e3579

Add some comments

3effd76

Start unit tests

02caec3

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

a05e435

…into task-manager/throughput-manager

Add logs

846784f

Fix log level

b9cfd70

mikecote added 2 commits September 28, 2020 20:17

Attempt at adding integration tests

b7f46c5

Merge branch 'feature/task_manager_429' of github.com:elastic/kibana …

6b8a614

…into task-manager/throughput-manager

mikecote commented Sep 29, 2020

View reviewed changes

Fix test failures

c497979

gmmorris approved these changes Sep 29, 2020

View reviewed changes

mikecote marked this pull request as ready for review September 29, 2020 12:38

mikecote requested a review from a team as a code owner September 29, 2020 12:38

mikecote added 2 commits September 29, 2020 09:50

Fix timer

0817e5e

Revert "Fix timer"

b2a62d0

This reverts commit 0817e5e.

pmuellr approved these changes Sep 30, 2020

View reviewed changes

mikecote added 2 commits September 30, 2020 15:37

Use Symbol

17c4435

Fix merge scan

9f6a53b

mikecote merged commit aa787b6 into elastic:feature/task_manager_429 Sep 30, 2020

mikecote mentioned this pull request Apr 23, 2021

[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

mikecote commented Sep 9, 2020 •

edited

Loading

gmmorris left a comment

mikecote Sep 29, 2020

gmmorris Sep 29, 2020

mikecote Sep 29, 2020

gmmorris left a comment

elasticmachine commented Sep 29, 2020

pmuellr left a comment

pmuellr Sep 30, 2020

mikecote Sep 30, 2020

pmuellr Sep 30, 2020

mikecote Sep 30, 2020

kibanamachine commented Sep 30, 2020

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

Conversation

mikecote commented Sep 9, 2020 • edited Loading

⚠️ This PR merges into a feature branch

gmmorris left a comment

Choose a reason for hiding this comment

mikecote Sep 29, 2020

Choose a reason for hiding this comment

gmmorris Sep 29, 2020

Choose a reason for hiding this comment

mikecote Sep 29, 2020

Choose a reason for hiding this comment

gmmorris left a comment

Choose a reason for hiding this comment

elasticmachine commented Sep 29, 2020

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr Sep 30, 2020

Choose a reason for hiding this comment

mikecote Sep 30, 2020

Choose a reason for hiding this comment

pmuellr Sep 30, 2020

Choose a reason for hiding this comment

mikecote Sep 30, 2020

Choose a reason for hiding this comment

kibanamachine commented Sep 30, 2020

💚 Build Succeeded

Metrics [docs]

distributable file count

History

mikecote commented Sep 9, 2020 •

edited

Loading