Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Sep 9, 2020

⚠️ This PR merges into a feature branch

Part of #65553.

Feature branch PR: #75666.

In this PR, I'm connecting two previous PRs (#75293 and #75679) by creating a configuration manager. As described in the proposal, back pressure will be applied as follows:

  • Reduce max workers by 20% every 10 seconds until 429 errors are no longer encountered
  • Increase poll interval by 20% every 10 seconds until 429 errors are no longer encountered

Once 429 errors are no longer encountered, the system will start going back to normal configuration as follows:

  • Increase max workers by 5% every 10 seconds until original configuration is reached
  • Decrease poll interval by 5% every 10 seconds until original configuration is reached

Each time the system starts changing configurations from the normal values, a warning message will be logged. For further insight, a series of debug messages are also logged.

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 9, 2020
@mikecote mikecote self-assigned this Sep 9, 2020
@gmmorris gmmorris self-requested a review September 17, 2020 08:36
Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!
Lets get this to the finish line :)

@@ -31,6 +39,7 @@ export function createObservableMonitor<T, E>(
return new Observable((subscriber) => {
const subscription: Subscription = interval(heartbeatInterval)
.pipe(
startWith(0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmmorris it seems when introducing the observable monitor that task manager wouldn't start claiming tasks until poll_interval * 2. With this change, it's now only poll_interval. I noticed this when writing some integration tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean Master was broken and wouldn't be working for weeks?
What do you mean by task manager wouldn't start claiming tasks until poll_interval * 2.?
That wouldn't be true on master as poll_interval doesn't change on master....

Am I misunderstanding? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have clarified. In my tests, I noticed the updateByQuery wasn't called until 6000ms passed instead of 3000ms when using a poll_interval of 3000. Adding this makes the call happen 3000ms after starting task manager instead of 6000ms.

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, this aligns nicely with how we do other things in TM, can't wait to se all the pieces fall into place :)

@mikecote mikecote marked this pull request as ready for review September 29, 2020 12:38
@mikecote mikecote requested a review from a team as a code owner September 29, 2020 12:38
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some nit comments

errors$.next(SavedObjectsErrorHelpers.createTooManyRequestsError('a', 'b'));
clock.tick(ADJUST_THROUGHPUT_INTERVAL);
}
expect(subscription).toHaveBeenNthCalledWith(2, 80);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could put these numbers in an array and do the expect()'s in a loop, but may be harder to debug problems that way ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this as well. The two up sides I saw this way 1) it provides a clear example of how the configuration gets reduced from 100 when errors keep emitting 2) it allowed to add comments to explain some of the inner usage of Math.floor and distinctUntilChanged() as the assertions happened. I could always cut a few assertions out.

@@ -0,0 +1,102 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, new directory integration_tests? I guess the idea with these is that they actually launch a task manager to operate on, so a little different than our other jest tests. Cool - I could see us adding more tests here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! There's a few plugins that use the concept of jest integration tests to have something higher level than a unit test yet lower level than an API integration test to make sure it all works together. I agree there's a lot of potential here for future tests.

I realized the test ran by the node scripts/jest script instead of the node scripts/jest_integration. I'll look into it.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

distributable file count

id value diff baseline
default 45823 +1 45822

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mikecote mikecote merged commit aa787b6 into elastic:feature/task_manager_429 Sep 30, 2020
mikecote added a commit that referenced this pull request Oct 13, 2020
…ith a 429 (#75666)

* Make task manager maxWorkers and pollInterval observables (#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
mikecote added a commit to mikecote/kibana that referenced this pull request Oct 13, 2020
…ith a 429 (elastic#75666)

* Make task manager maxWorkers and pollInterval observables (elastic#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (elastic#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (elastic#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
mikecote added a commit that referenced this pull request Oct 13, 2020
…ith a 429 (#75666) (#80355)

* Make task manager maxWorkers and pollInterval observables (#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants