-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096
Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096
Conversation
…into task-manager/throughput-manager
…into task-manager/throughput-manager
…into task-manager/throughput-manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Lets get this to the finish line :)
…into task-manager/throughput-manager
…into task-manager/throughput-manager
…into task-manager/throughput-manager
@@ -31,6 +39,7 @@ export function createObservableMonitor<T, E>( | |||
return new Observable((subscriber) => { | |||
const subscription: Subscription = interval(heartbeatInterval) | |||
.pipe( | |||
startWith(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmmorris it seems when introducing the observable monitor that task manager wouldn't start claiming tasks until poll_interval * 2. With this change, it's now only poll_interval. I noticed this when writing some integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that mean Master was broken and wouldn't be working for weeks?
What do you mean by task manager wouldn't start claiming tasks until poll_interval * 2.
?
That wouldn't be true on master as poll_interval
doesn't change on master....
Am I misunderstanding? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have clarified. In my tests, I noticed the updateByQuery
wasn't called until 6000ms passed instead of 3000ms when using a poll_interval of 3000. Adding this makes the call happen 3000ms after starting task manager instead of 6000ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, this aligns nicely with how we do other things in TM, can't wait to se all the pieces fall into place :)
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
This reverts commit 0817e5e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left some nit comments
x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts
Outdated
Show resolved
Hide resolved
errors$.next(SavedObjectsErrorHelpers.createTooManyRequestsError('a', 'b')); | ||
clock.tick(ADJUST_THROUGHPUT_INTERVAL); | ||
} | ||
expect(subscription).toHaveBeenNthCalledWith(2, 80); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could put these numbers in an array and do the expect()
's in a loop, but may be harder to debug problems that way ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this as well. The two up sides I saw this way 1) it provides a clear example of how the configuration gets reduced from 100 when errors keep emitting 2) it allowed to add comments to explain some of the inner usage of Math.floor and distinctUntilChanged() as the assertions happened. I could always cut a few assertions out.
@@ -0,0 +1,102 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, new directory integration_tests
? I guess the idea with these is that they actually launch a task manager to operate on, so a little different than our other jest tests. Cool - I could see us adding more tests here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! There's a few plugins that use the concept of jest integration tests to have something higher level than a unit test yet lower level than an API integration test to make sure it all works together. I agree there's a lot of potential here for future tests.
I realized the test ran by the node scripts/jest
script instead of the node scripts/jest_integration
. I'll look into it.
💚 Build SucceededMetrics [docs]distributable file count
History
To update your PR or re-run it, just comment with: |
…ith a 429 (#75666) * Make task manager maxWorkers and pollInterval observables (#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>
…ith a 429 (elastic#75666) * Make task manager maxWorkers and pollInterval observables (elastic#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (elastic#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (elastic#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>
…ith a 429 (#75666) (#80355) * Make task manager maxWorkers and pollInterval observables (#75293) * WIP step 1 * WIP step 2 * Cleanup * Make maxWorkers an observable for the task pool * Cleanup * Fix test failures * Use BehaviorSubject * Add some tests * Make the task manager store emit error events (#75679) * Add errors$ observable to the task store * Add unit tests * Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096) * WIP * Cleanup * Add error count to message * Reset observable values on stop * Add comments * Fix issues when changing configurations * Cleanup code * Cleanup pt2 * Some renames * Fix typecheck * Use observables to manage throughput * Rename class * Switch to createManagedConfiguration * Add some comments * Start unit tests * Add logs * Fix log level * Attempt at adding integration tests * Fix test failures * Fix timer * Revert "Fix timer" This reverts commit 0817e5e. * Use Symbol * Fix merge scan * replace startsWith with a timer that is scheduled to 0 * typo Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]> Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Gidi Meir Morris <[email protected]>
Part of #65553.
Feature branch PR: #75666.
In this PR, I'm connecting two previous PRs (#75293 and #75679) by creating a configuration manager. As described in the proposal, back pressure will be applied as follows:
Once 429 errors are no longer encountered, the system will start going back to normal configuration as follows:
Each time the system starts changing configurations from the normal values, a warning message will be logged. For further insight, a series of debug messages are also logged.