-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] time out work when it overruns in poller #74980
[Task Manager] time out work when it overruns in poller #74980
Conversation
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but left some questions, mainly about potential timing issues
// The `work` phase includes the prework needed *before* executing a task | ||
// (such as polling for new work, marking tasks as running etc.) but does not | ||
// include the time of actually running the task | ||
workTimeout: opts.config.poll_interval * opts.config.max_poll_inactivity_cycles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you used a "count" as the config bit here, instead of a duration - my guess is the "count" makes this setting "move" if the poll interval changes, which I think is a nice pattern - I'd been thinking some of our other explicit timeouts could be "relative" like this as well - eg, throttle based on alert interval (eg, throttle for 10 intervals, not 10 minutes, or probably allow both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, Mike asked the same. :)
It's exactly that - to make it relative to the polling interval.
It also reduces the scenarios you need to support, such as a work duration that's lower than polling interval, and the potential complexity that introduces.
pollRequests$.next(some(three)); | ||
|
||
advance(workTimeout); | ||
await sleep(workTimeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we account for a bit of slop here? Seems like it shouldn't be possible for this to resolve before the workTimeout, but you know ... node, time, etc, heh. Maybe just add 100ms or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to start here and address if it does introduce flakiness.
In this case it's using fake timers, which I've found to be quite accurate, because it isn't actually time based, but rather is based on the queued timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I'm guessing even if we did see some timing flakiness, it would be fairly obvious that was the problem, given the context, so WORKSFORME
await promiseResult<H, Error>( | ||
timeoutPromiseAfter<H, Error>( | ||
work(...pullFromSet(set, getCapacity())), | ||
workTimeout ?? pollInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the case for workTimeout
being undefined
? Tests? I worry that if there a path in production code to have this as undefined
, the pollInterval
value used would be too close to a valid cycle - should it at least be some kind of multiple of pollInterval
, maybe like 2 or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it was really just a default... but yeah, you're right that if someone sets the interval to be lower than the latency of talking to Elasticsearch this will behave badly 🤔
That said... same is true for the actual polling....
Not sure about the * 2
approach, as we have no idea what the interval will be and it might be set to something ridiculous like 30m, in which case, timeout will be 60m....
That said... if your poll is 30m you don't care about immediate results anyway, do you 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the default config is set to 10
I think we can leave it as is for now and see how it behaves.
Unless someone specifically changes it, it'll always be far longer than the interval itself, and as this is here as a safeguard against issues in marking tasks as running (this time does not include task execution) I feel comfortable waiting to see how it behaves in the real world.
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work. This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
* master: Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) [Security Solution][Detections] Adds exception modal tests (elastic#74596) [Dashboard] Sample data link does not work (elastic#75262) [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905) [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166) convert processor labels to sentence case (elastic#75278) [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160) Clarify no documents error message when filtering by is_training (elastic#75227) [Lens] Fix crash when two layers xychart switches to pie (elastic#75267) [Observability Homepage] Fix console error because of side effect (elastic#75258) [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146) [ML] Functional tests - re-activate DFA test suites (elastic#75257) GS providers improvements (elastic#75174) [Visualize] First version of by-value visualize editor (elastic#72256)
…emove-header * saved-objects/version-on-create: (59 commits) remove version when loading sample data omit version from SO import/export Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) [Security Solution][Detections] Adds exception modal tests (elastic#74596) Revert "Revert "added missing core docs"" Revert "Revert "added version to saved object bulk creation"" [Dashboard] Sample data link does not work (elastic#75262) [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905) [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166) convert processor labels to sentence case (elastic#75278) [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160) Clarify no documents error message when filtering by is_training (elastic#75227) [Lens] Fix crash when two layers xychart switches to pie (elastic#75267) [Observability Homepage] Fix console error because of side effect (elastic#75258) [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146) [ML] Functional tests - re-activate DFA test suites (elastic#75257) GS providers improvements (elastic#75174) ...
) If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work. This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
* master: (112 commits) [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364) Update Node.js to version 10.22.0 (elastic#75254) [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953) [Discover] Fix histogram cloud tests (elastic#75268) Uiactions to navigate to visualize or maps (elastic#74121) Use prefix search invis editor field/agg combo box (elastic#75290) Fix docs in trigger alerting UI (elastic#75363) [SIEM] Fixes search bar Cypress test (elastic#74833) Add libnss3.so to Dockerfile template (reporting) (elastic#75370) [Discover] Create field_button and add popovers to sidebar (elastic#73226) [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105) [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207) Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730) [src/dev/build] remove node-version from snapshots (elastic#75303) [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886) [Canvas] Remove dependency on legacy expressions APIs (elastic#74885) Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) ...
Summary
This PR addresses a potential failure case discovered during the investigation of this issue: #74785
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
Checklist
Delete any items that are not applicable to this PR.
For maintainers