-
Notifications
You must be signed in to change notification settings - Fork 669
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix fasttask backlog timeouts (#353)
## Overview The PR addresses two separate issues (1) the fasttask plugin can assign tasks to a worker that is at full capacity (parallelism + backlog_length). The worker than transparently drops these tasks and the plugin failsover to another worker and (2) the fasttask plugin has no notion of `PhaseVersion` which FlytePropeller uses to determine if any updates have occurred and consequently it needs to store state to etcd. This means that the `LastAccessedAt` field on the fasttask plugin state increment will never be persisted. Therefore, all backlogged tasks will fail after the grace period occurs regardless of whether updates are sent by the worker heartbeat or not. The former is addressed by making the worker backlog_length a suggestion, similar to how `max-parallelism` is applied within FlytePropeller. That is, the fasttask plugin will attempt to only assign tasks to full worker capacity (ie. parallelism + backlog_length), but if it assigns more (race condition) then the worker will backlog them. The latter is fixed by adding a `PhaseVersion` field on the fasttask plugin state that is incremented with each worker task status heartbeat. ## Test Plan This has been tested locally against a variety of backlog scenarios (ex. differing lengths, timeouts, etc). ## Rollout Plan (if applicable) This may be rolled out to all tenants immediately. ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [ ] To be upstreamed to OSS ## Issue https://linear.app/unionai/issue/COR-1455/execution-frequently-fails-due-to-missing-task-status-reporting ## Checklist * [x] Added tests * [ ] Ran a deploy dry run and shared the terraform plan * [x] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation
- Loading branch information
Showing
5 changed files
with
46 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters