Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Update resource failures w/ Finalizers set (#423) #5673

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

pvditt
Copy link
Contributor

@pvditt pvditt commented Aug 20, 2024

Upstreamed from https://github.com/unionai/flyte/pull/423

Overview

when informer cache has stale values, we cannot update the k8s resource when clearing finalizers and get Error: Operation cannot be fulfilled on pods. The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries.

What changes were proposed in this pull request?

This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date.

How was this patch tested?

  • deployed in Union
  • confirmed finalizers were getting cleared + retries were happening + not bubbling up in system errors

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

## Overview
when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries.

This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date.

## Test Plan
Ran in dogfood-gcp
- https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers
- Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes)
- Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes)
confirmed in logs that conflict errors:
```
{"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"}

```
did not bubble up + confirmed finalizers were removed:

```
➜  ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags.
➜  ~
```

## Rollout Plan (if applicable)
- revert changes to customer's config that disabled finalizers

## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [x] To be upstreamed to OSS

## Issue
fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks

## Checklist
* [ ] Added tests
* [x] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation

Signed-off-by: Paul Dittamo <[email protected]>
Copy link

codecov bot commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 48.64865% with 19 lines in your changes missing coverage. Please review.

Project coverage is 36.17%. Comparing base (d797f08) to head (d374648).
Report is 153 commits behind head on master.

Files with missing lines Patch % Lines
...er/pkg/controller/nodes/task/k8s/plugin_manager.go 45.71% 16 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5673       +/-   ##
===========================================
+ Coverage    9.74%   36.17%   +26.42%     
===========================================
  Files         214     1302     +1088     
  Lines       39190   109513    +70323     
===========================================
+ Hits         3820    39611    +35791     
- Misses      35031    65765    +30734     
- Partials      339     4137     +3798     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (ø)
unittests-flyteadmin 55.33% <ø> (?)
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.28% <ø> (?)
unittests-flyteidl 7.08% <ø> (ø)
unittests-flyteplugins 53.32% <100.00%> (?)
unittests-flytepropeller 41.72% <45.71%> (?)
unittests-flytestdlib 55.35% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@eapolinario eapolinario merged commit 7866c31 into master Aug 21, 2024
50 checks passed
@eapolinario eapolinario deleted the add-backoff-clearing-finalizers branch August 21, 2024 16:08
pmahindrakar-oss pushed a commit that referenced this pull request Sep 9, 2024
## Overview
when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries.

This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date.

## Test Plan
Ran in dogfood-gcp
- https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers
- Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes)
- Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes)
confirmed in logs that conflict errors:
```
{"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"}

```
did not bubble up + confirmed finalizers were removed:

```
➜  ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags.
➜  ~
```

## Rollout Plan (if applicable)
- revert changes to customer's config that disabled finalizers

## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [x] To be upstreamed to OSS

## Issue
fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks

## Checklist
* [ ] Added tests
* [x] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation

Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
bgedik pushed a commit to bgedik/flyte that referenced this pull request Sep 12, 2024
…eorg#5673)

## Overview
when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries.

This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date.

## Test Plan
Ran in dogfood-gcp
- https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers
- Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes)
- Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes)
confirmed in logs that conflict errors:
```
{"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"}

```
did not bubble up + confirmed finalizers were removed:

```
➜  ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final
INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags.
➜  ~
```

## Rollout Plan (if applicable)
- revert changes to customer's config that disabled finalizers

## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [x] To be upstreamed to OSS

## Issue
fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks

## Checklist
* [ ] Added tests
* [x] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation

Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Bugra Gedik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants