[BUG] use deep copy of bit arrays when getting array node state #5681

pvditt · 2024-08-21T09:03:43Z

Upstreaming https://github.com/unionai/flyte/pull/428

Why are the changes needed?

ArrayNodeState is not supposed to get persisted unless the handle loop completes and successfully calls PutArrayNodeState

However, any SetItem call on a bitarray (ex) that is used to store subnode phases + subtask phases will persist even if the Handle loop does not successfully call PutArrayNodeState at the end of the handle loop.

This is because the BitSet field in a bitarray is a pointer -> SetItem modifies the underlying data with an explicit call to PutArrayNodeState.

This can cause issues as we don't finish handling a subnode at a given phase if there is an error causing a return UnknownTransition after we persist subnode state.

Example issue:

subnode "1" gets set to a terminal node phase
eventing fails
next handle loops has the accidentally updated state so doesn't get re-evaluated at the previous, non-terminal phase so we don't propagate the necessary eventing to set the subnode's correct phases

What changes were proposed in this pull request?

How was this patch tested?

deployed in Union
ran a myriad of workflows, mapping over tasks and launch plans, that had varying levels of successes, failures, and manual aborts.
manually set eventing to fail to ensure that the next loop properly evaluated with the correct phases

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

branched off #5680 instead of master to reduce conflicts while upstreaming. Will merge that in first to clear up diff.

Docs link

If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]>

* handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>

* set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>

Signed-off-by: Paul Dittamo <[email protected]>

codecov · 2024-08-21T09:08:02Z

Codecov Report

Attention: Patch coverage is 3.44828% with 28 lines in your changes missing coverage. Please review.

Project coverage is 36.17%. Comparing base (e9413c0) to head (1f94bfe).
Report is 153 commits behind head on master.

Files with missing lines	Patch %	Lines
...ler/pkg/apis/flyteworkflow/v1alpha1/node_status.go	0.00%	13 Missing ⚠️
...opeller/pkg/controller/nodes/node_state_manager.go	0.00%	12 Missing ⚠️
...is/flyteworkflow/v1alpha1/zz_generated.deepcopy.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5681   +/-   ##
=======================================
  Coverage   36.17%   36.17%           
=======================================
  Files        1302     1302           
  Lines      109556   109580   +24     
=======================================
+ Hits        39630    39643   +13     
- Misses      65786    65797   +11     
  Partials     4140     4140

Flag	Coverage Δ
unittests-datacatalog	`51.37% <ø> (ø)`
unittests-flyteadmin	`55.33% <ø> (+0.06%)`	⬆️
unittests-flytecopilot	`12.17% <ø> (ø)`
unittests-flytectl	`62.28% <ø> (ø)`
unittests-flyteidl	`7.08% <ø> (ø)`
unittests-flyteplugins	`53.34% <ø> (ø)`
unittests-flytepropeller	`41.71% <3.44%> (-0.05%)`	⬇️
unittests-flytestdlib	`55.35% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

pingsutw · 2024-08-21T09:21:42Z

.run/single-binary.run.xml

@@ -1,13 +0,0 @@
-<component name="ProjectRunConfigurationManager">


Did you accidentally delete this file?

oh thanks for catching this. Unsure how I did that. Let me add it back.

This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]>

eapolinario · 2024-08-21T14:33:38Z

flytepropeller/pkg/apis/flyteworkflow/v1alpha1/zz_generated.deepcopy.go

+	if in.ArrayNodeStatus != nil {
+		in, out := &in.ArrayNodeStatus, &out.ArrayNodeStatus
+		*out = (*in).DeepCopy()
+	}


can you document how you generated the change in this file?

I did this manually on our fork when working on an array node feature a couple months ago. I couldn't get the script to run.

eapolinario · 2024-08-21T14:44:51Z

flytepropeller/pkg/controller/config/config.go

@@ -259,6 +259,7 @@ const (
 type EventConfig struct {
 	RawOutputPolicy           RawOutputPolicy `json:"raw-output-policy" pflag:",How output data should be passed along in execution events."`
 	FallbackToOutputReference bool            `json:"fallback-to-output-reference" pflag:",Whether output data should be sent by reference when it is too large to be sent inline in execution events."`
+	ErrorOnAlreadyExists      bool            `json:"error-on-already-exists" pflag:",Whether to return an error when an event already exists."`


worth enabling this in single-binary by default?

This should never really be enabled except for array node. I didn't want to update the record task event function, but I guess that might be better if users were to unknowingly change this and have executions fail.

Actually, I can just remove the ability to set this and keep it in the eventconfig struct

ah darn, I should've made that change here #5680

Signed-off-by: Paul Dittamo <[email protected]>

flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go

eapolinario · 2024-08-23T18:57:10Z

flytepropeller/pkg/controller/config/config.go

@@ -259,7 +259,8 @@ const (
 type EventConfig struct {
 	RawOutputPolicy           RawOutputPolicy `json:"raw-output-policy" pflag:",How output data should be passed along in execution events."`
 	FallbackToOutputReference bool            `json:"fallback-to-output-reference" pflag:",Whether output data should be sent by reference when it is too large to be sent inline in execution events."`
-	ErrorOnAlreadyExists      bool            `json:"error-on-already-exists" pflag:",Whether to return an error when an event already exists."`
+	// only meant to be overridden for certain node types that have different eventing behavior such as ArrayNode
+	ErrorOnAlreadyExists bool `json:"-"`


I understand the purpose of this now, thanks for the comment. We're adding a field to a central config object and assuming that such field will only be used in a particular situation. This feels hacky to me, but maybe there's a precedent and good reason to follow this pattern.

yup - it definitely is hacky. Looking forward when/if we eventually re-work eventing in propeller

flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go

* [BUG] add retries to handle array node eventing race condition (#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * use deep copy of bit arrays when getting array node state Signed-off-by: Paul Dittamo <[email protected]> * Revert "add deep copy for array node status" This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]> * ignore ErrorOnAlreadyExists when marshalling event config Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: pmahindrakar-oss <[email protected]>

…eorg#5681) * [BUG] add retries to handle array node eventing race condition (flyteorg#421) If there is an error updating a [FlyteWorkflow CRD](https://github.com/unionai/flyte/blob/6a7207c5345604a28a9d4e3699becff767f520f5/flytepropeller/pkg/controller/handler.go#L378), then the propeller streak ends without the CRD getting updated and the in-memory copy of the FlyteWorkflow is not utilized on the next loop. [TaskPhaseVersion](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go#L239) is stored in the FlyteWorkflow. This is incremented when there is an update to node/subnode state to ensure that events are unique. If the events stay in the same state and have the same TaskPhaseVersion, then they [get short-circuited and don't get emitted to admin](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/events/admin_eventsink.go#L59) or will get returned as an [AlreadyExists error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flyteadmin/pkg/manager/impl/task_execution_manager.go#L172) and get [handled in propeller to not bubble up in an error](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/nodes/node_exec_context.go#L38). We can run into issues with ArrayNode eventing when: - array node handler increments task phase version from "0" to "1" - admin event sink emits event with version "1" - the propeller controller is not able to update the FlyteWorkflow CRD, so the ArrayNodeStatus indicates taskPhaseVersion is still 0 - next loop, array node handler increments task phase version from "0" to "1" - admin event sink prevents the event from getting emitted as an event with the same ID has already been received. No error is bubbled up. This means we lose subnode state until there is an event that contains an update to that subnode. If the lost state is the subnode reaching a terminal state, then the subnode state (from admin/UI) is "stuck" in a non-terminal state. I confirmed this to be an issue in the load-test-cluster. Whenever, there was an [error syncing the FlyteWorkflow](https://github.com/flyteorg/flyte/blob/37b4e13ac4a3594ac63b7a35058f4b2220e51282/flytepropeller/pkg/controller/workers.go#L91), the next round of eventing in ArrayNode would fail unless the ArrayNode phase changed. - added unit test - tested locally in sandbox - test in dogfood - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4398#01914a1a-f6d6-42a5-b41b-7b6807f27370 - should be fine to rollout to prod Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS fixes: https://linear.app/unionai/issue/COR-1534/bug-arraynode-shows-non-complete-jobs-in-ui-when-the-job-is-actually * [x] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> * handle already exists error on array node abort (flyteorg#427) * handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * [BUG] set cause for already exists EventError (flyteorg#432) * set cause for already exists EventError Signed-off-by: Paul Dittamo <[email protected]> * add nil check event error Signed-off-by: Paul Dittamo <[email protected]> * lint Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * add deep copy for array node status Signed-off-by: Paul Dittamo <[email protected]> * use deep copy of bit arrays when getting array node state Signed-off-by: Paul Dittamo <[email protected]> * Revert "add deep copy for array node status" This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]> * ignore ErrorOnAlreadyExists when marshalling event config Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Bugra Gedik <[email protected]>

pvditt and others added 6 commits August 21, 2024 00:49

handle already exists error on array node abort (#427)

73155cd

* handle already exists error on array node abort Signed-off-by: Paul Dittamo <[email protected]> * update comment Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>

add deep copy for array node status

dde7595

Signed-off-by: Paul Dittamo <[email protected]>

add deep copy for array node status

1a898f9

Signed-off-by: Paul Dittamo <[email protected]>

use deep copy of bit arrays when getting array node state

2a20e02

Signed-off-by: Paul Dittamo <[email protected]>

pvditt marked this pull request as ready for review August 21, 2024 09:04

pvditt requested review from wild-endeavor and andrewwdye August 21, 2024 09:10

pingsutw reviewed Aug 21, 2024

View reviewed changes

Revert "add deep copy for array node status"

e3a5952

This reverts commit dde7595. Signed-off-by: Paul Dittamo <[email protected]>

pvditt requested a review from pingsutw August 21, 2024 09:27

eapolinario reviewed Aug 21, 2024

View reviewed changes

pvditt added 2 commits August 21, 2024 14:36

Merge branch 'master' into bug/array-node-persist-state

36d1288

Signed-off-by: Paul Dittamo <[email protected]>

ignore ErrorOnAlreadyExists when marshalling event config

1f94bfe

Signed-off-by: Paul Dittamo <[email protected]>

pvditt requested a review from eapolinario August 21, 2024 22:02

pvditt mentioned this pull request Aug 22, 2024

[BUG] Fix ArrayNode not reporting terminal state in TaskExecutionEvent #5135

Open

2 tasks

eapolinario reviewed Aug 23, 2024

View reviewed changes

pvditt requested a review from eapolinario August 23, 2024 19:04

eapolinario approved these changes Aug 23, 2024

View reviewed changes

flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go Show resolved Hide resolved

eapolinario merged commit 780f985 into master Aug 23, 2024
49 of 50 checks passed

eapolinario deleted the bug/array-node-persist-state branch August 23, 2024 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] use deep copy of bit arrays when getting array node state #5681

[BUG] use deep copy of bit arrays when getting array node state #5681

pvditt commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading

pingsutw Aug 21, 2024

pvditt Aug 21, 2024

eapolinario Aug 21, 2024

pvditt Aug 21, 2024

eapolinario Aug 21, 2024

pvditt Aug 21, 2024

pvditt Aug 21, 2024 •

edited

Loading

eapolinario Aug 23, 2024

pvditt Aug 23, 2024

		@@ -1,13 +0,0 @@
		<component name="ProjectRunConfigurationManager">

[BUG] use deep copy of bit arrays when getting array node state #5681

[BUG] use deep copy of bit arrays when getting array node state #5681

Conversation

pvditt commented Aug 21, 2024 • edited Loading

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Aug 21, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvditt Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvditt commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading

pvditt Aug 21, 2024 •

edited

Loading