Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear past errors from workflow state #4624

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Dec 19, 2023

Tracking issue

#4569

Why are the changes needed?

Reduce un-needed information stored in etcd when using failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. This allows flyte to scale to larger workflows before hitting etcd size limits.

What changes were proposed in this pull request?

  • Clears old errors whenever new ones occur
    • This is controlled by the node-config.enable-cr-debug-metadata config option. Set this to true to restore the previous behaviour.
    • Behaviour is only changed when running workflows with FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. Without this the workflow will fail as soon as there is one failure so there can never be more than one error regardless of this PR.
  • Update docstring for enable-cr-debug-metadata config option.
  • Update TestWorkflowExecutor_HandleFlyteWorkflow_Failing to have sub test cases for combinations of FAIL_AFTER_EXECUTABLE_NODES_COMPLETE and enable-cr-debug-metadata

How was this patch tested?

Updated unittests
I have been running something very similar to this in our prod deployment for some time.

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Follow up to #4596

Docs link

@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 62f6154 to 2b76a91 Compare December 21, 2023 19:32
Signed-off-by: Thomas Newton <[email protected]>
This reverts commit dab428d.

Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 2b76a91 to 216e1a3 Compare January 16, 2024 20:42
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 852f527 to 6c5650c Compare January 16, 2024 20:56
@Tom-Newton Tom-Newton changed the title Option to clear past errors from workflow state Clear past errors from workflow state Jan 17, 2024
Signed-off-by: Thomas Newton <[email protected]>
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from f7c5f80 to 2cea075 Compare January 17, 2024 09:57
@Tom-Newton Tom-Newton marked this pull request as ready for review January 17, 2024 09:59
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request labels Jan 17, 2024
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (0c8dc61) 58.20% compared to head (2cea075) 58.07%.
Report is 1 commits behind head on master.

Files Patch % Lines
flytepropeller/pkg/controller/nodes/executor.go 11.11% 8 Missing ⚠️
...ler/pkg/apis/flyteworkflow/v1alpha1/node_status.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4624      +/-   ##
==========================================
- Coverage   58.20%   58.07%   -0.14%     
==========================================
  Files         626      476     -150     
  Lines       53800    38119   -15681     
==========================================
- Hits        31316    22138    -9178     
+ Misses      19976    14056    -5920     
+ Partials     2508     1925     -583     
Flag Coverage Δ
unittests 58.26% <23.07%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Tom-Newton
Copy link
Contributor Author

It looks like I'm missing a little bit of code coverage. I'll try to find some time to fix that.

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0 and n1) if the first time we iterate over these the order is n0, n1 then we clear the error from n0 if the second time we iterate n1, n0 then we clear the error from n0 and just cleared all of our errors.

@@ -298,6 +299,10 @@ func (c *recursiveNodeExecutor) handleDownstream(ctx context.Context, execContex
// If the failure policy allows other nodes to continue running, do not exit the loop,
// Keep track of the last failed state in the loop since it'll be the one to return.
// TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one.
if executableNodeStatusOnComplete != nil {
c.nodeExecutor.Clear(executableNodeStatusOnComplete)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add the enableCRDebugMetadata bool argument here to the ClearExecutionError function on the MutableNodeStatus interface. Then this call more closely reflects the UpdatePhase call above and we can remove adding a Clear function to the nodeExecutor struct and similarly the NodeExecutor interface. Thoughts?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used golang before I started using flyte, so I have little opinion on how the interfaces are organised. I will try to implement it as you suggested.

@Tom-Newton
Copy link
Contributor Author

If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0 and n1) if the first time we iterate over these the order is n0, n1 then we clear the error from n0 if the second time we iterate n1, n0 then we clear the error from n0 and just cleared all of our errors.

That's a good question. I guess I assumed that stateOnComplete would only get updated when there are new errors and not get randomly set to a different error on each round. I think that assumption is correct given that we've been using this in prod and nobody has noticed any problems, but I will need to read the code again to understand how that is achieved.

@Tom-Newton
Copy link
Contributor Author

Unfortunately I've been very distracted from this recently but I do plan to come back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants