-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to clear node state on any termination #4596
Option to clear node state on any termination #4596
Conversation
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
3e55c74
to
625ab3d
Compare
There are still a couple of TODO comments that I want to address but if anyone has any early feedback that would be appreciated. I was rather uncertain about how to put this behind a feature flag. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4596 +/- ##
==========================================
- Coverage 58.98% 58.15% -0.83%
==========================================
Files 621 626 +5
Lines 52483 53790 +1307
==========================================
+ Hits 30957 31282 +325
- Misses 19059 20000 +941
- Partials 2467 2508 +41
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Thomas Newton <[email protected]>
e50ebb8
to
42fb049
Compare
… with `clearStateOnAnyTermination=true` Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
5e07651
to
453a69f
Compare
Signed-off-by: Thomas Newton <[email protected]>
b812954
to
fb9bd90
Compare
I'm still not super confident about how I implemented the config option but I think its time to get a review on this. |
@Tom-Newton this looks pretty good to me. Do you know how big part 2 (ie. leaving error when workflow failure policy is set) is? It feels to me like it makes sense to submit these in the same PR to make tracking the changes easier. Thoughts? Also, in the original PRs I thought we needed to keep the error around for one round (ie. transition node from |
Thanks for taking a look
The actual change is pretty small but I haven't written any tests for it yet.
This PR actually doesn't change the error field. I think before you looked at the sum of part 1 and part 2. I had always intended to split it like this but I made a bit of a mess with rebasing. Part 2 just clears the previous error whenever a new one occurs. I'll make the second part as a stacked PR on top of this and we can decide whether to combine it into this PR before or after this one is merged. |
@hamersaw Tom-Newton#6 implements part 2 (apparently I can't create a PR for the upstream that is stacked on a branch from my fork) |
@Tom-Newton thanks so much for these! They look great, I'll have to run some tests locally just to clarify everything is working the way I think it should. In the 2nd PR here, is there a reason we only strip if the workflow failure policy says to keep executing? It seems to me that we can always delete the error after it's been reported to flyteadmin right? In the mean-time, I was thinking that leaving the additional metadata in the CRD is really only usable for manual debugging purposes. Which there are very few users that do. I'm not sure the fine-grained controls of managing specific metadata is necessary (ie. having separate flags for deletion on all terminal states and removing errors after they have been reported). Do you think it makes sense to put both of these things behind a flag like |
Good question. I don't think it makes a lot of difference but that would provide slightly smaller workflow state. I did actually try that initially but I couldn't find the right place to clear the error to make sure it was after the last usage. Personally I'm quite happy with the current solution because I think it's relatively easy to understand.
I'm not sure. While it's very likely that users of one of these options will also want to use the other I also like to have lots of granularity in configuration. I think it's possible that there will be more options similar to this that may be more controversial such that some users will want more granular control than just one config option. Do you have any thoughts on whether you want to review and merge both parts in one PR? Personally I would go for 2 subsequent PRs as it is now. |
865f63f
to
05208ec
Compare
We agreed to rename the config option to |
Signed-off-by: Thomas Newton <[email protected]>
05208ec
to
64d7fa2
Compare
Tracking issue
Implements the first part of #4569
Why are the changes needed?
Reduce un-needed information stored in etcd. This allows flyte to scale to larger workflows before hitting etcd size limits.
What changes were proposed in this pull request?
node-config.enable-cr-debug-metadata
NodeStatus
are cleared except for Phase, StoppedAt and Error.NodeStatus
which therefore needs to be written to etcd.TestNodeStatus_UpdatePhase
->non-terminal-timing-out
to fix a code coverage failure.How was this patch tested?
Updated the unittests
We have also been using very similar changes in our prod flyte deployment for a few weeks.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link