-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dws: saving workflows to KVS #183
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Problem: in tests, coral2_dws.py runs as a Flux job, meaning FLUX_KVS_NAMESPACE is set. However, this makes KVS lookups look relative to that namespace, which is undesirable. Lookups would behave differently in production when coral2_dws runs as a systemd service. Remove FLUX_KVS_NAMESPACE from the environment.
Problem: HPE and others working with k8s workflows have been frustrated that k8s resources disappear shortly after their associated flux job moves to cleanup, and have requested that Flux save some of the resources. Save workflow resources to jobs' KVS. More resources may be needed in the future, but start small to reduce KVS pressure. Fix some whitespace/style errors in the same file.
Problem: there are no tests to ensure that workflows are written out to jobs' kvs properly. Add tests.
jameshcorbett
commented
Jul 17, 2024
payload={"id": jobid, "memo": {"rabbits": Hostlist(nodes_per_nnf.keys()).encode()}}, | ||
payload={ | ||
"id": jobid, | ||
"memo": {"rabbits": Hostlist(nodes_per_nnf.keys()).encode()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated whitespace change, sorry
Problem: now that workflows are saved to the KVS, there is no need to log them out, which only clutters the journal. Stop logging them.
Problem: as in flux-framework#169, for most states, raising an exception should be enough to trigger other logic that eventually moves the workflow to Teardown. However, if the workflow is in PostRun or DataOut, the exception won't affect the dws-epilog action holding the job, so the workflow should be moved to Teardown immediately. Move workflows that are stuck in TransientCondition in DataOut or PostRun to Teardown immediately.
Problem: setting the log level to DEBUG makes the k8s library very verbose, which is fine sometimes but too much for some use cases. However, the INFO level is very open. Change some log messages to INFO from WARNING and DEBUG.
Problem: a single -v flag has no effect on coral2_dws, because it checks for > 1 verbose flags when it should be > 0 or >= 1. Change the logic to be > 0.
grondo
approved these changes
Jul 17, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I was able to review this, LGTM!
Thanks! Setting MWP. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: HPE and others working with k8s workflows have been
frustrated that k8s resources disappear shortly after their
associated flux job moves to cleanup, and have requested that
Flux save some of the resources.
Save workflow resources to jobs' KVS. More resources may be needed
in the future, but start small to reduce KVS pressure.
See also #180 .