Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dws: saving workflows to KVS #183

Merged
merged 7 commits into from
Jul 17, 2024

Conversation

jameshcorbett
Copy link
Member

Problem: HPE and others working with k8s workflows have been
frustrated that k8s resources disappear shortly after their
associated flux job moves to cleanup, and have requested that
Flux save some of the resources.

Save workflow resources to jobs' KVS. More resources may be needed
in the future, but start small to reduce KVS pressure.

See also #180 .

Problem: in tests, coral2_dws.py runs as a Flux job, meaning
FLUX_KVS_NAMESPACE is set. However, this makes KVS lookups look
relative to that namespace, which is undesirable. Lookups would
behave differently in production when coral2_dws runs as a systemd
service.

Remove FLUX_KVS_NAMESPACE from the environment.
Problem: HPE and others working with k8s workflows have been
frustrated that k8s resources disappear shortly after their
associated flux job moves to cleanup, and have requested that
Flux save some of the resources.

Save workflow resources to jobs' KVS. More resources may be needed
in the future, but start small to reduce KVS pressure.

Fix some whitespace/style errors in the same file.
Problem: there are no tests to ensure that workflows are written
out to jobs' kvs properly.

Add tests.
@jameshcorbett jameshcorbett requested a review from grondo July 17, 2024 00:06
payload={"id": jobid, "memo": {"rabbits": Hostlist(nodes_per_nnf.keys()).encode()}},
payload={
"id": jobid,
"memo": {"rabbits": Hostlist(nodes_per_nnf.keys()).encode()},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated whitespace change, sorry

Problem: now that workflows are saved to the KVS, there is no need
to log them out, which only clutters the journal.

Stop logging them.
Problem: as in flux-framework#169, for most states, raising an exception should
be enough to trigger other logic that eventually moves the workflow
to Teardown. However, if the workflow is in PostRun or DataOut, the
exception won't affect the dws-epilog action holding the job, so
the workflow should be moved to Teardown immediately.

Move workflows that are stuck in TransientCondition in DataOut or
PostRun to Teardown immediately.
Problem: setting the log level to DEBUG makes the k8s library
very verbose, which is fine sometimes but too much for some use
cases. However, the INFO level is very open.

Change some log messages to INFO from WARNING and DEBUG.
Problem: a single -v flag has no effect on coral2_dws, because it
checks for > 1 verbose flags when it should be > 0 or >= 1.

Change the logic to be > 0.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I was able to review this, LGTM!

@jameshcorbett
Copy link
Member Author

Thanks! Setting MWP.

@mergify mergify bot merged commit d1f5c17 into flux-framework:master Jul 17, 2024
8 checks passed
@jameshcorbett jameshcorbett deleted the kvs-workflows branch July 17, 2024 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants