-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault tolerance: need error propagation analysis #618
Comments
There are several call sites where return codes are not checked: |
There are a few error paths where |
From #679:
EDIT: we should also decide how we want to recover from the above failure. Since technically the allocation for the job has already been made in |
Here is an additional problem: Our DFU traverser concatenate one or more error strings to its ahn1@5b12c7ea7263:/usr/src$ python3 t/scripts/flux-ion-resource.py find status=adown
2020-07-15T17:03:56.268829Z sched-fluxion-resource.err[0]: run_find: find: invalid criteria: status=adown.
2020-07-15T17:03:56.268864Z sched-fluxion-resource.err[0]: : Invalid argument Look at the extra colon in front of "Invalid argument". We will need a way for the upper layer to iterate each error string to print out properly. This may also give a better way to resolve one of the pending issues: #409. |
I want to spend a bit more time for this. Targeting Sep release. |
This will be likely to be broken into multiple issues but I wanted to open this to remember this important items as we will work on "stabilization" tasks towards a tape out. Within
resource
andqmanager
, there are some RPCs that leave the internal states inconsistent when a failure occurs. We need to analyze this more closely and have a clearer error handling semantics.The text was updated successfully, but these errors were encountered: