Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault tolerance: need error propagation analysis #618

Open
dongahn opened this issue Mar 7, 2020 · 5 comments
Open

fault tolerance: need error propagation analysis #618

dongahn opened this issue Mar 7, 2020 · 5 comments

Comments

@dongahn
Copy link
Member

dongahn commented Mar 7, 2020

This will be likely to be broken into multiple issues but I wanted to open this to remember this important items as we will work on "stabilization" tasks towards a tape out. Within resource and qmanager, there are some RPCs that leave the internal states inconsistent when a failure occurs. We need to analyze this more closely and have a clearer error handling semantics.

@dongahn
Copy link
Member Author

dongahn commented Jun 11, 2020

There are several call sites where return codes are not checked:

https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L461

@dongahn
Copy link
Member Author

dongahn commented Jun 11, 2020

There are a few error paths where errno is not preserved. We need to save and restore errno for library calls (e.g., json_decref) being made on the error paths.

https://github.com/flux-framework/flux-sched/blob/master/resource/writers/match_writers.cpp#L261

@SteVwonder
Copy link
Member

SteVwonder commented Jul 8, 2020

From #679:

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error

EDIT: we should also decide how we want to recover from the above failure. Since technically the allocation for the job has already been made in fluxion-resource. Do we want to automatically rollback the allocation, or let the requesting client handle the cancellation/rollback?

@dongahn
Copy link
Member Author

dongahn commented Jul 15, 2020

Here is an additional problem:

Our DFU traverser concatenate one or more error strings to its err_message string member so that the upper layer can use get_err_message() to print it. There are several place which err_message string added has the newline character in the end, which doesn't work well with flux_log_error. An example:

ahn1@5b12c7ea7263:/usr/src$ python3 t/scripts/flux-ion-resource.py find status=adown
2020-07-15T17:03:56.268829Z sched-fluxion-resource.err[0]: run_find: find: invalid criteria: status=adown.
2020-07-15T17:03:56.268864Z sched-fluxion-resource.err[0]: : Invalid argument

Look at the extra colon in front of "Invalid argument". We will need a way for the upper layer to iterate each error string to print out properly.

This may also give a better way to resolve one of the pending issues: #409.

@dongahn dongahn added this to the 2020 August Release milestone Aug 14, 2020
@dongahn
Copy link
Member Author

dongahn commented Aug 31, 2020

I want to spend a bit more time for this. Targeting Sep release.

@dongahn dongahn modified the milestones: 2020 August Release, 2020 September Release Aug 31, 2020
@garlick garlick removed this from the 2020 September Release milestone Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants