fault tolerance: need error propagation analysis #618

dongahn · 2020-03-07T07:47:38Z

This will be likely to be broken into multiple issues but I wanted to open this to remember this important items as we will work on "stabilization" tasks towards a tape out. Within resource and qmanager, there are some RPCs that leave the internal states inconsistent when a failure occurs. We need to analyze this more closely and have a clearer error handling semantics.

The text was updated successfully, but these errors were encountered:

dongahn · 2020-06-11T04:44:06Z

There are several call sites where return codes are not checked:

https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L461

dongahn · 2020-06-11T04:47:34Z

There are a few error paths where errno is not preserved. We need to save and restore errno for library calls (e.g., json_decref) being made on the error paths.

https://github.com/flux-framework/flux-sched/blob/master/resource/writers/match_writers.cpp#L261

SteVwonder · 2020-07-08T19:11:31Z

From #679:

There are some error paths within the sched-fluxion-resource module that are allowing errors to pass by silently and responding to the request RPC with a successful response. In particular, this happens when the writer fails to emit properly. We should make these errors loud and respond to the RPC with an error

EDIT: we should also decide how we want to recover from the above failure. Since technically the allocation for the job has already been made in fluxion-resource. Do we want to automatically rollback the allocation, or let the requesting client handle the cancellation/rollback?

dongahn · 2020-07-15T17:18:51Z

Here is an additional problem:

Our DFU traverser concatenate one or more error strings to its err_message string member so that the upper layer can use get_err_message() to print it. There are several place which err_message string added has the newline character in the end, which doesn't work well with flux_log_error. An example:

ahn1@5b12c7ea7263:/usr/src$ python3 t/scripts/flux-ion-resource.py find status=adown
2020-07-15T17:03:56.268829Z sched-fluxion-resource.err[0]: run_find: find: invalid criteria: status=adown.
2020-07-15T17:03:56.268864Z sched-fluxion-resource.err[0]: : Invalid argument

Look at the extra colon in front of "Invalid argument". We will need a way for the upper layer to iterate each error string to print out properly.

This may also give a better way to resolve one of the pending issues: #409.

dongahn · 2020-08-31T16:34:57Z

I want to spend a bit more time for this. Targeting Sep release.

dongahn mentioned this issue Mar 19, 2020

Use Jansson for JSON-based writers + time keys in RV1 #614

Merged

dongahn mentioned this issue Jun 11, 2020

Add basic state recovery support #663

Merged

dongahn mentioned this issue Jun 19, 2020

Resource status #665

Merged

dongahn mentioned this issue Jul 8, 2020

Match allocate_with_satisfiability success repsonse contains empty R #679

Closed

dongahn mentioned this issue Jul 12, 2020

Traverser: accum_to_parent always returns success #680

Open

dongahn added this to the 2020 August Release milestone Aug 14, 2020

dongahn modified the milestones: 2020 August Release, 2020 September Release Aug 31, 2020

garlick removed this from the 2020 September Release milestone Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fault tolerance: need error propagation analysis #618

fault tolerance: need error propagation analysis #618

dongahn commented Mar 7, 2020

dongahn commented Jun 11, 2020

dongahn commented Jun 11, 2020

SteVwonder commented Jul 8, 2020 •

edited

Loading

dongahn commented Jul 15, 2020

dongahn commented Aug 31, 2020

fault tolerance: need error propagation analysis #618

fault tolerance: need error propagation analysis #618

Comments

dongahn commented Mar 7, 2020

dongahn commented Jun 11, 2020

dongahn commented Jun 11, 2020

SteVwonder commented Jul 8, 2020 • edited Loading

dongahn commented Jul 15, 2020

dongahn commented Aug 31, 2020

SteVwonder commented Jul 8, 2020 •

edited

Loading