Fluxion can't restart running jobs with match-format `rv1_nosched` #991

grondo · 2022-12-01T18:52:38Z

The Flux system instance needed to be restarted on tioga recently and there were two active jobs in CLEANUP state. This caused Fluxion to fail to restart with the following errors:

[  +0.001134] job-manager[0]: scheduler: hello
[  +0.002056] sched-fluxion-resource[0]: parse_R: no scheduling key in R
[  +0.002077] sched-fluxion-resource[0]: run_update: parsing R: No such file or directory
[  +0.002084] sched-fluxion-resource[0]: update_request_cb: update failed (id=138630396509161472): No such file or directory
[  +0.002316] sched-fluxion-qmanager[0]: jobmanager_hello_cb: reconstruct (id=138630396509161472 queue=default): No such file or directory
[  +0.002330] sched-fluxion-qmanager[0]: hello: error loading R for id=138630396509161472: No such file or directory
[  +0.002347] sched-fluxion-qmanager[0]: handshake_jobmanager: schedutil_hello: No such file or directory
[  +0.002352] sched-fluxion-qmanager[0]: handshake: handshake_jobmanager: No such file or directory
[  +0.002362] sched-fluxion-qmanager[0]: mod_start: handshake: No such file or directory
[  +0.002728] sched-fluxion-qmanager[0]: service_unregister
[  +0.002759] sched-fluxion-qmanager[0]: module exiting abnormally

It appears that parse_R() in resource/modules/resource_match.cpp always requires a scheduling key be set with JGF, but this will not be the case for any job when sched-fluxion-resource.match-format = "rv1_nosched".

Since Fluxion supports a rv1exec reader, parse_R() should fall back to this reader when there is no scheduling key in R for a job.

The text was updated successfully, but these errors were encountered:

grondo · 2022-12-01T22:17:46Z

I verified this case (reloading sched-fluxion-resource with match-format=rv1_nosched) is missed in the Fluxion testsuite. Strangely, many other cases are tested, so I wonder if this was purposeful?

Anyway, adding this test to the testsuite demonstrates the issue:

diff --git a/t/t1007-recovery-full.t b/t/t1007-recovery-full.t
index 5b2de636..0923eeaa 100755
--- a/t/t1007-recovery-full.t
+++ b/t/t1007-recovery-full.t
@@ -150,6 +150,18 @@ test_expect_success 'recovery: qmanager restarts (rv1_nosched->rv1_nosched)' '
     test_expect_code 3 flux ion-resource info ${jobid5}
 '
 
+test_expect_success 'recovery: both modules restart (rv1_nosched->rv1_nosched)' '
+    reload_resource match-format=rv1_nosched \
+    policy=high &&
+    reload_qmanager &&
+    flux module stats sched-fluxion-qmanager &&
+    flux module stats sched-fluxion-resource &&
+    flux ion-resource info ${jobid1} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid2} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid3} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid4} | grep "ALLOCATED"
+'
+
 test_expect_success 'recovery: a cancel leads to a job schedule (rv1_nosched)' '
     flux job cancel ${jobid1} &&
     flux job wait-event -t 60 ${jobid5} start

grondo · 2022-12-02T16:49:35Z

I started to look into fixing this issue, but got lost fairly quickly. It seems like we need a way to convert an R object without the scheduling key into a jgf representation so it can be passed to the run() function to update the graph. It seems like this should be possible using the rv1exec reader, then emitting jgf, but I'm unsure of the exact mechanism for that.

Perhaps one of the developers more familiar with Fluxion (@jameshcorbett, @milroy, or @trws) could take a crack at it or offer some advice. Thanks!

garlick · 2023-01-26T17:40:51Z

Reloading the sched-fluxion-qmanager module with a running job is sufficient to reproduce this:

2023-01-26T17:33:25.033690Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
2023-01-26T17:33:25.033973Z job-manager.debug[0]: scheduler: hello
2023-01-26T17:33:25.038243Z sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
2023-01-26T17:33:25.038288Z sched-fluxion-qmanager.err[0]: hello: error loading R for id=131481139893239808: No such file or directory
2023-01-26T17:33:25.038351Z sched-fluxion-qmanager.err[0]: handshake_jobmanager: schedutil_hello: No such file or directory
2023-01-26T17:33:25.038363Z sched-fluxion-qmanager.err[0]: handshake: handshake_jobmanager: No such file or directory
2023-01-26T17:33:25.038373Z sched-fluxion-qmanager.err[0]: mod_start: handshake: No such file or directory
2023-01-26T17:33:25.040393Z sched-fluxion-qmanager.debug[0]: service_unregister
2023-01-26T17:33:25.040490Z sched-fluxion-qmanager.crit[0]: module exiting abnormally

A test could just leave resource running.

Problem: t1008-recovery-none.t expects the job manager to abort the scheduler if a job fails to re-allocate resources during the hello handshake, but this behavior will change soon. Drop this test. The behavior it is looking for will either be addressed by a true fix to flux-framework#991 or the workaround proposed in flux-framework/flux-core#4894.

@grondo

Problem: there is no test coverage for module reload with running jobs and rv1_nosched. Add test proposed by @grondo in flux-framework#991, expecting failure for now. The test fails before and after the work-around proposed in flux-framework/flux-core#4894 because it checks for both: - qmanager reload fails (fails before the work-around) - job resources remain allocated (fails after the work-around) Increase the broker stderr log verbosity so the fatal job exceptions generated by the work-around at LOG_INFO level are visible when the test is run with -v.

milroy · 2023-02-02T03:07:07Z

I started looking into this issue. While it is true that Fluxion features the rv1_nosched match format and the rv1exec reader, Fluxion doesn't support recovery with the rv1_nosched writer, in particular:

By omitting the scheduling key, "rv1_nosched" will result in higher scheduling performance. However, this format will not contain sufficient information to reconstruct the state of sched-fluxion-resource on module reload (as required for system instance failure recovery).

After looking at rv1_nosched output, I agree with the documentation. R_lite and nodelist do not provide unique identifiers to find and update the corresponding resource graph vertices. Is it possible to use rv1 for the system instance instead, or is the performance too low?

grondo · 2023-02-02T03:23:07Z

I could be wrong, but I think the issue is the amount of data added to R in the scheduling key, and the fact that job data, including R, will remain in the KVS for a long time at the system instance level, which could contribute to content store growth.

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well? To the casual observer, it would seem that since Fluxion can read and construct a graph from R_lite, it should also be able to reconstruct the necessary information to find those same vertices in an existing graph created from the same format?

milroy · 2023-02-02T03:23:17Z

The intention appears to have been to use the rv1 writer for the system instance:

# system-instance will use full-up rv1 writer
# so that R will contain scheduling key needed
# for failure recovery.
match-format = "rv1"

It may be possible to refactor Fluxion to change writers at runtime. If I understand correctly then if restarted under certain conditions Fluxion could switch writers from rv1_nosched to rv1.

milroy · 2023-02-02T03:46:51Z

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well?

Yes, here's the writer output for a job match with the rv1_nosched writer:

{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "35"}}], "nodelist": ["node1"], "starttime": 0, "expiration": 3600}}

In contrast, JGF provides each resource's unique IDs and graph paths which allow for full resolution of the vertex:

{"graph": {"nodes": [{"id": "79", "metadata": {"type": "core", "basename": "core", "name": "core35", "id": 35, "uniq_id": 79, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "basename": "socket", "name": "socket1", "id": 1, "uniq_id": 7, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "basename": "node", "name": "node1", "id": 1, "uniq_id": 3, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "basename": "rack", "name": "rack0", "id": 0, "uniq_id": 1, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "name": "tiny0", "id": 0, "uniq_id": 0, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "7", "target": "79", "metadata": {"name": {"containment": "contains"}}}, {"source": "3", "target": "7", "metadata": {"name": {"containment": "contains"}}}, {"source": "1", "target": "3", "metadata": {"name": {"containment": "contains"}}}, {"source": "0", "target": "1", "metadata": {"name": {"containment": "contains"}}}]}}

that since Fluxion can read and construct a graph from R_lite

I didn't know that, and I'm struggling to understand how that happens. Fluxion only supports grug, hwloc, jgf, and rv1exec as graph load formats, and of those only hwloc, rv1exec, and jgf for creating the resource graph via populate_resource_db_acquire

grondo · 2023-02-02T03:49:38Z

I didn't know that, and I'm struggling to understand how that happens.

Sorry, I meant rv1exec

grondo · 2023-02-02T03:50:57Z

Yes, here's the writer output for a job match with the rv1_nosched writer:

I'm assuming that {"rank": "-1"} is an error?

milroy · 2023-02-02T03:56:06Z

I think {"rank": "-1"} is just an artifact of running this in resource-query vs a running Flux instance.

milroy · 2023-02-02T07:54:48Z

Fluxion could switch writers from rv1_nosched to rv1

What I have in mind is fairly complicated and may not work in the end. It would consist of dumping the resources of running jobs to the KVS via rv1 writer when Flux stops. Let me know if there's interest and we can discuss it at a coffee time.

grondo · 2023-02-02T14:54:24Z

That might work for a planned and orderly shutdown, but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them? Naively, would a mapping between rank and unique_id work? (I'm actually guessing not, because if that was enough, then Fluxion could create this at runtime since broker ranks are already unique, however, I guess it cannot hurt to ask)

trws · 2023-02-02T15:25:37Z

If that isn’t sufficient I’d be somewhat interested to know why not, because it seems like the kind of problem we could solve with an index if it’s not currently feasible. Do we currently require the full path keys?

milroy · 2023-02-02T20:26:22Z

but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Good point.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them?

Yes, I think so. I'll refresh my memory on the graph update functions and how they use the global lookups to devise some minimum schema.

grondo · 2023-06-08T16:39:11Z

This came up again because the inability of Fluxion to be reloaded with running jobs is preventing us from reconfiguring resources (e.g. adding or modifying queues) without a downtime on production clusters.

As an experiment, I tried setting match-format = "rv1" and then seeing if Fluxion modules can be reloaded with running jobs, but I still get the error:

[  +9.794686] job-manager[0]: scheduler: hello
[  +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[  +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584

The job in question does seem to have JGF in R:

$ flux job info 63504667937603584 R | jq .scheduling | head
{
  "graph": {
    "nodes": [
      {
        "id": "17",
        "metadata": {
          "type": "core",
          "basename": "core",
          "name": "core0",
          "id": 0,

grondo · 2023-06-08T16:40:37Z

It is possible I performed the experiment incorrectly, so it may be good if someone else can verify this behavior. If so, then we'll need a plan to address this issue in the near term (i.e. this issue just got high priority)

trws · 2023-06-08T17:09:41Z

The only line that should be able to produce that error is this one:

flux-sched/qmanager/modules/qmanager_callbacks.cpp

Line 172 in c65026c

queue = ctx->queues.at (queue_name);

I haven't had time to go in and experiment with it yet, but this makes me think that the map of queues doesn't get repopulated on a restart, we might not even be getting to the graph.

grondo · 2023-06-08T21:46:45Z

Ah, ok, thats a good point and something (hopefully) easy to fix as a start?

I did verify that with no queues enabled and match-format = "rv1", Fluxion can be reloaded with running jobs.

I'll open a separate issue.

trws · 2023-06-08T21:56:49Z

Yeah, I don't think that will be all that bad. I'm a bit confused how that can happen, since everything is called in the right order, but if we have a reasonable reproducing test case this should be relatively straightforward. Glancing at it I'm guessing it's something like the list of queues in the initial config got cleared or some generated name changed or... 🤷 If we add in a bit of context to figure out which queue is missing and which queues exist it should fall out quickly.

garlick · 2023-12-20T23:43:38Z

Just following up on the meeting today:

fluxion gets the initial Rv1 object from the resource.acquire RPC, described in RFC 28. The Rv1 object is returned in the first response payload. This is designed so resources are automatically returned when fluxion unloads and can be re-acquired when it loads again.

flux-sched/resource/modules/resource_match.cpp

Line 1282 in fe872c8

static int populate_resource_db_acquire (std::shared_ptr<resource_ctx_t> &ctx)

Fluxion then uses the job-manager.hello RPC to acquire a list of running jobs. That one is described in RFC 27. Although the Rv1 object allocated to each job is not provided in the response payload, the libschedutil helper library from flux-core fetches it from the kvs and provides it to the scheduler in the hello callback. Thus the wall clock expiration time for each allocated Rv1 is available.

flux-sched/qmanager/modules/qmanager_callbacks.cpp

Line 123 in fe872c8

int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,

So my probably naive question is what required information is missing to allow fluxion to restart with running jobs and rv1_nosched?

garlick · 2024-04-11T01:10:04Z

In our meeting today it was asserted that the Rv1 to graph uuid mapping would need to be preserved (in a file or KVS) across a restart in order to map the resources of running jobs to the resource graph. Afterward I'm once again wondering why. The uuid's are purely internal to fluxion - fluxion won't receive a uuid from a running job that was allocated from a past instance of fluxion and need to map it to resources. It will receive rv1 with hostname/ranks and gpu/core logical indices. I don't see how having the old uuid mapping would help.

What am I missing?

milroy · 2024-04-11T08:54:34Z

I ended up deciding the vertex uniq_id (I mistakenly called it a UUID) wasn't necessary to identify the underlying graph vertex. I actually think it is possible to reconstruct the vertices' information from RV1 with hostnames and logical indices of a rank's children provided that the original resource graph was also generated from the rv1exec reader.

However, edge data (such as exclusivity tracking) can't be reconstructed without a fully specified format like JGF. Here's an example of what the JGF reader does to update the edges:

flux-sched/resource/readers/resource_reader_jgf.cpp

Line 1048 in c8e03f8

m.v_rt_edges[kv.first].set_for_trav_update (vmap[source].needs,

. That said, I don't know if not setting those parameters will result in undesirable scheduling behavior or only in reduction of optimizations.

garlick · 2024-04-11T14:18:13Z

Thanks for that explanation! Well, I think having the rv1 reader, even a naive one, would be an excellent near term step since it would let the scheduler be unloaded and reconfigured or even updated without affecting running jobs. When the scheduler is misbehaving, that could be really handy.

Edit: we could always augment that with some side band storage or whatever if it turns out to be required.

Problem: issue flux-framework#991 identified the need for an `rv1exec` implementation of update (). The need for the implementation is described in detail in the issue, but the primary motivation is to enable reloading Fluxion when using RV1 without the scheduling key and payload. The reader was not originally implemented due to the lack of information in the format. Examples include edges, exclusivity, paths with subsystems, and vertex sizes. To create a workable implementation, strong assumptions need to be made about resource exclusivity and edges. Add support for update () through helper functions that update vertex and edge metadata and infer exclusivity of node-level resources.

grondo changed the title ~~Fluxion can't restart with running jobs with match-format rv1_nosched~~ Fluxion can't restart running jobs with match-format rv1_nosched Dec 1, 2022

ryanday36 added this to TOSS4 system instance tracking Jan 19, 2023

garlick mentioned this issue Jan 23, 2023

crashing with running jobs can lead to problems flux-framework/flux-core#4862

Closed

garlick mentioned this issue Jan 27, 2023

WIP: raise fatal exception on running jobs during reload #999

Closed

garlick mentioned this issue Jan 28, 2023

testsuite: adjust expectations of recovery in rv1_nosched mode #1000

Merged

grondo mentioned this issue Feb 23, 2023

test impact of using rv1 vs rv1_nosched on instance performance #1009

Open

grondo mentioned this issue Jun 8, 2023

fluxion can't restart with queues enabled #1035

Closed

milroy mentioned this issue Apr 15, 2024

Implement rv1exec reader update to facilitate Fluxion reload #1176

Merged

2 tasks

milroy linked a pull request Apr 15, 2024 that will close this issue

Implement rv1exec reader update to facilitate Fluxion reload #1176

Merged

2 tasks

mergify bot closed this as completed in #1176 Apr 24, 2024

github-project-automation bot moved this to Done in TOSS4 system instance tracking Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluxion can't restart running jobs with match-format `rv1_nosched` #991

Fluxion can't restart running jobs with match-format `rv1_nosched` #991

grondo commented Dec 1, 2022

grondo commented Dec 1, 2022

grondo commented Dec 2, 2022

garlick commented Jan 26, 2023

milroy commented Feb 2, 2023 •

edited

Loading

grondo commented Feb 2, 2023

milroy commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Feb 2, 2023

grondo commented Feb 2, 2023

milroy commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Feb 2, 2023

trws commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 8, 2023

garlick commented Dec 20, 2023

garlick commented Apr 11, 2024

milroy commented Apr 11, 2024

garlick commented Apr 11, 2024 •

edited

Loading

Fluxion can't restart running jobs with match-format rv1_nosched #991

Fluxion can't restart running jobs with match-format rv1_nosched #991

Comments

grondo commented Dec 1, 2022

grondo commented Dec 1, 2022

grondo commented Dec 2, 2022

garlick commented Jan 26, 2023

milroy commented Feb 2, 2023 • edited Loading

grondo commented Feb 2, 2023

milroy commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Feb 2, 2023

grondo commented Feb 2, 2023

milroy commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Feb 2, 2023

trws commented Feb 2, 2023

milroy commented Feb 2, 2023

grondo commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 8, 2023

garlick commented Dec 20, 2023

garlick commented Apr 11, 2024

milroy commented Apr 11, 2024

garlick commented Apr 11, 2024 • edited Loading

Fluxion can't restart running jobs with match-format `rv1_nosched` #991

Fluxion can't restart running jobs with match-format `rv1_nosched` #991

milroy commented Feb 2, 2023 •

edited

Loading

garlick commented Apr 11, 2024 •

edited

Loading