Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluxion can't restart running jobs with match-format rv1_nosched #991

Closed
grondo opened this issue Dec 1, 2022 · 23 comments · Fixed by #1176
Closed

Fluxion can't restart running jobs with match-format rv1_nosched #991

grondo opened this issue Dec 1, 2022 · 23 comments · Fixed by #1176

Comments

@grondo
Copy link
Contributor

grondo commented Dec 1, 2022

The Flux system instance needed to be restarted on tioga recently and there were two active jobs in CLEANUP state. This caused Fluxion to fail to restart with the following errors:

[  +0.001134] job-manager[0]: scheduler: hello
[  +0.002056] sched-fluxion-resource[0]: parse_R: no scheduling key in R
[  +0.002077] sched-fluxion-resource[0]: run_update: parsing R: No such file or directory
[  +0.002084] sched-fluxion-resource[0]: update_request_cb: update failed (id=138630396509161472): No such file or directory
[  +0.002316] sched-fluxion-qmanager[0]: jobmanager_hello_cb: reconstruct (id=138630396509161472 queue=default): No such file or directory
[  +0.002330] sched-fluxion-qmanager[0]: hello: error loading R for id=138630396509161472: No such file or directory
[  +0.002347] sched-fluxion-qmanager[0]: handshake_jobmanager: schedutil_hello: No such file or directory
[  +0.002352] sched-fluxion-qmanager[0]: handshake: handshake_jobmanager: No such file or directory
[  +0.002362] sched-fluxion-qmanager[0]: mod_start: handshake: No such file or directory
[  +0.002728] sched-fluxion-qmanager[0]: service_unregister
[  +0.002759] sched-fluxion-qmanager[0]: module exiting abnormally

It appears that parse_R() in resource/modules/resource_match.cpp always requires a scheduling key be set with JGF, but this will not be the case for any job when sched-fluxion-resource.match-format = "rv1_nosched".

Since Fluxion supports a rv1exec reader, parse_R() should fall back to this reader when there is no scheduling key in R for a job.

@grondo grondo changed the title Fluxion can't restart with running jobs with match-format rv1_nosched Fluxion can't restart running jobs with match-format rv1_nosched Dec 1, 2022
@grondo
Copy link
Contributor Author

grondo commented Dec 1, 2022

I verified this case (reloading sched-fluxion-resource with match-format=rv1_nosched) is missed in the Fluxion testsuite. Strangely, many other cases are tested, so I wonder if this was purposeful?

Anyway, adding this test to the testsuite demonstrates the issue:

diff --git a/t/t1007-recovery-full.t b/t/t1007-recovery-full.t
index 5b2de636..0923eeaa 100755
--- a/t/t1007-recovery-full.t
+++ b/t/t1007-recovery-full.t
@@ -150,6 +150,18 @@ test_expect_success 'recovery: qmanager restarts (rv1_nosched->rv1_nosched)' '
     test_expect_code 3 flux ion-resource info ${jobid5}
 '
 
+test_expect_success 'recovery: both modules restart (rv1_nosched->rv1_nosched)' '
+    reload_resource match-format=rv1_nosched \
+    policy=high &&
+    reload_qmanager &&
+    flux module stats sched-fluxion-qmanager &&
+    flux module stats sched-fluxion-resource &&
+    flux ion-resource info ${jobid1} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid2} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid3} | grep "ALLOCATED" &&
+    flux ion-resource info ${jobid4} | grep "ALLOCATED"
+'
+
 test_expect_success 'recovery: a cancel leads to a job schedule (rv1_nosched)' '
     flux job cancel ${jobid1} &&
     flux job wait-event -t 60 ${jobid5} start

@grondo
Copy link
Contributor Author

grondo commented Dec 2, 2022

I started to look into fixing this issue, but got lost fairly quickly. It seems like we need a way to convert an R object without the scheduling key into a jgf representation so it can be passed to the run() function to update the graph. It seems like this should be possible using the rv1exec reader, then emitting jgf, but I'm unsure of the exact mechanism for that.

Perhaps one of the developers more familiar with Fluxion (@jameshcorbett, @milroy, or @trws) could take a crack at it or offer some advice. Thanks!

@garlick
Copy link
Member

garlick commented Jan 26, 2023

Reloading the sched-fluxion-qmanager module with a running job is sufficient to reproduce this:

2023-01-26T17:33:25.033690Z sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
2023-01-26T17:33:25.033973Z job-manager.debug[0]: scheduler: hello
2023-01-26T17:33:25.038243Z sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
2023-01-26T17:33:25.038288Z sched-fluxion-qmanager.err[0]: hello: error loading R for id=131481139893239808: No such file or directory
2023-01-26T17:33:25.038351Z sched-fluxion-qmanager.err[0]: handshake_jobmanager: schedutil_hello: No such file or directory
2023-01-26T17:33:25.038363Z sched-fluxion-qmanager.err[0]: handshake: handshake_jobmanager: No such file or directory
2023-01-26T17:33:25.038373Z sched-fluxion-qmanager.err[0]: mod_start: handshake: No such file or directory
2023-01-26T17:33:25.040393Z sched-fluxion-qmanager.debug[0]: service_unregister
2023-01-26T17:33:25.040490Z sched-fluxion-qmanager.crit[0]: module exiting abnormally

A test could just leave resource running.

garlick added a commit to garlick/flux-sched that referenced this issue Jan 28, 2023
Problem: t1008-recovery-none.t expects the job manager to abort the
scheduler if a job fails to re-allocate resources during the hello
handshake, but this behavior will change soon.

Drop this test.  The behavior it is looking for will either be
addressed by a true fix to flux-framework#991 or the workaround proposed in
flux-framework/flux-core#4894.
garlick added a commit to garlick/flux-sched that referenced this issue Jan 28, 2023
Problem: there is no test coverage for module reload with running
jobs and rv1_nosched.

Add test proposed by @grondo in flux-framework#991, expecting failure for now.

The test fails before and after the work-around proposed in
flux-framework/flux-core#4894 because it checks for both:
- qmanager reload fails (fails before the work-around)
- job resources remain allocated (fails after the work-around)

Increase the broker stderr log verbosity so the fatal job exceptions
generated by the work-around at LOG_INFO level are visible when
the test is run with -v.
@milroy
Copy link
Member

milroy commented Feb 2, 2023

I started looking into this issue. While it is true that Fluxion features the rv1_nosched match format and the rv1exec reader, Fluxion doesn't support recovery with the rv1_nosched writer, in particular:

By omitting the scheduling key, "rv1_nosched" will result in higher scheduling performance. However, this format will not contain sufficient information to reconstruct the state of sched-fluxion-resource on module reload (as required for system instance failure recovery).

After looking at rv1_nosched output, I agree with the documentation. R_lite and nodelist do not provide unique identifiers to find and update the corresponding resource graph vertices. Is it possible to use rv1 for the system instance instead, or is the performance too low?

@grondo
Copy link
Contributor Author

grondo commented Feb 2, 2023

I could be wrong, but I think the issue is the amount of data added to R in the scheduling key, and the fact that job data, including R, will remain in the KVS for a long time at the system instance level, which could contribute to content store growth.

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well? To the casual observer, it would seem that since Fluxion can read and construct a graph from R_lite, it should also be able to reconstruct the necessary information to find those same vertices in an existing graph created from the same format?

@milroy
Copy link
Member

milroy commented Feb 2, 2023

The intention appears to have been to use the rv1 writer for the system instance:

# system-instance will use full-up rv1 writer
# so that R will contain scheduling key needed
# for failure recovery.
match-format = "rv1"

It may be possible to refactor Fluxion to change writers at runtime. If I understand correctly then if restarted under certain conditions Fluxion could switch writers from rv1_nosched to rv1.

@milroy
Copy link
Member

milroy commented Feb 2, 2023

Can you expand on your statement that R_lite does not have the right identifiers to find update existing graph vertices correctly, for those of us that do not know the internals of the graph implementation very well?

Yes, here's the writer output for a job match with the rv1_nosched writer:

{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "35"}}], "nodelist": ["node1"], "starttime": 0, "expiration": 3600}}

In contrast, JGF provides each resource's unique IDs and graph paths which allow for full resolution of the vertex:

{"graph": {"nodes": [{"id": "79", "metadata": {"type": "core", "basename": "core", "name": "core35", "id": 35, "uniq_id": 79, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1/core35"}}}, {"id": "7", "metadata": {"type": "socket", "basename": "socket", "name": "socket1", "id": 1, "uniq_id": 7, "rank": -1, "exclusive": true, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1/socket1"}}}, {"id": "3", "metadata": {"type": "node", "basename": "node", "name": "node1", "id": 1, "uniq_id": 3, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0/node1"}}}, {"id": "1", "metadata": {"type": "rack", "basename": "rack", "name": "rack0", "id": 0, "uniq_id": 1, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0/rack0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "tiny", "name": "tiny0", "id": 0, "uniq_id": 0, "rank": -1, "exclusive": false, "unit": "", "size": 1, "paths": {"containment": "/tiny0"}}}], "edges": [{"source": "7", "target": "79", "metadata": {"name": {"containment": "contains"}}}, {"source": "3", "target": "7", "metadata": {"name": {"containment": "contains"}}}, {"source": "1", "target": "3", "metadata": {"name": {"containment": "contains"}}}, {"source": "0", "target": "1", "metadata": {"name": {"containment": "contains"}}}]}}

that since Fluxion can read and construct a graph from R_lite

I didn't know that, and I'm struggling to understand how that happens. Fluxion only supports grug, hwloc, jgf, and rv1exec as graph load formats, and of those only hwloc, rv1exec, and jgf for creating the resource graph via populate_resource_db_acquire

@grondo
Copy link
Contributor Author

grondo commented Feb 2, 2023

I didn't know that, and I'm struggling to understand how that happens.

Sorry, I meant rv1exec

@grondo
Copy link
Contributor Author

grondo commented Feb 2, 2023

Yes, here's the writer output for a job match with the rv1_nosched writer:

I'm assuming that {"rank": "-1"} is an error?

@milroy
Copy link
Member

milroy commented Feb 2, 2023

I think {"rank": "-1"} is just an artifact of running this in resource-query vs a running Flux instance.

@milroy
Copy link
Member

milroy commented Feb 2, 2023

Fluxion could switch writers from rv1_nosched to rv1

What I have in mind is fairly complicated and may not work in the end. It would consist of dumping the resources of running jobs to the KVS via rv1 writer when Flux stops. Let me know if there's interest and we can discuss it at a coffee time.

@grondo
Copy link
Contributor Author

grondo commented Feb 2, 2023

That might work for a planned and orderly shutdown, but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them? Naively, would a mapping between rank and unique_id work? (I'm actually guessing not, because if that was enough, then Fluxion could create this at runtime since broker ranks are already unique, however, I guess it cannot hurt to ask)

@trws
Copy link
Member

trws commented Feb 2, 2023

If that isn’t sufficient I’d be somewhat interested to know why not, because it seems like the kind of problem we could solve with an index if it’s not currently feasible. Do we currently require the full path keys?

@milroy
Copy link
Member

milroy commented Feb 2, 2023

but we also have to support a restart after a broker crash, in which case this mechanism could not be used.

Good point.

Could we devise something to put into the scheduling key of an rv1exec object which would be brief but give Fluxion enough of a hint to find the correct vertices in the graph and free them?

Yes, I think so. I'll refresh my memory on the graph update functions and how they use the global lookups to devise some minimum schema.

@grondo
Copy link
Contributor Author

grondo commented Jun 8, 2023

This came up again because the inability of Fluxion to be reloaded with running jobs is preventing us from reconfiguring resources (e.g. adding or modifying queues) without a downtime on production clusters.

As an experiment, I tried setting match-format = "rv1" and then seeing if Fluxion modules can be reloaded with running jobs, but I still get the error:

[  +9.794686] job-manager[0]: scheduler: hello
[  +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[  +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584

The job in question does seem to have JGF in R:

$ flux job info 63504667937603584 R | jq .scheduling | head
{
  "graph": {
    "nodes": [
      {
        "id": "17",
        "metadata": {
          "type": "core",
          "basename": "core",
          "name": "core0",
          "id": 0,

@grondo
Copy link
Contributor Author

grondo commented Jun 8, 2023

It is possible I performed the experiment incorrectly, so it may be good if someone else can verify this behavior. If so, then we'll need a plan to address this issue in the near term (i.e. this issue just got high priority)

@trws
Copy link
Member

trws commented Jun 8, 2023

The only line that should be able to produce that error is this one:

queue = ctx->queues.at (queue_name);

I haven't had time to go in and experiment with it yet, but this makes me think that the map of queues doesn't get repopulated on a restart, we might not even be getting to the graph.

@grondo
Copy link
Contributor Author

grondo commented Jun 8, 2023

Ah, ok, thats a good point and something (hopefully) easy to fix as a start?

I did verify that with no queues enabled and match-format = "rv1", Fluxion can be reloaded with running jobs.

I'll open a separate issue.

@trws
Copy link
Member

trws commented Jun 8, 2023

Yeah, I don't think that will be all that bad. I'm a bit confused how that can happen, since everything is called in the right order, but if we have a reasonable reproducing test case this should be relatively straightforward. Glancing at it I'm guessing it's something like the list of queues in the initial config got cleared or some generated name changed or... 🤷 If we add in a bit of context to figure out which queue is missing and which queues exist it should fall out quickly.

@garlick
Copy link
Member

garlick commented Dec 20, 2023

Just following up on the meeting today:

fluxion gets the initial Rv1 object from the resource.acquire RPC, described in RFC 28. The Rv1 object is returned in the first response payload. This is designed so resources are automatically returned when fluxion unloads and can be re-acquired when it loads again.

static int populate_resource_db_acquire (std::shared_ptr<resource_ctx_t> &ctx)

Fluxion then uses the job-manager.hello RPC to acquire a list of running jobs. That one is described in RFC 27. Although the Rv1 object allocated to each job is not provided in the response payload, the libschedutil helper library from flux-core fetches it from the kvs and provides it to the scheduler in the hello callback. Thus the wall clock expiration time for each allocated Rv1 is available.

int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,

So my probably naive question is what required information is missing to allow fluxion to restart with running jobs and rv1_nosched?

@garlick
Copy link
Member

garlick commented Apr 11, 2024

In our meeting today it was asserted that the Rv1 to graph uuid mapping would need to be preserved (in a file or KVS) across a restart in order to map the resources of running jobs to the resource graph. Afterward I'm once again wondering why. The uuid's are purely internal to fluxion - fluxion won't receive a uuid from a running job that was allocated from a past instance of fluxion and need to map it to resources. It will receive rv1 with hostname/ranks and gpu/core logical indices. I don't see how having the old uuid mapping would help.

What am I missing?

@milroy
Copy link
Member

milroy commented Apr 11, 2024

I ended up deciding the vertex uniq_id (I mistakenly called it a UUID) wasn't necessary to identify the underlying graph vertex. I actually think it is possible to reconstruct the vertices' information from RV1 with hostnames and logical indices of a rank's children provided that the original resource graph was also generated from the rv1exec reader.

However, edge data (such as exclusivity tracking) can't be reconstructed without a fully specified format like JGF. Here's an example of what the JGF reader does to update the edges:

m.v_rt_edges[kv.first].set_for_trav_update (vmap[source].needs,
. That said, I don't know if not setting those parameters will result in undesirable scheduling behavior or only in reduction of optimizations.

@garlick
Copy link
Member

garlick commented Apr 11, 2024

Thanks for that explanation! Well, I think having the rv1 reader, even a naive one, would be an excellent near term step since it would let the scheduler be unloaded and reconfigured or even updated without affecting running jobs. When the scheduler is misbehaving, that could be really handy.

Edit: we could always augment that with some side band storage or whatever if it turns out to be required.

milroy added a commit to milroy/flux-sched that referenced this issue Apr 15, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec` implementation of update (). The
need for the implementation is described in detail in the issue, but
the primary motivation is to enable reloading Fluxion when using RV1
without the scheduling key and payload. The reader was not originally
implemented due to the lack of information in the format. Examples
include edges, exclusivity, paths with subsystems, and vertex sizes. To
create a workable implementation, strong assumptions need to be made
about resource exclusivity and edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 15, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec` implementation of update (). The
need for the implementation is described in detail in the issue, but
the primary motivation is to enable reloading Fluxion when using RV1
without the scheduling key and payload. The reader was not originally
implemented due to the lack of information in the format. Examples
include edges, exclusivity, paths with subsystems, and vertex sizes. To
create a workable implementation, strong assumptions need to be made
about resource exclusivity and edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 15, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
@milroy milroy linked a pull request Apr 15, 2024 that will close this issue
2 tasks
milroy added a commit to milroy/flux-sched that referenced this issue Apr 17, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 23, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 23, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
@mergify mergify bot closed this as completed in #1176 Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

4 participants