add a faster way to get resource allocation status than sched.resource-status RPC #5796

garlick · 2024-03-15T14:28:58Z

As discussed in #5776, this adds

resource tracking in the job manager and a new RPC to request the current set of allocated resources
expansion of the payload of resource.status to include the keys found in sched.resource-status
update users (top, python) to call resource.status instead of sched.resource-status

Marking as a WIP pending

see what the throughput impact is. Edit: none at this point
~~decide what to do with the alloc-check plugin since this effectively does the same thing~~ Edit: second iteration of the design no longer does this
~~decide if it was a good idea to combine the two RPCs or whether we should provide two, or perhaps take a step back and design a new one.~~ Edit Let's have two RPCs, just like before.
~~should we leave the scheduler RPC handlers in place and provide some way to use them in test?~~ Edit: keep for back compat, but add tests so the code is not left uncovered
~~update flux-resource(1) which mentions "scheduler view of resources" a lot~~ Edit: view is ok - just get fix language about explicitly contacting the schedule
~~flux resource status currently calls the resource.status RPCs twice~~ Now that there are two RPCs again, this is perfectly fine.

garlick · 2024-03-15T15:24:41Z

Hmm, getting an ASAN failure here:

2024-03-15T15:21:32.2114399Z Making check in common
2024-03-15T15:21:32.2245352Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2246109Z =================================================================
2024-03-15T15:21:32.2249768Z �[1m�[31m==37186==ERROR: AddressSanitizer: SEGV on unknown address 0x63abbfc31750 (pc 0x7ab1a6bb7cb7 bp 0x000000000000 sp 0x7ffe4ec09ea0 T0)
2024-03-15T15:21:32.2251177Z �[1m�[0m==37186==The signal is caused by a READ memory access.
2024-03-15T15:21:32.2254299Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2254926Z AddressSanitizer: nested bug in the same thread, aborting.
2024-03-15T15:21:32.2258036Z make[1]: *** [Makefile:1027: check-recursive] Error 1
2024-03-15T15:21:32.2308431Z make: *** [Makefile:514: check-recursive] Error 1

and failing sched tests

2024-03-15T15:30:30.0668719Z 18/92 Test #20: t1012-find-status.t ...................***Failed    5.82 sec

2024-03-15T15:30:30.0744267Z expecting success: 
2024-03-15T15:30:30.0744721Z     flux resource list > resource.list.out &&
2024-03-15T15:30:30.0746652Z     validate_list_row resource.list.out 2 0 0 0 &&
2024-03-15T15:30:30.0747487Z     validate_list_row resource.list.out 3 4 176 16 &&
2024-03-15T15:30:30.0748166Z     validate_list_row resource.list.out 4 1 44 4
2024-03-15T15:30:30.0748609Z 
2024-03-15T15:30:30.0749421Z not ok 14 - find/status: flux resource list works

2024-03-15T15:30:30.0785797Z 27/92 Test #29: t1021-qmanager-nodex.t ................***Failed    6.44 sec

2024-03-15T15:30:30.0801907Z expecting success: 
2024-03-15T15:30:30.0802300Z     cat >status.expected1 <<-EOF &&
2024-03-15T15:30:30.0802597Z 	2 88 8
2024-03-15T15:30:30.0802784Z 	2 88 8
2024-03-15T15:30:30.0803047Z EOF
2024-03-15T15:30:30.0803284Z     flux resource list > resources.out1 &&
2024-03-15T15:30:30.0803756Z     cat resources.out1 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0804304Z 	| awk "{ print \$2,\$3,\$4 }" > status.out1 &&
2024-03-15T15:30:30.0804878Z     test_cmp status.expected1 status.out1
2024-03-15T15:30:30.0805121Z 
2024-03-15T15:30:30.0805347Z --- status.expected1	2024-03-15 15:29:16.098812556 +0000
2024-03-15T15:30:30.0805809Z +++ status.out1	2024-03-15 15:29:16.286811847 +0000
2024-03-15T15:30:30.0806168Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0806398Z -2 88 8
2024-03-15T15:30:30.0806601Z -2 88 8
2024-03-15T15:30:30.0806781Z +4 88 16
2024-03-15T15:30:30.0806973Z +2 88 0
2024-03-15T15:30:30.0807303Z not ok 6 - qmanager-nodex: free/alloc node count (hinodex)

2024-03-15T15:30:30.0832122Z expecting success: 
2024-03-15T15:30:30.0832615Z     cat >status.expected2 <<-EOF &&
2024-03-15T15:30:30.0833130Z 	0 0 0
2024-03-15T15:30:30.0833446Z 	4 176 16
2024-03-15T15:30:30.0833770Z EOF
2024-03-15T15:30:30.0834161Z     flux resource list > resources.out2 &&
2024-03-15T15:30:30.0834930Z     cat resources.out2 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0835613Z         | awk "{ print \$2,\$3,\$4 }" > status.out2 &&
2024-03-15T15:30:30.0836241Z     test_cmp status.expected2 status.out2
2024-03-15T15:30:30.0836615Z 
2024-03-15T15:30:30.0836947Z --- status.expected2	2024-03-15 15:29:16.670810399 +0000
2024-03-15T15:30:30.0837648Z +++ status.out2	2024-03-15 15:29:16.830809797 +0000
2024-03-15T15:30:30.0838235Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0838607Z -0 0 0
2024-03-15T15:30:30.0838961Z -4 176 16
2024-03-15T15:30:30.0839303Z +4 0 16
2024-03-15T15:30:30.0839643Z +4 176 0
2024-03-15T15:30:30.0840206Z not ok 9 - qmanager-nodex: free/alloc node count 2 (hinodex)

2024-03-15T15:30:30.0865149Z expecting success: 
2024-03-15T15:30:30.0865545Z     cat >status.expected3 <<-EOF &&
2024-03-15T15:30:30.0865849Z 	2 88 8
2024-03-15T15:30:30.0866036Z 	2 88 8
2024-03-15T15:30:30.0866301Z EOF
2024-03-15T15:30:30.0866548Z     flux resource list > resources.out3 &&
2024-03-15T15:30:30.0867207Z     cat resources.out3 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0867639Z         | awk "{ print \$2,\$3,\$4 }" > status.out3 &&
2024-03-15T15:30:30.0868319Z     test_cmp status.expected3 status.out3
2024-03-15T15:30:30.0868701Z 
2024-03-15T15:30:30.0869053Z --- status.expected3	2024-03-15 15:29:17.794806161 +0000
2024-03-15T15:30:30.0869520Z +++ status.out3	2024-03-15 15:29:17.894805784 +0000
2024-03-15T15:30:30.0869875Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0870177Z -2 88 8
2024-03-15T15:30:30.0870528Z -2 88 8
2024-03-15T15:30:30.0870721Z +4 88 16
2024-03-15T15:30:30.0870906Z +2 88 0
2024-03-15T15:30:30.0871255Z not ok 16 - qmanager-nodex: free/alloc node count (lonodex)

2024-03-15T15:30:30.0880480Z expecting success: 
2024-03-15T15:30:30.0880794Z     cat >status.expected4 <<-EOF &&
2024-03-15T15:30:30.0881078Z 	0 0 0
2024-03-15T15:30:30.0881271Z 	4 176 16
2024-03-15T15:30:30.0881467Z EOF
2024-03-15T15:30:30.0881692Z     flux resource list > resources.out4 &&
2024-03-15T15:30:30.0882111Z     cat resources.out4 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0882511Z         | awk "{ print \$2,\$3,\$4 }" > status.out4 &&
2024-03-15T15:30:30.0882887Z     test_cmp status.expected4 status.out4
2024-03-15T15:30:30.0883127Z 
2024-03-15T15:30:30.0883381Z --- status.expected4	2024-03-15 15:29:18.218804562 +0000
2024-03-15T15:30:30.0883894Z +++ status.out4	2024-03-15 15:29:18.442803717 +0000
2024-03-15T15:30:30.0884254Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0884514Z -0 0 0
2024-03-15T15:30:30.0884909Z -4 176 16
2024-03-15T15:30:30.0885248Z +4 0 16
2024-03-15T15:30:30.0885587Z +4 176 0
2024-03-15T15:30:30.0886227Z not ok 19 - qmanager-nodex: free/alloc node count 2 (lonodex)

grondo · 2024-03-15T16:14:31Z

Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.

grondo · 2024-03-15T20:16:01Z

I'm looking into the ASan errors, which are reproducible but only in the CI environment. In this case, libasan is getting a SEGV while running make. Here's more info:

runner@fv-az523-355:/usr/src/src$ ASAN_OPTIONS=$ASAN_OPTIONS,verbosity=1 LD_PRELOAD=/usr/lib64/libasan.so.6 make
==134==AddressSanitizer: failed to intercept '__isoc99_printf'
'==134==AddressSanitizer: failed to intercept '__isoc99_sprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_snprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_fprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsnprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vfprintf'
'==134==AddressSanitizer: failed to intercept 'xdr_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_u_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_destroy'
'==134==AddressSanitizer: libc interceptors initialized
|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem    ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap  ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow  ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem     ||
MemToShadow(shadow): 0x00008fff7000 0x000091ff6dff 0x004091ff6e00 0x02008fff6fff
redzone=16
max_redzone=2048
quarantine_size_mb=256M
thread_local_quarantine_size_kb=1024K
malloc_context_size=30
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 0x7fff8000
==134==Installed the sigaction for signal 11
==134==Installed the sigaction for signal 7
==134==Installed the sigaction for signal 8
==134==Deactivating ASan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==134==ERROR: AddressSanitizer: SEGV on unknown address 0x60cf8ad2c9f8 (pc 0x71fac17b5cb7 bp 0x000000000000 sp 0x7ffd602751c0 T0)
==134==The signal is caused by a READ memory access.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

I have no idea why this just started happening, nor what to try next. Maybe I will try moving the asan to Fedora 36 and see if that addresses this issue.

garlick · 2024-03-16T23:45:22Z

Looking at the first sched test failure, the test fakes resources consisting of 4 nodes, each with 44 cores and 4 gpus. In the first failure, it allocates two nodes exclusively and expects flux resource list to show

free: 2 nodes, 88 cores, 8 gpus
alloc: 2 nodes, 88 cores, 8 gpus

But what it actually shows is

free: 4 nodes, 88 cores, 16 gpus
alloc: 2 nodes, 88 cores, 0 gpus

I think what may be going on is that librlist doesn't handle gpus for the things I'm asking it to do.

For completeness, the new objects returned are:

all

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-3",
        "children": {
          "core": "0-43",
          "gpu": "0-3"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682,3179,3683,3178]"
    ]
  }
}

alloc

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0,2",
        "children": {
          "core": "0-43"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682-3683]"
    ]
  }
}

grondo · 2024-03-17T00:03:52Z

Uhoh, that sounds like an unfortunate bug :-(

garlick · 2024-03-20T13:51:47Z

FWIW I was able to fix the sched test failures by returning the allocated resource set from the job-manager directly in resource.status instead of marking those resources allocated the full rlist and then extracting them again with rlist_copy_allocated().

Some other tests are failing now in flux-core that weren't before so I need to sort that out then post an update here.

grondo · 2024-03-20T14:07:25Z

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate them (currently it cannot).

My other worry here is that librlist starts to get pretty slow when there are a lot of resources. I worry this will have a significant impact on throughput (I know you were already planning on testing that, but just a heads up). For example, in the testing in flux-framework/flux-sched#1137, we saw sched.resource-status for sched-simple taking >2s for 16K nodes.

garlick · 2024-03-20T14:22:46Z

Ah: the allocated set from job-manager.resource-status does not include properties so tests involving queues are failing in t2350-resource-list.t and t2801-top-cmd.t. I'm not sure if this is because sched-simple isn't returning them in the alloc response, or they are disappearing when passing through rlist_append(). But anyway before they were included because the allocated set was being carved out of the original R with rlist_copy_allocated().

grondo · 2024-03-20T14:29:19Z

Ah, probably dropped in rlist_append and that seems like a bug :-(

garlick · 2024-03-20T14:31:35Z

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate > them (currently it cannot).

Yeah that would allow the original technique I was using to work. Supporting gpus in sched-simple seems like a pretty big bonus as well!

My other worry here is that librlist starts to get pretty slow when there are a lot of resources.

I was hoping that the rlist_append() / rlist_diff() functions I am using in job-manager would not be too bad, but after I get this working, I'll probe that with a large resource set.

garlick · 2024-03-20T16:24:46Z

Hmm, actually rlist_append() seems to pass properties through OK, but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

grondo · 2024-03-20T16:43:46Z

but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

I'm not sure, but I did think properties were passed through an alloc, e.g. on fluke the queue properties are available in subinstances (whether this is right or not is questionable)

 grondo@fluke108:~$ flux alloc -N1 flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free batch           1        4        0 fluke95
 allocated                 0        0        0 
      down                 0        0        0 
[detached: session exiting]

garlick · 2024-03-20T17:10:50Z

Hmm, maybe something was wrong with my test then. Thanks for that!

Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.

garlick · 2024-03-20T21:21:59Z

I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.

garlick · 2024-03-21T22:09:12Z

FWIW throughput.py is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. 🤷

grondo · 2024-03-21T22:16:36Z

Oof. I guess sched-simple is so slow at that scale it doesn't matter. I'd like to see results for something smaller as well just in case though.

garlick · 2024-03-22T00:16:39Z

Here are some more numbers. It does look like there is a negative effect.

throughput.py -n 1000 (each run in a new instance)

Node: 32 cores / 4 gpus

        fluxion simple  fluxion simple
        master  master  #5776   #5776
nnodes  thruput thruput thruput thruput
64      595.4   497.0   551.8   505.9
        625.9   523.2   570.5   524.1
        600.4   508.8   527.7   497.4
128     441.4   522.4   429.6   497.6
        468.7   495.4   440.9   485.9
        484.6   505.2   426.3   475.3
256     251.7   501.7   242.0   469.6
        254.7   499.7   238.6   481.4
        249.9   491.5   233.3   454.2
512     142.5   463.2   138.4   456.5
        143.6   482.7   138.2   456.8
        146.2   465.7   137.3   460.5
1024    69.0    443.1   67.8    431.7
        68.6    472.7   68.6    428.7
        68.6    466.5   68.4    449.3
2048    33.1    434.5   32.6    428.5
        32.8    458.5   32.8    428.3
        32.8    447.5   33.5    435.2
4096    16.9    301.5   16.5    285.4
        16.9    293.1   16.5    275.2
        16.8    288.0   16.6    283.2
8192    8.3     104.1   8.2     103.3
        8.3     108.2   8.4     97.7
        8.3     108.0   8.3     103.3
16384   4.1     24.2    4.0     24.0
        4.1     23.9    4.0     23.5
        4.0     24.3    4.1     24.0

garlick · 2024-03-22T14:10:29Z

Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine them in the job manager or even ship them to the resource module and do it there.

grondo · 2024-03-22T14:47:43Z

The impact is not as bad as I'd feared! The rlist implementation is kind of dumb and not optimized at all, so we could look into improving it. This would help a bit with #5819 too.

However, since each time a user runs flux resource it will interrupt the job manager, moving the resource-list service to the resource or another module like we did for the job-list service is not a bad idea IMO.

I'll also note that the slow sched.resource-status RPC is an immediate concern on existing clusters, so a solution that works within the next couple releases would be better than spending more time implementing the "perfect" solution (However, it would probably be bad to break compatibility multiple times)

Sorry, all that may not have been too helpful. Just throwing some thoughts out there.

garlick · 2024-03-22T15:35:11Z

In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare the response. The RPC in the job manager that returns the allocated set is used exclusively by the resource module. So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

grondo · 2024-03-22T15:52:26Z

So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

That makes sense, sorry I misunderstood. So on receipt of a resource.status request, the resource module then requests the allocated set from the job manager, but in the current implementation the job-manager has already prepared the allocated resource set, so all it has to do is return it? This is a pretty clever implementation IMO.

Shifting some of the work to the resource module seems like a good idea to me. Your comment about just shipping to the resource module makes a lot of sense now that I better understand.

garlick · 2024-03-23T02:54:47Z

Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point.

The job manager RPC now just walks all active jobs, creating a union of all allocated Rs, which it returns. It's kind of dumb and simple, but if it turns out to be slow in some scenario (for example if there are a huge number of active jobs), there are obvious things to do to improve responsiveness such as caching the result until the next alloc/free, keeping all jobs with resources in a list, etc..

garlick · 2024-03-23T19:43:19Z

I've updated the description with a few todos, assuming this approach is acceptable.

I'm leaning towards splitting the resource.status RPC into two RPCs again. Since the job manager query is now somewhat costly (depending on the number of active jobs), it probably makes sense to avoid it when only the previous resource.status fields are really wanted.

grondo · 2024-03-25T14:34:59Z

Here's some timing results on this current branch vs master. Huge improvements:
master:

SCHEDULER       NNODES T(sched.resource-status)  T(flux resource list)
sched-simple       128                    0.175                  0.191
sched-simple       256                    0.166                  0.197
sched-simple       512                    0.186                  0.218
sched-simple      1024                    0.209                  0.254
sched-simple      2048                    0.239                  0.336
sched-simple      4096                    0.322                  0.510
sched-simple      8192                    0.610                  0.864
sched-simple     16384                    1.228                  1.588
fluxion            128                    0.273                  0.300
fluxion            256                    0.392                  0.431
fluxion            512                    0.648                  0.696
fluxion           1024                    1.167                  1.208
fluxion           2048                    2.161                  2.275
fluxion           4096                    4.228                  4.478
fluxion           8192                    8.352                  8.650
fluxion          16384                   16.570                 17.434

on this branch:

SCHEDULER       NNODES T(resource.sched-status)  T(flux resource list)
sched-simple       128                    0.159                  0.179
sched-simple       256                    0.176                  0.195
sched-simple       512                    0.185                  0.208
sched-simple      1024                    0.204                  0.248
sched-simple      2048                    0.229                  0.326
sched-simple      4096                    0.302                  0.492
sched-simple      8192                    0.454                  0.834
sched-simple     16384                    0.790                  1.550
fluxion            128                    0.169                  0.199
fluxion            256                    0.177                  0.205
fluxion            512                    0.184                  0.222
fluxion           1024                    0.202                  0.262
fluxion           2048                    0.237                  0.336
fluxion           4096                    0.307                  0.496
fluxion           8192                    0.462                  0.833
fluxion          16384                    0.770                  1.543

This result is somewhat obvious since this branch removes the influence of the scheduler on the timing for flux resource list, but I thought I'd share anyway.

Another useful test might be to do a similar run with a lot of active jobs, though I'm sure this branch will still be a net win.

grondo · 2024-03-25T14:41:28Z

I didn't see this early comment addressed:

Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.

It might be ok to just break compatibility in this case, but I wonder if we can take advantage of the fact that the resource.sched-status and sched.resource-status RPCs are compatible to support backwards compatibility in the tools for a short while. Sorry if you already responded to this comment and I missed it.

garlick · 2024-03-25T14:46:54Z

I didn't see this early comment addressed:

Oh, sorry! I took your comment to heart and left the old RPC in place. To restore a bit of test coverage without much effort, I added a FLUX_RESOURCE_STATUS_RPC environment variable that allows the topic string used in the python bindings to be overridden, e.g.

$ FLUX_RESOURCE_STATUS_RPC=sched.resource-status flux resource list

grondo · 2024-03-25T14:55:07Z

Ah, the case I was thinking about is flux top or other recursive tools being used against older versions of Flux (e.g. a user running a batch job that uses older Flux). If the newer flux top or flux pstree is used with the older versions, the RPC perhaps should be retried with sched.resource-status if resource.sched-status returns ENOSYS.

I'm still waffling on whether this is worth the effort, since it would probably have to be handled explicitly in all use cases (perhaps we could use some trick to make it automatic in Python though)

garlick · 2024-03-25T15:03:31Z

Ah, I was thinking old client + new server. You are right, for new client + old server, we'd probably have to retry with the old topic string on ENOSYS.

grondo · 2024-03-25T15:08:33Z

If we want to do that, I wonder if it would be simplest to create a C wrapper function (it could return an empty future that is either fulfilled by the first or retry response)

grondo · 2024-03-25T15:26:45Z

If it simplifies things, supporting backwards compatibility could be done in a follow-on PR (and I'd be willing to work on that if you've got other things going)

grondo

This looks great to me, and what an improvement! Seems like this is the way we should have been doing things all along.

I didn't really see much to comment on, and given that everyone's pretty annoyed that flux resource list seems to take >30s on elcap at the moment, we should get this in ASAP.

Amazing work!

BTW, I tried to add a "worst case scenario" to my benchmark by having one running job per node (so 16K jobs in the largest test case). The results show good performance for even that case

SCHEDULER       NNODES T(resource.sched-status)  T(flux resource list)
sched-simple       128                    0.168                  0.196
sched-simple       256                    0.178                  0.203
sched-simple       512                    0.193                  0.235
sched-simple      1024                    0.229                  0.293
sched-simple      2048                    0.309                  0.420
sched-simple      4096                    0.481                  0.694
sched-simple      8192                    0.881                  1.295
sched-simple     16384                    1.835                  2.633

Impact of many jobs doesn't seem to be too bad! I say we get this in.

grondo · 2024-03-25T15:18:45Z

src/modules/resource/monitor.c

- *   posted.  Thus, after waiting, resource.status (flux resource status)
- *   should show those ranks up, while sched.resource-status
- *   (flux resource list command) may still show them down.
+ *   posted.


Commit message: Problem statement: do you mean flux resource list instead of flux resource status here?

grondo · 2024-03-25T15:29:31Z

src/modules/resource/status.c

+static int update_properties (struct rlist *alloc, struct rlist *all)
+{
+    struct idset *alloc_ranks;
+    json_t *props;


I cannot recall, did we ever determine if the missing properties in the response is a bug in rlist? If so, is there an open issue? I'd be happy to attempt to address that since it is slightly depressing the resource module has to do this extra work...

Well I wasn't sure if it's actually desirable to have properties propagate to subinstances. Currently there is inconsistent behavior in fluxion (includes properties) vs sched-simple (omits properties). Queue properties are clearly irrelevant, but maybe other properties would be useful. For example on my test cluster I have properties assigned for different memory amounts, and those would be useful to propagate. I can open an issue and we can discuss there if you like.

I may have misspoken there! At least in a quick test, sched-simple does seem to assign properties in the R allocated to jobs. More testing needed to see where exactly properties were being dropped.

Problem: 'flux resource list' can take a long time when the scheduler is busy. Add a new 'resource.sched-status' RPC method that returns a compatible payload to 'sched.resource-status'. It avoids contacting the scheduler by making use of the new 'job-manager.resource-status' RPC. Properties are sometimes not included in the R returned by the job manager (see flux-framework#5826), so copy any applicable properties from the resource inventory into the allocated set before returning it in the RPC response.

Problem: when a client disconnects with a resource.status RPC pending, the RPC is not aborted. Change the resource module disconnect handler to allow non-owners to send the message (and verify that the acquire handler is using proper authentication). Then call status_disconnect() from the handler. Since the disconnect handler also calls acquire_disconect() and that only works on rank 0, add a check so it doesn't segfault.

Problem: updates to the summary pane are subject to large delays when the scheduler is busy. Use resource.sched-status instead of sched.resource-status to poll for resource information.

Problem: 'flux resource list' can take a long time when the scheduler is busy. Use resource.sched-status instead of sched.resource-status to request resource information. Force the RPC to go to rank 0 since, unlike the scheduler, resource is loaded on all ranks and would fail the RPC if first sent to the local broker.

Problem: test code would need to be duplicated or written from scratch to get coverage for the sched.resource-status RPC, which needs to be retained for a while for backwards compatibility. Add support for a FLUX_RESOURCE_LIST_RPC. If set to an alternate topic string, it is used by the python SchedResourceList class instead of the new resource.sched-status RPC.

Problem: the FLUX_RESOURCE_LIST_RPC environment variable is undocumented. Add it to the man page in the test section.

Problem: the sched.resource-status RPC no longer has test coverage. Add a couple of trivial tests that verify it still works, using the FLUX_RESOURCE_LIST_RPC environment variable just added.

Problem: the flux-resource(1) man page mentions querying the scheduler for resource status several times, but this is no longer done. Use "scheduler view of resources" instead where applicable. Although the resource module now answers this query, it provides the same "view" as before.

Problem: there is no test coverage for running 'flux resource list' on a broker rank other than the leader, but the resource module loads on all ranks and only offers resource.sched-status on rank 0. Make sure it works on ranks other than zero.

garlick · 2024-03-25T18:33:13Z

Fixed the commit typo and opened #5826 on the missing properties. I'll go ahead and set MWP. Thanks!

codecov · 2024-03-25T18:52:21Z

Codecov Report

Merging #5796 (98dc6b5) into master (d115427) will decrease coverage by 0.05%.
The diff coverage is 68.14%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5796      +/-   ##
==========================================
- Coverage   83.33%   83.29%   -0.05%     
==========================================
  Files         509      510       +1     
  Lines       82528    82817     +289     
==========================================
+ Hits        68776    68982     +206     
- Misses      13752    13835      +83

Files	Coverage Δ
src/bindings/python/flux/job/info.py	`93.76% <100.00%> (ø)`
src/bindings/python/flux/resource/list.py	`95.55% <100.00%> (+0.20%)`	⬆️
src/modules/resource/monitor.c	`67.21% <ø> (ø)`
src/modules/resource/acquire.c	`60.44% <85.71%> (+0.17%)`	⬆️
src/modules/resource/resource.c	`80.44% <80.00%> (+2.18%)`	⬆️
src/cmd/top/summary_pane.c	`85.37% <0.00%> (+1.58%)`	⬆️
src/modules/job-manager/alloc.c	`75.77% <63.63%> (-1.59%)`	⬇️
src/modules/resource/status.c	`68.23% <68.23%> (ø)`

... and 14 files with indirect coverage changes

garlick · 2024-03-25T21:01:49Z

Hmm, in one of the builders, t2812-flux-job-last.t hangs in this test (which is for issue #4390).
I don't see how it could be related to this PR but it is a bit concerning. I'll open an issue and restart the builder.

 expecting success: 
  	flux job last "[:]" >lastdump.exp &&
  	flux dump dump.tgz &&
  	flux start -o,-Scontent.restore=dump.tgz \
  		flux job last "[:]" >lastdump.out &&
  	test_cmp lastdump.exp lastdump.out
  
  
  flux-dump: archived 1 keys
  flux-dump: archived 2 keys
  flux-dump: archived 3 keys
  flux-dump: archived 4 keys
  flux-dump: archived 5 keys
  flux-dump: archived 6 keys
  flux-dump: archived 7 keys
  flux-dump: archived 8 keys
  flux-dump: archived 9 keys
  flux-dump: archived 65 keys
  Mar 25 18:41:44.457761 job-manager.err[0]: replay warning: INACTIVE action failed on job fEyxE9d: Read-only file system
  Mar 25 18:41:44.785969 job-manager.err[0]: sched.alloc-response: id=fF1SDRy already allocated

garlick force-pushed the issue#5776 branch from 3ce6aad to 1a3688c Compare March 15, 2024 15:18

garlick mentioned this pull request Mar 20, 2024

librlist: support gpus in rlist_set_allocated() #5807

Open

garlick force-pushed the issue#5776 branch from 1a3688c to c57c08a Compare March 20, 2024 21:20

garlick force-pushed the issue#5776 branch from c57c08a to b4080d3 Compare March 21, 2024 20:17

garlick mentioned this pull request Mar 21, 2024

job-manager: add support for housekeeping scripts with partial release of resources #5818

Merged

garlick force-pushed the issue#5776 branch 2 times, most recently from ce564ba to b1f3819 Compare March 23, 2024 02:45

grondo approved these changes Mar 25, 2024

View reviewed changes

garlick mentioned this pull request Mar 25, 2024

sched-simple: properties are omitted from job R unless --exclusive #5826

Open

garlick added 9 commits March 25, 2024 11:31

flux-top: use resource.sched-status

cd45418

Problem: updates to the summary pane are subject to large delays when the scheduler is busy. Use resource.sched-status instead of sched.resource-status to poll for resource information.

flux-environment(7): add FLUX_RESOURCE_LIST_RPC

f228064

Problem: the FLUX_RESOURCE_LIST_RPC environment variable is undocumented. Add it to the man page in the test section.

testsuite: cover sched.resource-status

e10c2d6

Problem: the sched.resource-status RPC no longer has test coverage. Add a couple of trivial tests that verify it still works, using the FLUX_RESOURCE_LIST_RPC environment variable just added.

garlick force-pushed the issue#5776 branch from f046b02 to 98dc6b5 Compare March 25, 2024 18:32

garlick added the merge-when-passing label Mar 25, 2024

This was referenced Mar 25, 2024

possible performance issue in sched.resource-status RPC flux-framework/flux-sched#1137

Closed

sched.resource-status RPC hangs when Fluxion is scheduling large jobs flux-framework/flux-sched#1039

Closed

garlick mentioned this pull request Mar 25, 2024

sched.alloc-response: id=fF1SDRy already allocated #5829

Closed

mergify bot merged commit fea7f25 into flux-framework:master Mar 25, 2024
34 of 35 checks passed

garlick deleted the issue#5776 branch March 25, 2024 22:40

This was referenced Mar 27, 2024

job-manager: implement RPC to partially replace sched.resource-status #5776

Closed

job submissions are serialized and not interactively performant flux-framework/flux-sched#1159

Closed

add a faster way to get resource allocation status than sched.resource-status RPC #5796

add a faster way to get resource allocation status than sched.resource-status RPC #5796

Conversation

garlick commented Mar 15, 2024 • edited Loading

garlick commented Mar 15, 2024 • edited Loading

grondo commented Mar 15, 2024 • edited Loading

grondo commented Mar 15, 2024

garlick commented Mar 16, 2024

grondo commented Mar 17, 2024

garlick commented Mar 20, 2024

grondo commented Mar 20, 2024

garlick commented Mar 20, 2024

grondo commented Mar 20, 2024

garlick commented Mar 20, 2024

garlick commented Mar 20, 2024

grondo commented Mar 20, 2024

garlick commented Mar 20, 2024 • edited Loading

garlick commented Mar 20, 2024

garlick commented Mar 21, 2024

grondo commented Mar 21, 2024

garlick commented Mar 22, 2024

garlick commented Mar 22, 2024

grondo commented Mar 22, 2024 • edited Loading

garlick commented Mar 22, 2024 • edited Loading

grondo commented Mar 22, 2024

garlick commented Mar 23, 2024

garlick commented Mar 23, 2024

grondo commented Mar 25, 2024 • edited Loading

grondo commented Mar 25, 2024

garlick commented Mar 25, 2024 • edited Loading

grondo commented Mar 25, 2024

garlick commented Mar 25, 2024

grondo commented Mar 25, 2024

grondo commented Mar 25, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Mar 25, 2024

Choose a reason for hiding this comment

grondo Mar 25, 2024

Choose a reason for hiding this comment

garlick Mar 25, 2024

Choose a reason for hiding this comment

garlick Mar 25, 2024

Choose a reason for hiding this comment

garlick commented Mar 25, 2024

codecov bot commented Mar 25, 2024

Codecov Report

garlick commented Mar 25, 2024

garlick commented Mar 15, 2024 •

edited

Loading

garlick commented Mar 15, 2024 •

edited

Loading

grondo commented Mar 15, 2024 •

edited

Loading

garlick commented Mar 20, 2024 •

edited

Loading

grondo commented Mar 22, 2024 •

edited

Loading

garlick commented Mar 22, 2024 •

edited

Loading

grondo commented Mar 25, 2024 •

edited

Loading

garlick commented Mar 25, 2024 •

edited

Loading