Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a faster way to get resource allocation status than sched.resource-status RPC #5796

Merged
merged 13 commits into from
Mar 25, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 15, 2024

As discussed in #5776, this adds

  • resource tracking in the job manager and a new RPC to request the current set of allocated resources
  • expansion of the payload of resource.status to include the keys found in sched.resource-status
  • update users (top, python) to call resource.status instead of sched.resource-status

Marking as a WIP pending

  • see what the throughput impact is. Edit: none at this point
  • decide what to do with the alloc-check plugin since this effectively does the same thing Edit: second iteration of the design no longer does this
  • decide if it was a good idea to combine the two RPCs or whether we should provide two, or perhaps take a step back and design a new one. Edit Let's have two RPCs, just like before.
  • should we leave the scheduler RPC handlers in place and provide some way to use them in test? Edit: keep for back compat, but add tests so the code is not left uncovered
  • update flux-resource(1) which mentions "scheduler view of resources" a lot Edit: view is ok - just get fix language about explicitly contacting the schedule
  • flux resource status currently calls the resource.status RPCs twice Now that there are two RPCs again, this is perfectly fine.

@garlick
Copy link
Member Author

garlick commented Mar 15, 2024

Hmm, getting an ASAN failure here:

2024-03-15T15:21:32.2114399Z Making check in common
2024-03-15T15:21:32.2245352Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2246109Z =================================================================
2024-03-15T15:21:32.2249768Z �[1m�[31m==37186==ERROR: AddressSanitizer: SEGV on unknown address 0x63abbfc31750 (pc 0x7ab1a6bb7cb7 bp 0x000000000000 sp 0x7ffe4ec09ea0 T0)
2024-03-15T15:21:32.2251177Z �[1m�[0m==37186==The signal is caused by a READ memory access.
2024-03-15T15:21:32.2254299Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T15:21:32.2254926Z AddressSanitizer: nested bug in the same thread, aborting.
2024-03-15T15:21:32.2258036Z make[1]: *** [Makefile:1027: check-recursive] Error 1
2024-03-15T15:21:32.2308431Z make: *** [Makefile:514: check-recursive] Error 1

and failing sched tests

2024-03-15T15:30:30.0668719Z 18/92 Test #20: t1012-find-status.t ...................***Failed    5.82 sec

2024-03-15T15:30:30.0744267Z expecting success: 
2024-03-15T15:30:30.0744721Z     flux resource list > resource.list.out &&
2024-03-15T15:30:30.0746652Z     validate_list_row resource.list.out 2 0 0 0 &&
2024-03-15T15:30:30.0747487Z     validate_list_row resource.list.out 3 4 176 16 &&
2024-03-15T15:30:30.0748166Z     validate_list_row resource.list.out 4 1 44 4
2024-03-15T15:30:30.0748609Z 
2024-03-15T15:30:30.0749421Z not ok 14 - find/status: flux resource list works

2024-03-15T15:30:30.0785797Z 27/92 Test #29: t1021-qmanager-nodex.t ................***Failed    6.44 sec

2024-03-15T15:30:30.0801907Z expecting success: 
2024-03-15T15:30:30.0802300Z     cat >status.expected1 <<-EOF &&
2024-03-15T15:30:30.0802597Z 	2 88 8
2024-03-15T15:30:30.0802784Z 	2 88 8
2024-03-15T15:30:30.0803047Z EOF
2024-03-15T15:30:30.0803284Z     flux resource list > resources.out1 &&
2024-03-15T15:30:30.0803756Z     cat resources.out1 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0804304Z 	| awk "{ print \$2,\$3,\$4 }" > status.out1 &&
2024-03-15T15:30:30.0804878Z     test_cmp status.expected1 status.out1
2024-03-15T15:30:30.0805121Z 
2024-03-15T15:30:30.0805347Z --- status.expected1	2024-03-15 15:29:16.098812556 +0000
2024-03-15T15:30:30.0805809Z +++ status.out1	2024-03-15 15:29:16.286811847 +0000
2024-03-15T15:30:30.0806168Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0806398Z -2 88 8
2024-03-15T15:30:30.0806601Z -2 88 8
2024-03-15T15:30:30.0806781Z +4 88 16
2024-03-15T15:30:30.0806973Z +2 88 0
2024-03-15T15:30:30.0807303Z not ok 6 - qmanager-nodex: free/alloc node count (hinodex)

2024-03-15T15:30:30.0832122Z expecting success: 
2024-03-15T15:30:30.0832615Z     cat >status.expected2 <<-EOF &&
2024-03-15T15:30:30.0833130Z 	0 0 0
2024-03-15T15:30:30.0833446Z 	4 176 16
2024-03-15T15:30:30.0833770Z EOF
2024-03-15T15:30:30.0834161Z     flux resource list > resources.out2 &&
2024-03-15T15:30:30.0834930Z     cat resources.out2 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0835613Z         | awk "{ print \$2,\$3,\$4 }" > status.out2 &&
2024-03-15T15:30:30.0836241Z     test_cmp status.expected2 status.out2
2024-03-15T15:30:30.0836615Z 
2024-03-15T15:30:30.0836947Z --- status.expected2	2024-03-15 15:29:16.670810399 +0000
2024-03-15T15:30:30.0837648Z +++ status.out2	2024-03-15 15:29:16.830809797 +0000
2024-03-15T15:30:30.0838235Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0838607Z -0 0 0
2024-03-15T15:30:30.0838961Z -4 176 16
2024-03-15T15:30:30.0839303Z +4 0 16
2024-03-15T15:30:30.0839643Z +4 176 0
2024-03-15T15:30:30.0840206Z not ok 9 - qmanager-nodex: free/alloc node count 2 (hinodex)

2024-03-15T15:30:30.0865149Z expecting success: 
2024-03-15T15:30:30.0865545Z     cat >status.expected3 <<-EOF &&
2024-03-15T15:30:30.0865849Z 	2 88 8
2024-03-15T15:30:30.0866036Z 	2 88 8
2024-03-15T15:30:30.0866301Z EOF
2024-03-15T15:30:30.0866548Z     flux resource list > resources.out3 &&
2024-03-15T15:30:30.0867207Z     cat resources.out3 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0867639Z         | awk "{ print \$2,\$3,\$4 }" > status.out3 &&
2024-03-15T15:30:30.0868319Z     test_cmp status.expected3 status.out3
2024-03-15T15:30:30.0868701Z 
2024-03-15T15:30:30.0869053Z --- status.expected3	2024-03-15 15:29:17.794806161 +0000
2024-03-15T15:30:30.0869520Z +++ status.out3	2024-03-15 15:29:17.894805784 +0000
2024-03-15T15:30:30.0869875Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0870177Z -2 88 8
2024-03-15T15:30:30.0870528Z -2 88 8
2024-03-15T15:30:30.0870721Z +4 88 16
2024-03-15T15:30:30.0870906Z +2 88 0
2024-03-15T15:30:30.0871255Z not ok 16 - qmanager-nodex: free/alloc node count (lonodex)

2024-03-15T15:30:30.0880480Z expecting success: 
2024-03-15T15:30:30.0880794Z     cat >status.expected4 <<-EOF &&
2024-03-15T15:30:30.0881078Z 	0 0 0
2024-03-15T15:30:30.0881271Z 	4 176 16
2024-03-15T15:30:30.0881467Z EOF
2024-03-15T15:30:30.0881692Z     flux resource list > resources.out4 &&
2024-03-15T15:30:30.0882111Z     cat resources.out4 | grep -E "(free|alloc)" \
2024-03-15T15:30:30.0882511Z         | awk "{ print \$2,\$3,\$4 }" > status.out4 &&
2024-03-15T15:30:30.0882887Z     test_cmp status.expected4 status.out4
2024-03-15T15:30:30.0883127Z 
2024-03-15T15:30:30.0883381Z --- status.expected4	2024-03-15 15:29:18.218804562 +0000
2024-03-15T15:30:30.0883894Z +++ status.out4	2024-03-15 15:29:18.442803717 +0000
2024-03-15T15:30:30.0884254Z @@ -1,2 +1,2 @@
2024-03-15T15:30:30.0884514Z -0 0 0
2024-03-15T15:30:30.0884909Z -4 176 16
2024-03-15T15:30:30.0885248Z +4 0 16
2024-03-15T15:30:30.0885587Z +4 176 0
2024-03-15T15:30:30.0886227Z not ok 19 - qmanager-nodex: free/alloc node count 2 (lonodex)


@grondo
Copy link
Contributor

grondo commented Mar 15, 2024

Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.

@grondo
Copy link
Contributor

grondo commented Mar 15, 2024

I'm looking into the ASan errors, which are reproducible but only in the CI environment. In this case, libasan is getting a SEGV while running make. Here's more info:

runner@fv-az523-355:/usr/src/src$ ASAN_OPTIONS=$ASAN_OPTIONS,verbosity=1 LD_PRELOAD=/usr/lib64/libasan.so.6 make
==134==AddressSanitizer: failed to intercept '__isoc99_printf'
'==134==AddressSanitizer: failed to intercept '__isoc99_sprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_snprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_fprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vsnprintf'
'==134==AddressSanitizer: failed to intercept '__isoc99_vfprintf'
'==134==AddressSanitizer: failed to intercept 'xdr_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_u_quad_t'
'==134==AddressSanitizer: failed to intercept 'xdr_destroy'
'==134==AddressSanitizer: libc interceptors initialized
|| `[0x10007fff8000, 0x7fffffffffff]` || HighMem    ||
|| `[0x02008fff7000, 0x10007fff7fff]` || HighShadow ||
|| `[0x00008fff7000, 0x02008fff6fff]` || ShadowGap  ||
|| `[0x00007fff8000, 0x00008fff6fff]` || LowShadow  ||
|| `[0x000000000000, 0x00007fff7fff]` || LowMem     ||
MemToShadow(shadow): 0x00008fff7000 0x000091ff6dff 0x004091ff6e00 0x02008fff6fff
redzone=16
max_redzone=2048
quarantine_size_mb=256M
thread_local_quarantine_size_kb=1024K
malloc_context_size=30
SHADOW_SCALE: 3
SHADOW_GRANULARITY: 8
SHADOW_OFFSET: 0x7fff8000
==134==Installed the sigaction for signal 11
==134==Installed the sigaction for signal 7
==134==Installed the sigaction for signal 8
==134==Deactivating ASan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==134==ERROR: AddressSanitizer: SEGV on unknown address 0x60cf8ad2c9f8 (pc 0x71fac17b5cb7 bp 0x000000000000 sp 0x7ffd602751c0 T0)
==134==The signal is caused by a READ memory access.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

I have no idea why this just started happening, nor what to try next. Maybe I will try moving the asan to Fedora 36 and see if that addresses this issue.

@garlick
Copy link
Member Author

garlick commented Mar 16, 2024

Looking at the first sched test failure, the test fakes resources consisting of 4 nodes, each with 44 cores and 4 gpus. In the first failure, it allocates two nodes exclusively and expects flux resource list to show

  • free: 2 nodes, 88 cores, 8 gpus
  • alloc: 2 nodes, 88 cores, 8 gpus

But what it actually shows is

  • free: 4 nodes, 88 cores, 16 gpus
  • alloc: 2 nodes, 88 cores, 0 gpus

I think what may be going on is that librlist doesn't handle gpus for the things I'm asking it to do.

For completeness, the new objects returned are:

all

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-3",
        "children": {
          "core": "0-43",
          "gpu": "0-3"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682,3179,3683,3178]"
    ]
  }
}

alloc

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0,2",
        "children": {
          "core": "0-43"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "sierra[3682-3683]"
    ]
  }
}

@grondo
Copy link
Contributor

grondo commented Mar 17, 2024

Uhoh, that sounds like an unfortunate bug :-(

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

FWIW I was able to fix the sched test failures by returning the allocated resource set from the job-manager directly in resource.status instead of marking those resources allocated the full rlist and then extracting them again with rlist_copy_allocated().

Some other tests are failing now in flux-core that weren't before so I need to sort that out then post an update here.

@grondo
Copy link
Contributor

grondo commented Mar 20, 2024

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate them (currently it cannot).

My other worry here is that librlist starts to get pretty slow when there are a lot of resources. I worry this will have a significant impact on throughput (I know you were already planning on testing that, but just a heads up). For example, in the testing in flux-framework/flux-sched#1137, we saw sched.resource-status for sched-simple taking >2s for 16K nodes.

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

Ah: the allocated set from job-manager.resource-status does not include properties so tests involving queues are failing in t2350-resource-list.t and t2801-top-cmd.t. I'm not sure if this is because sched-simple isn't returning them in the alloc response, or they are disappearing when passing through rlist_append(). But anyway before they were included because the allocated set was being carved out of the original R with rlist_copy_allocated().

@grondo
Copy link
Contributor

grondo commented Mar 20, 2024

Ah, probably dropped in rlist_append and that seems like a bug :-(

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate > them (currently it cannot).

Yeah that would allow the original technique I was using to work. Supporting gpus in sched-simple seems like a pretty big bonus as well!

My other worry here is that librlist starts to get pretty slow when there are a lot of resources.

I was hoping that the rlist_append() / rlist_diff() functions I am using in job-manager would not be too bad, but after I get this working, I'll probe that with a large resource set.

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

Hmm, actually rlist_append() seems to pass properties through OK, but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

@grondo
Copy link
Contributor

grondo commented Mar 20, 2024

but it doesn't look like sched-simple is returning queue properties in allocated Rs. Is it possible that was by design?

I'm not sure, but I did think properties were passed through an alloc, e.g. on fluke the queue properties are available in subinstances (whether this is right or not is questionable)

 grondo@fluke108:~$ flux alloc -N1 flux resource list
     STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
      free batch           1        4        0 fluke95
 allocated                 0        0        0 
      down                 0        0        0 
[detached: session exiting]

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

Hmm, maybe something was wrong with my test then. Thanks for that!

Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently.

@garlick
Copy link
Member Author

garlick commented Mar 20, 2024

I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing.

@garlick
Copy link
Member Author

garlick commented Mar 21, 2024

FWIW throughput.py is stable at about 21 jobs/sec with 16K nodes/32 cores each, on master and on this branch with sched-simple. 🤷

@grondo
Copy link
Contributor

grondo commented Mar 21, 2024

Oof. I guess sched-simple is so slow at that scale it doesn't matter. I'd like to see results for something smaller as well just in case though.

@garlick
Copy link
Member Author

garlick commented Mar 22, 2024

Here are some more numbers. It does look like there is a negative effect.

throughput.py -n 1000 (each run in a new instance)

Node: 32 cores / 4 gpus

        fluxion simple  fluxion simple
        master  master  #5776   #5776
nnodes  thruput thruput thruput thruput
64      595.4   497.0   551.8   505.9
        625.9   523.2   570.5   524.1
        600.4   508.8   527.7   497.4
128     441.4   522.4   429.6   497.6
        468.7   495.4   440.9   485.9
        484.6   505.2   426.3   475.3
256     251.7   501.7   242.0   469.6
        254.7   499.7   238.6   481.4
        249.9   491.5   233.3   454.2
512     142.5   463.2   138.4   456.5
        143.6   482.7   138.2   456.8
        146.2   465.7   137.3   460.5
1024    69.0    443.1   67.8    431.7
        68.6    472.7   68.6    428.7
        68.6    466.5   68.4    449.3
2048    33.1    434.5   32.6    428.5
        32.8    458.5   32.8    428.3
        32.8    447.5   33.5    435.2
4096    16.9    301.5   16.5    285.4
        16.9    293.1   16.5    275.2
        16.8    288.0   16.6    283.2
8192    8.3     104.1   8.2     103.3
        8.3     108.2   8.4     97.7
        8.3     108.0   8.3     103.3
16384   4.1     24.2    4.0     24.0
        4.1     23.9    4.0     23.5
        4.0     24.3    4.1     24.0

@garlick
Copy link
Member Author

garlick commented Mar 22, 2024

Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine them in the job manager or even ship them to the resource module and do it there.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2024

The impact is not as bad as I'd feared! The rlist implementation is kind of dumb and not optimized at all, so we could look into improving it. This would help a bit with #5819 too.

However, since each time a user runs flux resource it will interrupt the job manager, moving the resource-list service to the resource or another module like we did for the job-list service is not a bad idea IMO.

I'll also note that the slow sched.resource-status RPC is an immediate concern on existing clusters, so a solution that works within the next couple releases would be better than spending more time implementing the "perfect" solution (However, it would probably be bad to break compatibility multiple times)

Sorry, all that may not have been too helpful. Just throwing some thoughts out there.

@garlick
Copy link
Member Author

garlick commented Mar 22, 2024

In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare the response. The RPC in the job manager that returns the allocated set is used exclusively by the resource module. So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

@grondo
Copy link
Contributor

grondo commented Mar 22, 2024

So my comment above was about shifting some of the work from the job manager to the resource module to reduce d the penalty on every allocation.

That makes sense, sorry I misunderstood. So on receipt of a resource.status request, the resource module then requests the allocated set from the job manager, but in the current implementation the job-manager has already prepared the allocated resource set, so all it has to do is return it? This is a pretty clever implementation IMO.

Shifting some of the work to the resource module seems like a good idea to me. Your comment about just shipping to the resource module makes a lot of sense now that I better understand.

@garlick garlick force-pushed the issue#5776 branch 2 times, most recently from ce564ba to b1f3819 Compare March 23, 2024 02:45
@garlick
Copy link
Member Author

garlick commented Mar 23, 2024

Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point.

The job manager RPC now just walks all active jobs, creating a union of all allocated Rs, which it returns. It's kind of dumb and simple, but if it turns out to be slow in some scenario (for example if there are a huge number of active jobs), there are obvious things to do to improve responsiveness such as caching the result until the next alloc/free, keeping all jobs with resources in a list, etc..

@garlick
Copy link
Member Author

garlick commented Mar 23, 2024

I've updated the description with a few todos, assuming this approach is acceptable.

I'm leaning towards splitting the resource.status RPC into two RPCs again. Since the job manager query is now somewhat costly (depending on the number of active jobs), it probably makes sense to avoid it when only the previous resource.status fields are really wanted.

@grondo
Copy link
Contributor

grondo commented Mar 25, 2024

Here's some timing results on this current branch vs master. Huge improvements:
master:

SCHEDULER       NNODES T(sched.resource-status)  T(flux resource list)
sched-simple       128                    0.175                  0.191
sched-simple       256                    0.166                  0.197
sched-simple       512                    0.186                  0.218
sched-simple      1024                    0.209                  0.254
sched-simple      2048                    0.239                  0.336
sched-simple      4096                    0.322                  0.510
sched-simple      8192                    0.610                  0.864
sched-simple     16384                    1.228                  1.588
fluxion            128                    0.273                  0.300
fluxion            256                    0.392                  0.431
fluxion            512                    0.648                  0.696
fluxion           1024                    1.167                  1.208
fluxion           2048                    2.161                  2.275
fluxion           4096                    4.228                  4.478
fluxion           8192                    8.352                  8.650
fluxion          16384                   16.570                 17.434

on this branch:

SCHEDULER       NNODES T(resource.sched-status)  T(flux resource list)
sched-simple       128                    0.159                  0.179
sched-simple       256                    0.176                  0.195
sched-simple       512                    0.185                  0.208
sched-simple      1024                    0.204                  0.248
sched-simple      2048                    0.229                  0.326
sched-simple      4096                    0.302                  0.492
sched-simple      8192                    0.454                  0.834
sched-simple     16384                    0.790                  1.550
fluxion            128                    0.169                  0.199
fluxion            256                    0.177                  0.205
fluxion            512                    0.184                  0.222
fluxion           1024                    0.202                  0.262
fluxion           2048                    0.237                  0.336
fluxion           4096                    0.307                  0.496
fluxion           8192                    0.462                  0.833
fluxion          16384                    0.770                  1.543

This result is somewhat obvious since this branch removes the influence of the scheduler on the timing for flux resource list, but I thought I'd share anyway.

Another useful test might be to do a similar run with a lot of active jobs, though I'm sure this branch will still be a net win.

@grondo
Copy link
Contributor

grondo commented Mar 25, 2024

I didn't see this early comment addressed:

Another thing to consider here is backwards compatibility (perhaps a good reason to keep the sched.resource-status RPC for a bit). E.g. if a newer instance of Flux is running under an older version, tools like flux jobs, flux top, and flux pstree would get ENOSYS when attempting to get the subinstance resource information.

It might be ok to just break compatibility in this case, but I wonder if we can take advantage of the fact that the resource.sched-status and sched.resource-status RPCs are compatible to support backwards compatibility in the tools for a short while. Sorry if you already responded to this comment and I missed it.

@garlick
Copy link
Member Author

garlick commented Mar 25, 2024

I didn't see this early comment addressed:

Oh, sorry! I took your comment to heart and left the old RPC in place. To restore a bit of test coverage without much effort, I added a FLUX_RESOURCE_STATUS_RPC environment variable that allows the topic string used in the python bindings to be overridden, e.g.

$ FLUX_RESOURCE_STATUS_RPC=sched.resource-status flux resource list

@grondo
Copy link
Contributor

grondo commented Mar 25, 2024

Ah, the case I was thinking about is flux top or other recursive tools being used against older versions of Flux (e.g. a user running a batch job that uses older Flux). If the newer flux top or flux pstree is used with the older versions, the RPC perhaps should be retried with sched.resource-status if resource.sched-status returns ENOSYS.

I'm still waffling on whether this is worth the effort, since it would probably have to be handled explicitly in all use cases (perhaps we could use some trick to make it automatic in Python though)

@garlick
Copy link
Member Author

garlick commented Mar 25, 2024

Ah, I was thinking old client + new server. You are right, for new client + old server, we'd probably have to retry with the old topic string on ENOSYS.

@grondo
Copy link
Contributor

grondo commented Mar 25, 2024

If we want to do that, I wonder if it would be simplest to create a C wrapper function (it could return an empty future that is either fulfilled by the first or retry response)

@grondo
Copy link
Contributor

grondo commented Mar 25, 2024

If it simplifies things, supporting backwards compatibility could be done in a follow-on PR (and I'd be willing to work on that if you've got other things going)

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, and what an improvement! Seems like this is the way we should have been doing things all along.

I didn't really see much to comment on, and given that everyone's pretty annoyed that flux resource list seems to take >30s on elcap at the moment, we should get this in ASAP.

Amazing work!

BTW, I tried to add a "worst case scenario" to my benchmark by having one running job per node (so 16K jobs in the largest test case). The results show good performance for even that case

SCHEDULER       NNODES T(resource.sched-status)  T(flux resource list)
sched-simple       128                    0.168                  0.196
sched-simple       256                    0.178                  0.203
sched-simple       512                    0.193                  0.235
sched-simple      1024                    0.229                  0.293
sched-simple      2048                    0.309                  0.420
sched-simple      4096                    0.481                  0.694
sched-simple      8192                    0.881                  1.295
sched-simple     16384                    1.835                  2.633

Impact of many jobs doesn't seem to be too bad! I say we get this in.

* posted. Thus, after waiting, resource.status (flux resource status)
* should show those ranks up, while sched.resource-status
* (flux resource list command) may still show them down.
* posted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message: Problem statement: do you mean flux resource list instead of flux resource status here?

Comment on lines +209 to +212
static int update_properties (struct rlist *alloc, struct rlist *all)
{
struct idset *alloc_ranks;
json_t *props;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot recall, did we ever determine if the missing properties in the response is a bug in rlist? If so, is there an open issue? I'd be happy to attempt to address that since it is slightly depressing the resource module has to do this extra work...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I wasn't sure if it's actually desirable to have properties propagate to subinstances. Currently there is inconsistent behavior in fluxion (includes properties) vs sched-simple (omits properties). Queue properties are clearly irrelevant, but maybe other properties would be useful. For example on my test cluster I have properties assigned for different memory amounts, and those would be useful to propagate. I can open an issue and we can discuss there if you like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misspoken there! At least in a quick test, sched-simple does seem to assign properties in the R allocated to jobs. More testing needed to see where exactly properties were being dropped.

garlick added 9 commits March 25, 2024 11:31
Problem: 'flux resource list' can take a long time when the scheduler
is busy.

Add a new 'resource.sched-status' RPC method that returns a compatible
payload to 'sched.resource-status'.  It avoids contacting the scheduler
by making use of the new 'job-manager.resource-status' RPC.

Properties are sometimes not included in the R returned by the job
manager (see flux-framework#5826), so copy any applicable
properties from the resource inventory into the allocated set before
returning it in the RPC response.
Problem: when a client disconnects with a resource.status RPC
pending, the RPC is not aborted.

Change the resource module disconnect handler to allow non-owners
to send the message (and verify that the acquire handler is using
proper authentication).  Then call status_disconnect() from the
handler.

Since the disconnect handler also calls acquire_disconect() and
that only works on rank 0, add a check so it doesn't segfault.
Problem: updates to the summary pane are subject to large delays
when the scheduler is busy.

Use resource.sched-status instead of sched.resource-status to
poll for resource information.
Problem: 'flux resource list' can take a long time when the scheduler
is busy.

Use resource.sched-status instead of sched.resource-status to
request resource information.  Force the RPC to go to rank 0 since,
unlike the scheduler, resource is loaded on all ranks and would
fail the RPC if first sent to the local broker.
Problem: test code would need to be duplicated or written from scratch
to get coverage for the sched.resource-status RPC, which needs to be
retained for a while for backwards compatibility.

Add support for a FLUX_RESOURCE_LIST_RPC.  If set to an alternate
topic string, it is used by the python SchedResourceList class
instead of the new resource.sched-status RPC.
Problem: the FLUX_RESOURCE_LIST_RPC environment variable is
undocumented.

Add it to the man page in the test section.
Problem: the sched.resource-status RPC no longer has test coverage.

Add a couple of trivial tests that verify it still works, using
the FLUX_RESOURCE_LIST_RPC environment variable just added.
Problem: the flux-resource(1) man page mentions querying the
scheduler for resource status several times, but this is no longer
done.

Use "scheduler view of resources" instead where applicable.  Although
the resource module now answers this query, it provides the same "view"
as before.
Problem: there is no test coverage for running 'flux resource list'
on a broker rank other than the leader, but the resource module
loads on all ranks and only offers resource.sched-status on rank 0.

Make sure it works on ranks other than zero.
@garlick
Copy link
Member Author

garlick commented Mar 25, 2024

Fixed the commit typo and opened #5826 on the missing properties. I'll go ahead and set MWP. Thanks!

Copy link

codecov bot commented Mar 25, 2024

Codecov Report

Merging #5796 (98dc6b5) into master (d115427) will decrease coverage by 0.05%.
The diff coverage is 68.14%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5796      +/-   ##
==========================================
- Coverage   83.33%   83.29%   -0.05%     
==========================================
  Files         509      510       +1     
  Lines       82528    82817     +289     
==========================================
+ Hits        68776    68982     +206     
- Misses      13752    13835      +83     
Files Coverage Δ
src/bindings/python/flux/job/info.py 93.76% <100.00%> (ø)
src/bindings/python/flux/resource/list.py 95.55% <100.00%> (+0.20%) ⬆️
src/modules/resource/monitor.c 67.21% <ø> (ø)
src/modules/resource/acquire.c 60.44% <85.71%> (+0.17%) ⬆️
src/modules/resource/resource.c 80.44% <80.00%> (+2.18%) ⬆️
src/cmd/top/summary_pane.c 85.37% <0.00%> (+1.58%) ⬆️
src/modules/job-manager/alloc.c 75.77% <63.63%> (-1.59%) ⬇️
src/modules/resource/status.c 68.23% <68.23%> (ø)

... and 14 files with indirect coverage changes

@garlick
Copy link
Member Author

garlick commented Mar 25, 2024

Hmm, in one of the builders, t2812-flux-job-last.t hangs in this test (which is for issue #4390).
I don't see how it could be related to this PR but it is a bit concerning. I'll open an issue and restart the builder.

 expecting success: 
  	flux job last "[:]" >lastdump.exp &&
  	flux dump dump.tgz &&
  	flux start -o,-Scontent.restore=dump.tgz \
  		flux job last "[:]" >lastdump.out &&
  	test_cmp lastdump.exp lastdump.out
  
  
  flux-dump: archived 1 keys
  flux-dump: archived 2 keys
  flux-dump: archived 3 keys
  flux-dump: archived 4 keys
  flux-dump: archived 5 keys
  flux-dump: archived 6 keys
  flux-dump: archived 7 keys
  flux-dump: archived 8 keys
  flux-dump: archived 9 keys
  flux-dump: archived 65 keys
  Mar 25 18:41:44.457761 job-manager.err[0]: replay warning: INACTIVE action failed on job fEyxE9d: Read-only file system
  Mar 25 18:41:44.785969 job-manager.err[0]: sched.alloc-response: id=fF1SDRy already allocated

@mergify mergify bot merged commit fea7f25 into flux-framework:master Mar 25, 2024
34 of 35 checks passed
@garlick garlick deleted the issue#5776 branch March 25, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants