-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a faster way to get resource allocation status than sched.resource-status RPC #5796
Conversation
Hmm, getting an ASAN failure here:
and failing sched tests
|
Another thing to consider here is backwards compatibility (perhaps a good reason to keep the |
I'm looking into the ASan errors, which are reproducible but only in the CI environment. In this case, libasan is getting a SEGV while running
I have no idea why this just started happening, nor what to try next. Maybe I will try moving the asan to Fedora 36 and see if that addresses this issue. |
Looking at the first sched test failure, the test fakes resources consisting of 4 nodes, each with 44 cores and 4 gpus. In the first failure, it allocates two nodes exclusively and expects
But what it actually shows is
I think what may be going on is that librlist doesn't handle gpus for the things I'm asking it to do. For completeness, the new objects returned are: all {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0-3",
"children": {
"core": "0-43",
"gpu": "0-3"
}
}
],
"starttime": 0,
"expiration": 0,
"nodelist": [
"sierra[3682,3179,3683,3178]"
]
}
} alloc {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0,2",
"children": {
"core": "0-43"
}
}
],
"starttime": 0,
"expiration": 0,
"nodelist": [
"sierra[3682-3683]"
]
}
} |
Uhoh, that sounds like an unfortunate bug :-( |
FWIW I was able to fix the sched test failures by returning the allocated resource set from the job-manager directly in Some other tests are failing now in flux-core that weren't before so I need to sort that out then post an update here. |
I wonder how much work it would be to properly support gpus in librlist? Perhaps that would then allow sched-simple to allocate them (currently it cannot). My other worry here is that librlist starts to get pretty slow when there are a lot of resources. I worry this will have a significant impact on throughput (I know you were already planning on testing that, but just a heads up). For example, in the testing in flux-framework/flux-sched#1137, we saw |
Ah: the allocated set from |
Ah, probably dropped in |
Yeah that would allow the original technique I was using to work. Supporting gpus in sched-simple seems like a pretty big bonus as well!
I was hoping that the |
Hmm, actually |
I'm not sure, but I did think properties were passed through an alloc, e.g. on fluke the queue properties are available in subinstances (whether this is right or not is questionable)
|
Hmm, maybe something was wrong with my test then. Thanks for that! Edit: oh, that would be fluxion though. Maybe sched-simple behaves differently. |
I added a workaround for now, in which properties are re-added to the allocated set by the resource module if missing. |
FWIW |
Oof. I guess sched-simple is so slow at that scale it doesn't matter. I'd like to see results for something smaller as well just in case though. |
Here are some more numbers. It does look like there is a negative effect.
|
Hmm, maybe instead of keeping a "running R" in the job manager, it would be better (less impact on throughput) to simply gather the R's upon request and either combine them in the job manager or even ship them to the resource module and do it there. |
The impact is not as bad as I'd feared! The rlist implementation is kind of dumb and not optimized at all, so we could look into improving it. This would help a bit with #5819 too. However, since each time a user runs I'll also note that the slow Sorry, all that may not have been too helpful. Just throwing some thoughts out there. |
In case it wasn't clear, this PR already moves the contact point for the tools to the resource module, but the resource module doesn't do a whole lot except prepare the response. The RPC in the job manager that returns the |
That makes sense, sorry I misunderstood. So on receipt of a Shifting some of the work to the resource module seems like a good idea to me. Your comment about just shipping to the resource module makes a lot of sense now that I better understand. |
ce564ba
to
b1f3819
Compare
Just pushed the change discussed above. I started testing throughput and then realized this doesn't touch the critical path at all so there is little point. The job manager RPC now just walks all active jobs, creating a union of all allocated Rs, which it returns. It's kind of dumb and simple, but if it turns out to be slow in some scenario (for example if there are a huge number of active jobs), there are obvious things to do to improve responsiveness such as caching the result until the next alloc/free, keeping all jobs with resources in a list, etc.. |
I've updated the description with a few todos, assuming this approach is acceptable. I'm leaning towards splitting the |
Here's some timing results on this current branch vs master. Huge improvements:
on this branch:
This result is somewhat obvious since this branch removes the influence of the scheduler on the timing for Another useful test might be to do a similar run with a lot of active jobs, though I'm sure this branch will still be a net win. |
I didn't see this early comment addressed:
It might be ok to just break compatibility in this case, but I wonder if we can take advantage of the fact that the |
Oh, sorry! I took your comment to heart and left the old RPC in place. To restore a bit of test coverage without much effort, I added a FLUX_RESOURCE_STATUS_RPC environment variable that allows the topic string used in the python bindings to be overridden, e.g.
|
Ah, the case I was thinking about is I'm still waffling on whether this is worth the effort, since it would probably have to be handled explicitly in all use cases (perhaps we could use some trick to make it automatic in Python though) |
Ah, I was thinking old client + new server. You are right, for new client + old server, we'd probably have to retry with the old topic string on ENOSYS. |
If we want to do that, I wonder if it would be simplest to create a C wrapper function (it could return an empty future that is either fulfilled by the first or retry response) |
If it simplifies things, supporting backwards compatibility could be done in a follow-on PR (and I'd be willing to work on that if you've got other things going) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, and what an improvement! Seems like this is the way we should have been doing things all along.
I didn't really see much to comment on, and given that everyone's pretty annoyed that flux resource list
seems to take >30s on elcap at the moment, we should get this in ASAP.
Amazing work!
BTW, I tried to add a "worst case scenario" to my benchmark by having one running job per node (so 16K jobs in the largest test case). The results show good performance for even that case
SCHEDULER NNODES T(resource.sched-status) T(flux resource list)
sched-simple 128 0.168 0.196
sched-simple 256 0.178 0.203
sched-simple 512 0.193 0.235
sched-simple 1024 0.229 0.293
sched-simple 2048 0.309 0.420
sched-simple 4096 0.481 0.694
sched-simple 8192 0.881 1.295
sched-simple 16384 1.835 2.633
Impact of many jobs doesn't seem to be too bad! I say we get this in.
* posted. Thus, after waiting, resource.status (flux resource status) | ||
* should show those ranks up, while sched.resource-status | ||
* (flux resource list command) may still show them down. | ||
* posted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message: Problem statement: do you mean flux resource list
instead of flux resource status
here?
static int update_properties (struct rlist *alloc, struct rlist *all) | ||
{ | ||
struct idset *alloc_ranks; | ||
json_t *props; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot recall, did we ever determine if the missing properties in the response is a bug in rlist? If so, is there an open issue? I'd be happy to attempt to address that since it is slightly depressing the resource module has to do this extra work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I wasn't sure if it's actually desirable to have properties propagate to subinstances. Currently there is inconsistent behavior in fluxion (includes properties) vs sched-simple (omits properties). Queue properties are clearly irrelevant, but maybe other properties would be useful. For example on my test cluster I have properties assigned for different memory amounts, and those would be useful to propagate. I can open an issue and we can discuss there if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misspoken there! At least in a quick test, sched-simple does seem to assign properties in the R allocated to jobs. More testing needed to see where exactly properties were being dropped.
Problem: 'flux resource list' can take a long time when the scheduler is busy. Add a new 'resource.sched-status' RPC method that returns a compatible payload to 'sched.resource-status'. It avoids contacting the scheduler by making use of the new 'job-manager.resource-status' RPC. Properties are sometimes not included in the R returned by the job manager (see flux-framework#5826), so copy any applicable properties from the resource inventory into the allocated set before returning it in the RPC response.
Problem: when a client disconnects with a resource.status RPC pending, the RPC is not aborted. Change the resource module disconnect handler to allow non-owners to send the message (and verify that the acquire handler is using proper authentication). Then call status_disconnect() from the handler. Since the disconnect handler also calls acquire_disconect() and that only works on rank 0, add a check so it doesn't segfault.
Problem: updates to the summary pane are subject to large delays when the scheduler is busy. Use resource.sched-status instead of sched.resource-status to poll for resource information.
Problem: 'flux resource list' can take a long time when the scheduler is busy. Use resource.sched-status instead of sched.resource-status to request resource information. Force the RPC to go to rank 0 since, unlike the scheduler, resource is loaded on all ranks and would fail the RPC if first sent to the local broker.
Problem: test code would need to be duplicated or written from scratch to get coverage for the sched.resource-status RPC, which needs to be retained for a while for backwards compatibility. Add support for a FLUX_RESOURCE_LIST_RPC. If set to an alternate topic string, it is used by the python SchedResourceList class instead of the new resource.sched-status RPC.
Problem: the FLUX_RESOURCE_LIST_RPC environment variable is undocumented. Add it to the man page in the test section.
Problem: the sched.resource-status RPC no longer has test coverage. Add a couple of trivial tests that verify it still works, using the FLUX_RESOURCE_LIST_RPC environment variable just added.
Problem: the flux-resource(1) man page mentions querying the scheduler for resource status several times, but this is no longer done. Use "scheduler view of resources" instead where applicable. Although the resource module now answers this query, it provides the same "view" as before.
Problem: there is no test coverage for running 'flux resource list' on a broker rank other than the leader, but the resource module loads on all ranks and only offers resource.sched-status on rank 0. Make sure it works on ranks other than zero.
Fixed the commit typo and opened #5826 on the missing properties. I'll go ahead and set MWP. Thanks! |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5796 +/- ##
==========================================
- Coverage 83.33% 83.29% -0.05%
==========================================
Files 509 510 +1
Lines 82528 82817 +289
==========================================
+ Hits 68776 68982 +206
- Misses 13752 13835 +83
|
Hmm, in one of the builders,
|
As discussed in #5776, this adds
resource.status
to include the keys found insched.resource-status
resource.status
instead ofsched.resource-status
Marking as a WIP pending
decide what to do with theEdit: second iteration of the design no longer does thisalloc-check
plugin since this effectively does the same thingdecide if it was a good idea to combine the two RPCs or whether we should provide two, or perhaps take a step back and design a new one.Edit Let's have two RPCs, just like before.should we leave the scheduler RPC handlers in place and provide some way to use them in test?Edit: keep for back compat, but add tests so the code is not left uncoveredupdate flux-resource(1) which mentions "scheduler view of resources" a lotEdit: view is ok - just get fix language about explicitly contacting the scheduleNow that there are two RPCs again, this is perfectly fine.flux resource status
currently calls theresource.status
RPCs twice