-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource-hwloc: add synchronization to startup and reload #1931
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1931 +/- ##
==========================================
+ Coverage 80.01% 80.05% +0.03%
==========================================
Files 195 196 +1
Lines 34931 34981 +50
==========================================
+ Hits 27951 28004 +53
+ Misses 6980 6977 -3
|
Nice! I had a few minutes before leaving home this morning and was able to verify that sched still builds/checks against this branch. Will poke some more later today. |
Noticed one more test in |
Just playing with this - I did see a segfault one time running the following, but sadly no core file, so no idea if it has anything to do with this PR.
I really like the new Easy to confirm that resources are divided among brokers as one might hope when running multiple brokers per node:
|
Oh, that's not good. I'll try to reproduce the segfault, as it seems likely it is due to this PR... |
In aggregate.h
You have a nice description of the synchronization and use of aggregator in the comment for 81edaac. A block comment derived from that might be useful in the source somewhere for the casual flux tourist. |
Yeah, good calls. Thanks!
You are correct that there is some stale comments and external functions there due to a failed past attempt at a better interface. |
One other vestige from that earlier version: an extra |
Any reason not to have flux_kvs_lookup_get (f, &value);
flux_future_fulfill (f_orig, strdup (value), free); Then provide a |
That's exactly what the first version did. However, I got frustrated that I had to duplicate |
Yeah, I only get a weak reading on my egrig-o-meter for that one. |
Ok, I've pushed some changes that may address @garlick's comments. There was some conflicting requests, so I wasn't quite sure where to go with the
Using flux_kvs_lookup_get (f, &value);
flux_future_fulfill (f_orig, strdup (value), free); was not a great solution because you end up having parse the JSON multiple times. Instead, I ended up with a solution that parses the JSON once in the lookup handler, then embeds that json object in the aggregate_wait future for later use. I found using I also added some comments to the reload handlers as suggested, added a bit of code to add a GPU count to the topo summary, and added a |
Hmm, got an error I haven't seen before in only one builder:
|
I just restarted the failed builder. Given that it is a possibly difficult-to-reproduce issue in the wreck code, I don't think it is worth pursuing. |
Another couple builders are failing. I peeked at one and it's
|
Agreed, I'll open an issue on that. |
Yeah, sorry I forgot to add the new xml files to |
That makes complete sense. I should have noticed that when I made my suggestion. |
Hit an error in
|
I'm OK with this going in if you are ready. We can always tweak it a bit if the sched guys have additional thoughts upon returning from ECP. |
At least needs some squashing if you are ok with the general approach. BTW I tested this with up to 512 brokers (on 32 nodes) on one of our clusters. My one worry is now that resource-hwloc synchronizes on module load,
|
I think the extra delay is fine. since scheduler needs to wait for this data to load anyway. By the way, is the xml guaranteed to be loaded also when the module load completes? |
Yes, the xml is loaded with a synchronous |
Maybe if the load time becomes concerning at scale, we can investigate performing those two steps in parallel? E.g. start xml commit (or maybe fence?), perform aggregate, wait for commit result. It looks like that would involve some refactoring that's probably not justified at this point, but we could keep that one in our back pocket just in case. |
Using a fence is a good idea. Currently, each rank issues a separate commit for it's XML entry in the kvs. If we did this in parallel with the aggregation then rank 0 world be unable to guarantee synchronization. (Unless I'm misunderstanding something, which is not unlikely) |
Remove tests for flux-hwloc reload --walk-topology in preparation for removal of this option in the module.
The hwloc by_rank "HostName" key is largely redundant with resource.hosts, so remove the test for it in preparation for removal of support for this feature.
resource-hwloc module will soon require the aggregator module, so be sure to load it before any tests begin. Additionally, make some other minor cleanups and remove polling synchronization since that will no longer be necessary.
The walk-topology support in resource-hwloc is an unused feature, so remove the option from flux-hwloc to simplify the reload command.
Remove --walk-topology from the flux-hwloc(1) manpage.
Problem: the "walk_topology" flag in resource-hwloc was an unused feature, and while nice, generated a lot of keys in the kvs. In the interest of simplifying the resource-hwloc module, remove support for walk_topology and its associated code.
Remove the HostName key from resource.hwloc.by_rank. directory. This is redundant with the resource.hosts key (as well as the hwloc xml data), and thus creates and unnecessary extra key per rank in the kvs.
In preparation for using the aggregator in resource-hwloc, add some convenience code for constructing and waiting on aggregates.
Problem: The resource-hwloc module has no startup and reload synchronization, so `flux module load resource-hwloc` as well as `flux hwloc reload` may return before the module is ready and toplogy is populated in the KVS. Additionally, the `by_rank` directory hierarchy that the module populates creates a lot of kvs keys and large transactions. This change replaces the `by_rank` directory with a single JSON object, aggregated for all identical ranks. The module uses the aggregator to accomplish this, and thus also gains a synchronization point. The module on rank 0 will always wait for the aggregate to be "complete" before entering the reactor on startup, or responding to RPC for a reload request. All `resource-hwloc.reload` requests go to rank 0 now, and the global reload is accomplished via a sequenced event that indicates which ranks should actually reload topology information. All ranks, however, re-perform the aggregation for any reload event.
The resource-hwloc module now ensures the hwloc xml is loaded in the kvs before answering RPCs, so the internal `loaded` boolean is no longer necessary.
Count GPU objects of type cuda or opencl and include in the topology summary aggregated by each resource-hwloc module for informational purposes.
Remove left over test for broken down topology information in KVS which isn't necessary anymore.
Re-implement interfaces in resource-hwloc/aggregate.[ch] to offer an aggregate_wait_get_unpack() for accessing the final json object result from the underlying flux_kvs_lookup(). Also, ensure lookup RPC is canceled and properly completed before fulfilling the aggregate future.
Add additional comments to reload request handler functions to clarify their operation.
Rank 0 sends reload events, it doesn't get them. Don't bother listening for these events on this rank.
Add a few calls to flux-hwloc reload the the valgrind workload to ensure this call path does not leak memory or generate errors.
Add some xml for 2 brokers with GPU devices so that the GPU count code for resource.hwloc.by_rank can be tested.
Use the `fwd_count` hint in the aggregator.push rpc to allow the aggregator module to more efficiently forward hwloc aggregate upstream. When fwd_count == 0, the module only forwards aggregates after a timeout, while with fwd_count the module is able to immediately forward aggregate entries upstream when all descendants have added their entries to the aggregate.
For convenience, add the elapsed time spent in aggregate_wait() to the debug message emitted by resource-hwloc on rank 0.
Waaa, this module startup synchronization stuff is hard when modules can be unloaded and reloaded on individual ranks at any time. With the current approach here, any To fix the first problem, my plan is to have each module check for existing I'm not sure what to do about the second problem without some kind of tracking of which ranks have active resource-hwloc modules. This seems like a more generic problem that shouldn't be tackled here. |
I'll also pose this question -- does $ flux exec -r all sh -c 'flux kvs put resource.hwloc.xml.$(flux getattr rank)="$(lstopo-no-graphics --of xml --restrict binding)"' in |
It seems like a great simplification! |
This PR started as a simple change, but ended up with a rework of the
resource-hwloc
startup and topology reload functions to add a synchronization point and aggregation.The basic change here is that the
resource.hwloc.by_rank.
directory, which had one subdir per rank, is replaced with a single aggregate JSON object generated by the aggregator module, where the keys areidset
strings and values are JSON objects containing the same fields from the originalby_rank
.Also, as part of cleanup the unused
walk_topology
support was removed, alongby_rank.HostName
, which is superseded byresource.hosts
(and makes it more difficult to aggregate like resources for different hosts)E.g. of the new format:
or on a real system with different resources per-rank:
As a consequence of having all ranks push an aggregate instead of individually to the KVS, the topology load is now synchronized by having the rank 0 module wait for the aggregate to be "complete" before entering the reactor on startup. Thus, after these changes, once
flux module load -r all resource-hwloc
completes, it is guaranteed that all ranks have finished populating the kvs.In order to support the aggregate however,
flux hwloc reload
support was changed to always be a global event. Allresource-hwloc.reload
RPCs are now sent to rank 0, which then issues a reload event. Only the targeted ranks in the reload event payload actually reload the topology, however all ranks participate in another aggregation so that rank 0 can synchronously wait for the reload to complete. This is probably unnecessary at this point, but the reloads are sequenced to ensure multiple reload requests do not stomp on each other. (This could allowresource-hwloc
to asynchronously wait for the aggregates to complete for reloads, but that is saved for future work)I apologize for the large diff in 8b7116e, there wasn't a good way to really stage this rewrite into a series of functional, understandable chunks. Since the previous cleanup commits don't really help readability of the final large change, I'd be willing to squash most of this together if that would be better.