Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource-hwloc: add synchronization to startup and reload #1931

Closed
wants to merge 21 commits into from

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jan 18, 2019

This PR started as a simple change, but ended up with a rework of the resource-hwloc startup and topology reload functions to add a synchronization point and aggregation.

The basic change here is that the resource.hwloc.by_rank. directory, which had one subdir per rank, is replaced with a single aggregate JSON object generated by the aggregator module, where the keys are idset strings and values are JSON objects containing the same fields from the original by_rank.

Also, as part of cleanup the unused walk_topology support was removed, along by_rank.HostName, which is superseded by resource.hosts (and makes it more difficult to aggregate like resources for different hosts)

E.g. of the new format:

$ src/cmd/flux start -b selfpmi -s 4 flux kvs get resource.hwloc.by_rank | jq
{
  "[0-3]": {
    "NUMANode": 2,
    "Package": 2,
    "Core": 16,
    "PU": 32
  }
}

or on a real system with different resources per-rank:

$ srun -p pall -N4 src/cmd/flux start flux kvs get resource.hwloc.by_rank | jq
{
  "[1-3]": {
    "NUMANode": 2,
    "Package": 2,
    "Core": 36,
    "PU": 72
  },
  "0": {
    "NUMANode": 2,
    "Package": 2,
    "Core": 16,
    "PU": 32
  }
}

As a consequence of having all ranks push an aggregate instead of individually to the KVS, the topology load is now synchronized by having the rank 0 module wait for the aggregate to be "complete" before entering the reactor on startup. Thus, after these changes, once flux module load -r all resource-hwloc completes, it is guaranteed that all ranks have finished populating the kvs.

In order to support the aggregate however, flux hwloc reload support was changed to always be a global event. All resource-hwloc.reload RPCs are now sent to rank 0, which then issues a reload event. Only the targeted ranks in the reload event payload actually reload the topology, however all ranks participate in another aggregation so that rank 0 can synchronously wait for the reload to complete. This is probably unnecessary at this point, but the reloads are sequenced to ensure multiple reload requests do not stomp on each other. (This could allow resource-hwloc to asynchronously wait for the aggregates to complete for reloads, but that is saved for future work)

I apologize for the large diff in 8b7116e, there wasn't a good way to really stage this rewrite into a series of functional, understandable chunks. Since the previous cleanup commits don't really help readability of the final large change, I'd be willing to squash most of this together if that would be better.

@codecov-io
Copy link

codecov-io commented Jan 18, 2019

Codecov Report

Merging #1931 into master will increase coverage by 0.03%.
The diff coverage is 79.54%.

@@            Coverage Diff             @@
##           master    #1931      +/-   ##
==========================================
+ Coverage   80.01%   80.05%   +0.03%     
==========================================
  Files         195      196       +1     
  Lines       34931    34981      +50     
==========================================
+ Hits        27951    28004      +53     
+ Misses       6980     6977       -3
Impacted Files Coverage Δ
src/cmd/builtin/hwloc.c 79.72% <100%> (-0.28%) ⬇️
src/modules/resource-hwloc/resource.c 70.46% <78.62%> (+1.34%) ⬆️
src/modules/resource-hwloc/aggregate.c 79.26% <79.26%> (ø)
src/common/libzio/zio.c 74% <0%> (-0.23%) ⬇️
src/common/libflux/message.c 81.51% <0%> (+0.12%) ⬆️
src/cmd/flux-module.c 83.96% <0%> (+0.23%) ⬆️
src/modules/kvs-watch/kvs-watch.c 79.95% <0%> (+0.9%) ⬆️
src/common/libflux/mrpc.c 88.93% <0%> (+1.18%) ⬆️

@garlick
Copy link
Member

garlick commented Jan 18, 2019

Nice! I had a few minutes before leaving home this morning and was able to verify that sched still builds/checks against this branch. Will poke some more later today.

@grondo
Copy link
Contributor Author

grondo commented Jan 18, 2019

Noticed one more test in t2005-hwloc-basic.t that doesn't make sense anymore, and I should add a couple tests for the new by_rank key format...

@garlick
Copy link
Member

garlick commented Jan 18, 2019

Just playing with this - I did see a segfault one time running the following, but sadly no core file, so no idea if it has anything to do with this PR.

[garlick@ipa15:cmd]$ srun -N2 ./flux start flux kvs dir -R resource
srun: error: ipa4: task 0: Segmentation fault

^Csrun: interrupt (one more within 1 sec to abort)
srun: step:21049.0 task 1: running
srun: step:21049.0 task 0: exited abnormally
srun: Terminating job step 21049.0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
^Csrun: sending Ctrl-C to job 21049.0

I really like the new by_rank format.

Easy to confirm that resources are divided among brokers as one might hope when running multiple brokers per node:

[garlick@ipa15:cmd]$ srun -N2 -n2 ./flux start flux kvs get resource.hwloc.by_rank | jq
{
  "[0-1]": {
    "NUMANode": 2,
    "Package": 2,
    "Core": 36,
    "PU": 36
  }
}
[garlick@ipa15:cmd]$  srun -N2 -n4 ./flux start flux kvs get resource.hwloc.by_rank | jq
{
  "[0-3]": {
    "NUMANode": 1,
    "Package": 1,
    "Core": 18,
    "PU": 18
  }
}
[garlick@ipa15:cmd]$ srun -N2 -n8 ./flux start flux kvs get resource.hwloc.by_rank | jq
{
  "[0-7]": {
    "NUMANode": 1,
    "Package": 1,
    "Core": 9,
    "PU": 9
  }
}

@grondo
Copy link
Contributor Author

grondo commented Jan 18, 2019

Just playing with this - I did see a segfault one time running the following, but sadly no core file, so no idea if it has anything to do with this PR.

Oh, that's not good. I'll try to reproduce the segfault, as it seems likely it is due to this PR...

@garlick
Copy link
Member

garlick commented Jan 18, 2019

In aggregate.h

  • comment on aggregate_wait() says future can be decoded with flux_kvs_lookup_get_unpack(), but I think that pertains to your earlier implementation and now only works on the inner one obtained with aggregate_get()?
  • aggregate_get() doesn't have external users, so maybe it should be taken out of the header, and the user instructed to call aggregate_unpack() on the future from aggregate_wait()?

You have a nice description of the synchronization and use of aggregator in the comment for 81edaac. A block comment derived from that might be useful in the source somewhere for the casual flux tourist.

@grondo
Copy link
Contributor Author

grondo commented Jan 18, 2019

Yeah, good calls. Thanks!

aggregate.h is private to resource-hwloc for now, but I was playing with interfaces for potentially creating flux_aggregate_* interfaces in the future, so was working on abstracting some of that out. Probably that's unwise until the flux_future_t issue I mentioned earlier has a solution...

You are correct that there is some stale comments and external functions there due to a failed past attempt at a better interface.

@garlick
Copy link
Member

garlick commented Jan 18, 2019

One other vestige from that earlier version: an extra flux_future_get() in aggregate_check() that's redundant with the flux_kvs_lookup_get_unpack() that follows it.

@garlick
Copy link
Member

garlick commented Jan 18, 2019

Any reason not to have aggregate_check() call

flux_kvs_lookup_get (f, &value);
flux_future_fulfill (f_orig, strdup (value), free);

Then provide a aggregate_wait_get() that returns the copy of value and/or an unpack variant?

@grondo
Copy link
Contributor Author

grondo commented Jan 18, 2019

Any reason not to have aggregate_check() call

That's exactly what the first version did.

However, I got frustrated that I had to duplicate flux_kvs_lookup_unpack and undid all of that in a huff 😄. However, now that the solution's been reproposed here it doesn't seem so bad. There is, after all, only one form an aggregate can take, so at least multiple flux_kvs_lookup_get* functions would not need to be duplicated.

@garlick
Copy link
Member

garlick commented Jan 18, 2019

Yeah, I only get a weak reading on my egrig-o-meter for that one.

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

Ok, I've pushed some changes that may address @garlick's comments. There was some conflicting requests, so I wasn't quite sure where to go with the resource-hwloc/aggregate.h interface. Keep in mind this interface is private to resource-hwloc for now, and I was just using it as an experiment to see if the aggregator module could or should have a public interface in any kind of API. If the interfaces are confusing or you feel they shouldn't appear in a header, we can move them back into resource.c and make them static.

Yeah, I only get a weak reading on my egrig-o-meter for that one.

Using

flux_kvs_lookup_get (f, &value);
flux_future_fulfill (f_orig, strdup (value), free);

was not a great solution because you end up having parse the JSON multiple times. Instead, I ended up with a solution that parses the JSON once in the lookup handler, then embeds that json object in the aggregate_wait future for later use.

I found using flux_kvs_lookup in this mode a bit unergonomic, and very prone to surprising errors if you forget to cancel the lookup and wait for ENODATA. I know this was programmer error, but I spent awhile trying to figure out why a kvs commit RPC somewhere else was returning ENODATA when I had this bug in my code. It would be nice if the ENODATA response was just dropped if its future was "destroyed" (pending destruction), but I'm not sure it would even be possible to implement that.

I also added some comments to the reload handlers as suggested, added a bit of code to add a GPU count to the topo summary, and added a flux hwloc reload "workload" to the valgrind tests to ensure there are no leaks or memory errors captured by valgrind.

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

Hmm, got an error I haven't seen before in only one builder:

expecting success: 
        for i in `seq 0 3`; do echo "hi $i" > i.$i; done &&
        flux wreckrun --input="i.0:0;i.1:1;i.2:2;i.3:3" -l -n4 cat |
		sort -n > output.multi &&
        cat >expected.multi <<-EOF &&
	0: hi 0
	1: hi 1
	2: hi 2
	3: hi 3
	EOF
        test_cmp expected.multi output.multi

--- expected.multi	2019-01-19 18:23:27.900035131 +0000
+++ output.multi	2019-01-19 18:23:27.892035087 +0000
@@ -1,4 +1,3 @@
-0: hi 0
-1: hi 1
+0: hi 1
 2: hi 2
 3: hi 3
not ok 15 - wreckrun: --input different for each task works

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

I just restarted the failed builder. Given that it is a possibly difficult-to-reproduce issue in the wreck code, I don't think it is worth pursuing.

@garlick
Copy link
Member

garlick commented Jan 19, 2019

Another couple builders are failing. I peeked at one and it's

FAIL: t2005-hwloc-basic.t 15 - hwloc: reload xml with GPU objects
#	
#	    flux hwloc reload ${sierra2} &&
#	    flux kvs get resource.hwloc.by_rank | grep "\"GPU\":  *4"
#	

@garlick
Copy link
Member

garlick commented Jan 19, 2019

I found using flux_kvs_lookup in this mode a bit unergonomic, and very prone to surprising errors if you forget to cancel the lookup and wait for ENODATA. I know this was programmer error, but I spent awhile trying to figure out why a kvs commit RPC somewhere else was returning ENODATA when I had this bug in my code. It would be nice if the ENODATA response was just dropped if its future was "destroyed" (pending destruction), but I'm not sure it would even be possible to implement that.

Agreed, I'll open an issue on that.

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

Another couple builders are failing. I peeked at one and it's

Yeah, sorry I forgot to add the new xml files to dist_check_DATA

@garlick
Copy link
Member

garlick commented Jan 19, 2019

Instead, I ended up with a solution that parses the JSON once in the lookup handler, then embeds that json object in the aggregate_wait future for later use

That makes complete sense. I should have noticed that when I made my suggestion.

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

Hit an error in t3000-mpi-basic.t after all tests passed, possibly exceeded grace time? (I think @chu11 hit something similar earlier?) Will restart

ERROR: t3000-mpi-basic
======================

# ./t3000-mpi-basic.t: flux session size will be 4
ok 1 - mpi hello singleton with stdio-delay-commit
PASS: t3000-mpi-basic.t 1 - mpi hello singleton with stdio-delay-commit
ok 2 - mpi hello all ranks with stdio-delay-commit
PASS: t3000-mpi-basic.t 2 - mpi hello all ranks with stdio-delay-commit
ok 3 - mpi hello oversubscribed with stdio-delay-commit
PASS: t3000-mpi-basic.t 3 - mpi hello oversubscribed with stdio-delay-commit
# passed all 3 test(s)
lt-flux-broker: module 'content-sqlite' was not cleanly shutdown
lt-flux-broker: module 'kvs' was not cleanly shutdown
flux-start: 2 (pid 32495) Killed
1..3
ERROR: t3000-mpi-basic.t - exited with status 137 (terminated by signal 9?)

@garlick
Copy link
Member

garlick commented Jan 19, 2019

I'm OK with this going in if you are ready. We can always tweak it a bit if the sched guys have additional thoughts upon returning from ECP.

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

I'm OK with this going in if you are ready. We can always tweak it a bit if the sched guys have additional thoughts upon returning from ECP.

At least needs some squashing if you are ok with the general approach. BTW I tested this with up to 512 brokers (on 32 nodes) on one of our clusters.

My one worry is now that resource-hwloc synchronizes on module load, rc1 may be slower for large jobs. (note that for 512 broker case it seemed to take >1s for the initial topology "load" to take place)

$ srun --pty -N32 -n 512 src/cmd/flux start
(flux-r5pT8q) $ flux dmesg | grep seq
2019-01-19T20:59:00.274577Z resource-hwloc.debug[0]: seq=0: waiting for reload
2019-01-19T20:59:01.403893Z resource-hwloc.debug[0]: seq=0: reload complete
(flux-r5pT8q) $ flux kvs get resource.hwloc.by_rank | jq
{
  "[0-511]": {
    "NUMANode": 1,
    "Package": 1,
    "Core": 1,
    "PU": 1
  }
}

@garlick
Copy link
Member

garlick commented Jan 19, 2019

I think the extra delay is fine. since scheduler needs to wait for this data to load anyway. By the way, is the xml guaranteed to be loaded also when the module load completes?

@grondo
Copy link
Contributor Author

grondo commented Jan 19, 2019

By the way, is the xml guaranteed to be loaded also when the module load completes?

Yes, the xml is loaded with a synchronous kvs_commit, then the aggregate is started... Since the aggregate is started after the xml is loaded, that is why I was concerned about the extra time (this used to be done in a single kvs transaction per rank I believe)

@garlick
Copy link
Member

garlick commented Jan 20, 2019

Maybe if the load time becomes concerning at scale, we can investigate performing those two steps in parallel? E.g. start xml commit (or maybe fence?), perform aggregate, wait for commit result. It looks like that would involve some refactoring that's probably not justified at this point, but we could keep that one in our back pocket just in case.

@grondo
Copy link
Contributor Author

grondo commented Jan 20, 2019

Using a fence is a good idea. Currently, each rank issues a separate commit for it's XML entry in the kvs. If we did this in parallel with the aggregation then rank 0 world be unable to guarantee synchronization.

(Unless I'm misunderstanding something, which is not unlikely)

grondo added 21 commits January 23, 2019 10:34
Remove tests for flux-hwloc reload --walk-topology in preparation for
removal of this option in the module.
The hwloc by_rank "HostName" key is largely redundant with resource.hosts,
so remove the test for it in preparation for removal of support for this
feature.
resource-hwloc module will soon require the aggregator module, so be
sure to load it before any tests begin.

Additionally, make some other minor cleanups and remove polling
synchronization since that will no longer be necessary.
The walk-topology support in resource-hwloc is an unused feature,
so remove the option from flux-hwloc to simplify the reload command.
Remove --walk-topology from the flux-hwloc(1) manpage.
Problem: the "walk_topology" flag in resource-hwloc was an unused
feature, and while nice, generated a lot of keys in the kvs.

In the interest of simplifying the resource-hwloc module, remove
support for walk_topology and its associated code.
Remove the HostName key from resource.hwloc.by_rank. directory.
This is redundant with the resource.hosts key (as well as the hwloc
xml data), and thus creates and unnecessary extra key per rank in
the kvs.
In preparation for using the aggregator in resource-hwloc, add some
convenience code for constructing and waiting on aggregates.
Problem: The resource-hwloc module has no startup and reload
synchronization, so `flux module load resource-hwloc` as well as
`flux hwloc reload` may return before the module is ready and
toplogy is populated in the KVS.

Additionally, the `by_rank` directory hierarchy that the module
populates creates a lot of kvs keys and large transactions.

This change replaces the `by_rank` directory with a single JSON
object, aggregated for all identical ranks. The module uses the
aggregator to accomplish this, and thus also gains a synchronization
point. The module on rank 0 will always wait for the aggregate to
be "complete" before entering the reactor on startup, or responding
to RPC for a reload request.

All `resource-hwloc.reload` requests go to rank 0 now, and the
global reload is accomplished via a sequenced event that indicates
which ranks should actually reload topology information. All ranks,
however, re-perform the aggregation for any reload event.
The resource-hwloc module now ensures the hwloc xml is loaded
in the kvs before answering RPCs, so the internal `loaded`
boolean is no longer necessary.
Count GPU objects of type cuda or opencl and include in the topology
summary aggregated by each resource-hwloc module for informational
purposes.
Remove left over test for broken down topology information in KVS
which isn't necessary anymore.
Re-implement interfaces in resource-hwloc/aggregate.[ch] to offer
an aggregate_wait_get_unpack() for accessing the final json object
result from the underlying flux_kvs_lookup().

Also, ensure lookup RPC is canceled and properly completed before
fulfilling the aggregate future.
Add additional comments to reload request handler functions to clarify
their operation.
Rank 0 sends reload events, it doesn't get them. Don't bother listening
for these events on this rank.
Add a few calls to flux-hwloc reload the the valgrind workload to
ensure this call path does not leak memory or generate errors.
Add some xml for 2 brokers with GPU devices so that the GPU count
code for resource.hwloc.by_rank can be tested.
Use the `fwd_count` hint in the aggregator.push rpc to allow the
aggregator module to more efficiently forward hwloc aggregate
upstream. When fwd_count == 0, the module only forwards aggregates
after a timeout, while with fwd_count the module is able to
immediately forward aggregate entries upstream when all descendants
have added their entries to the aggregate.
For convenience, add the elapsed time spent in aggregate_wait()
to the debug message emitted by resource-hwloc on rank 0.
@grondo
Copy link
Contributor Author

grondo commented Jan 23, 2019

Waaa, this module startup synchronization stuff is hard when modules can be unloaded and reloaded on individual ranks at any time.

With the current approach here, any flux module unload/reload resource-hwlog will hang except with -r all. Obviously, if a single module is unloaded and not reloaded, then flux hwloc reload will hang.

To fix the first problem, my plan is to have each module check for existing resource.hwloc.xml.<rank> directory, and if it exists, do nothing at startup instead of re-loading xml and attempting to re-aggregate by_rank information. This means that after flux module reload returns, flux hwloc reload -r <rank> would need to be issued to ensure xml was re-inserted to kvs and by_rank information updated.

I'm not sure what to do about the second problem without some kind of tracking of which ranks have active resource-hwloc modules. This seems like a more generic problem that shouldn't be tackled here.

@grondo
Copy link
Contributor Author

grondo commented Jan 24, 2019

I'll also pose this question -- does resource-hwloc have to be a module? It isn't doing anything active, and its current functionality could be replaced with something like

$  flux exec -r all sh -c 'flux kvs put resource.hwloc.xml.$(flux getattr rank)="$(lstopo-no-graphics --of xml --restrict binding)"'

in rc1. (And of course, we could wrap the above command in flux hwloc reload builtin, including support for generating an aggregate as is proposed in this PR)

@garlick
Copy link
Member

garlick commented Jan 24, 2019

It seems like a great simplification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants