-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for hwloc 2.0+ #677
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let me see if I can quickly test the current failure in t4004-match-hwloc.t
@SteVwonder: I think I manually reproduced the issue and this was a problem I wanted to look into further. So this came out in an opportune time. Before this test is hung, I get the following error message:
This probably means that the graph created with hwloc2 support doesn't match w/ the underlying job spec. I will look but this could be the case where more hierarchical resource types than the jobspec supported (e.g., socket). The bigger issue is, though: when an match error is detected, Let me look at a couple of things and make a specific suggestion. |
The following patch should work for this PR diff --git a/t/t4004-match-hwloc.t b/t/t4004-match-hwloc.t
index 3a994ee4..7aca3868 100755
--- a/t/t4004-match-hwloc.t
+++ b/t/t4004-match-hwloc.t
@@ -119,7 +119,8 @@ test_expect_success 'reloading session/hwloc information with test data' '
'
test_expect_success 'loading resource module with default resource info source' '
- load_resource subsystems=containment policy=high
+ load_resource subsystems=containment policy=high \
+load-whitelist=node,socket,core
'
test_expect_success 'match-allocate works with four two-socket jobspecs' ' My guess is without this option the resulting graph contains other intermediate resource types between version: 1
resources:
- type: node
count: 1
with:
- type: slot
count: 2
label: default
with:
- type: socket
count: 1 I will push this change as a "fixup" commit to your branch to see if that makes CI happy. For the "match error" leading to a front end hang issue, I will open up a Issue ticket to address in a separate PR. |
FYI -- pushed this change. |
Codecov Report
@@ Coverage Diff @@
## master #677 +/- ##
==========================================
- Coverage 75.22% 75.20% -0.03%
==========================================
Files 78 78
Lines 8300 8301 +1
==========================================
- Hits 6244 6243 -1
- Misses 2056 2058 +2
Continue to review full report at Codecov.
|
@SteVwonder: the CI is happy now. Once you address the review comments and merge my fixup commit, this PR should be good to be merged. |
Thanks @dongahn for figuring that out! I appreciate the help! That reminds me that I should add the latest ubuntu image that was added to flux-core so that we have CI coverage of hwloc 2.X (the current testers only have 1.X). I'll address your review comments today as well. |
Awesome! Thanks @SteVwonder. |
I just restarted the Ubuntu 20.04 builder now that the flux-core image is being published (thanks @grondo !!!!). Looks like both issues occur there (the failed test and the empty
I'll begin poking at that tomorrow and next week. |
@SteVwonder: LMK if you need any help to finalize this. This should def. go into our end-of-July freeze. |
Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the Sierra/Lassen clusters are represented in the resource topology as direct children of the node. This topology ignores the fact that the GPUs actually have locality with respect to the CPU sockets. This topology also is causing downstream affects with the fluxion scheduler and its testsuite. Depending on how the hwloc is read (either via flux-core's flux-hwloc or directly via the hwloc API), the resource topology changes (i.e., GPUs are children of the node versus the sockets). Also worth noting that the GPUs are children of the sockets when using the hwloc V2 API, so ignoring the group creates a significant difference in the topologies between hwloc versions. Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the Sierra/Lassen systems, the GPUs are children of the sockets. This also normalizes the resource topology across reading methods and hwloc versions. Now requesting a GPU on Sierra/Lassen can always be done with a `node->socket->gpu` jobspec. Related PR: flux-framework/flux-sched/pull/677
Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the Sierra/Lassen clusters are represented in the resource topology as direct children of the node. This topology ignores the fact that the GPUs actually have locality with respect to the CPU sockets. This topology also is causing downstream affects with the fluxion scheduler and its testsuite. Depending on how the hwloc is read (either via flux-core's flux-hwloc or directly via the hwloc API), the resource topology changes (i.e., GPUs are children of the node versus the sockets). Also worth noting that the GPUs are children of the sockets when using the hwloc V2 API, so ignoring the group creates a significant difference in the topologies between hwloc versions. Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the Sierra/Lassen systems, the GPUs are children of the sockets. This also normalizes the resource topology across reading methods and hwloc versions. Now requesting a GPU on Sierra/Lassen can always be done with a `node->socket->gpu` jobspec. Related PR: flux-framework/flux-sched/pull/677
Problem: the first python in `PATH` may not be the same python that Flux is configured against causing tests to fail. Solution: modify tests to use `flux python`, ensuring that the "correct" python interpreter is used.
Problem: in hwloc 2.X, the `obj->children` array now only contains resources like PUs, cores, and sockets. Memory, IO, and Bridge types are no longer included in that array and its associated `obj->arity`. Those objects are now accessible via the new `obj->*_first_child` attributes (where `*` can be one of `memory`, `io`, and `misc`). As a result, the existing method of traversing the hwloc tree via the `obj->children` attribute results in missing memory, GPUs, and storage resources. Solution: the `hwloc_get_next_child` function, which exists in both hwloc 1.X and 2.X, iterates over all child resources that aren't filtered by other means (e.g., `hwloc_topology_set_flags` and `hwloc_set_io_types_filter`). It manages the complexity of handling these various new child attributes and accessing all valid children of a resource.
Travis is finally green! 🥳 https://travis-ci.org/github/flux-framework/flux-sched/builds/708137964 @dongahn, I addressed your comments as a fixup commit. Let me know if the changes look good, and then I'll squash and this PR should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congrats!
Just the alignment issues. I gave my approval and I will put the WMP label as well. Once you address the issues and squash the fixup comment, this should go in automatically.
Yap that commit looks perfect to me and the detailed commit message is awesome! Great findings! |
Problem: hwloc 2.X drops the `hwloc_topology_set_flags` function in favor of various `hwloc_topology_set_*_types_filter`. Solution: include both functions but #ifdef them based on the hwloc API version. The values for the new `hwloc_topology_set_*_types_filter` calls mimic those used in flux-core. In particular, instruction and memory caches are only included if they affect the structure of the resource tree and only "important" IO resources are kept (e.g., GPUs, NICs, storage).
Problem: the `yaml.load` method is unsafe (allows arbitrary code execution) and causes lots of warning to be printed to the console/test logs. Solution: switch to the `yaml.safe_load` method since we do not use any of the unsafe functionality. Further reading: - https://pyyaml.org/wiki/PyYAMLDocumentation#LoadingYAML - https://security.openstack.org/guidelines/dg_avoid-dangerous-input-parsing-libraries.html
Problem: the python scripts in `t/scripts` are intended to be run as subcommands via the `flux` binary, and the comments at the top of the file make that clear. Unfortunately, if you decide to execute them directly anyway (e.g., `./flux-ion-resource.py`), they get interpreted as bash scripts. In one unfortunate case, they locked up my entire machine (still not sure why that happened). Solution: add a bang line that points to `/bin/false` so that when they are directly executed, they immediately fail with an exit code of 1.
Problem: in hwloc 2.X, the `numanode` resource is no longer the parent of any compute resources, it is instead a sibling. This means that any jobspecs with `numanode` in them will no longer match on `fluxion-resource` instances populated from hwloc data. Any `L*cache` resources that do not affect the struture of the resource tree (commonly L1 and L2 caches) are now filtered out in both flux-core and flux-sched. Any expected output from `resource-query` that includes `numanodes` or `L*cache` resources will also no longer be correct. Solution: First, use the `-W` flag of `resource-query` to filter out any resources that aren't a core, pu, socket, node, or gpu. This creates a set of resources that both hwloc 1 and 2 treat uniformly. Second, replace any references to `numanode` in test jobspecs with `socket`. Third, replace some of the expected output files for the affected tests.
Problem: without the load-whitelist option, the resulting graph contains other intermediate resource types between node and socket such a way that the given jobspec (node[1]->slot[2]->socekt[1]) won't match.
as opposed to just a generic failure
** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.
Thanks @dongahn for the quick reviews! Just pushed the fixes and squashed. |
This PR adds support for hwloc versions 2.0+. Specifically, it:
#ifdef
block similar to the one in flux-core to do filtering of resourcesresource-query
whitelist feature to normalize the resource set tested across hwloc 1 & 2I have it marked as a WIP because
t4004-match-hwloc.t
still has a single failing test at the end. Oddly enough, the same resource data and jobspec works perfectly fine withresource-query
but not withflux-ion-resource
and the fullsched-fluxion-resource
module. I also haven't verified that things still work 100% in hwloc 1.EDIT: Closes #656