Add support for hwloc 2.0+ #677

SteVwonder · 2020-06-24T05:44:59Z

This PR adds support for hwloc versions 2.0+. Specifically, it:

Adds an #ifdef block similar to the one in flux-core to do filtering of resources
Replaces how children are accessed with a more forward-compatible method
Uses the resource-query whitelist feature to normalize the resource set tested across hwloc 1 & 2
- Updates the test input/output accordingly
Adds a few tweaks to various tests and test scripts

I have it marked as a WIP because t4004-match-hwloc.t still has a single failing test at the end. Oddly enough, the same resource data and jobspec works perfectly fine with resource-query but not with flux-ion-resource and the full sched-fluxion-resource module. I also haven't verified that things still work 100% in hwloc 1.

EDIT: Closes #656

dongahn

LGTM. Let me see if I can quickly test the current failure in t4004-match-hwloc.t

t/scripts/flux-ion-resource.py

resource/readers/resource_reader_hwloc.cpp

dongahn · 2020-06-24T18:36:30Z

I have it marked as a WIP because t4004-match-hwloc.t still has a single failing test at the end.

@SteVwonder: I think I manually reproduced the issue and this was a problem I wanted to look into further. So this came out in an opportune time.

Before this test is hung, I get the following error message:

2020-06-24T18:30:21.552622Z sched-fluxion-resource.err[0]: match_request_cb: match failed due to match error (id=0): Success
2020-06-24T18:30:21.552668Z sched-fluxion-resource.err[0]: match_request_cb: flux_respond_error: Invalid argument

This probably means that the graph created with hwloc2 support doesn't match w/ the underlying job spec. I will look but this could be the case where more hierarchical resource types than the jobspec supported (e.g., socket).

The bigger issue is, though: when an match error is detected, flux mini run (or equivalent) should handle it more gracefully. I actually hit this while I was testing PR #675.

Let me look at a couple of things and make a specific suggestion.

dongahn · 2020-06-24T18:50:29Z

The following patch should work for this PR

diff --git a/t/t4004-match-hwloc.t b/t/t4004-match-hwloc.t
index 3a994ee4..7aca3868 100755
--- a/t/t4004-match-hwloc.t
+++ b/t/t4004-match-hwloc.t
@@ -119,7 +119,8 @@ test_expect_success 'reloading session/hwloc information with test data' '
 '

 test_expect_success 'loading resource module with default resource info source' '
-    load_resource subsystems=containment policy=high
+    load_resource subsystems=containment policy=high \
+load-whitelist=node,socket,core
 '

 test_expect_success 'match-allocate works with four two-socket jobspecs' '

My guess is without this option the resulting graph contains other intermediate resource types between node and socket such a way that the given jobspec won't match.

version: 1
resources:
  - type: node
    count: 1
    with:
      - type: slot
        count: 2
        label: default
        with:
          - type: socket
            count: 1

I will push this change as a "fixup" commit to your branch to see if that makes CI happy.

For the "match error" leading to a front end hang issue, I will open up a Issue ticket to address in a separate PR.

dongahn · 2020-06-24T18:55:15Z

FYI -- pushed this change.

codecov-commenter · 2020-06-24T19:06:31Z

Codecov Report

Merging #677 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #677      +/-   ##
==========================================
- Coverage   75.22%   75.20%   -0.03%     
==========================================
  Files          78       78              
  Lines        8300     8301       +1     
==========================================
- Hits         6244     6243       -1     
- Misses       2056     2058       +2

Impacted Files	Coverage Δ
resource/readers/resource_reader_hwloc.hpp	`100.00% <ø> (ø)`
resource/readers/resource_reader_hwloc.cpp	`85.18% <100.00%> (-0.64%)`	⬇️
qmanager/modules/qmanager_callbacks.cpp	`76.42% <0.00%> (-0.72%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d60c75...a52ebca. Read the comment docs.

dongahn · 2020-06-24T19:08:20Z

@SteVwonder: the CI is happy now. Once you address the review comments and merge my fixup commit, this PR should be good to be merged.

SteVwonder · 2020-06-24T19:23:47Z

Thanks @dongahn for figuring that out! I appreciate the help! That reminds me that I should add the latest ubuntu image that was added to flux-core so that we have CI coverage of hwloc 2.X (the current testers only have 1.X).

I'll address your review comments today as well.

dongahn · 2020-06-24T19:24:54Z

Awesome! Thanks @SteVwonder.

SteVwonder · 2020-06-26T04:28:00Z

I just restarted the Ubuntu 20.04 builder now that the flux-core image is being published (thanks @grondo !!!!). Looks like both issues occur there (the failed test and the empty R in match responses):

ok 11 - loading resource module with default resource info source
loading resource module with default resource info source
expecting success: 
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket}
JOBID                STATUS               AT                   OVERHEAD (Secs)     
0                    ALLOCATED            2020-06-26T04:22:41  0.0005309581756591797
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
1                    ALLOCATED            2020-06-26T04:22:41  0.00048804283142089844
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
2                    ALLOCATED            2020-06-26T04:22:41  0.00046181678771972656
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
3                    ALLOCATED            2020-06-26T04:22:41  0.0004439353942871094
================================================================================
MATCHED RESOURCES:

expecting success: 
    remove_resource &&
    flux hwloc reload ${excl_4N4B_sierra2} &&
    load_resource subsystems=containment \
policy=high load-whitelist=node,socket,core,gpu &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    test_expect_code 16 flux ion-resource match allocate ${jobspec_1socket_2gpu}
OSError: error(16): Device or resource busy
not ok 14 - reloading sierra xml and match allocate

I'll begin poking at that tomorrow and next week.

dongahn · 2020-07-10T18:59:04Z

@SteVwonder: LMK if you need any help to finalize this. This should def. go into our end-of-July freeze.

Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the Sierra/Lassen clusters are represented in the resource topology as direct children of the node. This topology ignores the fact that the GPUs actually have locality with respect to the CPU sockets. This topology also is causing downstream affects with the fluxion scheduler and its testsuite. Depending on how the hwloc is read (either via flux-core's flux-hwloc or directly via the hwloc API), the resource topology changes (i.e., GPUs are children of the node versus the sockets). Also worth noting that the GPUs are children of the sockets when using the hwloc V2 API, so ignoring the group creates a significant difference in the topologies between hwloc versions. Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the Sierra/Lassen systems, the GPUs are children of the sockets. This also normalizes the resource topology across reading methods and hwloc versions. Now requesting a GPU on Sierra/Lassen can always be done with a `node->socket->gpu` jobspec. Related PR: flux-framework/flux-sched/pull/677

Problem: the first python in `PATH` may not be the same python that Flux is configured against causing tests to fail. Solution: modify tests to use `flux python`, ensuring that the "correct" python interpreter is used.

Problem: in hwloc 2.X, the `obj->children` array now only contains resources like PUs, cores, and sockets. Memory, IO, and Bridge types are no longer included in that array and its associated `obj->arity`. Those objects are now accessible via the new `obj->*_first_child` attributes (where `*` can be one of `memory`, `io`, and `misc`). As a result, the existing method of traversing the hwloc tree via the `obj->children` attribute results in missing memory, GPUs, and storage resources. Solution: the `hwloc_get_next_child` function, which exists in both hwloc 1.X and 2.X, iterates over all child resources that aren't filtered by other means (e.g., `hwloc_topology_set_flags` and `hwloc_set_io_types_filter`). It manages the complexity of handling these various new child attributes and accessing all valid children of a resource.

SteVwonder · 2020-07-14T21:39:38Z

Travis is finally green! 🥳 https://travis-ci.org/github/flux-framework/flux-sched/builds/708137964

@dongahn, I addressed your comments as a fixup commit. Let me know if the changes look good, and then I'll squash and this PR should be good to go.

dongahn

Congrats!

Just the alignment issues. I gave my approval and I will put the WMP label as well. Once you address the issues and squash the fixup comment, this should go in automatically.

resource/readers/resource_reader_hwloc.cpp

SteVwonder · 2020-07-14T21:52:34Z

Thanks @dongahn! I'll address those right away.

I forgot to call out that I did have to add a new commit: e350c5b. If you haven't already, that's probably worth looking over.

dongahn · 2020-07-14T21:59:19Z

I forgot to call out that I did have to add a new commit: e350c5b. If you haven't already, that's probably worth looking over.

Yap that commit looks perfect to me and the detailed commit message is awesome! Great findings!

Problem: hwloc 2.X drops the `hwloc_topology_set_flags` function in favor of various `hwloc_topology_set_*_types_filter`. Solution: include both functions but #ifdef them based on the hwloc API version. The values for the new `hwloc_topology_set_*_types_filter` calls mimic those used in flux-core. In particular, instruction and memory caches are only included if they affect the structure of the resource tree and only "important" IO resources are kept (e.g., GPUs, NICs, storage).

Problem: the `yaml.load` method is unsafe (allows arbitrary code execution) and causes lots of warning to be printed to the console/test logs. Solution: switch to the `yaml.safe_load` method since we do not use any of the unsafe functionality. Further reading: - https://pyyaml.org/wiki/PyYAMLDocumentation#LoadingYAML - https://security.openstack.org/guidelines/dg_avoid-dangerous-input-parsing-libraries.html

Problem: the python scripts in `t/scripts` are intended to be run as subcommands via the `flux` binary, and the comments at the top of the file make that clear. Unfortunately, if you decide to execute them directly anyway (e.g., `./flux-ion-resource.py`), they get interpreted as bash scripts. In one unfortunate case, they locked up my entire machine (still not sure why that happened). Solution: add a bang line that points to `/bin/false` so that when they are directly executed, they immediately fail with an exit code of 1.

Problem: in hwloc 2.X, the `numanode` resource is no longer the parent of any compute resources, it is instead a sibling. This means that any jobspecs with `numanode` in them will no longer match on `fluxion-resource` instances populated from hwloc data. Any `L*cache` resources that do not affect the struture of the resource tree (commonly L1 and L2 caches) are now filtered out in both flux-core and flux-sched. Any expected output from `resource-query` that includes `numanodes` or `L*cache` resources will also no longer be correct. Solution: First, use the `-W` flag of `resource-query` to filter out any resources that aren't a core, pu, socket, node, or gpu. This creates a set of resources that both hwloc 1 and 2 treat uniformly. Second, replace any references to `numanode` in test jobspecs with `socket`. Third, replace some of the expected output files for the affected tests.

Problem: without the load-whitelist option, the resulting graph contains other intermediate resource types between node and socket such a way that the given jobspec (node[1]->slot[2]->socekt[1]) won't match.

as opposed to just a generic failure

** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.

SteVwonder · 2020-07-14T22:05:16Z

Thanks @dongahn for the quick reviews! Just pushed the fixes and squashed.

dongahn requested changes Jun 24, 2020

View reviewed changes

dongahn mentioned this pull request Jun 24, 2020

Match error causes front end commands to hang #678

Closed

SteVwonder force-pushed the hwloc2 branch from 0c66a8d to 485ca68 Compare June 24, 2020 20:00

SteVwonder mentioned this pull request Jun 25, 2020

Match allocate_with_satisfiability success repsonse contains empty R #679

Closed

SteVwonder force-pushed the hwloc2 branch from 485ca68 to 8176e89 Compare July 14, 2020 02:26

SteVwonder mentioned this pull request Jul 14, 2020

flux-hwloc: remove ignore of HWLOC_OBJ_GROUP flux-framework/flux-core#3046

Merged

SteVwonder changed the title ~~[WIP] Add support for hwloc 2.0+~~ Add support for hwloc 2.0+ Jul 14, 2020

SteVwonder added 2 commits July 14, 2020 14:12

t2000-tree-basic.t: replace python with flux python

44838ef

Problem: the first python in `PATH` may not be the same python that Flux is configured against causing tests to fail. Solution: modify tests to use `flux python`, ensuring that the "correct" python interpreter is used.

SteVwonder force-pushed the hwloc2 branch from 7bc01d6 to 52b33e8 Compare July 14, 2020 21:19

dongahn approved these changes Jul 14, 2020

View reviewed changes

resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved

resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved

resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved

dongahn added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jul 14, 2020

SteVwonder added 4 commits July 14, 2020 15:04

dongahn and others added 4 commits July 14, 2020 15:04

t: remove intermediate resource types

36d3370

Problem: without the load-whitelist option, the resulting graph contains other intermediate resource types between node and socket such a way that the given jobspec (node[1]->slot[2]->socekt[1]) won't match.

t4004-match-hwloc.t: test explicitly for the expected exit code

59bff24

as opposed to just a generic failure

travis: add eoan CI builder for hwloc 2 coverage

f3cc862

SteVwonder force-pushed the hwloc2 branch from 52b33e8 to a52ebca Compare July 14, 2020 22:04

mergify bot merged commit 8d9a953 into flux-framework:master Jul 14, 2020

SteVwonder deleted the hwloc2 branch July 14, 2020 22:35

dongahn mentioned this pull request Jul 16, 2020

make distcheck: file name is too long #601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for hwloc 2.0+ #677

Add support for hwloc 2.0+ #677

SteVwonder commented Jun 24, 2020 •

edited

Loading

dongahn left a comment

dongahn commented Jun 24, 2020

dongahn commented Jun 24, 2020

dongahn commented Jun 24, 2020

codecov-commenter commented Jun 24, 2020 •

edited

Loading

dongahn commented Jun 24, 2020

SteVwonder commented Jun 24, 2020

dongahn commented Jun 24, 2020

SteVwonder commented Jun 26, 2020

dongahn commented Jul 10, 2020

SteVwonder commented Jul 14, 2020

dongahn left a comment

SteVwonder commented Jul 14, 2020

dongahn commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

Add support for hwloc 2.0+ #677

Add support for hwloc 2.0+ #677

Conversation

SteVwonder commented Jun 24, 2020 • edited Loading

dongahn left a comment

Choose a reason for hiding this comment

dongahn commented Jun 24, 2020

dongahn commented Jun 24, 2020

dongahn commented Jun 24, 2020

codecov-commenter commented Jun 24, 2020 • edited Loading

Codecov Report

dongahn commented Jun 24, 2020

SteVwonder commented Jun 24, 2020

dongahn commented Jun 24, 2020

SteVwonder commented Jun 26, 2020

dongahn commented Jul 10, 2020

SteVwonder commented Jul 14, 2020

dongahn left a comment

Choose a reason for hiding this comment

SteVwonder commented Jul 14, 2020

dongahn commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

SteVwonder commented Jun 24, 2020 •

edited

Loading

codecov-commenter commented Jun 24, 2020 •

edited

Loading