Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hwloc 2.0+ #677

Merged
merged 10 commits into from
Jul 14, 2020
Merged

Add support for hwloc 2.0+ #677

merged 10 commits into from
Jul 14, 2020

Conversation

SteVwonder
Copy link
Member

@SteVwonder SteVwonder commented Jun 24, 2020

This PR adds support for hwloc versions 2.0+. Specifically, it:

  • Adds an #ifdef block similar to the one in flux-core to do filtering of resources
  • Replaces how children are accessed with a more forward-compatible method
  • Uses the resource-query whitelist feature to normalize the resource set tested across hwloc 1 & 2
    • Updates the test input/output accordingly
  • Adds a few tweaks to various tests and test scripts

I have it marked as a WIP because t4004-match-hwloc.t still has a single failing test at the end. Oddly enough, the same resource data and jobspec works perfectly fine with resource-query but not with flux-ion-resource and the full sched-fluxion-resource module. I also haven't verified that things still work 100% in hwloc 1.

EDIT: Closes #656

Copy link
Member

@dongahn dongahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let me see if I can quickly test the current failure in t4004-match-hwloc.t

t/scripts/flux-ion-resource.py Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
@dongahn
Copy link
Member

dongahn commented Jun 24, 2020

I have it marked as a WIP because t4004-match-hwloc.t still has a single failing test at the end.

@SteVwonder: I think I manually reproduced the issue and this was a problem I wanted to look into further. So this came out in an opportune time.

Before this test is hung, I get the following error message:

2020-06-24T18:30:21.552622Z sched-fluxion-resource.err[0]: match_request_cb: match failed due to match error (id=0): Success
2020-06-24T18:30:21.552668Z sched-fluxion-resource.err[0]: match_request_cb: flux_respond_error: Invalid argument

This probably means that the graph created with hwloc2 support doesn't match w/ the underlying job spec. I will look but this could be the case where more hierarchical resource types than the jobspec supported (e.g., socket).

The bigger issue is, though: when an match error is detected, flux mini run (or equivalent) should handle it more gracefully. I actually hit this while I was testing PR #675.

Let me look at a couple of things and make a specific suggestion.

@dongahn
Copy link
Member

dongahn commented Jun 24, 2020

The following patch should work for this PR

diff --git a/t/t4004-match-hwloc.t b/t/t4004-match-hwloc.t
index 3a994ee4..7aca3868 100755
--- a/t/t4004-match-hwloc.t
+++ b/t/t4004-match-hwloc.t
@@ -119,7 +119,8 @@ test_expect_success 'reloading session/hwloc information with test data' '
 '

 test_expect_success 'loading resource module with default resource info source' '
-    load_resource subsystems=containment policy=high
+    load_resource subsystems=containment policy=high \
+load-whitelist=node,socket,core
 '

 test_expect_success 'match-allocate works with four two-socket jobspecs' '

My guess is without this option the resulting graph contains other intermediate resource types between node and socket such a way that the given jobspec won't match.

version: 1
resources:
  - type: node
    count: 1
    with:
      - type: slot
        count: 2
        label: default
        with:
          - type: socket
            count: 1

I will push this change as a "fixup" commit to your branch to see if that makes CI happy.

For the "match error" leading to a front end hang issue, I will open up a Issue ticket to address in a separate PR.

@dongahn
Copy link
Member

dongahn commented Jun 24, 2020

FYI -- pushed this change.

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2020

Codecov Report

Merging #677 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #677      +/-   ##
==========================================
- Coverage   75.22%   75.20%   -0.03%     
==========================================
  Files          78       78              
  Lines        8300     8301       +1     
==========================================
- Hits         6244     6243       -1     
- Misses       2056     2058       +2     
Impacted Files Coverage Δ
resource/readers/resource_reader_hwloc.hpp 100.00% <ø> (ø)
resource/readers/resource_reader_hwloc.cpp 85.18% <100.00%> (-0.64%) ⬇️
qmanager/modules/qmanager_callbacks.cpp 76.42% <0.00%> (-0.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d60c75...a52ebca. Read the comment docs.

@dongahn
Copy link
Member

dongahn commented Jun 24, 2020

@SteVwonder: the CI is happy now. Once you address the review comments and merge my fixup commit, this PR should be good to be merged.

@SteVwonder
Copy link
Member Author

Thanks @dongahn for figuring that out! I appreciate the help! That reminds me that I should add the latest ubuntu image that was added to flux-core so that we have CI coverage of hwloc 2.X (the current testers only have 1.X).

I'll address your review comments today as well.

@dongahn
Copy link
Member

dongahn commented Jun 24, 2020

Awesome! Thanks @SteVwonder.

@SteVwonder
Copy link
Member Author

I just restarted the Ubuntu 20.04 builder now that the flux-core image is being published (thanks @grondo !!!!). Looks like both issues occur there (the failed test and the empty R in match responses):

ok 11 - loading resource module with default resource info source
loading resource module with default resource info source
expecting success: 
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket} &&
    flux ion-resource match allocate ${jobspec_2socket}
JOBID                STATUS               AT                   OVERHEAD (Secs)     
0                    ALLOCATED            2020-06-26T04:22:41  0.0005309581756591797
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
1                    ALLOCATED            2020-06-26T04:22:41  0.00048804283142089844
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
2                    ALLOCATED            2020-06-26T04:22:41  0.00046181678771972656
================================================================================
MATCHED RESOURCES:
JOBID                STATUS               AT                   OVERHEAD (Secs)     
3                    ALLOCATED            2020-06-26T04:22:41  0.0004439353942871094
================================================================================
MATCHED RESOURCES:

expecting success: 
    remove_resource &&
    flux hwloc reload ${excl_4N4B_sierra2} &&
    load_resource subsystems=containment \
policy=high load-whitelist=node,socket,core,gpu &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    flux ion-resource match allocate ${jobspec_1socket_2gpu} &&
    test_expect_code 16 flux ion-resource match allocate ${jobspec_1socket_2gpu}
OSError: error(16): Device or resource busy
not ok 14 - reloading sierra xml and match allocate

I'll begin poking at that tomorrow and next week.

@dongahn
Copy link
Member

dongahn commented Jul 10, 2020

@SteVwonder: LMK if you need any help to finalize this. This should def. go into our end-of-July freeze.

SteVwonder added a commit to SteVwonder/flux-core that referenced this pull request Jul 14, 2020
Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the
Sierra/Lassen clusters are represented in the resource topology as
direct children of the node.  This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets.  This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite.  Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets).  Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.

Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the
Sierra/Lassen systems, the GPUs are children of the sockets.  This also
normalizes the resource topology across reading methods and hwloc
versions.  Now requesting a GPU on Sierra/Lassen can always be done with
a `node->socket->gpu` jobspec.

Related PR: flux-framework/flux-sched/pull/677
@SteVwonder SteVwonder changed the title [WIP] Add support for hwloc 2.0+ Add support for hwloc 2.0+ Jul 14, 2020
SteVwonder added a commit to SteVwonder/flux-core that referenced this pull request Jul 14, 2020
Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the
Sierra/Lassen clusters are represented in the resource topology as
direct children of the node.  This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets.  This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite.  Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets).  Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.

Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the
Sierra/Lassen systems, the GPUs are children of the sockets.  This also
normalizes the resource topology across reading methods and hwloc
versions.  Now requesting a GPU on Sierra/Lassen can always be done with
a `node->socket->gpu` jobspec.

Related PR: flux-framework/flux-sched/pull/677
Problem: the first python in `PATH` may not be the same python that Flux
is configured against causing tests to fail.

Solution: modify tests to use `flux python`, ensuring that the "correct"
python interpreter is used.
Problem: in hwloc 2.X, the `obj->children` array now only contains
resources like PUs, cores, and sockets.  Memory, IO, and Bridge types
are no longer included in that array and its associated `obj->arity`.
Those objects are now accessible via the new `obj->*_first_child`
attributes (where `*` can be one of `memory`, `io`, and `misc`).  As a
result, the existing method of traversing the hwloc tree via the
`obj->children` attribute results in missing memory, GPUs, and storage
resources.

Solution: the `hwloc_get_next_child` function, which exists in both
hwloc 1.X and 2.X, iterates over all child resources that aren't
filtered by other means (e.g., `hwloc_topology_set_flags` and
`hwloc_set_io_types_filter`).  It manages the complexity of handling
these various new child attributes and accessing all valid children of a
resource.
@SteVwonder
Copy link
Member Author

Travis is finally green! 🥳 https://travis-ci.org/github/flux-framework/flux-sched/builds/708137964

@dongahn, I addressed your comments as a fixup commit. Let me know if the changes look good, and then I'll squash and this PR should be good to go.

Copy link
Member

@dongahn dongahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats!

Just the alignment issues. I gave my approval and I will put the WMP label as well. Once you address the issues and squash the fixup comment, this should go in automatically.

resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_hwloc.cpp Outdated Show resolved Hide resolved
@dongahn dongahn added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jul 14, 2020
@SteVwonder
Copy link
Member Author

Thanks @dongahn! I'll address those right away.

I forgot to call out that I did have to add a new commit: e350c5b. If you haven't already, that's probably worth looking over.

@dongahn
Copy link
Member

dongahn commented Jul 14, 2020

I forgot to call out that I did have to add a new commit: e350c5b. If you haven't already, that's probably worth looking over.

Yap that commit looks perfect to me and the detailed commit message is awesome! Great findings!

Problem: hwloc 2.X drops the `hwloc_topology_set_flags` function in
favor of various `hwloc_topology_set_*_types_filter`.

Solution: include both functions but #ifdef them based on the hwloc API
version.  The values for the new `hwloc_topology_set_*_types_filter`
calls mimic those used in flux-core.  In particular, instruction and
memory caches are only included if they affect the structure of the
resource tree and only "important" IO resources are kept (e.g., GPUs,
NICs, storage).
Problem: the `yaml.load` method is unsafe (allows arbitrary code
execution) and causes lots of warning to be printed to the console/test
logs.

Solution: switch to the `yaml.safe_load` method since we do not use any
of the unsafe functionality.

Further reading:
- https://pyyaml.org/wiki/PyYAMLDocumentation#LoadingYAML
- https://security.openstack.org/guidelines/dg_avoid-dangerous-input-parsing-libraries.html
Problem: the python scripts in `t/scripts` are intended to be run as
subcommands via the `flux` binary, and the comments at the top of the
file make that clear.  Unfortunately, if you decide to execute them
directly anyway (e.g., `./flux-ion-resource.py`), they get interpreted
as bash scripts.  In one unfortunate case, they locked up my entire
machine (still not sure why that happened).

Solution: add a bang line that points to `/bin/false` so that when they
are directly executed, they immediately fail with an exit code of 1.
Problem: in hwloc 2.X, the `numanode` resource is no longer the parent
of any compute resources, it is instead a sibling.  This means that any
jobspecs with `numanode` in them will no longer match on
`fluxion-resource` instances populated from hwloc data.  Any `L*cache`
resources that do not affect the struture of the resource tree (commonly
L1 and L2 caches) are now filtered out in both flux-core and
flux-sched. Any expected output from `resource-query` that includes
`numanodes` or `L*cache` resources will also no longer be correct.

Solution: First, use the `-W` flag of `resource-query` to filter out any
resources that aren't a core, pu, socket, node, or gpu. This creates a
set of resources that both hwloc 1 and 2 treat uniformly. Second,
replace any references to `numanode` in test jobspecs with `socket`.
Third, replace some of the expected output files for the affected tests.
dongahn and others added 4 commits July 14, 2020 15:04
Problem: without the load-whitelist option, the
resulting graph contains other intermediate
resource types between node and socket
such a way that the given
jobspec (node[1]->slot[2]->socekt[1])
won't match.
** Problem **

The resource graph produced by the following process (referred to in
this commit message as "single read"):

- read V1-xml on disk with V2 API
- traverse hwloc topo with traversal APIs

is *NOT* equivalent to the graph produced by the following
process (referred to in this commit message as "double read"):

- read V1-xml on disk with V2 API
- serialize with V1 compatible writer
- read serialized V1-compatible XML with V2 API
- traverse hwloc topo with traversal APIs

The "single read" process, when applied to our `04-brokers-sierra2`
hwloc XML data, produces a resource graph where the GPUs are direct
children of the compute node. The "double read", process, when applied
to our `04-brokers-sierra2` hwloc XML data, produces a resource graph
where the GPUs are direct children of the sockets.  In terms of
locality, the latter is "more correct".  The difference in these two
graphs breaks matches against jobspecs that assume the GPUs are direct
children of the node; specifically `basics/test013.yaml`.

The "single read" process is what happens when you test with the
`resource-query` utility.  The "double read" process is what happens
when you test with the `flux ion-resource` utility against a
`fluxion-resource` module that has been populated with xml from `flux
hwloc reload`.  The "double read" process also more closely mimics what
happens "in production".

Note: All of the above "reads" use the following flags for the various
resource filtering functions added to V2:

  - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT
  - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE
  - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE

** Solution **

Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test
utility in flux-core to emulate the first read in the "double read"
process.  Specifically it performs the first two steps of the process:
read V1-xml on disk with V2 API and serialize with V1 compatible writer.
The result is that `resource-query` and `flux ion-resource` now both
instantiate the same resource graph and thus produce the same results on
the same jobspecs and hwloc XML.

The resource graphs produced by these utilities now includes a socket in
between the nodes and GPUs, which affects a jobspec
request (basics/test013.yaml) and the expected output of two GPU-based
match tests (018.R.out and 021.R.out).  This commit updates the jobspec
and the expected outputs to include the socket resources.

Note: This commit does not attempt to normalize the resource graphs
produced by hwloc V1 versus V2.  As we discussed in
flux-framework#656, where this would cause issue in
production, we can leverage the use of `--load-allowlist` to filter out
resources that cause differences.  If one sticks strictly to Jobspec V1
and thus filters out `socket`s, this difference will be normalized out.
@SteVwonder
Copy link
Member Author

Thanks @dongahn for the quick reviews! Just pushed the fixes and squashed.

@mergify mergify bot merged commit 8d9a953 into flux-framework:master Jul 14, 2020
@SteVwonder SteVwonder deleted the hwloc2 branch July 14, 2020 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hwloc 2.0+ Support
3 participants