flux-hwloc: remove ignore of `HWLOC_OBJ_GROUP` #3046

SteVwonder · 2020-07-14T17:33:52Z

Problem: When the HWLOC_OBJ_GROUP is ignored, the GPUs on the
Sierra/Lassen clusters are represented in the resource topology as
direct children of the node. This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets. This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite. Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets). Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.

Solution: remove the call to ignore HWLOC_OBJ_GROUP so that on the
Sierra/Lassen systems, the GPUs are children of the sockets. This also
normalizes the resource topology across reading methods and hwloc
versions. Now requesting a GPU on Sierra/Lassen can always be done with
a node->socket->gpu jobspec.

Related PR: flux-framework/flux-sched/pull/677

grondo · 2020-07-14T18:50:51Z

The flux-sched build is failing because this PR is not based on current master. Ok to do an automated rebase via mergify?

grondo

As noted in Slack, LGTM!

Thank you for the detailed commit message!

grondo · 2020-07-14T18:52:00Z

Ah, nevermind. I will set MWP.

SteVwonder · 2020-07-14T19:14:39Z

Thanks @grondo!

SteVwonder · 2020-07-14T19:19:13Z

@Mergifyio rebase

Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the Sierra/Lassen clusters are represented in the resource topology as direct children of the node. This topology ignores the fact that the GPUs actually have locality with respect to the CPU sockets. This topology also is causing downstream affects with the fluxion scheduler and its testsuite. Depending on how the hwloc is read (either via flux-core's flux-hwloc or directly via the hwloc API), the resource topology changes (i.e., GPUs are children of the node versus the sockets). Also worth noting that the GPUs are children of the sockets when using the hwloc V2 API, so ignoring the group creates a significant difference in the topologies between hwloc versions. Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the Sierra/Lassen systems, the GPUs are children of the sockets. This also normalizes the resource topology across reading methods and hwloc versions. Now requesting a GPU on Sierra/Lassen can always be done with a `node->socket->gpu` jobspec. Related PR: flux-framework/flux-sched/pull/677

mergify · 2020-07-14T19:19:42Z

Command rebase: success

codecov-commenter · 2020-07-14T19:42:07Z

Codecov Report

Merging #3046 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3046      +/-   ##
==========================================
+ Coverage   80.84%   80.87%   +0.02%     
==========================================
  Files         270      270              
  Lines       42798    42796       -2     
==========================================
+ Hits        34602    34613      +11     
+ Misses       8196     8183      -13

Impacted Files	Coverage Δ
src/cmd/builtin/hwloc.c	`84.39% <ø> (+0.16%)`	⬆️
src/modules/resource/acquire.c	`65.06% <0.00%> (-2.06%)`	⬇️
src/broker/broker.c	`75.31% <0.00%> (-0.11%)`	⬇️
src/common/libsubprocess/subprocess.c	`87.47% <0.00%> (+0.32%)`	⬆️
src/broker/runat.c	`84.61% <0.00%> (+0.80%)`	⬆️
src/common/libflux/handle.c	`85.61% <0.00%> (+2.05%)`	⬆️
src/broker/state_machine.c	`89.06% <0.00%> (+4.68%)`	⬆️

SteVwonder requested a review from grondo July 14, 2020 17:33

grondo approved these changes Jul 14, 2020

View reviewed changes

grondo added the merge-when-passing label Jul 14, 2020

SteVwonder force-pushed the hwloc-unignore-group branch from 9dd99b3 to 9464337 Compare July 14, 2020 19:19

mergify bot merged commit 3105f8f into flux-framework:master Jul 14, 2020

SteVwonder deleted the hwloc-unignore-group branch July 14, 2020 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-hwloc: remove ignore of `HWLOC_OBJ_GROUP` #3046

flux-hwloc: remove ignore of `HWLOC_OBJ_GROUP` #3046

SteVwonder commented Jul 14, 2020

grondo commented Jul 14, 2020

grondo left a comment

grondo commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

mergify bot commented Jul 14, 2020

codecov-commenter commented Jul 14, 2020

flux-hwloc: remove ignore of HWLOC_OBJ_GROUP #3046

flux-hwloc: remove ignore of HWLOC_OBJ_GROUP #3046

Conversation

SteVwonder commented Jul 14, 2020

grondo commented Jul 14, 2020

grondo left a comment

Choose a reason for hiding this comment

grondo commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

SteVwonder commented Jul 14, 2020

mergify bot commented Jul 14, 2020

codecov-commenter commented Jul 14, 2020

Codecov Report

flux-hwloc: remove ignore of `HWLOC_OBJ_GROUP` #3046

flux-hwloc: remove ignore of `HWLOC_OBJ_GROUP` #3046