Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-hwloc: remove ignore of HWLOC_OBJ_GROUP #3046

Merged
merged 1 commit into from
Jul 14, 2020

Conversation

SteVwonder
Copy link
Member

Problem: When the HWLOC_OBJ_GROUP is ignored, the GPUs on the
Sierra/Lassen clusters are represented in the resource topology as
direct children of the node. This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets. This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite. Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets). Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.

Solution: remove the call to ignore HWLOC_OBJ_GROUP so that on the
Sierra/Lassen systems, the GPUs are children of the sockets. This also
normalizes the resource topology across reading methods and hwloc
versions. Now requesting a GPU on Sierra/Lassen can always be done with
a node->socket->gpu jobspec.

Related PR: flux-framework/flux-sched/pull/677

@SteVwonder SteVwonder requested a review from grondo July 14, 2020 17:33
@grondo
Copy link
Contributor

grondo commented Jul 14, 2020

The flux-sched build is failing because this PR is not based on current master. Ok to do an automated rebase via mergify?

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in Slack, LGTM!

Thank you for the detailed commit message!

@grondo
Copy link
Contributor

grondo commented Jul 14, 2020

Ah, nevermind. I will set MWP.

@SteVwonder
Copy link
Member Author

Thanks @grondo!

@SteVwonder
Copy link
Member Author

@Mergifyio rebase

Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the
Sierra/Lassen clusters are represented in the resource topology as
direct children of the node.  This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets.  This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite.  Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets).  Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.

Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the
Sierra/Lassen systems, the GPUs are children of the sockets.  This also
normalizes the resource topology across reading methods and hwloc
versions.  Now requesting a GPU on Sierra/Lassen can always be done with
a `node->socket->gpu` jobspec.

Related PR: flux-framework/flux-sched/pull/677
@mergify
Copy link
Contributor

mergify bot commented Jul 14, 2020

Command rebase: success

@SteVwonder SteVwonder force-pushed the hwloc-unignore-group branch from 9dd99b3 to 9464337 Compare July 14, 2020 19:19
@codecov-commenter
Copy link

Codecov Report

Merging #3046 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3046      +/-   ##
==========================================
+ Coverage   80.84%   80.87%   +0.02%     
==========================================
  Files         270      270              
  Lines       42798    42796       -2     
==========================================
+ Hits        34602    34613      +11     
+ Misses       8196     8183      -13     
Impacted Files Coverage Δ
src/cmd/builtin/hwloc.c 84.39% <ø> (+0.16%) ⬆️
src/modules/resource/acquire.c 65.06% <0.00%> (-2.06%) ⬇️
src/broker/broker.c 75.31% <0.00%> (-0.11%) ⬇️
src/common/libsubprocess/subprocess.c 87.47% <0.00%> (+0.32%) ⬆️
src/broker/runat.c 84.61% <0.00%> (+0.80%) ⬆️
src/common/libflux/handle.c 85.61% <0.00%> (+2.05%) ⬆️
src/broker/state_machine.c 89.06% <0.00%> (+4.68%) ⬆️

@mergify mergify bot merged commit 3105f8f into flux-framework:master Jul 14, 2020
@SteVwonder SteVwonder deleted the hwloc-unignore-group branch July 14, 2020 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants