-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-hwloc: remove ignore of HWLOC_OBJ_GROUP
#3046
flux-hwloc: remove ignore of HWLOC_OBJ_GROUP
#3046
Conversation
The flux-sched build is failing because this PR is not based on current master. Ok to do an automated rebase via mergify? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in Slack, LGTM!
Thank you for the detailed commit message!
Ah, nevermind. I will set MWP. |
Thanks @grondo! |
@Mergifyio rebase |
Problem: When the `HWLOC_OBJ_GROUP` is ignored, the GPUs on the Sierra/Lassen clusters are represented in the resource topology as direct children of the node. This topology ignores the fact that the GPUs actually have locality with respect to the CPU sockets. This topology also is causing downstream affects with the fluxion scheduler and its testsuite. Depending on how the hwloc is read (either via flux-core's flux-hwloc or directly via the hwloc API), the resource topology changes (i.e., GPUs are children of the node versus the sockets). Also worth noting that the GPUs are children of the sockets when using the hwloc V2 API, so ignoring the group creates a significant difference in the topologies between hwloc versions. Solution: remove the call to ignore `HWLOC_OBJ_GROUP` so that on the Sierra/Lassen systems, the GPUs are children of the sockets. This also normalizes the resource topology across reading methods and hwloc versions. Now requesting a GPU on Sierra/Lassen can always be done with a `node->socket->gpu` jobspec. Related PR: flux-framework/flux-sched/pull/677
Command |
9dd99b3
to
9464337
Compare
Codecov Report
@@ Coverage Diff @@
## master #3046 +/- ##
==========================================
+ Coverage 80.84% 80.87% +0.02%
==========================================
Files 270 270
Lines 42798 42796 -2
==========================================
+ Hits 34602 34613 +11
+ Misses 8196 8183 -13
|
Problem: When the
HWLOC_OBJ_GROUP
is ignored, the GPUs on theSierra/Lassen clusters are represented in the resource topology as
direct children of the node. This topology ignores the fact that the
GPUs actually have locality with respect to the CPU sockets. This
topology also is causing downstream affects with the fluxion scheduler
and its testsuite. Depending on how the hwloc is read (either via
flux-core's flux-hwloc or directly via the hwloc API), the resource
topology changes (i.e., GPUs are children of the node versus the
sockets). Also worth noting that the GPUs are children of the sockets
when using the hwloc V2 API, so ignoring the group creates a significant
difference in the topologies between hwloc versions.
Solution: remove the call to ignore
HWLOC_OBJ_GROUP
so that on theSierra/Lassen systems, the GPUs are children of the sockets. This also
normalizes the resource topology across reading methods and hwloc
versions. Now requesting a GPU on Sierra/Lassen can always be done with
a
node->socket->gpu
jobspec.Related PR: flux-framework/flux-sched/pull/677