Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hwloc 2.0+ Support #656

Closed
SteVwonder opened this issue May 15, 2020 · 2 comments · Fixed by #677
Closed

Hwloc 2.0+ Support #656

SteVwonder opened this issue May 15, 2020 · 2 comments · Fixed by #677
Milestone

Comments

@SteVwonder
Copy link
Member

One change between hwloc 1.0 and 2.0+ that affects flux-sched is:

In hwloc v1.x, NUMA nodes were inside the tree, for instance Packages contained 2 NUMA nodes which contained a L3 and several cache.

Starting with hwloc v2.0, NUMA nodes are not in the main tree anymore. They are attached under objects as Memory Children on the side of normal children. This memory children list starts at obj->memory_first_child and its size is obj->memory_arity. Hence there can now exist two local NUMA nodes, for instance on Intel Xeon Phi processors.

if there are two NUMA nodes per package, a Group object may be added to keep cores together with their local NUMA node

[Source]

This means that the NUMANode resource type never appears in the resource graph when using an XML generated by hwloc 2.0+ (even when the V1 compatibility flag is provided at export time). In the case where the NUMANode affects the structure of the tree, a Group resource appears instead. When the NUMANode doesn't affect the structure, the tree goes straight from Machine to Package to L3Cache (or Core or whatever used to be the child of NUMANode).

This creates issues for several of the tests in t4004-match-hwloc.t where one of several condititions causes an error:

  • the expected output from resource-query contains the numanode vertex
  • the requested jobspec explicitly requests a numanode resource
  • the requested jobspec implicitly requires a common ancestor between two resources that used to be numanode (at least, I think that is what is happening, I need to dig further into it, the failing jobspec is data/resource/jobspecs/basics/test013.yml)

One option is to not try and make these two versions of hwloc produce the same resource graph or the same output. This would affect jobspec portability.

Another option is to try and make the resource graph generated with a hwloc2 reader look similar/the same as the resource graph generated with a hwloc1 reader. This will require more work and may not be possible in every situation, but I think at least attempting it is worth the potential portability gains.

One thought I had in the direction of the latter option was to rename any Groups with a NUMANode child to be a NUMANode. This would put numanode back "in the tree", but it won't apply to every topology. Specifically, if the NUMANode doesn't affect the topology structure, then hwloc2 elides the Group object entirely and makes NUMANode a child of Machine:

sherbein .../data/hwloc-data/001N/exclusive (hwloc2 !?S)
❯ lstopo-no-graphics -i 04-brokers/0.xml
Machine (16GB total)
  NUMANode L#0 (P#0 16GB)
  Package L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3 + PU L#3 (P#3)

Maybe as we are traversing the resource tree, whenever we detect a NUMANode at a given level, we insert the numanode back into the tree and then make the resources that used to be its siblings into its children. I'm worried about the potential corner cases with this approach. I don't have a good feel for all the possible topologies to be confident to say this wouldn't break anything. At a high-level, it sounds like the inverse operation of what was done during the v1->v2 transition.

@dongahn
Copy link
Member

dongahn commented May 15, 2020

@SteVwonder: for normal short-term use cases, resource types like numanode and group will not be used for scheduling. So I'm wondering if it would be possible to normalize the resource graph representation out of hwloc reader by setting --load-whitelist for test cases? This way, we only populate graph objects for the resource types listed with the option. One area where this can be broken is when we add socket as the resource type to scheduler; yet gpu appears either at the sibling level of socket vs. at the child level of socket depending on how hwloc v1 vs v2 affects the topology. But maybe worthwhile to try this quickly on this sharness test since it won't require any code changes?

@garlick
Copy link
Member

garlick commented Jun 16, 2020

I think for TOSS 4, we will be building our packages against hwloc 2, therefore I'll attach this to the v0.9.0 milestone.

@garlick garlick added this to the v0.9.0 milestone Jun 16, 2020
SteVwonder added a commit to SteVwonder/flux-sched that referenced this issue Jul 14, 2020
** Problem **

The resource graph produced by the following process (referred to in
this commit message as "single read"):

- read V1-xml on disk with V2 API
- traverse hwloc topo with traversal APIs

is *NOT* equivalent to the graph produced by the following
process (referred to in this commit message as "double read"):

- read V1-xml on disk with V2 API
- serialize with V1 compatible writer
- read serialized V1-compatible XML with V2 API
- traverse hwloc topo with traversal APIs

The "single read" process, when applied to our `04-brokers-sierra2`
hwloc XML data, produces a resource graph where the GPUs are direct
children of the compute node. The "double read", process, when applied
to our `04-brokers-sierra2` hwloc XML data, produces a resource graph
where the GPUs are direct children of the sockets.  In terms of
locality, the latter is "more correct".  The difference in these two
graphs breaks matches against jobspecs that assume the GPUs are direct
children of the node; specifically `basics/test013.yaml`.

The "single read" process is what happens when you test with the
`resource-query` utility.  The "double read" process is what happens
when you test with the `flux ion-resource` utility against a
`fluxion-resource` module that has been populated with xml from `flux
hwloc reload`.  The "double read" process also more closely mimics what
happens "in production".

Note: All of the above "reads" use the following flags for the various
resource filtering functions added to V2:

  - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT
  - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE
  - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE

** Solution **

Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test
utility in flux-core to emulate the first read in the "double read"
process.  Specifically it performs the first two steps of the process:
read V1-xml on disk with V2 API and serialize with V1 compatible writer.
The result is that `resource-query` and `flux ion-resource` now both
instantiate the same resource graph and thus produce the same results on
the same jobspecs and hwloc XML.

The resource graphs produced by these utilities now includes a socket in
between the nodes and GPUs, which affects a jobspec
request (basics/test013.yaml) and the expected output of two GPU-based
match tests (018.R.out and 021.R.out).  This commit updates the jobspec
and the expected outputs to include the socket resources.

Note: This commit does not attempt to normalize the resource graphs
produced by hwloc V1 versus V2.  As we discussed in
flux-framework#656, where this would cause issue in
production, we can leverage the use of `--load-allowlist` to filter out
resources that cause differences.  If one sticks strictly to Jobspec V1
and thus filters out `socket`s, this difference will be normalized out.
SteVwonder added a commit to SteVwonder/flux-sched that referenced this issue Jul 14, 2020
** Problem **

The resource graph produced by the following process (referred to in
this commit message as "single read"):

- read V1-xml on disk with V2 API
- traverse hwloc topo with traversal APIs

is *NOT* equivalent to the graph produced by the following
process (referred to in this commit message as "double read"):

- read V1-xml on disk with V2 API
- serialize with V1 compatible writer
- read serialized V1-compatible XML with V2 API
- traverse hwloc topo with traversal APIs

The "single read" process, when applied to our `04-brokers-sierra2`
hwloc XML data, produces a resource graph where the GPUs are direct
children of the compute node. The "double read", process, when applied
to our `04-brokers-sierra2` hwloc XML data, produces a resource graph
where the GPUs are direct children of the sockets.  In terms of
locality, the latter is "more correct".  The difference in these two
graphs breaks matches against jobspecs that assume the GPUs are direct
children of the node; specifically `basics/test013.yaml`.

The "single read" process is what happens when you test with the
`resource-query` utility.  The "double read" process is what happens
when you test with the `flux ion-resource` utility against a
`fluxion-resource` module that has been populated with xml from `flux
hwloc reload`.  The "double read" process also more closely mimics what
happens "in production".

Note: All of the above "reads" use the following flags for the various
resource filtering functions added to V2:

  - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT
  - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE
  - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE

** Solution **

Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test
utility in flux-core to emulate the first read in the "double read"
process.  Specifically it performs the first two steps of the process:
read V1-xml on disk with V2 API and serialize with V1 compatible writer.
The result is that `resource-query` and `flux ion-resource` now both
instantiate the same resource graph and thus produce the same results on
the same jobspecs and hwloc XML.

The resource graphs produced by these utilities now includes a socket in
between the nodes and GPUs, which affects a jobspec
request (basics/test013.yaml) and the expected output of two GPU-based
match tests (018.R.out and 021.R.out).  This commit updates the jobspec
and the expected outputs to include the socket resources.

Note: This commit does not attempt to normalize the resource graphs
produced by hwloc V1 versus V2.  As we discussed in
flux-framework#656, where this would cause issue in
production, we can leverage the use of `--load-allowlist` to filter out
resources that cause differences.  If one sticks strictly to Jobspec V1
and thus filters out `socket`s, this difference will be normalized out.
SteVwonder added a commit to SteVwonder/flux-sched that referenced this issue Jul 14, 2020
** Problem **

The resource graph produced by the following process (referred to in
this commit message as "single read"):

- read V1-xml on disk with V2 API
- traverse hwloc topo with traversal APIs

is *NOT* equivalent to the graph produced by the following
process (referred to in this commit message as "double read"):

- read V1-xml on disk with V2 API
- serialize with V1 compatible writer
- read serialized V1-compatible XML with V2 API
- traverse hwloc topo with traversal APIs

The "single read" process, when applied to our `04-brokers-sierra2`
hwloc XML data, produces a resource graph where the GPUs are direct
children of the compute node. The "double read", process, when applied
to our `04-brokers-sierra2` hwloc XML data, produces a resource graph
where the GPUs are direct children of the sockets.  In terms of
locality, the latter is "more correct".  The difference in these two
graphs breaks matches against jobspecs that assume the GPUs are direct
children of the node; specifically `basics/test013.yaml`.

The "single read" process is what happens when you test with the
`resource-query` utility.  The "double read" process is what happens
when you test with the `flux ion-resource` utility against a
`fluxion-resource` module that has been populated with xml from `flux
hwloc reload`.  The "double read" process also more closely mimics what
happens "in production".

Note: All of the above "reads" use the following flags for the various
resource filtering functions added to V2:

  - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT
  - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE
  - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE

** Solution **

Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test
utility in flux-core to emulate the first read in the "double read"
process.  Specifically it performs the first two steps of the process:
read V1-xml on disk with V2 API and serialize with V1 compatible writer.
The result is that `resource-query` and `flux ion-resource` now both
instantiate the same resource graph and thus produce the same results on
the same jobspecs and hwloc XML.

The resource graphs produced by these utilities now includes a socket in
between the nodes and GPUs, which affects a jobspec
request (basics/test013.yaml) and the expected output of two GPU-based
match tests (018.R.out and 021.R.out).  This commit updates the jobspec
and the expected outputs to include the socket resources.

Note: This commit does not attempt to normalize the resource graphs
produced by hwloc V1 versus V2.  As we discussed in
flux-framework#656, where this would cause issue in
production, we can leverage the use of `--load-allowlist` to filter out
resources that cause differences.  If one sticks strictly to Jobspec V1
and thus filters out `socket`s, this difference will be normalized out.
@mergify mergify bot closed this as completed in #677 Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants