Hwloc 2.0+ Support #656

SteVwonder · 2020-05-15T06:51:22Z

One change between hwloc 1.0 and 2.0+ that affects flux-sched is:

In hwloc v1.x, NUMA nodes were inside the tree, for instance Packages contained 2 NUMA nodes which contained a L3 and several cache.

Starting with hwloc v2.0, NUMA nodes are not in the main tree anymore. They are attached under objects as Memory Children on the side of normal children. This memory children list starts at obj->memory_first_child and its size is obj->memory_arity. Hence there can now exist two local NUMA nodes, for instance on Intel Xeon Phi processors.

if there are two NUMA nodes per package, a Group object may be added to keep cores together with their local NUMA node

[Source]

This means that the NUMANode resource type never appears in the resource graph when using an XML generated by hwloc 2.0+ (even when the V1 compatibility flag is provided at export time). In the case where the NUMANode affects the structure of the tree, a Group resource appears instead. When the NUMANode doesn't affect the structure, the tree goes straight from Machine to Package to L3Cache (or Core or whatever used to be the child of NUMANode).

This creates issues for several of the tests in t4004-match-hwloc.t where one of several condititions causes an error:

the expected output from resource-query contains the numanode vertex
the requested jobspec explicitly requests a numanode resource
the requested jobspec implicitly requires a common ancestor between two resources that used to be numanode (at least, I think that is what is happening, I need to dig further into it, the failing jobspec is data/resource/jobspecs/basics/test013.yml)

One option is to not try and make these two versions of hwloc produce the same resource graph or the same output. This would affect jobspec portability.

Another option is to try and make the resource graph generated with a hwloc2 reader look similar/the same as the resource graph generated with a hwloc1 reader. This will require more work and may not be possible in every situation, but I think at least attempting it is worth the potential portability gains.

One thought I had in the direction of the latter option was to rename any Groups with a NUMANode child to be a NUMANode. This would put numanode back "in the tree", but it won't apply to every topology. Specifically, if the NUMANode doesn't affect the topology structure, then hwloc2 elides the Group object entirely and makes NUMANode a child of Machine:

sherbein .../data/hwloc-data/001N/exclusive (hwloc2 !?S)
❯ lstopo-no-graphics -i 04-brokers/0.xml
Machine (16GB total)
  NUMANode L#0 (P#0 16GB)
  Package L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3 + PU L#3 (P#3)

Maybe as we are traversing the resource tree, whenever we detect a NUMANode at a given level, we insert the numanode back into the tree and then make the resources that used to be its siblings into its children. I'm worried about the potential corner cases with this approach. I don't have a good feel for all the possible topologies to be confident to say this wouldn't break anything. At a high-level, it sounds like the inverse operation of what was done during the v1->v2 transition.

The text was updated successfully, but these errors were encountered:

dongahn · 2020-05-15T22:57:07Z

@SteVwonder: for normal short-term use cases, resource types like numanode and group will not be used for scheduling. So I'm wondering if it would be possible to normalize the resource graph representation out of hwloc reader by setting --load-whitelist for test cases? This way, we only populate graph objects for the resource types listed with the option. One area where this can be broken is when we add socket as the resource type to scheduler; yet gpu appears either at the sibling level of socket vs. at the child level of socket depending on how hwloc v1 vs v2 affects the topology. But maybe worthwhile to try this quickly on this sharness test since it won't require any code changes?

garlick · 2020-06-16T14:24:42Z

I think for TOSS 4, we will be building our packages against hwloc 2, therefore I'll attach this to the v0.9.0 milestone.

** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.

garlick added this to the v0.9.0 milestone Jun 16, 2020

SteVwonder mentioned this issue Jul 14, 2020

Add support for hwloc 2.0+ #677

Merged

mergify bot closed this as completed in #677 Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hwloc 2.0+ Support #656

Hwloc 2.0+ Support #656

SteVwonder commented May 15, 2020

dongahn commented May 15, 2020

garlick commented Jun 16, 2020

Hwloc 2.0+ Support #656

Hwloc 2.0+ Support #656

Comments

SteVwonder commented May 15, 2020

dongahn commented May 15, 2020

garlick commented Jun 16, 2020