-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hwloc 2.0+ Support #656
Comments
@SteVwonder: for normal short-term use cases, resource types like |
I think for TOSS 4, we will be building our packages against hwloc 2, therefore I'll attach this to the v0.9.0 milestone. |
** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.
** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.
** Problem ** The resource graph produced by the following process (referred to in this commit message as "single read"): - read V1-xml on disk with V2 API - traverse hwloc topo with traversal APIs is *NOT* equivalent to the graph produced by the following process (referred to in this commit message as "double read"): - read V1-xml on disk with V2 API - serialize with V1 compatible writer - read serialized V1-compatible XML with V2 API - traverse hwloc topo with traversal APIs The "single read" process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the compute node. The "double read", process, when applied to our `04-brokers-sierra2` hwloc XML data, produces a resource graph where the GPUs are direct children of the sockets. In terms of locality, the latter is "more correct". The difference in these two graphs breaks matches against jobspecs that assume the GPUs are direct children of the node; specifically `basics/test013.yaml`. The "single read" process is what happens when you test with the `resource-query` utility. The "double read" process is what happens when you test with the `flux ion-resource` utility against a `fluxion-resource` module that has been populated with xml from `flux hwloc reload`. The "double read" process also more closely mimics what happens "in production". Note: All of the above "reads" use the following flags for the various resource filtering functions added to V2: - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE ** Solution ** Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test utility in flux-core to emulate the first read in the "double read" process. Specifically it performs the first two steps of the process: read V1-xml on disk with V2 API and serialize with V1 compatible writer. The result is that `resource-query` and `flux ion-resource` now both instantiate the same resource graph and thus produce the same results on the same jobspecs and hwloc XML. The resource graphs produced by these utilities now includes a socket in between the nodes and GPUs, which affects a jobspec request (basics/test013.yaml) and the expected output of two GPU-based match tests (018.R.out and 021.R.out). This commit updates the jobspec and the expected outputs to include the socket resources. Note: This commit does not attempt to normalize the resource graphs produced by hwloc V1 versus V2. As we discussed in flux-framework#656, where this would cause issue in production, we can leverage the use of `--load-allowlist` to filter out resources that cause differences. If one sticks strictly to Jobspec V1 and thus filters out `socket`s, this difference will be normalized out.
One change between hwloc 1.0 and 2.0+ that affects flux-sched is:
[Source]
This means that the
NUMANode
resource type never appears in the resource graph when using an XML generated by hwloc 2.0+ (even when the V1 compatibility flag is provided at export time). In the case where theNUMANode
affects the structure of the tree, aGroup
resource appears instead. When theNUMANode
doesn't affect the structure, the tree goes straight fromMachine
toPackage
toL3Cache
(orCore
or whatever used to be the child ofNUMANode
).This creates issues for several of the tests in
t4004-match-hwloc.t
where one of several condititions causes an error:resource-query
contains thenumanode
vertexnumanode
resourcedata/resource/jobspecs/basics/test013.yml
)One option is to not try and make these two versions of hwloc produce the same resource graph or the same output. This would affect jobspec portability.
Another option is to try and make the resource graph generated with a hwloc2 reader look similar/the same as the resource graph generated with a hwloc1 reader. This will require more work and may not be possible in every situation, but I think at least attempting it is worth the potential portability gains.
One thought I had in the direction of the latter option was to rename any
Group
s with aNUMANode
child to be aNUMANode
. This would putnumanode
back "in the tree", but it won't apply to every topology. Specifically, if theNUMANode
doesn't affect the topology structure, then hwloc2 elides theGroup
object entirely and makesNUMANode
a child ofMachine
:Maybe as we are traversing the resource tree, whenever we detect a
NUMANode
at a given level, we insert thenumanode
back into the tree and then make the resources that used to be its siblings into its children. I'm worried about the potential corner cases with this approach. I don't have a good feel for all the possible topologies to be confident to say this wouldn't break anything. At a high-level, it sounds like the inverse operation of what was done during the v1->v2 transition.The text was updated successfully, but these errors were encountered: