Skip to content

Commit

Permalink
t: run sierra2 hwloc XML through hwloc-convert and fix the fallout
Browse files Browse the repository at this point in the history
** Problem **

The resource graph produced by the following process (referred to in
this commit message as "single read"):

- read V1-xml on disk with V2 API
- traverse hwloc topo with traversal APIs

is *NOT* equivalent to the graph produced by the following
process (referred to in this commit message as "double read"):

- read V1-xml on disk with V2 API
- serialize with V1 compatible writer
- read serialized V1-compatible XML with V2 API
- traverse hwloc topo with traversal APIs

The "single read" process, when applied to our `04-brokers-sierra2`
hwloc XML data, produces a resource graph where the GPUs are direct
children of the compute node. The "double read", process, when applied
to our `04-brokers-sierra2` hwloc XML data, produces a resource graph
where the GPUs are direct children of the sockets.  In terms of
locality, the latter is "more correct".  The difference in these two
graphs breaks matches against jobspecs that assume the GPUs are direct
children of the node; specifically `basics/test013.yaml`.

The "single read" process is what happens when you test with the
`resource-query` utility.  The "double read" process is what happens
when you test with the `flux ion-resource` utility against a
`fluxion-resource` module that has been populated with xml from `flux
hwloc reload`.  The "double read" process also more closely mimics what
happens "in production".

Note: All of the above "reads" use the following flags for the various
resource filtering functions added to V2:

  - io_types: HWLOC_TYPE_FILTER_KEEP_IMPORTANT
  - cache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE
  - icache_types: HWLOC_TYPE_FILTER_KEEP_STRUCTURE

** Solution **

Run the `04-brokers-sierra2` XML files through the `hwloc-convert` test
utility in flux-core to emulate the first read in the "double read"
process.  Specifically it performs the first two steps of the process:
read V1-xml on disk with V2 API and serialize with V1 compatible writer.
The result is that `resource-query` and `flux ion-resource` now both
instantiate the same resource graph and thus produce the same results on
the same jobspecs and hwloc XML.

The resource graphs produced by these utilities now includes a socket in
between the nodes and GPUs, which affects a jobspec
request (basics/test013.yaml) and the expected output of two GPU-based
match tests (018.R.out and 021.R.out).  This commit updates the jobspec
and the expected outputs to include the socket resources.

Note: This commit does not attempt to normalize the resource graphs
produced by hwloc V1 versus V2.  As we discussed in
#656, where this would cause issue in
production, we can leverage the use of `--load-allowlist` to filter out
resources that cause differences.  If one sticks strictly to Jobspec V1
and thus filters out `socket`s, this difference will be normalized out.
  • Loading branch information
SteVwonder committed Jul 14, 2020
1 parent f27f3b3 commit e350c5b
Show file tree
Hide file tree
Showing 7 changed files with 2,152 additions and 2,670 deletions.
1,197 changes: 534 additions & 663 deletions t/data/hwloc-data/004N/exclusive/04-brokers-sierra2/0.xml

Large diffs are not rendered by default.

1,197 changes: 534 additions & 663 deletions t/data/hwloc-data/004N/exclusive/04-brokers-sierra2/1.xml

Large diffs are not rendered by default.

1,203 changes: 536 additions & 667 deletions t/data/hwloc-data/004N/exclusive/04-brokers-sierra2/2.xml

Large diffs are not rendered by default.

1,203 changes: 536 additions & 667 deletions t/data/hwloc-data/004N/exclusive/04-brokers-sierra2/3.xml

Large diffs are not rendered by default.

10 changes: 6 additions & 4 deletions t/data/resource/expected/basics/018.R.out
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
---------gpu2[1:x]
---------gpu3[1:x]
------------gpu2[1:x]
------------gpu3[1:x]
---------socket1[1:s]
------sierra3682[1:s]
---cluster0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================
---------gpu0[1:x]
---------gpu1[1:x]
------------gpu0[1:x]
------------gpu1[1:x]
---------socket0[1:s]
------sierra3682[1:s]
---cluster0[1:s]
INFO: =============================
Expand Down
8 changes: 4 additions & 4 deletions t/data/resource/expected/basics/021.R.out
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@
------------core41[1:x]
------------core42[1:x]
------------core43[1:x]
------------gpu2[1:x]
------------gpu3[1:x]
---------socket1[1:x]
---------gpu2[1:x]
---------gpu3[1:x]
------sierra3682[1:s]
---cluster0[1:s]
INFO: =============================
Expand Down Expand Up @@ -52,9 +52,9 @@ INFO: =============================
------------core19[1:x]
------------core20[1:x]
------------core21[1:x]
------------gpu0[1:x]
------------gpu1[1:x]
---------socket0[1:x]
---------gpu0[1:x]
---------gpu1[1:x]
------sierra3682[1:s]
---cluster0[1:s]
INFO: =============================
Expand Down
4 changes: 2 additions & 2 deletions t/data/resource/jobspecs/basics/test013.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ resources:
with:
- type: core
count: 22
- type: gpu
count: 2
- type: gpu
count: 2
# a comment
attributes:
system:
Expand Down

0 comments on commit e350c5b

Please sign in to comment.