slow throughput on high core-count nodes #4365

grondo · 2022-06-10T18:21:18Z

I noticed that the throughput test and related high throughput workloads running real jobs were very slow on modern systems with high core counts, e.g.

ƒ(s=1,d=0) grondo@corona171:~$ flux resource list -no {ncores}
48
ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -n 128
number of jobs: 128
submit time:    0.301 s (425.8 job/s)
script runtime: 0.652 s
job runtime:    0.449 s
throughput:     285.4 job/s (script: 196.2 job/s)
ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -xn 128
number of jobs: 128
submit time:    0.082 s (1558.5 job/s)
script runtime: 15.444s
job runtime:    15.431s
throughput:     8.3 job/s (script:   8.3 job/s)

@woodard had an intuition that the job shell's default use of CPU affinity might be to blame here, and that guess appeared to be helpful. By disabling CPU affinity, we see an 8x increase in throughput:

ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -o cpu-affinity=off -xn 256
number of jobs: 256
submit time:    0.382 s (670.4 job/s)
script runtime: 4.179 s
job runtime:    3.913 s
throughput:     65.4 job/s (script:  61.3 job/s)

However, that isn't the whole story. I ran the test again with the call to hwloc_bind(3) disabled (so no CPU affinity is actually set, but all the code to compute the CPU mask via libhwloc is still executed), and still see the slow throughput.

This leads me to believe that libhwloc is the source of the problem here. Indeed, running hwloc-info is slow

$ time hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          2 Package (type #1)
  depth 2:         16 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    96 PU (type #3)
Special depth -3:  2 NUMANode (type #13)
Special depth -4:  48 Bridge (type #14)
Special depth -5:  22 PCIDev (type #15)
Special depth -6:  15 OSDev (type #16)

real	0m1.433s
user	0m0.016s
sys	0m0.583s

But running 48 in parallel seems to indicate something is serializing the execution of the hwloc program:

$ time pdsh -Rexec -f 48 -w [1-48] hwloc-info 
...
real	0m57.167s
user	0m1.408s
sys	0m43.121s

Perhaps the issue is all the sysfs access required by libhwloc?

If this becomes an issue for real jobs, one approach might be to somehow cache the hwloc topology in the broker for reuse by the job shell. Or perhaps using libhwloc to compute affinity is overkill, and we can use a manual and much more efficient method.

The text was updated successfully, but these errors were encountered:

garlick · 2022-06-10T20:15:25Z

One side note is that pmix provides the topology/ cpuset to openmpi. Perhaps this is why.

…

On Fri, Jun 10, 2022, 11:21 AM Mark Grondona ***@***.***> wrote: I noticed that the throughput test and related high throughput workloads running real jobs were very slow on modern systems with high core counts, e.g. ƒ(s=1,d=0) ***@***.***:~$ flux resource list -no {ncores} 48 ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -n 128 number of jobs: 128 submit time: 0.301 s (425.8 job/s) script runtime: 0.652 s job runtime: 0.449 s throughput: 285.4 job/s (script: 196.2 job/s) ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -xn 128 number of jobs: 128 submit time: 0.082 s (1558.5 job/s) script runtime: 15.444s job runtime: 15.431s throughput: 8.3 job/s (script: 8.3 job/s) @woodard <https://github.com/woodard> had an intuition that the job shell's default use of CPU affinity might be to blame here, and that guess appeared to be helpful. By disabling CPU affinity, we see an 8x increase in throughput: ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -o cpu-affinity=off -xn 256 number of jobs: 256 submit time: 0.382 s (670.4 job/s) script runtime: 4.179 s job runtime: 3.913 s throughput: 65.4 job/s (script: 61.3 job/s) However, that isn't the whole story. I ran the test again with the call to hwloc_bind(3) disabled (so no CPU affinity is actually set, but all the code to compute the CPU mask via libhwloc is still executed), and still see the slow throughput. This leads me to believe that libhwloc is the source of the problem here. Indeed, running hwloc-info is slow $ time hwloc-info depth 0: 1 Machine (type #0) depth 1: 2 Package (type #1) depth 2: 16 L3Cache (type #6) depth 3: 48 L2Cache (type #5) depth 4: 48 L1dCache (type #4) depth 5: 48 L1iCache (type #9) depth 6: 48 Core (type #2) depth 7: 96 PU (type #3) Special depth -3: 2 NUMANode (type #13) Special depth -4: 48 Bridge (type #14) Special depth -5: 22 PCIDev (type #15) Special depth -6: 15 OSDev (type #16) real 0m1.433s user 0m0.016s sys 0m0.583s But running 48 in parallel seems to indicate something is serializing the execution of the hwloc program: $ time pdsh -Rexec -f 48 -w [1-48] hwloc-info ... real 0m57.167s user 0m1.408s sys 0m43.121s Perhaps the issue is all the sysfs access required by libhwloc? If this becomes an issue for real jobs, one approach might be to somehow cache the hwloc topology in the broker for reuse by the job shell. Or perhaps using libhwloc to compute affinity is overkill, and we can use a manual and much more efficient method. — Reply to this email directly, view it on GitHub <#4365>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJPWZ3NHHRYLWVEG676CDVOOBSVANCNFSM5YOPLUOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

grondo · 2022-06-11T05:00:31Z

The shell could actually fetch the topology from the local resource module with the existing resource.topo-get RPC if we open it up to users instead of only instance owner. As an experiment, the attached patch seems to resolve 80-90% of the slowdown.

Eventually the shell could cache this topology and hand it out to multiple shell plugins including the pmi/pmix plugin. Just an idea for now, not sure how critical this issue really is:

diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 98602fd66..98f58d675 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -158,10 +158,41 @@ static void shell_affinity_destroy (void *arg)
 
 /*  Initialize topology object for affinity processing.
  */
-static int shell_affinity_topology_init (struct shell_affinity *sa)
+static int shell_affinity_topology_init (flux_shell_t *shell,
+                                         struct shell_affinity *sa)
 {
+    const char *xml;
+    flux_t *h = flux_shell_get_flux (shell);
+
+    flux_future_t *f = flux_rpc (h,
+                                 "resource.topo-get",
+                                 NULL, FLUX_NODEID_ANY, 0);
+    if (flux_rpc_get (f, &xml) < 0)
+        return shell_log_errno ("resource.topo-get");
+
     if (hwloc_topology_init (&sa->topo) < 0)
         return shell_log_errno ("hwloc_topology_init");
+
+    if (hwloc_topology_set_xmlbuffer (sa->topo, xml, strlen (xml)) < 0)
+        return shell_log_errno ("hwloc_topology_set_xmlbuffer");
+
+    if (hwloc_topology_set_flags (sa->topo,
+                                  HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM) < 0)
+        return shell_log_errno ("hwloc_topology_set_flags");
+
+
+#if HWLOC_API_VERSION >= 0x20000
+    /*  Keep only cores. This should speed up topology load.
+     *  See: https://hwloc.readthedocs.io/en/stable/faq.html
+     */
+    if (hwloc_topology_set_all_types_filter(sa->topo,
+                                            HWLOC_TYPE_FILTER_KEEP_NONE) < 0
+        || hwloc_topology_set_type_filter(sa->topo,
+                                          HWLOC_OBJ_CORE,
+                                          HWLOC_TYPE_FILTER_KEEP_ALL) < 0)
+        return shell_log_errno ("hwloc: failed to set core filtering");
+#endif /*  HWLOC_API_VERSION >= 0x20000 */
+
     if (hwloc_topology_load (sa->topo) < 0)
         return shell_log_errno ("hwloc_topology_load");
     if (topology_restrict_current (sa->topo) < 0)
@@ -178,7 +209,7 @@ static struct shell_affinity * shell_affinity_create (flux_shell_t *shell)
     struct shell_affinity *sa = calloc (1, sizeof (*sa));
     if (!sa)
         return NULL;
-    if (shell_affinity_topology_init (sa) < 0)
+    if (shell_affinity_topology_init (shell, sa) < 0)
         goto err;
     if (flux_shell_rank_info_unpack (shell,
                                      -1,

garlick · 2022-06-11T16:28:09Z

That solution sounds completely reasonable to me!

Problem: Loading hwloc topology can be very slow, especially on a system with many cores and when possibly many processes are trying to simultaneously call hwloc_topology_load(3). This can occur when many short running jobs are being launched by Flux, since the job shell loads topology by default in the affinity plugin. Since the job shell now caches the hwloc XML in the shell info object, fetch this XML and use it to load topology, avoiding redundant scans of ths sytem. This may greatly improve job throughput on many core systems. Fixes flux-framework#4365

mergify bot closed this as completed in 313ae8f Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow throughput on high core-count nodes #4365

slow throughput on high core-count nodes #4365

grondo commented Jun 10, 2022

garlick commented Jun 10, 2022 via email

grondo commented Jun 11, 2022

garlick commented Jun 11, 2022

slow throughput on high core-count nodes #4365

slow throughput on high core-count nodes #4365

Comments

grondo commented Jun 10, 2022

garlick commented Jun 10, 2022 via email

grondo commented Jun 11, 2022

garlick commented Jun 11, 2022