Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow throughput on high core-count nodes #4365

Closed
grondo opened this issue Jun 10, 2022 · 3 comments
Closed

slow throughput on high core-count nodes #4365

grondo opened this issue Jun 10, 2022 · 3 comments

Comments

@grondo
Copy link
Contributor

grondo commented Jun 10, 2022

I noticed that the throughput test and related high throughput workloads running real jobs were very slow on modern systems with high core counts, e.g.

ƒ(s=1,d=0) grondo@corona171:~$ flux resource list -no {ncores}
48
ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -n 128
number of jobs: 128
submit time:    0.301 s (425.8 job/s)
script runtime: 0.652 s
job runtime:    0.449 s
throughput:     285.4 job/s (script: 196.2 job/s)
ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -xn 128
number of jobs: 128
submit time:    0.082 s (1558.5 job/s)
script runtime: 15.444s
job runtime:    15.431s
throughput:     8.3 job/s (script:   8.3 job/s)

@woodard had an intuition that the job shell's default use of CPU affinity might be to blame here, and that guess appeared to be helpful. By disabling CPU affinity, we see an 8x increase in throughput:

ƒ(s=1,d=0) grondo@corona171:~/git/flux-core.git$ src/test/throughput.py -o cpu-affinity=off -xn 256
number of jobs: 256
submit time:    0.382 s (670.4 job/s)
script runtime: 4.179 s
job runtime:    3.913 s
throughput:     65.4 job/s (script:  61.3 job/s)

However, that isn't the whole story. I ran the test again with the call to hwloc_bind(3) disabled (so no CPU affinity is actually set, but all the code to compute the CPU mask via libhwloc is still executed), and still see the slow throughput.

This leads me to believe that libhwloc is the source of the problem here. Indeed, running hwloc-info is slow

$ time hwloc-info
depth 0:           1 Machine (type #0)
 depth 1:          2 Package (type #1)
  depth 2:         16 L3Cache (type #6)
   depth 3:        48 L2Cache (type #5)
    depth 4:       48 L1dCache (type #4)
     depth 5:      48 L1iCache (type #9)
      depth 6:     48 Core (type #2)
       depth 7:    96 PU (type #3)
Special depth -3:  2 NUMANode (type #13)
Special depth -4:  48 Bridge (type #14)
Special depth -5:  22 PCIDev (type #15)
Special depth -6:  15 OSDev (type #16)

real	0m1.433s
user	0m0.016s
sys	0m0.583s

But running 48 in parallel seems to indicate something is serializing the execution of the hwloc program:

$ time pdsh -Rexec -f 48 -w [1-48] hwloc-info 
...
real	0m57.167s
user	0m1.408s
sys	0m43.121s

Perhaps the issue is all the sysfs access required by libhwloc?

If this becomes an issue for real jobs, one approach might be to somehow cache the hwloc topology in the broker for reuse by the job shell. Or perhaps using libhwloc to compute affinity is overkill, and we can use a manual and much more efficient method.

@garlick
Copy link
Member

garlick commented Jun 10, 2022 via email

@grondo
Copy link
Contributor Author

grondo commented Jun 11, 2022

The shell could actually fetch the topology from the local resource module with the existing resource.topo-get RPC if we open it up to users instead of only instance owner. As an experiment, the attached patch seems to resolve 80-90% of the slowdown.

Eventually the shell could cache this topology and hand it out to multiple shell plugins including the pmi/pmix plugin. Just an idea for now, not sure how critical this issue really is:

diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 98602fd66..98f58d675 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -158,10 +158,41 @@ static void shell_affinity_destroy (void *arg)
 
 /*  Initialize topology object for affinity processing.
  */
-static int shell_affinity_topology_init (struct shell_affinity *sa)
+static int shell_affinity_topology_init (flux_shell_t *shell,
+                                         struct shell_affinity *sa)
 {
+    const char *xml;
+    flux_t *h = flux_shell_get_flux (shell);
+
+    flux_future_t *f = flux_rpc (h,
+                                 "resource.topo-get",
+                                 NULL, FLUX_NODEID_ANY, 0);
+    if (flux_rpc_get (f, &xml) < 0)
+        return shell_log_errno ("resource.topo-get");
+
     if (hwloc_topology_init (&sa->topo) < 0)
         return shell_log_errno ("hwloc_topology_init");
+
+    if (hwloc_topology_set_xmlbuffer (sa->topo, xml, strlen (xml)) < 0)
+        return shell_log_errno ("hwloc_topology_set_xmlbuffer");
+
+    if (hwloc_topology_set_flags (sa->topo,
+                                  HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM) < 0)
+        return shell_log_errno ("hwloc_topology_set_flags");
+
+
+#if HWLOC_API_VERSION >= 0x20000
+    /*  Keep only cores. This should speed up topology load.
+     *  See: https://hwloc.readthedocs.io/en/stable/faq.html
+     */
+    if (hwloc_topology_set_all_types_filter(sa->topo,
+                                            HWLOC_TYPE_FILTER_KEEP_NONE) < 0
+        || hwloc_topology_set_type_filter(sa->topo,
+                                          HWLOC_OBJ_CORE,
+                                          HWLOC_TYPE_FILTER_KEEP_ALL) < 0)
+        return shell_log_errno ("hwloc: failed to set core filtering");
+#endif /*  HWLOC_API_VERSION >= 0x20000 */
+
     if (hwloc_topology_load (sa->topo) < 0)
         return shell_log_errno ("hwloc_topology_load");
     if (topology_restrict_current (sa->topo) < 0)
@@ -178,7 +209,7 @@ static struct shell_affinity * shell_affinity_create (flux_shell_t *shell)
     struct shell_affinity *sa = calloc (1, sizeof (*sa));
     if (!sa)
         return NULL;
-    if (shell_affinity_topology_init (sa) < 0)
+    if (shell_affinity_topology_init (shell, sa) < 0)
         goto err;
     if (flux_shell_rank_info_unpack (shell,
                                      -1,

@garlick
Copy link
Member

garlick commented Jun 11, 2022

That solution sounds completely reasonable to me!

grondo added a commit to grondo/flux-core that referenced this issue Jun 15, 2022
Problem: Loading hwloc topology can be very slow, especially on a
system with many cores and when possibly many processes are trying
to simultaneously call hwloc_topology_load(3). This can occur when
many short running jobs are being launched by Flux, since the job
shell loads topology by default in the affinity plugin.

Since the job shell now caches the hwloc XML in the shell info object,
fetch this XML and use it to load topology, avoiding redundant scans
of ths sytem. This may greatly improve job throughput on many core
systems.

Fixes flux-framework#4365
@mergify mergify bot closed this as completed in 313ae8f Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants