-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow throughput on high core-count nodes #4365
Comments
One side note is that pmix provides the topology/ cpuset to openmpi.
Perhaps this is why.
…On Fri, Jun 10, 2022, 11:21 AM Mark Grondona ***@***.***> wrote:
I noticed that the throughput test and related high throughput workloads
running real jobs were very slow on modern systems with high core counts,
e.g.
ƒ(s=1,d=0) ***@***.***:~$ flux resource list -no {ncores}
48
ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -n 128
number of jobs: 128
submit time: 0.301 s (425.8 job/s)
script runtime: 0.652 s
job runtime: 0.449 s
throughput: 285.4 job/s (script: 196.2 job/s)
ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -xn 128
number of jobs: 128
submit time: 0.082 s (1558.5 job/s)
script runtime: 15.444s
job runtime: 15.431s
throughput: 8.3 job/s (script: 8.3 job/s)
@woodard <https://github.com/woodard> had an intuition that the job
shell's default use of CPU affinity might be to blame here, and that guess
appeared to be helpful. By disabling CPU affinity, we see an 8x increase in
throughput:
ƒ(s=1,d=0) ***@***.***:~/git/flux-core.git$ src/test/throughput.py -o cpu-affinity=off -xn 256
number of jobs: 256
submit time: 0.382 s (670.4 job/s)
script runtime: 4.179 s
job runtime: 3.913 s
throughput: 65.4 job/s (script: 61.3 job/s)
However, that isn't the whole story. I ran the test again with the call to
hwloc_bind(3) disabled (so no CPU affinity is actually set, but all the
code to compute the CPU mask via libhwloc is still executed), and still see
the slow throughput.
This leads me to believe that libhwloc is the source of the problem here.
Indeed, running hwloc-info is slow
$ time hwloc-info
depth 0: 1 Machine (type #0)
depth 1: 2 Package (type #1)
depth 2: 16 L3Cache (type #6)
depth 3: 48 L2Cache (type #5)
depth 4: 48 L1dCache (type #4)
depth 5: 48 L1iCache (type #9)
depth 6: 48 Core (type #2)
depth 7: 96 PU (type #3)
Special depth -3: 2 NUMANode (type #13)
Special depth -4: 48 Bridge (type #14)
Special depth -5: 22 PCIDev (type #15)
Special depth -6: 15 OSDev (type #16)
real 0m1.433s
user 0m0.016s
sys 0m0.583s
But running 48 in parallel seems to indicate something is serializing the
execution of the hwloc program:
$ time pdsh -Rexec -f 48 -w [1-48] hwloc-info
...
real 0m57.167s
user 0m1.408s
sys 0m43.121s
Perhaps the issue is all the sysfs access required by libhwloc?
If this becomes an issue for real jobs, one approach might be to somehow
cache the hwloc topology in the broker for reuse by the job shell. Or
perhaps using libhwloc to compute affinity is overkill, and we can use a
manual and much more efficient method.
—
Reply to this email directly, view it on GitHub
<#4365>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABJPWZ3NHHRYLWVEG676CDVOOBSVANCNFSM5YOPLUOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The shell could actually fetch the topology from the local resource module with the existing Eventually the shell could cache this topology and hand it out to multiple shell plugins including the pmi/pmix plugin. Just an idea for now, not sure how critical this issue really is: diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 98602fd66..98f58d675 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -158,10 +158,41 @@ static void shell_affinity_destroy (void *arg)
/* Initialize topology object for affinity processing.
*/
-static int shell_affinity_topology_init (struct shell_affinity *sa)
+static int shell_affinity_topology_init (flux_shell_t *shell,
+ struct shell_affinity *sa)
{
+ const char *xml;
+ flux_t *h = flux_shell_get_flux (shell);
+
+ flux_future_t *f = flux_rpc (h,
+ "resource.topo-get",
+ NULL, FLUX_NODEID_ANY, 0);
+ if (flux_rpc_get (f, &xml) < 0)
+ return shell_log_errno ("resource.topo-get");
+
if (hwloc_topology_init (&sa->topo) < 0)
return shell_log_errno ("hwloc_topology_init");
+
+ if (hwloc_topology_set_xmlbuffer (sa->topo, xml, strlen (xml)) < 0)
+ return shell_log_errno ("hwloc_topology_set_xmlbuffer");
+
+ if (hwloc_topology_set_flags (sa->topo,
+ HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM) < 0)
+ return shell_log_errno ("hwloc_topology_set_flags");
+
+
+#if HWLOC_API_VERSION >= 0x20000
+ /* Keep only cores. This should speed up topology load.
+ * See: https://hwloc.readthedocs.io/en/stable/faq.html
+ */
+ if (hwloc_topology_set_all_types_filter(sa->topo,
+ HWLOC_TYPE_FILTER_KEEP_NONE) < 0
+ || hwloc_topology_set_type_filter(sa->topo,
+ HWLOC_OBJ_CORE,
+ HWLOC_TYPE_FILTER_KEEP_ALL) < 0)
+ return shell_log_errno ("hwloc: failed to set core filtering");
+#endif /* HWLOC_API_VERSION >= 0x20000 */
+
if (hwloc_topology_load (sa->topo) < 0)
return shell_log_errno ("hwloc_topology_load");
if (topology_restrict_current (sa->topo) < 0)
@@ -178,7 +209,7 @@ static struct shell_affinity * shell_affinity_create (flux_shell_t *shell)
struct shell_affinity *sa = calloc (1, sizeof (*sa));
if (!sa)
return NULL;
- if (shell_affinity_topology_init (sa) < 0)
+ if (shell_affinity_topology_init (shell, sa) < 0)
goto err;
if (flux_shell_rank_info_unpack (shell,
-1,
|
That solution sounds completely reasonable to me! |
Problem: Loading hwloc topology can be very slow, especially on a system with many cores and when possibly many processes are trying to simultaneously call hwloc_topology_load(3). This can occur when many short running jobs are being launched by Flux, since the job shell loads topology by default in the affinity plugin. Since the job shell now caches the hwloc XML in the shell info object, fetch this XML and use it to load topology, avoiding redundant scans of ths sytem. This may greatly improve job throughput on many core systems. Fixes flux-framework#4365
I noticed that the throughput test and related high throughput workloads running real jobs were very slow on modern systems with high core counts, e.g.
@woodard had an intuition that the job shell's default use of CPU affinity might be to blame here, and that guess appeared to be helpful. By disabling CPU affinity, we see an 8x increase in throughput:
However, that isn't the whole story. I ran the test again with the call to
hwloc_bind(3)
disabled (so no CPU affinity is actually set, but all the code to compute the CPU mask via libhwloc is still executed), and still see the slow throughput.This leads me to believe that libhwloc is the source of the problem here. Indeed, running
hwloc-info
is slowBut running 48 in parallel seems to indicate something is serializing the execution of the hwloc program:
Perhaps the issue is all the sysfs access required by libhwloc?
If this becomes an issue for real jobs, one approach might be to somehow cache the hwloc topology in the broker for reuse by the job shell. Or perhaps using libhwloc to compute affinity is overkill, and we can use a manual and much more efficient method.
The text was updated successfully, but these errors were encountered: