Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: eliminate per-job namespace event subscriptions on rank 0 #2777

Merged
merged 5 commits into from
Feb 28, 2020

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Feb 25, 2020

Per discussion in #2727. I'll just throw up my commit message here:

    On rank 0, we have to subscribe to all events on all KVS namespaces,
    as rank 0 must have knowledge of all KVS namespaces.  On the other
    hand, workers only need to know about the KVS namespaces that
    callers are using on that rank.
    
    To improve KVS namespace create performance, subscribe to all KVS
    namespaces on rank 0 once on initialization, instead of subscribing
    to each KVS namespace as it is created.  On workers, subscribe
    to only the KVS namespaces that are necessary.

Also threw in a fix for #2762

@grondo
Copy link
Contributor

grondo commented Feb 26, 2020

Definitely not a silver bullet, but there's an improvement here. There isn't much of a change in "single" alloc mode (still ~50 job/s), likely because the alloc mode is the limiter. However, when we enable "unlimited" alloc mode for sched-simple, I'm getting steady ~70 job/s on this PR branch:

test.sh

#!/bin/bash

echo "Default test:"
flux python bulksubmit.py jobs/*.json

echo
echo "sched-simple in unlimited alloc mode:"
flux module reload sched-simple unlimited
flux python bulksubmit.py jobs/*.json

Before:

$ srun -N64 --pty --mpi=none src/cmd/flux start bash test.sh
Default test:
bulksubmit: Starting...
bulksubmit: scheduling disabled, reason=Testing
bulksubmit: submitted 1024 jobs in 1.72s. 595.74job/s
bulksubmit: scheduling enabled
bulksubmit: First job finished in about 0.265s
|██████████████████████████████████████████████████████████| 100.0% (48.0 job/s)
bulksubmit: Ran 1024 jobs in 21.3s. 48.0 job/s

sched-simple in unlimited alloc mode:
bulksubmit: Starting...
bulksubmit: scheduling disabled, reason=Testing
bulksubmit: submitted 1024 jobs in 1.52s. 673.72job/s
bulksubmit: scheduling enabled
bulksubmit: First job finished in about 5.892s
|██████████████████████████████████████████████████████████| 100.0% (40.8 job/s)
bulksubmit: Ran 1024 jobs in 25.1s. 40.8 job/s

After:

$ srun -N64 --pty --mpi=none src/cmd/flux start bash test.sh
Default test:
bulksubmit: Starting...
bulksubmit: scheduling disabled, reason=Testing
bulksubmit: submitted 1024 jobs in 1.76s. 582.05job/s
bulksubmit: scheduling enabled
bulksubmit: First job finished in about 0.247s
|██████████████████████████████████████████████████████████| 100.0% (49.4 job/s)
bulksubmit: Ran 1024 jobs in 20.7s. 49.4 job/s

sched-simple in unlimited alloc mode:
bulksubmit: Starting...
bulksubmit: scheduling disabled, reason=Testing
bulksubmit: submitted 1024 jobs in 1.73s. 592.51job/s
bulksubmit: scheduling enabled
bulksubmit: First job finished in about 3.425s
|██████████████████████████████████████████████████████████| 100.0% (73.8 job/s)
bulksubmit: Ran 1024 jobs in 13.9s. 73.8 job/s

(the progress bar in the 2nd case is very bursty -- I think we're hitting general kvs busy-ness now)

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look reasonable to me. Thanks!

}
if (ctx->rank != 0) {
if (asprintf (&setroot_topic, "kvs.setroot-%s", ns) < 0) {
errno = ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asprintf() already sets errno on error, so might want to drop the explicit errno = ENOMEM assignment in the cases you are touching in this PR.

Since the missing coverage is mainly these error paths, it might increase coverage a bit too.

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

re-pushed with asprintf changers per @garlick's comment. Fixed up in places other than just the one mentioned.

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

Definitely not a silver bullet, but there's an improvement here.

I had forgotten to run perf record on this branch. The waiting on flux_event_subscribe() is gone. But no obvious bottlenecks remain, as it's mostly performance in ev, reactor, & zeromq.

-   28.37%     0.12%  flux-broker-0    libflux-core.so.2.0.0         [.] ev_run                                                                                                                                   ◆
   - 28.25% ev_run                                                                                                                                                                                                ▒
      - 26.68% ev_invoke_pending                                                                                                                                                                                  ▒
         - 9.33% check_cb                                                                                                                                                                                         ▒
            + 5.08% module_cb                                                                                                                                                                                     ▒
            + 4.04% 0x2aaaab441d32                                                                                                                                                                                ▒
         - 5.39% handle_cb                                                                                                                                                                                        ▒
            + 3.45% dispatch_message (inlined)                                                                                                                                                                    ▒
            + 1.85% flux_recv                                                                                                                                                                                     ▒
         - 4.15% prepare_cb                                                                                                                                                                                       ▒
            + 3.84% 0x2aaaab441d32                                                                                                                                                                                ▒
         + 1.62% transaction_check_cb                                                                                                                                                                             ▒
         + 1.61% transaction_prep_cb                                                                                                                                                                              ▒
         + 1.03% check_cb                                                                                                                                                                                         ▒
         + 0.85% check_cb                                                                                                                                                                                         ▒
         + 0.84% prepare_cb                                                                                                                                                                                       ▒
           0.69% conn_read_cb                                                                                                                                                                                     ▒
           0.54% prepare_cb                                                                                                                                                                                       ▒
      + 1.47% epoll_poll                                                                                                                                                                                          ▒
+   28.37%     0.00%  flux-broker-0    libflux-core.so.2.0.0         [.] flux_reactor_run                                                                                                                         ▒
-   26.68%     0.05%  flux-broker-0    libflux-core.so.2.0.0         [.] ev_invoke_pending                                                                                                                        ▒
   - 26.63% ev_invoke_pending                                                                                                                                                                                     ▒
      - 9.33% check_cb                                                                                                                                                                                            ▒
         + 5.08% module_cb                                                                                                                                                                                        ▒
         + 4.04% 0x2aaaab441d32                                                                                                                                                                                   ▒
      - 5.39% handle_cb                                                                                                                                                                                           ▒
         + 3.45% dispatch_message (inlined)                                                                                                                                                                       ▒
         + 1.85% flux_recv                                                                                                                                                                                        ▒
      - 4.15% prepare_cb                                                                                                                                                                                          ▒
         + 3.84% 0x2aaaab441d32                                                                                                                                                                                   ▒
      - 1.62% transaction_check_cb                                                                                                                                                                                ▒
         + 1.62% kvsroot_mgr_iter_roots                                                                                                                                                                           ▒
      - 1.61% transaction_prep_cb                                                                                                                                                                                 ▒
         + kvsroot_mgr_iter_roots                                                                                                                                                                                 ▒
      + 1.03% check_cb                                                                                                                                                                                            ▒
      + 0.85% check_cb                                                                                                                                                                                            ▒
      + 0.84% prepare_cb                                                                                                                                                                                          ▒
        0.69% conn_read_cb                                                                                                                                                                                        ▒
        0.54% prepare_cb                                                                          

0x2aaaab441d32 maps to something in zeromq land.

I suspect improving performance past this will involve trying to eek out small gains in bunch of different areas.

Edit: OR we could think of some batching mechanism, tell the KVS to make > 1 namespace in a single RPC request? But I suspect that would take major changes in job-manager, job-exec, etc. and that's way down the line.

@grondo
Copy link
Contributor

grondo commented Feb 26, 2020

Edit: OR we could think of some batching mechanism, tell the KVS to make > 1 namespace in a single RPC request? But I suspect that would take major changes in job-manager, job-exec, etc. and that's way down the line.

I assume that the namespace_create/destroy is no longer a bottleneck for job throughput, and the bottleneck has now moved to the job-exec system (lots of kvs activity with jobs writing to eventlogs, etc.)

I think there will be some opportunities for batching up the final kvs commits for jobs when we redo the exec system which might help some. Right now the final write and kvs move of the guest namespace to the main namepace are done under a separate commit for each job.

goto cleanup;
}

if (flux_event_subscribe (ctx->h, "kvs.namespace-remove") < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the error handling for these 3 flux_event_subscribe calls is exactly the same, you could get another small bump in coverage by combining these conditionals into one. Not important though, so only if you feel like it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, actually I can combine the flux_event_subscribe calls above it as well.

Consolidate multiple calls to flux_event_subscribe(), that go through
the same error path, into a single if statement.
When a namespace is removed, unsubscribe from the
kvs.namespace-removed-<NS> event.

Fixes flux-framework#2762
On rank 0, we have to subscribe to all events on all KVS namespaces,
as rank 0 must have knowledge of all KVS namespaces.  On the other
hand, workers only need to know about the KVS namespaces that
callers are using on that rank.

To improve KVS namespace create performance, subscribe to all KVS
namespaces on rank 0 once on initialization, instead of subscribing
to each KVS namespace as it is created.  On workers, subscribe
to only the KVS namespaces that are necessary.

Fixes flux-framework#2727
asprintf already sets errno on error, so there is no need to explicitly
set errno = ENOMEM on asprintf errors.
@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

re-pushed w/ the cleanup @grondo suggested. I was able to cleanup in multiple places, so there's 1 additional cleanup patch added.

@garlick
Copy link
Member

garlick commented Feb 26, 2020

Quick additional thought: could we rename the events so that only one subscription is necessary per namespace in the rank > 0 case?

@grondo
Copy link
Contributor

grondo commented Feb 26, 2020

Quick additional thought: could we rename the events so that only one subscription is necessary per namespace in the rank > 0 case?

Oh, good idea. I'd be interested to see if there is any effect in the job throughput benchmark.

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

one builder failed with

expecting success: 
	id=$(flux jobspec srun -N4 "exit \$JOB_SHELL_RANK" | flux job submit) &&
	flux job wait-event -vt 3 $id finish | grep status=768

flux-job: wait-event timeout on event 'finish'

not ok 5 - job-exec: status is maximum job shell exit codes

assuming timeout b/c of slow travis?

@grondo
Copy link
Contributor

grondo commented Feb 26, 2020

Hm, I've seen that same failure before. We should probably bump that timeout up I suppose...

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

Quick additional thought: could we rename the events so that only one subscription is necessary per namespace in the rank > 0 case?

Was pondering this and an obvious solution did come to mind. But perhaps there's event sub/pub trickery I don't know about?

If I renamed the events like this: kvs.setroot-%s -> kvs.workernamespace-setroot-%s, then workers could just subscribe on kvs.workernamespace.

But then I think I have to send the setroot event out two times? Once to kvs.setroot and once to kvs.workernamespace-setroot? So possibly not a net win.

But maybe there's a naming trick I haven't thought about.

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

Hm, I've seen that same failure before. We should probably bump that timeout up I suppose...

Argh, it failed twice in a row. I'll give it one more try, otherwise I'm bumping the timeout up .

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

Argh, it failed twice in a row. I'll give it one more try, otherwise I'm bumping the timeout up .

3 times in a row!

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

If I renamed the events like this: kvs.setroot-%s -> kvs.workernamespace-setroot-%s, then workers could just subscribe on kvs.workernamespace.

I'm clearly not that clever, as @garlick figured out we can do kvs.namespace-%s-setroot as the pattern, so rank 0 can subscribe to kvs.namespace and the workers can subscribe to kvs.namespace-%s.

Increase test timeouts throughout tests to workaround slowness
in travis.
@chu11
Copy link
Member Author

chu11 commented Feb 26, 2020

re-pushed with increased timeouts in t2402 to hopefully workaround slow asan/travis.

Will work on the renaming events in another PR, as that will involve modifying code in other modules (e.g. kvs-watch).

@codecov-io
Copy link

Codecov Report

Merging #2777 into master will decrease coverage by 0.01%.
The diff coverage is 68.57%.

@@            Coverage Diff             @@
##           master    #2777      +/-   ##
==========================================
- Coverage   81.06%   81.05%   -0.02%     
==========================================
  Files         250      250              
  Lines       39399    39407       +8     
==========================================
+ Hits        31939    31940       +1     
- Misses       7460     7467       +7
Impacted Files Coverage Δ
src/modules/kvs/kvs.c 66.53% <68.57%> (+0.36%) ⬆️
src/modules/job-info/watch.c 70.98% <0%> (-1.56%) ⬇️
src/modules/job-info/guest_watch.c 76.72% <0%> (-0.58%) ⬇️
src/broker/module.c 74.76% <0%> (-0.47%) ⬇️
src/broker/broker.c 73.3% <0%> (-0.2%) ⬇️

@chu11
Copy link
Member Author

chu11 commented Feb 27, 2020

woo hoo, it finally passed

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice improvement!

@grondo
Copy link
Contributor

grondo commented Feb 27, 2020

I'm just going to edit the title for more clarity when release notes are developed.

@grondo grondo changed the title kvs: update how namespace events are subscribed to kvs: eliminate per-job namespace event subscriptions on rank 0 Feb 27, 2020
@grondo
Copy link
Contributor

grondo commented Feb 28, 2020

Setting merge-when-passing.

@mergify mergify bot merged commit dd2e16b into flux-framework:master Feb 28, 2020
@chu11 chu11 deleted the issue2727 branch June 5, 2021 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants