Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of asynchronous futures #1840

Merged
merged 6 commits into from
Nov 21, 2018

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Nov 16, 2018

As described in #1839, this PR improves efficiency of asynchronous use of flux_future_t by eliminating the prepare watcher and only starting the check and idle watchers at the time of fulfillment instead of immediately when flux_future_then(3) is called. This reduces the number of active watchers significantly when there are many unfulfilled futures associated with the reactor loop.

This PR should be carefully examined and tested to ensure I haven't missed some subtle use case that is not covered in our testsuite. During development, I did find one case that was missed by the unit tests and luckily caught by another test in make check. I'll see if I can figure out what that particular use case was, and codify it in the future_t unit tests.

@codecov-io
Copy link

codecov-io commented Nov 16, 2018

Codecov Report

Merging #1840 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1840      +/-   ##
==========================================
+ Coverage   79.91%   79.91%   +<.01%     
==========================================
  Files         196      196              
  Lines       35267    35263       -4     
==========================================
- Hits        28185    28182       -3     
+ Misses       7082     7081       -1
Impacted Files Coverage Δ
src/common/libflux/future.c 87.29% <100%> (-0.17%) ⬇️
src/common/libflux/response.c 79.62% <0%> (-1.24%) ⬇️
src/common/libflux/message.c 81.51% <0%> (-0.13%) ⬇️
src/broker/module.c 83.83% <0%> (+0.27%) ⬆️
src/common/libflux/mrpc.c 87.89% <0%> (+1.17%) ⬆️

@grondo
Copy link
Contributor Author

grondo commented Nov 16, 2018

Ok, I've pushed some updates to libflux/test/future.c that I think exercise the case I hit in development of this PR. The main case, iirc, was a multiple result future where the first result is obtained synchronously. In one version of this PR the subsequent async continuation was never called because watchers were not started (I don't remember the exact reason why, sorry). This case was luckily exercised by t/kvs/commit_order.c.

@garlick
Copy link
Member

garlick commented Nov 17, 2018

Here's a little test that indicates this PR has a positive impact on scaling of concurrent RPCs, versus current master (results are wall clock, based on one sample, run on my single-core Ubuntu VM, no flux-security):

$ time flux job submitbench --fanout=FANOUT --repeat=4096 basic.yaml
fanout master 8c23603 (sec) future-efficiency (sec)
256 14.156 13.101
512 14.844 12.116
1024 15.781 12.772
2048 16.235 11.336
4096 16.813 10.101

(each run was in a fresh instance, so KVS content was not cumulative)

@garlick
Copy link
Member

garlick commented Nov 17, 2018

My vote is to put this in. It might be good to get one more set of eyes on it though first - @chu11?

@grondo
Copy link
Contributor Author

grondo commented Nov 17, 2018

Thanks for taking an extra careful look @garlick, @chu11!

@chu11
Copy link
Member

chu11 commented Nov 20, 2018

took a look and everything LGTM

@chu11
Copy link
Member

chu11 commented Nov 20, 2018

restarted a builder that hit

  python/t0009-security.py:  PASS: N=2   PASS=2   FAIL=0 SKIP=0 XPASS=0 XFAIL=0
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

@grondo if you're happy with it i can hit the button

@grondo
Copy link
Contributor Author

grondo commented Nov 20, 2018

History might look cleaner if #1850 goes in first, so we don't have a future sandwich between two kvs improvements. ;-)

@garlick
Copy link
Member

garlick commented Nov 20, 2018

Mmm, sandwich. One builder hit this valgrind error. I'll go ahead and restart it.

==1624== HEAP SUMMARY:
==1624==     in use at exit: 6,346,975 bytes in 182 blocks
==1624==   total heap usage: 952,580 allocs, 952,398 frees, 223,263,756 bytes allocated
==1624== 
==1624== 1,048,593 bytes in 1 blocks are possibly lost in loss record 99 of 102
==1624==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1624==    by 0x4E74898: cbuf_create (cbuf.c:233)
==1624==    by 0x4E639FC: flux_buffer_create (buffer.c:80)
==1624==    by 0x4E6F0AC: remote_channel_setup (remote.c:354)
==1624==    by 0x4E6F69B: remote_setup_stdio (remote.c:443)
==1624==    by 0x4E6F69B: subprocess_remote_setup (remote.c:493)
==1624==    by 0x4E72C1A: flux_rexec (subprocess.c:677)
==1624==    by 0xB7CD6CB: spawn_exec_handler (job.c:694)
==1624==    by 0xB7CD6CB: runevent_continuation (job.c:757)
==1624==    by 0x4E88E12: ev_invoke_pending (ev.c:3314)
==1624==    by 0x4E8C3D8: ev_run (ev.c:3717)
==1624==    by 0x4E589E2: flux_reactor_run (reactor.c:140)
==1624==    by 0xB7CE10F: mod_main (job.c:938)
==1624==    by 0x1144EB: module_thread (module.c:157)
==1624==    by 0x55BC6DA: start_thread (pthread_create.c:463)
==1624==    by 0x636388E: clone (clone.S:95)

@chu11
Copy link
Member

chu11 commented Nov 20, 2018

@garlick hmmm, appears to be new. Don't know if it's a new variant of #1641

Problem: several places in libflux/future.c test if a future
is ready or not ready by checking both f->result_valid *and*
f->fatal_errnum_valid. This requirement could too easily lead to a
future maintainer (hah) forgetting one of these checks, so abstract
this simple test into a convenience function and use it throughout
the code.

This change also cleans up `flux_future_is_ready()` to use the new
function. Though the function handily used `flux_future_wait_for (f, 0.)`
to test for readiness, in the end that amounted to the same check
implemented in the new `future_is_ready`, and use of that function
is more clear.
Problem: futures run in asynchronous mode have their prepare and
check watchers started immediately when `flux_future_then(3)`
is called. This means that the `prepare_cb` and `check_cb` are
run for every unfulfilled future on every reactor loop iteration.
In a process with many futures (e.g. thousands of outstanding
RPCs) this can result in a large slowdown.

Instead of starting the prepare and check watchers at the time
`flux_future_then` is called, start the watchers only after the
future has been fulfilled (with result or fatal error) by
calling `then_context_start` from `post_fulfill`

Fixes flux-framework#1839
The flux_future_t prepare watcher callback is currently used only
to start the idle watcher. Eliminate the middle man and start
the idle watcher directly in `then_context_start`.
Add unit tests to ensure fatal errors in flux_future_t are
handled in asynchronous mode (then context) both before and after
a synchronous get of the error.
Clean up leaked flux_reactor_t in libflux/test/future.c: test_simple().
Ensure a case where a multiple-result future is use first
synchronously then asynchronously is covered in the unit tests.
@grondo
Copy link
Contributor Author

grondo commented Nov 21, 2018

Hit another "no output received" timeout after python/t0009-security.py and restarted

@chu11
Copy link
Member

chu11 commented Nov 21, 2018

man, another hang, restared

@chu11
Copy link
Member

chu11 commented Nov 21, 2018

finally it all passed!

@chu11 chu11 merged commit bbe885e into flux-framework:master Nov 21, 2018
@grondo grondo deleted the future-efficiency branch February 8, 2019 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants