Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsubprocess: reduce remote input prep/check #6002

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 22, 2024

Problem: Profiling shows that a significant amount of time can be spent in the prep/check of remote subprocess input. This is even in the case when the input buffer is empty.

Enable the remote input prep/check only when the buffer is non-empty.

@garlick
Copy link
Member

garlick commented May 23, 2024

OK silly question. Could flux_subprocess_write() just call subprocess_write() (which makes the exec.write RPC) directly when the process is remote?

@chu11
Copy link
Member Author

chu11 commented May 23, 2024

OK silly question. Could flux_subprocess_write() just call subprocess_write() (which makes the exec.write RPC) directly when the process is remote?

Ya know what, i think it can. Lets try that.

@chu11
Copy link
Member Author

chu11 commented May 23, 2024

Ya know what, i think it can. Lets try that.

Ahhh I remember now. We allow users to flux_subprocess_close() on a remote subprocess channel even if the subprocess isn't running yet. The prep/check in the background handles this magic.

If we wait to call flux_subprocess_close() until the subprocess hits running, I think there would be fallout code adjustments elsewhere in flux-core, possibly non-trivial although I didn't look into it too deeply. Just as one example, the --noinput option in flux-exec closes stdin just a few lines below the call to flux_rexec() (i.e. reactor hasn't been re-entered yet, so the initial rexec RPC hasn't even been sent yet).

Perhaps a "LOCAL_UNBUF" on the input side is something we could investigate as well, but perhaps not something for this round?

@garlick
Copy link
Member

garlick commented May 23, 2024

Ah I didn't think of that. Hmm, could we just set the c->closed flag in the channel, and then upon receiving the pid from the remote server, send eofs for all closed channels?

I can't for the life of me think why the input buffering is ever helpful.

@chu11
Copy link
Member Author

chu11 commented May 23, 2024

Ah I didn't think of that. Hmm, could we just set the c->closed flag in the channel, and then upon receiving the pid from the remote server, send eofs for all closed channels?

Yeah, that's certainly a possibility. I hadn't thought about it too much yet mainly because ....

I just realized that next Tuesday is not the release deadline, it's in two Tuesdays. So I actually have time to experiment with an alternate approach. I was thinking this would be the approach for the upcoming release.

@garlick
Copy link
Member

garlick commented May 23, 2024

Well, it just seems like, in the input case, the buffer isn't helping at all, and in fact impacts scaling in 1:N configs like in job-exec because you have to allocate all the buffers, watchers, etc. So if it's not helping, it seems really tempting to just cut it out!

@chu11
Copy link
Member Author

chu11 commented May 24, 2024

Well, it just seems like, in the input case, the buffer isn't helping at all, and in fact impacts scaling in 1:N configs like in job-exec because you have to allocate all the buffers, watchers, etc. So if it's not helping, it seems really tempting to just cut it out!

As I played around with this more, we also allow users to flux_subprocess_write() before the job is running. If we want to keep that behavior, we probably need to keep the buffer around. But we could just subprocess_write() any data after the job is RUNNING.

@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 1ea4905 to 3fcf596 Compare May 25, 2024 00:02
@chu11
Copy link
Member Author

chu11 commented May 25, 2024

after messing around today I think I got it to work. The only gotcha is that we still need the buffer when the caller writes to the subprocess before the subprocess is running. But we can completely remove the remote input prep/check.

I still to test against the reproducer to make sure this works, but it's looking pretty good and passing all tests.

@chu11
Copy link
Member Author

chu11 commented May 27, 2024

The reproducer on corona seems to work, and profiling shows exactly what we'd hope for.

-   23.38%     0.00%  job-exec         job-exec.so            [.] mod_main                              ▒
     mod_main                                                                                           ▒
     flux_reactor_run                                                                                   ▒
     ev_run                                                                                             ▒
   - ev_run                                                                                             ▒
      - 23.00% ev_invoke_pending                                                                        ▒
         - 16.50% check_cb                                                                              ▒
            - exec_start_cmds (inlined)                                                                 ▒
               - 16.48% exec_start_cmd (inlined)                                                        ▒
                  - 16.42% flux_rexec_ex                                                                ▒
                     - flux_rexec_ex                                                                    ▒
                        - 13.49% remote_exec                                                            ▒
                           - 13.42% subprocess_rexec                                                    ▒
                              - 7.20% flux_rpc_pack                                                     ▒
                                 + flux_rpc_vpack                                                       ▒
                              - 6.18% rexec_ctx_create (inlined)                                        ▒
                                 - 6.15% cmd_tojson                                                     ▒
                                    - 5.87% envz_tojson                                                 ▒
                                       + 2.13% json_object_setn_new_nocheck                             ▒
                                       + 1.05% 0x1555545b44bc                                           ▒
                                       + 1.04% 0x1555545b448e                                           ▒
                                         0.63% json_stringn                                             ▒
                        - 1.68% subprocess_remote_setup                                                 ▒
                           - 1.66% remote_setup_stdio (inlined)                                         ▒
                              - remote_channel_setup                                                    ▒
                                 - 1.34% fbuf_create                                                    ▒
                                    + 0.67% cbuf_create                                                 ▒
                                    + 0.64% __GI___libc_malloc (inlined)                                ▒
                        - 1.14% subprocess_create                                                       ▒
                           + 0.93% __socketpair (inlined)                                               ▒
         - 3.24% rexec_continuation                                                                     ▒
            - 1.53% subprocess_rexec_get                                                                ▒
               - 1.14% flux_rpc_get_unpack                                                              ▒
                  + flux_rpc_get_vunpack (inlined)                                                      ▒
            - 1.38% remote_output_local_unbuf (inlined)                                                 ▒
               - 1.32% exec_output_cb                                                                   ▒
                  - 1.29% output_cb                                                                     ▒
                     - 1.24% output_cb                                                                  ▒
                        + exec_barrier_enter (inlined)                                                  ▒
         - 2.67% handle_cb                                                                              ▒

@garlick
Copy link
Member

garlick commented May 28, 2024

Nice!

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

@chu11
Copy link
Member Author

chu11 commented May 28, 2024

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

Ahhh, you mean on the server side, lookup jobs in the hash via exec matchtag vs PID? Yeah, I think that would work, then all the buffering would only be on the server side. Although I think that's a follow up PR to this one as we're changing the protocol.

@garlick
Copy link
Member

garlick commented May 28, 2024

Perhaps we can slip the protocol change in before this PR then.

I'll get a PR open on the RFC for a start.

@chu11
Copy link
Member Author

chu11 commented May 28, 2024

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

Began looking through code and began wondering about the fallout of removing the local buffer.

  • It doesn't appear that libsdexec will buffer anything on writes if the process is not yet running, with the minor exception of whatever the fds of socketpair() buffers. This isn't an issue with rexec b/c we know it is running when flux_local_exec() is done.

  • Although I didn't look into it with great rigor, it's not clear to me if systemd will buffer stdin until the process is started and ready to receive data.

  • Would it be simpler to just not create the local buffer unless it's necessary? It's now somewhat obvious to me that the local buffer is not needed in most situations.

@garlick
Copy link
Member

garlick commented May 28, 2024

Well I guess we should check if write/close before start is required in the main sdexec case we care about right now which is job-exec.

We may need to add some buffering to sdexec in any case. I had forgotten about that. We won't have line buffering with the UNBUF flag...

@garlick
Copy link
Member

garlick commented May 28, 2024

Wait a sec, for local processes, there is an explicit check in flux_subprocess_write() that the subprocess is in RUNNING state or it fails with EPIPE.

https://github.com/flux-framework/flux-core/blob/master/src/common/libsubprocess/subprocess.c#L689

So we can't just send to the remote.

I guess your create on demand idea may be the best option. If we could delete the buffer after it empties out, even better.

But it also makes me wonder why we allow writing/closing before start for remote processes but not for local ones.

@chu11
Copy link
Member Author

chu11 commented May 28, 2024

But it also makes me wonder why we allow writing/closing before start for remote processes but not for local ones.

I think the main reason is that flux_local_exec() is fork/exec, so the process is running before the function returns, so writing before a local process is running is basically impossible. Dumb example is:

p = flux_local_exec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

vs

p = flux_rexec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

in the latter case, the remote subprocess RPC has probably not even been sent when flux_subprocess_write() is called, but with the former case, it's just fork/exec and the process is definitely running. So we'd like the API usage pattern between both to still be possible.

@garlick
Copy link
Member

garlick commented May 28, 2024

p = flux_rexec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

in the latter case, the remote subprocess RPC has probably not even been sent when flux_subprocess_write() is called, but with the former case, it's just fork/exec and the process is definitely running. So we'd like the API usage pattern between both to still be possible.

In that example, the exec request definitely is sent before the write request, and the server is guaranteed to receive those messages in order. Thus if fork+exec/spawn really are synchronous, the remote pid will always be assigned to the subprocess before the write request is processed. Unless of course there was a failure but then it doesn't matter.

@chu11
Copy link
Member Author

chu11 commented May 28, 2024

Thus if fork+exec/spawn really are synchronous, the remote pid will always be assigned to the subprocess before the write request is processed.

I don't think we can be guaranteed this with sdexec though. After sdexec_start_transient_unit(), we've only informed systemd to run the command we want, but there's no guarantee it has done it before the write request comes in?

@garlick
Copy link
Member

garlick commented May 28, 2024

I don't think we can be guaranteed this with sdexec though. After sdexec_start_transient_unit(), we've only informed systemd to run the command we want, but there's no guarantee it has done it before the write request comes in?

Good point. I guess I was assuming we could fix that. In fact I should tack something on to #6013. For input I was just thinking we could enqueue the write request messages until the unit comes online.

@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 3fcf596 to 1bf761c Compare May 28, 2024 22:57
@chu11
Copy link
Member Author

chu11 commented May 28, 2024

re-pushed, adding a commit to create the write buffer on demand when it is needed.

@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch 2 times, most recently from e1d2bd3 to d83e80d Compare May 31, 2024 21:01
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice cleanup!

I had a few comments, though nothing very substantial.

One other quick thought - it might not hurt to add a check that flux exec -n --service=sdexec works. Something like

diff --git a/t/t2409-sdexec.t b/t/t2409-sdexec.t
index cf7ab989c..491ce6717 100755
--- a/t/t2409-sdexec.t
+++ b/t/t2409-sdexec.t
@@ -55,6 +55,9 @@ test_expect_success 'clear broker logs' '
 test_expect_success 'sdexec true succeeds' '
        $sdexec -r 0 $true
 '
+test_expect_success 'sdexec -n cat succeeds' '
+       run_timeout 30 $sdexec -n -r 0 cat
+'
 test_expect_success 'sdexec false fails with exit code 1' '
        test_expect_code 1 $sdexec -r 0 $false
 '

Comment on lines 772 to 774
/* if process isn't running, will be sent after process
* converts to running. Or if it has already exited, it
* doesn't matter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "eof plus any previously writtten data" will be sent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes that there is no buffered data if the state is RUNNING. Maybe it would be helpful to mention where that buffer is flushed.

Comment on lines 510 to 518
if (c->closed) {
if (subprocess_write (p->h,
p->service_name,
p->rank,
p->pid,
c->name,
NULL,
0,
true) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be combined with the subprocess_write() call above? E.g.

if (c->closed || fbuf_bytes (c->write_buffer) > 0) {
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh i think you're right. It would have been two calls if the user did it, but we can combine it at this point in the code.

Comment on lines +564 to +547
if (send_channel_data (p) < 0)
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we destroy the fbuf here?

Copy link
Member Author

@chu11 chu11 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh you mean in the error case if it wasn't destroyed in send_channel_data(). Hmmm, we certainly could. But since we're goto error, we're failing the subprocess and going to destroy the subprocess soon. So not super necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, derp! I didn't notice that it was destroyed in send_channel_data(). NM!

@chu11
Copy link
Member Author

chu11 commented Jun 3, 2024

One other quick thought - it might not hurt to add a check that flux exec -n --service=sdexec works. Something like

hmm it just seems that isn't covered in general. should add tests for the non-sdexec case too (i just grepped for -n and sdexec, which may not be the most accurate grep) oops, for non-sdexec shouldn't grep for sdexec :-)

@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from d83e80d to 793807d Compare June 3, 2024 15:52
@chu11
Copy link
Member Author

chu11 commented Jun 3, 2024

re-pushed with tweaks per comments above. I also added an extra libsubprocess test that covers only closing stdin early vs write & closing it early.

@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 793807d to 629ee0a Compare June 3, 2024 16:57
Problem: When starting remote processes, there are no tests to
ensure that we can write / close stdin before a process is running.

Add tests in libsubprocess/test/remote.c.
@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 629ee0a to 21ed590 Compare June 3, 2024 16:58
chu11 added 3 commits June 3, 2024 12:25
Problem: The -n option in flux-exec is not covered with the sdexec
server.

Add coverage in t2409-sdexec.t.
Problem: Profiling shows that a significant amount of time
can be spent in the prep/check of remote subprocess input.
This is even in the case when the input buffer is empty.

It ends up that the prep/check is not necessary for remote input.
If the subprocess is already running, it can be written to directly
without buffering.  Buffering is only needed when a caller attempts
to write to the subprocess before the subprocess is running.

For remote subprocesses, remove all channel input prep/check.  Immediately
write to the remote subprocess if the subprocess is running.  If the subprocess
is not yet running, buffer the input and write it out later.
Problem: Profiling shows that creating the write buffer for
remote subprocesses eats up a healthy amount of cycles.  However,
the buffer is not needed for many circumstances.  It is only needed
when there is an attempt to write data to the subprocess before
the subprocess is running.

Create the write buffer only when it is needed.
@chu11 chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 21ed590 to b131d32 Compare June 3, 2024 19:26
@chu11
Copy link
Member Author

chu11 commented Jun 3, 2024

setting MWP, thanks!

@mergify mergify bot merged commit 148da68 into flux-framework:master Jun 3, 2024
34 of 35 checks passed
@chu11 chu11 deleted the issue5919_subprocess_remote_input_check branch June 4, 2024 14:54
Copy link

codecov bot commented Sep 1, 2024

Codecov Report

Attention: Patch coverage is 61.01695% with 23 lines in your changes missing coverage. Please review.

Project coverage is 83.30%. Comparing base (8e71f1c) to head (b131d32).
Report is 470 commits behind head on master.

Files with missing lines Patch % Lines
src/common/libsubprocess/subprocess.c 50.00% 14 Missing ⚠️
src/common/libsubprocess/remote.c 70.96% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6002      +/-   ##
==========================================
+ Coverage   83.27%   83.30%   +0.02%     
==========================================
  Files         519      519              
  Lines       83680    83637      -43     
==========================================
- Hits        69688    69675      -13     
+ Misses      13992    13962      -30     
Files with missing lines Coverage Δ
src/common/libsubprocess/remote.c 77.53% <70.96%> (+0.90%) ⬆️
src/common/libsubprocess/subprocess.c 87.87% <50.00%> (-1.69%) ⬇️

... and 12 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants