libsubprocess: reduce remote input prep/check #6002

chu11 · 2024-05-22T18:17:26Z

Problem: Profiling shows that a significant amount of time can be spent in the prep/check of remote subprocess input. This is even in the case when the input buffer is empty.

Enable the remote input prep/check only when the buffer is non-empty.

garlick · 2024-05-23T20:36:47Z

OK silly question. Could flux_subprocess_write() just call subprocess_write() (which makes the exec.write RPC) directly when the process is remote?

chu11 · 2024-05-23T21:53:41Z

OK silly question. Could flux_subprocess_write() just call subprocess_write() (which makes the exec.write RPC) directly when the process is remote?

Ya know what, i think it can. Lets try that.

chu11 · 2024-05-23T23:03:02Z

Ya know what, i think it can. Lets try that.

Ahhh I remember now. We allow users to flux_subprocess_close() on a remote subprocess channel even if the subprocess isn't running yet. The prep/check in the background handles this magic.

If we wait to call flux_subprocess_close() until the subprocess hits running, I think there would be fallout code adjustments elsewhere in flux-core, possibly non-trivial although I didn't look into it too deeply. Just as one example, the --noinput option in flux-exec closes stdin just a few lines below the call to flux_rexec() (i.e. reactor hasn't been re-entered yet, so the initial rexec RPC hasn't even been sent yet).

Perhaps a "LOCAL_UNBUF" on the input side is something we could investigate as well, but perhaps not something for this round?

garlick · 2024-05-23T23:37:55Z

Ah I didn't think of that. Hmm, could we just set the c->closed flag in the channel, and then upon receiving the pid from the remote server, send eofs for all closed channels?

I can't for the life of me think why the input buffering is ever helpful.

chu11 · 2024-05-23T23:43:37Z

Ah I didn't think of that. Hmm, could we just set the c->closed flag in the channel, and then upon receiving the pid from the remote server, send eofs for all closed channels?

Yeah, that's certainly a possibility. I hadn't thought about it too much yet mainly because ....

I just realized that next Tuesday is not the release deadline, it's in two Tuesdays. So I actually have time to experiment with an alternate approach. I was thinking this would be the approach for the upcoming release.

garlick · 2024-05-23T23:54:13Z

Well, it just seems like, in the input case, the buffer isn't helping at all, and in fact impacts scaling in 1:N configs like in job-exec because you have to allocate all the buffers, watchers, etc. So if it's not helping, it seems really tempting to just cut it out!

chu11 · 2024-05-24T18:41:32Z

Well, it just seems like, in the input case, the buffer isn't helping at all, and in fact impacts scaling in 1:N configs like in job-exec because you have to allocate all the buffers, watchers, etc. So if it's not helping, it seems really tempting to just cut it out!

As I played around with this more, we also allow users to flux_subprocess_write() before the job is running. If we want to keep that behavior, we probably need to keep the buffer around. But we could just subprocess_write() any data after the job is RUNNING.

chu11 · 2024-05-25T00:04:15Z

after messing around today I think I got it to work. The only gotcha is that we still need the buffer when the caller writes to the subprocess before the subprocess is running. But we can completely remove the remote input prep/check.

I still to test against the reproducer to make sure this works, but it's looking pretty good and passing all tests.

chu11 · 2024-05-27T04:47:03Z

The reproducer on corona seems to work, and profiling shows exactly what we'd hope for.

-   23.38%     0.00%  job-exec         job-exec.so            [.] mod_main                              ▒
     mod_main                                                                                           ▒
     flux_reactor_run                                                                                   ▒
     ev_run                                                                                             ▒
   - ev_run                                                                                             ▒
      - 23.00% ev_invoke_pending                                                                        ▒
         - 16.50% check_cb                                                                              ▒
            - exec_start_cmds (inlined)                                                                 ▒
               - 16.48% exec_start_cmd (inlined)                                                        ▒
                  - 16.42% flux_rexec_ex                                                                ▒
                     - flux_rexec_ex                                                                    ▒
                        - 13.49% remote_exec                                                            ▒
                           - 13.42% subprocess_rexec                                                    ▒
                              - 7.20% flux_rpc_pack                                                     ▒
                                 + flux_rpc_vpack                                                       ▒
                              - 6.18% rexec_ctx_create (inlined)                                        ▒
                                 - 6.15% cmd_tojson                                                     ▒
                                    - 5.87% envz_tojson                                                 ▒
                                       + 2.13% json_object_setn_new_nocheck                             ▒
                                       + 1.05% 0x1555545b44bc                                           ▒
                                       + 1.04% 0x1555545b448e                                           ▒
                                         0.63% json_stringn                                             ▒
                        - 1.68% subprocess_remote_setup                                                 ▒
                           - 1.66% remote_setup_stdio (inlined)                                         ▒
                              - remote_channel_setup                                                    ▒
                                 - 1.34% fbuf_create                                                    ▒
                                    + 0.67% cbuf_create                                                 ▒
                                    + 0.64% __GI___libc_malloc (inlined)                                ▒
                        - 1.14% subprocess_create                                                       ▒
                           + 0.93% __socketpair (inlined)                                               ▒
         - 3.24% rexec_continuation                                                                     ▒
            - 1.53% subprocess_rexec_get                                                                ▒
               - 1.14% flux_rpc_get_unpack                                                              ▒
                  + flux_rpc_get_vunpack (inlined)                                                      ▒
            - 1.38% remote_output_local_unbuf (inlined)                                                 ▒
               - 1.32% exec_output_cb                                                                   ▒
                  - 1.29% output_cb                                                                     ▒
                     - 1.24% output_cb                                                                  ▒
                        + exec_barrier_enter (inlined)                                                  ▒
         - 2.67% handle_cb                                                                              ▒

garlick · 2024-05-28T13:22:37Z

Nice!

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

chu11 · 2024-05-28T14:32:04Z

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

Ahhh, you mean on the server side, lookup jobs in the hash via exec matchtag vs PID? Yeah, I think that would work, then all the buffering would only be on the server side. Although I think that's a follow up PR to this one as we're changing the protocol.

garlick · 2024-05-28T16:58:42Z

Perhaps we can slip the protocol change in before this PR then.

I'll get a PR open on the RFC for a start.

chu11 · 2024-05-28T18:41:22Z

Hey, could we just send the exec matchtag instead of the pid in write requests and then get rid of the buffer? That would be a protocol change to RFC 42 (which I would be happy to propose if you think it would work out).

Began looking through code and began wondering about the fallout of removing the local buffer.

It doesn't appear that libsdexec will buffer anything on writes if the process is not yet running, with the minor exception of whatever the fds of socketpair() buffers. This isn't an issue with rexec b/c we know it is running when flux_local_exec() is done.
Although I didn't look into it with great rigor, it's not clear to me if systemd will buffer stdin until the process is started and ready to receive data.
Would it be simpler to just not create the local buffer unless it's necessary? It's now somewhat obvious to me that the local buffer is not needed in most situations.

garlick · 2024-05-28T19:14:43Z

Well I guess we should check if write/close before start is required in the main sdexec case we care about right now which is job-exec.

We may need to add some buffering to sdexec in any case. I had forgotten about that. We won't have line buffering with the UNBUF flag...

garlick · 2024-05-28T20:35:45Z

Wait a sec, for local processes, there is an explicit check in flux_subprocess_write() that the subprocess is in RUNNING state or it fails with EPIPE.

https://github.com/flux-framework/flux-core/blob/master/src/common/libsubprocess/subprocess.c#L689

So we can't just send to the remote.

I guess your create on demand idea may be the best option. If we could delete the buffer after it empties out, even better.

But it also makes me wonder why we allow writing/closing before start for remote processes but not for local ones.

chu11 · 2024-05-28T20:57:31Z

But it also makes me wonder why we allow writing/closing before start for remote processes but not for local ones.

I think the main reason is that flux_local_exec() is fork/exec, so the process is running before the function returns, so writing before a local process is running is basically impossible. Dumb example is:

p = flux_local_exec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

vs

p = flux_rexec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

in the latter case, the remote subprocess RPC has probably not even been sent when flux_subprocess_write() is called, but with the former case, it's just fork/exec and the process is definitely running. So we'd like the API usage pattern between both to still be possible.

garlick · 2024-05-28T21:21:33Z

p = flux_rexec();
flux_subprocess_write (p, "foobar", ...);
flux_reactor_run();

in the latter case, the remote subprocess RPC has probably not even been sent when flux_subprocess_write() is called, but with the former case, it's just fork/exec and the process is definitely running. So we'd like the API usage pattern between both to still be possible.

In that example, the exec request definitely is sent before the write request, and the server is guaranteed to receive those messages in order. Thus if fork+exec/spawn really are synchronous, the remote pid will always be assigned to the subprocess before the write request is processed. Unless of course there was a failure but then it doesn't matter.

chu11 · 2024-05-28T22:42:45Z

Thus if fork+exec/spawn really are synchronous, the remote pid will always be assigned to the subprocess before the write request is processed.

I don't think we can be guaranteed this with sdexec though. After sdexec_start_transient_unit(), we've only informed systemd to run the command we want, but there's no guarantee it has done it before the write request comes in?

garlick · 2024-05-28T22:56:24Z

I don't think we can be guaranteed this with sdexec though. After sdexec_start_transient_unit(), we've only informed systemd to run the command we want, but there's no guarantee it has done it before the write request comes in?

Good point. I guess I was assuming we could fix that. In fact I should tack something on to #6013. For input I was just thinking we could enqueue the write request messages until the unit comes online.

chu11 · 2024-05-28T22:58:37Z

re-pushed, adding a commit to create the write buffer on demand when it is needed.

garlick

This is a nice cleanup!

I had a few comments, though nothing very substantial.

One other quick thought - it might not hurt to add a check that flux exec -n --service=sdexec works. Something like

diff --git a/t/t2409-sdexec.t b/t/t2409-sdexec.t
index cf7ab989c..491ce6717 100755
--- a/t/t2409-sdexec.t
+++ b/t/t2409-sdexec.t
@@ -55,6 +55,9 @@ test_expect_success 'clear broker logs' '
 test_expect_success 'sdexec true succeeds' '
        $sdexec -r 0 $true
 '
+test_expect_success 'sdexec -n cat succeeds' '
+       run_timeout 30 $sdexec -n -r 0 cat
+'
 test_expect_success 'sdexec false fails with exit code 1' '
        test_expect_code 1 $sdexec -r 0 $false
 '

garlick · 2024-06-03T13:53:04Z

src/common/libsubprocess/subprocess.c

+        /* if process isn't running, will be sent after process
+         * converts to running.  Or if it has already exited, it
+         * doesn't matter.


Maybe say "eof plus any previously writtten data" will be sent?

This code assumes that there is no buffered data if the state is RUNNING. Maybe it would be helpful to mention where that buffer is flushed.

garlick · 2024-06-03T14:02:33Z

src/common/libsubprocess/remote.c

+        if (c->closed) {
+            if (subprocess_write (p->h,
+                                  p->service_name,
+                                  p->rank,
+                                  p->pid,
+                                  c->name,
+                                  NULL,
+                                  0,
+                                  true) < 0) {


Could this be combined with the subprocess_write() call above? E.g.

if (c->closed || fbuf_bytes (c->write_buffer) > 0) { ... }

ahhh i think you're right. It would have been two calls if the user did it, but we can combine it at this point in the code.

garlick · 2024-06-03T14:13:27Z

src/common/libsubprocess/remote.c

+        if (send_channel_data (p) < 0)
+            goto error;


Should we destroy the fbuf here?

ahhh you mean in the error case if it wasn't destroyed in send_channel_data(). Hmmm, we certainly could. But since we're goto error, we're failing the subprocess and going to destroy the subprocess soon. So not super necessary?

Oh, derp! I didn't notice that it was destroyed in send_channel_data(). NM!

chu11 · 2024-06-03T15:47:33Z

One other quick thought - it might not hurt to add a check that flux exec -n --service=sdexec works. Something like

hmm it just seems that isn't covered in general. ~~should add tests for the non-sdexec case too (i just grepped for -n and sdexec, which may not be the most accurate grep)~~ oops, for non-sdexec shouldn't grep for sdexec :-)

chu11 · 2024-06-03T15:53:16Z

re-pushed with tweaks per comments above. I also added an extra libsubprocess test that covers only closing stdin early vs write & closing it early.

Problem: When starting remote processes, there are no tests to ensure that we can write / close stdin before a process is running. Add tests in libsubprocess/test/remote.c.

Problem: The -n option in flux-exec is not covered with the sdexec server. Add coverage in t2409-sdexec.t.

Problem: Profiling shows that a significant amount of time can be spent in the prep/check of remote subprocess input. This is even in the case when the input buffer is empty. It ends up that the prep/check is not necessary for remote input. If the subprocess is already running, it can be written to directly without buffering. Buffering is only needed when a caller attempts to write to the subprocess before the subprocess is running. For remote subprocesses, remove all channel input prep/check. Immediately write to the remote subprocess if the subprocess is running. If the subprocess is not yet running, buffer the input and write it out later.

Problem: Profiling shows that creating the write buffer for remote subprocesses eats up a healthy amount of cycles. However, the buffer is not needed for many circumstances. It is only needed when there is an attempt to write data to the subprocess before the subprocess is running. Create the write buffer only when it is needed.

chu11 · 2024-06-03T20:15:30Z

setting MWP, thanks!

codecov · 2024-09-01T20:30:36Z

Codecov Report

Attention: Patch coverage is 61.01695% with 23 lines in your changes missing coverage. Please review.

Project coverage is 83.30%. Comparing base (8e71f1c) to head (b131d32).
Report is 470 commits behind head on master.

Files with missing lines	Patch %	Lines
src/common/libsubprocess/subprocess.c	50.00%	14 Missing ⚠️
src/common/libsubprocess/remote.c	70.96%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6002      +/-   ##
==========================================
+ Coverage   83.27%   83.30%   +0.02%     
==========================================
  Files         519      519              
  Lines       83680    83637      -43     
==========================================
- Hits        69688    69675      -13     
+ Misses      13992    13962      -30

Files with missing lines	Coverage Δ
src/common/libsubprocess/remote.c	`77.53% <70.96%> (+0.90%)`	⬆️
src/common/libsubprocess/subprocess.c	`87.87% <50.00%> (-1.69%)`	⬇️

... and 12 files with indirect coverage changes

chu11 mentioned this pull request May 23, 2024

shell: do not fail on oom init failure #6004

Merged

chu11 mentioned this pull request May 24, 2024

job-exec: possible scaling issue managing many job shells #5919

Closed

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 1ea4905 to 3fcf596 Compare May 25, 2024 00:02

garlick mentioned this pull request May 28, 2024

rfc42: exec.write should use matchtag not pid flux-framework/rfc#416

Merged

chu11 mentioned this pull request May 28, 2024

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

Merged

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 3fcf596 to 1bf761c Compare May 28, 2024 22:57

chu11 force-pushed the issue5919_subprocess_remote_input_check branch 2 times, most recently from e1d2bd3 to d83e80d Compare May 31, 2024 21:01

garlick approved these changes Jun 3, 2024

View reviewed changes

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from d83e80d to 793807d Compare June 3, 2024 15:52

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 793807d to 629ee0a Compare June 3, 2024 16:57

libsubprocess: cover pre-running write/close

64f578a

Problem: When starting remote processes, there are no tests to ensure that we can write / close stdin before a process is running. Add tests in libsubprocess/test/remote.c.

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 629ee0a to 21ed590 Compare June 3, 2024 16:58

chu11 added 3 commits June 3, 2024 12:25

t: cover flux-exec -n with sdexec server

5ef83da

Problem: The -n option in flux-exec is not covered with the sdexec server. Add coverage in t2409-sdexec.t.

chu11 force-pushed the issue5919_subprocess_remote_input_check branch from 21ed590 to b131d32 Compare June 3, 2024 19:26

chu11 added the merge-when-passing label Jun 3, 2024

mergify bot merged commit 148da68 into flux-framework:master Jun 3, 2024
34 of 35 checks passed

chu11 deleted the issue5919_subprocess_remote_input_check branch June 4, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libsubprocess: reduce remote input prep/check #6002

libsubprocess: reduce remote input prep/check #6002

chu11 commented May 22, 2024

garlick commented May 23, 2024

chu11 commented May 23, 2024

chu11 commented May 23, 2024

garlick commented May 23, 2024

chu11 commented May 23, 2024

garlick commented May 23, 2024

chu11 commented May 24, 2024

chu11 commented May 25, 2024

chu11 commented May 27, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024 •

edited

Loading

garlick commented May 28, 2024

garlick commented May 28, 2024 •

edited

Loading

chu11 commented May 28, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024 •

edited

Loading

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick left a comment

garlick Jun 3, 2024

garlick Jun 3, 2024

garlick Jun 3, 2024

chu11 Jun 3, 2024

garlick Jun 3, 2024

chu11 Jun 3, 2024 •

edited

Loading

garlick Jun 3, 2024

chu11 commented Jun 3, 2024 •

edited

Loading

chu11 commented Jun 3, 2024

chu11 commented Jun 3, 2024

codecov bot commented Sep 1, 2024

libsubprocess: reduce remote input prep/check #6002

libsubprocess: reduce remote input prep/check #6002

Conversation

chu11 commented May 22, 2024

garlick commented May 23, 2024

chu11 commented May 23, 2024

chu11 commented May 23, 2024

garlick commented May 23, 2024

chu11 commented May 23, 2024

garlick commented May 23, 2024

chu11 commented May 24, 2024

chu11 commented May 25, 2024

chu11 commented May 27, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024 • edited Loading

garlick commented May 28, 2024

garlick commented May 28, 2024 • edited Loading

chu11 commented May 28, 2024

garlick commented May 28, 2024

chu11 commented May 28, 2024 • edited Loading

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick left a comment

Choose a reason for hiding this comment

garlick Jun 3, 2024

Choose a reason for hiding this comment

garlick Jun 3, 2024

Choose a reason for hiding this comment

garlick Jun 3, 2024

Choose a reason for hiding this comment

chu11 Jun 3, 2024

Choose a reason for hiding this comment

garlick Jun 3, 2024

Choose a reason for hiding this comment

chu11 Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

garlick Jun 3, 2024

Choose a reason for hiding this comment

chu11 commented Jun 3, 2024 • edited Loading

chu11 commented Jun 3, 2024

chu11 commented Jun 3, 2024

codecov bot commented Sep 1, 2024

Codecov Report

chu11 commented May 28, 2024 •

edited

Loading

garlick commented May 28, 2024 •

edited

Loading

chu11 commented May 28, 2024 •

edited

Loading

chu11 Jun 3, 2024 •

edited

Loading

chu11 commented Jun 3, 2024 •

edited

Loading