libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

garlick · 2024-05-28T21:53:34Z

This is an attempt to implement the subprocess protocol change proposed in flux-framework/rfc#416, which allows a write request to be sent to a remote subprocess before the pid is received.

Unfortunately what seemed like a straightforward change broke a bunch of tests and I've yet to figure out why!

Wanted to post what I had so far and continue discussing this with @chu11

chu11 · 2024-05-28T22:05:17Z

In our #6002 discussion, we were considering just not creating the write buffer unless it is needed. Were you thinking of going forward with this protocol change anyways?

garlick · 2024-05-28T22:28:12Z

See my last couple of comments in #6002. It seems like this should work without any buffering. IOW the only thing preventing flux_subprocess_write() on a remote subprocess without waiting for the status callback to notify that the process is RUNNING is the lack of a token to correlate the request to the subprocess. The matchtag should work for this. I might be missing something important though.

garlick · 2024-05-29T14:57:19Z

I'll be danged if I know why this doesn't work. I've wasted half a day trying to figure it out and am getting nowhere. Parking for now.

chu11 · 2024-05-31T23:40:53Z

i played around with the branch and equally befuddled, especially from the non-sdexec side of the code. The changes seem so simple and obvious.

I tried to see if there was a racy scenario where we were writing earlier than we could before (b/c matchtag is available immediately, whereas pid wasn't), but I don't think that scenario is being violated.

I also rebased your branch on master, in case some of the recent fixes we did were part of the problem.

hmmm

garlick · 2024-06-06T01:25:48Z

I figured it out - we need to match both the matchtag and the sender on the server side. Otherwise input can go to the wrong subprocess, which causes myriad test suite failures. Derp! 🤦‍♂️

codecov · 2024-06-06T14:32:29Z

Codecov Report

Attention: Patch coverage is 73.84615% with 17 lines in your changes missing coverage. Please review.

Project coverage is 83.30%. Comparing base (9944231) to head (1a6ea83).

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6013       +/-   ##
===========================================
+ Coverage   54.42%   83.30%   +28.87%     
===========================================
  Files         471      519       +48     
  Lines       76273    83671     +7398     
===========================================
+ Hits        41515    69704    +28189     
+ Misses      34758    13967    -20791

Files	Coverage Δ
src/common/libsubprocess/client.c	`79.69% <100.00%> (+0.33%)`	⬆️
src/common/libsubprocess/remote.c	`78.08% <ø> (+4.38%)`	⬆️
src/common/libsubprocess/server.c	`80.17% <93.75%> (+22.09%)`	⬆️
src/common/libsubprocess/subprocess.c	`88.83% <33.33%> (+3.11%)`	⬆️
src/modules/sdexec/sdexec.c	`70.71% <64.28%> (+0.30%)`	⬆️

... and 436 files with indirect coverage changes

garlick · 2024-06-06T14:58:47Z

This and flux-framework/rfc#416 are probably ready for a review.

chu11

LGTM! just some nits

chu11 · 2024-06-06T21:59:32Z

src/modules/sdexec/sdexec.c

+        if (flux_cancel_match (msg, m))
+            return m;


looks a little weird to call flux_cancel_match(), maybe just a small comment here

chu11 · 2024-06-06T22:05:12Z

src/common/libsubprocess/server.c

+        const flux_msg_t *msg;
+
+        if ((msg = flux_subprocess_aux_get (p, msgkey))
+            && flux_cancel_match (request, msg))


similar to prior comment

chu11 · 2024-06-06T23:29:37Z

src/modules/sdexec/sdexec.c

+    /* If the systemd unit has not started yet  enqueue the write request for
+     * later processing in start_continuation().
+     */
+    if (sdexec_channel_get_fd (proc->in) != -1) { // not yet claimed by systemd


this could probably use a little more comment, it's not super intuitive why you'd check for ~= -1. Perhaps mention that start_contiunation() closes the fd after acknowledgement unit has started.

Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the sender's uuid to look up the systemd unit.

Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the requestor's uuid to look up the subprocess.

Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Send the matchtag in the write request instead of the pid and alter the internal function prototype for subprocess_write() to accept the exec future in lieu of rank, service, and matchtag. Update users of that function. Also update a unit test that uses the raw protocol.

Problem: if an sdexec.write request arrives before the unit stdin is ready, it will be dropped. This is possible now that sdexec.write identifies the unit by the matchtag instead of the pid. Queue early sdexec.write requests until stdin is valid, then move them back to the flux handle for processing as though they are being received for the first time.

Problem: data written to the subprocess server before the subprocess has entered RUNNING state is silently discarded, but must be handled in order to eliminate extraneous buffering on the client size. Drop stdin data only when the subprocess is in FAILED or EXITED state.

Problem: when stdin is written to a remote subprocess before the pid has been received, a buffer is created on the client side, but now that the protocol uses the matchtag instead of the pid, data can be sent early and this extra complexity can be avoided. Drop pre-running stdin buffering.

garlick · 2024-06-10T13:24:39Z

Thanks - I fixed up those comments.

garlick · 2024-06-10T14:24:51Z

setting MWP

garlick mentioned this pull request May 28, 2024

libsubprocess: reduce remote input prep/check #6002

Merged

garlick force-pushed the rexec_matchtag branch 2 times, most recently from 8ea1d76 to cada1e7 Compare May 29, 2024 14:56

garlick force-pushed the rexec_matchtag branch from cada1e7 to 84d2739 Compare June 6, 2024 01:22

garlick force-pushed the rexec_matchtag branch from 84d2739 to 1a6ea83 Compare June 6, 2024 14:08

garlick changed the title ~~WIP: libsubprocess: use matchtag instead of pid for flux_subprocess_write()~~ libsubprocess: use matchtag instead of pid for flux_subprocess_write() Jun 6, 2024

chu11 mentioned this pull request Jun 6, 2024

sdexec: add stdin buffering #6035

Open

chu11 approved these changes Jun 6, 2024

View reviewed changes

garlick added 6 commits June 10, 2024 06:24

sdexec: use matchtag in server write request

ca34683

Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the sender's uuid to look up the systemd unit.

libsubprocess: use matchtag in server write req

5f80d28

Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the requestor's uuid to look up the subprocess.

garlick force-pushed the rexec_matchtag branch from 1a6ea83 to b84e4eb Compare June 10, 2024 13:24

garlick added the merge-when-passing label Jun 10, 2024

mergify bot merged commit 67fc412 into flux-framework:master Jun 10, 2024
33 checks passed

garlick deleted the rexec_matchtag branch June 10, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick commented May 28, 2024

garlick commented May 29, 2024

chu11 commented May 31, 2024

garlick commented Jun 6, 2024

codecov bot commented Jun 6, 2024

garlick commented Jun 6, 2024

chu11 left a comment

chu11 Jun 6, 2024

chu11 Jun 6, 2024

chu11 Jun 6, 2024

garlick commented Jun 10, 2024

garlick commented Jun 10, 2024

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

Conversation

garlick commented May 28, 2024

chu11 commented May 28, 2024

garlick commented May 28, 2024

garlick commented May 29, 2024

chu11 commented May 31, 2024

garlick commented Jun 6, 2024

codecov bot commented Jun 6, 2024

Codecov Report

garlick commented Jun 6, 2024

chu11 left a comment

Choose a reason for hiding this comment

chu11 Jun 6, 2024

Choose a reason for hiding this comment

chu11 Jun 6, 2024

Choose a reason for hiding this comment

chu11 Jun 6, 2024

Choose a reason for hiding this comment

garlick commented Jun 10, 2024

garlick commented Jun 10, 2024