Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013

Merged
merged 6 commits into from
Jun 10, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented May 28, 2024

This is an attempt to implement the subprocess protocol change proposed in flux-framework/rfc#416, which allows a write request to be sent to a remote subprocess before the pid is received.

Unfortunately what seemed like a straightforward change broke a bunch of tests and I've yet to figure out why!

Wanted to post what I had so far and continue discussing this with @chu11

@chu11
Copy link
Member

chu11 commented May 28, 2024

In our #6002 discussion, we were considering just not creating the write buffer unless it is needed. Were you thinking of going forward with this protocol change anyways?

@garlick
Copy link
Member Author

garlick commented May 28, 2024

See my last couple of comments in #6002. It seems like this should work without any buffering. IOW the only thing preventing flux_subprocess_write() on a remote subprocess without waiting for the status callback to notify that the process is RUNNING is the lack of a token to correlate the request to the subprocess. The matchtag should work for this. I might be missing something important though.

@garlick garlick force-pushed the rexec_matchtag branch 2 times, most recently from 8ea1d76 to cada1e7 Compare May 29, 2024 14:56
@garlick
Copy link
Member Author

garlick commented May 29, 2024

I'll be danged if I know why this doesn't work. I've wasted half a day trying to figure it out and am getting nowhere. Parking for now.

@chu11
Copy link
Member

chu11 commented May 31, 2024

i played around with the branch and equally befuddled, especially from the non-sdexec side of the code. The changes seem so simple and obvious.

I tried to see if there was a racy scenario where we were writing earlier than we could before (b/c matchtag is available immediately, whereas pid wasn't), but I don't think that scenario is being violated.

I also rebased your branch on master, in case some of the recent fixes we did were part of the problem.

hmmm

@garlick
Copy link
Member Author

garlick commented Jun 6, 2024

I figured it out - we need to match both the matchtag and the sender on the server side. Otherwise input can go to the wrong subprocess, which causes myriad test suite failures. Derp! 🤦‍♂️

@garlick garlick changed the title WIP: libsubprocess: use matchtag instead of pid for flux_subprocess_write() libsubprocess: use matchtag instead of pid for flux_subprocess_write() Jun 6, 2024
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 73.84615% with 17 lines in your changes missing coverage. Please review.

Project coverage is 83.30%. Comparing base (9944231) to head (1a6ea83).

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6013       +/-   ##
===========================================
+ Coverage   54.42%   83.30%   +28.87%     
===========================================
  Files         471      519       +48     
  Lines       76273    83671     +7398     
===========================================
+ Hits        41515    69704    +28189     
+ Misses      34758    13967    -20791     
Files Coverage Δ
src/common/libsubprocess/client.c 79.69% <100.00%> (+0.33%) ⬆️
src/common/libsubprocess/remote.c 78.08% <ø> (+4.38%) ⬆️
src/common/libsubprocess/server.c 80.17% <93.75%> (+22.09%) ⬆️
src/common/libsubprocess/subprocess.c 88.83% <33.33%> (+3.11%) ⬆️
src/modules/sdexec/sdexec.c 70.71% <64.28%> (+0.30%) ⬆️

... and 436 files with indirect coverage changes

@garlick
Copy link
Member Author

garlick commented Jun 6, 2024

This and flux-framework/rfc#416 are probably ready for a review.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! just some nits

Comment on lines +149 to +156
if (flux_cancel_match (msg, m))
return m;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a little weird to call flux_cancel_match(), maybe just a small comment here

const flux_msg_t *msg;

if ((msg = flux_subprocess_aux_get (p, msgkey))
&& flux_cancel_match (request, msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to prior comment

/* If the systemd unit has not started yet enqueue the write request for
* later processing in start_continuation().
*/
if (sdexec_channel_get_fd (proc->in) != -1) { // not yet claimed by systemd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably use a little more comment, it's not super intuitive why you'd check for ~= -1. Perhaps mention that start_contiunation() closes the fd after acknowledgement unit has started.

garlick added 6 commits June 10, 2024 06:24
Problem: RFC 42 now requires write requests to reference the
subprocess via the exec matchtag rather than the pid.

Expect a "matchtag" key in the write request and use it along with
the sender's uuid to look up the systemd unit.
Problem: RFC 42 now requires write requests to reference the
subprocess via the exec matchtag rather than the pid.

Expect a "matchtag" key in the write request and use it along
with the requestor's uuid to look up the subprocess.
Problem: RFC 42 now requires write requests to reference the
subprocess via the exec matchtag rather than the pid.

Send the matchtag in the write request instead of the pid and
alter the internal function prototype for subprocess_write() to
accept the exec future in lieu of rank, service, and matchtag.
Update users of that function.

Also update a unit test that uses the raw protocol.
Problem: if an sdexec.write request arrives before the unit stdin
is ready, it will be dropped.

This is possible now that sdexec.write identifies the unit by the
matchtag instead of the pid.

Queue early sdexec.write requests until stdin is valid, then move
them back to the flux handle for processing as though they are being
received for the first time.
Problem: data written to the subprocess server before the subprocess
has entered RUNNING state is silently discarded, but must be handled
in order to eliminate extraneous buffering on the client size.

Drop stdin data only when the subprocess is in FAILED or EXITED state.
Problem: when stdin is written to a remote subprocess before the
pid has been received, a buffer is created on the client side,
but now that the protocol uses the matchtag instead of the pid,
data can be sent early and this extra complexity can be avoided.

Drop pre-running stdin buffering.
@garlick
Copy link
Member Author

garlick commented Jun 10, 2024

Thanks - I fixed up those comments.

@garlick
Copy link
Member Author

garlick commented Jun 10, 2024

setting MWP

@mergify mergify bot merged commit 67fc412 into flux-framework:master Jun 10, 2024
33 checks passed
@garlick garlick deleted the rexec_matchtag branch June 10, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants