-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libsubprocess: use matchtag instead of pid for flux_subprocess_write() #6013
Conversation
In our #6002 discussion, we were considering just not creating the write buffer unless it is needed. Were you thinking of going forward with this protocol change anyways? |
See my last couple of comments in #6002. It seems like this should work without any buffering. IOW the only thing preventing |
8ea1d76
to
cada1e7
Compare
I'll be danged if I know why this doesn't work. I've wasted half a day trying to figure it out and am getting nowhere. Parking for now. |
i played around with the branch and equally befuddled, especially from the non-sdexec side of the code. The changes seem so simple and obvious. I tried to see if there was a racy scenario where we were writing earlier than we could before (b/c matchtag is available immediately, whereas pid wasn't), but I don't think that scenario is being violated. I also rebased your branch on master, in case some of the recent fixes we did were part of the problem. hmmm |
I figured it out - we need to match both the matchtag and the sender on the server side. Otherwise input can go to the wrong subprocess, which causes myriad test suite failures. Derp! 🤦♂️ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6013 +/- ##
===========================================
+ Coverage 54.42% 83.30% +28.87%
===========================================
Files 471 519 +48
Lines 76273 83671 +7398
===========================================
+ Hits 41515 69704 +28189
+ Misses 34758 13967 -20791
|
This and flux-framework/rfc#416 are probably ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just some nits
if (flux_cancel_match (msg, m)) | ||
return m; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks a little weird to call flux_cancel_match()
, maybe just a small comment here
const flux_msg_t *msg; | ||
|
||
if ((msg = flux_subprocess_aux_get (p, msgkey)) | ||
&& flux_cancel_match (request, msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to prior comment
/* If the systemd unit has not started yet enqueue the write request for | ||
* later processing in start_continuation(). | ||
*/ | ||
if (sdexec_channel_get_fd (proc->in) != -1) { // not yet claimed by systemd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could probably use a little more comment, it's not super intuitive why you'd check for ~= -1. Perhaps mention that start_contiunation() closes the fd after acknowledgement unit has started.
Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the sender's uuid to look up the systemd unit.
Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Expect a "matchtag" key in the write request and use it along with the requestor's uuid to look up the subprocess.
Problem: RFC 42 now requires write requests to reference the subprocess via the exec matchtag rather than the pid. Send the matchtag in the write request instead of the pid and alter the internal function prototype for subprocess_write() to accept the exec future in lieu of rank, service, and matchtag. Update users of that function. Also update a unit test that uses the raw protocol.
Problem: if an sdexec.write request arrives before the unit stdin is ready, it will be dropped. This is possible now that sdexec.write identifies the unit by the matchtag instead of the pid. Queue early sdexec.write requests until stdin is valid, then move them back to the flux handle for processing as though they are being received for the first time.
Problem: data written to the subprocess server before the subprocess has entered RUNNING state is silently discarded, but must be handled in order to eliminate extraneous buffering on the client size. Drop stdin data only when the subprocess is in FAILED or EXITED state.
Problem: when stdin is written to a remote subprocess before the pid has been received, a buffer is created on the client side, but now that the protocol uses the matchtag instead of the pid, data can be sent early and this extra complexity can be avoided. Drop pre-running stdin buffering.
Thanks - I fixed up those comments. |
setting MWP |
This is an attempt to implement the subprocess protocol change proposed in flux-framework/rfc#416, which allows a write request to be sent to a remote subprocess before the pid is received.
Unfortunately what seemed like a straightforward change broke a bunch of tests and I've yet to figure out why!
Wanted to post what I had so far and continue discussing this with @chu11