Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4210: Preserve return code from nonblocking send #1602

Closed
wants to merge 1 commit into from

Conversation

smikes
Copy link
Contributor

@smikes smikes commented Feb 12, 2021

Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
ZOPERATIONTIMEOUT
ZSYSTEMERROR
ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

adaptor_send_queue(zh, 0);
return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

@smikes
Copy link
Contributor Author

smikes commented Feb 12, 2021

Additional notes:

  • developed against 3.4 and adapted to 3.6
  • allows earlier discovery of connection loss

@smikes
Copy link
Contributor Author

smikes commented Feb 12, 2021

Failure in PR-build:

[2021-02-12T23:45:10.488Z] [ERROR] Errors: 
[2021-02-12T23:45:10.488Z] [ERROR]   ReadOnlyModeTest.testConnectionEvents:205 » Timeout Failed to connect in read-...
[2021-02-12T23:45:10.488Z] [INFO] 
[2021-02-12T23:45:10.488Z] [ERROR] Tests run: 2901, Failures: 0, Errors: 1, Skipped: 4

I will try to reproduce the failure locally.

@maoling
Copy link
Member

maoling commented Feb 15, 2021

@smikes

  • Thanks for this contribution. It's better to create a JIRA issue (sign up JIRA if you don't have an account)
    to bind this PR to a JIRA-ID. ZooKeeper uses the GitHub workflow. The contributor guideline is here
  • The CI failure may be not your fault and you can ignore it:)

@smikes
Copy link
Contributor Author

smikes commented Feb 15, 2021

@maoling maoling changed the title Preserve return code from nonblocking send ZOOKEEPER-4210: Preserve return code from nonblocking send Feb 16, 2021
@maoling
Copy link
Member

maoling commented Feb 16, 2021

Ping @ztzg @eolivelli @anmolnar :)

@smikes
Copy link
Contributor Author

smikes commented Feb 16, 2021

Confirmed that ReadOnlyModeTest passes locally, so if this is known to be flaky/slow under load, then I'm fine.

@smikes
Copy link
Contributor Author

smikes commented Feb 16, 2021

Running the build on branch-3.6 I see some unit tests failing, both with this change and without

Copy link
Contributor

@ztzg ztzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smikes,

Looks good, in general. Thank you!

But…

This is technically an API & ABI break, as it is widening the range of error codes returned by API functions. As such, I would recommend not including it in upcoming 3.6 releases. (I would be okay with 3.7+ as 1/ C only uses integer codes, which most programs don't interpret and 2/ we are not on the "semantic versioning" train. @eolivelli, WDYT?)

In any case, we usually take PRs against master and backport to the relevant branches; would you mind doing that?

I have also left a couple of comments on the diff, one of which is actionable.

Finally, there is the question of documenting the change in zookeeper.h; the comment on e.g. zoo_aset is now outdated and should probably be rephrased:

/* [...]
 * \return ZOK on success or one of the following errcodes on failure:
 * ZBADARGUMENTS - invalid input parameters
 * ZINVALIDSTATE - zhandle state is either ZOO_SESSION_EXPIRED_STATE or ZOO_AUTH_FAILED_STATE
 * ZMARSHALLINGERROR - failed to marshall a request; possibly, out of memory
 */
ZOOAPI int zoo_aset(zhandle_t *zh, const char *path, const char *buffer, int buflen,
        int version, stat_completion_t completion, const void *data);

Would you agree?

Cheers, -D

P.-S. — You commented:

Running the build on branch-3.6 I see some unit tests failing, both with this change and without

Right; we have some stability issues with the current test suite. Passing -Dsurefire-forkcount=1 helps with Java tests, at the cost of a looong runtime. Something we have to fix ASAP, but please ignore CI failures which don't seem relevant in the meantime, as @maoling suggested.

zookeeper-client/zookeeper-client-c/src/zookeeper.c Outdated Show resolved Hide resolved
zookeeper-client/zookeeper-client-c/src/zookeeper.c Outdated Show resolved Hide resolved
@smikes
Copy link
Contributor Author

smikes commented Feb 17, 2021

Sounds good, will take a look. Internally we have 3.4 and 3.6 branches active so I developed against that, I will rebase this against the main branch as you suggest

@smikes smikes closed this Feb 17, 2021
@smikes smikes reopened this Feb 17, 2021
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch from acce97b to c3081b8 Compare February 17, 2021 20:31
@smikes smikes changed the base branch from branch-3.6 to master February 17, 2021 20:34
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch 2 times, most recently from 2335cb3 to e20bf70 Compare February 18, 2021 00:34
@smikes
Copy link
Contributor Author

smikes commented Feb 18, 2021

Summary of current changes:

  • bc5c997
    A revised version of the original change. This now DOES NOT change the ABI -- it does not modify the return values that come from the zoo_a* functions.
    Also, the action on ZCONNECTIONLOSS is more conservative. Now, a connection loss that is detected during an opportunistic nonblocking send will be treated as a normal connection loss, not an error. The pending send buffers will not be cleared, and completions/watchers will be preserved at least until the next zookeeper_interest call.

  • 9c0cd16
    Addresses the comment above, that handle_socket_error_msg reports the caller's line but not the caller's function name. Not needed to address ZOOKEEPER-4210, so I would be happy to split it into a different PR. Please advise.

  • e20bf70
    In addition to changing the behavior when connection loss is detected, also allow the return code ZCONNECTIONLOSS to propagate upward to the zoo_a* functions. Add documentation in zookeeper.h for the new return codes.
    Note: included for completeness, but I now think this is probably a bad idea, because there is nothing the consumer of the zookeeper.c API can do with the new information (about connection loss during async send).

@smikes smikes requested a review from ztzg February 18, 2021 00:51
@smikes
Copy link
Contributor Author

smikes commented Feb 18, 2021

Separately: I was seeing the zktest_mt and zktest_st cppunit test suites fail, because of my change to the ABI, so I am hoping this will now build cleanly.

@smikes smikes force-pushed the asyncsend-returncode-3.6 branch from e20bf70 to bc5c997 Compare February 19, 2021 03:23
@ztzg
Copy link
Contributor

ztzg commented Feb 19, 2021

@smikes: Still planning to have a look at this, and #1609, ASAP. Thank you for all that work so far!

Copy link
Contributor

@ztzg ztzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@eolivelli, @anmolnar, @symat: PTAL.

@ztzg
Copy link
Contributor

ztzg commented Feb 21, 2021

Summary of current changes:

  • bc5c997
    A revised version of the original change. This now DOES NOT change the ABI -- it does not modify the return values that come from the zoo_a* functions.
    Also, the action on ZCONNECTIONLOSS is more conservative. Now, a connection loss that is detected during an opportunistic nonblocking send will be treated as a normal connection loss, not an error. The pending send buffers will not be cleared, and completions/watchers will be preserved at least until the next zookeeper_interest call.

This is this PR. LGTM.

  • 9c0cd16
    Addresses the comment above, that handle_socket_error_msg reports the caller's line but not the caller's function name. Not needed to address ZOOKEEPER-4210, so I would be happy to split it into a different PR. Please advise.

Now #1609 and ZOOKEEPER-4217. Thank you for splitting it.

I note that the PR currently is in "draft" state. Do you have specific improvements in mind?

  • e20bf70
    In addition to changing the behavior when connection loss is detected, also allow the return code ZCONNECTIONLOSS to propagate upward to the zoo_a* functions. Add documentation in zookeeper.h for the new return codes.
    Note: included for completeness, but I now think this is probably a bad idea, because there is nothing the consumer of the zookeeper.c API can do with the new information (about connection loss during async send).

Agree. (But thank you for preparing the commit anyway!)

@smikes
Copy link
Contributor Author

smikes commented Feb 21, 2021

Summary of current changes:

  • bc5c997
    A revised version of the original change. This now DOES NOT change the ABI -- it does not modify the return values that come from the zoo_a* functions.

This is this PR. LGTM.

Cool! I am happy to take more feedback on this, or it can be merged.

  • 9c0cd16
    Addresses the comment above, that handle_socket_error_msg reports the caller's line but not the caller's function name. Not needed to address ZOOKEEPER-4210, so I would be happy to split it into a different PR. Please advise.

Now #1609 and ZOOKEEPER-4217. Thank you for splitting it.

I note that the PR currently is in "draft" state. Do you have specific improvements in mind?

No. Today I removed the "draft" status.

After this, I will make some documentation updates to the C client dev notes.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me,
but I left one question

zookeeper-client/zookeeper-client-c/src/zookeeper.c Outdated Show resolved Hide resolved
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch from bc5c997 to f0539f2 Compare February 23, 2021 02:27
@smikes smikes requested a review from eolivelli February 23, 2021 15:01
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch 2 times, most recently from 548004f to 96c13e2 Compare February 24, 2021 00:14
@smikes smikes requested a review from eolivelli February 24, 2021 00:19
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch from 96c13e2 to bc8adb0 Compare February 26, 2021 23:24
@smikes smikes requested a review from ztzg February 26, 2021 23:25
In a nonblocking send, we may be notified that the
connection was lost.  Currently this is not handled because
the return value of flush_send_queue() is ignored.

This change DOES NOT change the ABI or the return
codes that are returned by zoo_a* async functions.

This change DOES close the file descriptor and
change zh->state to reflect that the connection was lost.

Reorganized to minimize the change in behavior. Always
call adaptor_send_queue() even if there was a marshalling
error. Use `fd == -1` as test for whether to call close_zsock()
@smikes smikes force-pushed the asyncsend-returncode-3.6 branch from bc8adb0 to 037a4f8 Compare February 26, 2021 23:27
Copy link
Contributor

@ztzg ztzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@ztzg ztzg closed this in 1944f77 Mar 9, 2021
@ztzg
Copy link
Contributor

ztzg commented Mar 9, 2021

This is now in master. Thank you for the contribution, and for your patience, @smikes!

RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Aug 31, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Aug 31, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Aug 31, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 29, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
anuragmadnawat1 pushed a commit to anuragmadnawat1/zookeeper that referenced this pull request Nov 2, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6
anuragmadnawat1 added a commit to anuragmadnawat1/zookeeper that referenced this pull request Nov 2, 2022
Async API calls attempt to flush the send buffer, which
calls flush_send_queue(); and can report
  ZOPERATIONTIMEOUT
  ZSYSTEMERROR
  ZCONNECTIONLOSS

Specifically: send_buffer() calls send(2) with MSG_NOSIGNAL,
which can return EPIPE; then send_buffer return -1, causing
ZCONNECTIONLOSS from flush_send_queue().

Current async API calls drop the return value from flush_send_queue(),
as below:

    adaptor_send_queue(zh, 0);
    return (rc < 0)?ZMARSHALLINGERROR:ZOK;

The async API then returns ZOK instead of ZCONNECTIONLOSS.

Author: Sam Mikes <[email protected]>

Reviewers: Enrico Olivelli <[email protected]>, Damien Diederen <[email protected]>

Closes apache#1602 from smikes/asyncsend-returncode-3.6

Co-authored-by: Sam Mikes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants