Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(go-client): loopForRequest not return and retry forever #1444

Merged
merged 3 commits into from
Apr 18, 2023

Conversation

littlepangdi
Copy link
Contributor

@littlepangdi littlepangdi commented Apr 17, 2023

What problem does this PR solve?

fix #1385

What is changed and how does it work?

expected behavior:

after replica server restart,The client should be able to reconnect with node without considering the client configuration update (reselecting the primary replica).

Actually, client configuration update is another issue we should talk about. I'd like to discuss that in another issue.

In this pr, we focus on why client cannot reconnect with restarted server.

  1. SDK use several loop to monitor rpc.conn, when server is closed, loopForDialing will continue to retry dialling until connected with server(when server is restart),then, two new loop loopForRequest & loopForResponse will be created to handle request.
    however, loopForRequest will not return when correlative loopForResponse returned because of IsNetworkClosed(EOF), since latter only return nil and will not shutdown tom. thus, there will be more aliveloopForRequest than loopForResponse in this case.

  2. SDK retried not on timeout err, however, we wrapped timeout err incorrectly in here

    return fmt.Errorf("session %s is unable to connect (used %dms), the context error: %s", n, time.Since(dialStart)/time.Millisecond, ctx.Err())
    ,thus every request like multiset will continue to do retry here since err we passed to outer is no longer timeout err

Checklist

monitor capture based on this pr:
timeout is eliminated after server restart.
image

should update config and change target
@littlepangdi
Copy link
Contributor Author

During test, we found that there is still timeout exists, thus locate another problem--sdk should not retry on replica is not primary error after reconnect with restarted server ip.
In this case, configuration update should be processed and keep retrying means nothing.

@littlepangdi
Copy link
Contributor Author

Futhermore, if we remove user-side call of TableConnector.Close(), client is working perfectly even when server replica is stopped as if nothing happened. see for this issue #1446
image

log:
triggers configuration update(ips are blurred):

time="2023-04-17T18:47:06+08:00" level=info msg="trigger configuration update of table [CL74485_loki_staging2_324] due to RPC failure [session unresponsive for long] to [xx(replica)]" func="pegasus.(*pegasusTableConnector).tryConfUpdate" file="table_connector.go:753"
time="2023-04-17T18:47:06+08:00" level=info msg="querying configuration of table(CL74485_loki_staging2_324) [metaList=[xxxx]]" func="session.(*MetaManager).QueryConfig" file="meta_session.go:110"
time="2023-04-17T18:47:06+08:00" level=info msg="querying configuration of table(CL74485_loki_staging2_324) from [xxxx(meta)]" func="session.(*metaSession).queryConfig" file="meta_session.go:45"
time="2023-04-17T18:47:06+08:00" level=info msg="create session with [xxxx(replica)]" func=session.newNodeSession file="session.go:141"
time="2023-04-17T18:47:13+08:00" level=info msg="failed to read response from [xxxx(replica)]: read tcp xxxx:46252->xxxx:15801: read: connection reset by peer" func="session.(*nodeSession).loopForResponse" file="session.go:298"
time="2023-04-17T18:47:14+08:00" level=info msg="dial to [xxxx(replica)]" func="session.(*nodeSession).dial" file="session.go:199"
time="2023-04-17T18:47:14+08:00" level=info msg="stop dialing for [xxxx(replica)], connection state: ConnStateReady" func="session.(*nodeSession).dial" file="session.go:211"

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@d16e65c). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #1444   +/-   ##
=========================================
  Coverage          ?   53.58%           
=========================================
  Files             ?       27           
  Lines             ?     2538           
  Branches          ?        0           
=========================================
  Hits              ?     1360           
  Misses            ?     1132           
  Partials          ?       46           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@hycdong hycdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution~

@acelyc111 acelyc111 changed the title fix loopForRequest not return and retry forever fix(go-client): loopForRequest not return and retry forever Apr 18, 2023
@hycdong hycdong merged commit b4937fa into apache:master Apr 18, 2023
@empiredan empiredan mentioned this pull request Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The client does not respond after the replication service is restarted
4 participants