fix(go-client): loopForRequest not return and retry forever #1444

littlepangdi · 2023-04-17T05:09:50Z

What problem does this PR solve?

What is changed and how does it work?

expected behavior:

after replica server restart，The client should be able to reconnect with node without considering the client configuration update (reselecting the primary replica).

Actually, client configuration update is another issue we should talk about. I'd like to discuss that in another issue.

In this pr, we focus on why client cannot reconnect with restarted server.

SDK use several loop to monitor rpc.conn, when server is closed, loopForDialing will continue to retry dialling until connected with server(when server is restart),then, two new loop loopForRequest & loopForResponse will be created to handle request.
however, loopForRequest will not return when correlative loopForResponse returned because of IsNetworkClosed(EOF), since latter only return nil and will not shutdown tom. thus, there will be more aliveloopForRequest than loopForResponse in this case.
SDK retried not on timeout err, however, we wrapped timeout err incorrectly in here

incubator-pegasus/go-client/session/session.go

Line 338 in d16e65c

return fmt.Errorf("session %s is unable to connect (used %dms), the context error: %s", n, time.Since(dialStart)/time.Millisecond, ctx.Err())

,thus every request like multiset will continue to do retry here since err we passed to outer is no longer timeout err

incubator-pegasus/go-client/pegasus/table_connector.go

Line 719 in d16e65c

retry = true

Checklist

monitor capture based on this pr:
timeout is eliminated after server restart.

should update config and change target

littlepangdi · 2023-04-17T07:37:11Z

During test, we found that there is still timeout exists, thus locate another problem--sdk should not retry on replica is not primary error after reconnect with restarted server ip.
In this case, configuration update should be processed and keep retrying means nothing.

littlepangdi · 2023-04-17T11:15:26Z

Futhermore, if we remove user-side call of TableConnector.Close()， client is working perfectly even when server replica is stopped as if nothing happened. see for this issue #1446

log:
triggers configuration update(ips are blurred):

time="2023-04-17T18:47:06+08:00" level=info msg="trigger configuration update of table [CL74485_loki_staging2_324] due to RPC failure [session unresponsive for long] to [xx(replica)]" func="pegasus.(*pegasusTableConnector).tryConfUpdate" file="table_connector.go:753"
time="2023-04-17T18:47:06+08:00" level=info msg="querying configuration of table(CL74485_loki_staging2_324) [metaList=[xxxx]]" func="session.(*MetaManager).QueryConfig" file="meta_session.go:110"
time="2023-04-17T18:47:06+08:00" level=info msg="querying configuration of table(CL74485_loki_staging2_324) from [xxxx(meta)]" func="session.(*metaSession).queryConfig" file="meta_session.go:45"
time="2023-04-17T18:47:06+08:00" level=info msg="create session with [xxxx(replica)]" func=session.newNodeSession file="session.go:141"
time="2023-04-17T18:47:13+08:00" level=info msg="failed to read response from [xxxx(replica)]: read tcp xxxx:46252->xxxx:15801: read: connection reset by peer" func="session.(*nodeSession).loopForResponse" file="session.go:298"
time="2023-04-17T18:47:14+08:00" level=info msg="dial to [xxxx(replica)]" func="session.(*nodeSession).dial" file="session.go:199"
time="2023-04-17T18:47:14+08:00" level=info msg="stop dialing for [xxxx(replica)], connection state: ConnStateReady" func="session.(*nodeSession).dial" file="session.go:211"

codecov-commenter · 2023-04-18T02:35:22Z

Codecov Report

❗ No coverage uploaded for pull request base (master@d16e65c). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #1444   +/-   ##
=========================================
  Coverage          ?   53.58%           
=========================================
  Files             ?       27           
  Lines             ?     2538           
  Branches          ?        0           
=========================================
  Hits              ?     1360           
  Misses            ?     1132           
  Partials          ?       46

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

hycdong

Thanks for the contribution~

fix loopForRequest not return and retry forever

8ebe394

github-actions bot added the go-client label Apr 17, 2023

should not retry if target replica is not primary

b94b7e8

should update config and change target

fix unit test

ccb0e62

hycdong approved these changes Apr 18, 2023

View reviewed changes

acelyc111 approved these changes Apr 18, 2023

View reviewed changes

acelyc111 changed the title ~~fix loopForRequest not return and retry forever~~ fix(go-client): loopForRequest not return and retry forever Apr 18, 2023

hycdong merged commit b4937fa into apache:master Apr 18, 2023

empiredan mentioned this pull request Aug 21, 2023

Release 2.5.0 #1584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(go-client): loopForRequest not return and retry forever #1444

fix(go-client): loopForRequest not return and retry forever #1444

littlepangdi commented Apr 17, 2023 •

edited

Loading

littlepangdi commented Apr 17, 2023

littlepangdi commented Apr 17, 2023

codecov-commenter commented Apr 18, 2023

hycdong left a comment •

edited

Loading

fix(go-client): loopForRequest not return and retry forever #1444

fix(go-client): loopForRequest not return and retry forever #1444

Conversation

littlepangdi commented Apr 17, 2023 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

expected behavior:

Checklist

littlepangdi commented Apr 17, 2023

littlepangdi commented Apr 17, 2023

codecov-commenter commented Apr 18, 2023

Codecov Report

hycdong left a comment • edited Loading

Choose a reason for hiding this comment

littlepangdi commented Apr 17, 2023 •

edited

Loading

hycdong left a comment •

edited

Loading