Skip to content
This repository has been archived by the owner on Dec 1, 2022. It is now read-only.

reconnect when EOF #431

Merged
merged 3 commits into from
Mar 11, 2021
Merged

reconnect when EOF #431

merged 3 commits into from
Mar 11, 2021

Conversation

critical27
Copy link
Contributor

@critical27 critical27 commented Mar 8, 2021

Almost same as #411. Two difference:

  1. use AsyncSocket::hangup to tell if we encounter EOF, in chaos graphd will failed to send any request to storage after storage reboots in worst case. As @guojun85 said in Fix problem: storage restarted but reconnect is not working. #411, readable will return true while the connection is still working (during my pressure test, it will return true occasionally, once I stop the pressure, no more true is returned), that is probably because it use POLLIN, and default read buffer if quite small. And AsyncSocket::hangup use POLLRDHUP | POLLHUP, which is what we want.
  2. remove duplicate check in impl(), it is uesless now because we didn't use ReconnectRequestChannel, and ReconnectingRequestChannel::impl() will be invoked multiple times, and check more than once. After this PR, we manage retry by ourself.

@guojun85, PTAL, thx

@critical27
Copy link
Contributor Author

Hold on a second, there maybe a way to make changes minimum.

@critical27
Copy link
Contributor Author

critical27 commented Mar 8, 2021

I believe it is the right solution now, ready to review.

@critical27 critical27 closed this Mar 8, 2021
@critical27 critical27 reopened this Mar 9, 2021
liuyu85cn
liuyu85cn previously approved these changes Mar 9, 2021
Copy link
Contributor

@liuyu85cn liuyu85cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

panda-sheep
panda-sheep previously approved these changes Mar 9, 2021
Copy link
Contributor

@panda-sheep panda-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done!

@critical27
Copy link
Contributor Author

Wait dutor to have a look

@bright-starry-sky
Copy link
Contributor

Wait dutor to have a look

👌

@dutor
Copy link
Contributor

dutor commented Mar 10, 2021

Skimmed through code of the AsyncSocket transport, I think calling transport->good() is enough, hangup() is unnecessary. Because if the remote side closes the connection(either restarted or crashed), there will be a EPOLLIN event which triggers a read action which returns a EOF, then the socket states will be set to indicate not good()

You could have a try to verify it.

@dutor
Copy link
Contributor

dutor commented Mar 10, 2021

Skimmed through code of the AsyncSocket transport, I think calling transport->good() is enough, hangup() is unnecessary. Because if the remote side closes the connection(either restarted or crashed), there will be a EPOLLIN event which triggers a read action which returns a EOF, then the socket states will be set to indicate not good()

You could have a try to verify it.

emmm. It seems that channel->good() calls transport->good() for us. It think we could just check if the channel is good, then create one if not.

@critical27
Copy link
Contributor Author

critical27 commented Mar 10, 2021

Skimmed through code of the AsyncSocket transport, I think calling transport->good() is enough, hangup() is unnecessary. Because if the remote side closes the connection(either restarted or crashed), there will be a EPOLLIN event which triggers a read action which returns a EOF, then the socket states will be set to indicate not good()
You could have a try to verify it.

emmm. It seems that channel->good() calls transport->good() for us. It think we could just check if the channel is good, then create one if not.

transport->good() is AsyncSocket::good() actually, and it did not check POLLIN, however AsyncSocket::readable() will check POLLIN.

During my test, readable will be effected by pressure, and will return true even the socket is valid. You could try #411 and you will find raft client transport will be released with a little pressure, very easy to reproduce.

@dutor
Copy link
Contributor

dutor commented Mar 10, 2021

Skimmed through code of the AsyncSocket transport, I think calling transport->good() is enough, hangup() is unnecessary. Because if the remote side closes the connection(either restarted or crashed), there will be a EPOLLIN event which triggers a read action which returns a EOF, then the socket states will be set to indicate not good()

You could have a try to verify it.

After digging, we found out that, after each RPC call finishes, READ event of the connection will be removed from the event set, so that even the connection is lost the client side won't be able to update the state. So we have to be ask to kernel about the connection state, which is what hangup does.

@dutor
Copy link
Contributor

dutor commented Mar 10, 2021

During my test, readable will be effected by pressure, and will return true even the socket is valid. You could try #411 and you will find raft client transport will be released with a little pressure, very easy to reproduce.

This is because a disconnect from the remove side will incur a READ event on the client side, so the connection IS readable in terms of event handling.

@critical27
Copy link
Contributor Author

Skimmed through code of the AsyncSocket transport, I think calling transport->good() is enough, hangup() is unnecessary. Because if the remote side closes the connection(either restarted or crashed), there will be a EPOLLIN event which triggers a read action which returns a EOF, then the socket states will be set to indicate not good()
You could have a try to verify it.

After digging, we found out that, after each RPC call finishes, READ event of the connection will be removed from the event set, so that even the connection is lost the client side won't be able to update the state. So we have to be ask to kernel about the connection state, which is what hangup does.

👍 , let me have a look.

@critical27 critical27 force-pushed the reconnect branch 2 times, most recently from d888edd to 7d27ebc Compare March 10, 2021 10:41
@critical27
Copy link
Contributor Author

Ready to go, close #411

std::shared_ptr<apache::thrift::async::TAsyncSocket> socket;
evb->runImmediatelyOrRunInEventBaseThreadAndWait(
[&socket, evb, host]() {
socket = apache::thrift::async::TAsyncSocket::newSocket(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use tempHost instead of host

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter, actually validate won't change the value of the parameter, so tempHost will be equal to host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, good catch

dutor
dutor previously approved these changes Mar 11, 2021
Copy link
Contributor

@dutor dutor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zouni

panda-sheep
panda-sheep previously approved these changes Mar 11, 2021
Copy link
Contributor

@panda-sheep panda-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job

dutor
dutor previously approved these changes Mar 11, 2021
Copy link
Contributor

@dutor dutor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kim_yes

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants