Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Handle closed error #100525

Merged

Conversation

andrewbaptist
Copy link
Collaborator

We close the listener before closing the connection. This can result in a spurious failure due to the Listener also closing our connection.

Epic: none
Fixes: #100391
Fixes: #77754
Informs: #80034

Release note: None

@andrewbaptist andrewbaptist requested a review from a team as a code owner April 3, 2023 21:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

We close the listener before closing the connection. This can result in
a spurious failure due to the Listener also closing our connection.

Epic: none
Fixes: cockroachdb#100391
Fixes: cockroachdb#77754
Informs: cockroachdb#80034

Release note: None
@andrewbaptist andrewbaptist force-pushed the 2023-04-03-handle-ErrClosed branch from d7cb57f to 41288e1 Compare April 3, 2023 21:53
@andrewbaptist andrewbaptist requested a review from knz April 3, 2023 21:54
@andrewbaptist
Copy link
Collaborator Author

I fixed the first issue (conn.Close() returning net.ErrClosed) easily in stress and then hit the second issue cluster ID after a longer stress run (5-10min / 50K iterations). I think the second issue is due to "reusing ports" between test runs but I'm not sure. It has been happening for ~1 year, but very infrequently.

I can run the test for >30min without failure with both of these fixes.

Comment on lines +1277 to +1281
// TODO(baptist): Better understand when this happens. It appears we can get
// spurious connections to other tests on a stress run. This has been
// happening for a while, but only comes out rarely when this package is
// stressed. This test is very aggressive since it is calling GRPCDialNode in
// a busy loop for 50ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a known problem. Test servers bind to port 0 to let the OS pick a random unused port, but as soon as we close it then a different test can be randomly assigned that very same port number, so if we later try to connect to that port we get unexpected results. This typically causes problems when we attempt to restart a stopped server reusing the same port number it was previously assigned, but someone else bound to it in the meanwhile.

@andrewbaptist andrewbaptist added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Apr 4, 2023
@andrewbaptist
Copy link
Collaborator Author

bors r=erikgrinaker

@craig craig bot merged commit 76956df into cockroachdb:master Apr 4, 2023
@craig
Copy link
Contributor

craig bot commented Apr 4, 2023

Build succeeded:

@andrewbaptist andrewbaptist deleted the 2023-04-03-handle-ErrClosed branch April 5, 2023 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpc: TestHeartbeatHealthTransport failed rpc: TestHeartbeatHealthTransport failed
3 participants