Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud: add optional connection timeout retries #99811

Merged
merged 1 commit into from
Mar 28, 2023

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Mar 28, 2023

This optionally adds connection timeout errors to the set of errors to retry in the resuming reader. Additionally, it adds the file name to error messages returned by the resuming reader.

We test for ETIMEDOUT. Linux returns ETIMEDOUT in cases where the remote end of a connection fails to acknowledge a configured number of TCP Keep-Alive messages.

This shouldn't catch timeouts that are the result of IO deadlines or context cancellations. But, we don't enable this retrying by default, to avoid changing any current behaviour.

Epic: none

Release note: None

@stevendanna stevendanna requested a review from a team as a code owner March 28, 2023 15:34
@stevendanna stevendanna requested review from benbardin and removed request for a team March 28, 2023 15:34
@blathers-crl
Copy link

blathers-crl bot commented Mar 28, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/cloud/cloud_io.go Outdated Show resolved Hide resolved
@stevendanna stevendanna force-pushed the retry-connection-timeout branch from 148a659 to 0de35f3 Compare March 28, 2023 15:52
pkg/cloud/cloud_io.go Outdated Show resolved Hide resolved
@stevendanna stevendanna force-pushed the retry-connection-timeout branch from 84c9ee2 to c63e1a4 Compare March 28, 2023 22:19
This optionally adds connection timeout errors to the set of errors to
retry in the resuming reader. Additionally, it adds the file name to
error messages returned by the resuming reader.

We test for ETIMEDOUT. Linux returns ETIMEDOUT in cases where the
remote end of a connection fails to acknowledge a configured number of
TCP Keep-Alive messages.

This shouldn't catch timeouts that are the result of IO deadlines or
context cancellations. But, just in case we've put this behind a
cluster setting to avoid changing behaviour.

Epic: none

Release note: None
@stevendanna stevendanna force-pushed the retry-connection-timeout branch from c63e1a4 to e5862c2 Compare March 28, 2023 22:27
@stevendanna
Copy link
Collaborator Author

stevendanna commented Mar 28, 2023

For the permanent record, I tested this by:

  1. Setting sudo sysctl -w net.ipv4.tcp_keepalive_probes=2 to shorten how long the following test takes. This is how many missed keepalives result in the connection being torn down. The default is 9 which takes a few minutes. Keep-Alives are set every 30s on our current version of Go.
  2. Creating some calling code that started a read and then slept for a while.
  3. While that code was stopped, using ss -ontap to find the local ephemeral port
  4. Using iptables to drop inbound traffic on that port: sudo iptables -A INPUT -p tcp --dport 33256 -j DROP
  5. Wait for the connection to fail with a timeout.

With the setting set to false, the error is returned to the user. With the setting set to true, we get a retry:

E230328 22:13:28.744289 3804 cloud/cloud_io.go:262 ⋮ [n1,client=127.0.0.1:56942,user=‹demo›] 754  read ‹cockroach-v20.1.11.darwin-10.9-amd64›: read tcp ‹10.142.0.102:33256› -> ‹52.219.110.2:443›: read: connection timed out
W230328 22:13:28.744408 3804 cloud/cloud_io.go:169 ⋮ [n1,client=127.0.0.1:56942,user=‹demo›] 755  retrying connection timed out because ‹cloudstorage.connection_timed_out_retries.enabled› = true
E230328 22:13:28.744421 3804 cloud/cloud_io.go:270 ⋮ [n1,client=127.0.0.1:56942,user=‹demo›] 756  Retry IO error: read ‹cockroach-v20.1.11.darwin-10.9-amd64›: read tcp ‹10.142.0.102:33256› -> ‹52.219.110.2:443›: read: connection timed out

@dt dt merged commit b3b04e5 into cockroachdb:release-22.2 Mar 28, 2023
@dt
Copy link
Member

dt commented Mar 29, 2023

does this need a forward port to master and 23.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants