Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: disable and document overrides of OS default TCP keepalive by Go #6672

Merged
merged 7 commits into from
Nov 7, 2023

Conversation

JaydenTeoh
Copy link
Contributor

@JaydenTeoh JaydenTeoh commented Oct 1, 2023

Pertaining to #6250

Summary of Changes made (as suggested by @easwars):

  • transport/http2_client.go: Disable Go's override of OS default TCP keepalive interval (currently 15s) on the client-side
  • Add documentation to retain OS defaults on server-side and example to implement it within helloworld/greeter_server

I am still unable to replicate the bug mentioned originally in the issue and test whether my changes on client-side will fix it.

grpc version: v1.58.2
go version: go1.20.4 darwin/arm64
OS: macOS Ventura 13.4

image

at packet 3773, I applied the rule to pftcl:

block drop in on lo0 proto tcp from any to any port = 8080

but my server does not reset the connection after a single lost keepalive packet (I am using the default keepalive settings on both client and server side).

RELEASE NOTES: none

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 1, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@ginayeh ginayeh requested a review from easwars October 2, 2023 18:02
@easwars easwars added this to the 1.59 Release milestone Oct 5, 2023
easwars
easwars previously requested changes Oct 5, 2023
internal/transport/http2_client.go Show resolved Hide resolved
examples/helloworld/greeter_server/main.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
dialoptions.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
dialoptions.go Outdated Show resolved Hide resolved
@JaydenTeoh
Copy link
Contributor Author

@easwars Thank you for the comments. I have made the changes as requested. Do note I am still unable to replicate the original issue surfaced in #6250 to test if changes to retain the OS defaults does fix it.

Copy link
Member

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous comments. the changes LGTM, modulo minor formatting nits.

In general for docstrings, ensure that commentary is readable from source even on narrow screens. When a comment gets too long(typically anything > 80), it is recommended to wrap it into multiple single-line comments.


cc: @dfawley. Requesting a 2nd review from you.

examples/helloworld/greeter_server/main.go Outdated Show resolved Hide resolved
internal/transport/http2_client.go Outdated Show resolved Hide resolved
internal/transport/http2_client.go Outdated Show resolved Hide resolved
dialoptions.go Outdated
Comment on lines 417 to 418
// Note: Go overrides the OS defaults for TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a negative value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note: Go overrides the OS defaults for TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a negative value.
// Note: Go overrides the OS defaults for TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a
// negative value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say something like "As of Go 1.21, the standard library overrides ............"

Also, something that gives clarity into our default behavior would be nice, too. E.g. "gRPC-Go's default dialer does this in order to restore the OS defaults."

Also:

//
// For more information, please see [issue 48622] in the Go github repo.
//
// [issue 48622]: https://github.com/golang/go/issues/48622

@arvindbr8 arvindbr8 requested a review from dfawley October 16, 2023 21:58
@arvindbr8 arvindbr8 linked an issue Oct 17, 2023 that may be closed by this pull request
examples/helloworld/greeter_server/main.go Outdated Show resolved Hide resolved
dialoptions.go Outdated
Comment on lines 417 to 418
// Note: Go overrides the OS defaults for TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a negative value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say something like "As of Go 1.21, the standard library overrides ............"

Also, something that gives clarity into our default behavior would be nice, too. E.g. "gRPC-Go's default dialer does this in order to restore the OS defaults."

Also:

//
// For more information, please see [issue 48622] in the Go github repo.
//
// [issue 48622]: https://github.com/golang/go/issues/48622

server.go Outdated Show resolved Hide resolved
@dfawley dfawley removed their assignment Oct 18, 2023
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

❗ No coverage uploaded for pull request base (master@1466283). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

@JaydenTeoh
Copy link
Contributor Author

JaydenTeoh commented Oct 19, 2023

@dfawley I've reverted the changes to the greeter_server example as requested. Actually the default TCP keepalive times were implemented since Go 1.13. The original issue in the Go github repo that discussed the implementation of default keepalive times is here. Here is the change I made for the documentation in dialoptions.go instead, do let me know if it is okay.

// Note: As of Go 1.13, the standard library overrides the OS defaults for
// TCP keepalive time and interval to 15s.
// To retain OS defaults, use a net.Dialer with the KeepAlive field set to a
// negative value.
//
// For more information, please see [issue 23459] in the Go github repo
//
// [issue 23459]: https://github.com/golang/go/issues/23459

Also, I am not really sure what you meant by including "gRPC-Go's default dialer does this in order to restore the OS defaults.". Based on my understanding, there has not been any restoration of OS defaults on grpc side until this PR. And the only restoration this PR achieves is only on the client side (see this)

@dfawley
Copy link
Member

dfawley commented Oct 20, 2023

I've reverted the changes to the greeter_server example as requested. Actually the default TCP keepalive times were implemented since Go 1.13.

Sorry, by "As of Go 1.21" I meant "ALL versions of go, up to and including 1.21 SO FAR". I think we should document that it's currently the case, but that it may not always be the case, and is even expected to be changed. If you can think of better wording than "As of" for this, then that's fine too!

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the punctuation and potentially the benchmark changes.

dialoptions.go Outdated Show resolved Hide resolved
internal/transport/http2_client.go Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
@dfawley
Copy link
Member

dfawley commented Oct 31, 2023

@arvindbr8 please review again after @JaydenTeoh updates this. Thanks!

Copy link

github-actions bot commented Nov 6, 2023

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Nov 6, 2023
@JaydenTeoh
Copy link
Contributor Author

JaydenTeoh commented Nov 7, 2023

@arvindbr8 updated. Hmm but seems like the tests are failing. I don't think it is related to changing the dialer on the benchmark/benchmain/main.go file though

@github-actions github-actions bot removed the stale label Nov 7, 2023
@arvindbr8
Copy link
Member

@JaydenTeoh -- it is a known flake, I've rerun the failed test for you. Meanwhile, let me take a look at the diff.

Copy link
Member

@arvindbr8 arvindbr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo minor comment about docstring. Just to maintain a similar formatting for references.

dialoptions.go Show resolved Hide resolved
server.go Show resolved Hide resolved
@arvindbr8
Copy link
Member

arvindbr8 commented Nov 7, 2023

nvmd, I've been informed this doesnt work. Your comments was per the golang doc

@arvindbr8 arvindbr8 merged commit f1a1fcd into grpc:master Nov 7, 2023
13 checks passed
@@ -176,7 +176,9 @@ func dial(ctx context.Context, fn func(context.Context, string) (net.Conn, error
if networkType == "tcp" && useProxy {
return proxyDial(ctx, address, grpcUA)
}
return (&net.Dialer{}).DialContext(ctx, networkType, address)
// KeepAlive is set to a negative value to prevent Go's override of the TCP
// keepalive time and interval; retain the OS default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this comment is technically correct, I'm not sure this is what you intended and I think the comment could be more clear: the default on linux is to disable keepalives unless opted in via SO_KEEPALIVE, which then uses the os-level keepalive configuration.

At least this is different from Java and what Eric proposed in #6250 (comment), which would be to set SO_KEEPALIVE to true unconditionally. This could be done by adding something like:

		Control: func(network, address string, c syscall.RawConn) error {
			return c.Control(func(fd uintptr) {
				syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_KEEPALIVE, 1)
			})
		},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default on linux is to disable keepalives unless opted in via SO_KEEPALIVE, which then uses the os-level keepalive configuration.

Hmm, I believe you are right. We were under the impression that keepalives were always enabled and that a negative value would not disable them, but would use the OS defaults for the timers. However, this doesn't call setKeepAlive when the value is negative:

https://go.dev/src/net/tcpsock.go#L238

And they set SO_KEEPALIVE explicitly for posix systems in setKeepAlive:

https://go.dev/src/net/sockopt_posix.go#L116

So presumably not doing that means keepalives will be disabled otherwise.

Copy link
Collaborator

@atollena atollena Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of unconditionally enabling keepalives via Dialer control then?

(3) is something that Java's been doing forever, and still do independent of (1). Even with (1), (3) is still useful as some people configure their OS to use more aggressive settings for their specific environment (e.g., AWS). For those environments (3) behaves well without any extra work from users. And if you aren't in such an environment, then (3) is very low or zero cost.

I think we should try to do that before the release.

edit: nevermind, just saw your comment on #6250 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of unconditionally enabling keepalives via Dialer control then?

A little bit more thought is required I guess. We do set the Dialer.Control to a different function when running inside of the Google production network to set certain socket options that are appropriate for that environment. Unconditionally enabling TCP keepalives using the Dialer.Control here could still work (if we enhance the above function to also do the same). Thoughts @dfawley ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it appears that the proposal mentioned here has made significant progress and has a pending change. This would allow us to disable TCP keepalive or use OS defaults with a simple API. We would have to wait for a new Go version though, to be able to use this and would be a while until that Go version becomes the least supported Go version for grpc.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCP Connection reset after a single lost keepalive packet
6 participants