Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: more detailed TCP keep-alive configuration #62254

Closed
neild opened this issue Aug 23, 2023 · 13 comments
Closed

net: more detailed TCP keep-alive configuration #62254

neild opened this issue Aug 23, 2023 · 13 comments

Comments

@neild
Copy link
Contributor

neild commented Aug 23, 2023

TCP keep-alives are enabled by setting the SO_KEEPALIVE socket option. When keep-alives are enabled, the operating system will probe idle connections and close connections with an unresponsive peer.

Linux, FreeBSD, and Windows permit configuring keep-alive probes with the TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT socket options. (Darwin appears to call TCP_KEEPIDLE TCP_KEEPALIVE.) These options set the idle duration before the first probe is sent, the interval between probes, and the maximum number of probes to send before declaring a connection dead, respectively.

The net package permits configuring TCP keepalive behavior with:

Setting the keep-alive period sets both TCP_KEEPIDLE and TCP_KEEPINTVL. There is no way in the net package to set TCP_KEEPCNT or otherwise configure how long it takes to declare a connection dead. There is no way to set different values for the idle period and the probe interval.

Since Go 1.12 (https://go.dev/cl/107196), the net package has set TCP_KEEPIDLE and TCP_KEEPINTVL on all new sockets to 15 seconds by default.

This 15 second value is not appropriate for all uses. For example, #48622 complains that the default keep-alive settings cause excessive CPU consumption on mobile devices due to frequent probes. Reducing the time until the first keep-alive probe is sent with Dialer.KeepAlive also reduces the time between probes; setting the period to five minutes will cause the kernel to take ~50 minutes to detect a dead connection.

I propose that we change the net package functions which set the keep-alive period to set TCP_KEEPIDLE, but not TCP_KEEPINTVL. It is surprising and not useful for a "keep-alive period" of ten minutes to translate into a connection being declared dead only after several hours of unresponsiveness.

I propose that we also add a finer-grained API for configuring keep-alive behavior. I haven't exhaustively surveyed the operating systems we support, but Linux, Windows, Darwin, and at least one BSD all appear to support the same three settings. It's possible to set these via golang.org/x/sys, but providing a common way to configure them in the net package seems reasonable.

The following API also permits for enabling keep-alives on a connection while using the operating system default keep-alive configuration (set KeepAliveConfig.Enable, set the other fields to -1).

package net

// KeepAliveConfig contains TCP keep-alive options.
//
// If the Idle, Interval, or Count fields are zero, a default value is chosen.
// If a field is negative, the operating system default is used.
type KeepAliveConfig struct {
        // If Enable is true, keep-alive probes are enabled.
        Enable bool

        // Idle is the time that the connection must be idle before
        // the first keep-alive probe is sent.
        Idle time.Duration

        // Interval is the time between keep-alive probes.
        Interval time.Duration

        // Count is the maximum number of keep-alive probes that
        // should be sent before dropping a connection.
        Count int
}

type Dialer { // contains existing unchanged fields
        // KeepAlive specifies the interval between keep-alive
        // probes for an active network connection.
        //
        // KeepAlive is ignored if KeepAliveConfig.Enable is true.
        //
        // If zero, keep-alive probes are sent with a default value
        // (currently 15 seconds), if supported by the protocol and operating
        // system. Network protocols or operating systems that do
        // not support keep-alives ignore this field.
        // If negative and KeepAliveConfig.Enable is false, keep-alive probes are disabled.
        KeepAlive time.Duration

        // KeepAliveConfig specifies the keep-alive probe configuration
        // for an active network connection, when supported by the
        // protocol and operating system.
        //
        // If KeepAliveConfig.Enable is true, keep-alives probes are enabled.
        // If KeepAliveConfig.Enable is false and KeepAlive is negative,
        // keep-alive probes are disabled.
        KeepAliveConfig KeepAliveConfig
}

type ListenConfig {
        // same fields as Dialer above
}

// SetKeepAliveConfig configures keep-alive messages sent by the operating system.
func (c *TCPConn) SetKeepAliveConfig(config KeepAliveConfig) error {}
@neild
Copy link
Contributor Author

neild commented Aug 23, 2023

I found the following survey of keep-alive behaviors to be useful:
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

That post also covers the TCP_USER_TIMEOUT socket option, which configures another aspect of how dead TCP connections are detected. I'm not certain if anything other than Linux supports that option, so I've left it out of this proposal.

@bcmills
Copy link
Contributor

bcmills commented Aug 24, 2023

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

@neild
Copy link
Contributor Author

neild commented Aug 24, 2023

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

That was my thought. If a caller doesn't mind a platform not supporting an option, they can just not check the returned error.

@panjf2000
Copy link
Member

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

Maybe we should return an error that wraps errors. ErrUnsupported from #41198

@rsc
Copy link
Contributor

rsc commented Oct 11, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Oct 11, 2023
@rsc
Copy link
Contributor

rsc commented Oct 24, 2023

This seems reasonable, well-motivated, and well-justified. Thank you.
I think you can drop the "and KeepAliveConfig.Enable is false" in the
KeepAlive docs, since it already said the field is ignored when Enable==true.

It would help to say what the current defaults are in the config struct docs.

And in Count maybe "should be sent" should be "can go unanswered".

@neild
Copy link
Contributor Author

neild commented Nov 2, 2023

Updated proposal with the suggested changes.

I needed to pick default values for KeepAliveConfig fields. I went with the current behavior for Idle and Interval (15 seconds), and the Linux kernel default for Count (9). We could also say that the default for Count matches the current behavior of picking the OS default, but it seems less surprising for us to set all three.

package net

// KeepAliveConfig contains TCP keep-alive options.
//
// If the Idle, Interval, or Count fields are zero, a default value is chosen.
// If a field is negative, the operating system default is used.
type KeepAliveConfig struct {
        // If Enable is true, keep-alive probes are enabled.
        Enable bool

        // Idle is the time that the connection must be idle before
        // the first keep-alive probe is sent.
        // If zero, a default value of 15 seconds is used.
        Idle time.Duration

        // Interval is the time between keep-alive probes.
        // If zero, a default value of 15 seconds is used.
        Interval time.Duration

        // Count is the maximum number of keep-alive probes that
        // can go unanswered before dropping a connection.
        // If zero, a default value of 9 is used.
        Count int
}

type Dialer { // contains existing unchanged fields
        // KeepAlive specifies the interval between keep-alive
        // probes for an active network connection.
        //
        // KeepAlive is ignored if KeepAliveConfig.Enable is true.
        //
        // If zero, keep-alive probes are sent with a default value
        // (currently 15 seconds), if supported by the protocol and operating
        // system. Network protocols or operating systems that do
        // not support keep-alives ignore this field.
        // If negative, keep-alive probes are disabled.
        KeepAlive time.Duration

        // KeepAliveConfig specifies the keep-alive probe configuration
        // for an active network connection, when supported by the
        // protocol and operating system.
        //
        // If KeepAliveConfig.Enable is true, keep-alives probes are enabled.
        // If KeepAliveConfig.Enable is false and KeepAlive is negative,
        // keep-alive probes are disabled.
        KeepAliveConfig KeepAliveConfig
}

type ListenConfig {
        // same fields as Dialer above
}

// SetKeepAliveConfig configures keep-alive messages sent by the operating system.
func (c *TCPConn) SetKeepAliveConfig(config KeepAliveConfig) error {}

@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

Proposal details are in #62254 (comment).

@rsc rsc moved this from Active to Likely Accept in Proposals Nov 2, 2023
@rsc
Copy link
Contributor

rsc commented Nov 10, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

Proposal details are in #62254 (comment).

@rsc rsc moved this from Likely Accept to Accepted in Proposals Nov 10, 2023
@rsc rsc changed the title proposal: net: more detailed TCP keep-alive configuration net: more detailed TCP keep-alive configuration Nov 10, 2023
@rsc rsc modified the milestones: Proposal, Backlog Nov 10, 2023
@panjf2000
Copy link
Member

I've done some investigations about TCP keep-alive on various platforms and had a CL for this proposal drafted, I'd volunteer to implement this if you don't mind.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/542275 mentions this issue: net: add KeepAliveConfig and implement SetKeepAliveConfig

@panjf2000 panjf2000 self-assigned this Nov 14, 2023
@panjf2000 panjf2000 modified the milestones: Backlog, Go1.22 Nov 17, 2023
@odeke-em
Copy link
Member

Thanks everyone for the discourse! Looks like it is too late in the cycle to get in, forward rolling to Go1.23.

@odeke-em odeke-em modified the milestones: Go1.22, Go1.23 Dec 13, 2023
@panjf2000

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

6 participants