-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool: track usage of incoming streams #10710
Conversation
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, LGTM. I've left some a couple of comments where I think we could make the existing code harder for developers to break, but I'll leave that to your judgement.
client/rpc.go
Outdated
@@ -301,7 +300,7 @@ func (c *Client) rpcConnListener() { | |||
|
|||
// listenConn is used to listen for connections being made from the server on | |||
// pre-existing connection. This should be called in a goroutine. | |||
func (c *Client) listenConn(s *yamux.Session) { | |||
func (c *Client) listenConn(s *pool.Conn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiniest of nitpicks, but with the type being changed to *pool.Conn
, naming this argument s
doesn't make sense anymore. Maybe rename the arg to p
?
(pool.Conn
is kind of a weirdly named type in general because it's more like "connection factory" or "connect proxy" given that Accept()
returns net.Conn
, but it isn't a "connection pool" either as it contains the connection pool. But probably best not to rework the whole thing. 😀 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to c
. FWIW, pool.Conn
is the underlying physical connection - pool.Accept()
returns a wrapped yamux.Session
which also implements net.Conn
(and clients expect)..
helper/pool/pool.go
Outdated
@@ -461,7 +496,7 @@ func (p *ConnPool) RPC(region string, addr net.Addr, version int, method string, | |||
p.clearConn(conn) | |||
} | |||
|
|||
p.releaseConn(conn) | |||
conn.releaseUse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing this PR I'm noticing we have an existing hidden temporal coupling of conn.releaseUse()
to getRPCClient
. If someone were to call conn.releaseUse
outside of this method without first calling p.clearConn(conn)
, we can get a value of -1 for the reference count and then not close the connection. The current code is correct but easy for someone to break.
Two suggestions:
- Move this into a
defer conn.releaseUse()
right after we check the error fromgetRPCClient
to make the temporal coupling more explicit. - Maybe make
releaseUse
tolerant of misuse by having it close onrefCount < 1
:
// releaseUse is the complement of `markForUse`, to free up the reference count
func (c *Conn) releaseUse() {
refCount := atomic.AddInt32(&c.refCount, -1)
if refCount < 1 && atomic.LoadInt32(&c.shouldClose) == 1 {
c.Close()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's brittle, I'll move to defer.
I'm unsure about handling negative refCounts. It's unclear how refCount can be negative by re-ordering calls. Only conn.releaseUse
decrement the refCount
. So negative values indicate double-release bugs, and defensive handling will probably lead to yet more subtle cases of unexpected connection closing like this one. I wish we can simply panic on negative values so we can find the source of double free case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So negative values indicate double-release bugs, and defensive handling will probably lead to yet more subtle cases of unexpected connection closing like this one.
That's a good point. Totally agreed on that.
defer releaseUse to ensure it gets run Also, clarify conn vs stream naming a little bit
Co-authored-by: Tim Gross <[email protected]>
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579
Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.
This fixes a bug where
alloc exec
andalloc fs
sessions get terminatedunexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.
I've tested this code in a live cluster, and had an
alloc exec
command session almost 3 hours before I killed it. Without the change, analloc exec
session used to last around 7-20 minutes or so.Fixes #10579