Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DNM] rpc/nodedialer: avoid tripping context on context.Canceled
Marked DNM because Testing is coming, nodedialer currently has no testing and I'm going to change that. Just want to get this change out there to start the conversation and make myself more visibly accountable. In cockroachdb#34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Release note: None
- Loading branch information