-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve HA behavior of database agents in leaf clusters #10641
Conversation
@gabrielcorado Can you also check if the HA for application access we implemented a couple months ago is prone to the same issue and if so implement a fix/test similar to this? |
// the reverse tunnel connection is down e.g. because the agent is down. | ||
func isReverseTunnelDownError(err error) bool { | ||
return trace.IsConnectionProblem(err) || | ||
strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding missing switch r.ConnType -> case types.DatabaseTunnel
case and and returning trace.ConnectionProblem(err,
the check based on the NoDatabaseTunnel
error message text seems be to be not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's correct - previously, we would also return trace.ConnectionProblem there (if you look a few lines below, after directDial) and it wasn't detected in the db proxy. To confirm I also tried removing the error message matching and the integration test I wrote failed with the error this PR fixes.
// Connections to applications and databases should never occur over | ||
// a direct dial, return right away. | ||
switch r.ConnType { | ||
case types.AppTunnel: | ||
return nil, false, trace.ConnectionProblem(err, NoApplicationTunnel) | ||
case types.DatabaseTunnel: | ||
return nil, false, trace.ConnectionProblem(err, NoDatabaseTunnel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we somehow unify this switch so we will have same common logic for remoteSite and localSite https://github.com/gravitational/teleport/blob/roman/leafdb/lib/reversetunnel/localsite.go#L327
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but to be honest it doesn't feel like moving it out is really worth it (and would probably complicate implementation a bit too).
// mu protects the shuffleFunc global access. | ||
mu sync.RWMutex | ||
// shuffleFunc provides shuffle behavior for multiple database agents. | ||
shuffleFunc ShuffleFunc = ShuffleRandom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not big fun of global variables but I assume that this logic is hard to test without this. Though wonder if the db servers order can be enforced in integration tests by settings specific time point in clock used by proxy db service:
clock := NewFakeClockAt(time.Date(2021, time.April, 4, 0, 0, 0, 0, time.UTC))
pack := setupDatabaseTest(t, withClock(clock))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about the same actually but didn't know how to ensure the stable shuffle order even with the frozen time. I think in this case shuffle will always produce the same order but there's no way of telling which order. So while I'm not a big fan of this either, this felt like the most reliable way to ensure guaranteed order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we could always shuffle and in tests add an extra pass to sort them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean adding some sort of "test mode", I was trying to avoid that. I generally try not to make "if isTestMode()" switches in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to write NVM, but you were faster. Initially I thought about passing extra sort function in test, but that introduces the same issue that we have already.
I agree on the "test mode" too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some suggestion but otherwise the PR LGTM.
// the reverse tunnel connection is down e.g. because the agent is down. | ||
func isReverseTunnelDownError(err error) bool { | ||
return trace.IsConnectionProblem(err) || | ||
strings.Contains(err.Error(), reversetunnel.NoDatabaseTunnel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any reason why not to create an var ErrNoDatabaseTunnel
and just use here err == ErrNoDatabaseTunnel
instead of comparing strings? For me sounds cleaner and safer as current implementation panics when err == nil
- not very likely to happen but still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I looked at it, we just get a generic trace error from reversetunnel/transport in this case so I'm not sure introducing another error type is going to help. Otherwise, the check for trace.ConnectionProblem which we already had here would have worked too.
if trace.IsConnectionProblem(err) { | ||
s.log.WithError(err).Warnf("Failed to dial %v.", server) | ||
// If an agent is down, we'll retry on the next one (if available). | ||
if isReverseTunnelDownError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check for error type anyway? Shouldn't we just try to connect to all servers in case of any error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about it originally and decided to only retry reverse tunnel errors specifically to be on the safe side and not retry errors that should not be retried (like, target database connection errors, rbac, etc.). We do it same way for kube and apps.
// mu protects the shuffleFunc global access. | ||
mu sync.RWMutex | ||
// shuffleFunc provides shuffle behavior for multiple database agents. | ||
shuffleFunc ShuffleFunc = ShuffleRandom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we could always shuffle and in tests add an extra pass to sort them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bot
Fixes #10640.
The database agent fallback logic in proxy server did not recognize the reverse tunnel error properly when connecting to a leaf cluster agent so the retry logic did not kick in. The reason is when connecting to an agent in the same cluster the returned error is trace.ConnectionProblem, but when connecting to an agent in a leaf cluster it's just a generic error. This fix ensures we check for proper error message to recognize it.
I've also written integration tests for both scenarios, HA in root and HA in leaf. Had to do some shenanigans to make sure we can override shuffling behavior in tests to ensure that in tests we don't rely on chance.