-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry app access requests in different servers when there is a connection failure #9288
Conversation
68f72c0
to
100bacf
Compare
lib/web/app/handler.go
Outdated
func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) { | ||
if !trace.IsConnectionProblem(err) && err != io.EOF { | ||
w.WriteHeader(http.StatusInternalServerError) | ||
w.Write([]byte(http.StatusText(http.StatusInternalServerError))) | ||
return | ||
} | ||
|
||
session, err := h.renewSession(req) | ||
if err != nil { | ||
w.WriteHeader(http.StatusInternalServerError) | ||
w.Write([]byte(http.StatusText(http.StatusInternalServerError))) | ||
return | ||
} | ||
|
||
session.fwd.ServeHTTP(w, req) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will if all agents are unavailable ?
At first glance it looks that we will create a spin while loop with the while !trace.IsConnectionProblem
condition:
- newSessionIs Created
- fwd app agent dial call fails with ConnectionProblem err.
- session is renewed and a new transport is created.
- loop flow starts from 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, the new session flow now only matches "healthy" servers, so if there isn't any, it will return an error on the renew the session function.
However, there can be a case where the server is unstable, meaning that at the session creation, it is healthy, but when the forward happens, it is not. So we'll have the loop you mentioned in this case, and the client will timeout its request.
lib/web/app/transport.go
Outdated
t.mu.Lock() | ||
defer t.mu.Unlock() | ||
|
||
for i := len(t.c.servers) - 1; i >= 0; i-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this approach in scenario where all app agents are available the we will forward all connection to only one app agent without any LB approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I'll add a shuffle on the Match
function, so we get a new ordination of the server's list every time a new session is created and keep the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would shuffle here as well, to avoid relying on some external behavior.
@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session | |||
if err != nil { | |||
return nil, trace.Wrap(err) | |||
} | |||
server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr)) | |||
|
|||
servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need check all app servers healthy status here.
It looks like the logic in Transport Dial function does already the same thing: https://github.com/gravitational/teleport/pull/9288/files#diff-9dd39725077cc062cbe4ded242810d79af548977d645d288875b70d49d759072R174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to give the transport a more accurate list of servers, and in the case of no healthy status, it would fail before the actual request forward. It is also aligned with the solution, where if the transport's list of servers is empty, the new transport will have only healthy servers to forward requests to, also working as a stop condition of this "renew loop".
Do you think it is better to avoid dialing apps during session creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/web/app/transport.go
Outdated
tr.DialContext = t.DialContext | ||
tr.TLSClientConfig = t.clientTLSConfig | ||
return t, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you make some changes to tr
, but then return t
? Looks like a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, the tr
is inside the t
(which is the *transport
we want to return). I will change the order of the changes to make it more clear.
79d120c
to
9160bf0
Compare
lib/web/app/handler.go
Outdated
// to get "fresh" app servers, and then will forwad the request to the newly | ||
// created session. | ||
func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) { | ||
if !trace.IsConnectionProblem(err) && err != io.EOF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check specifically for the error that is returned when the agent is offline? I'm a little worried that we may retry a request that should not be retried (e.g. some HTTP request that is not idempotent in case connection drops mid-flight).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. ConnectionProblem
error should be enough here since it is the error that comes from the DialContext
function. I'll update it.
lib/web/app/transport.go
Outdated
if err != nil { | ||
// Connection problem with the server. | ||
if trace.IsConnectionProblem(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, can we only check for reverse tunnel connection error specifically to avoid inadvertently retrying requests we shouldn't retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the ConnectionProblem
error happen in other scenarios? Looking at the code it seems to be the correct error to check for reverse tunnel connectivity issues.
lib/web/app/transport.go
Outdated
t.mu.Lock() | ||
defer t.mu.Unlock() | ||
|
||
for i := len(t.c.servers) - 1; i >= 0; i-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would shuffle here as well, to avoid relying on some external behavior.
3d821cb
to
7693233
Compare
// Check that the session exists in the backend cache. This allows the user | ||
// to logout and invalidate their application session immediately. This | ||
// lookup should also be fast because it's in the local cache. | ||
return h.c.AccessPoint.GetAppSession(r.Context(), types.GetAppSessionRequest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding the AccessPoint is used to retrieve WebSession
where "cache" is uses to cache the transport forwarder.
The /teleport-logout
logout flow doesn't remove anything from local "cache" and only calls the DeleteAppSession
. So this check probably checks if user is still logged and AppSession is still valid.
@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session | |||
if err != nil { | |||
return nil, trace.Wrap(err) | |||
} | |||
server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr)) | |||
|
|||
servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy
here ?
7693233
to
d57f2a7
Compare
session, err := h.getSession(ctx, ws) | ||
// Remove the session from the cache, this will force a new session to be | ||
// generated and cached. | ||
h.cache.remove(ws.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on current code I think that handling this situation seems to be "safe". The h.cache.remove
called from other goroutine will remove previously inserted by other goroutine but the new session will be inserted.
The only concerns that comes to my mind is performance impact because newSession flow tries to dial all app services to check it services are healthy.
So maybe we should consider using something like https://pkg.go.dev/golang.org/x/[email protected]/singleflight that allows to block simultaneous code executed based on a key but this probably be done in a separate ticket/PR.
45a8205
to
cfafbed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor formatting nit comments. I think that there is one open issues
https://github.com/gravitational/teleport/pull/9288/files#discussion_r768280901 that needs to be confirmed by @r0mant. Otherwise, LGTM
7a5c5c1
to
17b3b98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabrielcorado lgtm. Have you tested this fix in the Kubernetes rolling update scenario? To make sure app access keeps working when a Teleport app agent pod gets deleted and another one spinned up during K8s deployment rolling update? Let's please make sure it works in that scenario before merging.
Also, please backport to branch/v8 after this gets merged.
lib/web/app/transport.go
Outdated
clusterClient, err := c.proxyClient.GetSite(c.identity.RouteToApp.ClusterName) | ||
if err != nil { | ||
return nil, trace.Wrap(err) | ||
t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool { | |
t.servers.Range(func(serverID, appServerInterface interface{}) bool { |
lib/web/app/transport.go
Outdated
} | ||
// DialContext dials and connect to the application service over the reverse | ||
// tunnel subsystem. | ||
func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) { | |
func (t *transport) DialContext(ctx context.Context, _, _ string) (net.Conn, error) { |
lib/web/app/transport.go
Outdated
if dialErr != nil { | ||
// Connection problem with the server. | ||
if trace.IsConnectionProblem(dialErr) { | ||
t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr) | |
t.c.log.Warnf("Failed to connect to application server %q: %v.", serverID, dialErr) |
@r0mant Yes, I tested this scenario with app agents deployed in Kubernetes using the Helm chart ( |
17b3b98
to
ae08bc4
Compare
Closes #9126 by retrying the app requests in different agents.
The solution consists of two parts:
transport
(lib/web/app/transport.go
) to receive a list of servers, and whenDialContext
fails, it will try a different server from the list. Note: every time a server has connection issues, it is removed from the list, and when the list is empty, the transport returns an error;transport
forwarder, provide anErrorHandler
which will (when called) expire the current session, create a new one (with a fresh list of agents) and forward the request using it;In addition to these changes, the new session process now matches only "healthy" servers. A "healthy" server is a server where the handler could establish a connection.