Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry app access requests in different servers when there is a connection failure #9288

Merged
merged 1 commit into from
Jan 13, 2022

Conversation

gabrielcorado
Copy link
Contributor

Closes #9126 by retrying the app requests in different agents.

The solution consists of two parts:

  1. Changing the transport (lib/web/app/transport.go) to receive a list of servers, and when DialContext fails, it will try a different server from the list. Note: every time a server has connection issues, it is removed from the list, and when the list is empty, the transport returns an error;
  2. When creating the transport forwarder, provide an ErrorHandler which will (when called) expire the current session, create a new one (with a fresh list of agents) and forward the request using it;

In addition to these changes, the new session process now matches only "healthy" servers. A "healthy" server is a server where the handler could establish a connection.

@gabrielcorado gabrielcorado self-assigned this Dec 8, 2021
@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 3 times, most recently from 68f72c0 to 100bacf Compare December 8, 2021 19:20
Comment on lines 148 to 164
func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) {
if !trace.IsConnectionProblem(err) && err != io.EOF {
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte(http.StatusText(http.StatusInternalServerError)))
return
}

session, err := h.renewSession(req)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte(http.StatusText(http.StatusInternalServerError)))
return
}

session.fwd.ServeHTTP(w, req)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will if all agents are unavailable ?

At first glance it looks that we will create a spin while loop with the while !trace.IsConnectionProblem condition:

  1. newSessionIs Created
  2. fwd app agent dial call fails with ConnectionProblem err.
  3. session is renewed and a new transport is created.
  4. loop flow starts from 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, the new session flow now only matches "healthy" servers, so if there isn't any, it will return an error on the renew the session function.

However, there can be a case where the server is unstable, meaning that at the session creation, it is healthy, but when the forward happens, it is not. So we'll have the loop you mentioned in this case, and the client will timeout its request.

lib/web/app/transport.go Outdated Show resolved Hide resolved
t.mu.Lock()
defer t.mu.Unlock()

for i := len(t.c.servers) - 1; i >= 0; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this approach in scenario where all app agents are available the we will forward all connection to only one app agent without any LB approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll add a shuffle on the Match function, so we get a new ordination of the server's list every time a new session is created and keep the current behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would shuffle here as well, to avoid relying on some external behavior.

@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session
if err != nil {
return nil, trace.Wrap(err)
}
server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr))

servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need check all app servers healthy status here.
It looks like the logic in Transport Dial function does already the same thing: https://github.com/gravitational/teleport/pull/9288/files#diff-9dd39725077cc062cbe4ded242810d79af548977d645d288875b70d49d759072R174

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to give the transport a more accurate list of servers, and in the case of no healthy status, it would fail before the actual request forward. It is also aligned with the solution, where if the transport's list of servers is empty, the new transport will have only healthy servers to forward requests to, also working as a stop condition of this "renew loop".

Do you think it is better to avoid dialing apps during session creation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 116 to 125
tr.DialContext = t.DialContext
tr.TLSClientConfig = t.clientTLSConfig
return t, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you make some changes to tr, but then return t? Looks like a typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, the tr is inside the t (which is the *transport we want to return). I will change the order of the changes to make it more clear.

integration/app_integration_test.go Outdated Show resolved Hide resolved
lib/web/app/match_test.go Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 79d120c to 9160bf0 Compare December 10, 2021 14:04
lib/web/app/handler.go Outdated Show resolved Hide resolved
// to get "fresh" app servers, and then will forwad the request to the newly
// created session.
func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) {
if !trace.IsConnectionProblem(err) && err != io.EOF {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check specifically for the error that is returned when the agent is offline? I'm a little worried that we may retry a request that should not be retried (e.g. some HTTP request that is not idempotent in case connection drops mid-flight).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. ConnectionProblem error should be enough here since it is the error that comes from the DialContext function. I'll update it.

integration/app_integration_test.go Outdated Show resolved Hide resolved
lib/web/app/handler.go Outdated Show resolved Hide resolved
lib/web/app/handler.go Outdated Show resolved Hide resolved
if err != nil {
// Connection problem with the server.
if trace.IsConnectionProblem(err) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, can we only check for reverse tunnel connection error specifically to avoid inadvertently retrying requests we shouldn't retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the ConnectionProblem error happen in other scenarios? Looking at the code it seems to be the correct error to check for reverse tunnel connectivity issues.

lib/web/app/transport.go Outdated Show resolved Hide resolved
t.mu.Lock()
defer t.mu.Unlock()

for i := len(t.c.servers) - 1; i >= 0; i-- {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would shuffle here as well, to avoid relying on some external behavior.

lib/web/app/transport.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 3d821cb to 7693233 Compare December 16, 2021 18:03
// Check that the session exists in the backend cache. This allows the user
// to logout and invalidate their application session immediately. This
// lookup should also be fast because it's in the local cache.
return h.c.AccessPoint.GetAppSession(r.Context(), types.GetAppSessionRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding the AccessPoint is used to retrieve WebSession
where "cache" is uses to cache the transport forwarder.

The /teleport-logout logout flow doesn't remove anything from local "cache" and only calls the DeleteAppSession. So this check probably checks if user is still logged and AppSession is still valid.

lib/web/app/handler.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session
if err != nil {
return nil, trace.Wrap(err)
}
server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr))

servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy here ?

lib/web/app/match.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch from 7693233 to d57f2a7 Compare January 5, 2022 14:16
integration/app_integration_test.go Outdated Show resolved Hide resolved
lib/web/app/handler.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Show resolved Hide resolved
session, err := h.getSession(ctx, ws)
// Remove the session from the cache, this will force a new session to be
// generated and cached.
h.cache.remove(ws.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on current code I think that handling this situation seems to be "safe". The h.cache.remove called from other goroutine will remove previously inserted by other goroutine but the new session will be inserted.
The only concerns that comes to my mind is performance impact because newSession flow tries to dial all app services to check it services are healthy.

So maybe we should consider using something like https://pkg.go.dev/golang.org/x/[email protected]/singleflight that allows to block simultaneous code executed based on a key but this probably be done in a separate ticket/PR.

@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 45a8205 to cfafbed Compare January 5, 2022 18:18
Copy link
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor formatting nit comments. I think that there is one open issues
https://github.com/gravitational/teleport/pull/9288/files#discussion_r768280901 that needs to be confirmed by @r0mant. Otherwise, LGTM

integration/app_integration_test.go Outdated Show resolved Hide resolved
integration/app_integration_test.go Outdated Show resolved Hide resolved
integration/app_integration_test.go Outdated Show resolved Hide resolved
integration/app_integration_test.go Outdated Show resolved Hide resolved
lib/web/app/transport.go Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 7a5c5c1 to 17b3b98 Compare January 6, 2022 13:12
Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabrielcorado lgtm. Have you tested this fix in the Kubernetes rolling update scenario? To make sure app access keeps working when a Teleport app agent pod gets deleted and another one spinned up during K8s deployment rolling update? Let's please make sure it works in that scenario before merging.

Also, please backport to branch/v8 after this gets merged.

clusterClient, err := c.proxyClient.GetSite(c.identity.RouteToApp.ClusterName)
if err != nil {
return nil, trace.Wrap(err)
t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool {
t.servers.Range(func(serverID, appServerInterface interface{}) bool {

}
// DialContext dials and connect to the application service over the reverse
// tunnel subsystem.
func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) {
func (t *transport) DialContext(ctx context.Context, _, _ string) (net.Conn, error) {

if dialErr != nil {
// Connection problem with the server.
if trace.IsConnectionProblem(dialErr) {
t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr)
t.c.log.Warnf("Failed to connect to application server %q: %v.", serverID, dialErr)

@gabrielcorado
Copy link
Contributor Author

@r0mant Yes, I tested this scenario with app agents deployed in Kubernetes using the Helm chart (teleport-kube-agent), and everything worked as expected, with no downtime during rollouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

App access requests should fail over if agent is unavailable
4 participants