Retry app access requests in different servers when there is a connection failure #9288

gabrielcorado · 2021-12-08T16:06:25Z

Closes #9126 by retrying the app requests in different agents.

The solution consists of two parts:

Changing the transport (lib/web/app/transport.go) to receive a list of servers, and when DialContext fails, it will try a different server from the list. Note: every time a server has connection issues, it is removed from the list, and when the list is empty, the transport returns an error;
When creating the transport forwarder, provide an ErrorHandler which will (when called) expire the current session, create a new one (with a fresh list of agents) and forward the request using it;

In addition to these changes, the new session process now matches only "healthy" servers. A "healthy" server is a server where the handler could establish a connection.

smallinsky · 2021-12-09T11:41:13Z

lib/web/app/handler.go

+func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) {
+	if !trace.IsConnectionProblem(err) && err != io.EOF {
+		w.WriteHeader(http.StatusInternalServerError)
+		w.Write([]byte(http.StatusText(http.StatusInternalServerError)))
+		return
+	}
+
+	session, err := h.renewSession(req)
+	if err != nil {
+		w.WriteHeader(http.StatusInternalServerError)
+		w.Write([]byte(http.StatusText(http.StatusInternalServerError)))
+		return
+	}
+
+	session.fwd.ServeHTTP(w, req)
+}


What will if all agents are unavailable ?

At first glance it looks that we will create a spin while loop with the while !trace.IsConnectionProblem condition:

newSessionIs Created

fwd app agent dial call fails with ConnectionProblem err.

session is renewed and a new transport is created.

loop flow starts from 2.

Not really, the new session flow now only matches "healthy" servers, so if there isn't any, it will return an error on the renew the session function.

However, there can be a case where the server is unstable, meaning that at the session creation, it is healthy, but when the forward happens, it is not. So we'll have the loop you mentioned in this case, and the client will timeout its request.

lib/web/app/transport.go

smallinsky · 2021-12-09T12:05:49Z

lib/web/app/transport.go

+	t.mu.Lock()
+	defer t.mu.Unlock()
+
+	for i := len(t.c.servers) - 1; i >= 0; i-- {


In this approach in scenario where all app agents are available the we will forward all connection to only one app agent without any LB approach.

True, I'll add a shuffle on the Match function, so we get a new ordination of the server's list every time a new session is created and keep the current behavior.

I would shuffle here as well, to avoid relying on some external behavior.

smallinsky · 2021-12-09T12:13:08Z

lib/web/app/session.go

@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session
 	if err != nil {
 		return nil, trace.Wrap(err)
 	}
-	server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr))
+
+	servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr)))


Not sure why we need check all app servers healthy status here.
It looks like the logic in Transport Dial function does already the same thing: https://github.com/gravitational/teleport/pull/9288/files#diff-9dd39725077cc062cbe4ded242810d79af548977d645d288875b70d49d759072R174

The idea is to give the transport a more accurate list of servers, and in the case of no healthy status, it would fail before the actual request forward. It is also aligned with the solution, where if the transport's list of servers is empty, the new transport will have only healthy servers to forward requests to, also working as a stop condition of this "renew loop".

Do you think it is better to avoid dialing apps during session creation?

I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy here ?

jakule · 2021-12-09T23:29:58Z

lib/web/app/transport.go

+	tr.DialContext = t.DialContext
+	tr.TLSClientConfig = t.clientTLSConfig
+	return t, nil


Here you make some changes to tr, but then return t? Looks like a typo.

Not really, the tr is inside the t (which is the *transport we want to return). I will change the order of the changes to make it more clear.

integration/app_integration_test.go

lib/web/app/match_test.go

lib/web/app/handler.go

r0mant · 2021-12-14T02:29:29Z

lib/web/app/handler.go

+// to get "fresh" app servers, and then will forwad the request to the newly
+// created session.
+func (h *Handler) handleForwardError(w http.ResponseWriter, req *http.Request, err error) {
+	if !trace.IsConnectionProblem(err) && err != io.EOF {


Can we check specifically for the error that is returned when the agent is offline? I'm a little worried that we may retry a request that should not be retried (e.g. some HTTP request that is not idempotent in case connection drops mid-flight).

I agree with you. ConnectionProblem error should be enough here since it is the error that comes from the DialContext function. I'll update it.

integration/app_integration_test.go

lib/web/app/handler.go

r0mant · 2021-12-14T03:21:44Z

lib/web/app/transport.go

 		if err != nil {
+			// Connection problem with the server.
+			if trace.IsConnectionProblem(err) {


Similar to above, can we only check for reverse tunnel connection error specifically to avoid inadvertently retrying requests we shouldn't retry?

Can the ConnectionProblem error happen in other scenarios? Looking at the code it seems to be the correct error to check for reverse tunnel connectivity issues.

lib/web/app/transport.go

r0mant · 2021-12-14T03:24:26Z

lib/web/app/transport.go

+	t.mu.Lock()
+	defer t.mu.Unlock()
+
+	for i := len(t.c.servers) - 1; i >= 0; i-- {


I would shuffle here as well, to avoid relying on some external behavior.

lib/web/app/transport.go

smallinsky · 2022-01-05T09:46:14Z

lib/web/app/handler.go

+	// Check that the session exists in the backend cache. This allows the user
+	// to logout and invalidate their application session immediately. This
+	// lookup should also be fast because it's in the local cache.
+	return h.c.AccessPoint.GetAppSession(r.Context(), types.GetAppSessionRequest{


For my understanding the AccessPoint is used to retrieve WebSession
where "cache" is uses to cache the transport forwarder.

The /teleport-logout logout flow doesn't remove anything from local "cache" and only calls the DeleteAppSession. So this check probably checks if user is still logged and AppSession is still valid.

lib/web/app/handler.go

lib/web/app/transport.go

smallinsky · 2022-01-05T10:53:15Z

lib/web/app/session.go

@@ -62,30 +63,34 @@ func (h *Handler) newSession(ctx context.Context, ws types.WebSession) (*session
 	if err != nil {
 		return nil, trace.Wrap(err)
 	}
-	server, err := Match(ctx, accessPoint, MatchPublicAddr(identity.RouteToApp.PublicAddr))
+
+	servers, err := Match(ctx, accessPoint, MatchAll(MatchHealthy(h.c.ProxyClient, identity), MatchPublicAddr(identity.RouteToApp.PublicAddr)))


I see now. If a bit unintuitive for me that Dial status check was done in two places. Could you please add a comment explaining why we need to call MatchHealthy here ?

lib/web/app/match.go

lib/web/app/transport.go

integration/app_integration_test.go

lib/web/app/handler.go

lib/web/app/transport.go

smallinsky · 2022-01-05T16:12:16Z

lib/web/app/handler.go

-	session, err := h.getSession(ctx, ws)
+	// Remove the session from the cache, this will force a new session to be
+	// generated and cached.
+	h.cache.remove(ws.GetName())


Based on current code I think that handling this situation seems to be "safe". The h.cache.remove called from other goroutine will remove previously inserted by other goroutine but the new session will be inserted.
The only concerns that comes to my mind is performance impact because newSession flow tries to dial all app services to check it services are healthy.

So maybe we should consider using something like https://pkg.go.dev/golang.org/x/[email protected]/singleflight that allows to block simultaneous code executed based on a key but this probably be done in a separate ticket/PR.

smallinsky

Left some minor formatting nit comments. I think that there is one open issues
https://github.com/gravitational/teleport/pull/9288/files#discussion_r768280901 that needs to be confirmed by @r0mant. Otherwise, LGTM

integration/app_integration_test.go

lib/web/app/transport.go

r0mant

@gabrielcorado lgtm. Have you tested this fix in the Kubernetes rolling update scenario? To make sure app access keeps working when a Teleport app agent pod gets deleted and another one spinned up during K8s deployment rolling update? Let's please make sure it works in that scenario before merging.

Also, please backport to branch/v8 after this gets merged.

r0mant · 2022-01-11T21:14:16Z

lib/web/app/transport.go

-		clusterClient, err := c.proxyClient.GetSite(c.identity.RouteToApp.ClusterName)
-		if err != nil {
-			return nil, trace.Wrap(err)
+	t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool {


Suggested change

t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool {

t.servers.Range(func(serverID, appServerInterface interface{}) bool {

r0mant · 2022-01-11T21:14:26Z

lib/web/app/transport.go

-}
+// DialContext dials and connect to the application service over the reverse
+// tunnel subsystem.
+func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) {


Suggested change

func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) {

func (t *transport) DialContext(ctx context.Context, _, _ string) (net.Conn, error) {

r0mant · 2022-01-11T21:16:09Z

lib/web/app/transport.go

+		if dialErr != nil {
+			// Connection problem with the server.
+			if trace.IsConnectionProblem(dialErr) {
+				t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr)


Suggested change

t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr)

t.c.log.Warnf("Failed to connect to application server %q: %v.", serverID, dialErr)

gabrielcorado · 2022-01-13T18:04:18Z

@r0mant Yes, I tested this scenario with app agents deployed in Kubernetes using the Helm chart (teleport-kube-agent), and everything worked as expected, with no downtime during rollouts.

gabrielcorado requested review from r0mant, jakule and smallinsky December 8, 2021 16:06

gabrielcorado self-assigned this Dec 8, 2021

github-actions bot requested review from ibeckermayer, atburke and timothyb89 December 8, 2021 16:06

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 3 times, most recently from 68f72c0 to 100bacf Compare December 8, 2021 19:20

russjones removed request for timothyb89, ibeckermayer and atburke December 8, 2021 19:34

smallinsky reviewed Dec 9, 2021

View reviewed changes

jakule reviewed Dec 9, 2021

View reviewed changes

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 79d120c to 9160bf0 Compare December 10, 2021 14:04

r0mant requested changes Dec 14, 2021

View reviewed changes

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 3d821cb to 7693233 Compare December 16, 2021 18:03

gabrielcorado requested review from r0mant, smallinsky and jakule December 16, 2021 20:06

smallinsky reviewed Jan 5, 2022

View reviewed changes

lib/web/app/transport.go Outdated Show resolved Hide resolved

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch from 7693233 to d57f2a7 Compare January 5, 2022 14:16

jakule reviewed Jan 5, 2022

View reviewed changes

integration/app_integration_test.go Outdated Show resolved Hide resolved

lib/web/app/handler.go Outdated Show resolved Hide resolved

smallinsky reviewed Jan 5, 2022

View reviewed changes

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 45a8205 to cfafbed Compare January 5, 2022 18:18

smallinsky approved these changes Jan 6, 2022

View reviewed changes

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch 2 times, most recently from 7a5c5c1 to 17b3b98 Compare January 6, 2022 13:12

r0mant approved these changes Jan 11, 2022

View reviewed changes

r0mant mentioned this pull request Jan 12, 2022

AWS Credential brokering fails with federated accounts #9727

Closed

feat: app server requests failover

ae08bc4

gabrielcorado force-pushed the gabrielcorado/app-requests-failover branch from 17b3b98 to ae08bc4 Compare January 13, 2022 18:04

gabrielcorado enabled auto-merge (squash) January 13, 2022 18:22

gabrielcorado merged commit 86e92be into master Jan 13, 2022

gabrielcorado deleted the gabrielcorado/app-requests-failover branch January 13, 2022 18:24

gabrielcorado added a commit that referenced this pull request Jan 18, 2022

feat: app server requests failover (#9288)

b391aa0

gabrielcorado mentioned this pull request Jan 18, 2022

(v8) Retry app access requests in different servers when there is a connection failure #9819

Merged

gabrielcorado added a commit that referenced this pull request Jan 21, 2022

feat: app server requests failover (#9288)

85b53c6

gabrielcorado added a commit that referenced this pull request Jan 24, 2022

feat: app server requests failover (#9288)

999eb5a

gabrielcorado added a commit that referenced this pull request Jan 24, 2022

feat: app server requests failover (#9288) (#9819)

0de35cd

webvictim mentioned this pull request Mar 4, 2022

Opened in error #10856

Closed

gabrielcorado mentioned this pull request Apr 7, 2022

Update application servers match to retry on errors #11807

Closed

alexatcanva mentioned this pull request Oct 13, 2022

BUGFIX | Fix Teleport ALPN Proxy not being HTTP CONNECT Proxy Aware alexatcanva/teleport#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry app access requests in different servers when there is a connection failure #9288

Retry app access requests in different servers when there is a connection failure #9288

gabrielcorado commented Dec 8, 2021

smallinsky Dec 9, 2021

gabrielcorado Dec 9, 2021

smallinsky Dec 9, 2021

gabrielcorado Dec 9, 2021

r0mant Dec 14, 2021

smallinsky Dec 9, 2021

gabrielcorado Dec 10, 2021

smallinsky Jan 5, 2022

gabrielcorado Jan 5, 2022

jakule Dec 9, 2021

gabrielcorado Dec 10, 2021

r0mant Dec 14, 2021

gabrielcorado Dec 16, 2021

r0mant Dec 14, 2021

gabrielcorado Dec 16, 2021

r0mant Dec 14, 2021

smallinsky Jan 5, 2022

smallinsky Jan 5, 2022

smallinsky Jan 5, 2022

smallinsky left a comment

r0mant left a comment •

edited

Loading

r0mant Jan 11, 2022

r0mant Jan 11, 2022

r0mant Jan 11, 2022

gabrielcorado commented Jan 13, 2022

	t.servers.Range(func(serverID interface{}, appServerInterface interface{}) bool {
	t.servers.Range(func(serverID, appServerInterface interface{}) bool {

	func (t *transport) DialContext(ctx context.Context, _ string, _ string) (net.Conn, error) {
	func (t *transport) DialContext(ctx context.Context, _, _ string) (net.Conn, error) {

	t.c.log.Warnf("Failed to connect to application server \"%d\": %v.", serverID, dialErr)
	t.c.log.Warnf("Failed to connect to application server %q: %v.", serverID, dialErr)

Retry app access requests in different servers when there is a connection failure #9288

Retry app access requests in different servers when there is a connection failure #9288

Conversation

gabrielcorado commented Dec 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smallinsky left a comment

Choose a reason for hiding this comment

r0mant left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrielcorado commented Jan 13, 2022

r0mant left a comment •

edited

Loading