Update application servers match to retry on errors #11807

gabrielcorado · 2022-04-07T19:11:09Z

Related to #10142.

Brief problem description

When starting, proxy servers do not wait until all the reverse tunnels are established to set their ready status as OK, new servers can receive requests before all available reverse tunnels are connected. In the case of o application access, proxies can receive a request before the proxy has a connection with the application service. This causes the request to fail with no tunnel connection found: no app reverse tunnel for xxx found.

Solution

To reduce the number of failures during an instability on proxy servers (like an update). We added a retry mechanism in the part that finds an application server to the request, where:

If the logic doesn't find a matching application server to forward the request, it will retry until a specified timeout;
Only healthy remote servers will match the application (as introduced in Retry app access requests in different servers when there is a connection failure #9288);

How it was tested

A Teleport instance configured with HA and an application service instance with debug_app enabled. And then the following was executed:

Login into the example application: tsh app login dumper;
Execute the curl in a loop to make multiple requests;
Increase the number of proxy replicas;
See the loop of requests returning some errors;

smallinsky · 2022-04-11T16:22:30Z

lib/web/app/match.go

+	}
+
+	var servers []types.AppServer
+	err = backoff.For(ctx, func() error {


I'm not totally sure about this is approach. It has severals drawback:

If there are no any app service available the call will block the request fo - 10s

It works only for app access where this issue occurs for db, windows desktop access.

Retrying or delaying the service startup solutions seems not to solve the issue. I've done more extensive testing during the development of this PR. The time for a reverse tunnel to be established can vary depending on how the cluster is configured (more or fewer instances, for example). For example, even the default 10s couldn't prevent all errors from happening in some of my tests.

About your points:

The retry is to give the service a room where the reverse tunnel could not be connected yet. However, that also means that if there isn't any app service up, all requests will exceed the timeout;

Not sure if we can have something more generic, even if the implementation is placed on the service startup instead of per request. Still, each service has its startup logic (which can be shared, but we would have to update all of them);

That being said, I think we should work on some alternatives to this problem. Here is what I think would solve the issue: Change the startup logic of the service to wait until at least one reverse tunnel per registered application is up. It can take a long time if there are dynamic apps. But this way, we guarantee that the proxy is ready to serve any application request.

For example, even the default 10s couldn't prevent all errors from happening in some of my tests.

Wonder how this retry delay solution differs and solves this case. Generally the MatchWithRetry function waits 10s till a reverse tunnel connection is established and app agent was successfully dialled. I'm curious what make difference with postponing response from service readiness probe vs current approach. Namely proxy discovery by an app agent take more than 10s current approach should also fail.

For the most common scenario, both solutions would have similar effects but differences from the user's perspective. The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access). In this way, we would isolate the delay just where needed. For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.

For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.

I think that this is ok, k8 pod scheduler is pretty smart and allows to distribute pod across multiple instances with anti-affinity policy and set min number of available replicas so adding 10-20s delay to readyz probe should not significantly affect the k8 cluster behavior.

The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access)

Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz probe delay affects only cluster scaling behavior.

Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz probe delay affects only cluster scaling behavior.

The proxy won't be able to serve any other service (for example, database access).

But still, the delay solution won't cover all scenarios. Do you think we should skip them for now? or think about a more reliable way to define if the proxy is ready or not (like the one I mentioned in my previous comments)?

r0mant

@gabrielcorado @smallinsky Didn't we discuss the following approach instead:

Introduce a delay of max 30s before proxy starts returning "ready" on its readyz endoint.
As an next step, switch it to "ready" sooner if all tunnels have been established.

This will help with all protocols, not just app access. Is this PR related to that?

feat(app): update match application servers to retry on errors

cdb8b2d

gabrielcorado requested a review from smallinsky April 7, 2022 19:11

gabrielcorado self-assigned this Apr 7, 2022

smallinsky reviewed Apr 11, 2022

View reviewed changes

r0mant reviewed Apr 18, 2022

View reviewed changes

gabrielcorado closed this Apr 26, 2022

gabrielcorado deleted the gabrielcorado/app-servers-match-with-retry branch November 2, 2022 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update application servers match to retry on errors #11807

Update application servers match to retry on errors #11807

gabrielcorado commented Apr 7, 2022

smallinsky Apr 11, 2022

gabrielcorado Apr 11, 2022 •

edited

Loading

smallinsky Apr 11, 2022

gabrielcorado Apr 11, 2022

smallinsky Apr 12, 2022

gabrielcorado Apr 12, 2022

r0mant left a comment

Update application servers match to retry on errors #11807

Update application servers match to retry on errors #11807

Conversation

gabrielcorado commented Apr 7, 2022

Brief problem description

Solution

How it was tested

smallinsky Apr 11, 2022

Choose a reason for hiding this comment

gabrielcorado Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

smallinsky Apr 11, 2022

Choose a reason for hiding this comment

gabrielcorado Apr 11, 2022

Choose a reason for hiding this comment

smallinsky Apr 12, 2022

Choose a reason for hiding this comment

gabrielcorado Apr 12, 2022

Choose a reason for hiding this comment

r0mant left a comment

Choose a reason for hiding this comment

gabrielcorado Apr 11, 2022 •

edited

Loading