Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update application servers match to retry on errors #11807

Closed

Conversation

gabrielcorado
Copy link
Contributor

Related to #10142.

Brief problem description

When starting, proxy servers do not wait until all the reverse tunnels are established to set their ready status as OK, new servers can receive requests before all available reverse tunnels are connected. In the case of o application access, proxies can receive a request before the proxy has a connection with the application service. This causes the request to fail with no tunnel connection found: no app reverse tunnel for xxx found.

Solution

To reduce the number of failures during an instability on proxy servers (like an update). We added a retry mechanism in the part that finds an application server to the request, where:

How it was tested

A Teleport instance configured with HA and an application service instance with debug_app enabled. And then the following was executed:

  1. Login into the example application: tsh app login dumper;
  2. Execute the curl in a loop to make multiple requests;
  3. Increase the number of proxy replicas;
  4. See the loop of requests returning some errors;

@gabrielcorado gabrielcorado requested a review from smallinsky April 7, 2022 19:11
@gabrielcorado gabrielcorado self-assigned this Apr 7, 2022
}

var servers []types.AppServer
err = backoff.For(ctx, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure about this is approach. It has severals drawback:

  • If there are no any app service available the call will block the request fo - 10s
  • It works only for app access where this issue occurs for db, windows desktop access.

Copy link
Contributor Author

@gabrielcorado gabrielcorado Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying or delaying the service startup solutions seems not to solve the issue. I've done more extensive testing during the development of this PR. The time for a reverse tunnel to be established can vary depending on how the cluster is configured (more or fewer instances, for example). For example, even the default 10s couldn't prevent all errors from happening in some of my tests.

About your points:

  • The retry is to give the service a room where the reverse tunnel could not be connected yet. However, that also means that if there isn't any app service up, all requests will exceed the timeout;
  • Not sure if we can have something more generic, even if the implementation is placed on the service startup instead of per request. Still, each service has its startup logic (which can be shared, but we would have to update all of them);

That being said, I think we should work on some alternatives to this problem. Here is what I think would solve the issue: Change the startup logic of the service to wait until at least one reverse tunnel per registered application is up. It can take a long time if there are dynamic apps. But this way, we guarantee that the proxy is ready to serve any application request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, even the default 10s couldn't prevent all errors from happening in some of my tests.

Wonder how this retry delay solution differs and solves this case. Generally the MatchWithRetry function waits 10s till a reverse tunnel connection is established and app agent was successfully dialled. I'm curious what make difference with postponing response from service readiness probe vs current approach. Namely proxy discovery by an app agent take more than 10s current approach should also fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most common scenario, both solutions would have similar effects but differences from the user's perspective. The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access). In this way, we would isolate the delay just where needed. For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.

I think that this is ok, k8 pod scheduler is pretty smart and allows to distribute pod across multiple instances with anti-affinity policy and set min number of available replicas so adding 10-20s delay to readyz probe should not significantly affect the k8 cluster behavior.

The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access)

Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz probe delay affects only cluster scaling behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz probe delay affects only cluster scaling behavior.

The proxy won't be able to serve any other service (for example, database access).

But still, the delay solution won't cover all scenarios. Do you think we should skip them for now? or think about a more reliable way to define if the proxy is ready or not (like the one I mentioned in my previous comments)?

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabrielcorado @smallinsky Didn't we discuss the following approach instead:

  1. Introduce a delay of max 30s before proxy starts returning "ready" on its readyz endoint.
  2. As an next step, switch it to "ready" sooner if all tunnels have been established.

This will help with all protocols, not just app access. Is this PR related to that?

@gabrielcorado gabrielcorado deleted the gabrielcorado/app-servers-match-with-retry branch November 2, 2022 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants