-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update application servers match to retry on errors #11807
Conversation
} | ||
|
||
var servers []types.AppServer | ||
err = backoff.For(ctx, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure about this is approach. It has severals drawback:
- If there are no any app service available the call will block the request fo - 10s
- It works only for app access where this issue occurs for db, windows desktop access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying or delaying the service startup solutions seems not to solve the issue. I've done more extensive testing during the development of this PR. The time for a reverse tunnel to be established can vary depending on how the cluster is configured (more or fewer instances, for example). For example, even the default 10s couldn't prevent all errors from happening in some of my tests.
About your points:
- The retry is to give the service a room where the reverse tunnel could not be connected yet. However, that also means that if there isn't any app service up, all requests will exceed the timeout;
- Not sure if we can have something more generic, even if the implementation is placed on the service startup instead of per request. Still, each service has its startup logic (which can be shared, but we would have to update all of them);
That being said, I think we should work on some alternatives to this problem. Here is what I think would solve the issue: Change the startup logic of the service to wait until at least one reverse tunnel per registered application is up. It can take a long time if there are dynamic apps. But this way, we guarantee that the proxy is ready to serve any application request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, even the default 10s couldn't prevent all errors from happening in some of my tests.
Wonder how this retry delay solution differs and solves this case. Generally the MatchWithRetry
function waits 10s till a reverse tunnel connection is established and app agent was successfully dialled. I'm curious what make difference with postponing response from service readiness probe vs current approach. Namely proxy discovery by an app agent take more than 10s current approach should also fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most common scenario, both solutions would have similar effects but differences from the user's perspective. The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access). In this way, we would isolate the delay just where needed. For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if some proxy instances start to be killed for some reason, the newer ones would take longer to come online even if users are not using application access.
I think that this is ok, k8 pod scheduler is pretty smart and allows to distribute pod across multiple instances with anti-affinity policy and set min number of available replicas so adding 10-20s delay to readyz
probe should not significantly affect the k8 cluster behavior.
The rationale behind not adding the delay during the proxy start was that it could affect other proxy responsibilities (like authenticating users or serving DB access)
Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz
probe delay affects only cluster scaling behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm missing something but how an unstarted pod that doesn't handle any traffic can affect other proxy responsibilities? In my opinion adding an additional readyz probe delay affects only cluster scaling behavior.
The proxy won't be able to serve any other service (for example, database access).
But still, the delay solution won't cover all scenarios. Do you think we should skip them for now? or think about a more reliable way to define if the proxy is ready or not (like the one I mentioned in my previous comments)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabrielcorado @smallinsky Didn't we discuss the following approach instead:
- Introduce a delay of max 30s before proxy starts returning "ready" on its readyz endoint.
- As an next step, switch it to "ready" sooner if all tunnels have been established.
This will help with all protocols, not just app access. Is this PR related to that?
Related to #10142.
Brief problem description
When starting, proxy servers do not wait until all the reverse tunnels are established to set their ready status as OK, new servers can receive requests before all available reverse tunnels are connected. In the case of o application access, proxies can receive a request before the proxy has a connection with the application service. This causes the request to fail with
no tunnel connection found: no app reverse tunnel for xxx found
.Solution
To reduce the number of failures during an instability on proxy servers (like an update). We added a retry mechanism in the part that finds an application server to the request, where:
How it was tested
A Teleport instance configured with HA and an application service instance with
debug_app
enabled. And then the following was executed:tsh app login dumper
;curl
in a loop to make multiple requests;