-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart entire node on tunnel collapse #8102
Conversation
Before merge: needs tests, buy-in from owners/experts, delete debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change seems reasonable and I like how you took a somewhat generic approach, so it may address problems we didn't foresee. I have a slight concern that something like this could cause nodes to exist in a perpetual restart cycle, but I'm probably being paranoid - seems better to merge this than not.
Requesting changes mainly due to test coverage.
@fspmarshall @knisbet What do you two think about this approach? |
If we have to restart the entire node (or close to it) then this is a good solution in my eyes. Though I wonder if that is completely necessary? If that would involves significant effort though this seems like a generic "should work well enough" passable solution. |
I think the only part I'm a bit curious about is how different supervisors would react. I'm not sure how the teleport container is setup, but it might not be able to graceful restart when the parent pid exits if it's running as pid 1. But as long as that happens should be fine. I suspect most supervisors teleport runs under should be able to handle a consistently restarting process, if the grace periods are long enough, things like systemd and kubernetes will backoff. Although my guess is the Teleport team would be far more familiar with what supervisors are commonly used with Teleport. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @codingllama's comments, otherwise looks good to me
The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.
The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.
The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.
* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.
The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address. (cherry picked from commit 6cb1371)
* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address. (cherry picked from commit 6cb1371)
* Dynamically resolve reverse tunnel address The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinintely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applid. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accomodate using a dynamic reverse tunnel address.
* Dynamically resolve reverse tunnel address (#9958) The reverse tunnel address is currently a static string that is retrieved from config and passed around for the duration of a services lifetime. When the `tunnel_public_address` is changed on the proxy and the proxy is then restarted, all established reverse tunnels over the old address will fail indefinitely. As a means to get around this, #8102 introduced a mechanism that would cause nodes to restart if their connection to the auth server was down for a period of time. While this did allow the nodes to pickup the new address after the nodes restarted it was meant to be a stop gap until a more robust solution could be applied. Instead of using a static address, the reverse tunnel address is now resolved via a `reversetunnel.Resolver`. Anywhere that previoulsy relied on the static proxy address now will fetch the actual reverse tunnel address via the webclient by using the Resolver. In addition this builds on the refactoring done in #4290 to further simplify the reversetunnel package. Since we no longer track multiple proxies, all the left over bits that did so have been removed to accommodate using a dynamic reverse tunnel address.
Imagine you have a cluster with the a node connected in via a tunnel through a proxy
example.com
on port3024
Now imagine you change the proxy config so that
tunnel_public_address
isexample.com:4024
. You either restart the proxy, or reload the proxy config with aSIGHUP
....and the node doesn't reconnect to the proxy, because even though the
auth_server address
hasn't changed the node has cached the oldtunnel_public_address
and keeps trying to connect to that.You can always manually restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.
In order to not have to manually restart all nodes, this change implements a check for a connection failures to the auth server, and re-starts the node if there are multiple connection failures in a given period of time. The check as-implemented piggybacks on the node's "common.rotate" service, which can already restart the node in certain circumstances, and uses the success of the periodic rotation sync as a proxy for the health of the node's connection to the auth server.