-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start autoupdate_agent_rollout
controller in auth service
#49101
Start autoupdate_agent_rollout
controller in auth service
#49101
Conversation
f611f35
to
1042ef7
Compare
lib/service/service.go
Outdated
agentRolloutController := rolloutcontroller.New(authServer, logger, process.Clock) | ||
process.RegisterFunc("auth.autoupdate_agent_rollout_controller", func() error { | ||
return trace.Wrap(agentRolloutController.Run(process.GracefulExitContext()), "running autoupdate_agent_rollout controller") | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this service be opt-in? What if I have no desire to ever use automatic upgrades (i.e. most tests and likely many self hosted customers), how can I disable this service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this service be opt-in?
Most users are having trouble keeping their Teleport cluster up-to-date. We want to make every Teleport cluster support self-updating agents by default. This implies having the agent rollout controller running.
If there are concerns about the 3 reads and 1 optional write per minute this controller adds, I can make it opt-out and expose settings to control it. Although I think this goes against our goal of simplifying updates as users will now be able to break the core automatic update functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine I never use automatic upgrades and an accidental panic in the rollout controller gets released. I would find it really annoying as a cluster admin that a feature which I have no desire to use, and no way to disable can bring my auth instances down and prevent anyone from accessing resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be OK with
auth_service:
enabled: yes
autoupdate_controller:
# enabled by default, disabling this will freeze the autoupdate agent rollout
# if you want to disable automatic update, use `autoupdate_config.spec.agents.mode: disabled` instead
enabled: true
I find this setting a bit awkward because it looks like you are opting out of AUs, but you are not, you are just breaking the reconciliation. If you have existing AU resources, they will stay unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense if I could disable all of auto update instead of just this component
auth_service:
enabled: yes
autoupdates:
enabled: yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proxies are streaming autoupdate_agent_rollout
to know what to answer when queried on the ping
and find
routes. Disabling automatic updates means removing the resource (more accurately, proxies will be streaming autoupdate_agent_rollout, as of today they are looking at the config/version because the controller is not running yet, this will go away in the next PR and proxies will only look at autoupdate_agent_rollout
to figure out the agent desired state).
The autoupdate_agent_rollout
resource is asynchronously created from autoupdate_config
and autoupdate_version
. It's the controller's job to look at the config, version, and understand if there should be an agent rollout resource or not. If the config says AUs are disabled, the controller must delete the autoupdate_agent_rollout
.
Disabling the controller without reconciling first would leave your cluster in an unknown state where autoupdate_agent_rollout
might be out of sync with the version adn config resources. So for example we would have an autoupdate_config
saying AUs are disabled, but a leftover autoupdate_agent_rollout
saying that AUs are enabled and agents should update to version X ASAP. The proxies will indefinitely serve the stale rollout because there's no controller to delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like the only meaningful way to disable via config would be to have an option that is consumed by both auth and the proxy then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and options spread over multiple services add more potential inconsistent cases which I'm not a huge fan of. If you think that obtaining the ability to bypass the update-related code at the price of more potential misconfigurations is a good trade, I will add the config fields.
Another option that could help with your stability concern would be to have the controller recover from a panic and return en error. A the function is not critical, this would not stop the teleport process. To detect if this happened we could monitor the reconciliation count via Prometheus metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be worse, a misconfiguration of someone intentionally trying to turn off automatic upgrades, or a silent panic in the roll out controller that prevents using automatic upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be worse, a misconfiguration of someone intentionally trying to turn off automatic upgrades, or a silent panic in the roll out controller that prevents using automatic upgrades?
I added safeties to avoid panics in 955f09e . Now we handle them gracefully like reconciliation errors. Stuff is not working properly but we will retry. Now, recovering from panics does not add new silent failure modes.
With this change, I think that it would be more harmful to add a parameter in auth/proxy that does not do what people expect (e.g. I set the param in auth_service, think AUs are disabled while they're not because I've not set it on the proxy) than staying always on as we won't bring down the auth by mistake.
31f4d6c
to
955f09e
Compare
lib/autoupdate/rollout/controller.go
Outdated
} | ||
|
||
// NewController creates a new Controller for the autoupdate_agent_rollout kind. | ||
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) *Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: validate the inputs and return an error or use default values
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) *Controller { | |
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) (*Controller, error) { |
Co-authored-by: rosstimothy <[email protected]>
a451250
to
327039a
Compare
@hugoShaka See the table below for backport results.
|
This PR adds the autoupdate_agent_rollout controller and starts it when we run as the auth service.
This also makes the Reconciler type private as there is currently no usage outside of the Controller.