Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start autoupdate_agent_rollout controller in auth service #49101

Merged
merged 3 commits into from
Nov 27, 2024

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Nov 15, 2024

This PR adds the autoupdate_agent_rollout controller and starts it when we run as the auth service.

This also makes the Reconciler type private as there is currently no usage outside of the Controller.

@hugoShaka hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Nov 15, 2024
@hugoShaka hugoShaka force-pushed the hugo/add-autoupdate-agent-rollout-controller branch from f611f35 to 1042ef7 Compare November 15, 2024 22:22
Comment on lines 2434 to 2440
agentRolloutController := rolloutcontroller.New(authServer, logger, process.Clock)
process.RegisterFunc("auth.autoupdate_agent_rollout_controller", func() error {
return trace.Wrap(agentRolloutController.Run(process.GracefulExitContext()), "running autoupdate_agent_rollout controller")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this service be opt-in? What if I have no desire to ever use automatic upgrades (i.e. most tests and likely many self hosted customers), how can I disable this service?

Copy link
Contributor Author

@hugoShaka hugoShaka Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this service be opt-in?

Most users are having trouble keeping their Teleport cluster up-to-date. We want to make every Teleport cluster support self-updating agents by default. This implies having the agent rollout controller running.

If there are concerns about the 3 reads and 1 optional write per minute this controller adds, I can make it opt-out and expose settings to control it. Although I think this goes against our goal of simplifying updates as users will now be able to break the core automatic update functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine I never use automatic upgrades and an accidental panic in the rollout controller gets released. I would find it really annoying as a cluster admin that a feature which I have no desire to use, and no way to disable can bring my auth instances down and prevent anyone from accessing resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be OK with

auth_service:
    enabled: yes
    autoupdate_controller:
      # enabled by default, disabling this will freeze the autoupdate agent rollout
      # if you want to disable automatic update, use `autoupdate_config.spec.agents.mode: disabled` instead
      enabled: true

I find this setting a bit awkward because it looks like you are opting out of AUs, but you are not, you are just breaking the reconciliation. If you have existing AU resources, they will stay unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense if I could disable all of auto update instead of just this component

auth_service:
    enabled: yes
    autoupdates:
        enabled: yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxies are streaming autoupdate_agent_rollout to know what to answer when queried on the ping and find routes. Disabling automatic updates means removing the resource (more accurately, proxies will be streaming autoupdate_agent_rollout, as of today they are looking at the config/version because the controller is not running yet, this will go away in the next PR and proxies will only look at autoupdate_agent_rollout to figure out the agent desired state).

The autoupdate_agent_rollout resource is asynchronously created from autoupdate_config and autoupdate_version. It's the controller's job to look at the config, version, and understand if there should be an agent rollout resource or not. If the config says AUs are disabled, the controller must delete the autoupdate_agent_rollout.

Disabling the controller without reconciling first would leave your cluster in an unknown state where autoupdate_agent_rollout might be out of sync with the version adn config resources. So for example we would have an autoupdate_config saying AUs are disabled, but a leftover autoupdate_agent_rollout saying that AUs are enabled and agents should update to version X ASAP. The proxies will indefinitely serve the stale rollout because there's no controller to delete it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the only meaningful way to disable via config would be to have an option that is consumed by both auth and the proxy then?

Copy link
Contributor Author

@hugoShaka hugoShaka Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and options spread over multiple services add more potential inconsistent cases which I'm not a huge fan of. If you think that obtaining the ability to bypass the update-related code at the price of more potential misconfigurations is a good trade, I will add the config fields.

Another option that could help with your stability concern would be to have the controller recover from a panic and return en error. A the function is not critical, this would not stop the teleport process. To detect if this happened we could monitor the reconciliation count via Prometheus metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be worse, a misconfiguration of someone intentionally trying to turn off automatic upgrades, or a silent panic in the roll out controller that prevents using automatic upgrades?

Copy link
Contributor Author

@hugoShaka hugoShaka Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be worse, a misconfiguration of someone intentionally trying to turn off automatic upgrades, or a silent panic in the roll out controller that prevents using automatic upgrades?

I added safeties to avoid panics in 955f09e . Now we handle them gracefully like reconciliation errors. Stuff is not working properly but we will retry. Now, recovering from panics does not add new silent failure modes.

With this change, I think that it would be more harmful to add a parameter in auth/proxy that does not do what people expect (e.g. I set the param in auth_service, think AUs are disabled while they're not because I've not set it on the proxy) than staying always on as we won't bring down the auth by mistake.

lib/service/service_test.go Outdated Show resolved Hide resolved
lib/service/service_test.go Outdated Show resolved Hide resolved
@hugoShaka hugoShaka force-pushed the hugo/add-autoupdate-agent-rollout-controller branch from 31f4d6c to 955f09e Compare November 19, 2024 16:49
}

// NewController creates a new Controller for the autoupdate_agent_rollout kind.
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) *Controller {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: validate the inputs and return an error or use default values

Suggested change
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) *Controller {
func NewController(client Client, log *slog.Logger, clock clockwork.Clock) (*Controller, error) {

lib/autoupdate/rollout/controller.go Outdated Show resolved Hide resolved
@hugoShaka hugoShaka force-pushed the hugo/add-autoupdate-agent-rollout-controller branch from a451250 to 327039a Compare November 27, 2024 20:02
@hugoShaka hugoShaka added this pull request to the merge queue Nov 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2024
@hugoShaka hugoShaka added this pull request to the merge queue Nov 27, 2024
Merged via the queue into master with commit b207bb7 Nov 27, 2024
40 checks passed
@hugoShaka hugoShaka deleted the hugo/add-autoupdate-agent-rollout-controller branch November 27, 2024 20:59
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v17 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v17 no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants