Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor configuration loading #3072

Merged
merged 4 commits into from
Jun 7, 2023
Merged

Refactor configuration loading #3072

merged 4 commits into from
Jun 7, 2023

Conversation

kke
Copy link
Contributor

@kke kke commented May 5, 2023

Description

Refactor configuration loading, eliminating the hard to follow ClientConfigLoadingRules.

The "runtime config" concept has been kept for now, as it is a smaller change than what was attempted before.

Key changes:

  • The runtime config has been converted into a "first class citizen" by making a dedicated struct for it. It now stores not only the config, but also the "k0svars". This allows commands like token create or k0s status to reuse all the parameters of the controller, such as --data-dir and --status-socket without the user having to supply them again. The runtime config also stores a pid, which is used to roughly check if the file belongs to an active process or if it's a leftover from a crash.
  • The k0svars, aka constant.CfgVars has been moved to config.CfgVars as it is not a constant.
  • The k0sVars now duplicates most of the config.CLIOptions, this is to reduce the number of arguments that need to be passed around, as the essentials are available from k0svars, which is passed around to almost everywhere already. This will help in the effort to eliminate the globals, which are now only used almost exclusively to build the CLIOptions.
  • Usages of "k0svars" have been changed to use a pointer, since that struct holds a memoized nodeconfig. This removes the need to reload it everywhere.
  • The config.GetCmdOptions() now takes the *cobra.Command as an argument and additionally returns an error. The cobra command is used to build the "k0svars". Returning an error is better than the surprise os.Exit in the old implementation did (which silently ended the test suite).
  • Same with c.NodeConfig, which is now a function that returns *v1beta1.ClusterConfig, error.
  • Fixes k0s config validate --config, even when using - for stdin. I think k0s controller and install controller may still be broken for -.
  • Node- and dynamic config are loaded via k0svars.NodeConfig() and k0svars.FetchDynamicConfig(), which may be a bit counterintuitive and doesn't exactly follow the Single Responsibility Principle.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@kke kke added area/configuration area/cli go Pull requests that update Go code labels May 5, 2023
@twz123
Copy link
Member

twz123 commented May 5, 2023

Just a general thought: Would it be feasible to break this down into more than one PR? The list of key changes reads quite nicely and it would be really helpful if there'd be dedicated commits for those. Depending on the amount of changes, those commits could then be bundled up into smaller PRs that are easier/safer to review.

Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass on this. Just loads of nits so far. I've left some "notes to self" that I left for me to figure out when engaging with the code more thoroughly. Feel free to ignore them 😝

cmd/airgap/listimages.go Outdated Show resolved Hide resolved
cmd/api/api.go Outdated Show resolved Hide resolved
cmd/backup/backup.go Show resolved Hide resolved
cmd/airgap/listimages.go Outdated Show resolved Hide resolved
cmd/api/api.go Outdated Show resolved Hide resolved

if err := retry.Do(
func() (err error) {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old timeout was ten seconds. I'd at least restore that. Maybe remove the timeout completely and rely on the client-go timeout settings instead?

Copy link
Contributor Author

@kke kke May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only things calling ClusterConfig() are 1) controller startup 2) airgap list-images.

I think the controller could live without it. Components that require the "dynamic config" could probably just wait for config reconciliation before starting up. It will fail every time when there are no other controllers in the cluster.

I don't know if even 10 seconds is enough for list-images if you run sudo k0s start and k0s airgap list-images right after it. Maybe it should instead wait for status socket before proceeding (in case c.K0sVars.EnableDynamicConfig == true, otherwise it can just use the nodeconfig).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does list-images need to work with dynamic config?

Copy link
Contributor Author

@kke kke May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config business in controller startup is complicated.

node components:

  • all of the nodeComponents that need clusterConfig receive it in constructor, ie. &foo.Foo{Config: nodeConfig} or foo.NewFoo(nodeConfig)
  • none of the nodeComponents have Reconcile, they will never receive a config in any other way after that.

This mostly makes sense.

cluster components:

  • some of the clusterComponents receive a nodeconfig during constructor, some don't. It's a bit hard to see if they use the nodeconfig until a dynamic config is available or have mixed usage (coredns for example I think uses both).
  • many of them have a Reconcile function, so they will receive a new config when:
    • ClusterConfigReconcilier first starts
    • When ClusterConfigReconcilier detects a dynamic config change

The startup process is somewhat exotic:

  1. All of the components are Add()ed to cluster component manager
  2. Finally component manager's Start() is called
  3. Start iterates over all of the Add()ed components and calls Start() for each and then waits for Ready() if the component has defined that function.
  4. When the iteration encounters the ClusterConfigReconcilier component somewhere middle of the way, it calls the Start() on it like for any other component, which makes ClusterConfigReconciler call component manager's Reconcile(ctx, cfg)
  5. Component manager's Reconcile iterates over all of the Added components (we're still in the middle of the iterator in step 3) and calls Reconcile(ctx, cfg) on them if they have defined this function.
  6. At this point, all of the components that have a Reconcile function will receive a new config (local or dynamic). this Starts some of the components because their logic is in Reconcile.
  7. The iteration in step 3 continues to Start() the rest of the components

So,

  • the components before step 4 are started with either the local config or no config
  • the components after step 4 receive a new config via Reconcile() (if they implement it), before they are started

This needs to be cleaned up somehow.

Copy link
Contributor Author

@kke kke May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional issues:

The component manager:

Init / Add:
  • It has an Init() function that is called once for both of the component managers (node components and the reconciling cluster components) in the controller.
  • It also has the Add() function. The function tries to call Reconcile on the components in case the manager has a config available. It never has unless there's something calling the manager's Reconcile, which only happens once it has been started.
  • Any component .Add()ed after the first Init() call will/would never be .Init()ed before being started
Start:
  • The Start() function iterates over all of the components to call Start on them and then returns. This happens once (if Start is called once, like it is.)
  • Nothing starts or initializes any components that are added after Start (I don't think we add anything after start)
Other:
  • Nothing restarts components that fail
  • If a component fails, nothing kills the manager or stops k0s
  • the "reconcile" term is used rather sloppily for multiple things and it's slightly difficult to figure out what it means for each of the components. I think the dynamic config reconciliation related functions should be named more explicitly.

Regarding the components:

  • There are multiple components that have just return nil in their Init() and Start() and the actual full functionality is implemented in Reconcile().
  • Some components have functionality in their Start but they are one-shot, the component "starts" but nothing is actually left running. These shouldn't probably be made as components.
  • There are several components that are added to the clusterComponents manager that do not have a Reconcile function, so they could just as well be in the nodeComponents manager that doesn't reconcile.
  • The pkg/component/controller package defines 48 structs and 5 interfaces. The "node components" and "cluster components" are all mixed up in there in a single directory without any separation.
  • Many of them have // Run .... doc comment on top of the Start() function, a remnant from some function rename I suppose

The prober:

  • The only components that implement the prober healthcheck interface are: Konnectivity and mockComponent.
  • The only components that implement the prober event emitter interface are: Konnectivity, OCIBundleReconcilier (which is a one-shot, Start just unpacks stuff and returns) and mockComponentWithEvents.

The cluster config reconciler:

  • Nothing restarts the watcher if the server or a watchtimeout closes it
  • It does not differentiate between add, create, update, delete, I think it should at least somehow react to delete.
  • It does not check for nils. If it receives a *v1beta1.ClusterConfig typed nil, it will call Validate() on that, which will return nil and then it proceeds to reconcile using the nil.
  • Nothing tells you if the component has actually stopped, not even the experimental k0s status components subcommand. It doesn't implement event emitter or health check interfaces. the only way you notice is that config reconciliation does not happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The components are started and reconcilied sequentially, not sure if it's necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only things calling ClusterConfig() are 1) controller startup 2) airgap list-images.

Now only number 2.

Does "airgap list-images" need to work with dynamic config, @jnummelin ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK no. I believe that's anyways more like a helper command to get the list of images to be used so that one can create a custom image tarball if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now airgap list-images will error out if dynamic is enabled. Now nothing uses FetchDynamicConfig() so it has been removed. Now configs are never merged.

pkg/config/cfgvars.go Outdated Show resolved Hide resolved
pkg/config/cfgvars.go Outdated Show resolved Hide resolved
pkg/config/cfgvars.go Outdated Show resolved Hide resolved
pkg/config/cfgvars.go Outdated Show resolved Hide resolved
@kke
Copy link
Contributor Author

kke commented May 8, 2023

Just a general thought: Would it be feasible to break this down into more than one PR? The list of key changes reads quite nicely and it would be really helpful if there'd be dedicated commits for those. Depending on the amount of changes, those commits could then be bundled up into smaller PRs that are easier/safer to review.

Without some intelligent tool to splice it up, it would seem like a big effort to identify and extract closely related changes into separate commits when the files include multiple change sets.

@kke kke force-pushed the refactor-config-loading-2 branch from 1cda571 to 8a718c4 Compare May 8, 2023 12:44
@kke kke requested a review from twz123 May 8, 2023 12:45
@kke kke force-pushed the refactor-config-loading-2 branch 2 times, most recently from 785685b to 54c3453 Compare May 10, 2023 06:55
@kke kke changed the title [RFC] Refactor configuration loading Refactor configuration loading May 10, 2023
@kke kke marked this pull request as ready for review May 10, 2023 08:23
@kke kke requested a review from a team as a code owner May 10, 2023 08:23
@kke kke requested a review from ncopa May 10, 2023 08:23
@kke kke force-pushed the refactor-config-loading-2 branch from 59d3e9c to ccf13d4 Compare May 10, 2023 10:16
cmd/controller/controller.go Outdated Show resolved Hide resolved
cmd/ctr/ctr.go Show resolved Hide resolved
@kke kke force-pushed the refactor-config-loading-2 branch from ccf13d4 to 5a4f2ab Compare May 15, 2023 20:08
@kke kke requested review from jnummelin and mikhail-sakhnov and removed request for ncopa May 16, 2023 08:08
@kke kke enabled auto-merge May 17, 2023 11:24
Refactor configuration loading, eliminating the hard to follow
`ClientConfigLoadingRules`.

The Runtime config has been converted into a "first class citizen" by making a
dedicated struct for it.

It now stores not only the config, but also the "k0svars". This allows
commands like `token create` or `k0s status` to reuse all the parameters of
the controller, such as `--data-dir` and `--status-socket` without the user
having to supply them again. The runtime config also stores a pid, which is
used to roughly check if the file belongs to an active process or if it's a
leftover from a previous crash. The pid mechanism also prevents two k0s
controllers from running in the same directory.

The `k0svars`, aka `constant.CfgVars` has been moved to `config.CfgVars`
as it is not a constant to begin with.

The `k0sVars` now duplicates most of the `config.CLIOptions`, this is to
reduce the number of arguments that need to be passed around, as the
essentials are available from `k0svars`, which was passed around to almost
everywhere already.

The `config.GetCmdOptions()` now takes the `*cobra.Command` as an argument
and can return an error. The cobra command is used to build the "k0svars" by
looking at flag values.

Signed-off-by: Kimmo Lehto <[email protected]>
Co-authored-by: Tom Wieczorek <[email protected]>
Signed-off-by: Kimmo Lehto <[email protected]>
return fmt.Errorf("failed to load cluster config: %w", err)
}
configSource, err = clusterconfig.NewStaticSource(clusterConfig)
configSource, err = clusterconfig.NewStaticSource(nodeConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we should use the full cluster config here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeConfig is always a full cluster config when dynamic config is not enabled. It is only stripped down to BootstrapConfig when merged to a dynamic one.

Reconcile should only deal with ClusterWideConfig, any component that uses anything from the NodeConfig should receive it as a static config in initialization, if it needs anything from the cluster wide config, it should be received from reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now GetBootstrappingConfig() is never called so it too can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetClusterWideConfig() is only used upon creating the initial clusterconfig api object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


if err := retry.Do(
func() (err error) {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK no. I believe that's anyways more like a helper command to get the list of images to be used so that one can create a custom image tarball if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli area/configuration go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants