-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor configuration loading #3072
Conversation
Just a general thought: Would it be feasible to break this down into more than one PR? The list of key changes reads quite nicely and it would be really helpful if there'd be dedicated commits for those. Depending on the amount of changes, those commits could then be bundled up into smaller PRs that are easier/safer to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass on this. Just loads of nits so far. I've left some "notes to self" that I left for me to figure out when engaging with the code more thoroughly. Feel free to ignore them 😝
pkg/config/cfgvars.go
Outdated
|
||
if err := retry.Do( | ||
func() (err error) { | ||
ctx, cancel := context.WithTimeout(ctx, 2*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the old timeout was ten seconds. I'd at least restore that. Maybe remove the timeout completely and rely on the client-go timeout settings instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only things calling ClusterConfig()
are 1) controller startup 2) airgap list-images.
I think the controller could live without it. Components that require the "dynamic config" could probably just wait for config reconciliation before starting up. It will fail every time when there are no other controllers in the cluster.
I don't know if even 10 seconds is enough for list-images if you run sudo k0s start
and k0s airgap list-images
right after it. Maybe it should instead wait for status socket before proceeding (in case c.K0sVars.EnableDynamicConfig == true, otherwise it can just use the nodeconfig).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does list-images need to work with dynamic config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config business in controller startup is complicated.
node components:
- all of the nodeComponents that need clusterConfig receive it in constructor, ie. &foo.Foo{Config: nodeConfig} or foo.NewFoo(nodeConfig)
- none of the nodeComponents have Reconcile, they will never receive a config in any other way after that.
This mostly makes sense.
cluster components:
- some of the clusterComponents receive a nodeconfig during constructor, some don't. It's a bit hard to see if they use the nodeconfig until a dynamic config is available or have mixed usage (coredns for example I think uses both).
- many of them have a Reconcile function, so they will receive a new config when:
- ClusterConfigReconcilier first starts
- When ClusterConfigReconcilier detects a dynamic config change
The startup process is somewhat exotic:
- All of the components are
Add()
ed to cluster component manager - Finally component manager's
Start()
is called - Start iterates over all of the
Add()
ed components and callsStart()
for each and then waits forReady()
if the component has defined that function. - When the iteration encounters the
ClusterConfigReconcilier
component somewhere middle of the way, it calls theStart()
on it like for any other component, which makes ClusterConfigReconciler call component manager'sReconcile(ctx, cfg)
- Component manager's
Reconcile
iterates over all of theAdd
ed components (we're still in the middle of the iterator in step 3) and callsReconcile(ctx, cfg)
on them if they have defined this function. - At this point, all of the components that have a Reconcile function will receive a new config (local or dynamic). this Starts some of the components because their logic is in Reconcile.
- The iteration in step 3 continues to
Start()
the rest of the components
So,
- the components before step 4 are started with either the local config or no config
- the components after step 4 receive a new config via
Reconcile()
(if they implement it), before they are started
This needs to be cleaned up somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional issues:
The component manager:
Init / Add:
- It has an
Init()
function that is called once for both of the component managers (node components and the reconciling cluster components) in the controller. - It also has the
Add()
function. The function tries to callReconcile
on the components in case the manager has a config available. It never has unless there's something calling the manager'sReconcile
, which only happens once it has been started. - Any component
.Add()
ed after the firstInit()
call will/would never be.Init()
ed before being started
Start:
- The
Start()
function iterates over all of the components to callStart
on them and then returns. This happens once (ifStart
is called once, like it is.) - Nothing starts or initializes any components that are added after
Start
(I don't think we add anything after start)
Other:
- Nothing restarts components that fail
- If a component fails, nothing kills the manager or stops k0s
- the "reconcile" term is used rather sloppily for multiple things and it's slightly difficult to figure out what it means for each of the components. I think the dynamic config reconciliation related functions should be named more explicitly.
Regarding the components:
- There are multiple components that have just
return nil
in theirInit()
andStart()
and the actual full functionality is implemented inReconcile()
. - Some components have functionality in their
Start
but they are one-shot, the component "starts" but nothing is actually left running. These shouldn't probably be made as components. - There are several components that are added to the
clusterComponents
manager that do not have aReconcile
function, so they could just as well be in thenodeComponents
manager that doesn't reconcile. - The
pkg/component/controller
package defines 48 structs and 5 interfaces. The "node components" and "cluster components" are all mixed up in there in a single directory without any separation. - Many of them have
// Run ....
doc comment on top of theStart()
function, a remnant from some function rename I suppose
The prober:
- The only components that implement the prober healthcheck interface are: Konnectivity and
mockComponent
. - The only components that implement the prober event emitter interface are: Konnectivity, OCIBundleReconcilier (which is a one-shot,
Start
just unpacks stuff and returns) andmockComponentWithEvents
.
The cluster config reconciler:
- Nothing restarts the watcher if the server or a watchtimeout closes it
- It does not differentiate between add, create, update, delete, I think it should at least somehow react to delete.
- It does not check for nils. If it receives a *v1beta1.ClusterConfig typed nil, it will call
Validate()
on that, which will returnnil
and then it proceeds to reconcile using the nil. - Nothing tells you if the component has actually stopped, not even the experimental
k0s status components
subcommand. It doesn't implement event emitter or health check interfaces. the only way you notice is that config reconciliation does not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The components are started and reconcilied sequentially, not sure if it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only things calling
ClusterConfig()
are 1) controller startup 2) airgap list-images.
Now only number 2.
Does "airgap list-images" need to work with dynamic config, @jnummelin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK no. I believe that's anyways more like a helper command to get the list of images to be used so that one can create a custom image tarball if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now airgap list-images will error out if dynamic is enabled. Now nothing uses FetchDynamicConfig() so it has been removed. Now configs are never merged.
Without some intelligent tool to splice it up, it would seem like a big effort to identify and extract closely related changes into separate commits when the files include multiple change sets. |
1cda571
to
8a718c4
Compare
785685b
to
54c3453
Compare
59d3e9c
to
ccf13d4
Compare
ccf13d4
to
5a4f2ab
Compare
Refactor configuration loading, eliminating the hard to follow `ClientConfigLoadingRules`. The Runtime config has been converted into a "first class citizen" by making a dedicated struct for it. It now stores not only the config, but also the "k0svars". This allows commands like `token create` or `k0s status` to reuse all the parameters of the controller, such as `--data-dir` and `--status-socket` without the user having to supply them again. The runtime config also stores a pid, which is used to roughly check if the file belongs to an active process or if it's a leftover from a previous crash. The pid mechanism also prevents two k0s controllers from running in the same directory. The `k0svars`, aka `constant.CfgVars` has been moved to `config.CfgVars` as it is not a constant to begin with. The `k0sVars` now duplicates most of the `config.CLIOptions`, this is to reduce the number of arguments that need to be passed around, as the essentials are available from `k0svars`, which was passed around to almost everywhere already. The `config.GetCmdOptions()` now takes the `*cobra.Command` as an argument and can return an error. The cobra command is used to build the "k0svars" by looking at flag values. Signed-off-by: Kimmo Lehto <[email protected]> Co-authored-by: Tom Wieczorek <[email protected]> Signed-off-by: Kimmo Lehto <[email protected]>
5a4f2ab
to
7e42dba
Compare
return fmt.Errorf("failed to load cluster config: %w", err) | ||
} | ||
configSource, err = clusterconfig.NewStaticSource(clusterConfig) | ||
configSource, err = clusterconfig.NewStaticSource(nodeConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we should use the full cluster config here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeConfig is always a full cluster config when dynamic config is not enabled. It is only stripped down to BootstrapConfig when merged to a dynamic one.
Reconcile should only deal with ClusterWideConfig, any component that uses anything from the NodeConfig should receive it as a static config in initialization, if it needs anything from the cluster wide config, it should be received from reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now GetBootstrappingConfig()
is never called so it too can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetClusterWideConfig() is only used upon creating the initial clusterconfig api object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/config/cfgvars.go
Outdated
|
||
if err := retry.Do( | ||
func() (err error) { | ||
ctx, cancel := context.WithTimeout(ctx, 2*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK no. I believe that's anyways more like a helper command to get the list of images to be used so that one can create a custom image tarball if needed.
Signed-off-by: Kimmo Lehto <[email protected]>
Signed-off-by: Kimmo Lehto <[email protected]>
Signed-off-by: Kimmo Lehto <[email protected]>
Description
Refactor configuration loading, eliminating the hard to follow
ClientConfigLoadingRules
.The "runtime config" concept has been kept for now, as it is a smaller change than what was attempted before.
Key changes:
token create
ork0s status
to reuse all the parameters of the controller, such as--data-dir
and--status-socket
without the user having to supply them again. The runtime config also stores a pid, which is used to roughly check if the file belongs to an active process or if it's a leftover from a crash.k0svars
, akaconstant.CfgVars
has been moved toconfig.CfgVars
as it is not a constant.k0sVars
now duplicates most of theconfig.CLIOptions
, this is to reduce the number of arguments that need to be passed around, as the essentials are available fromk0svars
, which is passed around to almost everywhere already. This will help in the effort to eliminate the globals, which are now only used almost exclusively to build the CLIOptions.config.GetCmdOptions()
now takes the*cobra.Command
as an argument and additionally returns an error. The cobra command is used to build the "k0svars". Returning an error is better than the surpriseos.Exit
in the old implementation did (which silently ended the test suite).c.NodeConfig
, which is now a function that returns*v1beta1.ClusterConfig, error
.k0s config validate --config
, even when using-
for stdin. I thinkk0s controller
andinstall controller
may still be broken for-
.k0svars.NodeConfig()
andk0svars.FetchDynamicConfig()
, which may be a bit counterintuitive and doesn't exactly follow the Single Responsibility Principle.Type of change
How Has This Been Tested?
Checklist: