-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Post Create Initialization #868
Conversation
docs/proposals/update-machine.md
Outdated
|
||
Each AWS EC2 instance performs source/destination checks by default. | ||
For [EC2 NAT](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck) instances | ||
these should be disabled. This is done by issueing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be disabled. This is done by issueing | |
these should be disabled. This is done by issuing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar corrections elsewhere as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo fixed
docs/proposals/update-machine.md
Outdated
|
||
We have abstract operations for creation/deletion and listing of machines (actually compute instances) but we do not correctly handle post-creation startup logic. Nor do we provide an abstract operation to represent the hot update of an instance after creation. | ||
|
||
We have found this to be necessary for several use cases. Today in the MCM AWS Provider, we already misuse `driver.GetMachineStatus` which is supposed to be a read-only operation obtaining the status of an instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/proposals/update-machine.md
Outdated
|
||
We will split the fulfilment of this overall need into 2 stages of implementation. | ||
|
||
1. **Stage-A**: Support post-VM creation logic in implementations of `Driver.CreateMachine` by permitting provider implementors to add one-time logic after VM creation, return with special new error code `codes.Startup` for post-VM errors and correspondingly support a new machine operation stage `InitiateStartup`. The [triggerCreationFlow](https://github.com/gardener/machine-controller-manager/blob/rel-v0.50/pkg/util/provider/machinecontroller/machine.go#L310) - a reconciliation sub-flow of the MCM responsible for orchestrating instance creation and updating machine status will be changed to support this behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what does
one-time logic
precisely mean here? codes.Startup
- This does not really tell whether a post launch update failed. Can you suggest alternative names?- Similarly
InitiateStartup
- maybe we need something more apt to depict that there is a post-launch-vm or post-create-vm operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about codes.StartupFailure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- One-time logic means logic is executed only once after creation.
- will change
codes.Startup
tocodes.Initialization
- will change
InitiateStartup
toInstanceInitialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed review comments
docs/proposals/update-machine.md
Outdated
|
||
|
||
|
||
2. **Stage-B**: Introduction of `Driver.UpdateMachine` and enhancing the MCM, MCM providers and gardener extension providers to support hot update of instances through `Driver.UpdateMachine`. The MCM [triggerUpdationFlow](https://github.com/gardener/machine-controller-manager/blob/v0.50.1/pkg/util/provider/machinecontroller/machine.go#L531) - a reconciliation sub-flow of the MCM which is supposed to be responsible for orchestrating instance update - but currently not used, will be updated to invoke the provider `Driver.UpdateMachine` on hot-updates to to the `Machine` object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now be Driver.HotUpdateMachine
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think UpdateMachine
sounds better and inline with already existing driver names. We could also update docstring to tell that its for hot-updatable... since we have moved all the startup related update operation to CreateMachine
anyone reading UpdateMachine
wouldn't be that confused. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently a point of contention since we have now changed the boundaries for UpdateMachine
. Since this is part of Stage-B i would keep it Out-Of-Scope
completely in this document. When we take up hot updates of tags then we can rethink on it.
From POV it is a bit un-clean (maybe i am not convinced now and some thinking from my end is required). Reasons for hesitation:
- Create machine creates an instance and also does
update
on the created instance. This is part ofcreate machine
. - Update machine also now does an update but you wish to differentiate it by having a doc string? Why have one name of a method and do something different which is then described as part of docstring?
- When will
Update machine
be used other than hot-updating the VM? If it results in rolling update of the VM then it would most probably be part ofCreateMachine
itself now, right?
docs/proposals/update-machine.md
Outdated
|
||
### Instance Not Ready Taint | ||
|
||
- Due to the fact that creation flow for machines will now be enhanced to correctly support post-creation startup logic, we should not scheduled workload until this startup logic is complete. Even without this feature we have a need for such a taint as described in [MCM#740](https://github.com/gardener/machine-controller-manager/issues/740) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not really post creation startups, but post-creation VM updates. I am not convinced that we should be calling this Startup
docs/proposals/update-machine.md
Outdated
|
||
- Due to the fact that creation flow for machines will now be enhanced to correctly support post-creation startup logic, we should not scheduled workload until this startup logic is complete. Even without this feature we have a need for such a taint as described in [MCM#740](https://github.com/gardener/machine-controller-manager/issues/740) | ||
- We propose a new taint `node.machine.sapcloud.io/instance-not-ready` which will be added as a node startup taint in gardener core [KubeletConfiguration.RegisterWithTaints](https://github.com/gardener/gardener/blob/v1.83.1/pkg/component/extensions/operatingsystemconfig/original/components/kubelet/config.go#L101) | ||
- The will will then removed by MCM in health check reconciliation, once the machine becomes fully ready. (when moving to `Running` phase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The will will then removed by MCM in health check reconciliation, once the machine becomes fully ready. (when moving to `Running` phase) | |
- This taint will then removed by MCM in health check reconciliation, once the machine becomes fully ready. (when moving to `Running` phase) |
docs/proposals/update-machine.md
Outdated
|
||
*NOTES* | ||
* The `lastop` below is an abbreviation for `machine.Status.LastOperation`. This, along with the machine phase is generally updated on the `Machine` object just before returning from the method. | ||
* `machineCreateErrorHandler` is a separate method on the controller that handles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence is incomplete
docs/proposals/update-machine.md
Outdated
1. We need to introduce enhancements around the call to `Driver.CreateMachine` in the flow illustrated above. | ||
1. Observe that after the call to a successful `Driver.CreateMachine`, the machine phase is set to `Pending`, the `LastOperation.Type` is currently set to `Create` and the `LastOperation.State` set to `Processing`before returning with a `ShortRetry`. The `LastOperation.Description` is (unfortunately) set to the fixed message: `Creating machine on cloud provider`. | ||
1. Observe that after an erroneous call to `Driver.CreateMachine`, the machine phase is set to `CrashLoopBackOff` or `Failed` (in case of creation timeout). | ||
1. We propose introducing a new MC error code `codes.Startup` indicating that the VM Instance was created but there was a startup error in post VM creation activities. The implementor of `Driver.CreateMachine` can return this error code, indicating that `CreateMachine` needs to be called again. However, the Machine Controller will keep the machine phase for such a situation as `Pending` and not change the phase to `CrashLoopBackOff` when encountering a `codes.Startup` error. (The instance for the machine HAS been created hence `CrashLoopBackoff` is inappropriate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to use another error code then do change it here as well. Startup signals that the VM was created and the startup of the VM which involves running the user config script, starting up kubelet and other k8s components failed. But here we are hinting at an update call to the VM post create which does not signal Startup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can follow simple terminology - PreCreate, Create, PostCreate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this InstanceInitialization
docs/proposals/update-machine.md
Outdated
1. Observe that after the call to a successful `Driver.CreateMachine`, the machine phase is set to `Pending`, the `LastOperation.Type` is currently set to `Create` and the `LastOperation.State` set to `Processing`before returning with a `ShortRetry`. The `LastOperation.Description` is (unfortunately) set to the fixed message: `Creating machine on cloud provider`. | ||
1. Observe that after an erroneous call to `Driver.CreateMachine`, the machine phase is set to `CrashLoopBackOff` or `Failed` (in case of creation timeout). | ||
1. We propose introducing a new MC error code `codes.Startup` indicating that the VM Instance was created but there was a startup error in post VM creation activities. The implementor of `Driver.CreateMachine` can return this error code, indicating that `CreateMachine` needs to be called again. However, the Machine Controller will keep the machine phase for such a situation as `Pending` and not change the phase to `CrashLoopBackOff` when encountering a `codes.Startup` error. (The instance for the machine HAS been created hence `CrashLoopBackoff` is inappropriate) | ||
1. We will introduce a new _machine operation_ stage `InitiateStartup`. In case of an `codes.Startup` error, the `machine.Status.LastOperation.Description` will be set to this stage, the `LastOperation.Type` will be set to `Create` and the `LastOperation.State` set to `Failed` before returning with a `ShortRetry` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we said that if a driver.CreateMachine
call fails then we simply stick to the current behavior and always move it to either CrashLoopBackoff
or Failed
right? Pending
unfortunately is way too generic and today it means that CreateMachine
call was successful but in this case it was not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(convo from slack)
It was said in the meeting not to do this as
CrashLoopBackoff
is meant for VMs that don’t get created andFailed
is meant for VMs that don’t get created after machine creation timeout. For cases where the VM gets created but post-creation (aka startup activities) failed, we still retain thePending
phase
From Madhav:
There was too much back and forth on this unfortunately and what i recollect is that we should retain the current behavior of moving the VM to CrashLoopBackoff because today it signifies failure in creating the VM which now involves one or more steps. Maybe @rishabh-11 and @himanshu-kun can confirm.
It is a pity that we use Crashlookbackoff
when in context of a Pod (from which we borrowed) indicates that there is an error which is preventing the containers within a pod to startup correctly. So overall the error category is StartupFailure and the underline cause is a crashing container. However in terms of MCM our usage is plain and simply wrong. We do not have a crashing VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recollect is that we should retain the current behavior of moving the VM to CrashLoopBackoff because today it signifies failure in creating the VM.
Yes, this is correct.
However in terms of MCM our usage is plain and simply wrong. We do not have a crashing VM
I don't think it is wrong. MCM uses CrashloopBackoff
to say that VM creation requests are failing. if we compare it to a pod, then it is similar to pod restarts due to failure. For a VM creation, we only have success or failure, no restart happens in creation, so multiple restarts can be compared to the CreateMachine calls that MCM makes in each reconciliation until it succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @rishabh-11 and @himanshu-kun can confirm.
We decided to use a separate stage for the error in startup updates. Now if we want to follow the definition of CLBF, then it can be argued that startup activities should be included in it. Then in that case, we need to have proper logs that can help us differentiate why the machine is in CLBF and this should be reflected in the status of the Machine object as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is wrong. MCM uses CrashloopBackoff to say that VM creation requests are failing. if we compare it to a pod, then it is similar to pod restarts due to failure. For a VM creation, we only have success or failure, no restart happens in creation, so multiple restarts can be compared to the CreateMachine calls that MCM makes in each reconciliation until it succeeds.
CrashLoopBackoff
has a word crash in it for a very simple reason that the failure to start a container within a pod is that it is crashing. This means that the container gets created but it then crashes. This is not the same as failing to create a VM, because this error code is returned when mcm-provider fails to create a VM (this is not because it crashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest leaving these phase renaming decisions to the overhaul. After reading the arguments above and thinking on it a bit, I agree that we can simply preserve the current phase change from <empty>
to CrashLoopBackoff
(yes, naming is not appropriate but this was decision taken before us) when there is an error not just on VM instance creation but on post-creation/startup logic.
The argument supporting this is that this is consistent phase handling on error wrt API. The argument against this is that it can confuse operators since they expect such an error code to indicate that the VM it-self was not created. However, we can mitigate this by documenting change of expected behavior in FAQ/release-notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to use a separate stage for the error in startup updates. Now if we want to follow the definition of CLBF, then it can be argued that startup activities should be included in it. Then in that case, we need to have proper logs that can help us differentiate why the machine is in CLBF and this should be reflected in the status of the Machine object as well.
In the triggerCreationFlow
a call is made to driver.CreateMachine
. If an error is returned then that is handled in method machineCreateErrorHandler which then calls getCreateFailurePhase to get the phase. With this proposal, if an update failed then an error will be returned which will have the same control flow. The error code will be used to determine the retry interval and if the timeout has not happened then it will simply return CrashLoopBackoff
as the phase which will then trigger invocation of driver.CreateMachine
again.
So with minimal changes you will achieve what you wanted to, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with minimal changes you will achieve what you wanted to, right?
yes. We just need to add the reason for CLBF in the machine status and have additional logs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with minimal changes you will achieve what you wanted to, right?
yes. We just need to add the reason for CLBF in the machine status and have additional logs as well.
Will take note of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal looks good, as it tries to make minimum changes to the logic, plus keep the flow logical. I have added a corner case which we should think of handling, if its valid
docs/proposals/update-machine.md
Outdated
|
||
We will split the fulfilment of this overall need into 2 stages of implementation. | ||
|
||
1. **Stage-A**: Support post-VM creation logic in implementations of `Driver.CreateMachine` by permitting provider implementors to add one-time logic after VM creation, return with special new error code `codes.Startup` for post-VM errors and correspondingly support a new machine operation stage `InitiateStartup`. The [triggerCreationFlow](https://github.com/gardener/machine-controller-manager/blob/rel-v0.50/pkg/util/provider/machinecontroller/machine.go#L310) - a reconciliation sub-flow of the MCM responsible for orchestrating instance creation and updating machine status will be changed to support this behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about codes.StartupFailure
docs/proposals/update-machine.md
Outdated
|
||
|
||
|
||
2. **Stage-B**: Introduction of `Driver.UpdateMachine` and enhancing the MCM, MCM providers and gardener extension providers to support hot update of instances through `Driver.UpdateMachine`. The MCM [triggerUpdationFlow](https://github.com/gardener/machine-controller-manager/blob/v0.50.1/pkg/util/provider/machinecontroller/machine.go#L531) - a reconciliation sub-flow of the MCM which is supposed to be responsible for orchestrating instance update - but currently not used, will be updated to invoke the provider `Driver.UpdateMachine` on hot-updates to to the `Machine` object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think UpdateMachine
sounds better and inline with already existing driver names. We could also update docstring to tell that its for hot-updatable... since we have moved all the startup related update operation to CreateMachine
anyone reading UpdateMachine
wouldn't be that confused. WDYT?
docs/proposals/update-machine.md
Outdated
1. Observe that after the call to a successful `Driver.CreateMachine`, the machine phase is set to `Pending`, the `LastOperation.Type` is currently set to `Create` and the `LastOperation.State` set to `Processing`before returning with a `ShortRetry`. The `LastOperation.Description` is (unfortunately) set to the fixed message: `Creating machine on cloud provider`. | ||
1. Observe that after an erroneous call to `Driver.CreateMachine`, the machine phase is set to `CrashLoopBackOff` or `Failed` (in case of creation timeout). | ||
1. We propose introducing a new MC error code `codes.Startup` indicating that the VM Instance was created but there was a startup error in post VM creation activities. The implementor of `Driver.CreateMachine` can return this error code, indicating that `CreateMachine` needs to be called again. However, the Machine Controller will keep the machine phase for such a situation as `Pending` and not change the phase to `CrashLoopBackOff` when encountering a `codes.Startup` error. (The instance for the machine HAS been created hence `CrashLoopBackoff` is inappropriate) | ||
1. We will introduce a new _machine operation_ stage `InitiateStartup`. In case of an `codes.Startup` error, the `machine.Status.LastOperation.Description` will be set to this stage, the `LastOperation.Type` will be set to `Create` and the `LastOperation.State` set to `Failed` before returning with a `ShortRetry` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are situations which we should think of dealing with. For example, what if following condition happens
- VM was created &&
- there was a startup failure &&,
status.LastOperation
couldn't be updated due to KAPI rate limiting or MCM restart
then in the next reconcile, we wouldn't know at what step we are
GetMachineStatus
is used today to see if VM exists ,but it won't be able to help in identifying if startup failure happened or not.
This is happening as the CreateMachine
call won't be atomic as per the proposal (its not atomic in Azure as well presently, but there we separately list nic to see where we have reached in create operation)
Probable solutions to the above situation:
- always assume StartupFailure happened in such a situation, let
CreateMachine()
be called , but enhanceCreateMachine()
to first check if the VM has all the startup settings done(which I'm not sure is possible for every startup setting), if not then call startup related calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally GetMachineStatus
should be enhanced to give the status of the machine keeping in mind the desired state of the machine vs as-is. This means that in cases where a Create
involves creating the VM + updating properties on the VM (e.g. ipv6 specific configuration or source-destination-checks) then that should be separately sent back as part of the response. This is due to the fact that now after this proposal the VM is considered usable only when create + post-create-update succeeds. Therefore GetMachineStatus
should also be changed to reflect this as post-creation-activities constitute part of the status of a machine.
MCM can then decide what call to make based on the status returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it the clearly the job of GetMachineStatus
to report this correctly as it is already invoked today in the triggerCreationFlow
. Provider implementations can simply return the same proposed error code when the VM exists but startup not fully done. codes.Startup(Error)
(Himanshu preference) or codes.PostCreate(Error)
. (Madhav preference) or codes.Initialization
(Tarun Prefrerence)
Please note that suffixing Error
to the error codes currently breaks the convention in codes
. There is no Error
suffix in any of the error codes and I would like to preserve existing convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed point of modification to GetMachineStatus
semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the point that Himanshu, mentioned above, the job of Driver.CreateMachine
is becoming more and more complex. After thinking on this yesterday and today, I would propose simplifying this by introducing a new Driver.InitializeMachine
which is separate from Driver.UpdateMachine
. All post-create initialization logic set such as disabling src-dest checks (aws) and ipv6 address assignment (aws), etc. will be put in here. This method will be invoked only once after Driver.CreateMachine
is invoked successfully. If Driver.InitializeMachine
fails it will return status.Error(codes.Initialization)
as the error. Driver.GetMachine
also will return status.Error(codes.Initialization)
if the VM is not fully initialized. The MCM will invoke Driver.InitializeMachine
if Driver.GetMachine
returns such a status. Only the triggerCreationFlow
will need changes and no other flows.
InstanceInitialization
will be the new machine operation stage. machine.Status.LastOperation.Description
will have this set if Driver.InitializeMachine
fails. The machine phase will be CrashLoopBackoff
as usual if either Driver.CreateMachine
or Driver.InitializeMachine
fails. Or move to Failed
phase (as usual) after the machine creation timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
What this PR does / why we need it:
See #871
Which issue(s) this PR fixes:
Fixes #871
Special notes for your reviewer:
Release note: