Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an intermediate Machine state for non-atomic Cloud Provider APIs #444

Closed
zuzzas opened this issue Apr 14, 2020 · 8 comments
Closed

Add an intermediate Machine state for non-atomic Cloud Provider APIs #444

zuzzas opened this issue Apr 14, 2020 · 8 comments
Labels
exp/intermediate Issue that requires some project experience kind/design kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/backlog Issues which needs implementation

Comments

@zuzzas
Copy link
Contributor

zuzzas commented Apr 14, 2020

What would you like to be added:

An intermediate Machine state for non-atomic Cloud Provider APIs. After VM creation is complete, apply a second wave of changes to the machine. Consider Machine Ready only after all API calls are complete.

Why is this needed:

There are some API methods in some of the Cloud Provider's APIs that require performing a second (after the creation of a VM) API call to change some parameters. A perfect (and dear to my heart) example is setting the source-dest-check option on AWS instances.

If MCM is restarted (or somehow lost internal state), the second non-atomic call to the Cloud Provider's API would never be made.

@zuzzas zuzzas added the kind/enhancement Enhancement, improvement, extension label Apr 14, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 14, 2020
@prashanth26
Copy link
Contributor

Also, we should update the status on machine object right before VM creation request is made ot the provider.

@prashanth26 prashanth26 added exp/intermediate Issue that requires some project experience and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 16, 2020
@hardikdr hardikdr added kind/design priority/backlog Issues which needs implementation labels Aug 28, 2020
@prashanth26
Copy link
Contributor

prashanth26 commented Oct 21, 2020

Hi @zuzzas ,

I thought a little deeper about the issue which you have raised. I see two separate requirements here, correct me if I understood it wrongly.

  1. Having multiple cloud provider calls for a single machine creation so that operations are continued/retried on failures (MCM restarts). This is something that should be achievable using the OOT code currently where CreateMachine() call could be smart enough to decode the previous state using the lastKnownState - This field can be used by the providers to save partial state during VM creation/deletions (also at this. And also the state of the partial state of the VMs on providers can always be retrieved if VMs are tagged correctly with the initial VM operation call.

Although the above-mentioned logic is achievable with the OOT IMO, however one small caveat I observe is that the OOT currently would delete such partially created machines bcoz of the machineSet controller that deletes failed machines. However, the fix I see for this would be to keep the machine in a Pending state when machine creation is partially created here. We can probably also switch between Pending / Failed / AnyOtherPhase based on the type of error returned by the provider CreateMachine() code here. I am currently working on the AWS OOT provider and will try have an initial release in the coming week. We can probably test it to verify your use-case to add your partially created machine logic with multiple calls here (with the source-dest-check option).

  1. However, all this transforms into the larger problem of in-place updates which is yet to be solved at the MCM and probably needs more thought there. The way I envision to achieve this in the future would be to use machineClass hashes on the VM as labels to match the current state and use a getMachineStatus() to return this label hash and compare it with the machineClass hash and update VMs it when this doesn't match. However, this is just a rough thought which needs more discussion.

If your requirement is only (1), that is VM creation with multiple rounds of CreateMachine() calls, where the createMachine() would be smart enough to decode the current operations required to perform (E.g. If VM is present then only make an update call to set the source-dest-check option) then this is something I can test in the coming days and update you on this. However, if it's about trying to update VMs in place for any generic case, I think it would be a much larger topic and discussion.

Also, these are just some thoughts I have tried to pen down, please correct me if I am wrong here or something can be improved?

@zuzzas
Copy link
Contributor Author

zuzzas commented Oct 22, 2020

Hi @prashanth26,

Thanks for a thorough response. Yeah, the first point seems feasible and is exactly what I require.

The reason for #538 is that I still use in-tree providers. It was an attempt to create a simple interim solution, which simply removes the Machine and restarts the creation flow. That helps us with not adopting half-created Machines, which were created as a VM, but failed to add Security Groups to its interfaces.

@prashanth26
Copy link
Contributor

prashanth26 commented Oct 22, 2020

That helps us with not adopting half-created Machines, which were created as a VM, but failed to add Security Groups to its interfaces

But wouldn't this case lead to orphan VMs in your case? Anyways you can identify VMs using the tags right. IMO, you can capture this case in the driver Create() call and either fix the partially/adopt created VM (or) probably try to perform cleanup on calling the Create() call the second time? And return an error to declare the machine failed. Wouldn't this be more meaningful rather than creating this additional phase? Wouldn't this approach work? Because adding this phase seems like a hack to me.

cc: @hardikdr @ggaurav10 @amshuman-kr WDYT?

@zuzzas
Copy link
Contributor Author

zuzzas commented Oct 23, 2020

IMO, you can capture this case in the driver Create() call

This happens after adoption. If a VM was created (without setting up other important stuff, like SGs in OpenStack), the GetVMs() call will return the VM and it will get adopted.

@amshuman-kr
Copy link

@zuzzas We are in the middle of moving from the in-tree providers to the out-of-tree providers. The existing in-tree providers are in different stages of being moved to the out-of-tree approach. As mentioned above by @prashanth26, the out-of-tree approach proposes to use the lastKnownState field to be used to maintain state about the state of reconciliation for the Machine (which covers provisioning, update and de-provisioning) to aid in decisions during the next reconciliation.

Clearly, something like this doesn't exist in the in-tree approach. But considering that the in-tree approach is being phased out, we are a bit reluctant to introduce new functionality in the in-tree approach. Would it work for you to do this in the out-of-tree approach? Also, do you see any problems in using the lastKnownState field in the out-of-tree approach that doesn't address some aspect of the problem?

@zuzzas
Copy link
Contributor Author

zuzzas commented Oct 27, 2020

Nah, everything is clear. I've been unsure about the current policy of changing stuff in the in-tree controller. Thanks, @amshuman-kr.

Also, do you see any problems in using the lastKnownState field in the out-of-tree approach that doesn't address some aspect of the problem?

It's perfect.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 26, 2021
@prashanth26
Copy link
Contributor

This has been taken care with this field.
/close in favour of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/intermediate Issue that requires some project experience kind/design kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/backlog Issues which needs implementation
Projects
None yet
Development

No branches or pull requests

5 participants