Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the delete of the donor instance #69

Merged
merged 4 commits into from
Apr 2, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions docs/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ the full solution.

Step 2 is very much dependent on step 1 - there are bits of data or annotations
that are needed to automatically reason about which node is most suited to
becoming the prototype pattern node.
becoming the prototype pattern node. We have, a mechanism that is a
reasonable base implementation of Step 2 in the code based on the rules
with attributes as we have stated them.

Step 3 is the part that no one (or no one we know of) has. This part can
be used, in the worst case, by someone manually selecting the target node.
Expand All @@ -68,6 +70,13 @@ Yes - our goal is to bring steps 1 and 2 as pluggable elements such that
each of them could be replaced with their own implementations. Our goal
is to have a basic reference implementation of all 3 components.

With the latest changes to the open source [Kured](https://github.com/weaveworks/kured)
tool, we now have a baseline of step 1 plus our [Kamino auto update](../helm/vmss-prototype/auto-update.md) for
step 2 and 3.

Our hope is that these components are composable and replaceable as needed.
Again, the big push with doing step 3 first is that we feel that is the
most critical and unique component right now.
Again, the big push was doing step 3 first in that we felt that was the
most critical and unique component. Having our step 2 code there and validated
against at least 2 implementations of step 1 (and internal system and now
the public [Kured](https://github.com/weaveworks/kured) project) gives us
confidence that Kamino is a viable first release operational and useful tool.
4 changes: 2 additions & 2 deletions helm/vmss-prototype/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@ The `vmss-prototype` operation carries out a procedural set of steps, each of wh
8. Cordon + drain the target node in preparation for taking it offline. If the cordon + drain fails, we will fail the operation _unless we pass in the `--force` option to the `vmss-prototype` tool (see the Helm Chart usage of `kamino.drain.force` below)_.
9. Deallocate the VMSS instance. This is a fancy, Azure-specific way of saying that we release the reservation of the underlying compute hardware running that instance virtual machine. This is a pre-condition to performing a snapshot of the underlying disk.
10. Make a snapshot of the OS disk image attached to the deallocated VMSS instance.
11. *Permanently delete the VMSS instance.* This is due to an [open issue](https://github.com/jackfrancis/kamino/issues/26). Long-term, we aim to solve that issue and simply re-introduce the snapshotted node back into the cluster. In the meanwhile, one operational side-effect of `vmss-prototype` is the loss of one node in the node pool. If you wish to re-add one node after `vmss-prototype` has completed updating the VMSS model, you may use the `--set kamino.newUpdatedNodes=1` option when invoking `helm install`.
11. Restart the node's VMSS instance that we just grabbed a snapshot of.
12. Uncordon the node to allow Kubernetes to schedule workloads onto it.
13. Remove the `cluster-autoscaler.kubernetes.io/scale-down-disabled` cluster-autoscaler node annotation as we no longer care if this node is chosen for removal by cluster-autoscaler.
14. Build a new SIG Image Definition _version_ (i.e., the actual image we're going to update the VMSS to use) from the recently captured snapshot image. This takes a long time! In our tests we see a 30 GB image (the OS disk size default for many Linux distros) take between 30 minutes and 2 _hours_ to be rendered as a SIG Image Definition version!
15. After the new SIG Image Definition version has been created, we delete the snapshot image as it will no longer be needed.
16. We now prune older SIG Image Definition versions (configurable, see the usage of `kamino.imageHistory` in the official Helm Chart docs below).
17. Update the target instance's VMSS model so that its OS image refers to the newly created SIG Image Definition version. This means that the very next instance built with this VMSS will derive from the newly created image. *This update operation does not affect existing instances: The `vmss-prototype` tool does not instruct the VMSS API to perform a "rolling upgrade" to ensure that all instances are running this new OS image! Similarly, `vmss-prototype` **will not** perform a "rolling upgrade" across the other, existing VMSS instances, nor will it create new, replacement instances, and delete old instances!*
17. Update the target instance's VMSS model so that its OS image refers to the newly created SIG Image Definition version. This means that the very next instance built with this VMSS will derive from the newly created image. *This update operation does not affect existing instances: The `vmss-prototype` tool does not instruct the VMSS API to perform a "rolling upgrade" to ensure that all instances are running this new OS image! Similarly, `vmss-prototype` **will not** perform a "rolling upgrade" across the other, existing VMSS instances, nor will it create new, replacement instances, or delete old instances!*
18. Update the target instance's cloud-init configuration so that it no longer includes "one-time bootstrap" configuration. Because this instance was _already_ bootstrapped when the cluster was created, we don't need to perform those various prerequisite file system operations: by updating the VMSS's OS image reference to a "post-bootstrapped" image, `vmss-prototype` has made it unnecessary for new instances to perform this cloud-init bootstrap overhead: our new nodes will come online more quickly!
19. Similarly, we remove any VMSS "Extensions" that were used to execute "one-time bootrap executable code" (i.e., all the stuff we execute to turn a vanilla Linux VM into a Kubernetes node running in a cluster), except for any "provenance-identifying" Extensions, e.g. "computeAksLinuxBilling". Similar to the cloud-init savings, `vmss-prototype` allows us to create new instances _already configured to come online immediately as Kubernetes nodes in this cluster!_

Expand Down
Loading