redeploy scheme improvement #25

dseeley-sky · 2020-02-19T10:54:12Z

Current redeploy is very conservative and quite slow (redeploys one node at a time). An alternative scheme might be to deploy an entirely new cluster (with canary=start), which would effectively create a blue/green deployment. Then (on canary=finish), run the 'predelete' callback to sync-out the old cluster (then perhaps a new canary=clean to delete the old cluster).

The text was updated successfully, but these errors were encountered:

dseeley-sky · 2020-02-19T10:54:46Z

See https://github.com/sky-uk/clusterverse/tree/dps_addallscheme for initial work.

…r of some other variables. Speed is considerably improved in all circumstances, and a couple of edge cases are fixed. Breaks the rollback interface - predelete_role now receives a list of VMs 'hosts_to_remove' to delete, rather than a single VM (was 'host_to_redeploy'). This has potential to massively increase redeploy speed, if clustering affinity is configured. + A new VM tag/label 'lifecycle_state' is created describing the lifecycle state of the VM. It is either 'current', 'retiring' or 'redeployfail'. + A cluster's VMs will now always have the same epoch suffix (even when adding to the cluster) + The '-e clean' functionality has changed; you can now either clean hosts in every 'lifecycle_state' ('-e clean=all'), or optionally just the VMs in one of the above states ('-e clean=retiring'). + Redeploy will assert if the topology has changed (number of VMs does not match reality) + A new global fact 'cluster_hosts_state' is created that contains information on all running VMs with the derived cluster_name; i.e. the _state_ of the cluster. + Variables in 'cluster_hosts_state' are used instead of constantly querying the infrastructure, esp during redeploy. + Alternate redeploy scheme: '_scheme_addallnew_rmdisk_rollback'. + A full mirror of the cluster is deployed. + If the process proceeds correctly: + `predeleterole` is called with a _list_ of the old VMs, in 'hosts_to_remove'. + The old VMs are stopped. + If the process fails for any reason, the old VMs are reinstated, and the new VMs stopped (rollback) + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy' + The existing '_scheme_addnewvm_rmdisk_rollback' scheme is refactored to use the new variables. It is functionally similar, but does not terminate the VMs on success. + For each node in the cluster: + Create a new VM + Run `predeleterole` on the previous node as a _list_ (for compatibility), in 'hosts_to_remove'. + Shut down the previous node. + If the process fails for any reason, the old VMs are reinstated, and any new VMs that were built are stopped (rollback) + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy' Fixes #25

* A refactoring of the rollback code, which also necessitated a refactor of some other variables. Speed is considerably improved in all circumstances, and a couple of edge cases are fixed. Breaks the rollback interface - predelete_role now receives a list of VMs 'hosts_to_remove' to delete, rather than a single VM (was 'host_to_redeploy'). This has potential to massively increase redeploy speed, if clustering affinity is configured. + A new VM tag/label 'lifecycle_state' is created describing the lifecycle state of the VM. It is either 'current', 'retiring' or 'redeployfail'. + A cluster's VMs will now always have the same epoch suffix (even when adding to the cluster) + The '-e clean' functionality has changed; you can now either clean hosts in every 'lifecycle_state' ('-e clean=all'), or optionally just the VMs in one of the above states ('-e clean=retiring'). + Redeploy will assert if the topology has changed (number of VMs does not match reality) + A new global fact 'cluster_hosts_state' is created that contains information on all running VMs with the derived cluster_name; i.e. the _state_ of the cluster. + Variables in 'cluster_hosts_state' are used instead of constantly querying the infrastructure, esp during redeploy. + Alternate redeploy scheme: '_scheme_addallnew_rmdisk_rollback'. + A full mirror of the cluster is deployed. + If the process proceeds correctly: + `predeleterole` is called with a _list_ of the old VMs, in 'hosts_to_remove'. + The old VMs are stopped. + If the process fails for any reason, the old VMs are reinstated, and the new VMs stopped (rollback) + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy' + The existing '_scheme_addnewvm_rmdisk_rollback' scheme is refactored to use the new variables. It is functionally similar, but does not terminate the VMs on success. + For each node in the cluster: + Create a new VM + Run `predeleterole` on the previous node as a _list_ (for compatibility), in 'hosts_to_remove'. + Shut down the previous node. + If the process fails for any reason, the old VMs are reinstated, and any new VMs that were built are stopped (rollback) + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy' Fixes #25 * Ensure that the 'release' tag/label is consistent within a cluster (e.g. during a scaling deploy); don't allow user to set a different label, and if one is not specified on command line, apply the existing label. * Move location of release_version logic for redeploy * Fix canary_tidy_on_success to apply only when canary is "none" or "finish" * + Add a short sleep to allow DNS operation to complete. Possibly the records are not replicated when the Ansible module returns, but without a small sleep, the dig command will sometimes fail and create a negative cache, which means name won't resolve until the SOA TTL expires. + Remove `delegate_to: localhost` on the dig command, so that it can work if we are running through a bastion host. + If the dig command needs to check an external IP, use 8.8.8.8, otherwise it will default to resolving the cloud DNS and return the internal VPC IP, which will not validate against the ansible_host. + Add some sequence diagrams to show redeploy lifecycle_state for _scheme_addallnew_rmdisk_rollback * + Enable redeploying to larger or smaller clusters. + Prevent from running on a cluster built with older version of clusterverse + Add a new playbook `clusterverse_label_upgrade_v1-v2.yml`, to add necessary labels to an older cluster. + Add skip_release_version_check option. + Make external dns resolver variable * + Change cluster_hosts_flat to cluster_hosts_target + Change nested logging output to print useful trace * Fix for DNS dig check in GCP - only add a '.' to fqdn when there isn't already one at the end. * Only allow canary=tidy to tidy (remove) powered-down VMS. Tidy is meant to clean up after a successful redeploy - if there are non-current machines still powered-up, something is wrong. * Fix for canary_tidy_on_success * Fix merge error in installing file/metricbeat Co-authored-by: Dougal Seeley <[email protected]>

dseeley-sky self-assigned this Feb 19, 2020

dseeley mentioned this issue Mar 22, 2020

A refactoring of the rollback code... #47

Merged

antoineserrano closed this as completed in #47 Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redeploy scheme improvement #25

redeploy scheme improvement #25

dseeley-sky commented Feb 19, 2020

dseeley-sky commented Feb 19, 2020

redeploy scheme improvement #25

redeploy scheme improvement #25

Comments

dseeley-sky commented Feb 19, 2020

dseeley-sky commented Feb 19, 2020