Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redeploy scheme improvement #25

Closed
dseeley-sky opened this issue Feb 19, 2020 · 1 comment · Fixed by #47
Closed

redeploy scheme improvement #25

dseeley-sky opened this issue Feb 19, 2020 · 1 comment · Fixed by #47
Assignees

Comments

@dseeley-sky
Copy link
Contributor

Current redeploy is very conservative and quite slow (redeploys one node at a time). An alternative scheme might be to deploy an entirely new cluster (with canary=start), which would effectively create a blue/green deployment. Then (on canary=finish), run the 'predelete' callback to sync-out the old cluster (then perhaps a new canary=clean to delete the old cluster).

@dseeley-sky dseeley-sky self-assigned this Feb 19, 2020
@dseeley-sky
Copy link
Contributor Author

See https://github.com/sky-uk/clusterverse/tree/dps_addallscheme for initial work.

dseeley referenced this issue in dseeley/clusterverse Mar 22, 2020
…r of some other variables. Speed is considerably improved in all circumstances, and a couple of edge cases are fixed. Breaks the rollback interface - predelete_role now receives a list of VMs 'hosts_to_remove' to delete, rather than a single VM (was 'host_to_redeploy'). This has potential to massively increase redeploy speed, if clustering affinity is configured.

+ A new VM tag/label 'lifecycle_state' is created describing the lifecycle state of the VM.  It is either 'current', 'retiring' or 'redeployfail'.
  + A cluster's VMs will now always have the same epoch suffix (even when adding to the cluster)
  + The '-e clean' functionality has changed; you can now either clean hosts in every 'lifecycle_state' ('-e clean=all'), or optionally just the VMs in one of the above states ('-e clean=retiring').
  + Redeploy will assert if the topology has changed (number of VMs does not match reality)

+ A new global fact 'cluster_hosts_state' is created that contains information on all running VMs with the derived cluster_name; i.e. the _state_ of the cluster.
  + Variables in 'cluster_hosts_state' are used instead of constantly querying the infrastructure, esp during redeploy.

+ Alternate redeploy scheme: '_scheme_addallnew_rmdisk_rollback'.
  + A full mirror of the cluster is deployed.
  + If the process proceeds correctly:
    + `predeleterole` is called with a _list_ of the old VMs, in 'hosts_to_remove'.
    + The old VMs are stopped.
  + If the process fails for any reason, the old VMs are reinstated, and the new VMs stopped (rollback)
  + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy'

+ The existing '_scheme_addnewvm_rmdisk_rollback' scheme is refactored to use the new variables. It is functionally similar, but does not terminate the VMs on success.
  + For each node in the cluster:
    + Create a new VM
    + Run `predeleterole` on the previous node as a _list_ (for compatibility), in 'hosts_to_remove'.
    + Shut down the previous node.
  + If the process fails for any reason, the old VMs are reinstated, and any new VMs that were built are stopped (rollback)
  + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy'

Fixes #25
antoineserrano pushed a commit that referenced this issue Apr 14, 2020
* A refactoring of the rollback code, which also necessitated a refactor of some other variables.  Speed is considerably improved in all circumstances, and a couple of edge cases are fixed.  Breaks the rollback interface - predelete_role now receives a list of VMs 'hosts_to_remove' to delete, rather than a single VM (was 'host_to_redeploy').  This has potential to massively increase redeploy speed, if clustering affinity is configured.

+ A new VM tag/label 'lifecycle_state' is created describing the lifecycle state of the VM.  It is either 'current', 'retiring' or 'redeployfail'.
  + A cluster's VMs will now always have the same epoch suffix (even when adding to the cluster)
  + The '-e clean' functionality has changed; you can now either clean hosts in every 'lifecycle_state' ('-e clean=all'), or optionally just the VMs in one of the above states ('-e clean=retiring').
  + Redeploy will assert if the topology has changed (number of VMs does not match reality)

+ A new global fact 'cluster_hosts_state' is created that contains information on all running VMs with the derived cluster_name; i.e. the _state_ of the cluster.
  + Variables in 'cluster_hosts_state' are used instead of constantly querying the infrastructure, esp during redeploy.

+ Alternate redeploy scheme: '_scheme_addallnew_rmdisk_rollback'.
  + A full mirror of the cluster is deployed.
  + If the process proceeds correctly:
    + `predeleterole` is called with a _list_ of the old VMs, in 'hosts_to_remove'.
    + The old VMs are stopped.
  + If the process fails for any reason, the old VMs are reinstated, and the new VMs stopped (rollback)
  + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy'

+ The existing '_scheme_addnewvm_rmdisk_rollback' scheme is refactored to use the new variables. It is functionally similar, but does not terminate the VMs on success.
  + For each node in the cluster:
    + Create a new VM
    + Run `predeleterole` on the previous node as a _list_ (for compatibility), in 'hosts_to_remove'.
    + Shut down the previous node.
  + If the process fails for any reason, the old VMs are reinstated, and any new VMs that were built are stopped (rollback)
  + To delete the old VMs, either set '-e canary_tidy_on_success=true', or call redeploy.yml with '-e canary=tidy'

Fixes #25

* Ensure that the 'release' tag/label is consistent within a cluster (e.g. during a scaling deploy); don't allow user to set a different label, and if one is not specified on command line, apply the existing label.

* Move location of release_version logic for redeploy

* Fix canary_tidy_on_success to apply only when canary is "none" or "finish"

* + Add a short sleep to allow DNS operation to complete.  Possibly the records are not replicated when the Ansible module returns, but without a small sleep, the dig command will sometimes fail and create a negative cache, which means name won't resolve until the SOA TTL expires.
+ Remove `delegate_to: localhost` on the dig command, so that it can work if we are running through a bastion host.
  + If the dig command needs to check an external IP, use 8.8.8.8, otherwise it will default to resolving the cloud DNS and return the internal VPC IP, which will not validate against the ansible_host.
+ Add some sequence diagrams to show redeploy lifecycle_state for _scheme_addallnew_rmdisk_rollback

* + Enable redeploying to larger or smaller clusters.
+ Prevent from running on a cluster built with older version of clusterverse
  + Add a new playbook `clusterverse_label_upgrade_v1-v2.yml`, to add necessary labels to an older cluster.
+ Add skip_release_version_check option.
+ Make external dns resolver variable

* + Change cluster_hosts_flat to cluster_hosts_target
+ Change nested logging output to print useful trace

* Fix for DNS dig check in GCP - only add a '.' to fqdn when there isn't already one at the end.

* Only allow canary=tidy to tidy (remove) powered-down VMS.  Tidy is meant to clean up after a successful redeploy - if there are non-current machines still powered-up, something is wrong.

* Fix for canary_tidy_on_success

* Fix merge error in installing file/metricbeat

Co-authored-by: Dougal Seeley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant