Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Consul servers should leave gracefully when they are restarted #760

Closed
wants to merge 1 commit into from

Conversation

ishustava
Copy link
Contributor

@ishustava ishustava commented Jan 8, 2021

Background

Currently, if consul servers are restarted when ACLs are enabled (either by doing the recommended upgrade procedure, doing a stateful set restart, or simply deleting consul server pods), they exit with code 1, which means a non-graceful termination. This is because the pre-stop hook that runs the consul leave command doesn't have the right ACL permissions to perform a graceful leave. This could leave the cluster unhealthy and could lead to various errors, including split brain, node name collisions and others.

Changes proposed in this PR:

  • Set leave_on_terminate on consul servers to true
  • Remove the preStop hook that calls consul leave since we don't need it anymore

How I've tested this PR:

  1. Create a kind cluster and install the helm chart from this branch with
    helm install iryna --set global.name=consul --set server.replicas=1 --set global.acls.manageSystemACLs=true .
    
  2. Watch consul servers logs with kubectl logs -f consul-server-0
  3. Trigger the restart of the servers kubectl rollout restart sts/consul-server
  4. Observe that the server has exited 0 in Consul logs:
    2021-01-08T00:36:42.266Z [INFO]  agent.router.manager: shutting down
    2021-01-08T00:36:42.266Z [INFO]  agent: consul server down
    2021-01-08T00:36:42.266Z [INFO]  agent: shutdown complete
    2021-01-08T00:36:42.266Z [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=tcp
    2021-01-08T00:36:42.266Z [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=udp
    2021-01-08T00:36:42.266Z [INFO]  agent: Stopping server: address=[::]:8500 network=tcp protocol=http
    2021-01-08T00:36:42.766Z [INFO]  agent: Waiting for endpoints to shut down
    2021-01-08T00:36:42.766Z [INFO]  agent: Endpoints down
    2021-01-08T00:36:42.766Z [INFO]  agent: Exit code: code=0

How I expect reviewers to test this PR:

[Optional] In the same way, as I've described above.

Checklist:

  • N/A - Bats tests added - No bats tests necessary for this change since there is no template logic
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)
  • N/A - PR submitted to update consul.io Helm docs (see Generating Helm Reference Docs)

@ishustava ishustava added bug Something isn't working area/upgrades Related to upgrading the helm chart labels Jan 8, 2021
@ishustava ishustava requested review from a team, lkysow and thisisnotashwin and removed request for a team January 8, 2021 00:45
@ishustava
Copy link
Contributor Author

Thank you both so much for your reviews! I'm going to close this one in favor of #764.

@ishustava ishustava closed this Jan 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/upgrades Related to upgrading the helm chart bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants