Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Catalog Integration Fixes #5536

Merged
merged 3 commits into from
May 9, 2019
Merged

Consul Catalog Integration Fixes #5536

merged 3 commits into from
May 9, 2019

Conversation

endocrimes
Copy link
Contributor

@endocrimes endocrimes commented Apr 10, 2019

fixes #4566
fixes #4537

The current implementation of Service Registration uses a hash of the nomad-internal state of a service to register it with Consul, this means that any update to the service invalidates this name and we then de-register, and recreate the service in Consul.

While this behavior slightly simplifies reasoning about service registration, this becomes problematic when we add consul health checks to a service. When the service is re-registered, so are the checks, which default to failing for at least one check period. This causes issues when a canary is promoted, as the old, stable allocs are immediately stopped, and the new ones updated, which causes a consul registration cycle and takes your service offline for one health check period.

This pull request introduces v3 of the Nomad-Consul integration style that moves back to having a stable identifier for a service, and registers updates based on changes to the internal struct state and remote state.

Notes

  • There is currently a UX issue in Consul that means that when you update a Service's tags, its Checks will still return the ones they were registered against, however this does not effect the functionality of Consul or Nomad (Updating a Service definition via the API does not update Checks consul#5631).

  • There is an outstanding Nomad bug where the consul reconciler races against task restoration on a host, which means when restarting or updating a client, services will be removed from consul and re-registered, causing a 1-health-check-period status flap. This PR does not attempt to solve that bug as it will require a larger refactor of the client startup flow to be reliable, thus this PR does not attempt to handle older registrations outside of removing them as expected.

@endocrimes endocrimes force-pushed the dani/consul branch 3 times, most recently from 54f0e0e to 761314d Compare April 15, 2019 15:18
endocrimes added 2 commits May 2, 2019 16:54
The current implementation of Service Registration uses a hash of the
nomad-internal state of a service to register it with Consul, this means that
any update to the service invalidates this name and we then deregister, and
recreate the service in Consul.

While this behaviour slightly simplifies reasoning about service registration,
this becomes problematic when we add consul health checks to a service. When
the service is re-registered, so are the checks, which default to failing for
at least one check period.

This commit migrates us to using a stable identifier based on the
allocation, task, and service identifiers, and uses the difference
between the remote and local state to decide when to push updates.

It uses the existing hashing mechanic to decide when UpdateTask should
regenerate service registrations for providing to Sync, but this should
be removable as part of a future refactor.

It additionally introduces the _nomad-check- prefix for check
definitions, to allow for future allowing of consul features like
maintenance mode.
This commit causes sync to skip deregistering checks that are not
managed by nomad, such as service maintenance mode checks.  This is
handled in the same way as service registrations - by doing a Nomad
specific prefix match.
@endocrimes endocrimes force-pushed the dani/consul branch 2 times, most recently from 429d157 to 14c231a Compare May 3, 2019 09:46
@endocrimes endocrimes marked this pull request as ready for review May 3, 2019 09:55
@endocrimes
Copy link
Contributor Author

Only travis job that's failing on push is the website and that seems unrelated.

@endocrimes endocrimes requested a review from schmichael May 3, 2019 10:03
Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good Dani! Don't forget to update the CHANGELOG either in this PR or post merge.

command/agent/consul/client.go Show resolved Hide resolved
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants