Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service maintenance mode check removed by nomad agent from consul #4537

Closed
i-prudnikov opened this issue Jul 27, 2018 · 3 comments · Fixed by #5536
Closed

Service maintenance mode check removed by nomad agent from consul #4537

i-prudnikov opened this issue Jul 27, 2018 · 3 comments · Fixed by #5536

Comments

@i-prudnikov
Copy link

i-prudnikov commented Jul 27, 2018

Nomad & Consul version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
Consul v1.2.0 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Operating system and Environment details

Server - Windows 2016 Datacenter, run on VM
Agent - Windows 2016 Datacenter, inside container microsoft/windowsservercore:latest.

Issue

Service maintenance mode check removed by nomad agent from consul. It is probably a bug related to #4170.

We have the following setup:
nomad agent and consul agent are installed on the same VM.
When running nomad job, it is successfully registered a service and it's health checks and it is visible from consul agent and server.
Then, we switch the service registered by nomad into maintenance mode, by directly calling the agent api (or running by command line).

consul maint  -enable -service=:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw -reason "Testing"

For some time the service is shown in consul as it should be - in maintenance mode, then it is switched back to normal state.
Studying the logs of both nomad and consul agents show explicitly that consul agent receives a request from localhost to de-register service maintenance health check:

018/07/27 10:44:12 [INFO] agent: Service "_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" entered maintenance mode
2018/07/27 10:44:12 [DEBUG] agent: Service "_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" in sync
2018/07/27 10:44:12 [DEBUG] agent: Check "_service_maintenance:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" in sync
2018/07/27 10:44:12 [DEBUG] http: Request PUT /v1/agent/service/maintenance/_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw?enable=true&reason=Testing (42.0073ms) from=127.0.0.1:62209
...
...
2018/07/27 10:44:40 [DEBUG] http: Request PUT /v1/agent/check/deregister/_service_maintenance:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw (21.0081ms) from=127.0.0.1:52544

And it looks like this is triggered by nomad agent:
Corresponding long from nomad agent:

2018/07/27 10:44:39.764948 [DEBUG] http: Request GET /v1/agent/health?type=client (997.6µs)
2018/07/27 10:44:40.152101 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 1 checks

I have a feeling that nomad should not remove maintenance mode checks from services in consul in this case, though the rest should be synced as it works now according to #4170 .

The tests also showed that If node as a whole is set to maintenance, then it remains in this state until explicitly removed from maintenance mode.

P.S. There is no any traces at all in consul server and nomad servers logs.

Reproduction steps

  1. Run consul agent and nomad agent on the same VM. Point nomad agent to consul on localhost: consul-address=127.0.0.1:8500. It is not matter will consul run in dev mode or consul agent will connect to the server.
  2. Run nomad job with service registration
  3. Set the registered (by nomad) service into maintenance mode
  4. After some time the maintenance mode health-check is automatically removed

Nomad Client logs (if appropriate)

2018/07/27 10:44:39.764948 [DEBUG] http: Request GET /v1/agent/health?type=client (997.6µs)
2018/07/27 10:44:40.152101 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 1 checks
@preetapan
Copy link
Contributor

@i-prudnikov Thanks for the details and reproduction steps, I confirmed this behavior as well.

As part of #4170 we made an assumption that any checks registered on behalf of Nomad tasks are only created and managed by Nomad, so we remove extraneous checks that Nomad is not aware of. This plays badly with maintenance mode, which caused the behavior you saw.

We'll fix this in an upcoming release for maintenance mode to work. In general, any out of band registered checks for services that Nomad manages should still get removed. i.e if you want to register any checks, use the service stanza in Nomad to do so. Maintenance mode is a special case so we will fix that.

@i-prudnikov
Copy link
Author

@preetapan thank you for fast reply! Will wait for next nomad release.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants