Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.8.3 upgrade seems to make consul checks unstable. #4256

Closed
apenney opened this issue May 5, 2018 · 25 comments
Closed

0.8.3 upgrade seems to make consul checks unstable. #4256

apenney opened this issue May 5, 2018 · 25 comments

Comments

@apenney
Copy link

apenney commented May 5, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Operating system and Environment details

Ubuntu 16.04

Issue

I'm in the process of doing a 0.7 -> 0.8.3 upgrade and I've run into a weird situation. We have some services that seem stable from the point of view of nomad (they've been running in docker for an hour now) but the service check inside consul seems to get created and deregistered over and over.

Example of the check:

			service {
				port = "http"

				tags = [ "urlprefix-service.production.example.com/" ]

				check {
					type     = "http"
					path     = "/status"
					interval = "5s"
					timeout  = "1s"
				}
			}

When I watch the consul UI I can see the check appear, eventually go green, then just disappear. I'll wait a few seconds and see it come back. From the nomad client side running the successful job I see:

May 05 20:03:39 ip-10-1-1-169 nomad[1751]: consul.sync: successfully updated services in Consul
May 05 20:04:10 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:04:10.003616 [WARN] consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "2e73a1727975aa33b585da9b707d5ebbdae524e3")
May 05 20:04:10 ip-10-1-1-169 nomad[1751]: consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "2e73a1727975aa33b585da9b707d5ebbdae524e3")
May 05 20:04:11 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:04:11.031347 [INFO] consul.sync: successfully updated services in Consul
May 05 20:04:11 ip-10-1-1-169 nomad[1751]: consul.sync: successfully updated services in Consul
May 05 20:04:41 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:04:41.131895 [WARN] consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "ddb4c73fbaae81f13073b570fcf262278535f1ca")
May 05 20:04:41 ip-10-1-1-169 nomad[1751]: consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "ddb4c73fbaae81f13073b570fcf262278535f1ca")
May 05 20:04:42 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:04:42.153094 [INFO] consul.sync: successfully updated services in Consul
May 05 20:04:42 ip-10-1-1-169 nomad[1751]: consul.sync: successfully updated services in Consul
May 05 20:05:12 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:05:12.247792 [WARN] consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "4e8db1e601cc839694927a70974dedefacfeebbb")
May 05 20:05:12 ip-10-1-1-169 nomad[1751]: consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "4e8db1e601cc839694927a70974dedefacfeebbb")
May 05 20:05:15 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:05:15.311213 [INFO] consul.sync: successfully updated services in Consul
May 05 20:05:15 ip-10-1-1-169 nomad[1751]: consul.sync: successfully updated services in Consul
May 05 20:06:15 ip-10-1-1-169 nomad[1751]:     2018/05/05 20:06:15.418388 [WARN] consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "2a1468b67ea9be0b2b82c09f4022fab366170bc4")
May 05 20:06:15 ip-10-1-1-169 nomad[1751]: consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "2a1468b67ea9be0b2b82c09f4022fab366170bc4")

Which kind of lines up with it disappearing unexpectedly.

Reproduction steps

This is tough. It's consul 1.0.7 with nomad 0.8.3. I'm not sure it's easy to reproduce as some services on 0.8.3 are working out. The 0.7 nodes are also still working fine. At this point I'm looking for help figuring out:

  • Is there a way to see where a deregister for a check comes from?
  • Is debug logging on the nomad client enough to show me if it's the one killing the check?
  • Any other advice? :(
@apenney
Copy link
Author

apenney commented May 5, 2018

Something that may be related:

https://gist.github.com/apenney/f09f90177c7ba168b8f120b299729961

The first example (-production) is working fine. I notice the address/taggedaddress/serviceaddress all match, and are of the nomad instances themselves.

In the second example (-uat) the address/taggedaddress is of one of the consul servers itself, and then the serviceAddress is set correctly. Could this be related? I don't know how nomad would be able to make a check that had the address of a consul server in the first place.

@apenney
Copy link
Author

apenney commented May 5, 2018

I think my last comment was a red herring. I do notice:

[
  {
    "ID": "b603650b-98dd-bb71-fd84-b5da6a3a9391",
    "Node": "i-0934a022d3a85c672",
    "Address": "10.50.2.20",
    "Datacenter": "us-east-1",
    "TaggedAddresses": {
      "lan": "10.50.2.20",
      "wan": "10.50.2.20"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "_nomad-task-vmvg5psbe7dnoqulo775qdjdxsf5q5js",
    "ServiceName": "hashi-ui-web-hashi-ui",
    "ServiceTags": [
      "urlprefix-hashi-ui.management.cotahealthcare.lan/"
    ],
    "ServiceAddress": "10.50.1.215",
    "ServiceMeta": {},
    "ServicePort": 20860,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 43809134,
    "ModifyIndex": 43809134
  }
oot@ip-10-1-1-169:~# curl http://10.1.1.169:8500/v1/catalog/service/hashi-ui-web-hashi-ui | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    722      0 --:--:-- --:--:-- --:--:--  1000
[]

(wait a bit)

[
  {
    "ID": "b603650b-98dd-bb71-fd84-b5da6a3a9391",
    "Node": "i-0934a022d3a85c672",
    "Address": "10.50.2.20",
    "Datacenter": "us-east-1",
    "TaggedAddresses": {
      "lan": "10.50.2.20",
      "wan": "10.50.2.20"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "_nomad-task-vmvg5psbe7dnoqulo775qdjdxsf5q5js",
    "ServiceName": "hashi-ui-web-hashi-ui",
    "ServiceTags": [
      "urlprefix-hashi-ui.management.cotahealthcare.lan/"
    ],
    "ServiceAddress": "10.50.1.215",
    "ServiceMeta": {},
    "ServicePort": 20860,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 43810354,
    "ModifyIndex": 43810354
  }
]

So it's definitely deregistering and reregistering the service over and over for no reason.

@apenney
Copy link
Author

apenney commented May 5, 2018

In debug mode one of the clients exihibiting this issue is doing this -constantly-:

May 05 22:54:43 ip-10-50-1-215 nomad[3425]:     2018/05/05 22:54:43.718930 [DEBUG] client: state changed, updating node and re-registering.
May 05 22:54:43 ip-10-50-1-215 nomad[3425]: client: state changed, updating node and re-registering.

I feel like it's weird that it would need to re-register constantly, but this may be normal behavior.

@apenney
Copy link
Author

apenney commented May 6, 2018

I tried a bunch of other stuff tonight:

  • Moved my whole consul cluster (and all members) to 1.0.7.
  • Moved the consul servers back to 1.0.6 on a suggestion from slack.
  • Upgraded 100% of my nomad agents to 0.8.3
  • Downgraded one agent to 0.7.1 (the service continued to register/deregister)
  • Made sure I upgraded my raft protocol to 3 in consul (because.. why not)
  • Bumped raft protocol to 2 for nomad 0.8 servers.
  • Stared at numerous logs, traces, kept doing curl's over and over for the services only to see them appear and disappear with new index numbers every time.

Not sure what else to try anymore.

@apenney
Copy link
Author

apenney commented May 6, 2018

I did notice from staring at sysdig traces:

------ Read 252B from 10.30.3.123:36470->10.50.3.8:8500 (consul)
PUT /v1/agent/service/deregister/_nomad-task-5e24ujna6dkjxcdcrykyt4fhr3qyqae2

10.30.3.123 is a UAT box (so node.class = uat).

{“ID”:“_nomad-task-5e24ujna6dkjxcdcrykyt4fhr3qyqae2",“Name”:“tricorder-staging-
web-tricorder-staging”,“Tags”:[“urlprefix-tricorder.staging.example.com/“],“Port”:9090,“Address”:“172.17.0.2”,“Check”:null,“Checks”:null
}.

But it's deregistering a service that was registered by a totally different staging box. It's like they are all fighting with each other.

@apenney
Copy link
Author

apenney commented May 6, 2018

I just downgraded the entire cluster to 0.8.1, hoping #4170 was involved. Unfortunately, while at first it seemed to help, it then fell apart and started deregistering services again.

@apenney
Copy link
Author

apenney commented May 7, 2018

I downgraded all nodes to 0.7.1 which stopped the service churn in Consul. The masters are still 0.8.3. This seems to be a nomad bug.

@apenney
Copy link
Author

apenney commented May 7, 2018

From someone on slack, who's been trying to help diagnose:

Aha yep, these are correlated to the second on our deploys; we do canary + promote, and it's at the promotion step where we see the rpc errors

May  5 23:19:42 somehost consul: 2018/05/05 23:19:42 [ERR] consul: "Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error making call: rpc error making call: Unknown check 'e64810a18e7d5ec394854d48d2bdc683138416aa'
May  5 23:19:42 somehost consul: 2018/05/05 23:19:42 [INFO] agent: Deregistered check 'e64810a18e7d5ec394854d48d2bdc683138416aa'
May  5 23:19:42 somehost consul: 2018/05/05 23:19:42 [ERR] http: Request PUT /v1/agent/check/deregister/e64810a18e7d5ec394854d48d2bdc683138416aa, error: Unknown check "e64810a18e7d5ec394854d48d2bdc683138416aa" from=127.0.0.1:57442
May  5 23:19:42 somehost nomad: 2018/05/05 23:19:42.292765 [WARN] consul.sync: failed to update services in Consul: Unexpected response code: 500 (Unknown check "e64810a18e7d5ec394854d48d2bdc683138416aa")
May  5 23:19:43 somehost nomad: 2018/05/05 23:19:43.293714 [INFO] consul.sync: successfully updated services in Consul

Hmm I wonder if there's a problem with the order of operations? This strikes me as a potentially benign error

  1. De-register service
  2. De-register check

Does de-registering the service de-register the check at the same time, making the second step redundant?

Is it complainy that it can't de-register a check that has already gone away due to the first step?
Though I guess there's also potentially a big difference between your case and ours; in our case, these are only popping up when deploys happen, and if so the log messages are a bit annoying but not actually harmful
I'm not sure if you're seeing the same only during deploys, or if the catalog churning is happening all the time

wolfman [11:48 PM]
Pretty sure this is the thing: https://github.com/hashicorp/consul/blob/v1.0.0/agent/local.go#L593-L650 (edited)
Loops through all services and conditionally deletes some; then loops through all checks and conditionally deletes some
I'm not smart enough to figure out if there's a corner case here but I'm 99% certain that syncChanges is the starting point where these catalog messages you're seeing are coming from

wolfman [11:56 PM]
Ok this is looking pretty certain https://github.com/hashicorp/consul/blob/v1.0.0/agent/consul/fsm.go#L153-L162 (edited)
syncChanges at some point calls deleteService which somewhat indirectly calls applyDeregister
applyDeregister deletes the service and checks (edited)
syncChanges continues on its way and then calls deleteCheck but the check is already gone
syncChanges also seems to lock localstate at the top of the function which perhaps means(?) it doesn't know the check is already gone

wolfman [12:00 AM]
I don't really know golang and I'm diving down into a code base that I don't know/understand so I'm going to stop my grasping at straws there; if this is working the way I think it's working though, it doesn't look too dangerous if it's occurring on deploys
If it's occurring frequently on its own without anything changing, then perhaps a service is churning, and these deregistration errors you're seeing are perhaps symptoms of that problem, and not the actual problem itself (edited)

@preetapan
Copy link
Contributor

@apenney Thanks for reporting this and attempting to debug. Wanted to note a few things:

Re:But it's deregistering a service that was registered by a totally different staging box. It's like they are all fighting with each other.
Nomad uses the task id/alloc id as part of the service ID when registering into Consul, so the above should definitely not happen.

Re: Could this be related? I don't know how nomad would be able to make a check that had the address of a consul server in the first place.`
That can only happen if your infrastructure had a nomad client node running on the same host/IP as a Consul server.

The comment trail above is pretty long, I will try to reproduce this locally and post an update by tomorrow. Don't want to rule out #4170 too early, we are syncing services and checks with Consul regularly in Nomad now and there might be a race condition with that.

@apenney
Copy link
Author

apenney commented May 7, 2018

I did suspect #4170 but (buried in the giant wall of text) I mentioned we rolled the entire cluster to 0.8.1 to try and roll that out.

@apenney
Copy link
Author

apenney commented May 7, 2018

Quick update:

Using the same consul cluster I made a new nomad cluster, with new nodes/servers (completely separate from the other cluster) and deployed a single service. It immediately started deleting/recreating services over and over.

It may just be interference from the other nodes as they talk to the same consul (different "datacenter" name for the new cluster, along with different was hoping this would rule out cross talk). Just a data point.

@apenney
Copy link
Author

apenney commented May 8, 2018

Further woes:

  • I built a brand new 0.8.3 cluster, with no history or state. I redeployed all our services from scratch. The same problem, services registering/deregistering.
  • I built a brand new 0.7.1 cluster, with no history or state. I redeployed all our services from scratch. Everything worked!

I went to bed, woke up, everything was broken and no consul services existed in two of the three environments we run in the 0.7.1 cluster. All the containers were working. I restarted nomad on one of the three production boxes and all the prod services registered themselves again. Same for UAT.

I don't know anymore.

@apenney
Copy link
Author

apenney commented May 8, 2018

I... may have it. We had our consul {} section with an address of consul.service.consul instead of 127.0.0.1:8500. Since rebuilding the cluster (again) with this changed to localhost, the fighting over services I was seeing seems to have stopped. I'll wait the rest of the day to confirm we've actually helped it, but this does seem to be involved.

Talking directly to the masters instead of the local agent seemed to cause problems.

@dadgar
Copy link
Contributor

dadgar commented May 9, 2018

@apenney Your latest comment seems very likely to be the issue. You should have the agents pointing to the local consul agent. As you said give it a day but if this is the root cause, which is likely, would you mind closing this issue.

@apenney
Copy link
Author

apenney commented May 9, 2018

We haven't had any issues since. We should add some serious documentation additions about the requirement to point at a local consul node and not a remote one, as the docs only really reference localhost as a default and not a mandatory requirement.

@apenney apenney closed this as completed May 9, 2018
@dadgar
Copy link
Contributor

dadgar commented May 9, 2018

@apenney Would appreciate either a PR or filing an issue for this. Obviously would prefer a PR

@karlgrz
Copy link

karlgrz commented May 10, 2018

I'm experiencing this same issue trying to experiment with consul and nomad locally. I'm just running a consul node in docker in dev mode and can confirm the constant churn of the nomad services in consul even though the nomad job containers seem to be stable and all the health checks should be passing. What was your resolution here, @apenney?

@karlgrz
Copy link

karlgrz commented May 10, 2018

Ah, forgive my ignorance, I had only run the single server node in dev mode and did not run the agents alongside it, my mistake.

@karlgrz
Copy link

karlgrz commented May 10, 2018

Ok, that didn't resolve it either. Even with a 3 node consul cluster locally the services keep registering and deregistering.

@dadgar
Copy link
Contributor

dadgar commented May 10, 2018

@karlgrz You should have your Nomad clients talking to a local Consul agent. You want to avoid the Nomad clients all talking to the same agent. The Nomad clients are doing a diff between what it wants registered and what the consul agent has. So if you have multiple nodes all pointing at the same Consul agent you will see this flapping.

@karlgrz
Copy link

karlgrz commented May 10, 2018

@dadgar thanks for the comments. That's interesting as I can replicate this consistently with a single consul node, a single Nomad server, and a single Nomad client. This is all running locally, consul running in a docker container, the Nomad server and client running on my host. The health checks all look to be passing according to consul, but then Nomad client deregisters all of them and they disappear from consul completely for a time, which is basically rendering any kind of reliable consul-template service discovery useless. I'd be happy to supply configurations I'm using if it would be helpful, could be PEBKAC with my configurations or something but it seems to be wired up properly. Maybe I'm just missing something obvious.

@karlgrz
Copy link

karlgrz commented May 10, 2018

@dadgar I can reproduce this just by spinning up the redis job generated by:

nomad job init

@dadgar
Copy link
Contributor

dadgar commented May 10, 2018

@karlgrz Ah you are still pointing two agents at the same consul. The Nomad Server and the Client. I am assuming what is flapping is the Nomad agent health checks.

If you want to run a server/client on the same host and have it talk to the same consul, run nomad agent -dev

@karlgrz
Copy link

karlgrz commented May 10, 2018

@dadgar ah ha! Success! Thank you so much! That seems to be holding up well.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants