Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog/from-k8s: Don't spin loop on watchService #33

Merged
merged 1 commit into from
Feb 5, 2019

Conversation

mattrobenolt
Copy link
Contributor

With the default: clause here, the time.After doesn't work and just
falls through to default causing a spin loop on the consul agent.

With the `default:` clause here, the time.After doesn't work and just
falls through to `default` causing a spin loop on the consul agent.
@mattrobenolt mattrobenolt changed the title syncer: Don't spin loop on watchService catalog/from-k8s: Don't spin loop on watchService Nov 21, 2018
@mattrobenolt
Copy link
Contributor Author

So for context, I'm working through some unexplained issues I'm having, and this stood out as something that was hammering our system.

But in changing this, it seems I have exaggerated my original issue I was seeing. So I'm not sure yet about merging this until I fully understand the behaviors I am seeing.

Basically, consul-k8s seems to flap services regularly when syncing from k8s -> consul. Services come in and out relatively rapidly. Maybe every minute or so, but it's not super predictable. It's also relatively unreliable to do any syncing. Sometimes it just doesn't work and takes a while for it to fix itself.

I'm investigating more here on what's actually going on since it could be something weird I'm doing, but I don't think it really is at this point. I'm happy to share more information.

@descrepes
Copy link

Hi,
I'm hiting the same issue while exposing services with NodePoprt.
There's plenty of deregisterd services in the agents's logs:

2018/11/22 00:07:45 [INFO] agent: Deregistered service "zenko-cloudserver-9f684bfcb159" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-zenko-quorum-headless-6bc17f9fa8eb" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-cloudserver-fa025fa76c47" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-prometheus-server-headless-7257a2538875" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-redis-ha-b3586c82a181" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-prometheus-server-fc6bc13f27c2" 2018/11/22 00:07:46 [INFO] agent: Deregistered service "zenko-zenko-quorum-794043f1e2dd" 2018/11/22 00:12:58 [INFO] agent: Deregistered service "redis-ha-redis-ha-9867315a0434"

One of my hypothesis is that the NodePort registration is done using the Catalog API.
And unfortunately the agent automatically deregister services during the anti-entropy.
If this hypothesis is true, the NodePort registration should use the agent API instead to prevent that behavior.
This issue doesn't appear when exposing services with Loadbalencer or externalIP.

Regards.

@mattrobenolt
Copy link
Contributor Author

@descrepes You might be right. I looked into how I've done similar things outside of kubernetes in the past, and I always registered services into the local agent, not directly into the catalog.

I'm going to investigate this further today and make sure I can reproduce that behavior outside of consul-k8s.

@mattrobenolt
Copy link
Contributor Author

So according to docs:

During this synchronization, the catalog is also checked for correctness. If any services or checks exist in the catalog that the agent is not aware of, they will be automatically removed to make the catalog reflect the proper set of services and health information for that agent. Consul treats the state of the agent as authoritative; if there are any differences between the agent and catalog view, the agent-local view will always be used.

@mattrobenolt
Copy link
Contributor Author

@descrepes do you happen to be running master of the consul-helm chart? I think this is the culprit: hashicorp/consul-helm#15

The problem with this change is it changes the consul client agent to register itself with the name of the host machine. The services themselves also are registered with the Node to match the host name of the server. This makes consul thing these are services running on real client nodes. Which causes the anti-entropy stuff to kick in since it says "hey, we have a node with the same name, and this consul client says we have no services!" so it yanks it from the catalog. When the service is registered in the catalog with a node that doesn't have a paired consul client... it's treated as an external service, in which it doesn't do anti-entropy to remove it. Since it's assumed to be a fake node.

@descrepes
Copy link

@mattrobenolt yes i'm running with that.

I will try to make some tests on my cluster without registering the agent with the Node IP.

@descrepes
Copy link

I removed the -node="${NODE}" \, but unfortunately no services where registered at all.

@mattrobenolt
Copy link
Contributor Author

My guess is the old nodes are still registered under the member list as left, but they haven't actually been removed from the cluster. In my case, I'm just poking around with new stuff, so I just removed the consul cluster, and rebuilt it with -node= removed, and things work just fine now, especially since I have a better understanding of how these things work.

@descrepes
Copy link

Well i dropped my actual cluster and bootstrapped a fresh one without the -node= and everything is working well now.

Copy link
Contributor

@adilyse adilyse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find, thanks! It definitely fixes the performance of the sync process. The other issue about services being de-registered that you mention in the comments should be fixed now that #63 is merged.

Since you created this, HashiCorp has implemented a Contributor License Agreement (more info available here. I'll merge this as soon as you've had a change to sign the agreement!

Thanks!

@mattrobenolt
Copy link
Contributor Author

@adilyse From what I read, it says a bot will comment with information to sign the CLA. Aside from that, I see no way to sign it. :(

But it's just a single line change, feel free to just take it without crediting me.

@adilyse
Copy link
Contributor

adilyse commented Feb 5, 2019

Try this link.

@mattrobenolt
Copy link
Contributor Author

Thanks! Done. The check hasn't cleared up, but the CLA is signed on my end.

@adilyse adilyse merged commit 3d0438e into hashicorp:master Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants