-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CockroachDB cannot be meshed (Cluster-communication breaks) #11281
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
If memory serves, we've seen issues like this in the past with CockroachDB due to its use of an I believe that the native sidecar containers entering beta in Kubernetes 1.29 may resolve issues like this, since they could be used to allow Linkerd proxies to start up before other |
I'm not very familiar with the past cockroachDB issues, but by inspecting this example I see the sts' init container is just a shell command that doesn't hit the network. OTOH when sts pods are rolled out, they need to keep on connecting to one another in order to become ready: e.g. cockroachdb-0 is the first to be rolled out and gets injected while cockroachdb-1 and cockroachdb-2 remain uninjected, but cockroachdb-0 can't receive connections from the others because of policy, so it doesn't fully start and the sts rollout process gets stuck. |
Thank you very much for reply :) @alpeb I was already aware of this issue and manually deleted the two remaining pods so that all three run with linkerd sidecar container. They still cannot talk to each other. Also, NO policy violation is logged. Please re-read my origional description for more details. |
I still believe this is an actual bug. If you want me to re-test with latest version of linkerd or do some more analysis, just tell me what to do :) |
I have also gained some experience with issues related to other init containers that are created via webhook injection. I have this issue with Hashicorp's Vault Agent Injector Webhook in combination with Linkerd Init- and Sidecar-container injection. But with CockroachDB, the issue is not about init containers. With CockroachDB, the main containers just simply cannot talk to each other anymore even if all of them are meshed. |
I think this is caused by cockroach's delicate consensus mechanism. From my testing, this works as long as you issue |
Do you maybe have a log or script that I can use to repdroduce what you did to get it working? Have you tried to reproduce what I did by following the steps in my original description? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
Hello :)
I deploy linkerd 2.13.5 in HA-mode
Then I set a namespace to policy "all-authenticated":
Then I deploy CockroachDB-Cluster via helm chart with default values:
CockroachDB-Cluster works fine afterwards.
Then I try to perform linkerd injection:
Rollout process get stuck because first restarted pod does not become ready, so I manually restart the other pods.
But even after all pods have been restarted and contain linkerd init- and sidecar-containers, CockroachDB-Cluster does not work anymore - nodes cannot reach each other:
linkerd-proxy sidecar container does not log anything related to port 26257. (It only logs unauthoized connection attempts from Prometheus to port 8080, which is true, but unrelated to the cockroachdb-cluster-communication issue.)
Also
does not show anything related to port 26257.
I don't know how to further debug this issue.
I have tried to set annotation
but this did not change anything.
k8s cluster is EKS cluster 1.26.6
Can anybody give me a hint on how to further debug this issue?
Thanks in advance :)
How can it be reproduced?
Set namespace to policy "all-authenticated":
Deploy CockroachDB-Cluster via helm chart with default values:
Perform linkerd injection:
As rollout process gets stuck, delete the remaining two pods so that all three pods run with linkerd sidecar.
CockroachDB pods remain unhealthy, end up in crash loop. CockroachDB pods log "unable to connect (is the peer up and reachable?)"
Logs, error output, etc
No error logs from linkerd as far as I can see. No logs regarding blocked packets related to cluster-port 26257
output of
linkerd check -o short
Environment
AWS EKS: v1.26.7-eks-2d98532
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: