Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

Closed
sebschlue opened this issue Aug 22, 2023 · 9 comments
Closed

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

sebschlue opened this issue Aug 22, 2023 · 9 comments

Comments

@sebschlue
Copy link

What is the issue?

Hello :)

I deploy linkerd 2.13.5 in HA-mode

linkerd install --crds ...
linkerd install --ha ...
linkerd viz install --ha ...

Then I set a namespace to policy "all-authenticated":

kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated

Then I deploy CockroachDB-Cluster via helm chart with default values:

helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development

CockroachDB-Cluster works fine afterwards.

Then I try to perform linkerd injection:

kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -

Rollout process get stuck because first restarted pod does not become ready, so I manually restart the other pods.

But even after all pods have been restarted and contain linkerd init- and sidecar-containers, CockroachDB-Cluster does not work anymore - nodes cannot reach each other:

E230801 17:25:57.509642 927 2@rpc/context.go:2404 ⋮ [T1,n1,rnode=2,raddr=‹cockroachdb-2.cockroachdb.development.svc.cluster.local:26257›,class=default,rpc] 108  unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: authentication handshake failed: EOF"› [code 14/Unavailable]

linkerd-proxy sidecar container does not log anything related to port 26257. (It only logs unauthoized connection attempts from Prometheus to port 8080, which is true, but unrelated to the cockroachdb-cluster-communication issue.)

Also

linkerd viz tap -n development sts/cockroachdb

does not show anything related to port 26257.

I don't know how to further debug this issue.

I have tried to set annotation

config.linkerd.io/opaque-ports: 26257,8080

but this did not change anything.

k8s cluster is EKS cluster 1.26.6

Can anybody give me a hint on how to further debug this issue?

Thanks in advance :)

How can it be reproduced?

Set namespace to policy "all-authenticated":

kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated

Deploy CockroachDB-Cluster via helm chart with default values:

helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development

Perform linkerd injection:

kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -

As rollout process gets stuck, delete the remaining two pods so that all three pods run with linkerd sidecar.

CockroachDB pods remain unhealthy, end up in crash loop. CockroachDB pods log "unable to connect (is the peer up and reachable?)"

Logs, error output, etc

No error logs from linkerd as far as I can see. No logs regarding blocked packets related to cluster-port 26257

output of linkerd check -o short

$ linkerd check -o short
Status check results are √

Environment

AWS EKS: v1.26.7-eks-2d98532

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

Copy link

stale bot commented Nov 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 22, 2023
@hawkw
Copy link
Contributor

hawkw commented Nov 23, 2023

If memory serves, we've seen issues like this in the past with CockroachDB due to its use of an initContainer which must communicate over the network as part of the startup process. Because initContainers run before the Linkerd proxy, they are unmeshed, and an all-authenticated policy will deny traffic from those init containers, because their traffic cannot be authenticated (as there is no Linkerd proxy performing mTLS on their behalf yet).

I believe that the native sidecar containers entering beta in Kubernetes 1.29 may resolve issues like this, since they could be used to allow Linkerd proxies to start up before other initContainers run, allowing initContainer traffic to be meshed. Linkerd added support for native sidecar containers in today's edge release, edge-23.11.4 (see PR #11465), so this issue may be fixed for edge-23.11.4 running on Kubernetes 1.28. It's also possible that additional steps are necessary in order to use this new, beta functionality in Kubernetes to resolve this issue --- perhaps @alpeb knows more about this?

@alpeb
Copy link
Member

alpeb commented Nov 23, 2023

I'm not very familiar with the past cockroachDB issues, but by inspecting this example I see the sts' init container is just a shell command that doesn't hit the network. OTOH when sts pods are rolled out, they need to keep on connecting to one another in order to become ready: e.g. cockroachdb-0 is the first to be rolled out and gets injected while cockroachdb-1 and cockroachdb-2 remain uninjected, but cockroachdb-0 can't receive connections from the others because of policy, so it doesn't fully start and the sts rollout process gets stuck.
What you need to do here is add the all-authenticated policy to the namespace only after the sts has been injected.

@stale stale bot removed the wontfix label Nov 23, 2023
@sebschlue
Copy link
Author

Thank you very much for reply :)

@alpeb I was already aware of this issue and manually deleted the two remaining pods so that all three run with linkerd sidecar container. They still cannot talk to each other. Also, NO policy violation is logged. Please re-read my origional description for more details.

@sebschlue
Copy link
Author

I still believe this is an actual bug. If you want me to re-test with latest version of linkerd or do some more analysis, just tell me what to do :)

@sebschlue
Copy link
Author

I have also gained some experience with issues related to other init containers that are created via webhook injection. I have this issue with Hashicorp's Vault Agent Injector Webhook in combination with Linkerd Init- and Sidecar-container injection. But with CockroachDB, the issue is not about init containers.

With CockroachDB, the main containers just simply cannot talk to each other anymore even if all of them are meshed.

@alpeb
Copy link
Member

alpeb commented Nov 23, 2023

I think this is caused by cockroach's delicate consensus mechanism. From my testing, this works as long as you issue node drain on a pod before bouncing it so it becomes injected. This has to be done one by one.

@sebschlue
Copy link
Author

@alpeb

Do you maybe have a log or script that I can use to repdroduce what you did to get it working?

Have you tried to reproduce what I did by following the steps in my original description?

Copy link

stale bot commented Feb 23, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 23, 2024
@stale stale bot closed this as completed Mar 10, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants