CockroachDB cannot be meshed (Cluster-communication breaks) #11281

sebschlue · 2023-08-22T10:23:37Z

What is the issue?

Hello :)

I deploy linkerd 2.13.5 in HA-mode

linkerd install --crds ...
linkerd install --ha ...
linkerd viz install --ha ...

Then I set a namespace to policy "all-authenticated":

kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated

Then I deploy CockroachDB-Cluster via helm chart with default values:

helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development

CockroachDB-Cluster works fine afterwards.

Then I try to perform linkerd injection:

kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -

Rollout process get stuck because first restarted pod does not become ready, so I manually restart the other pods.

But even after all pods have been restarted and contain linkerd init- and sidecar-containers, CockroachDB-Cluster does not work anymore - nodes cannot reach each other:

E230801 17:25:57.509642 927 2@rpc/context.go:2404 ⋮ [T1,n1,rnode=2,raddr=‹cockroachdb-2.cockroachdb.development.svc.cluster.local:26257›,class=default,rpc] 108  unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: authentication handshake failed: EOF"› [code 14/Unavailable]

linkerd-proxy sidecar container does not log anything related to port 26257. (It only logs unauthoized connection attempts from Prometheus to port 8080, which is true, but unrelated to the cockroachdb-cluster-communication issue.)

Also

linkerd viz tap -n development sts/cockroachdb

does not show anything related to port 26257.

I don't know how to further debug this issue.

I have tried to set annotation

config.linkerd.io/opaque-ports: 26257,8080

but this did not change anything.

k8s cluster is EKS cluster 1.26.6

Can anybody give me a hint on how to further debug this issue?

Thanks in advance :)

How can it be reproduced?

Set namespace to policy "all-authenticated":

kubectl annotate namespace development config.linkerd.io/default-inbound-policy=all-authenticated

Deploy CockroachDB-Cluster via helm chart with default values:

helm upgrade --install cockroachdb cockroachdb/cockroachdb --version 11.1.3 --namespace development

Perform linkerd injection:

kubectl -n development get sts cockroachdb -o yaml | linkerd inject - | kubectl apply -f -

As rollout process gets stuck, delete the remaining two pods so that all three pods run with linkerd sidecar.

CockroachDB pods remain unhealthy, end up in crash loop. CockroachDB pods log "unable to connect (is the peer up and reachable?)"

Logs, error output, etc

No error logs from linkerd as far as I can see. No logs regarding blocked packets related to cluster-port 26257

output of `linkerd check -o short`

$ linkerd check -o short
Status check results are √

Environment

AWS EKS: v1.26.7-eks-2d98532

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

stale · 2023-11-22T22:38:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

hawkw · 2023-11-23T00:05:49Z

If memory serves, we've seen issues like this in the past with CockroachDB due to its use of an initContainer which must communicate over the network as part of the startup process. Because initContainers run before the Linkerd proxy, they are unmeshed, and an all-authenticated policy will deny traffic from those init containers, because their traffic cannot be authenticated (as there is no Linkerd proxy performing mTLS on their behalf yet).

I believe that the native sidecar containers entering beta in Kubernetes 1.29 may resolve issues like this, since they could be used to allow Linkerd proxies to start up before other initContainers run, allowing initContainer traffic to be meshed. Linkerd added support for native sidecar containers in today's edge release, edge-23.11.4 (see PR #11465), so this issue may be fixed for edge-23.11.4 running on Kubernetes 1.28. It's also possible that additional steps are necessary in order to use this new, beta functionality in Kubernetes to resolve this issue --- perhaps @alpeb knows more about this?

alpeb · 2023-11-23T16:22:22Z

I'm not very familiar with the past cockroachDB issues, but by inspecting this example I see the sts' init container is just a shell command that doesn't hit the network. OTOH when sts pods are rolled out, they need to keep on connecting to one another in order to become ready: e.g. cockroachdb-0 is the first to be rolled out and gets injected while cockroachdb-1 and cockroachdb-2 remain uninjected, but cockroachdb-0 can't receive connections from the others because of policy, so it doesn't fully start and the sts rollout process gets stuck.
What you need to do here is add the all-authenticated policy to the namespace only after the sts has been injected.

sebschlue · 2023-11-23T16:36:00Z

Thank you very much for reply :)

@alpeb I was already aware of this issue and manually deleted the two remaining pods so that all three run with linkerd sidecar container. They still cannot talk to each other. Also, NO policy violation is logged. Please re-read my origional description for more details.

sebschlue · 2023-11-23T16:40:27Z

I still believe this is an actual bug. If you want me to re-test with latest version of linkerd or do some more analysis, just tell me what to do :)

sebschlue · 2023-11-23T16:44:28Z

I have also gained some experience with issues related to other init containers that are created via webhook injection. I have this issue with Hashicorp's Vault Agent Injector Webhook in combination with Linkerd Init- and Sidecar-container injection. But with CockroachDB, the issue is not about init containers.

With CockroachDB, the main containers just simply cannot talk to each other anymore even if all of them are meshed.

alpeb · 2023-11-23T20:18:02Z

I think this is caused by cockroach's delicate consensus mechanism. From my testing, this works as long as you issue node drain on a pod before bouncing it so it becomes injected. This has to be done one by one.

sebschlue · 2023-11-24T07:26:56Z

@alpeb

Do you maybe have a log or script that I can use to repdroduce what you did to get it working?

Have you tried to reproduce what I did by following the steps in my original description?

stale · 2024-02-23T09:48:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

sebschlue added the bug label Aug 22, 2023

adleong added the priority/triage label Aug 24, 2023

stale bot added the wontfix label Nov 22, 2023

stale bot removed the wontfix label Nov 23, 2023

stale bot added the wontfix label Feb 23, 2024

stale bot closed this as completed Mar 10, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

sebschlue commented Aug 22, 2023

stale bot commented Nov 22, 2023

hawkw commented Nov 23, 2023

alpeb commented Nov 23, 2023

sebschlue commented Nov 23, 2023

sebschlue commented Nov 23, 2023

sebschlue commented Nov 23, 2023

alpeb commented Nov 23, 2023

sebschlue commented Nov 24, 2023

stale bot commented Feb 23, 2024

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

CockroachDB cannot be meshed (Cluster-communication breaks) #11281

Comments

sebschlue commented Aug 22, 2023

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

stale bot commented Nov 22, 2023

hawkw commented Nov 23, 2023

alpeb commented Nov 23, 2023

sebschlue commented Nov 23, 2023

sebschlue commented Nov 23, 2023

sebschlue commented Nov 23, 2023

alpeb commented Nov 23, 2023

sebschlue commented Nov 24, 2023

stale bot commented Feb 23, 2024

output of `linkerd check -o short`