-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out #714
Comments
Hi @dschaaff thanks for filing. Would you be able to provide a YAML manifest either an example or repro YAML that may introduce this behavior? We have seen some issues around multi-port pods but you may not have the same scenario or architecture. |
Here is the deployment and service manifest for 1 of our services. Our other microservices mirror this same setup and differ just by name and image. apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "30"
meta.helm.sh/release-name: example-service
meta.helm.sh/release-namespace: example-service
labels:
app.kubernetes.io/instance: example-service
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: example-service
app.kubernetes.io/version: "1.0"
helm.sh/chart: example-service-2.2.1
name: example-service
namespace: example-service
spec:
# note this is managed by HPA, scales based cpu percentage
replicas: 19
selector:
matchLabels:
app.kubernetes.io/instance: example-service
app.kubernetes.io/name: example-service
strategy:
rollingUpdate:
maxSurge: 200%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
consul.hashicorp.com/connect-inject: "true"
consul.hashicorp.com/connect-service: example-service
consul.hashicorp.com/connect-service-port: "80"
consul.hashicorp.com/service-meta-namespace: example-service
consul.hashicorp.com/service-tags: eks
labels:
app.kubernetes.io/instance: example-service
app.kubernetes.io/name: example-service
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- example-service
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- args:
- agent
- -config=/etc/vault/vault-agent-config.hcl
env:
- name: VAULT_ADDR
value: https://redacted
- name: VAULT_CAPATH
value: /etc/ssl/certs
image: vault:1.6.1
imagePullPolicy: IfNotPresent
name: vault-agent-auth
resources:
limits:
cpu: 200m
requests:
cpu: 5m
memory: 150Mi
volumeMounts:
- mountPath: /etc/vault
name: vault-agent-config
readOnly: true
- mountPath: /etc/ssl/certs/vault-ca.pem
name: vault-cert
readOnly: true
subPath: vault-ca.pem
- mountPath: /home/vault
name: vault-token
- env:
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: VAULT_ADDR
value: http://127.0.0.1:8200
- name: VAULT_CAPATH
value: /etc/ssl/certs
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: example-service
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sleep
- "30"
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 80
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: example-service
ports:
- containerPort: 80
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readinezz
port: 80
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: "2"
memory: 700Mi
requests:
cpu: "1"
memory: 700Mi
volumeMounts:
- mountPath: /home/vault
name: vault-token
- mountPath: /etc/vault
name: vault-agent-config
readOnly: true
dnsConfig:
options:
- name: ndots
value: "1"
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: example-service
serviceAccountName: example-service
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: example-service-vault-agent-config
name: vault-agent-config
- configMap:
defaultMode: 420
items:
- key: vault-ca.pem
path: vault-ca.pem
name: example-service-vault-agent-config
name: vault-cert
- emptyDir:
medium: Memory
name: vault-token
---
apiVersion: v1
kind: Service
metadata:
annotations:
consul.hashicorp.com/service-tags: example-service
meta.helm.sh/release-name: example-service
meta.helm.sh/release-namespace: example-service
labels:
app.kubernetes.io/instance: example-service
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: example-service
app.kubernetes.io/version: "1.0"
helm.sh/chart: example-service-2.2.1
name: example-service
namespace: example-service
spec:
clusterIP: 172.20.239.195
clusterIPs:
- 172.20.239.195
ports:
- name: http
port: 80
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/instance: example-service
app.kubernetes.io/name: example-service
sessionAffinity: None
type: ClusterIP |
I'm the processing of updating the consul client pods to disable the streaming backend. If the behavior still exists after that then I'm going to downgrade consul-k8s to back the sidecar architecture. Unfortunately, this is very disruptive and has led to the on-call engineer get paging quite frequently. I'm hoping to grab some more info in the meantime. |
Thanks @dschaaff we have seen some issues when some folks are declaring the service and service-port via annotation. You shouldn't need to do this any longer since we are able to grab this information from the Service object itself. We'd need to investigate this further but I don't think we'll have answers for you in the immediate short term. I do wonder if removing those annotations changes anything.
|
I had to go ahead and downgrade our prod clusters. Next week I'll try and reproduce in a test cluster. |
Just to check-in, I haven't had time to work on reproducing the issue. We have only experienced it in large eks clusters (50-100 nodes, 1000s of pods)thus far. I had to downgrade our last larger cluster due to the issues. I hope to find time to work on reproducing it outside of high traffic environment soon. |
I am now seeing this behavior in our staging cluster. Can you please let me know what information I should grab to help debug this? I have a pod that has been stuck for 10 min crash looping. The consul-connect-inject-init container keeps saying
The consul agent on the node is up and healthy. One thing caught my eye in the injector logs
Is it expected for the pod name to be empty like that? It was in this state for 15 mins. Afterwards the injector finally registered the service and the pod was able to start.
|
I attempted to upgrade one of our production clusters yesterday with 0.39.0. Unfortunately we ran into the same issue described here. Several pods were stuck at the init phase for more than 10m because the endpionts controller had not recognized the new pods and registered them with consul. Restart the controller pods fixed the issue. |
Hi @dschaaff sorry for not getting back to you. Do you see that the endpoints controller not receiving the event for that pod (I'm talking specifically about IIRC this line |
When reproducing the issue, could you grab and provide the full endpoints controller logs (all the logs from both pods of
We can try to correlate K8s events and the endpoint controller logs that way. Also could you provide the CPU and memory usage of the consul and injector pods while this happens? I was able to use |
I captured some logs from an issue this morning. I'm currently running consul-k8s 0.39.0 in our staging clusters. I'm leaving production on 0.25.0 until I can sort why we get so many injection errors on the newer version. We had alerts go off for a pod in our staging environment today due envoy not being available. The pod looks like it was rescheduled due to cluster autoscaling activity. When the pod started the consul connect inject init container put out these logs
```
LAST SEEN TYPE REASON OBJECT MESSAGE
44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container envoy-sidecar
44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container logrotate
44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container vault-agent-auth
44m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-mt6bx
43m Warning FailedScheduling pod/client-replication-1-6ff8bcb699-mt6bx 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 7 Insufficient memory.
44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container client-replication
44m Normal Killing pod/client-replication-1-6ff8bcb699-jhj7r Stopping container laravel-log
43m Normal TriggeredScaleUp pod/client-replication-1-6ff8bcb699-mt6bx pod triggered scale-up: [{eks-stg-arm-spot-m-2021101817251900310000000c-fabe4a37-d6a9-2c06-075f-e07215ca1dd3 1->2 (max: 15)}]
42m Warning FailedScheduling pod/client-replication-1-6ff8bcb699-mt6bx 0/22 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 7 Insufficient memory.
42m Normal Scheduled pod/client-replication-1-6ff8bcb699-mt6bx Successfully assigned client-replication/client-replication-1-6ff8bcb699-mt6bx to ip-10-20-2-136.us-west-2.compute.internal
42m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg"
41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 23.294743018s
41m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container config-template-copy
41m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container config-template-copy
41m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/vault:1.7.7"
41m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container vault-agent-init
41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/vault:1.7.7" in 4.260832414s
41m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container vault-agent-init
41m Normal Pulling pod/client-replication-1-6ff8bcb699-mt6bx Pulling image "registry/cordial/consul-template:main-c2c65811"
41m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 4.907906483s
40m Normal Started pod/client-replication-1-6ff8bcb699-mt6bx Started container consul-template-init
40m Normal Created pod/client-replication-1-6ff8bcb699-mt6bx Created container consul-template-init
40m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Container image "registry/cordial/consul-template:main-c2c65811" already present on machine
40m Warning BackOff pod/client-replication-1-6ff8bcb699-mt6bx Back-off restarting failed container
40m Normal Pulled pod/client-replication-1-6ff8bcb699-mt6bx Container image "registry/consul:1.10.6" already present on machine
37m Normal Pulling pod/client-replication-1-6ff8bcb699-drctt Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg"
37m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-drctt
37m Normal Scheduled pod/client-replication-1-6ff8bcb699-drctt Successfully assigned client-replication/client-replication-1-6ff8bcb699-drctt to ip-10-20-11-129.us-west-2.compute.internal
36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 23.167028987s
36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container config-template-copy
36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container config-template-copy
36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container vault-agent-init
36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container vault-agent-init
36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/vault:1.7.7" already present on machine
36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container consul-template-init
36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/cordial/consul-template:main-c2c65811" already present on machine
36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container consul-template-init
36m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container copy-consul-bin
36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/consul:1.10.6" already present on machine
36m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container copy-consul-bin
36m Normal Pulling pod/client-replication-1-6ff8bcb699-drctt Pulling image "registry/consul-k8s-control-plane:0.39.0"
36m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Successfully pulled image "registry/consul-k8s-control-plane:0.39.0" in 1.860779427s
34m Normal Created pod/client-replication-1-6ff8bcb699-drctt Created container consul-connect-inject-init
34m Normal Started pod/client-replication-1-6ff8bcb699-drctt Started container consul-connect-inject-init
34m Normal Pulled pod/client-replication-1-6ff8bcb699-drctt Container image "registry/consul-k8s-control-plane:0.39.0" already present on machine
33m Normal SuccessfulCreate replicaset/client-replication-1-6ff8bcb699 Created pod: client-replication-1-6ff8bcb699-cw2cq
33m Normal Scheduled pod/client-replication-1-6ff8bcb699-cw2cq Successfully assigned client-replication/client-replication-1-6ff8bcb699-cw2cq to ip-10-20-24-127.us-west-2.compute.internal
33m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg"
32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 21.855575333s
32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container config-template-copy
32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container config-template-copy
32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/vault:1.7.7"
32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/vault:1.7.7" in 4.673750726s
32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container vault-agent-init
32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container vault-agent-init
32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/cordial/consul-template:main-c2c65811"
32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container consul-template-init
32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container consul-template-init
32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 4.893254952s
32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/consul:1.10.6" already present on machine
32m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container copy-consul-bin
32m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container copy-consul-bin
32m Normal Pulling pod/client-replication-1-6ff8bcb699-cw2cq Pulling image "registry/consul-k8s-control-plane:0.39.0"
32m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Successfully pulled image "registry/consul-k8s-control-plane:0.39.0" in 1.947241264s
28m Normal Created pod/client-replication-1-6ff8bcb699-cw2cq Created container consul-connect-inject-init
28m Normal Started pod/client-replication-1-6ff8bcb699-cw2cq Started container consul-connect-inject-init
28m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/consul-k8s-control-plane:0.39.0" already present on machine
28m Warning BackOff pod/client-replication-1-6ff8bcb699-cw2cq Back-off restarting failed container
19m Normal Pulled pod/client-replication-1-6ff8bcb699-cw2cq Container image "registry/cordial/consul-template:main-c2c65811" already present on machine
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container vault-agent-auth
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container dedicated-client-replication
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container laravel-log
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container consul-template
5m51s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 7 Insufficient memory.
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container envoy-sidecar
5m52s Normal Killing pod/dedicated-client-replication-1-757549d787-zftjz Stopping container logrotate
5m52s Normal SuccessfulCreate replicaset/dedicated-client-replication-1-757549d787 Created pod: dedicated-client-replication-1-757549d787-j4wl4
5m49s Warning FailedPreStopHook pod/dedicated-client-replication-1-757549d787-zftjz Exec lifecycle hook ([bin/bash -c /usr/local/bin/consul logout -http-addr="${HOST_IP}:8500" -token-file /consul-template/acl-token]) for Container "consul-template" in Pod "dedicated-client-replication-1-757549d787-zftjz_client-replication(8b437f7f-93a6-4528-b57b-ecae7b39105b)" failed - error: command 'bin/bash -c /usr/local/bin/consul logout -http-addr="${HOST_IP}:8500" -token-file /consul-template/acl-token' exited with 1: Error destroying token: Unexpected response code: 403 (rpc error making call: rpc error making call: ACL not found)...
5m36s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/20 nodes are available: 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 8 Insufficient memory.
5m46s Normal TriggeredScaleUp pod/dedicated-client-replication-1-757549d787-j4wl4 pod triggered scale-up: [{eks-stg-spot-c-20211222190131236200000001-9cbef1c2-8e1d-848c-700a-2ab387a46cca 3->4 (max: 15)}]
5m16s Warning FailedScheduling pod/dedicated-client-replication-1-757549d787-j4wl4 0/21 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {role: ci}, that the pod didn't tolerate, 1 node(s) were unschedulable, 18 Insufficient cpu, 8 Insufficient memory.
5m6s Normal Scheduled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully assigned client-replication/dedicated-client-replication-1-757549d787-j4wl4 to ip-10-20-9-22.us-west-2.compute.internal
5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "vault-agent-config" : failed to sync configmap cache: timed out waiting for the condition
5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "vault-cert" : failed to sync configmap cache: timed out waiting for the condition
5m4s Warning FailedMount pod/dedicated-client-replication-1-757549d787-j4wl4 MountVolume.SetUp failed for volume "logrotate-conf" : failed to sync configmap cache: timed out waiting for the condition
4m47s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/cordial/core:apache-r-20220123.2006110319.stg"
4m25s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/cordial/core:apache-r-20220123.2006110319.stg" in 22.414868737s
4m15s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container config-template-copy
4m15s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container config-template-copy
4m14s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/vault:1.7.7"
4m12s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container vault-agent-init
4m12s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container vault-agent-init
4m12s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/vault:1.7.7" in 2.386491168s
4m9s Normal Pulling pod/dedicated-client-replication-1-757549d787-j4wl4 Pulling image "registry/cordial/consul-template:main-c2c65811"
4m6s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Successfully pulled image "registry/cordial/consul-template:main-c2c65811" in 3.911779387s
3m50s Normal Created pod/dedicated-client-replication-1-757549d787-j4wl4 Created container consul-template-init
3m50s Normal Started pod/dedicated-client-replication-1-757549d787-j4wl4 Started container consul-template-init
3m25s Normal Pulled pod/dedicated-client-replication-1-757549d787-j4wl4 Container image "registry/cordial/consul-template:main-c2c65811" already present on machine
3m37s Warning BackOff pod/dedicated-client-replication-1-757549d787-j4wl4 Back-off restarting failed container
```
I'm attaching the endpoints controller logs since there is a lot of them. What jumps out to me is
|
I have just discovered the setting advertise_reconnect_timeout that was added to the agent. I have set that at 15m for the k8s based agents so that dead nodes get cleaned up from the member list faster. I do not know if this is related to the errors described in this ticket but I want to not that I made that change after collecting the logs in my previous comment. |
I believe this issue may be caused by #779. Our environment auto scales quite frequently and there are often scenarios where a node that has gone away will still show up in the Consul member list. I deployed 0.40 with #991 into a staging cluster last week and things are looking better thus far. I have expanded the test by deploying 0.40 to a lower traffic production cluster. I'll update the issue after observing behavior further. |
Thank you @dschaaff for the update! |
Unfortunately, this behavior continues. We just had alert fire for 2 pods that were not healthy. They both had been stuck on the
I checked the endpoints controller logs and the pod ids for the that alerted do not appear at all. I'm going to attempt to leave this cluster at the latest version a bit longer, but this issue does not appear to resolved by #991. All of our other production clusters remain on 0.25.0 of this project which has been completely reliable. Reminder I attached the previously requests logs #714 (comment) |
Hey @dschaaff thanks so much for the update and for trying it out! Could you describe roughly what is happening in the cluster when you're seeing these errors? I'm asking more about how many pods are be created by the HPA at about the same time. Do you know roughly what is the range of pods per service you could have at any given time? I'm thinking that this may be related to scalability issues with Endpoints objects and trying to find out if you may be coming across that. We have looked at supporting EndpointSlices but haven't gotten to it yet. |
The service that failed to register today happened while we were deploying 3 different services. Between these 3 services, there was a total of 14 pods at the time. This is inside a secondary cluster that does not receive as much traffic as our main cluster. The HPA for that service that errored has min/max of 2/10. I'm including a screenshot of every HPA in that cluster. The pod count for that cluster hovers averages around 400. For context, our main cluster is about 6 times the size of this one and the services scale up and down a lot more often. Hopefully, that is helpful. I'm happy to dig into any more info that would be useful in troubleshooting the issues. Thanks! |
I may have to downgrade the production cluster I was testing this upgrade in. The on call engineer was paged again due to pods stuck an init state.
The pod ids
Eventually pod service-86dd585d4-xb5ng started but the delay between the pod starting and it become healthy is way too long. The only thing jumping out at me in the injector logs is this
This error |
Just a note that I once again tried to upgrade our primary production cluster to today (to 0.41). I'm now going to roll it back. We've been flooded with alerts off and on for pods that are stuck in the init phase logging
Our alerts only fire after a pod has been stuck like this for 8m. I'd really like to be able to use the up to date injector but it's definitely really buggy. |
Hello, I'm also facing the same issue on our deployment. On one of our services, the init container logs from
When the init container fails to start, the entire pod will get stuck. If we are persistent to restart the deployment multiple times, the init container would eventually start up successfully. We connected our Kubernetes cluster to HCP Consul. The cluster has about 50 worker nodes, constantly spun up and down periodically. On our Consul Helm deployment, we noticed the consul-connect-injector pods crashed rather frequently. This is the visual view from our Argo CD, note that each injector pod restarts for 27 and 33 times respectively (number on the far right): One of the pod's summary: On its events: The injector pod's logs did not contain any exceptions, but it did flag that one of our services, which was annotated with
Hope this helps in the investigation. Components version:
|
@lkysow Thanks, looking forward to adopt the fix!
We have managed to resolve the root cause of connect-injector pod crashing. It was due to allocating insufficient memory to the pod. The default memory allocation for each pod is 50Mi but the pod was actually using up to 70Mi and that led to frequent crashes. After bumping the memory allocation to 100Mi, it resolved the crash and we don't get liveness probe errors anymore in the events. |
@lkysow does this mean there is some sort of memory leak that happens due to maybe many pods coming and going due to HPA? How to monitor this so that it is caught before this occurs. @raphlcx Is the error same with what @dschaaff observed in production? @dschaaff is your production test also using the same default memory allocation of 50Mi? or did you encounter the issue even with higher memory allocation? |
If @raphlcx is seeing stability at 100Mi then I don't think there's a memory leak. Probably just with a certain size of environment you need more mem. If they're seeing memory creep up then yes there may be a leak. |
In production I run the endpoints controller with
I haven't had any issues with the controller pods crashing and the have never been closing to using the amount of memory I've allocated for them. That sounds like a different issue than what I've been seeing. |
We'll have to continue observing the memory usage, so far it hovers around 50~70Mi, but at one point it spiked to 118Mi, which was above the threshold that we have set (100Mi). Initially, we weren't expecting the memory usage for the connect-injector pod to rise past the default limit set. Were there documented use cases, tutorials, or similar past incidents that are also related to rising memory usage from the injector pod? |
@raphlcx can you share your values.yaml and your environment details (kube version, hosting provider, etc.) |
Hi everyone, we are definitely monitoring this issue closely. It sounds like this is occurring on K8s clusters that have the following properties:
A couple of questions as well.
We plan to try to reproduce this issue but currently are in the midst of preparing for a major release, any more helpful indicators on how to repro this would be appreciated. |
We have 53 connect services total on the mesh. 8 of those run directly on ec2 outside of Kubernetes.
We never experience behavior like this with consul-k8s 0.25. I have tried every version of consul-k8s project since 0.33 and have run into these problems in production every time. I have almost never seen this behavior in our pre-prod environments which have been using the latest consul-k8s starting with 0.33 up to 0.42 now. The only difference between them is the scale. The workloads get a lot more traffic and autoscale to handle spikes. Our consul cluster is usually around 144 nodes. The raft commit time is between 10 and 25ms at all times. Consul usually has 10-40 catalog operations a second. We run 5 consul servers on c6g.xlarge instances. CPU utilization has peaked at 10% on the leader. I'm happy to provide any other helpful info. I'd love to not rely on restarting the inject controller every 3 minutes to keep things working :) |
Just to give some info on why this is happening with 0.25.0+ it's because we moved from the pods registering themselves to being registered via the endpoints controller. This was needed to support transparent proxying–being able to use kube dns URLs (among other reasons such as being able to deregister pods that are force killed and re-register pods efficiently when consul clients restart). |
Yep. The issue I have had is that the endpoints controller has not been reliably registering services with the local agent. Because of that pods are randomly get stuck in in the init phase, sometimes for as long as 30 minutes. Restarting the endpoints controller regularly has made the alert noise manageable but we still have issues during large deployments. |
Yeah for sure; this is a high priority for us to fix now. |
While the restart cron has greatly reduced the number of errors we see, it is cropping up. We just hit this for example
That's 43 pods that have been stuck in the init phase due to the service not being registered for 13m+
After 15m they finally got registered by the controller and were able to start. This is brutal. During this time the consul agent on the node was logging the below errors repeatedly until the service registration finally happened.
|
I disabled my restart cron temporarily hoping to get some more debug data. We hit this issue again afterwards. I've followed the chain of events in the logs as best as I can. The pod in question was removed for a node scale down.
The consul client on that node shows an event for the node shut down as well.
I then see the inject controller deleting the ACL token and trying to de-register the service
Next it logs some messages referring to the consul client on the node that was just killed.
The new pod
It remained in this state for about 16 minutes. I also confirmed the new pod was listed in the endpoints object during this period.
Then I see the controller log the following connection error trying to talk to the consul agent the node that no longer exists
Then roughly 1 minute later the new pod is finally registered.
It's a pain to convert the unix timestamps in the logs but if you do you'll see quite a lot of time passed during this period. Hopefully this is helpful in tracking this down. |
Quick note that we hit these injection errors in our staging cluster early this morning as well. The timing coincides with several spot terminations forcing the recycle of nodes. There was again a long period where services were not being registered and again we see these logs in the injector
|
Hey @dschaaff, thank you for all of the detailed logs and timelines. Please correct me if you think differently, but it sounds like:
I spent some a few hours trying to recreate this yesterday. I think there is a missing element that you are running into that I am not adding into recreation or maybe just won't hit when not at scale. I tried to recreate it:
I tried a few different variations of this and every time the scenario was handled gracefully and it processed the termination similar to the |
I ran into this condition today by just doing a deployment of a service, unfortunately I'm fighting with our elk stack right now to get the connect injector logs. Here is the kubectl events output for the pod in question though.
I'm disabling our restart cron for the day to try and recreate the issue and grab another set of detailed logs.
This does seem to be happening quite regularly.
Is there a way for me to verify this? Any way to instropect the queue?
What is the max retry count for this? Scale definitely seems to be a factor at play. Our staging cluster has only hit this error a handful of times. In production we hit it quite regularly, especially without the cronjob to restart the endpoints controller. |
We just hit the error again in our staging cluster. A new pod was created and stuck in the init phase
The endpoints controller did not register the service until after that long period had elapsed.
Prior to it registering the service I again see the errors from it attempting to connect to a non-existent host. I'm also suprised by this particular error because the external-dns pods are not annotated
The host client pod 10.20.14.50 was running on an instance that was removed by cluster autoscaler This follows the same pattern where the service happens to be registered after I see errors connecting to nodes that no longer exist. They definitely all seem to be following this pattern. |
Hi @dschaaff , we've been talking internally. It seems the http client that is used to connect to the client from endpoints controller is the default go client with no timeout configured. We believe this, rather than retries, are the culprit because when endpoints controller gets an endpoints entry, it gets the current clients that are available. So, there should not be a retry scenario where this does not self correct the next time that endpoint record is processed. (This would be a temporary image to verify and we'd have to release the code changes in a following patch release.) |
Yes, I'd be happy to pilot it in our environments. |
Just for my own benefit, is the http client here the one in question https://github.com/hashicorp/consul/blob/main/api/api.go#L739 ? |
Yes, we use this function above it that calls the one you linked. It allows us to pass in an httpClient in the config. it will use that if it is there otherwise it will create a new setting only the transport. we'll have to do some slight refactoring in Consul to accommodate the change in Consul on Kubernetes. I'll link draft PRs here for your reference when I send over instructions with the docker image to use in the values file. |
@jmurret Thank you so much for the help on this. I'm deploying the image now and will report back. |
My initial feedback on this change is positive. I've deployed the container in 2 different clusters. In the stg cluster I did a rolling replace of the kubernetes nodes. I did not encounter any errors during this operation. After running it in the stg cluster for a few hours I went ahead and deployed it to the production cluster. I also disabled the cronjob that was restarting the controller every 3 minutes. Since deploying the updated controller in that environment cluster autoscaler has removed 43 nodes due to scale down operations. We have not experienced any service registration delays thus far. During this period of scale down I saw the timeout being hit and logged by the controller quite frequently. For example I'm going to leave this setup in production and will report back if the service registration delay comes up at all. So far I am optimistic that this fixes our issue. |
That's great to hear. Please keep us posted. |
It's been almost a full 24 hours since I deployed the change, and we haven't had a single issue. At this point I'd say this has addressed the problem. Thank you so much for you effort looking into this and fixing. We will continue to run the custom build until the official release. |
* Adding GH-714 as bug fix to change log. * Update CHANGELOG.md Co-authored-by: Luke Kysow <[email protected]> Co-authored-by: Luke Kysow <[email protected]>
Thanks again for all the work on this! |
Community Note
Overview of the Issue
Since upgrading the consul-k8s project to 0.33 We are seeing frequent failures within our primary cluster. I have not, yet at least, observed this behavior in other clusters we run. This primary difference with this cluster is that it runs close to 100 nodes and has several connect injected services that are frequently scaling up and down from the horizontal pod autoscaler.
Observed Symptoms
Unable to find registered services; retrying
At this point the pod is stuck in this state. The endpoints controller never actually registers the service. After a few minutes our on call engineers our paged due to the stuck pods. Deleting the pods in this state usually gets things back on track.
Server Version: v1.20.7-eks-d88609
I have confirmed that these pods are present in the service's endpoints.
Helm Values
I could use some guidance on what information would be most useful to help debug this. Thanks for your help!
The text was updated successfully, but these errors were encountered: