Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in terminating state #1357

Closed
Setomidor opened this issue Mar 17, 2023 · 2 comments · Fixed by #1484
Closed

Pods stuck in terminating state #1357

Setomidor opened this issue Mar 17, 2023 · 2 comments · Fixed by #1484

Comments

@Setomidor
Copy link

This is probably not an issue for everyone, but I wanted to leave a note here in case other people are stuck with the same problem.

The issue was that the pods with SGX support were stuck in the Terminating state for a long time. The issue was tracked to the SGX Webhook:

kubectl -n kube-sgx logs -f sgx-webhook-webhook-5444cff965-cn4hz
I0317 07:11:31.984493       1 server.go:149] controller-runtime/webhook "msg"="Registering webhook" "path"="/pods-sgx"
I0317 07:11:31.984684       1 main.go:60] setup "msg"="starting manager"
I0317 07:11:31.985081       1 server.go:217] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0317 07:11:31.985521       1 certwatcher.go:131] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0317 07:11:31.985724       1 certwatcher.go:85] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0317 07:11:31.986040       1 server.go:271] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443

2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:26292: EOF
2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:49324: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:59980: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:42953: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:34228: read tcp 10.233.190.251:9443->10.233.240.0:34228: read: connection reset by peer
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:38431: EOF
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:36956: EOF
2023/03/17 07:12:35 http: TLS handshake error from 10.233.240.0:11239: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:27806: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:3522: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:2116: EOF

It seems the Webhook was blocking the cleanup of the Pods, and they would be stuck for days. Undeploying the webhook would release the stuck pods immediately.

Changing the MutatingWebhookConfiguration to only act on Create and not Update resolved the issue. The current working configuration for the Webhook for us is:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: kube-sgx/sgx-webhook-serving-cert
  name: sgx-webhook-mutating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: sgx-webhook-service
      namespace: kube-sgx
      path: /pods-sgx
  failurePolicy: Ignore
  name: sgx.mutator.webhooks.intel.com
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
  sideEffects: None
  timeoutSeconds: 10

Feel free to close this issue immediately. :)

@eero-t
Copy link
Contributor

eero-t commented Apr 12, 2023

@mythi Any comments on this?

@mythi
Copy link
Contributor

mythi commented Jul 21, 2023

Changing the MutatingWebhookConfiguration to only act on Create and not Update resolved the issue.

I think I hit this too and this action is probably the right thing to do. Let me do some more testing and submit a PR with the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants