Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

Open
pacoxu opened this issue Sep 12, 2024 · 9 comments · May be fixed by #124953 or #127954
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pacoxu
Copy link
Member

pacoxu commented Sep 12, 2024

Which jobs are flaking?

ci-crio-cgroupv1-evented-pleg

Which tests are flaking?

E2eNode Suite.[It] [sig-node] [NodeConformance] Containers Lifecycle when a pod is terminating because its liveness probe fails should continue running liveness probes for restartable init containers and restart them while in preStop [NodeConformance]

Since when has it been flaking?

8/24

https://storage.googleapis.com/k8s-triage/index.html?date=2024-09-12&job=ci-crio-cgroupv1-evented-pleg&test=%20Containers%20Lifecycle%20when%20a%20pod%20is%20terminating%20because%20its%20liveness%20probe%20fails%20should%20continue%20running%20liveness%20probes%20for%20restartable%20init%20containers%20and%20restart%20them%20while%20in%20preStop%20

Testgrid link

https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg

Reason for failure (if possible)

{ failed [FAILED] Expected an error to have occurred.  Got:
    <nil>: nil
In [It] at: k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:903 @ 08/23/24 18:16:25.131
}

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@pacoxu pacoxu added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 12, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Sep 12, 2024

/cc @hshiina @SergeyKanzhelev

@pacoxu pacoxu changed the title [Flaking Test] Containers Lifecycle when a pod is terminating because its liveness probe fails should continue running liveness probes for restartable init containers and restart them while in preStop [Flaking Test] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop Sep 12, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Sep 12, 2024

@hshiina this seem to be the same problem of #123087 static pod when EventedPLEG is enabled.)

/cc @gjkim42 @liggitt

@hshiina
Copy link
Contributor

hshiina commented Sep 12, 2024

As far as I saw the log, containers does not look to have been recreated.

If I understand correctly, this test works like:

  1. The liveness probe for the regular container fails.
  2. kubelet starts to stop the regular container. Then, the prestop hook is triggered.
  3. If the liveness probe for the sidecar container runs and fails before the prestop, this assertion is passed. If the probe runs while the prostop is running, this assertion fails:
    err = results.RunTogetherLhsFirst(prefixedName(PreStopPrefix, regular1), prefixedName(LivenessPrefix, restartableInit1))
    gomega.Expect(err).To(gomega.HaveOccurred())

I'm afraid I'm not sure what is expected to guarantee the liveness probe for the sidecar container (restartable-init-1) to run or stop before the prestop starts.

@hshiina
Copy link
Contributor

hshiina commented Sep 12, 2024

Due to #124297 which was recently merged, another issue (#124704) appeared. Pod workers sometimes get blocked for a few seconds in kubelet like #124297 (comment). This may make something like race condition surface.

@SergeyKanzhelev
Copy link
Member

/retitle [Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop

@k8s-ci-robot k8s-ci-robot changed the title [Flaking Test] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop [Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop Sep 12, 2024
@SergeyKanzhelev
Copy link
Member

Marking with evented PLEG.

Is the issue also hapenning outside the evented PLEG?

@hshiina
Copy link
Contributor

hshiina commented Sep 13, 2024

I don't think this happens outside the evented PLEG.
Usually, the init container gets into CrashLoopBackOff before the liveness probe for the regular container whose InitialDelaySeconds is 10 starts. So, the liveness probe for the init container does not run while the prestop is running.

If the pod worker works slowly with blocked by #124704, the init container may not get into CrashLoopBackOff.

@SergeyKanzhelev
Copy link
Member

/assign @hshiina
since the PR is opened.

This is for alpha feature and NOT release blocking

/priority backlog
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 18, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Oct 12, 2024

It failed in pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
4 participants