Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure there is one running static pod with the same full name #104743

Merged
merged 2 commits into from
Nov 12, 2021

Conversation

gjkim42
Copy link
Member

@gjkim42 gjkim42 commented Sep 3, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR ensures there is no more than one running static pod with the same full name so that static pods are able to be updated/deleted gracefully.

Which issue(s) this PR fixes:

Fixes #97722

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Respect grace period when updating static pods.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 3, 2021
@gjkim42 gjkim42 changed the title Ensure pod uniqueness Ensure there is one running pod with the same full name Sep 3, 2021
@gjkim42 gjkim42 changed the title Ensure there is one running pod with the same full name WIP: Ensure there is one running pod with the same full name Sep 3, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2021
@gjkim42 gjkim42 force-pushed the ensure-pod-uniqueness branch 2 times, most recently from ca4dfc1 to affcc7e Compare September 3, 2021 03:12
@gjkim42 gjkim42 changed the title WIP: Ensure there is one running pod with the same full name WIP: Ensure there is one running static pod with the same full name Sep 3, 2021
@gjkim42
Copy link
Member Author

gjkim42 commented Sep 3, 2021

/retest

@gjkim42
Copy link
Member Author

gjkim42 commented Sep 3, 2021

/assign @smarterclayton

Made an alternative to #104031

@gjkim42
Copy link
Member Author

gjkim42 commented Sep 3, 2021

/retest

I think kubelet-serial tests are broken

https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&job=pull-kubernetes-node-kubelet-serial

@gjkim42 gjkim42 changed the title WIP: Ensure there is one running static pod with the same full name Ensure there is one running static pod with the same full name Sep 3, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2021
@gjkim42
Copy link
Member Author

gjkim42 commented Sep 3, 2021

/sig node
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 3, 2021
@gjkim42
Copy link
Member Author

gjkim42 commented Oct 19, 2021

/retest

@gjkim42 gjkim42 force-pushed the ensure-pod-uniqueness branch from e852ec2 to 3bce245 Compare October 19, 2021 07:30
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 19, 2021

@gjkim42: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-kubelet-serial-containerd 530706b74c43ae5e025c49f4acc3ee6d5117684c link false /test pull-kubernetes-node-kubelet-serial-containerd
pull-kubernetes-node-kubelet-serial 530706b74c43ae5e025c49f4acc3ee6d5117684c link false /test pull-kubernetes-node-kubelet-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gjkim42
Copy link
Member Author

gjkim42 commented Oct 19, 2021

/retest

@gjkim42
Copy link
Member Author

gjkim42 commented Oct 21, 2021

/assign @ehashman
For final review

@gjkim42
Copy link
Member Author

gjkim42 commented Nov 3, 2021

/cc @ehashman @derekwaynecarr @Random-Liu
Need final review from pod_worker maintainers :)

Comment on lines +505 to +507
if !started {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible for a pod that hasn't been started to be terminating? I'm thinking of the edge case for a static pod that might be in the middle of setup (e.g. an initContainer) and gets removed before it reaches Running.

Maybe it could still be in waitingToStartStaticPodsByFullname?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the edge case for a static pod that might be in the middle of setup (e.g. an initContainer) and gets removed before it reaches Running.

A static pod can start initContainer after the static pod has started(moved into startedStaticPodsFullname).

And I think that the non-started static pod does not have the mirror pod, so only the started static pod should be passed to this function.

Comment on lines +803 to +806
if !p.allowStaticPodStart(status.fullname, pod.UID) {
p.workQueue.Enqueue(pod.UID, wait.Jitter(p.backOffPeriod, workerBackOffPeriodJitterFactor))
status.working = false
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a static pod is never allowed to start, what happens? It appears that we keep requeueing it forever. Is that possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible if the previous static pod is terminating forever, but I think we are ensuring that the termination requested pod terminates after terminationGracePeriodSeconds.

if status, ok := p.podSyncStatuses[pod.UID]; ok {
if status.terminatingAt.IsZero() {
klog.V(4).InfoS("Pod worker was terminated but did not have terminatingAt set, likely programmer error", "pod", klog.KObj(pod), "podUID", pod.UID)
}
status.terminatedAt = time.Now()
status.finished = true
status.working = false

if p.startedStaticPodsByFullname[status.fullname] == pod.UID {
delete(p.startedStaticPodsByFullname, status.fullname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see waitingToStartStaticPodsByFullname getting cleared out anywhere when it's != 0... should that happen here or somewhere else?

Copy link
Member Author

@gjkim42 gjkim42 Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitingToStartStaticPodsByFullname is basically a set of queues and there can be more than one waiting to start static pods with the same full name.
So we can't clear waitingToStartStaticPodsByFullname until it's empty.

Comment on lines +850 to +851
types.UID("uid-2"),
types.UID("uid-2"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same UID repeated here intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just wanted to make sure that invalid and redundant waiting pods are trimmed.

} else {
t.Errorf("Pod should not be allowed")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need this added I think

time.Sleep(time.Duration(200) * time.Millisecond)
}
ginkgo.By("wait for the mirror pod to be running for grace period")
gomega.Consistently(func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Eventually if we're waiting? Consistently checks that it's true every time and will fail on the first false if it's not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that the mirror pod is running for terminationGracePeriodSeconds consistently.

@gjkim42
Copy link
Member Author

gjkim42 commented Nov 11, 2021

@ehashman
Could you PTAL again? Thanks!

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @smarterclayton about this, I think we should merge and get signal on this from CI ASAP.

Sorry for my delays on review!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5f0a94b into kubernetes:master Nov 12, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 12, 2021
@gjkim42 gjkim42 deleted the ensure-pod-uniqueness branch November 14, 2021 11:51
k8s-ci-robot added a commit that referenced this pull request Dec 2, 2021
…743-upstream-release-1.22

Automated cherry pick of #104743: Ensure there is one running static pod with the same full name
@liggitt
Copy link
Member

liggitt commented Jan 25, 2022

for anyone subscribed to this PR, a revert is proposed in #107734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Static pod should be updated by respecting terminationGracePeriodSeconds
9 participants