-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure there is one running static pod with the same full name #104743
Ensure there is one running static pod with the same full name #104743
Conversation
ca4dfc1
to
affcc7e
Compare
/retest |
/assign @smarterclayton Made an alternative to #104031 |
/retest I think kubelet-serial tests are broken https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&job=pull-kubernetes-node-kubelet-serial |
/sig node |
/retest |
e852ec2
to
3bce245
Compare
@gjkim42: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
/assign @ehashman |
/cc @ehashman @derekwaynecarr @Random-Liu |
if !started { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible for a pod that hasn't been started to be terminating? I'm thinking of the edge case for a static pod that might be in the middle of setup (e.g. an initContainer) and gets removed before it reaches Running.
Maybe it could still be in waitingToStartStaticPodsByFullname
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of the edge case for a static pod that might be in the middle of setup (e.g. an initContainer) and gets removed before it reaches Running.
A static pod can start initContainer after the static pod has started(moved into startedStaticPodsFullname
).
And I think that the non-started static pod does not have the mirror pod, so only the started static pod should be passed to this function.
if !p.allowStaticPodStart(status.fullname, pod.UID) { | ||
p.workQueue.Enqueue(pod.UID, wait.Jitter(p.backOffPeriod, workerBackOffPeriodJitterFactor)) | ||
status.working = false | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a static pod is never allowed to start, what happens? It appears that we keep requeueing it forever. Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possible if the previous static pod is terminating forever, but I think we are ensuring that the termination requested pod terminates after terminationGracePeriodSeconds
.
if status, ok := p.podSyncStatuses[pod.UID]; ok { | ||
if status.terminatingAt.IsZero() { | ||
klog.V(4).InfoS("Pod worker was terminated but did not have terminatingAt set, likely programmer error", "pod", klog.KObj(pod), "podUID", pod.UID) | ||
} | ||
status.terminatedAt = time.Now() | ||
status.finished = true | ||
status.working = false | ||
|
||
if p.startedStaticPodsByFullname[status.fullname] == pod.UID { | ||
delete(p.startedStaticPodsByFullname, status.fullname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see waitingToStartStaticPodsByFullname
getting cleared out anywhere when it's != 0... should that happen here or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitingToStartStaticPodsByFullname
is basically a set of queues and there can be more than one waiting to start static pods with the same full name.
So we can't clear waitingToStartStaticPodsByFullname
until it's empty.
types.UID("uid-2"), | ||
types.UID("uid-2"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same UID repeated here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I just wanted to make sure that invalid and redundant waiting pods are trimmed.
} else { | ||
t.Errorf("Pod should not be allowed") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need this added I think
time.Sleep(time.Duration(200) * time.Millisecond) | ||
} | ||
ginkgo.By("wait for the mirror pod to be running for grace period") | ||
gomega.Consistently(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be Eventually if we're waiting? Consistently checks that it's true every time and will fail on the first false if it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that the mirror pod is running for terminationGracePeriodSeconds consistently.
@ehashman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with @smarterclayton about this, I think we should merge and get signal on this from CI ASAP.
Sorry for my delays on review!
/lgtm
…743-upstream-release-1.22 Automated cherry pick of #104743: Ensure there is one running static pod with the same full name
for anyone subscribed to this PR, a revert is proposed in #107734 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR ensures there is no more than one running static pod with the same full name so that static pods are able to be updated/deleted gracefully.
Which issue(s) this PR fixes:
Fixes #97722
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: