-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PodAntiAffinity issues in case of multiple affinityTerms #68173
Fix PodAntiAffinity issues in case of multiple affinityTerms #68173
Conversation
/sig scheduling |
/retest |
071809f
to
7dc345d
Compare
7dc345d
to
f56568d
Compare
}, | ||
}, | ||
}, | ||
wantAffinityPods: make(map[string][]*v1.Pod), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You used nil
above instead of make()
, should be consistent :D
btw, I think nil is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, lol...
getPodsMatchingAffinity
simply return nil if it doesn't have affinity/antiaffinity:
kubernetes/pkg/scheduler/algorithm/predicates/metadata.go
Lines 376 to 382 in 058b26f
func getPodsMatchingAffinity(pod *v1.Pod, nodeInfoMap map[string]*schedulercache.NodeInfo) (affinityPods map[string][]*v1.Pod, antiAffinityPods map[string][]*v1.Pod, err error) { | |
allNodeNames := make([]string, 0, len(nodeInfoMap)) | |
affinity := pod.Spec.Affinity | |
if affinity == nil || (affinity.PodAffinity == nil && affinity.PodAntiAffinity == nil) { | |
return nil, nil, nil | |
} |
But Mr. reflect.DeepEqual
is unhappy to treat "nil" as equality of a "nil map" - so in some cases, I put nil, while some I put nil map there to make it happy...
BTW: apiequality.Semantic.DeepEqual
is considerate to say yes to "nil equals to nil map". But maybe not worth introducing an extra package dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, seems like using apiequality.Semantic.DeepEqual
is a approach. I noticed that function in other test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Have to rebase code as PR #67788 refactored the data structure.
Now it's all consistent. PTAL.
|
||
// TestGetPodsMatchingAffinity_AntiAffinity tests againt method getPodsMatchingAffinity | ||
// on Anti Affinity cases | ||
func TestGetPodsMatchingAffinity_AntiAffinity(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very detailed test, thanks!
Maybe I am verbose, but I am thinking if you can add a test case to verify this bug is fixed in predicates.go
as well?
Because we still have many logics there after we got the antiAffinityPods
from metadata, so it would be great if we can ensure this change does work and will not break anything in predicates.go
.
Sounds reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resouer Thanks for the detailed review!
Sure, it's totally reasonable. Do you think is it good enough to add additional UTs for predicates.go#satisfiesPodsAffinityAntiAffinity()
? (step 2 of symmetry check this PR tries to fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be.
Or, I think you can just add one more test case to verify this in TestInterPodAffinity
? which seems quicker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one testcase - which only passed with this PR. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is valid to me, while some nits need to be fixed first :-)
Also, I would suggest a test case for predicates.go
to verify this fix solves the reported issue.
f56568d
to
817fc2e
Compare
affPods = append(affPods, existingPod) | ||
} | ||
// Check anti-affinity properties. | ||
if podMatchesAffinityTermProperties(existingPod, antiAffinityProperties) { | ||
if podMatchesAnyAffinityTermProperties(existingPod, antiAffinityProperties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we check InterPodAffinityMatches
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, implicitly - InterPodAffinityMatches
calls satisfiesPodsAffinityAntiAffinity
, satisfiesPodsAffinityAntiAffinity
needs "meta", and "meta" is generated in GetMetadata
where calls getPodsMatchingAffinity
.
817fc2e
to
e17e00f
Compare
e17e00f
to
a73bcc1
Compare
/xref kubernetes-retired/bootkube#1001 The changes in this PR are causing
Checkpoint manifest json includes:
Removing the affinity section within the checkpoint manifest fixes the restore issue. I'll be investigating the issue further, but am commenting to see if the upstream authors have any thoughts. This issue is a breaking change on upgrades and new installs for 1.12.x. |
I suspect setting the affinity to nil when matchExpressions == null/nil would preserve the prior behavior. |
@rphillips thanks for bringing this up. If I understand correctly, you're suspicious that after applying this PR, scheduler rejects a pod which has |
Considering this PR doesn't go to v1.11.3, I did a test on v1.11.3: kind: Pod
apiVersion: v1
metadata:
name: pause
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions: null
containers:
- name: pause
image: k8s.gcr.io/pause:3.1 The above pod is also rejected. In other words, this PR seems not to break behavior of null matchExpressions. $ k get po
NAME READY STATUS RESTARTS AGE
pause 0/1 Pending 0 2m
$ k describe po pause | tail -3
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 1m (x25 over 3m) default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector. |
Kubernetes 1.12.x introduced new logic for Affinity [1]. In addition to new logic, the Pod contains a default affinity. The new default affinity gets serialized into the checkpoint file, and the 1.12.x kubelet does not restore the pod due to the affinity. This PR removes the affinity from the spec and documents that affinity's are not supported. ``` "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": null } ] } } }, ``` [1] kubernetes/kubernetes#68173 [2] https://github.com/kubernetes/kubernetes/blob/e39b510726113581c6f6a9c2db1753d794aa9cce/pkg/controller/daemon/util/daemonset_util.go#L183-L196
@Huang-Wei Thank you for the prompt reply. The concern is with a static pod the kubelet will restore on restart. Bootkube, Tectonic, and other distributions may use a checkpointer to save the state in a manifest.
|
Note: We do not create the checkpointer pod with an affinity, so I'm not sure where it is being added in. |
Agree, this is the key point. |
Kubernetes 1.12.x introduced new logic for Affinity [1]. In addition to new logic, the Pod contains a default affinity. The new default affinity gets serialized into the checkpoint file, and the 1.12.x kubelet does not restore the pod due to the affinity. This PR removes the affinity from the spec and documents that affinity's are not supported. ``` "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": null } ] } } }, ``` [1] kubernetes/kubernetes#68173 [2] https://github.com/kubernetes/kubernetes/blob/e39b510726113581c6f6a9c2db1753d794aa9cce/pkg/controller/daemon/util/daemonset_util.go#L183-L196
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Use a patched pod-checkpointer that strips affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Use a patched pod-checkpointer that strips affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Use a patched pod-checkpointer that strips affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* checkpointer: ignore Affinity within podspec Kubernetes 1.12.x introduced new logic for Affinity [1]. In addition to new logic, the Pod contains a default affinity. The new default affinity gets serialized into the checkpoint file, and the 1.12.x kubelet does not restore the pod due to the affinity. This PR removes the affinity from the spec and documents that affinity's are not supported. ``` "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": null } ] } } }, ``` [1] kubernetes/kubernetes#68173 [2] https://github.com/kubernetes/kubernetes/blob/e39b510726113581c6f6a9c2db1753d794aa9cce/pkg/controller/daemon/util/daemonset_util.go#L183-L196 * gofmt * golang: bump to 1.11.1 * fixes checkpointer run * checkpointer test will have a modified asset file
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strips affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
* Mount an empty dir for the controller-manager to work around kubernetes/kubernetes#68973 * Update coreos/pod-checkpointer to strip affinity from checkpointed pod manifests. Kubernetes v1.12.0-rc.1 introduced a default affinity that appears on checkpointed manifests; but it prevented scheduling and checkpointed pods should not have an affinity, they're run directly by the Kubelet on the local node * kubernetes-retired/bootkube#1001 * kubernetes/kubernetes#68173
What this PR does / why we need it:
There are 2 phases to verify symmetry of PodAntiAffinity:
This PR is to fix issue in second step.
Currently logic of second step tries to invoke
podMatchesAffinityTermProperties
when checking both affinity and antiaffinity.It's not that accurate, as
podMatchesAffinityTermProperties
is a "match all" semantics, say for each term, ensure it matches target selector/labels, and return true if all term passes check. This is good for affinity, but not antiaffinity.Regarding antiaffinity, it's a "match any" semantics - i.e. if found any term match target selector/labels, return true.
So without this fix, unit tests (which is added in this PR) failed on following cases:
Case a) is guarded in before-mentioned step 1 of symmetric logic. So we don't see it in a running k8s cluster. Case b) and c) are real problems and specifically case c) is what issue #68101 reported.
Which issue(s) this PR fixes:
Fixes #68101
Special notes for your reviewer:
After reviewing the correctness of this PR, bsalamat and k82cn can decide if it needs to go into 1.12 release cycle.
Release note: