-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ipvs: non-local access to externalTrafficPolicy:Local #97081
Conversation
@uablrek: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @andrewsykim |
/retest |
We are also facing the problem described in #93456 and looking for a fix. Hopefully this patch will go in soon. |
@andrewsykim any idea when can we take this patch in? |
pkg/proxy/ipvs/proxier.go
Outdated
@@ -2046,6 +2046,12 @@ func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNode | |||
newEndpoints.Insert(epInfo.String()) | |||
} | |||
|
|||
if len(newEndpoints) == 0 && onlyNodeLocalEndpoints { | |||
for _, epInfo := range proxier.endpointsMap[svcPortName] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should defer this logic for when we implement internalTrafficPolicy
#96600? This way the behavior to fall-back to cluster-wide endpoints is opt-in by user and expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How far ahead in time is it? This is a very small PR and may serve as a gap-filler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That PR is targetted for v1.21. With that PR you can set internalTrafficPolicy: PreferLocal
which I think would achieve the same behavior as this. Is that true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds perfect 👍 I'll apply the PT and test it when I get time, hopefully tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. It is good to know that #96600 can solve the problem, and that PR will be in v1.21. Thanks @andrewsykim for your review and sharing, and thanks @uablrek for working with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internalTrafficPolicy can't be used to solve this issue anyway. Please see the comment below.
I have now tested First
I would suggest to merge this PR. The CI has failed for a long time because of this problem and is now disabled, see kubernetes/test-infra#21447 @andrewsykim PTAL |
@andrewsykim Can you please re-check this PR? It will solve the most urgent use-case, described in the PR-description, and it will fix #100925 (comment). The problem for |
/cc |
pkg/proxy/ipvs/proxier.go
Outdated
// Insert non-local endpoints to mimic the behavior of the iptables proxier | ||
// https://github.com/kubernetes/kubernetes/issues/93456 | ||
if len(newEndpoints) == 0 { | ||
newEndpoints = readyEndpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like implementing "PreferLocal" semantics specific to IPVS external VIPs. Instead of doing it implicitly, we should offer an API-driven way to allow it IMO (e.g. topology-aware routing, eTP=PreferLocal, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For readers - this came up at sig-net call today :)
This is like implementing "PreferLocal" semantics specific to IPVS
My thoughts exactly, and I think you know my feeling on PreferLocal.
Why do people set ETP=local
?
- To ensure the preservation of client IP
- To optimize (avoid the "2nd hop")
This PR breaks both guarantees in a non-deterministic way, for all clients. I don't think that's OK.
IF we could do this ONLY for traffic that orginates on "this" node, would it be OK? You'd still be guaranteeing (1) and there is no "2nd hop", so (2) is satisfied. I can see how to express this in iptables, but I don't know how to do it in IPVS - is it possible?
Note, there are 2 ways to spin this:
Assume a service with a NodePort and ETP=Local.
-
Traffic from a client pod to the nodeport on that pod's node may/should/must(?) use non-local endpoints when local endpoints are not available. Such traffic is still considered "external", but this special case is afforded.
-
Traffic from a client pod to the nodeport on that pod's node is considered "internal" and is subject to ITP rather than ETP. Traffic from a client pod to the nodeport on a different node is considered "external" and is subject to ETP.
In either case, we are ONLY discussing traffic from a client pod to a nodeport on that pod's node. Other traffic (pod->other-node or non-pod->any-node) is "external" and must respect ETP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship ICYMI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR breaks both guarantees in a non-deterministic way, for all clients.
How?
On nodes with local endpoints there is no difference at all, so nothing is broken.
On nodes without local endpoints "other" endpoints are inserted, which means:
- From a POD on a node without endpoints: This will break 2) (obviously) but not 1). The (pod) source is preserved.
- From main netns on a node (including hostNetwork:true): This does not work, please see [ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints #93456 (comment)
- External traffic: traffic arriving on a node with endpoints works as usual (no difference). Traffic arriving on a node without endpoints will not work but "in a roundabout and inefficient way" [ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints #93456 (comment)
The most common use-case is when a service is accessed from PODs but the service may be local or on another cluster. The application want's to access the service in an uniform way using it's external (loadBalancer) IP. But for services with externalTrafficPolicy:Local this will fail if no server is executing on the local node.
The use-case is not uncommon and is a reason in several cases to not use ipvs. Proxy-mode=iptables has this function without an API-driven way to allow it and IMHO ipvs should work in a similar way.
Real use-case
The reason I was asked to bring this up was an application that had the problem described above but also wanted to use source based policy routing to direct return traffic from a loadBalancerIP to a specific interface on the node (in telco there may be very many interfaces). But that only works with proxy-mode=ipvs! So they were stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand the patch - this seems to apply to both traffic from pods but also actual external traffic. Is that wrong?
External traffic: traffic arriving on a node with endpoints works as usual (no difference). Traffic arriving on a node without endpoints will not work but "in a roundabout and inefficient way"
I see. If we don't SNAT it, then yes we are setting it up to fail, which is what we want. So as you say "roundabout and inefficient". Hmm. Super subtle that is.
I can see how to express this in iptables, but I don't know how to do it in IPVS - is it possible?
I guess the IPVS expression is "don't SNAT". But don't we always use IPVS masquerade mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a roundabout and inefficient way
I quoted @danwinship from #97081 (comment) 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we always use IPVS masquerade mode?
Yes, but the meaning in ipvs is different from iptables, it means DNAT. The VIP address is NAT'ed to the pod address. The SNAT (masquerade) is made by iptables rules even for proxy-mode=ipvs.
For ETP=local the SNAT is bypassed with a match to the KUBE-LOAD-BALANCER-LOCAL
ipset;
Chain KUBE-LOAD-BALANCER (1 references)
pkts bytes target prot opt in out source destination
1 60 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* Kubernetes service load balancer ip + port with externalTrafficPolicy=local */ match-set KUBE-LOAD-BALANCER-LOCAL dst,dst
0 0 KUBE-MARK-MASQ all -- * * 0.0.0.0/0 0.0.0.0/0
I'm OK to proceed on this - it seems directionally aligned and a net improvement in the short term. @andrewsykim - it's yours to approve. |
(rebased) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
Holding since it needs rebase
pkg/proxy/ipvs/proxier.go
Outdated
@@ -2004,6 +2004,12 @@ func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNode | |||
newEndpoints = localReadyTerminatingEndpoints | |||
} | |||
} | |||
|
|||
// Insert non-local endpoints to mimic the behavior of the iptables proxier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add more comments on the justifcation for this change. Specifically about how this only applies for nodes that don't have local endpoints anyways.
Allow access to externalTrafficPolicy:Local services from PODs not on a node where a server executes. Problem described in kubernetes#93456
// https://github.com/kubernetes/kubernetes/pull/97081 | ||
// Allow access from local PODs even if no local endpoints exist. | ||
// Traffic from an external source will be routed but the reply | ||
// will have the POD address and will be discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the reply will have the POD address and will be discarded.
I'm still scratching my head on this one to be honest, and I think I'm just missing something.
Why are replies to external sources discarded? Wouldn't conntrack keep track of the NAT entry here and know to map the reply packet from the Pod back to the external source? This is effectively how the VIPs work when externalTrafficPolicy=Cluster, so what is happening differently here that changes that behavior (maybe an iptables rule that that is specific to Local)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this bypasses SNAT, so the eventual reply will go to the original client (DSR), not the proxying node. That client THINKS it is talking to the service IP but the response comes from the pod IP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- and this is safe since the fallback only happens when traffic policy is "Local" so we're guarenteed that SNAT is never enabled.
/lgtm |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
This was LGTM before the freeze - do we want to push for it to go in? |
What type of PR is this?
This PR adds the fix presented by @jpiper in #75262 (comment).
/kind feature
What this PR does / why we need it:
Allow access to externalTrafficPolicy:Local services from PODs not on a node where a server executes. Problem described in #93456
The most common use-case is when a service is accessed from PODs but the service may be local or on another cluster. The application want's to access the service in an uniform way using it's external (loadBalancer) IP. But for services with
externalTrafficPolicy:Local
this will fail if no server is executing on the local node.The feature is already implemented for proxy-mode=iptables.
Which issue(s) this PR fixes:
Partially Fixes #93456
"Partially" because access from the main netns (e.g a
hostNetwork:True
POD) on a K8s node where the server POD does not execute will fail. This is not easily fixed since the route to thekube-ipvs0
must be altered, please see further #93456 (comment).This PR will however fix the problem wast majority of users and it is quite small.
Special notes for your reviewer:
Does this PR introduce a user-facing change?: