Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ipvs: non-local access to externalTrafficPolicy:Local #97081

Merged
merged 1 commit into from
Mar 30, 2022

Conversation

uablrek
Copy link
Contributor

@uablrek uablrek commented Dec 5, 2020

What type of PR is this?

This PR adds the fix presented by @jpiper in #75262 (comment).

/kind feature

What this PR does / why we need it:

Allow access to externalTrafficPolicy:Local services from PODs not on a node where a server executes. Problem described in #93456

The most common use-case is when a service is accessed from PODs but the service may be local or on another cluster. The application want's to access the service in an uniform way using it's external (loadBalancer) IP. But for services with externalTrafficPolicy:Local this will fail if no server is executing on the local node.

The feature is already implemented for proxy-mode=iptables.

Which issue(s) this PR fixes:

Partially Fixes #93456

"Partially" because access from the main netns (e.g a hostNetwork:True POD) on a K8s node where the server POD does not execute will fail. This is not easily fixed since the route to the kube-ipvs0 must be altered, please see further #93456 (comment).

This PR will however fix the problem wast majority of users and it is quite small.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Dec 5, 2020
@k8s-ci-robot
Copy link
Contributor

@uablrek: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 5, 2020
@uablrek
Copy link
Contributor Author

uablrek commented Dec 5, 2020

/assign @andrewsykim

@k8s-ci-robot k8s-ci-robot added area/ipvs sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 5, 2020
@uablrek
Copy link
Contributor Author

uablrek commented Dec 5, 2020

/retest

@namnx228
Copy link

namnx228 commented Jan 7, 2021

We are also facing the problem described in #93456 and looking for a fix. Hopefully this patch will go in soon.

@zoetrope zoetrope mentioned this pull request Jan 25, 2021
5 tasks
@kashifest
Copy link

@andrewsykim any idea when can we take this patch in?

ymmt2005 added a commit to cybozu/neco-containers that referenced this pull request Jan 28, 2021
ymmt2005 added a commit to cybozu/neco-containers that referenced this pull request Jan 28, 2021
@@ -2046,6 +2046,12 @@ func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNode
newEndpoints.Insert(epInfo.String())
}

if len(newEndpoints) == 0 && onlyNodeLocalEndpoints {
for _, epInfo := range proxier.endpointsMap[svcPortName] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should defer this logic for when we implement internalTrafficPolicy #96600? This way the behavior to fall-back to cluster-wide endpoints is opt-in by user and expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far ahead in time is it? This is a very small PR and may serve as a gap-filler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR is targetted for v1.21. With that PR you can set internalTrafficPolicy: PreferLocal which I think would achieve the same behavior as this. Is that true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds perfect 👍 I'll apply the PT and test it when I get time, hopefully tomorrow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. It is good to know that #96600 can solve the problem, and that PR will be in v1.21. Thanks @andrewsykim for your review and sharing, and thanks @uablrek for working with this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalTrafficPolicy can't be used to solve this issue anyway. Please see the comment below.

@uablrek
Copy link
Contributor Author

uablrek commented Mar 22, 2021

I have now tested internalTrafficPolicy: PreferLocal on K8s v1.21.0-beta.1 where #96600 is merged.

First internalTrafficPolicy: PreferLocal is not allowed. I am unsure if it ever will be or if PreferLocal is not applicable any more.

internalTrafficPolicy: Local have no effect on externalTrafficPolicy and does not help the issue this PR is fixing, #93456

I would suggest to merge this PR.

The CI has failed for a long time because of this problem and is now disabled, see kubernetes/test-infra#21447

@andrewsykim PTAL

@uablrek
Copy link
Contributor Author

uablrek commented May 19, 2021

@andrewsykim Can you please re-check this PR?

It will solve the most urgent use-case, described in the PR-description, and it will fix #100925 (comment).

The problem for hostNetwork:True must maybe be addressed in the future but it is not so simple since packets will get the VIP address both as src and dst. I can't see all implications, so I would like to leave it for now.

@astoycos
Copy link
Contributor

astoycos commented Mar 3, 2022

/cc

@k8s-ci-robot k8s-ci-robot requested a review from astoycos March 3, 2022 22:59
// Insert non-local endpoints to mimic the behavior of the iptables proxier
// https://github.com/kubernetes/kubernetes/issues/93456
if len(newEndpoints) == 0 {
newEndpoints = readyEndpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like implementing "PreferLocal" semantics specific to IPVS external VIPs. Instead of doing it implicitly, we should offer an API-driven way to allow it IMO (e.g. topology-aware routing, eTP=PreferLocal, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For readers - this came up at sig-net call today :)

This is like implementing "PreferLocal" semantics specific to IPVS

My thoughts exactly, and I think you know my feeling on PreferLocal.

Why do people set ETP=local?

  1. To ensure the preservation of client IP
  2. To optimize (avoid the "2nd hop")

This PR breaks both guarantees in a non-deterministic way, for all clients. I don't think that's OK.

IF we could do this ONLY for traffic that orginates on "this" node, would it be OK? You'd still be guaranteeing (1) and there is no "2nd hop", so (2) is satisfied. I can see how to express this in iptables, but I don't know how to do it in IPVS - is it possible?

Note, there are 2 ways to spin this:

Assume a service with a NodePort and ETP=Local.

  1. Traffic from a client pod to the nodeport on that pod's node may/should/must(?) use non-local endpoints when local endpoints are not available. Such traffic is still considered "external", but this special case is afforded.

  2. Traffic from a client pod to the nodeport on that pod's node is considered "internal" and is subject to ITP rather than ETP. Traffic from a client pod to the nodeport on a different node is considered "external" and is subject to ETP.

In either case, we are ONLY discussing traffic from a client pod to a nodeport on that pod's node. Other traffic (pod->other-node or non-pod->any-node) is "external" and must respect ETP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship ICYMI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR breaks both guarantees in a non-deterministic way, for all clients.

How?

On nodes with local endpoints there is no difference at all, so nothing is broken.

On nodes without local endpoints "other" endpoints are inserted, which means:

  1. From a POD on a node without endpoints: This will break 2) (obviously) but not 1). The (pod) source is preserved.
  2. From main netns on a node (including hostNetwork:true): This does not work, please see [ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints #93456 (comment)
  3. External traffic: traffic arriving on a node with endpoints works as usual (no difference). Traffic arriving on a node without endpoints will not work but "in a roundabout and inefficient way" [ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints #93456 (comment)

The most common use-case is when a service is accessed from PODs but the service may be local or on another cluster. The application want's to access the service in an uniform way using it's external (loadBalancer) IP. But for services with externalTrafficPolicy:Local this will fail if no server is executing on the local node.

The use-case is not uncommon and is a reason in several cases to not use ipvs. Proxy-mode=iptables has this function without an API-driven way to allow it and IMHO ipvs should work in a similar way.

Real use-case

The reason I was asked to bring this up was an application that had the problem described above but also wanted to use source based policy routing to direct return traffic from a loadBalancerIP to a specific interface on the node (in telco there may be very many interfaces). But that only works with proxy-mode=ipvs! So they were stuck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstand the patch - this seems to apply to both traffic from pods but also actual external traffic. Is that wrong?

External traffic: traffic arriving on a node with endpoints works as usual (no difference). Traffic arriving on a node without endpoints will not work but "in a roundabout and inefficient way"

I see. If we don't SNAT it, then yes we are setting it up to fail, which is what we want. So as you say "roundabout and inefficient". Hmm. Super subtle that is.

I can see how to express this in iptables, but I don't know how to do it in IPVS - is it possible?

I guess the IPVS expression is "don't SNAT". But don't we always use IPVS masquerade mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a roundabout and inefficient way

I quoted @danwinship from #97081 (comment) 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we always use IPVS masquerade mode?

Yes, but the meaning in ipvs is different from iptables, it means DNAT. The VIP address is NAT'ed to the pod address. The SNAT (masquerade) is made by iptables rules even for proxy-mode=ipvs.

For ETP=local the SNAT is bypassed with a match to the KUBE-LOAD-BALANCER-LOCAL ipset;

Chain KUBE-LOAD-BALANCER (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    1    60 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes service load balancer ip + port with externalTrafficPolicy=local */ match-set KUBE-LOAD-BALANCER-LOCAL dst,dst
    0     0 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0           

@thockin
Copy link
Member

thockin commented Mar 25, 2022

I'm OK to proceed on this - it seems directionally aligned and a net improvement in the short term.

@andrewsykim - it's yours to approve.

@uablrek
Copy link
Contributor Author

uablrek commented Mar 26, 2022

(rebased)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 28, 2022
Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

Holding since it needs rebase

@@ -2004,6 +2004,12 @@ func (proxier *Proxier) syncEndpoint(svcPortName proxy.ServicePortName, onlyNode
newEndpoints = localReadyTerminatingEndpoints
}
}

// Insert non-local endpoints to mimic the behavior of the iptables proxier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add more comments on the justifcation for this change. Specifically about how this only applies for nodes that don't have local endpoints anyways.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 29, 2022
Allow access to externalTrafficPolicy:Local services from PODs
not on a node where a server executes. Problem described in kubernetes#93456
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 29, 2022
// https://github.com/kubernetes/kubernetes/pull/97081
// Allow access from local PODs even if no local endpoints exist.
// Traffic from an external source will be routed but the reply
// will have the POD address and will be discarded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the reply will have the POD address and will be discarded.

I'm still scratching my head on this one to be honest, and I think I'm just missing something.

Why are replies to external sources discarded? Wouldn't conntrack keep track of the NAT entry here and know to map the reply packet from the Pod back to the external source? This is effectively how the VIPs work when externalTrafficPolicy=Cluster, so what is happening differently here that changes that behavior (maybe an iptables rule that that is specific to Local)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this bypasses SNAT, so the eventual reply will go to the original client (DSR), not the proxying node. That client THINKS it is talking to the service IP but the response comes from the pod IP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- and this is safe since the fallback only happens when traffic policy is "Local" so we're guarenteed that SNAT is never enabled.

@andrewsykim
Copy link
Member

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 29, 2022
@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@thockin
Copy link
Member

thockin commented Mar 30, 2022

This was LGTM before the freeze - do we want to push for it to go in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet