Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig-network: Add egress-source-ip-support KEP #1105

Closed
wants to merge 3 commits into from

Conversation

mkimuram
Copy link
Contributor

No description provided.

@k8s-ci-robot
Copy link
Contributor

Welcome @mkimuram!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 13, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jun 13, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkimuram
To complete the pull request process, please assign dcbw
You can assign the PR to them by writing /assign @dcbw in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

As a user of Kubernetes, I have some pods which require an access to different databases that restrict access by source ip and exists outside the k8s cluster.
So, some pods which require database access need a specific egress source IP when sending packets to the database, and other pods need another specific egress source IP.

### Implementation Details/Notes/Constraints [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain how this will work on the nodes themselves. For example (and I haven't looked too deep into the implementation) does your implementation assign all the egress-source-ips to the node that the pod lives on? If so, how would that work in cloud environments that have tighter constraints on the IPs available to nodes. That kind of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment.

It assigns each IP on one of the nodes by leveraging keepalived-vip (https://github.com/kubernetes-retired/contrib/tree/master/keepalived-vip), then forward packets from the node that pod lives on to the node that have the specific IP by iptables rule and routing table.


## Proposal

Expose an egress API to user like below to allow user to assign a static egress source IP to specific pod(s).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing a Custom Resource Definition, or an actual API resource here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my PoC implementation, it uses CRD, for it is implemented as a k8s operator to reconcile the iptables rules and routing tables on all nodes. However, I think that we still have a choice to define k8s API or keep it as CRD. .


## Summary

Egress source IP is a feature to assign a static egress source IP for packets from a pod to outside k8s cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to limit this to a pod or something more stable like a label selector? Naming pods explicitly in resources can be fragile -- pod names are meant to be temporary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to define what "outside the k8s cluster" means. Where is the boundary? Is it when the packet leaves the node, some notion of network, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. As I mentioned in the SIG meeting. I meant "within private network", network like across cloud was not within my scope. I will update the KEP.

Also, label based approach sounds good, because the use case includes multiple pod to IP mapping case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(TLDR; I'm newbie, and you are allowed to ignore this comment )

I would say that I understand the original term of "outside of k8s cluster", as destination seem to hint that it is outside the k8s cluster pod CIDR and service CIDR.

If the destination was inside the kubernetes cluster, this wouldn't make much sense IMO.
(ie destination could use a NetworkPolicy for securing it)

From my understanding, the keepalived is just yet another VIP that acts as an service type LoadBalancer, which kube-egress needs to be able to bind + using the iptables and routing entries to do the right thing for ensuring pod's get's source ip set to keepalive VIP before leaving the node.

(Note; new in here, newbie, I might be totally off, just wanted to leave a note, but coming from a world where I used to pet servers in a datacenter, keepalived is a good friend for providing HA for a load balancer IP that is put into DNS as it uses software defined VRRP to ensure VIP is always available. From sig-network meeting and getting [ei]ngress towards such a keepalive VIP , it seems someone has already been playing around/been on an adventure to get it to work together with cloud providers using service type LoadBalancer: https://github.com/munnerz/keepalived-cloud-provider )

And also mention during the sig-network meeting if running in cloud, you could probably achieve the stories by getting a dedicated source ip for the whole cluster using platform/cloud provider dependent configuration of [ie]ngress to/from the kubernetes cluster, that would solve the use cases of reaching the destination that requires a whitelist of source IP.

KEP is targeting set of PODs, using label selector makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have made you confuse by not putting details on what should be done and how it is achieved in my PoC implementation. I will update the KEP to clarify them.

However, short explanation will be:

My goal is not to make PodIP or ClusterIP to be visible to applications not running on k8s cluster.
Instead, the goal is to make certain pods' source IPs to be fixed one to applications not running on k8s cluster. To do that I'm thinking about making N:1 map of Pods and VIP when PodIP is SNATed. (Actually, VIP could be assigned by anyway as k8s LoadBalancer allows it, but my PoC implementation just used keepalived-vip.)

My intention of excluding usecases like cross-cloud above, was to exclude scenarios like there are another SNAT, VPN, and so on between k8s cluster and applications, which will require much more works than just doing such a mapping when going out from k8s cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are use cases for both "outside the cluster but within the private network" and "all the way out to the internet". (eg, take any of the user stories below and assume that the kubernetes cluster is in a public cloud but the database server is not)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somethings popping up to my mind:

I wonder if it is something that could be done on service IP's, they are already VIPs inside the k8s cluster. So the SNATing you already do in PoC, could it be applied to kube-proxy?
How would this work for IPVS?
Ie is it okey to only support iptabels and not IPVS?

(I think I have a vague memory of maybe @thockin mentioning about a customer who wanted likewise support for egress on service vips during the sig-network call?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are use cases for both "outside the cluster but within the private network" and "all the way out to the internet". (eg, take any of the user stories below and assume that the kubernetes cluster is in a public cloud but the database server is not)

O.K. Let's also consider "all the way out to the internet". And if needed, let's set another milestone to achieve it.

is it okey to only support iptabels and not IPVS?

I think that it is a good idea to allow different implementation to forward packets for egress as kube-proxy do. Also, we might be able to leverage service as a mechanism to trace the PodIP. I will add description on this to "Design Details" section of this KEP to discuss this in detail.

metadata:
name: example-pod1-egress
spec:
ip: 192.168.122.222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the restrictions/instructions for this IP be? Any IP in the node CIDR?

Also, is this meant to be sharable, or unique per-Egress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "Any IP in the node CIDR" and is sharable. I will add this to KEP.

@bjhaid
Copy link

bjhaid commented Jun 16, 2019

This is the calico feature I mentioned on the last sig-network call:

https://docs.projectcalico.org/v3.7/reference/cni-plugin/configuration#requesting-a-specific-ip-address


### Goals

Provide users with an official and common way to assign a static egress source IP for packets from a pod to outside k8s cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a static egress source IP for packages from one or more pods" isn't it? (eg, Story 2 below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

name: pod1
```

PoC implementation is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the expectation of how the KEP'ed version of the feature would differ from the PoC?

More specifically, if this is something that can already be implemented entirely outside of kubernetes, then does it benefit from being moved into kubernetes?

Are you expecting that kubernetes would adopt essentially the PoC implementation, or merely the API surrounding it? Would this be something that would be core to Kubernetes, or would it be implemented by network plugins (who might be able to optimize in various ways that a generic implementation could not)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, if this is something that can already be implemented entirely outside of kubernetes, then does it benefit from being moved into kubernetes?

Benefits that I expects are:

  • Make it work with any CNI drivers. (Some CNI drivers won't work well just by my current PoC implementation)
  • Define stable API that can keep compatible with future k8s versions (I won't stick to make it a core k8s API, as long as, it can keep computability. As volume snapshot feature is implemented as CRD.)
  • Make use of existing k8s mechanisms like kube-proxy and service, if possible and useful

Then, it will provide users with the same UX across any k8s cluster and it will decrease developers burden to maintain the compatibility for this feature.

@mkimuram
Copy link
Contributor Author

@dcbw @bowei @norrs @vllry @bjhaid @danwinship

Thank you for your feedback. I've updated the KEP based on feedback.
Please check it.


## Motivation

In k8s, egress traffic has its source IP translated (SNAT) to appear as the node IP when it leaves the cluster. However, there are many devices and software that use IP based ACLs to restrict incoming traffic for security reasons and bandwidth limitations. As a result, this kind of ACLs outside k8s cluster will block packets from the pod, which causes a connectivity issue. To resolve this issue, we need a feature to assign a particular static egress source IP to one or more particular pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In k8s, egress traffic has its source IP translated (SNAT) to appear as the node IP when it leaves the cluster

Is this defined somewhere? My understanding is that many network plugins can do this, but it's not mandated or enforced by Kubernetes itself. You could set up a cluster to maintain original source IP address if so desired.

In that sense, I feel like this is adding an additional responsibility to Kubernetes which currently exists on the other side of the plugin boundary. Are we sure that's the right thing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caseydavenport

Thank you for your feedback.

Is this defined somewhere? My understanding is that many network plugins can do this, but it's not mandated or enforced by Kubernetes itself.

You are right that this is not defined as a k8s network model here, and this is just a behavior of some CNI plugins’ implementation. So, I will rephrase it to make it more accurate in KEP.

My understanding is that many network plugins can do this, but it's not mandated or enforced by Kubernetes itself.
You could set up a cluster to maintain original source IP address if so desired.

In my understanding, some plugins can directly send packets to outside k8s cluster from Pods, and they can assign a particular PodIP to a Pod, in their own ways. However, this feature still has values, which I believe existing plugins alone won’t solve, like below:

  • Compatibility and Interoperability: It provides a common interface to achieve this for any CNI plugins or cloud providers
  • Mulitple Pod usecase: It will also solve usecases like Story 2, without adding many IPs to ACL.
  • Across Internet usecases: It will also add a nob to expand ability to handle “(2) internet that is outside the private network where k8s cluster is running” usecase

In that sense, I feel like this is adding an additional responsibility to Kubernetes which currently exists on the other side of the plugin boundary. Are we sure that's the right thing here?

I hope above justifies the addition of the new responsibility to k8s.
This won’t change the k8s network model, but add a common way to solve common use cases, just like service and ingress do.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 8, 2019
@skydoctor
Copy link

Wondering if there has been any progress on this?

@uablrek
Copy link

uablrek commented Dec 6, 2019

Kubernetes can define an Egress object in the same way as the Ingress object. Some cloud providers already allows the egress-ip to be specified but outside "standard" Kubernetes (*). With a defined Egress object the way of defining an egress-ip would be uniform, again compare with the Ingress.

The implementation of an egress-ip function will be dependent on the network environment, such as the CNI-plugin, external routers and network security policies. A generic in-cluster solution can not be provided and therefore Kubernetes should not implement this function IMO.

And BTW, the PoC refered in this PR and the more mature https://github.com/nirmata/kube-static-egress-ip are not generic.

@skydoctor
Copy link

Agree that an Egress object in K8s with cloud-provider specific implementation would be ideal. Similar to how Service type LoadBalancer allows each cloud-provider to assign IP addresses in their own ways. But just like the Service object which specifies which pods should receive packets from that ingress external IP, we need some native K8s object which specifies which pods should be able to send out through the external egress IP.

@thockin thockin self-assigned this Dec 18, 2019
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to the bottom of my email list and found this - it wasn't assigned to me so it never got priority. Sorry.

Thoughts.

  • Does the IP have to be a node's IP or can it be any IP in the network? Can it be a public IP? Can it be different for different destinations (e.g. an RFC-1918 IP if dest is RFC-1918, but a public IP if dest is not RFC-1918)?

  • Can or must the user ask for a specific IP or is it allocated by the controller?

  • What happens if the use asks for an IP that is already in use?

  • What happens if two Egress resources select the same pod, which IP is used? Perhaps this should be based on ServiceAccount or something else that is 1:1 instead?

  • I understand how you can make a single pod work. How can you make multiple pods work?

  • Overall if something like this is to be developed, I think it should be a CRD and live outside the core for the forseeable future. I'd want to see several implementations to explore the intricacies of the idea.

That said, this is a few months old - is it something you still want to pursue?

@nitishm
Copy link

nitishm commented Dec 18, 2019

@thockin Thanks for sharing those thoughts which are definitely something to be considered. This feature is still something we need and wish to pursue.
I can imagine this requirement is going to be something more and more internet/telco based organizations will require (similar to ours).

@bowei
Copy link
Member

bowei commented Dec 18, 2019

cc: @satyasm @vbannai

@mkimuram
Copy link
Contributor Author

Thank you for comments and sorry for late response.

I'm thinking about making this feature available outside k8s, as it would require CNI-plugin or cloud provider specific implementation, which k8s community would like to avoid adding to k8s. (For me, this feature isn't necessary to be in k8s, as long as it is a well-maintained common way. Some might not agree with it, though.)

So, improving project like kube-static-egress-ip would be one way to solve this goal. (Thank you for sharing information on the project. @uablrek)

Another idea is to extend submariner's scope from cluster-to-cluster to cluster-to-non-cluster. The summary of my idea is making pod accessible via podIP from outside cluster without NAT, instead of assigning 1:1 static NAT accessible from outside cluster. (It's still just an idea and I haven't discussed it with submariner community yet, though).

By doing this, it would allow:

  • Reserving source IP (podIP) to outside the cluster where the pod is running
  • Connectivity over internet, not just inside the same LAN
  • HA gateway on k8s cluster
  • Offload IP address management to k8s (by utilizing podIP)

However, ability to assign specific IP is missing in this case. So, we need to think about a common way to solve it for some use cases, like ip-based external firewall where the IPs are expected to be specific fixed values. This feature would be challenging as @thockin pointed out, and it seems to be a kind of feature that would be hardly accepted to k8s.
(Actually, we won't need to stick to assigning static podIPs in k8s cluster to achieve this goal. For example, we would be able to 1:1 static NAT in the remote(non-cluster side's) gateway, by reconciling mapping between podIP and floating IP, as it is done in "local" gateway in kube-static-egress-ip.)

@mkimuram
Copy link
Contributor Author

mkimuram commented Jan 7, 2020

Let me share the details of my implementation idea that I mentioned in my previous comment.
(It uses iptables and ssh tunnel, however there might be a better way. Any feedback is welcomed.)

Let's assume a use case as described in figure#1:

  • Source IP for access from Client Pod#1 to port 8000 of External Server (A) should be External IP#1,
  • Source IP for access from Client Pod#2 to port 8000 of External Server (A) should be External IP#2.
    figure1

(We will be able to extend it to multiple external servers and/or N:1 mapping of pods to an external source IP, later.)

It can be achieved by below steps (See figure#2):

  • Create a gateway server that has External IP#1 and External IP#2 and make an sshd service run for each IP on the server (Gateway server needs to be accessible to external server from the IPs),
  • Create a forwarder pod for the external server and create an ssh tunnel to a different port for the server for each external IP and make iptables rules to forward to the right tunnel depending on the source IP of the pod (sshd needs to be accessible from the forwarder pod),
  • Create a service to the forwarder pod to expose it to client pods,
  • Then, each client pod will be able to access to external server via the service and the source IP for the access to the external server will be as expected.

figure2

This concept will be able to extend like below:

  • To add more external IPs, assign external IPs to the gateway server and run sshd for each external IP,
  • To consume the external IP, create an ssh tunnel for the external IP in the forwarder pod,
  • To assign multiple pods to an external IP, add iptables rules for the pod to the ssh tunnel,
  • To add external servers, add forwarder pod for each external server and create iptables rules and an ssh tunnel for it inside the forwarder pod.

Forwarder pods need to be created as many as external servers, but forwarding rule and ssh tunnel for the servers exist only in the forwarder pod for a particular external server. Therefore, no change is required for the gateway server side for podIP changes and mapping changes for another external servers. We will be able to create a k8s operator to reconcile the mapping in forwarder pods without introducing much complexity.

In addition, this idea only assumes "pods on a node can communicate with all pods on all nodes without NAT", therefore it should work well for any CNI plugins and any cloud providers.
Also, if k8s clusters are connected by pod network level, for example by using submariner, any external networks accessible by one of the k8s cluster will be accessible by all the other k8s clusters (We can even create small k8s cluster, like Kind, in a certain external network just to access to the external network from the k8s clusters).

I manually confirmed that above concept works well for the configuration of the above use case by below steps:

  • Setup test environment

[On external server]

  1. Confirm IP address of external server
# ip addr show eth0 | awk '/inet /{print $2}'
192.168.122.139/24
  1. Run service on the node (HTTP server on port 8000)
# python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

[On Gateway node]

  1. Assign IP addresses to be used as external IPs and run sshd for each IP (Just for test, see (*1))

[On k8s]

  1. Create client1 and client2
$ kubectl run client1 --image=centos:7 --restart=Never --command -- bash -c "sleep 10000"
$ kubectl run client2 --image=centos:7 --restart=Never --command -- bash -c "sleep 10000"
  1. Create forwarder pod
$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: fwd-pod
  labels:
    app: fwd-pod
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    image: centos:7
    name: fwd-pod
    securityContext: 
      privileged: true
EOF
  1. Confirm pod IPs
$ kubectl get pod -o wide
NAME      READY   STATUS    RESTARTS   AGE    IP           NODE               NOMINATED NODE   READINESS GATES
client1   1/1     Running   0          8m7s   10.244.2.7   cluster1-worker    <none>           <none>
client2   1/1     Running   0          8m3s   10.244.2.8   cluster1-worker    <none>           <none>
fwd-pod   1/1     Running   0          24s    10.244.2.9   cluster1-worker    <none>           <none>

[On forwarder pod ($ kubectl exec -it fwd-pod bash)]

  1. Create ssh tunnel to external server via gateway node
# yum install -y openssh-clients

# EXTIP1=192.168.122.200
# EXTIP2=192.168.122.201
# EXTSVRIP=192.168.122.139

# ssh -g -f -N -L 8001:$EXTSVRIP:8000 $EXTIP1
# ssh -g -f -N -L 8002:$EXTSVRIP:8000 $EXTIP2
  1. Test accessibility and check source IP on HTTP server
# curl localhost:8001
# curl localhost:8002

(Stdout from SimpleHTTPServer)

192.168.122.200 - - [20/Dec/2019 17:04:13] "GET / HTTP/1.1" 200 -
192.168.122.201 - - [20/Dec/2019 17:04:15] "GET / HTTP/1.1" 200 -
  1. Setup iptables rules to forward to different ports depending on a source IP
# yum install -y iptables

# CLIENT1IP=10.244.2.7
# CLIENT2IP=10.244.2.8
# FWDPODIP=10.244.2.9
# EXTSVRIP=192.168.122.139

# iptables -A PREROUTING -t nat -m tcp -p tcp --dst $FWDPODIP --src $CLIENT1IP --dport 8000 -j DNAT --to-destination $FWDPODIP:8001
# iptables -A POSTROUTING -t nat -m tcp -p tcp --dst $EXTSVRIP --dport 8001 -j SNAT --to-source $FWDPODIP
# iptables -A PREROUTING -t nat -m tcp -p tcp --dst $FWDPODIP --src $CLIENT2IP --dport 8000 -j DNAT --to-destination $FWDPODIP:8002
# iptables -A POSTROUTING -t nat -m tcp -p tcp --dst $EXTSVRIP --dport 8002 -j SNAT --to-source $FWDPODIP

# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DNAT       tcp  --  10.244.2.7           10.244.2.9           tcp dpt:8000 to:10.244.2.9:8001
DNAT       tcp  --  10.244.2.8           10.244.2.9           tcp dpt:8000 to:10.244.2.9:8002

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
SNAT       tcp  --  0.0.0.0/0            192.168.122.139      tcp dpt:8001 to:10.244.2.9
SNAT       tcp  --  0.0.0.0/0            192.168.122.139      tcp dpt:8002 to:10.244.2.9

[On k8s]

  1. Expose forwarder pod as service
# kubectl expose pod fwd-pod --name=ext-service --port=8000
# kubectl get svc
NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
ext-service   ClusterIP   100.94.90.120   <none>        8000/TCP   16s
kubernetes    ClusterIP   100.94.0.1      <none>        443/TCP    26h
  • Test
    [On client1 ($ kubectl exec -it client1 bash)]
  1. Access to ext-service
$ curl ext-service:8000
  1. Check source IP for the access on external server
    (Stdout from SimpleHTTPServer)
192.168.122.200 - - [20/Dec/2019 17:30:14] "GET / HTTP/1.1" 200 -

[On client2 ($ kubectl exec -it client2 bash)]

  1. Access to ext-service
$ curl ext-service:8000
  1. Check source IP for the access on external server
    (Stdout from SimpleHTTPServer)
192.168.122.201 - - [20/Dec/2019 17:30:52] "GET / HTTP/1.1" 200 -

(*1)

# ip link add macvlan1 link eth0 type macvlan mode bridge
# ip netns add net1
# ip link set macvlan1 netns net1
# ip netns exec net1 bash
# ip link set lo up
# ip link set macvlan1 up
# ip addr add 192.168.122.200/24 dev macvlan1
# ip route add default via 192.168.122.1
# /usr/sbin/sshd -o PidFile=/run/sshd-net1.pid
# exit

# ip link add macvlan2 link eth0 type macvlan mode bridge
# ip netns add net2
# ip link set macvlan2 netns net2
# ip netns exec net2 bash
# ip link set lo up
# ip link set macvlan2 up
# ip addr add 192.168.122.201/24 dev macvlan2
# ip route add default via 192.168.122.1
# /usr/sbin/sshd -o PidFile=/run/sshd-net2.pid
# exit

@skydoctor
Copy link

Thanks for sharing the details of your mechanism @mkimuram. In part, this shows the complexity required to ensure that egressing packets from a given pod-type have a static source IP. A native approach would enlist the help of iptables at the node level to SNAT the packets going out. The key thing to think about is what entity creates the right iptables rule - an independent operator or possibly kube-proxy?

@skydoctor
Copy link

@thockin: A common requirement in this regard is the need for symmetry: When a LoadBalancer service is used, an external client sends packets to the LoadBalancer IP. When pods that are part of that service initiate packets towards an external server, the packets should go out using the LoadBalancer IP as the source IP.

@mkimuram
Copy link
Contributor Author

mkimuram commented Jan 9, 2020

@skydoctor

The key thing to think about is what entity creates the right iptables rule - an independent operator or possibly kube-proxy?

Yes. I meant to make k8s operator, and possibly k8s, handle this complexity.
(It's too much work to do it, manually. The list of commands are just to test that it works.)

A common requirement in this regard is the need for symmetry: When a LoadBalancer service is used, an external client sends packets to the LoadBalancer IP. When pods that are part of that service initiate packets towards an external server, the packets should go out using the LoadBalancer IP as the source IP.

External IPs are already used by gateway server, so we won't be able to assign the same external IP to k8s's load balancer when the above idea is used. However, if we create a per external IP forwarder pod and run ssh client to remote forward ssh tunnel in it, packets from external servers could be sent to client pods via corresponding external IPs.

In that case, the source IP of the packets won't be the clusterIP of the service of the external server's forwarder pod, instead it will be the IP of the per external IP forwarder pod. So, it won't be symmetry in this point.
Also, by doing it, it makes the combinations of the external IP and particular ports dedicate to it
(Without it, only port 22 for an external IP is used on the gateway server).
So, we will need to care about which external IPs and ports are used by which set of pods, if we do it.

I will consider if there will be a better way.

@mkimuram
Copy link
Contributor Author

Let me also share the detailed idea of preserving source IPs of reverse access.

By using remote ssh tunnel, source IP of packets from external server to each pod can be each external server's fwd-pod's IP. So, each pod will be able to distinguish which external server sent the packets. (Each pod will be able to access to the right external server via the fwd-pod's IP, although the IP might change.)

This won't be a perfect solultion, but I couldn't find a good way to send a packet from service IP.
I'm sharing this to get feedback. Please see the figure 3 and details below.

Note that in this idea, it creates a tunnel by using ssh client in forwarder pod, instead of creating it in per external IP pod, as I mentioned in my previous comment. This is because remote ssh tunnel doesn't preserve source IP, so it's too late to find original source IP after it is tunneled to pod network. By doing so, iptables rules need to be updated in the gateway server, which would make management complex, but I guess that it is still possible if we develop k8s operator to handle it.

Picture1

Set up the same configuration in #1105 (comment) .

[On client1 ($ kubectl exec -it client1 bash)]

  1. Run service on the client1 (HTTP server on port 80)
# python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...

[On client2 ($ kubectl exec -it client2 bash)]

  1. Run service on the client2 (HTTP server on port 80)
# python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...

[On k8s]

  1. Create services for clients
$ kubectl expose pod client1 --name=cl1-service --port=80
$ kubectl expose pod client2 --name=cl2-service --port=80
  1. Confirm the service IP
# kubectl get svc cl1-service cl2-service
NAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
cl1-service   ClusterIP   100.94.212.105   <none>        80/TCP    12m
cl2-service   ClusterIP   100.94.16.100    <none>        80/TCP    14s

[On forwarder pod ($ kubectl exec -it fwd-pod bash)]

  1. Create remote ssh tunnel to client1's service IP via gateway node
# EXTIP1=192.168.122.200
# EXTIP2=192.168.122.201
# CL1SVCIP=100.94.212.105
# CL2SVCIP=100.94.16.100
# ssh -f -N -R $EXTIP1:10080:$CL1SVCIP:80 $EXTIP1
# ssh -f -N -R $EXTIP2:10080:$CL2SVCIP:80 $EXTIP2

(To make this work, sshd on gateway node needs to allow external access for remote forward, by setting GatewayPorts to clientspecified in sshd_config.)

[On Gateway node]

  1. Add IP tables rule to forward packets to a particular port depending on the source IP in namespace net1 (# ip netns exec net1 bash)
# EXTSVRIP=192.168.122.139
# EXTIP1=192.168.122.200

# iptables -A PREROUTING -t nat -m tcp -p tcp --dst $EXTIP1 --src $EXTSVRIP --dport 80 -j DNAT --to-destination $EXTIP1:10080
# iptables -A POSTROUTING -t nat -m tcp -p tcp --dst $EXTSVRIP --dport 10080 -j SNAT --to-source $EXTIP1

# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DNAT       tcp  --  192.168.122.139      192.168.122.200      tcp dpt:80 to:192.168.122.200:10080

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
SNAT       tcp  --  0.0.0.0/0            192.168.122.139      tcp dpt:10080 to:192.168.122.200

2.Add IP tables rule to forward packets to a particular port depending on the source IP in namespace net2 (# ip netns exec net2 bash)

# EXTSVRIP=192.168.122.139
# EXTIP2=192.168.122.201

# iptables -A PREROUTING -t nat -m tcp -p tcp --dst $EXTIP2 --src $EXTSVRIP --dport 80 -j DNAT --to-destination $EXTIP2:10080
# iptables -A POSTROUTING -t nat -m tcp -p tcp --dst $EXTSVRIP --dport 10080 -j SNAT --to-source $EXTIP2

# iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
DNAT       tcp  --  192.168.122.139      192.168.122.201      tcp dpt:80 to:192.168.122.201:10080

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
SNAT       tcp  --  0.0.0.0/0            192.168.122.139      tcp dpt:10080 to:192.168.122.201
  • Test
    [On external server]
  1. Access to external IP#1
$ EXTIP1=192.168.122.200
$ curl $EXTIP1
  1. Check source IP for the access on external server (IP will be fwd-pod's IP.)
    (Stdout from SimpleHTTPServer)
10.244.2.9 - - [16/Jan/2020 22:27:54] "GET / HTTP/1.1" 200 -
  1. Access to external IP#2
$ EXTIP2=192.168.122.201
$ curl $EXTIP2
  1. Check source IP for the access on external server (IP will be fwd-pod's IP.)
    (Stdout from SimpleHTTPServer)
10.244.2.9 - - [16/Jan/2020 22:27:54] "GET / HTTP/1.1" 200 -

@mkimuram
Copy link
Contributor Author

Just for your information. I've implemented POC code of the above idea, here.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nitishm
Copy link

nitishm commented Mar 2, 2020

/remove-lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.