Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for target-type IP #298

Open
rifelpet opened this issue Dec 23, 2019 · 4 comments
Open

Support for target-type IP #298

rifelpet opened this issue Dec 23, 2019 · 4 comments

Comments

@rifelpet
Copy link

For CNI providers that provide VPC IPs to pods, ALBs and NLBs can target the pod IPs directly rather than instance IDs. This bypasses kube-proxy and can avoid cross-AZ traffic by instances that do not run one of the service's pods.

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#target-type

I think it would be great if kube-ingress-aws-controller could support targeting the IPs behind a Service (its Endpoints). An additional target-type annotation could be added to support this. The controller would need to watch Endpoints in order to keep target groups up to date when the ready pods behind a Service change. We're open to contributing this functionality if thats acceptable.

Thoughts?

@szuecs
Copy link
Member

szuecs commented Dec 24, 2019

@rifelpet from my side it’s fine to implement that, if you opt-in via cli flag to enable endpoint lookups and annotations to opt-in via ingress. I am happy to read your pull request.

On the other hand no one should use nodeport service and if you use skipper as hostnetwork target you get the same (no service but endpoints), of course you have an additional hop, which might be bad for some workloads with very tight latency demands.

@rverma-jm
Copy link

+1

@universam1
Copy link
Contributor

Good to see there is a discussion started already. Made a test setup to POC with Skipper not using hostPort but regular containerPort only, accessing via the VPC IP of these pods.

Basically, configure the ALB to use a target-group of IP instead of Instance, which are pointing to the Skipper pods IPs directly, and it works just fine!
That means, using AWS CNI it is possible to untie Skipper from hostPort!

advantages:

  • Skipper is independent of nodes, it can scale freely horizontally in both directions
  • no unhealthy endpoints on the target group when Skipper is run as deployment with replicas < nodes
  • no extra hop introduced, compared to a "Service"
  • PDB or scalingPolicies are better supported, because there can be more than one Skipper per node simplifying the rolling update

disadvantage:

  • works only with AWS CNI (?)

Would that implementation make sense?

thoughts @szuecs

@szuecs
Copy link
Member

szuecs commented Aug 12, 2020

@universam1 sounds like a great step forward. If you can create an option such that you can enable this behavior as opt-in, I would be happy to review the pr. 😊
I guess the Lyft cni plugin might also work, but no idea about state of both of these.

universam1 added a commit to o11n/kube-ingress-aws-controller that referenced this issue Jan 3, 2022
This adds a new operational mode of adding the **Ingress Pods as Targets in the Target Groups directly**, instead of the current operation mode where the Ingress pods are accessed through a **HostPort**.

The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection.

There are several drivers and advantages accessing the pod directly vs a HostPort:

ref: https://kubernetes.io/docs/concepts/configuration/overview/#services

This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of **no Ingresses available at a time**. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to.
With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds!

With HostPort there is an eventual dependency on available nodes to scale the Ingress.
Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress.
With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now!

Core idea is the event based registration to Kubernetes using pod `Informer` that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups.

The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling!

Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured!

Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller.

* client-go Informer: This high level functions are providing a convenient access to event registrations of kubernetes. Since the event registration is the key of fast response and efficient compared to high rate polling, using this standard factory methods seems standing to reason.

*  successful transistion of TG from type Instance to type IP vice versa
* the controller registers pods that are discovered
* the controller deregisters pods that are "terminating" status
* the controller recovers desired state if manual intervention on the TG happened by "resyncing"
* it removes pods that are killed or dead

| access mode | HostNetwork | HostPort | Result |                      Notes                       |
| :---------: | :---------: | :------: | :----: | :----------------------------------------------: |
| `Instance`  |   `true`    |  `true`  |  `ok`  |                    status quo                    |
|    `IP`     |   `true`    |  `true`  |  `ok`  |       PodIP == HostIP --> limited benefits       |
|    `IP`     |   `false`   |  `true`  |  `ok`  | PodIP != HostIP --> limited scaling, node bounde |
|    `IP`     |   `true`    | `false`  | `N/A`  |            impossible, HN implies HP             |
|    `IP`     |   `false`   | `false`  |  `OK`  |            goal achieved: free scaling and HA             |

Example logs:
```
time="2021-12-17T15:36:36Z" level=info msg="Deleted Pod skipper-ingress-575548f66-k99vs IP:172.16.49.249"
time="2021-12-17T15:36:36Z" level=info msg="Deregistering CNI targets: [172.16.49.249]"
time="2021-12-17T15:36:37Z" level=info msg="New Pod skipper-ingress-575548f66-qff2q IP:172.16.60.193"
time="2021-12-17T15:36:40Z" level=info msg="Registering CNI targets: [172.16.60.193]"
```

* extended the service account with required RBAC permissions to watch/list pods
* added example of Skipper without a HostPort and HostNetwork
Signed-off-by: Samuel Lang <[email protected]>
universam1 added a commit to o11n/kube-ingress-aws-controller that referenced this issue Jan 11, 2022
This adds a new operational mode of adding the **Ingress Pods as Targets in the Target Groups directly**, instead of the current operation mode where the Ingress pods are accessed through a **HostPort**.

The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection.

There are several drivers and advantages accessing the pod directly vs a HostPort:

ref: https://kubernetes.io/docs/concepts/configuration/overview/#services

This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of **no Ingresses available at a time**. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to.
With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds!

With HostPort there is an eventual dependency on available nodes to scale the Ingress.
Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress.
With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now!

Core idea is the event based registration to Kubernetes using pod `Informer` that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups.

The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling!

Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured!

Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller.

* client-go Informer: This high level functions are providing a convenient access to event registrations of kubernetes. Since the event registration is the key of fast response and efficient compared to high rate polling, using this standard factory methods seems standing to reason.

*  successful transistion of TG from type Instance to type IP vice versa
* the controller registers pods that are discovered
* the controller deregisters pods that are "terminating" status
* the controller recovers desired state if manual intervention on the TG happened by "resyncing"
* it removes pods that are killed or dead

| access mode | HostNetwork | HostPort | Result |                      Notes                       |
| :---------: | :---------: | :------: | :----: | :----------------------------------------------: |
| `Instance`  |   `true`    |  `true`  |  `ok`  |                    status quo                    |
|    `IP`     |   `true`    |  `true`  |  `ok`  |       PodIP == HostIP --> limited benefits       |
|    `IP`     |   `false`   |  `true`  |  `ok`  | PodIP != HostIP --> limited scaling, node bounde |
|    `IP`     |   `true`    | `false`  | `N/A`  |            impossible, HN implies HP             |
|    `IP`     |   `false`   | `false`  |  `OK`  |            goal achieved: free scaling and HA             |

Example logs:
```
time="2021-12-17T15:36:36Z" level=info msg="Deleted Pod skipper-ingress-575548f66-k99vs IP:172.16.49.249"
time="2021-12-17T15:36:36Z" level=info msg="Deregistering CNI targets: [172.16.49.249]"
time="2021-12-17T15:36:37Z" level=info msg="New Pod skipper-ingress-575548f66-qff2q IP:172.16.60.193"
time="2021-12-17T15:36:40Z" level=info msg="Registering CNI targets: [172.16.60.193]"
```

* extended the service account with required RBAC permissions to watch/list pods
* added example of Skipper without a HostPort and HostNetwork
Signed-off-by: Samuel Lang <[email protected]>
szuecs pushed a commit that referenced this issue Jan 31, 2022
* implements #298 Support for target-type IP

This adds a new operational mode of adding the **Ingress Pods as Targets in the Target Groups directly**, instead of the current operation mode where the Ingress pods are accessed through a **HostPort**.

The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection.

There are several drivers and advantages accessing the pod directly vs a HostPort:

ref: https://kubernetes.io/docs/concepts/configuration/overview/#services

This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of **no Ingresses available at a time**. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to.
With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds!

With HostPort there is an eventual dependency on available nodes to scale the Ingress.
Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress.
With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now!

Core idea is the event based registration to Kubernetes using pod `Informer` that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups.

The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling!

Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured!

Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller.

* client-go Informer: This high level functions are providing a convenient access to event registrations of kubernetes. Since the event registration is the key of fast response and efficient compared to high rate polling, using this standard factory methods seems standing to reason.

*  successful transistion of TG from type Instance to type IP vice versa
* the controller registers pods that are discovered
* the controller deregisters pods that are "terminating" status
* the controller recovers desired state if manual intervention on the TG happened by "resyncing"
* it removes pods that are killed or dead

| access mode | HostNetwork | HostPort | Result |                      Notes                       |
| :---------: | :---------: | :------: | :----: | :----------------------------------------------: |
| `Instance`  |   `true`    |  `true`  |  `ok`  |                    status quo                    |
|    `IP`     |   `true`    |  `true`  |  `ok`  |       PodIP == HostIP --> limited benefits       |
|    `IP`     |   `false`   |  `true`  |  `ok`  | PodIP != HostIP --> limited scaling, node bounde |
|    `IP`     |   `true`    | `false`  | `N/A`  |            impossible, HN implies HP             |
|    `IP`     |   `false`   | `false`  |  `OK`  |            goal achieved: free scaling and HA             |

Example logs:
```
time="2021-12-17T15:36:36Z" level=info msg="Deleted Pod skipper-ingress-575548f66-k99vs IP:172.16.49.249"
time="2021-12-17T15:36:36Z" level=info msg="Deregistering CNI targets: [172.16.49.249]"
time="2021-12-17T15:36:37Z" level=info msg="New Pod skipper-ingress-575548f66-qff2q IP:172.16.60.193"
time="2021-12-17T15:36:40Z" level=info msg="Registering CNI targets: [172.16.60.193]"
```

* extended the service account with required RBAC permissions to watch/list pods
* added example of Skipper without a HostPort and HostNetwork
Signed-off-by: Samuel Lang <[email protected]>

* fixing golangci-lint timeouts

```
Run make lint
golangci-lint run ./...
level=error msg="Running error: context loading failed: failed to load packages: timed out to load packages: context deadline exceeded"
level=error msg="Timeout exceeded: try increasing it by passing --timeout option"
```

ref: golangci/golangci-lint#825

Seems like an incompatibility with the client-go pkg and the module go version.
Updating to 1.16 and recreating the go.sum fixes the timeouts and slims down the dependency list.

Also updating the install script for github action script of golangci-lint as it is deprecated

Signed-off-by: Samuel Lang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants