-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ AWS CNI Mode #461
✨ AWS CNI Mode #461
Conversation
@universam1 thanks, sounds like a great idea for aws-vpc cni. One drawback is that we need informers and because of that client-go WATCH (internal to informers). We tried to omit client-go, because of kubernetes/client-go#374 (comment) zalando-incubator/stackset-controller#263 , but since kubernetes/kubernetes#95981 was done maybe it's also fine to use client-go. |
Thanks @szuecs for review!
Please have a look at the roster of tests:
From my testing it is impossible to define HostNetwork but disable HostPort. AFAIK this combination is not avail as the HostNetworking implies pod port mapping.
Likewise, we don't either! Sorry I mixed up nodeport with hostport, while the first stands for a service and the latter for a port on the host. I was only referring to the host port, corrected the description
Curious, unaware of this issue. Wondering, as the informer is used across SIGS projects such as ClusterAutoscaler or even suggested pattern for controller - is that issue you face a corner case? |
80db96d
to
a04d016
Compare
Thanks for the testing table, looks great to me. |
Thank you for the context, pretty impressive deep dive! Issues like this are incredible challenging to surface! Since I'm not familiar with this issue, can you help me out what the desired config would look like?
IMHO a full refactor could simplify things, but might render this PR large to impossible to land. |
Interesting not sure if this helps in this case. On the other hand client-go uses H2 ping frames to detect broken connections, so should be fine.
I would just start to pass a custom Transport, you can use the DefaultTransport to pass it.
Agree out of scope, but we should have an issue that we can work on later to migrate. |
@szuecs |
@szuecs Added complete set of unit tests, implemented the Skipper NewTransport, tested in Lab, I believe this should be ready to go. |
@universam1 from my side only DCO :) |
712b195
to
653679c
Compare
restarted, in case it happens again please check why it takes so long and if it really takes so long increase the timeout, please |
It requires a couple of adjustments on the CI config as much as I see, fine from my side if that's desired |
yes, please |
782d97a
to
cd5f94b
Compare
👍 |
added a section in the readme to describe that mode, anything missing @szuecs ? |
// PodInformer is a event handler for Pod events registered to, that builds a local list of valid and relevant pods | ||
// and sends an event to the endpoint channel, triggering a resync of the targets. | ||
func (a *Adapter) PodInformer(ctx context.Context, endpointChan chan<- []string) error { | ||
podEndpoints := sync.Map{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we need to pre-populate podEndpoints
map before launching informer, otherwise it is empty after controller restart and would trigger an update after receiving a single pod change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a guard for that in here already:
https://github.com/zalando-incubator/kube-ingress-aws-controller/pull/461/files#diff-0ca2bf28ee146c014db869dd42c216dff322e265933c8b179b939f18016386d6R673
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, on controller restart:
- the main cycle will send a list of the relevant target groups.
- this informer will receive a change of a single pod and would publish an updated list of endpoints of size 1 since
podEndpoints
is empty.
TheSetTargetsOnCNITargetGroups
will take the difference between registered targets and this single item endpoint list and will de-register other valid endpoints.
The "guard" only verifies that TGs and EPs were received at least onec, but the update itself assumes that endpoint list contains all available EPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That case is not existing because the informer is long time in sync before the TGs are enumerated. It is flushing those first messages. There are several seconds between
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess @AlexanderYastrebov you should take some time to explore this concept in a lab to learn how it behaves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#461 (comment) doesn't work, the events are generated yet by informer.Run, not from factory.Start . That's why we cannot omit the informer.Run, and also the semaphore via atomic.StoreInt64 is effectless here, unfortunately! 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hypothetical or not, if the problem exist we need to tackle it.
We have sometimes more than 1k Ingress objects in a cluster with an SLA that 40k is fine. This can easily lead to kubernetes slower than AWS API response. time.Sleep
is never a solution, I would rather accept a TODO comment that needs to be addressed some day.
What if we just list endpoints once on creation to populate the cache?
One issue that could happen is that if we do not execute an update operation on cloud load balancers to update the endpoints list, then a deployment of kube-ingress-aws-controller could miss a new created endpoint while it was down.
I think the right thing would be at start to make sure we list endpoints through informer (warm up cache) and use the list to trigger updates to all cloud load balancers. After that we can rely onUpdate events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a smart idea to determine when the cache equilibrium is reached - I like that and it doesn't sound hard to implement...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szuecs @AlexanderYastrebov Please check if you like this version. Using the list
is convenient in several ways as it delivers the warm cache right from the start!
The successive informer updates result in a no-op since we do anyway a map reduce! via LoadOrStore
and LoadAndDelete
That means, there is nothing like a warm-cache sync validation required!
The beauty is that even in a hard race we will process correctly a pod deletion right after the initial list just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not over complicate things see also #461
@AlexanderYastrebov do you agree?
8ef03e3
to
377e6c9
Compare
informer := factory.Core().V1().Pods().Informer() | ||
factory.Start(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factory.Start()
apparently starts informers but we configure informer EventHandler later which seems wrong.
From this perspective informer.Run(ctx.Done())
also feels wrong as factory.Start already runs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can do what others do:
workqueue: https://github.com/kubernetes/client-go/blob/master/examples/workqueue/main.go#L128-L132
fakeclient: https://github.com/kubernetes/client-go/blob/master/examples/fake-client/main_test.go#L69-L72
The cache.WaitForCacheSync() does what we need:
https://github.com/kubernetes/client-go/blob/3255cdcd9cf0ce5ea718c97f4aea519130649aca/informers/factory.go#L141
https://github.com/kubernetes/client-go/blob/3255cdcd9cf0ce5ea718c97f4aea519130649aca/tools/cache/shared_informer.go#L254
I also asked in sig-apimachinery, if they consider this safe or if we have to do something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so Kubernetes sig-apimachinery says that it's not an issue because List is used before Watch, which means there is no race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szuecs
In another discussion #461 (comment) this issue kubernetes/kubernetes#92777 (comment) was brought up which is exactly about the problem we have now - configuring informer after factory start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should just move the AddEventhandler call with body to before Start()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the extended unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the sources https://github.com/kubernetes/client-go/blob/master/tools/cache/shared_informer.go#L552-L556 that explain how a safe join happens to a running informer
I prefer this solution over a semaphore since it is clean and easier to grasp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this makes sure we get all entries from a fresh LIST, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes- so theoretically we could omit the explicit List before since we get this one implicitly by late binding, but it serves two reasons so I'd keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this makes sure we get all entries from a fresh LIST, right?
Yes- so theoretically we could omit the explicit List
Idk, I see it uses addNotification
but we only handle updates.
``` Run make lint golangci-lint run ./... level=error msg="Running error: context loading failed: failed to load packages: timed out to load packages: context deadline exceeded" level=error msg="Timeout exceeded: try increasing it by passing --timeout option" ``` ref: golangci/golangci-lint#825 Seems like an incompatibility with the client-go pkg and the module go version. Updating to 1.16 and recreating the go.sum fixes the timeouts and slims down the dependency list. Also updating the install script for github action script of golangci-lint as it is deprecated Signed-off-by: Samuel Lang <[email protected]>
I guess we've spent enough time pondering on the proposed solution so it is good enough to be accepted. |
👍 |
1 similar comment
👍 |
@universam1 thanks and let's have follow up fixes if you detect something that does not work as great as it could. |
Thank you @szuecs for releasing v0.12.17 - just wondering if this is really considered a bugfix release bump? 😺 |
@universam1 we increase minor version in case there is a need to change flags or something like this. We add always migration guide to the readme for minor release increase. |
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in Cloudformation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to 'ip' or 'instance'. Changing target type from unset to 'instance' in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to 'instance' in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new 'Legacy' option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - renames `HostPort` option of `target-access-mode` into `InstanceFilter` - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
…` version Before version `0.12.17` loadbalancer target group target type was not specified and defaulted to `instance` in CloudFormation, see https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticloadbalancingv2-targetgroup.html#cfn-elasticloadbalancingv2-targetgroup-targettype PR #461 introduced AWS CNI mode and configured target group target type either to `ip` or `instance`. Changing target type from unset to `instance` in Cloudformation triggers target group re-creation which makes it impossible to upgrade from pre `0.12.17` without downtime. This change: - makes `target-access-mode` flag required to force users to choose proper value - introduces a new `Legacy` option that does not set target type and thus enables upgrade from pre `0.12.17`. Fixes #507 Signed-off-by: Alexander Yastrebov <[email protected]>
new operational mode of Direct Ingress Pod access
implements #298 Support for target-type IP
This adds a new operational mode of adding the Ingress Pods as Targets in the Target Groups directly, instead of the current operation mode where the Ingress pods are accessed through a HostPort.
The core idea is based on the fact that standard AWS EKS cluster running AWS VPC CNI have their pods as first class members in the VPCs. Their IPs are directly reachable from the ALB/NLB target groups like the nodes are too, which means there is no necessity for the HostPort indirection.
There are several drivers and advantages accessing the pod directly vs a HostPort:
HostNetwork and HostPort conflict with best practices
ref: https://kubernetes.io/docs/concepts/configuration/overview/#services
Delayed, eventual consistency of AutoscalingGroup to ALB/NLB target group updates
This has been the biggest trouble in operations, that the update of node members in target groups is slower than the nodes are physically replaced, which ends up in a black hole of no Ingresses available at a time. We are facing regularly downtimes especially when spot interruptions or ASG node rollings happen that the ALB/NLB takes up to 2 minutes to reflect the group change. For smaller clusters this leads to no Skipper instance being registered hence no target available to forward traffic to.
With this new mode the registration happens independently of ASGs and instantly, the scheduling of pods up to be serving traffic from ALB takes less than 10 seconds!
Independent scaling from nodes
With HostPort there is an eventual dependency on available nodes to scale the Ingress.
Plus the Ingress pod cannot be replaced in place but requires a termination first and then rescheduling. For a certain time, which can be more than a minute, this node is offline as an Ingress.
With this mode the (host networking ?) and HostPort is obsolete, which allows node independent scaling of Skipper pods! Skipper becomes a regular deployment and its ReplicaSet can be independent on the cluster size, which simplifies operations especially for smaller clusters. We are using a custom HPA metric attached to node group size to counteract this deployment / daemonset hybrid combination, which is obsolete now!
Synced de/registering and instantaneous response
Core idea is the event based registration to Kubernetes using pod
Informer
that receives immediate notifications about pod changes, which allow almost zero delayed updates on the target groups.The registration happens as soon as the pod received an IP from AWS VPC. Hence the readiness probe of the ALB/NLB starts to monitor already during scheduling of the pod, serving the earliest possible. Tests in lab show pods serving ALB traffic well under 10s from scheduling!
Deregistration happens bound to the kubernetes event. That means the LB is now in sync with the cluster and will stop sending traffic before the pod is actually terminated. This implement save deregistering without traffic loss. Tests in lab show even under aggressive deployment scalings there are no packet losses measured!
TG without unhealthy targets
Since the IP based TGs are managed now by this controller, they represent pods and thus all of them are shown healthy, otherwise cleaned up by this controller.
Implementation details:
Tests
HostNetwork and HostPort combinations tested:
Instance
true
true
ok
IP
true
true
ok
IP
false
true
ok
IP
true
false
N/A
IP
false
false
OK
Example logs:
misc