-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Finalizer Protection for Service LoadBalancers #992
Conversation
Adding relevant folks for review. /assign @bowei @andrewsykim @jhorwit2 @jiatongw |
@MrHohn: GitHub didn't allow me to assign the following users: jiatongw. Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ca34886
to
b999ffe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split the paragraphs into multiple lines so it's easier to review?
|
||
## Summary | ||
|
||
We will be adding finalizer protection to ensure the Service resource is not fully deleted until the correlating load balancing resources are also deleted. Any service that has `type=LoadBalancer` (both existing and newly created ones) will be attached a service LoadBalancer finalizer, which should be removed by service controller upon the cleanup of related load balancing resources. Such finalizer protectiion mechanism will be released with phases to ensure downgrades can happen safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo protectiion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
## Motivation | ||
|
||
There are various cases where service controller can leave orphaned load balancing resources after services are deleted (ref discussion on https://github.com/kubernetes/kubernetes/issues/32157, https://github.com/kubernetes/kubernetes/issues/53451). We are periodically getting bug reports and cusotmer issues that replicated such problem, which seems to be common enough and is worth to have a better mechanism for ensuring the cleanup of load balancing resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo cusotmer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
## Proposal | ||
|
||
We are going to define a finalizer for service LoadBalancers with name `kubernetes.io/service-load-balancer`. This finalizer will be attached to any service that has `type=LoadBalancer`. Upon the deletion of such servicce, the actual deletion of the resource will be blocked until this finalizer is removed. This finalizer will not be removed until cleanup of the correlating load balancing resources are considered finished by service controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefixing with networking.kubernetes.io
be useful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have strong opinion here --- is there any naming convention we can follow?
The only finalizers I'm aware of are for volume, which use kubernetes.io
: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/finalizer.go
Or one other option is networking.k8s.io
? (Following https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190125-ingress-api-group.md)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Maybe kubernetes.io
is okay then to be consistent with volumes & since it's added to core API?
|
||
### Graduation Criteria | ||
|
||
Beta: Allow Alpha ("remove finalzier") to soak for at least one release, then switch the "add finalizer" logic to be on by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if feature gating is necessary here. Can't think of a scenario where you would opt-out of this except for the first version where it's supported. Even if user does a version downgrade, your cluster would break if you forget to remove the feature gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair point. I was just following the usual lifecycle of a feature gate --- maybe we won't need that for this? Or we can decide when we actually hit beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal take here is to not do feature gates. I consider this more of a bug fix then an added feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, did you mean not doing feature gate for this enhancement at all, or not doing feature gate at beta and after?
I think having a feature gate at least in the initial release might help if anyone would like to turn it on earlier (for the "add finalizer" part).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think having a feature gate at least in the initial release might help if anyone would like to turn it on earlier (for the "add finalizer" part).
Good point 🤔. If we're okay with cleaning up the feature gate after only 1 release then sgtm.
b999ffe
to
0fbada4
Compare
Right, split those into multiple lines. |
3. Provision of load balancing resources. | ||
- Deletion | ||
1. User issues a deletion for the service. | ||
2. Service resoure deletion is blocked due to the finalizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Service resoure deletion is blocked due to the finalizer. | |
2. Service resource deletion is blocked due to the finalizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
We are going to define a finalizer for service LoadBalancers with name | ||
`kubernetes.io/service-load-balancer`. This finalizer will be attached to any | ||
service that has `type=LoadBalancer`. Upon the deletion of such servicce, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service that has `type=LoadBalancer`. Upon the deletion of such servicce, the | |
service that has `type=LoadBalancer`. Upon the deletion of such service, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
- [Proposal](#proposal) | ||
- [Test Plan](#test-plan) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Implementation History](#implementation-history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a section for "Risks and Mitigations"? I think we should discuss what happens if user does n+2 upgrade from v1.14 -> v1.16 and then does a downgrade back to v1.14. They would have added finalizers to the Service but then lose the removal logic on the downgrade. To be safe should we alpha for 2 releases or just assume a user would always do n+1 upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I was under the impression the supported upgrade order is always 1.n
to 1.(n+1)
? Ref https://kubernetes.io/docs/setup/version-skew-policy/#supported-component-upgrade-order.
Though should definitely cover that if n+2
upgrade case is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I was under the impression the supported upgrade order is always 1.n to 1.(n+1)? Ref https://kubernetes.io/docs/setup/version-skew-policy/#supported-component-upgrade-order.
Yeah I believe officially it is 1 minor version that is supported but in practice/reality it's not uncommon for users to do n+2 :P. At the very least we should cover that in the KEP as a risk IMO even if we choose not to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, will add a "Risks and Mitigations" section for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be careful to not try to support more than what is documented. n+2 is definitely not something that is intended to work.
/assign @thockin |
As such, we are proposing Alpha/Beta/GA phase for this enhancement as below: | ||
- Alpha: The finailizer cleanup logic would be on by default while the finalizer | ||
addition logic will be put behind a feature gate that is off by default. | ||
- Beta: Both the cleanup and addition logic will be default on, but allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
- Beta: Finalizer clean up will always be on. Finalizer addition will be on by default but can be disabled via a feature gate.
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he's proposing this state for Alpha, and I think it's probably OK.
We could require:
- pre-alpha: cleanup and addition gated, default off
- alpha: cleanup gated, default on; addition gated default off
- beta: cleanup and addition gated, default on
- GA: default on
That seems tedious. The risk of dropping pre-alpha is that people come to depend on the cleanup but we might remove it OR that cleanup causes real problems. Do we think that's signficant? I think it's a calculated risk that shaves a whole release off such a feature...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I meant was exactly what @andrewsykim suggested :)
We were trying to avoid the pre-alpha phase --- having the feature available but completely off by default. I can't easily think of how the cleanup can cause significant problem, as that is trying to remove a specific finalizer that presumably doesn't exist yet.
people come to depend on the cleanup
Not sure if there is other use case, we wanted to enable the cleanup but not the addition just to accommodate the downgrade problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were trying to avoid the pre-alpha phase --- having the feature available but completely off by default. I can't easily think of how the cleanup can cause significant problem, as that is trying to remove a specific finalizer that presumably doesn't exist yet.
Agreed, I think something like this is more ideal (and I think the KEP is supposed to reflect this?):
- pre-alpha / alpha: clean up always on, addition gated with default off
- beta: clean up always on, addition gated default on
- GA: clean up & addition always on
The risk of dropping pre-alpha is that people come to depend on the cleanup but we might remove it OR that cleanup causes real problems.
The biggest risk I think is in trying to change the clean-up logic. If we change it in any breaking way during alpha, we have to wait another release cycle to account for rollbacks. Two releases if we want to account for those users who like to be dangerous and do n+2 version upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that sequence was what @MrHohn was proposing?
As to waiting another release - that is easy...don't screw up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that sequence was what @MrHohn was proposing?
Sorry, I was a bit confused earlier. I think we're on the same page now :)
## Proposal | ||
|
||
We are going to define a finalizer for service LoadBalancers with name | ||
`kubernetes.io/service-load-balancer`. This finalizer will be attached to any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANY?
Should this be generic code or cloud-provider code? What happens if someone makes an LB Service in an environment that doesn't implement service LBs. Does that mean they can't delete the Service without manual finalizer removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the service controller calls into cloud-provider code anyway, so even if it is a no-op in the provider, the finalizer will be managed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, even in a no-op case we ensure the finalizer is properly managed. So service LB w/ finalizer would have the same lifecycle regardless of the specific cloud provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that if the provider opts-out of LoadBalancer support, then we don't actually run service controller at all (see here). Does that behavior need to change or can we leave that as-is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support a service with Type=LoadBalancer if the cloud provider doesn't support it (I think yes)? There can be custom controllers that watch this but maybe they should implement their own finalizer protection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that if the provider opts-out of LoadBalancer support, then we don't actually run service controller at all (see here). Does that behavior need to change or can we leave that as-is? Does that behavior need to change or can we leave that as-is?
Thanks, good to know. My thought is we leave that as-ls. In that case creating a LoadBalancer type service will take no effect (from service controller perspective), hence finalizer won't be needed.
There can be custom controllers that watch this but maybe they should implement their own finalizer protection?
Implementing their own finalizer makes the most sense to me --- I imagine custom controller may have distinct pipeline than service controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me :) Should we mention this in the KEP somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add this in KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a Other notes
section for this.
## Proposal | ||
|
||
We are going to define a finalizer for service LoadBalancers with name | ||
`kubernetes.io/service-load-balancer`. This finalizer will be attached to any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service.kubernetes.io/load-balancer-cleanup
? Do we have any convention for naming in-system finalizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For volume we uses kubernetes.io/pvc-protection
(ref https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/finalizer.go).
load-balancer-cleanup
seems good though, as that is more clear what this is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
harumph. Storage should have used a storage prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, updated to use service.kubernetes.io/load-balancer-cleanup
.
As such, we are proposing Alpha/Beta/GA phase for this enhancement as below: | ||
- Alpha: The finailizer cleanup logic would be on by default while the finalizer | ||
addition logic will be put behind a feature gate that is off by default. | ||
- Beta: Both the cleanup and addition logic will be default on, but allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he's proposing this state for Alpha, and I think it's probably OK.
We could require:
- pre-alpha: cleanup and addition gated, default off
- alpha: cleanup gated, default on; addition gated default off
- beta: cleanup and addition gated, default on
- GA: default on
That seems tedious. The risk of dropping pre-alpha is that people come to depend on the cleanup but we might remove it OR that cleanup causes real problems. Do we think that's signficant? I think it's a calculated risk that shaves a whole release off such a feature...
- Verify the downgrade path and ensure service finalizer removal works. | ||
- Verify the upgrade path and ensure finalizer protection works with existing LB | ||
services. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on also adding regression tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I haven't put much thought on that. Would the existing LB related tests be enough for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup I think that's good 👍 thanks.
## Summary | ||
|
||
We will be adding finalizer protection to ensure the Service resource is not | ||
fully deleted until the correlating load balancing resources are also deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fully deleted until the correlating load balancing resources are also deleted. | |
fully deleted until the correlating load balancing resources are deleted. |
this seems better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will apply these and some suggestions above at the next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
### Goals | ||
|
||
Ensure the Service resource is not fully deleted until the correlating load | ||
balancing resources are also deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
We are going to define a finalizer for service LoadBalancers with name | ||
`service.kubernetes.io/load-balancer-cleanup`. This finalizer will be attached to | ||
any service that has `type=LoadBalancer`. Upon the deletion of such service, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth mentioning here also that cloud provider integration is required.
This finalizer will be attached to any service that has `type=LoadBalancer` if the cluster has the cloud provider integration enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, applied.
implication that the resources cleanup must be fulfilled before fully deleting | ||
the service. | ||
|
||
The lifecyle of a `LoadBalancer` type serivce with finalizer would look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: serivce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Is this ready for re-review?
…On Fri, Apr 26, 2019 at 8:36 PM Andrew Sy Kim ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-network/20190423-service-lb-finalizer.md
<#992 (comment)>
:
> + - The newly created `LoadBalancer` services will have finalizer attached upon
+ creation.
+- Downgrade from with-finailzer version
+ - All existing `LoadBalancer` service will have the attached finalizer removed
+ upon the cleanup of load balancing resources.
+ - The newly created `LoadBalancer` services will not have finailzer attached.
+
+To ensures that downgrades can happen safely, the first release will include the
+"remove finalizer" logic with the "add finalizer" logic behind a gate. Then in a
+later release we will remove the feature gate and enable both the "remove" and
+"add" logic by default.
+
+As such, we are proposing Alpha/Beta/GA phase for this enhancement as below:
+- Alpha: The finailizer cleanup logic would be on by default while the finalizer
+addition logic will be put behind a feature gate that is off by default.
+- Beta: Both the cleanup and addition logic will be default on, but allowing
I thought that sequence was what @MrHohn <https://github.com/MrHohn> was
proposing?
Sorry, I was a bit confused earlier. I think we're on the same page now :)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#992 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKWAVCHFMIDU7URIWG2LDLPSPC2VANCNFSM4HIGIFAA>
.
|
Addressed some feedback, PTAL thanks! |
/lgtm I think we should revisit the risks involved with n+2 upgrades after alpha. Maybe we should wait 2 releases just to be safe? SIG Storage folks might have some useful insight there since they had a similar problem with volumes. |
539149d
to
c6701fe
Compare
Thanks folks, I've changed the status to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, MrHohn, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This KEP proposes the plan for adding finalizer protection for service load balancer. The work has been started a long while back but unfortunately hasn't been delivered yet. We are looking forward to making it happen in 1.15.
Note that many content is derived from the original implementation (kubernetes/kubernetes#54569) of this feature.
Enhancement issue: #980