diff --git a/keps/prod-readiness/sig-network/1860.yaml b/keps/prod-readiness/sig-network/1860.yaml index 4c1af10d389..9ab489d6800 100644 --- a/keps/prod-readiness/sig-network/1860.yaml +++ b/keps/prod-readiness/sig-network/1860.yaml @@ -4,3 +4,5 @@ kep-number: 1860 alpha: approver: "@wojtek-t" +beta: + approver: "@wojtek-t" # temptative diff --git a/keps/sig-network/1860-kube-proxy-IP-node-binding/README.md b/keps/sig-network/1860-kube-proxy-IP-node-binding/README.md index 2144b554c0a..5d12a1c99ae 100644 --- a/keps/sig-network/1860-kube-proxy-IP-node-binding/README.md +++ b/keps/sig-network/1860-kube-proxy-IP-node-binding/README.md @@ -189,138 +189,125 @@ Yes. It is tested by `TestUpdateServiceLoadBalancerStatus` in pkg/registry/core/ ### Rollout, Upgrade and Rollback Planning - - ###### How can a rollout or rollback fail? Can it impact already running workloads? - +In case of a rollback, kube-proxy will also rollback to the default behavior, switching +back to VIP mode. This can fail for workloads that may be already relying on the +new behavior (eg. sending traffic to the LoadBalancer expecting some additional +features, like PROXY and TLS Termination as per the Motivations section). ###### What specific metrics should inform a rollback? - +If using kube-proxy, looking at metrics `sync_proxy_rules_duration_seconds` and +`sync_proxy_rules_last_timestamp_seconds` may help identifying problems and indications +of a required rollback. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - +Because this is a feature that depends on CCM/LoadBalancer controller, and none yet +implements it, the scenario is simulated with the upgrade/downgrade/upgrade path being +enabling and disabling the feature flag, and doing the changes on services status subresources. + +There is a LoadBalancer running on the environment (metallb) that is responsible for doing the proper +LB ip allocation and announcement, but the rest of the test is related to kube-proxy programming +or not the iptables rules based on this enablement/disablement path + +* Initial scenario + * Started with a v1.29 cluster with the feature flag enabled + * Created 3 Deployments: + * web1 - Will be using the new feature + * web2 - Will NOT be using the new feature + * client - "the client" + * Created the loadbalancer for the two web services. By default both LBs are with the default `VIP` value +```yaml +status: + loadBalancer: + ingress: + - ip: 172.18.255.200 + ipMode: VIP +``` + * With the feature flag enabled but no change on the service resources, tested and both +web deployments were accessible + * Verified that the iptables rule for both LBs exists on all nodes +* Testing the feature ("upgrade") + * Changed the `ipMode` from first LoadBalancer to `Proxy` + * Verified that the iptables rule for the second LB still exists, while the first one didn't + * Because the LoadBalancer of the first service is not aware of this new implementation (metallb), it is + not accessible anymore from the client Pod + * The second service, which `ipMode` is `VIP` is still accessible from the Pods +* Disable the feature flag ("downgrade") + * Edit kube-apiserver manifest and disable the feature flag + * Edit kube-proxy configmap, disable the feature and restart kube-proxy Pods + * Confirmed that both iptables rules are present, even if the `ipMode` field was still + set as `Proxy`, confirming the feature is disabled. Both accesses are working + +Additionally, an apiserver and kube-proxy upgrade test was executed as the following: +* Created a KinD cluster with v1.28 +* Created the same deployments and services as bellow + * Both loadbalancer are accessible +* Upgraded apiserver and kube-proxy to v1.29, and enabled the feature flag +* Set `ipMode` as `Proxy` on one of the services and execute the same tests as above + * Observed the expected behavior of iptables rule for the changed service + not being created + * Observed that the access of the changed service was not accessible anymore, as + expected +* Disable feature flag +* Rollback kube-apiserver and kube-proxy to v1.28 +* Verified that both services are working correctly on v1.28 +* Upgraded again to v1.29, keeping the feature flag disabled + * Both loadbalancers worked as expected, the field is still present on + the changed service. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - +No. ### Monitoring Requirements - - ###### How can an operator determine if the feature is in use by workloads? - +If the LB IP works correctly from pods, then the feature is working ###### How can someone using this feature know that it is working for their instance? - - -- [ ] Events - - Event Reason: -- [ ] API .status +- [X] API .status - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: + - Other field: `.status.loadBalancer.ingress.ipMode` not null +- [X] Other: + - Details: To detect if the traffic is being directed to the LoadBalancer and not + directly to another node, the user will need to rely on the LoadBalancer logs, + and the destination workload logs to check if the traffic is coming from one Pod + to the other or from the LoadBalancer. + ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - +The quality of service for clouds using this feature is the same as the existing +quality of service for clouds that don't need this feature ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: +N/A ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - +* On kube-proxy, a metric containing the count of IP programming vs service type would be useful +to determine if the feature is being used, and if there is any drift between nodes ### Dependencies - - ###### Does this feature depend on any specific services running in the cluster? - +- cloud controller manager / LoadBalancer controller + - If there is an outage of the cloud controller manager, the result is the same + as if this feature wasn't in use; the LoadBalancers will get out of sync with Services +- kube-proxy or other Service Proxy that implements this feature + - If there is a service proxy outage, the result is the same as if this feature wasn't in use ### Scalability @@ -336,79 +323,34 @@ previous answers based on experience in the field. ###### Will enabling / using this feature result in any new API calls? - +No. ###### Will enabling / using this feature result in introducing new API types? - +No. ###### Will enabling / using this feature result in any new calls to the cloud provider? - +No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - +- API type: v1/Service +- Estimated increase size: new string field. Supported options at this time are max 6 characters (`Proxy`) +- Estimated amount of new objects: 0 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - +No. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? - +No ### Troubleshooting @@ -425,19 +367,14 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? +Same for any loadbalancer/cloud controller manager, the new IP and the new status will not be +set. + +kube-proxy reacts on the IP status, so the service LoadBalancer IP and configuration will be pending. + ###### What are other known failure modes? - +N/A ###### What steps should be taken if SLOs are not being met to determine the problem? +N/A \ No newline at end of file diff --git a/keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml b/keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml index 0ff3e31001a..3ff674a2d85 100644 --- a/keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml +++ b/keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml @@ -14,9 +14,9 @@ approvers: - "@thockin" - "@andrewsykim" -stage: "alpha" +stage: "beta" -latest-milestone: "v1.29" +latest-milestone: "v1.30" milestone: alpha: "v1.29"