From f97e5e72b44b76e927d7b4621b2a56c567df7f49 Mon Sep 17 00:00:00 2001 From: Dan Winship Date: Thu, 1 Dec 2022 12:46:13 -0500 Subject: [PATCH] Initial proposal of kube-proxy nftables mode --- .../sig-network/NNNN-nftables-proxy/README.md | 1342 +++++++++++++++++ keps/sig-network/NNNN-nftables-proxy/kep.yaml | 39 + 2 files changed, 1381 insertions(+) create mode 100644 keps/sig-network/NNNN-nftables-proxy/README.md create mode 100644 keps/sig-network/NNNN-nftables-proxy/kep.yaml diff --git a/keps/sig-network/NNNN-nftables-proxy/README.md b/keps/sig-network/NNNN-nftables-proxy/README.md new file mode 100644 index 000000000000..14dc5fd352aa --- /dev/null +++ b/keps/sig-network/NNNN-nftables-proxy/README.md @@ -0,0 +1,1342 @@ +# KEP-NNNN: An nftables-based kube-proxy backend + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [The iptables kernel subsystem has unfixable performance problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems) + - [Upstream development has moved on from iptables to nftables](#upstream-development-has-moved-on-from-iptables-to-nftables) + - [The ipvs mode of kube-proxy will not save us](#the--mode-of-kube-proxy-will-not-save-us) + - [The nf_tables mode of /sbin/iptables will not save us](#the--mode-of--will-not-save-us) + - [The iptables mode of kube-proxy has grown crufty](#the--mode-of-kube-proxy-has-grown-crufty) + - [We will hopefully be able to trade 2 supported backends for 1](#we-will-hopefully-be-able-to-trade-2-supported-backends-for-1) + - [Writing a new kube-proxy mode may help with our "KPNG" goals](#writing-a-new-kube-proxy-mode-may-help-with-our-kpng-goals) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [High level](#high-level) + - [Low level](#low-level) + - [Tables](#tables) + - [Communicating with the kernel nftables subsystem](#communicating-with-the-kernel-nftables-subsystem) + - [Versioning and compatibility](#versioning-and-compatibility) + - [NAT rules](#nat-rules) + - [General Service dispatch](#general-service-dispatch) + - [Masquerading](#masquerading) + - [Session affinity](#session-affinity) + - [Filter rules](#filter-rules) + - [Dropping or rejecting packets for services with no endpoints](#dropping-or-rejecting-packets-for-services-with-no-endpoints) + - [Dropping traffic rejected by LoadBalancerSourceRanges](#dropping-traffic-rejected-by-) + - [Forcing traffic on HealthCheckNodePorts to be accepted](#forcing-traffic-on--to-be-accepted) + - [Future improvements](#future-improvements) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Continue to improve the iptables mode](#continue-to-improve-the--mode) + - [Fix up the ipvs mode](#fix-up-the--mode) + - [Use an existing nftables-based kube-proxy implementation](#use-an-existing-nftables-based-kube-proxy-implementation) + - [Create an eBPF-based proxy implementation](#create-an-ebpf-based-proxy-implementation) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The default kube-proxy implementation on Linux is currently based on +iptables. IPTables was the preferred packet filtering and processing +system in the Linux kernel for many years (starting with the 2.4 +kernel in 2001). However, problems with iptables led to the +development of a successor, nftables, first made available in the 3.13 +kernel in 2014, and growing increasingly featureful and usable as a +replacement for iptables since then. Development on iptables has +mostly stopped, with new features and performance improvements +primarily going into nftables instead. + +This KEP proposes the creation of a new official/supported nftables +backend for kube-proxy. While it is hoped that this backend will +eventually replace both the `iptables` and `ipvs` backends and become +the default kube-proxy mode on Linux, that replacement/deprecation +would be handled in a separate future KEP. + +## Motivation + +There are currently two officially supported kube-proxy backends for +Linux: `iptables` and `ipvs`. (The original `userspace` backend was +deprecated several releases ago and removed from the tree in 1.25.) + +The `iptables` mode of kube-proxy is currently the default, and it is +generally considered "good enough" for most use cases. Nonetheless, +there are good arguments for replacing it with a new `nftables` mode. + +### The iptables kernel subsystem has unfixable performance problems + +Although much work has been done to improve the performance of the +kube-proxy `iptables` backend, there are fundamental +performance-related problems with the implementation of iptables in +the kernel, both on the "control plane" side and on the "data plane" +side: + + - The control plane is problematic because the iptables API does not + support making incremental changes to the ruleset. If you want to + add a single iptables rule, the iptables binary must acquire a lock, + download the entire ruleset from the kernel, find the appropriate + place in the ruleset to add the new rule, add it, re-upload the + entire ruleset to the kernel, and release the lock. This becomes + slower and slower as the ruleset increases in size (ie, as the + number of Kubernetes Services grows). If you want to replace a large + number of rules (as kube-proxy does frequently), then simply the + time that it takes `/sbin/iptables-restore` to parse all of the + rules becomes substantial. + + - The data plane is problematic because (for the most part), the + number of iptables rules used to implement a set of Kubernetes + Services is directly proportional to the number of Services. And + every packet going through the system then needs to pass through + all of these rules, slowing down the traffic. + +IPTables is the bottleneck in kube-proxy performance, and it always +will be until we stop using it. + +### Upstream development has moved on from iptables to nftables + +In large part due to its unfixable problems, development on iptables +in the kernel has slowed down and mostly stopped. New features are not +being added to iptables, because nftables is supposed to do everything +iptables does, but better. + +Although there is no plan to remove iptables from the upstream kernel, +that does not guarantee that iptables will remain supported by +_distributions_ forever. In particular, Red Hat has declared that +[iptables is deprecated in RHEL 9] and is likely to be removed +entirely in RHEL 10, a few years from now. Other distributions have +made smaller steps in the same direction; for instance, [Debian +removed `iptables` from the set of "required" packages] in Debian 11 +(Bullseye). + +The RHEL deprecation in particular impacts Kubernetes in two ways: + + 1. Many Kubernetes users run RHEL or one of its downstreams, so in a + few years when RHEL 10 is released, they will be unable to use + kube-proxy in `iptables` mode (or, for that matter, in `ipvs` or + `userspace` mode, since those modes also make heavy use of the + iptables API). + + 2. Several upstream iptables bugs and performance problems that + affect Kubernetes have been fixed by Red Hat developers over the + past several years. With Red Hat no longer making any effort to + maintain iptables, it is less likely that upstream iptables bugs + that affect Kubernetes in the future would be fixed promptly, if + at all. + +[iptables is deprecated in RHEL 9]: https://access.redhat.com/solutions/6739041 +[Debian removed `iptables` from the set of "required" packages]: https://salsa.debian.org/pkg-netfilter-team/pkg-iptables/-/commit/c59797aab9 + +### The `ipvs` mode of kube-proxy will not save us + +Because of the problems with iptables, some developers added an `ipvs` +mode to kube-proxy in 2017. It was generally hoped that this could +eventually solve all of the problems with the `iptables` mode and +become its replacement, but this never really happened. It's not +entirely clear why... [kubeadm #817], "Track when we can enable the +ipvs mode for the kube-proxy by default" is perhaps a good snapshot of +the initial excitement followed by growing disillusionment with the +`ipvs` mode: + + - "a few issues ... re: the version of iptables/ipset shipped in the + kube-proxy container image" + - "clearly not ready for defaulting" + - "complications ... with IPVS kernel modules missing or disabled on + user nodes" + - "we are still lacking tests" + - "still does not completely align with what [we] support in + iptables mode" + - "iptables works and people are familiar with it" + - "[not sure that it was ever intended for IPVS to be the default]" + +Additionally, the kernel IPVS APIs alone do not provide enough +functionality to fully implement Kubernetes services, and so the +`ipvs` backend also makes heavy use of the iptables API. Thus, if we +are worried about iptables deprecation, then in order to switch to +using `ipvs` as the default mode, we would have to port the iptables +parts of it to use nftables anyway. But at that point, there would be +little excuse for using IPVS for the core load-balancing part, +particularly given that IPVS, like iptables, is no longer an +actively-developed technology. + +[kubeadm #817]: https://github.com/kubernetes/kubeadm/issues/817 +[not sure that it was ever intended for IPVS to be the default]: https://en.wikipedia.org/wiki/The_Fox_and_the_Grapes + +### The `nf_tables` mode of `/sbin/iptables` will not save us + +In 2018, with the 1.8.0 release of the iptables client binaries, a new +mode was added to the binaries, to allow them to use the nftables API +in the kernel rather than the legacy iptables API, while still +preserving the "API" of the original iptables binaries. As of 2022, +most Linux distributions now use this mode, so the legacy iptables +kernel API is mostly dead. + +However, this new mode does not add any new _syntax_, and so it is not +possible to use any of the new nftables features (like maps) that are +not present in iptables. + +Furthermore, the compatibility constraints imposed by the user-facing +API of the iptables binaries themselves prevent them from being able +to take advantage of many of the performance improvements associated +with nftables. + +### The `iptables` mode of kube-proxy has grown crufty + +Because `iptables` is the default kube-proxy mode, it is subject to +strong backward-compatibility constraints which mean that certain +"features" that are now considered to be bad ideas cannot be removed +because they might break some existing users. A few examples: + + - It allows NodePort services to be accessed on `localhost`, which + requires it to set a sysctl to a value that may introduce security + holes on the system. More generally, it defaults to having + NodePort services be accessible on _all_ node IPs, when most users + would probably prefer them to be more restricted. + + - It implements the `LoadBalancerSourceRanges` feature for traffic + addressed directly to LoadBalancer IPs, but not for traffic + redirected to a NodePort by an external LoadBalancer. + + - Some new functionality only works correctly if the administrator + passes certain command-line options to kube-proxy (eg, + `--cluster-cidr`), but we cannot make those options be mandatory, + since that would break old clusters that aren't passing them. + +A new kube-proxy, which existing users would have to explicitly opt +into, could revisit these and other decisions. + +### We will hopefully be able to trade 2 supported backends for 1 + +Right now SIG Network is supporting both the `iptables` and `ipvs` +backends of kube-proxy, and does not feel like it can ditch `ipvs` +because of performance issues with `iptables`. If we create a new +backend which is as functional and non-buggy as `iptables` but as +performant as `ipvs`, then we could (eventually) deprecate both of the +existing backends and only have one backend to support in the future. + +### Writing a new kube-proxy mode may help with our "KPNG" goals + +The [KPNG] (Kube-Proxy Next Generation) working group has been working +on the future of kube-proxy's underlying architecture. They have +recently proposed a [kube-proxy library KEP]. Creating a new proxy +mode which will be officially supported, but which does not (yet) have +the same compatibility and non-bugginess requirements as the +`iptables` and `ipvs` modes should help with that project, because we +can target the new backend to the new library without worrying about +breaking the old backends. + +[KPNG]: https://github.com/kubernetes-sigs/kpng +[kube-proxy library KEP]: https://github.com/kubernetes/enhancements/pull/3649 + +### Goals + +- Design and implement an `nftables` mode for kube-proxy. + + - Drop support for localhost nodeports + + - Ensure that all configuration which is _required_ for full + functionality (eg, `--cluster-cidr`) is actually required, + rather than just logging warnings about missing functionality. + + - Consider other fixes to legacy `iptables` mode behavior. + +- Come up with at least a vague plan to eventually make `nftables` the + default backend. + +- Decide whether we can/should deprecate or even remove the `iptables` + and/or `ipvs` backends. (Perhaps they can be pushed out of tree, a + la `cri-dockerd`.) + +- Take advantage of kube-proxy-related work being done by the kpng + working group. + +### Non-Goals + +- Falling into the same traps as the `ipvs` backend, to the extent + that we can identify what those traps were. + +## Proposal + +### Notes/Constraints/Caveats + +At least three nftables-based kube-proxy implementations already +exist, but none of them seems suitable either to adopt directly or to +use as a starting point: + +- [kube-nftlb]: This is built on top of a separate nftables-based load + balancer project called [nftlb], which means that rather than + translating Kubernetes Services directly into nftables rules, it + translates them into nftlb load balancer objects, which then get + translated into nftables rules. Besides making the code more + confusing for users who aren't already familiar with nftlb, this + also means that in many cases, new Service features would need to + have features added to the nftlb core first before kube-nftld could + consume them. (Also, it has not been updated in two years.) + +- [nfproxy]: Its README notes that "nfproxy is not a 1:1 copy of + kube-proxy (iptables) in terms of features. nfproxy is not going to + cover all corner cases and special features addressed by + kube-proxy". (Also, it has not been updated in two years.) + +- [kpng's nft backend]: This was written as a proof of concept and is + mostly a straightforward translation of the iptables rules to + nftables, and doesn't make good use of nftables features that would + let it reduce the total number of rules. It also makes heavy use of + kpng's APIs, like "DiffStore", which there is not consensus about + adopting upstream. + +[kube-nftlb]: https://github.com/zevenet/kube-nftlb +[nftlb]: https://github.com/zevenet/nftlb +[nfproxy]: https://github.com/sbezverk/nfproxy +[kpng's nft backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft + +### Risks and Mitigations + +The primary risk of the proposal is feature regressions, which will be +addressed by testing, and by a slow, optional, rollout of the new proxy +mode. + +The `nftables` mode should not pose any new security issues relative +to the `iptables` mode. + +## Design Details + +### High level + +At a high level, the new mode should have the same architecture as the +existing modes; it will use the service/endpoint-tracking code in +`k8s.io/kubernetes/pkg/proxy` (or its eventual replacement from kpng) +to watch for changes, and update rules in the kernel accordingly. + +### Low level + +Some details will be figured out as we implement it. We may start with +an implementation that is architecturally closer to the `iptables` +mode, and then rewrite it to take advantage of additional nftables +features over time. + +#### Tables + +Unlike iptables, nftables does not have any reserved/default tables or +chains (eg, `nat`, `PREROUTING`). Users are expected to create their +own tables and chains for their own purposes. An nftables table can +only contain rules for a single "family" (`ip` (v4), `ip6`, `inet` +(both IPv4 and IPv6), `arp`, `bridge`, or `netdev`), but unlike in +iptables, you can have both "filter"-type chains and "NAT"-type chains +in the same table. + +So, we will create a single `kube_proxy` table in the `ip` family, and +another in the `ip6` family. All of our chains, sets, maps, etc, will +go into those tables. Other system components (eg, firewalld) should +ignore our table, so we should not need to worry about watching for +other people deleting our rules like we have to in the `iptables` +backend. + +(In theory, instead of creating one table each in the `ip` and `ip6` +families, we could create a single table in the `inet` family and put +both IPv4 and IPv6 chains/rules there. However, this wouldn't really +result in much simplification, because we would still need separate +sets/maps to match IPv4 addresses and IPv6 addresses. (There is no +data type that can store/match either an IPv4 address or an IPv6 +address.) Furthermore, because of how Kubernetes Services evolved in +parallel with the existing kube-proxy implementation, we have ended up +with a dual-stack Service semantics that is most easily implemented by +handling IPv4 and IPv6 completely separately anyway.) + +#### Communicating with the kernel nftables subsystem + +At least initially, we will use the `nft` command-line tool to read +and write rules, much like how we use command-line tools in the +`iptables` and `ipvs` backends. However, the `nft` tool is mostly just +a thin wrapper around `libnftables`, and it would be possible to use +that directly instead in the future, given a cgo wrapper. + +When reading data from the kernel (`nft list ...`), `nft` outputs the +data in a nested "object" form: + +``` +table ip kube_proxy { + comment "Kubernetes service proxying rules"; + + chain services { + ip daddr . ip protocol . th dport vmap @service_ips + } +} +``` + +(This is the "native" nftables syntax, but the tools also support a +JSON syntax that may be easier for us to work with...) + +When writing data to the kernel, `nft` accepts the data in either the +same "object" form used by `nft list`, or in the form of a set of +`nft` command lines without the leading "`nft`" (which are then +executed atomically): + +``` +add table ip kube_proxy { comment "Kubernetes service proxying rules"; } +add chain ip kube_proxy services +add rule ip kube_proxy services ip daddr . ip protocol . th dport vmap @service_ips +``` + +The "object" form is more logical and easy to understand, but the +"command" form is better for dynamic usage. In particular, it allows +you to add and remove individual chains, rules, map/set elements, etc, +without needing to also include the chains/rules/elements that you are +not modifying. + +The examples below all show the "object" form of data, but it should +be understood that these are examples of what would be seen in `nft +list` output after kube-proxy creates the rules (with additional +`#`-preceded comments added to help the KEP reader), not examples of +the data we will actually be passing to `nft`. + +The examples below are also all IPv4-specific, for simplicity. When +actually writing out rules for nft, we will need to switch between, +e.g., "`ip daddr`" and "`ip6 daddr`" appropriately, to match an IPv4 +or IPv6 destination address. This will actually be fairly simple +because the `nft` command lets you create "variables" (really +constants) and substitute their values into the rules. Thus, we can +just always have the rule-generating code write "`$IP daddr`", and +then pass either "`-D IP=ip`" or "`-D IP=ip6`" to `nft` to fix it up.) + +(Also, most of the examples below have not actually been tested and +may have syntax errors. Caveat lector.) + +#### Versioning and compatibility + +Since nftables is subject to much more development than iptables has +been recently, we will need to pay more attention to kernel and tool +versions. + +The `nft` command has a `--check` option which can be used to check if +a command could be run successfully; it parses the input, and then +(assuming success), uploads the data to the kernel and asks the kernel +to check it (but not actually act on it) as well. Thus, with a few +`nft --check` runs at startup we should be able to confirm what +features are known to both the tooling and the kernel. + +It is not yet clear what the minimum kernel or `nft` command-line +versions needed by the `nftables` backend will be. The newest feature +used in the examples below was added in Linux 5.6, released in March +2020 (though they could be rewritten to not need that feature). + +It is possible some users will not be able to upgrade from the +`iptables` and `ipvs` backends to `nftables`. (Certainly the +`nftables` backend will not support RHEL 7, which some people are +still using Kubernetes with.) + +#### NAT rules + +##### General Service dispatch + +For ClusterIP and external IP services, we will use an nftables +"verdict map" to store the logic about where to dispatch traffic, +based on destination IP, protocol, and port. We will then need only a +single actual rule to apply the verdict map to all inbound traffic. +(Or it may end up making more sense to have separate verdict maps for +ClusterIP, ExternalIP, and LoadBalancer IP?) Likewise, for NodePort +traffic, we will use a verdict map matching only on destination +protocol / port, with the rules set up to only check the `nodeports` +map for packets addressed to a local IP. + +``` +map service_ips { + comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"; + + # The "type" clause defines the map's datatype; the key type is to + # the left of the ":" and the value type to the right. The map key + # in this case is a concatenation (".") of three values; an IPv4 + # address, a protocol (tcp/udp/sctp), and a port (aka + # "inet_service"). The map value is a "verdict", which is one of a + # limited set of nftables actions. In this case, the verdicts are + # all "goto" statements. + + type ipv4_addr . inet_proto . inet_service : verdict; + + elements { + 172.30.0.44 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK, + 192.168.99.33 . tcp . 80 : goto svc_4SW47YFZTEDKD3PK, + ... + } +} + +map service_nodeports { + comment "NodePort traffic"; + type inet_proto . inet_service : verdict; + + elements { + tcp . 3001 : goto svc_4SW47YFZTEDKD3PK, + ... + } +} + +chain prerouting { + jump services + jump nodeports +} + +chain services { + # Construct a key from the destination address, protocol, and port, + # then look that key up in the `service_ips` vmap and take the + # associated action if it is found. + + ip daddr . ip protocol . th dport vmap @service_ips +} + +chain nodeports + # Return if the destination IP is non-local, or if it's localhost. + fib daddr type != local return + ip daddr == 127.0.0.1 return + + # If --nodeport-addresses was in use then the above would instead be + # something like: + # ip daddr != { 192.168.1.5, 192.168.3.10 } return + + # dispatch on the service_nodeports vmap + ip protocol . th dport vmap @service_nodeports +} + +# Example per-service chain +chain svc_4SW47YFZTEDKD3PK { + # Send to random endpoint chain using an inline vmap + numgen random mod 2 vmap { + 0 : goto sep_UKSFD7AGPMPPLUHC, + 1 : goto sep_C6EBXVWJJZMIWKLZ + } +} + +# Example per-endpoint chain +chain sep_UKSFD7AGPMPPLUHC { + # masquerade hairpin traffic + ip saddr 10.180.0.4 jump mark_for_masquerade + + # send to selected endpoint + dnat to 10.180.0.4:8000 +} +``` + +##### Masquerading + +The example rules above include + +``` + ip saddr 10.180.0.4 jump mark_for_masquerade +``` + +to masquerade hairpin traffic, as in the `iptables` proxier. This +assumes the existence of a `mark_for_masquerade` chain, not shown. + +nftables has the same constraints on DNAT and masquerading as iptables +does; you can only DNAT from the "prerouting" stage and you can only +masquerade from the "postrouting" stage. Thus, as with `iptables`, the +`nftables` proxy will have to handle DNAT and masquerading at separate +times. One possibility would be to simply copy the existing logic from +the `iptables` proxy, using the packet mark to communicate from the +prerouting chains to the postrouting ones. + +However, it should be possible to do this in nftables without using +the mark or any other externally-visible state; we can just create an +nftables `set`, and use that to communicate information between the +chains. Something like: + +``` +# Set of 5-tuples of connections that need masquerading +set need_masquerade { + type ipv4_addr . inet_service . ipv4_addr . inet_service . inet_proto; + flags timeout ; timeout 5s ; +} + +chain mark_for_masquerade { + update @need_masquerade { ip saddr . th sport . ip daddr . th dport . ip protocol } +} + +chain postrouting_do_masquerade { + # We use "ct original ip daddr" and "ct original proto-dst" here + # since the packet may have been DNATted by this point. + + ip saddr . th sport . ct original ip daddr . ct original proto-dst . ip protocol @need_masquerade masquerade +} +``` + +This is not yet tested, but some kernel nftables developers have +confirmed that it ought to work. + +##### Session affinity + +Session affinity can be done in roughly the same way as in the +`iptables` proxy, just using the more general nftables "set" framework +rather than the affinity-specific version of sets provided by the +iptables `recent` module. In fact, since nftables allows arbitrary set +keys, we can optimize relative to `iptables`, and only have a single +affinity set per service, rather than one per endpoint. (And we also +have the flexibility to change the affinity key in the future if we +want to, eg to key on source IP+port rather than just source IP.) + +``` +set affinity_4SW47YFZTEDKD3PK { + # Source IP . Destination IP . Destination Port + type ipv4_addr . ipv4_addr . inet_service; + flags timeout; timeout 3h; +} + +chain svc_4SW47YFZTEDKD3PK { + # Check for existing session affinity against each endpoint + ip saddr . 10.180.0.4 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_UKSFD7AGPMPPLUHC + ip saddr . 10.180.0.5 . 80 @affinity_4SW47YFZTEDKD3PK goto sep_C6EBXVWJJZMIWKLZ + + # Send to random endpoint chain + numgen random mod 2 vmap { + 0 : goto sep_UKSFD7AGPMPPLUHC, + 1 : goto sep_C6EBXVWJJZMIWKLZ + } +} + +chain sep_UKSFD7AGPMPPLUHC { + # Mark the source as having affinity for this endpoint + update @affinity_4SW47YFZTEDKD3PK { ip saddr . 10.180.0.4 . 80 } + + ip saddr 10.180.0.4 jump mark_for_masquerade + dnat to 10.180.0.4:8000 +} + +# likewise for other endpoint(s)... +``` + +#### Filter rules + +The `iptables` mode uses the `filter` table for three kinds of rules: + +##### Dropping or rejecting packets for services with no endpoints + +As with service dispatch, this is easily handled with a verdict map: + +``` +map no_endpoint_services { + type ipv4_addr . inet_proto . inet_service : verdict + elements = { + 192.168.99.22 . tcp . 80 : drop, + 172.30.0.46 . tcp . 80 : goto reject_chain, + 1.2.3.4 . tcp . 80 : drop + } +} + +chain filter { + ... + ip daddr . ip protocol . th dport vmap @no_endpoint_services + ... +} + +# helper chain needed because "reject" is not a "verdict" and so can't +# be used directly in a verdict map +chain reject_chain { + reject +} +``` + +##### Dropping traffic rejected by `LoadBalancerSourceRanges` + +The implementation of LoadBalancer source ranges will be similar to +the ipset-based implementation in the `ipvs` kube proxy: we use one +set to recognize "traffic that is subject to source ranges", and then +another to recognize "traffic that is _accepted_ by its service's +source ranges". Traffic which matches the first set but not the second +gets dropped: + +``` +set firewall { + comment "destinations that are subject to LoadBalancerSourceRanges"; + type ipv4_addr . inet_proto . inet_service +} +set firewall_allow { + comment "destination+sources that are allowed by LoadBalancerSourceRanges"; + type ipv4_addr . inet_proto . inet_service . ipv4_addr +} + +chain filter { + ... + ip daddr . ip protocol . th dport @firewall jump firewall_check + ... +} + +chain firewall_check { + ip daddr . ip protocol . th dport . ip saddr @firewall_allow return + drop +} +``` + +Where, eg, adding a Service with LoadBalancer IP `10.1.2.3`, port +`80`, and source ranges `["192.168.0.3/32", "192.168.1.0/24"]` would +result in: + +``` +add element ip kube_proxy firewall { 10.1.2.3 . tcp . 80 } +add element ip kube_proxy firewall { 10.1.2.3 . tcp . 80 } +add element ip kube_proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.0.3/32 } +add element ip kube_proxy firewall_allow { 10.1.2.3 . tcp . 80 . 192.168.1.0/24 } +``` + +##### Forcing traffic on `HealthCheckNodePorts` to be accepted + +The `iptables` mode adds rules to ensure that traffic to NodePort +services' health check ports is allowed through the firewall. eg: + +``` +-A KUBE-NODEPORTS -m comment --comment "ns2/svc2:p80 health check node port" -m tcp -p tcp --dport 30000 -j ACCEPT +``` + +(There are also rules to accept any traffic that has already been +tagged by conntrack.) + +This cannot be done reliably in nftables; the `accept` and `drop` +rules work differently than they do in iptables, and so if there is a +firewall that would drop traffic to that port, then there is no +guaranteed way to "sneak behind its back" like you can in iptables; we +would need to actually properly configure _that firewall_ to accept +the packets. + +However, these sorts of rules are somewhat legacy anyway; they work +(in the `iptables` proxy) to bypass a _local_ firewall, but they would +do nothing to bypass a firewall implemented at the cloud network +layer, which is perhaps a more common configuration these days anyway. +Administrators using non-local firewalls are already required to +configure those firewalls correctly to allow Kubernetes traffic +through, and it is reasonable for us to just extend that requirement +to administrators using local firewalls as well. + +Thus, the `nftables` backend will not attempt to replicate these +`iptables`-backend rules. + +#### Future improvements + +Further improvements are likely possible. + +For example, it would be nice to not need a separate "hairpin" check for +every endpoint. There is no way to ask directly "does this packet have +the same source and destination IP?", but the proof-of-concept [kpng +nftables backend] does this instead: + +``` +set hairpin { + type ipv4_addr . ipv4_addr; + elements { + 10.180.0.4 . 10.180.0.4, + 10.180.0.5 . 10.180.0.5, + ... + } +} + +chain ... { + ... + ip saddr . ip daddr @hairpin jump mark_for_masquerade +} +``` + +More efficiently, if nftables eventually got the ability to call eBPF +programs as part of rule processing (like iptables's `-m ebpf`) then +we could write a trivial eBPF program to check "source IP equals +destination IP" and then call that rather than needing the giant set +of redundant IPs. + +If we do this, then we don't need the per-endpoint hairpin check +rules. If we could also get rid of the per-endpoint affinity-updating +rules, then we could get rid of the per-endpoint chains entirely, +since `dnat to ...` is an allowed vmap verdict: + +``` +chain svc_4SW47YFZTEDKD3PK { + # FIXME handle affinity somehow + + # Send to random endpoint + random mod 2 vmap { + 0 : dnat to 10.180.0.4:8000 + 1 : dnat to 10.180.0.5:8000 + } +} +``` + +With the current set of nftables functionality, it does not seem +possible to do this (in the case where affinity is in use), but future +features may make it possible. + +It is not yet clear what the tradeoffs of such rewrites are, either in +terms of runtime performance, or of admin/developer-comprehensibility +of the ruleset. + +[kpng nftables backend]: https://github.com/kubernetes-sigs/kpng/tree/master/backends/nft + +### Test Plan + + + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + +We will add unit tests for the `nftables` mode that are equivalent to +the ones for the `iptables` mode. In particular, we will port over the +tests that feed Services and EndpointSlices into the proxy engine, +dump the generated ruleset, and then mock running packets through the +ruleset to determine how they would behave. + +The `cmd/kube-proxy/app` tests mostly only test configuration parsing, +and we will extend them to understand the new mode and its associated +configuration options, but there will not be many changes made there. + + + +- ``: `` - `` + +##### Integration tests + +Kube-proxy does not have integration tests. + +##### e2e tests + +Most of the e2e testing of kube-proxy is backend-agnostic. Initially, +we will need a separate e2e job to test the nftables mode (like we do +with ipvs). Eventually, if nftables becomes the default, then this +would be flipped around to having a legacy "iptables" job. + +The handful of e2e tests that specifically examine iptables rules will +need to be updated to be able to work with either backend. + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + +The new mode should not introduce any upgrade/downgrade problems, +excepting that you can't downgrade or feature-disable a cluster using +the new kube-proxy mode without switching it back to `iptables` or +`ipvs` first. + +When rolling out or rolling back the feature, it should be safe to +enable the feature gate and change the configuration at the same time, +since nothing cares about the feature gate except for kube-proxy +itself. Likewise, it is expected to be safe to roll out the feature in +a live cluster, even though this will result in different proxy modes +running on different nodes, because Kubernetes service proxying is +defined in such a way that no node needs to be aware of the +implementation details of the service proxy implementation on any +other node. + +(However, see the notes below in [Feature Enablement and +Rollback](#feature-enablement-and-rollback) about stale rule cleanup +when switching modes.) + +### Version Skew Strategy + +The feature is isolated to kube-proxy and does not introduce any API +changes, so the versions of other components do not matter. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + +The administrator must enable the feature gate to make the feature +available, and then must run kube-proxy with the +`--proxy-mode=nftables` flag. + +Kube-proxy does not delete its rules on exit (to avoid service +interruptions when restarting/upgrading kube-proxy, or if it crashes). +This means that when switching between proxy modes, it is necessary +for the administrator to ensure that the rules created by the old +proxy mode get deleted. (Failure to do so may result in stale service +rules being left behind for an arbitrarily long time.) The simplest +way to do this is to reboot each node when switching from one proxy +mode to another, but it is also possible to run kube-proxy in "cleanup +and exit" mode, eg: + +``` +kube-proxy --proxy-mode=iptables --cleanup +``` + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: NFTablesKubeProxy + - Components depending on the feature gate: + - kube-proxy +- [X] Other + - Describe the mechanism: + - See above + - Will enabling / disabling the feature require downtime of the control + plane? + - No + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + - See above + +###### Does enabling the feature change any default behavior? + +Enabling the feature gate does not change any behavior; it just makes +the `--proxy-mode=nftables` option available. + +Switching from `--proxy-mode=iptables` or `--proxy-mode=ipvs` to +`--proxy-mode=nftables` will likely change some behavior, depending +on what we decide to do about certain un-loved kube-proxy features +like localhost nodeports. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, though the same caveat about rebooting or running `kube-proxy +--cleanup` applies as in the "enabling" case. + +Of course, if the user is rolling back, that suggests that the +`nftables` mode was not working correctly, in which case the +`--cleanup` option may _also_ not work correctly, so rebooting the +node is safer. + +###### What happens if we reenable the feature if it was previously rolled back? + +It should just work. + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + +The feature is used by the cluster as a whole, and the operator would +know that it was in use from looking at the cluster configuration. + +###### How can someone using this feature know that it is working for their instance? + +- [X] Other (treat as last resort) + - Details: If Services still work then the feature is working + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [X] Metrics + - Metric names: + - ... + - Components exposing the metric: + - kube-proxy + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +It may require a newer kernel than some current users have. It does +not depend on anything else in the cluster. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + +Probably not; kube-proxy will still be using the same +Service/EndpointSlice-monitoring code, it will just be doing different +things locally with the results. + +###### Will enabling / using this feature result in introducing new API types? + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +It is not expected to... + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +The same way that kube-proxy currently does; updates stop being +processed until the apiserver is available again. + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- Initial proposal: 2023-02-01 + +## Drawbacks + +Adding a new officially-supported kube-proxy implementation implies +more work for SIG Network (especially if we are not able to deprecate +either of the existing backends soon). + +Replacing the default kube-proxy implementation will affect many +users. + +However, doing nothing would result in a situation where, eventually, +many users would be unable to use the default proxy implementation. + +## Alternatives + +### Continue to improve the `iptables` mode + +We have made many improvements to the `iptables` mode, and could make +more. In particular, we could make the `iptables` mode use IP sets +like the `ipvs` mode does. + +However, even if we could solve literally all of the performance +problems with the `iptables` mode, there is still the looming +deprecation issue. + +(See also "[The iptables kernel subsystem has unfixable performance +problems](#the-iptables-kernel-subsystem-has-unfixable-performance-problems)".) + +### Fix up the `ipvs` mode + +Rather than implementing an entirely new `nftables` kube-proxy mode, +we could try to fix up the existing `ipvs` mode. + +However, the `ipvs` mode makes extensive use of the iptables API in +addition to the IPVS API. So while it solves the performance problems +with the `iptables` mode, it does not address the deprecation issue. +So we would at least have to rewrite it to be IPVS+nftables rather +than IPVS+iptables. + +(See also "[The ipvs mode of kube-proxy will not save +us](#the--mode-of-kube-proxy-will-not-save-us)".) + +### Use an existing nftables-based kube-proxy implementation + +Discussed in [Notes/Constraints/Caveats](#notesconstraintscaveats). + +### Create an eBPF-based proxy implementation + +Another possibility would be to try to replace the `iptables` and +`ipvs` modes with an eBPF-based proxy backend, instead of an an +nftables one. eBPF is very trendy, but it is also notoriously +difficult to work with. + +One problem with this approach is that the APIs to access conntrack +information from eBPF programs only exist in the very newest kernels. +In particular, the API for NATting a connection from eBPF was only +added in the recently-released 6.1 kernel. It will be a long time +before a majority of Kubernetes users have a kernel new enough that we +can depend on that API. + +Thus, an eBPF-based kube-proxy implementation would initially need a +number of workarounds for missing functionality, adding to its +complexity (and potentially forcing architectural choices that would +not otherwise be necessary, to support the workarounds). + +One interesting eBPF-based approach for service proxying is to use +eBPF to intercept the `connect()` call in pods, and rewrite the +destination IP before the packets are even sent. In this case, eBPF +conntrack support is not needed (though it would still be needed for +non-local service connections, such as connections via NodePorts). One +nice feature of this approach is that it integrates well with possible +future "multi-network Service" ideas, in which a pod might connect to +a service IP that resolves to an IP on a secondary network which is +only reachable by certain pods. In the case of a "normal" service +proxy that does destination IP rewriting in the host network +namespace, this would result in a packet that was undeliverable +(because the host network namespace has no route to the isolated +secondary pod network), but a service proxy that does `connect()`-time +rewriting would rewrite the connection before it ever left the pod +network namespace, allowing the connection to proceed. + +The multi-network effort is still in the very early stages, and it is +not clear that it will actually adopt a model of multi-network +Services that works this way. (It is also _possible_ to make such a +model work with a mostly-host-network-based proxy implementation; it's +just more complicated.) + diff --git a/keps/sig-network/NNNN-nftables-proxy/kep.yaml b/keps/sig-network/NNNN-nftables-proxy/kep.yaml new file mode 100644 index 000000000000..8e650d0ef0bf --- /dev/null +++ b/keps/sig-network/NNNN-nftables-proxy/kep.yaml @@ -0,0 +1,39 @@ +title: An nftables-based kube-proxy backend +kep-number: NNNN +authors: + - "@danwinship" +owning-sig: sig-network +status: provisional +creation-date: 2023-02-01 +reviewers: + - "@thockin" + - "@dcbw" + - "@aojea" +approvers: + - "@thockin" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.27" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.28" + beta: "v1.30" + stable: "v1.32" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: NFTablesKubeProxy + components: + - kube-proxy +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - ...