-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KNI [Kubernetes Networking Interface] Initial Draft KEP #4477
Changes from 29 commits
2bede53
14eeea2
eae3341
00209db
9e9ef49
eae3f0c
68738dd
9f215c6
217f1c3
64eca47
664c2e0
17a0fa6
d547f62
f957158
abc4210
cefc7c9
a6e3c30
1f05981
e770486
855d5e7
8a33b31
0c3fb89
325bbfc
17baf99
1bfd49b
49e5614
34d21b7
d6d9a5c
82af8a0
3177ee4
61281b5
1c3107b
2081e13
cd3f4b2
9d2ee29
30d4804
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# KEP-4410: Kubernetes Networking reImagined | ||
|
||
> **NOTE**: for the initial PR we've removed a lot of the templated text and | ||
> aimed to keep this first iteration small and easier to consume. We are only | ||
> focusing on the "What" and "Why" (e.g. motivation, goals, user stories) for | ||
> this iteration so that we can build consensus on those first before we add | ||
> any of the "How". | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
This proposal is to design and implement the KNI [Kubernetes Networking Interface] or better known as Kubernetes Networking reImagined. KNI will create a Network resource and provide an API that will provide network status, availability, how to attach a pod to a network, detach the pod from the network and update a pods network. | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Motivation | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Kubernetes networking has traditionally been challenging to understand for users | ||
interacting with the Kubernetes API, and there has been considerable flexibility | ||
in how Container Network Interfaces (CNIs) set up networking within clusters. | ||
This has resulted in a scenario where things like pod networking (including pod | ||
to pod networking) is opaque to users, with different implementations taking | ||
markedly different approaches. This fragmentation has spread networking across | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
all layers of the stack which include k8s components like kube-proxy, netpol agents, | ||
container runtime with CNI plugins and low level runtimes like kata and issues | ||
with the API have negatively impacted adoption in sectors such as telecommunications. | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Our goal is to transform Kubernetes networking by making networks and their components | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
actual resources within the Kubernetes API. This will allow for the development | ||
of shared functionalities and their integration into the API. We anticipate that | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
this new approach will enhance support for areas that are currently struggling, | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
facilitate the development and promotion of common features, and better define | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and accommodate advanced functionalities and potential areas for expansion. | ||
MikeZappa87 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Goals | ||
|
||
- Design a cool looking t-shirt | ||
- Provide Kubernetes APIs for the creation, configuration and management of interfaces | ||
- Provide documentation, examples, troubleshooting and FAQ's for KNI. | ||
- KNI should provide the API's required to establish feature parity with current CNI [ADD, DEL] | ||
- Handle support levels like Gateway API (e.g. "core" and "extended") | ||
- Handle implementation-specific use cases through extension points | ||
- Decouple the Pod and Node Network setup | ||
- Provide garbage collection to ensure no resources created during pod setup such as Linux bridges, ebpf programs, | ||
allocated IP addresses are left behind after pod deletion | ||
- Improve the current IP handling for pods (PodIP) to be handle multiple IP addresses and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which is not possible in CNI because ...? |
||
a field to identify the IP address family (IPV4 vs IPV6) | ||
- Provide backwards compatibility for the existing CNI approach and migration a path to fully adopt KNI | ||
- Guarantee the network is setup and in a healthy state before containers are started (ephemeral, init, regular) | ||
- If feasible, provide API awareness of Pod network namespaces (e.g. interface names) | ||
- Provide a uniform approach for network setup/teardown for both virtualized (kata) and non-virtualized (runc) | ||
runtimes including kubevirt. This could eliminate the high and low level runtimes from the networking path | ||
- Provide a reference implementation of the KNI network runtime | ||
- Provide the ability to have all the dependencies packaged in the container image (no more CNI binaries in the host file system) | ||
..- No more downloading CNI binaries via initContainers/Mounting /etc/cni/net.d or /opt/cni/bin | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... Downloading a container image to run the gRPC service is better how? Currently a cluster operator can just pre-populate these instead of relying on a daemonset, it's pretty straightforward and results in very fast startup. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's valid, I think - there needs to be a story for supporting pre-baked/pre-provisioned "locked down" nodes. That could be as simple as shipping node images with KNI as a system service, or shipping a node with a preloaded image - but warrants discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't make too much sense. Since calico, flannel, cilium are already running as daemonset pods. Since they would be implementing the KNI grpc service and no longer have a need for cni binaries on disk. Are you looking for a migration path here? This becomes a problem solved with kni. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's 2 (somewhat edge case but still valid) operator questions that can be explicitly answered here.
Both of these can be (and are today) solved with existing K8S primitives and patterns, with or without KNI, so I don't think KNI makes this materially more difficult. The KNI solution to (1) would be the same as CNI - don't allow privileged pods on your custom nodes. KNI might give you a bit more flexibility here by allowing things like admission webhooks for KNI config, while still allowing privileged pods, etc that are not possible with the current out-of-K8S CNI config model. The KNI solution to (2) would be the same solution you would employ today to ship any node-required daemonset you didn't want to pull on every node provision (yep, like cilium/flannel/calico/whatever) -> preload the images on the node, or run them as system services and not privileged containers. |
||
- Provide the ability to use native k8s resources for configuration such as a ConfigMap's instead of configuration files in host file system | ||
- Provide an API to indicate network readiness for the node (no more files on disk) | ||
- Eliminate the need to exec binaries and replace with gRPC | ||
- Make troubleshooting easier by having logs accessible via kubectl logs | ||
- Improve network pod startup time | ||
- Provide the ability to prevent additional scheduling of pods if IPAM is out of IP addresses without evicting running pods | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to rephrase this one to be providing the API around the IPAM state so that we don't scope creep. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: Add goal of having the pod object available at network runtime There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dougbtv I am drafting an update. So I might be able to get this. Do you have specific items you want off the Pod spec? Metadata (name, namespace, labels, annotations, ... ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metadata nails it, thanks. At least I'm most interested in getting all you listed. Potentially someone might want more? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to request CIDRs as an available piece of metadata if possible. That would be great for legacy applications (e.g., Ceph) that use CIDR configurations as config values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BlaineEXE this sounds reasonable. I notice you are in Colorado. I am in the boulder area. Does the application need the pod cidr? You might be able to infer this via the Pod IP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
that is exactly what we need, these experiences, and this one is something I've identified in multiple places, attach netdevices to pods, so I feel this is a strong use case ... what I also see is that these interfaces are used as "external' networks that are only relevant to the app running on the specific pod, so I don't feel that these IPs from these interfaces should be represented on the kubernetes topology ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BlaineEXE I'm still trying to fully understand your use case, based on your comments it seems you need to have some prior work of setting up the infrastructure and the vlans, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Certainly someone must configure the additional hardware. In practice, an admin must add a separate switch (or create a VLAN on an existing switch) that connects to a different interface on the host systems. So if Our current deployment strategy leverages Multus and NetworkAttachmentDefinitions to connect storage (Ceph) pods to This does work, but because Multus is such a complex feature to understand, users often seem lost trying to configure an already-complex storage system with NADs. In addition, there is developer complexity, and there are friction points -- like not being able to get a Service with a static IP on a Multus network. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
what do you mean by a multus network? some entity that is connected to the additional interface? i.e , if eth1 is connected to an external vlan, some compute or host on that network? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BlaineEXE @dougbtv I want to understand well this use case, I don't quite get what are the expectations here and what is the problem we are trying to solve., definitively kubernetes can not manage the infra, connecting switches or create vlanes, there are other projects that cover that area ... |
||
### Non-Goals | ||
|
||
1. Any changes to the kube-scheduler | ||
2. Any specific implementation other than the reference implementation. However we should ensure the KNI-API is flexible enough to support | ||
|
||
## Proposal | ||
|
||
The proposal of this KEP is to design and implement the KNI-API and make necessary changes to the CRI-API and container runtimes. The scope should be kept to a minimum and we should target feature parity. | ||
|
||
### User Stories | ||
|
||
We are constantly adding these user stories, please join the community sync to discuss. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Where? |
||
|
||
#### Story 1 | ||
|
||
As a cluster operator, I need the ability to determine my network(s) is ready so that my pods come up with a working network. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this has to have more details about what mean "network is ready", there is no such a thing as a "global network state", the whole point of IP networks is to be distributed, "Network is ready" today means I can provide a netdevice to the Pod (veth) and assign an IP addresses There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently 'network is ready' is a CNI network configuration in /etc/cni/net.d. At a minimum, we should clearly define what 'network is ready' means. I can ask several people about this and get various answers depending on their environment. We should allow the user to implement this assuming it meets the criteria of K8s. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My view is that "network ready" today is expected to mean "I can create a Pod that will be able to communicate with all the other Pods in the cluster", and this is a paradox https://github.com/kubernetes/enhancements/pull/4477/files#r1489293806, however, it is implemented today as "there is a cni config file that we are expecting to do what is right when we create a pod", and this use to mean "I can create a Pod and wil be able to communicate within the node" @danwinship @thockin and @squeed on these philosophical questions :) |
||
|
||
#### Story 2 | ||
|
||
As a cluster operator, I need the ability to determine what networks are available on my node so that upstream components can ensure the pod is scheduled on the appropriate node. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this user story, this is an scheduling problem that is already solved today https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have been in debate about whether or not to focus more on interfaces and not networks, taking a more CNI approach. However, if one was to implement multiple networks most likely they would do it with different interfaces. Thanks for pointing out the topology spread constraints. In the end we don't want to be involved with scheduling however we want to provide the API to give other efforts what is currently on this node. |
||
|
||
#### Story 3 | ||
|
||
As a Kubernetes developer, I need the ability to have extension points for pod network setup, teardown and update so that I can support future Kubernetes networking features with either reducing the changes to core kubernetes or eliminating them | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, I'm reading this as not just setup and teardown, but, within a pod's lifecycle, maybe we can expand this to say that it's not just "update" but during it's lifecycle, can that be more explicit about taking actions while the pod is running?
I think of this as a current limitation at having execution points on pod creation or deletion. Also this could be split into its own thing, because there's a kinda double thing here with both extension points and reducing or eliminating changes to core. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. " kubernetes developer that wants to do network things " is not an user story, you need to define the user story that require those advanced networking features , and then we figure out what is thebest solution There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have a use-case for consideration here that I can break down into 2 parts: Part 1: Currently the Rook project uses Multus to allow storage pods to attach to host network devices without exposing pods on the host network namespace. The goal is to be able to keep as much network isolation for security as possible while also being able to give the Ceph storage platform access to host-local network speeds. It's not clear to me whether the high-level use case I am describing -- attaching pods to specific host devices -- has a place in KNI. I certainly hope so. (As a note here, Rook uses prefers to use macvlan to get subdevices of a host device rather than exposing the host device itself, but I can also imagine systems that want to just expose the host device itself.) Part 2: Assuming part 1 fits into KNI's purpose, the next part relates to IP assignments. A big hangup the Rook project has with Multus is that we can't easily get static IP assignments on Multus networks. For the Pod network, it is easy to get a static IP via k8s Service, but we don't have the same ease when doing this for dedicated NICs (*). I would propose that KNI consider whether it can create API flexibility to allow Service functionality on any network, including host-device-attached networks. (*): We are aware that MultusService exists, but it isn't at a stable enough place for us to requiring it as an add-on for users. Additionally, it requires Rook to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MikeZappa87 this comment is proobing my point on the user story I added as example here #4477 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to rethink this user story. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
if the pod is connected through a external nic to a external network, is the external network responsibility to assign IPs, are we trying to say that kubernetes should manage these external networks ipam? what stops an admin on that network to assign a static ip? |
||
|
||
#### Story 4 | ||
|
||
As a tool which manages eBPF programs on a Kubernetes cluster (bpfman, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should keep the personal to an actually user? |
||
inspektorgadget), I would like to be able to see the network interfaces of a | ||
`Pod` via the Kubernetes API so that I can attach TC/XDP network programs to | ||
those interfaces based on knowing the Pod name. | ||
|
||
### Notes/Constraints/Caveats | ||
|
||
Additional Information/Diagrams: https://docs.google.com/document/d/1Gz7iNtJNMI-zKJhaOcI3aflPCx3etJ01JMxzbtvruKk/edit?usp=sharing | ||
|
||
Changes to the pod specification will require hard evidence. | ||
|
||
The specifics of "Network Readiness" is an implementation detail. We need to provide this RPC to the user. | ||
|
||
We should consider the trade offs to using a Native K8s Network object or CRD's. | ||
Using a native object would allow passing a slice of network type to AttachNetwork | ||
|
||
Since the network runtime can be run separated from the container runtime, you can package everything into a pod and not need to have binaries on disk. This allows the CNI plugins to be isolated in the pod and the pod will never need to mount /opt/cni/bin or /etc/cni/net.d. This offers a potentially more ability to control execution. Keep in mind CNI is the implementation however when this is used chaining is still available. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
title: k8s-network-interface | ||
kep-number: 4410 | ||
authors: | ||
- "@mikezappa87" | ||
- "@shaneutt" | ||
owning-sig: sig-network | ||
participating-sigs: | ||
- sig-network | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surely at least SIG Node should be participating (there's no way this doesn't affect kubelet, CRI)..? I would also tag Cluster Lifecycle at least as FYI / advisory since cluster lifecycle folks will know about and have suggestions re: node readiness and cluster configuration. |
||
status: provisional | ||
creation-date: 2024-01-11 | ||
reviewers: | ||
- @aojea | ||
- @danwinship | ||
- @thockin | ||
approvers: | ||
|
||
see-also: | ||
- "/keps/sig-aaa/1234-we-heard-you-like-keps" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. delete these, or update with relevant keps? same below for |
||
- "/keps/sig-bbb/2345-everyone-gets-a-kep" | ||
replaces: | ||
- "/keps/sig-ccc/3456-replaced-kep" | ||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alpha | ||
|
||
# The most recent milestone for which work toward delivery of this KEP has been | ||
# done. This can be the current (upcoming) milestone, if it is being actively | ||
# worked on. | ||
latest-milestone: "v1.30" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v1.31" | ||
beta: "v1.32" | ||
stable: "v1.33" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems unlikely, without even defining an API yet? |
||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: kni | ||
components: | ||
- kubelet | ||
- cri-api | ||
disable-supported: true | ||
|
||
# The following PRR answers are required at beta release | ||
metrics: | ||
- my_feature_metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's "better" known as anything, given that it mostly does not exist yet.
The Summary and Motivation should not assume that the reader is already familiar with the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is funny you mention this because reImagined is how it got recirculated back to me and I just ran with it. However, 100% the summary/motivation should be written in a way that a nontechnical reader should understand. However, that is a high bar to hit as the current is very difficult and through the years, I have found that small numbers of people can actually articulate the current accurately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure we can call the project "Kubernetes Networking Interface" and then colloquially we can refer to the effort as "reImagined" in less formal settings, recommend:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first time trying to get familiar with the KNI project, so started reading this, Apologies if this sounds like a total newbie question, perhaps I must read other material first before coming to this KEP and if so please point me the right way:
network status and availability
means the cluster networking status? node networking status? pod networking status?the second part refers to pod aspects:
how to attach a pod to a network, detach the pod from the network and update a pods network
=> this is like providing what the CNI spec does but through an API right?