-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace KubeProxy Design Draft (Linux Only) #1931
Comments
@hongliangl could you clarify what happens for Host ClusterIP access when the endpoints for the Service are not in the Pod Network (they can be host IPs, e.g. for the |
Some update about open questions from community meetings:
And if the flows are not ready in 15 seconds (even though the timeout is 30 seconds, the backoff time is beyong the window), the call will return error and the program will exit and restart:
|
The implementation for host IPs is a little complicated, and I'll explain it later in detail. |
Again I feel we should come up with a single approach to redirect matched host traffic to OVS. Later we will that for Node security policies (but it can be different in that most Node traffic should be directed to OVS) too. |
For endpoint in remote host network. E.g,
The traffic is sent to target endpoint finally. Now we handle the return traffic.
|
The key to implement NodePort is how to "redirect" the traffic of NodePort into OVS. The NodePort traffice can be from localhost or remote hosts. Traffic From Remote HostsEnvironmentThis is the test environment. backend be seen as an integration of OVS and pods, and node is a VM. Note that, this is different from the real Antrea environment, so there should be some changes when integrating this design with Antrea. DesignThe destination of the NodePort traffic may be different IP addresses if the Kubernetes has multiple network adaptors. Here we take eth0 as an example, and assuming that the source IP address(short for src) arriving eth0 is In general,
This is a detailed example:
Hash TableFrom the detailed example of above, we can see that the key of redirecting NodePort traffic is TC filter. By default, there is a default hash table that has only a hash bucket on each network adaptor's ingress and egress. Every hash bucket can be appended items in list. We can also add another hash table to the list, however, there is a restriction that we can only create a hash table with a maximum of 256 buckets. The hashkey should be set when attaching filter to the hash table. For example,
tc filter add dev eth0 parent ffff:0 protocol ip prio 99 handle 1: u32 divisor 256
tc filter add dev eth0 parent ffff:0 protocol ip prio 99 u32 link 1: hashkey mask 0x000000ff at 20 match ip dst 192.168.2.211 I have two designs of filtering NodePort traffic. Hash Design 1: Hash Table + ListFrom the picture we can see that:
Hash Design 2: Nested Hash TableFrom the picture we can see that:
Traffic From LocalhostEnvironmentDesignThe design for handling NodePort from the localhost and remote host is different. Here I need to take two examples.
For Traffic to 127.0.0.1:30000In general,
This is a detailed example:
For Traffic to 192.168.2.211:30000
EvaluationEnvironmentThe netperf server and client are two VMs.
All values are the average of 140 results from 160 results(the highest 10 and lowest 10 results are not included). TCP-STRERAM
TCP-RR
TCP-CRR
|
Hello, @antoninbas @jianjuns @tnqn, I have post the new design and test result. |
Hi @hongliangl , I do not understand the following:
What is gw1 and what is gw3? |
@hongliangl should this issue be closed? |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
Replace KubeProxy Design Draft (Linux Only)
original google doc
Why We Should Do This
Since we will provide NodePort Service support in AntreaProxy recently, we will then have the ability to remove KubeProxy totally. This will save a lot of cycles and memories, and make it possible to have more control over Service traffic.
Items to be Resolved
We still need the following abilities to do the replacement.
Direct APIServer Access
The KubeProxy may not exist when we’re installing Antrea in the future, this will make Services not work during the installation. Meanwhile, Antrea needs to connect to the Kubernetes Service to watch resources. To overcome this issue, we need both AntreaAgent and AntreaController to be able to connect to the APIServer Pod/Proxy directly.
Node ClusterIP Service Support
The ability to serve ClusterIP Service on the Node. AntreaProxy does not have this ability yet.
Current Expected
LoadBalancer Service Support
Service Health Check Support
Bootstrap Order Re-arrange
Since the AntreaProxy will take all charge of Service serving, Services will not be available until the first sync of AntreaProxy. Thus, we need to make sure all sub-components rely on Services to be waiting until AntreaProxy is ready.
Detail Design
Direct APIServer Access
Kubernetes Client in Both AntreaAgent and AntreaController
In the current implementation, we use the in-cluster kubeconfig to set up the connection of the Kubernetes client. In the replacement solution, we need to change the server address from the Service address to the IP address of the Kubernetes API Server. There is a PR #1735 for supporting this in Agent. We need a similar PR for the controller also.
Antrea Client for Watching Internal NetworkPolicies
The Antrea client in AntreaAgent uses the Antrea Service to watch internal NetworkPolicies. We make the Antrea client to be initialized after the first sync of AntreaProxy.
APIServer in Both AntreaAgent and AntreaController
The APIServers in Antrea components need to connect to the Kubernetes APIServer to retrieve the Secret. The Secret retrieve will only take one time and it will not happen once the APIServer is up. Since the retrieve implementation is hidden behind the Kubernetes library, we need to rewrite the function in Antrea and override the address of the Kubernetes Service.
Startup Pressure
Once we override the Kubernetes Service, we lose the LB of it. This may cause large pressure on the specified Kubernetes APIServer since the watcher will then keep using it to watch resources.To overcome this issue, we can make the AntreaProxy have the ability to notify if it has been synced. Once the Service is ready, we can then switch to use the Service to take the advantage of the LB. Meanwhile, we should also have a flag to control whether we need to do the switch, since users may use a custom out-of-cluster LB for serving Kubernetes APIServer.Another way to reduce the pressure is to accept multiple override Endpoints, each Antrea component can then randomly pick one Endpoint from the list to connect.
Host ClusterIP Access Support
We use IPTables and IP rules to make it, and one of the design goals is to make the runtime updates on the host as little as possible.Like the NodePort implementation, we use IPSet to match the cluster dress, protocol, and port.
Name: ANTREA-SVC-CLUSTER-SERVICE Type: hash:ip,port Revision: 5 Header: family inet hashsize 1024 maxelem 65536 Size in memory: 600 References: 2 Number of entries: 8 Members: 10.96.0.10,udp:53 10.107.33.12,tcp:8080
In the mangle table of the iptables, we make a custom chain that marks packets that match the IPSet with mark 0xf2. The chain should be referred to in both OUTPUT and PREROUTING chains.
In the nat table, we match the packet with 0xf2 mark and masquerade them.
To make packets with 0xf2 mark, we add an IP rule to make that packets go to the antrea route table.
In the antrea route table, we only have one onlink default route which forwards all packets to the antrea-gw0.
We also need to add flow in the pipeline to respond to the virtual IP address' ARP query.
Bootstrap Order Re-arrange
The bootstrap order of sub-components in AntreaController only rely on Kubernetes APIServer, thus we don’t need to care since we will use direct connection instead of the Service. We should take care of the bootstrap order of AnteaAgent because subcomponents in AntreaAgent rely on the antrea Service to set up the antrea client.The dependency map of sub-components in AntreaAgent looks like the diagram below.
According to the dependency map, the bootstrap order could be:
Installation
We can still follow the single YAML way to install Antrea, but users need to specify a Kubernetes APIServer Pod address in the config. Since Antrea will use insert operation instead of append option to the IPTables, once Antrea components start, KubeProxy rules will not be hit.
Upgrade
Since AntreaProxy will install IPTables rules before KubeProxy’s, users can just delete the KubeProxy deployment after the installation. There must be short connection breaks since the connection.
Antctl as an installation wizard
We can also
install
andupdate
commands to Antctl to simplify the installation and upgrade.Instead of modifying a YAML file, only specifying the APIServer Endpoint by using Antctl should be easierAntctl can take responsibility for cleanup works, for e.g., remove KubeProxy rules, remove legacy rules or update API endpoints of Antrea.KubeProxy Options Compatible
Consider Supporting in future
Supported in Alternative Options
Implemented
Open Items
The text was updated successfully, but these errors were encountered: