Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

Closed
cloudnativer opened this issue Jun 4, 2020 · 20 comments

Comments

@cloudnativer
Copy link
Contributor

cloudnativer commented Jun 4, 2020

Using Kube router in large-scale kubernetes cluster will lead to too many BGP neighbors and BGP routing entries of Kube router server and connected network devices by default, which will seriously affect the network performance of the cluster. Is there any good way to reduce the routing entries of both sides and the performance loss, so as to support the larger cluster network?

large-networks03

@cloudnativer
Copy link
Contributor Author

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

@cloudnativer
Copy link
Contributor Author

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:
large-networks03

after:
large-networks04

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 4, 2020

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:
large-networks03

after:
large-networks04


According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.
large-networks09
However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.

@cloudnativer
Copy link
Contributor Author

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:
large-networks03
after:
large-networks04

According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.
large-networks09
However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.


We modified part of the source code of kube-router and added parameters such as "advertisement-cluster-subnet" to solve this problem.

@cloudnativer
Copy link
Contributor Author

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than one year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network.
Please see #920.

@cloudnativer cloudnativer changed the title The problem of too many routing entries between the kube-router server and the connected network device Too many BGP neighbors and routes between kube-router server and connected network devices Jun 4, 2020
@cloudnativer cloudnativer changed the title Too many BGP neighbors and routes between kube-router server and connected network devices Too many BGP routing entries and neighbors between kube-router server and connected network devices Jun 4, 2020
@rearden-steel
Copy link

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

@murali-reddy
Copy link
Member

murali-reddy commented Jun 5, 2020

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 5, 2020

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

Yes, when I communicate with many R & D personnel of other companies, I find that they have the same problem. When the scale of kubernets cluster network becomes larger, the problem becomes more serious.

@cloudnativer
Copy link
Contributor Author

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.


If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time:
(1) We need to route the kubernetes service to the outside for direct access to the service;
(2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links;
(3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation( https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

Related yaml files can be found in https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/daemonset/kube-router-daemonset-advertise-cluster-subnet.yaml

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time:
(1) We need to route the kubernetes service to the outside for direct access to the service;
(2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links;
(3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation( https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

Related yaml files can be found in https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/daemonset/kube-router-daemonset-advertise-cluster-subnet.yaml

@cloudnativer
Copy link
Contributor Author

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

Let me add that I will further improve the document according to what you said in the near future.

@murali-reddy
Copy link
Member

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 5, 2020

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?


[requirements and test instructions]

Suppose we have a kubernetes service network segment with a range of 172.30.0.0/16. There are 100 running services in the cluster.
Our node server has a 172.32.0.128/25 pod CIDR network segment with 20 running pods.
We need to announce the two network segments of kubernetes service and kubernetes pod to the connected network device, so that we can directly access the service and pod from the outside.
We did the following tests according to your method.


[Test 1]

  1. kube-router image version:

     image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)
    
  2. Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16
  1. Args is set to:

     args:
     - --run-router=true
     - --run-firewall=true
     - --run-service-proxy=true
     - --enable-overlay=false
     - --enable-pod-egress=false
     - --advertise-cluster-ip=false
     - --advertise-pod-cidr=true
     - --masquerade-all=false
     - --bgp-graceful-restart=true
     - --enable-ibgp=false
     - --nodes-full-mesh=true
     - --cluster-asn=64558
     - --peer-router-ips=192.168.140.1
     - --peer-router-asns=64558
     - --kubeconfig=/etc/kubernetes/ssl/kubeconfig
    
  2. The test results are as follows:

Routing table description on the uplink network device:

  • One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
  • 100 32-bit kubernetes service host routes in the cluster. [not learned]
  • A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 2]

  1. kube-router image version:

     image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)
    
  2. Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16
  1. Args is set to:

     args:
     - --run-router=true
     - --run-firewall=true
     - --run-service-proxy=true
     - --enable-overlay=false
     - --enable-pod-egress=false
     - --advertise-cluster-ip=false
     - --advertise-pod-cidr=false
     - --masquerade-all=false
     - --bgp-graceful-restart=true
     - --enable-ibgp=false
     - --nodes-full-mesh=true
     - --cluster-asn=64558
     - --peer-router-ips=192.168.140.1
     - --peer-router-asns=64558
     - --kubeconfig=/etc/kubernetes/ssl/kubeconfig
    
  2. The test results are as follows:

Routing table description on the uplink network device:

  • One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
  • 100 32-bit kubernetes service host routes in the cluster. [not learned]
  • A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 3]

  1. kube-router image version:

     image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)
    
  2. Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16
  1. Args is set to:

     args:
     - --run-router=true
     - --run-firewall=true
     - --run-service-proxy=true
     - --enable-overlay=false
     - --enable-pod-egress=false
     - --advertise-cluster-ip=true
     - --advertise-pod-cidr=false
     - --masquerade-all=false
     - --bgp-graceful-restart=true
     - --enable-ibgp=false
     - --nodes-full-mesh=true
     - --cluster-asn=64558
     - --peer-router-ips=192.168.140.1
     - --peer-router-asns=64558
     - --kubeconfig=/etc/kubernetes/ssl/kubeconfig
    
  2. The test results are as follows:

Routing table description on the uplink network device:

  • One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
  • 100 32-bit kubernetes service host routes in the cluster. [learned]
  • A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 4]

  1. kube-router image version:

     image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)
    
  2. Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

3 Args is set to:

    args:
    - --run-router=true
    - --run-firewall=true
    - --run-service-proxy=true
    - --enable-overlay=false
    - --enable-pod-egress=false
    - --advertise-cluster-ip=true
    - --advertise-pod-cidr=true
    - --masquerade-all=false
    - --bgp-graceful-restart=true
    - --enable-ibgp=false
    - --nodes-full-mesh=true
    - --cluster-asn=64558
    - --peer-router-ips=192.168.140.1
    - --peer-router-asns=64558
    - --kubeconfig=/etc/kubernetes/ssl/kubeconfig
  1. The test results are as follows:

Routing table description on the uplink network device:

  • One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
  • 100 32-bit kubernetes service host routes in the cluster. [learned]
  • A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 5]

  1. kube-router image version:

     image: My branch version (https://github.com/cloudnativer/kube-router-cnlabs/tree/advertise-cluster-subnet)
    
  2. Annotations is not set.

  3. Args is set to:

     args:
     - --run-router=true
     - --run-firewall=true
     - --run-service-proxy=true
     - --enable-overlay=false
     - --enable-pod-egress=false
     - --advertise-cluster-ip=true
     - --advertise-cluster-subnet=172.30.0.0/16
     - --advertise-pod-cidr=true
     - --masquerade-all=false
     - --bgp-graceful-restart=true
     - --enable-ibgp=false
     - --nodes-full-mesh=true
     - --cluster-asn=64558
     - --peer-router-ips=192.168.140.1
     - --peer-router-asns=64558
     - --kubeconfig=/etc/kubernetes/ssl/kubeconfig
    
  4. The test results are as follows:

Routing table description on the uplink network device:

  • One kubernetes service route of 172.30.0.0/16 is specified. [learned]
  • 100 32-bit kubernetes service host routes in the cluster. [not learned]
  • A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

@murali-reddy

Attach my yaml template file for testing:

test.yaml.txt

I didn't use "kube-router.io/service.advertise.clusterip" to test the effect you said. Did I test it wrong? Or this "kube-router.io/service.advertise.clusterip" can't realize my previous requirements?
But we did use the "advertise-cluster-subnet" parameter to implement the previous requirements.

@cloudnativer
Copy link
Contributor Author

Please note that I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range".
Keep the same parameter names as kube-api-server, kubeadm etc.
Please see #920.

@murali-reddy
Copy link
Member

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

@cloudnativer
Copy link
Contributor Author

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

OK。

@murali-reddy
Copy link
Member

murali-reddy commented Jun 16, 2020

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicated load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentially each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 18, 2020

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicate load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentiall each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

Yes, I agree with that.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

Yes, we can advertise the service IP range on the leaf router to reduce the number of spine routers.But in a large-scale kubernetes cluster network, if all Kube-routers advertise 32-bit host routing, the number of routes on the leaf router will also multiply. If only advertising the service IP range on the leaf router, it can't solve the problem of increasing the number of routes on the leaf router itself.Therefore, we need to be able to achieve the IP range of advertising service on the Kube-router of the server, which is used to reduce the number of leaf routers and uplink routers.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 30, 2020

@github-actions
Copy link

github-actions bot commented Sep 5, 2023

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 5, 2023
@github-actions
Copy link

This issue was closed because it has been stale for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants