Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

cloudnativer · 2020-06-04T04:25:33Z

Using Kube router in large-scale kubernetes cluster will lead to too many BGP neighbors and BGP routing entries of Kube router server and connected network devices by default, which will seriously affect the network performance of the cluster. Is there any good way to reduce the routing entries of both sides and the performance loss, so as to support the larger cluster network?

cloudnativer · 2020-06-04T04:28:38Z

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

cloudnativer · 2020-06-04T04:31:27Z

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:

after:

cloudnativer · 2020-06-04T04:37:36Z

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:

after:

According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.

cloudnativer · 2020-06-04T04:45:27Z

You can try the following two methods:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes.

After our test, we found that the BGP neighbors and routes between kube-router and the uplink switch did reduce a lot.
before:

after:

According to this practice, some problems have been solved. But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

However, our switch equipment only supports 200000 route forwarding. With the growth of kubernetes cluster size, more and more routes will be routed on the switch, which will eventually lead to the exhaustion of switch equipment performance and failure to work properly.

We modified part of the source code of kube-router and added parameters such as "advertisement-cluster-subnet" to solve this problem.

cloudnativer · 2020-06-04T04:46:04Z

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than one year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network.
Please see #920.

rearden-steel · 2020-06-04T20:38:13Z

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

murali-reddy · 2020-06-05T04:22:20Z

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

cloudnativer · 2020-06-05T04:40:32Z

I think your changes are reasonable, we have the same network topology and also will suffer from the same problem.

Yes, when I communicate with many R & D personnel of other companies, I find that they have the same problem. When the scale of kubernets cluster network becomes larger, the problem becomes more serious.

cloudnativer · 2020-06-05T05:06:01Z

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time:
(1) We need to route the kubernetes service to the outside for direct access to the service;
(2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links;
(3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation（ https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

Related yaml files can be found in https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/daemonset/kube-router-daemonset-advertise-cluster-subnet.yaml

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

However, in the large-scale kubernetes cluster network, if we have the following requirements at the same time:
(1) We need to route the kubernetes service to the outside for direct access to the service;
(2) At the same time, ECMP load balancing is enabled to enhance the availability of North-South network links;
(3) We also need to reduce the number of BGP neighbors and the number of routing entries of the connected network devices.

We set the "--enable-ibgp=false", "--advertise-cluster-IP=true" and "--advertise-cluster-subnet=" parameters at the same time. Please see the solution documentation（ https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/docs/large-networks01.md )

Related yaml files can be found in https://github.com/cloudnativer/kube-router-cnlabs/blob/advertise-cluster-subnet/daemonset/kube-router-daemonset-advertise-cluster-subnet.yaml

cloudnativer · 2020-06-05T05:07:05Z

Just to clarify there is nothing implicit about the kube-router design that one would see these challenges with routing pod network CIDR. Users have to carefully choose the knobs provided kube-router that suits them. You could use iBGP or peer with just external routers or use route felectors etc. These are standard BGP configuration network engineers deal with. In this e.g. #923 (comment) these are the type of choices (e.g. --enable-ibgp=false) one has to choose at network design stage.

But in the case of large-scale kubernetes cluster, when we enable ECMP routing load balancing, BGP routing on the switch has changed a lot. There are tens of thousands of kubernetes cluster service routes on each switch.

Again I would not design large scale network where all service VIP for all the services are advertised. You should use kube-router.io/service.advertise.clusterip annotation and set --advertise-cluster-ip=false to choose which service cluster IP's are advertised. Not all services need to receive north-south traffic, only services which are expected to receive north south traffic should ideally use this annotation. Yes if you set --advertise-cluster-ip=true all the service cluster IP are advertised which is not desirable for large deployments.

A prescribed operations guide to design network topology with kube-router would be good. Hopefully documentation in #920 will evolve in this direction.

Let me add that I will further improve the document according to what you said in the near future.

murali-reddy · 2020-06-05T05:07:42Z

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?

cloudnativer · 2020-06-05T11:38:52Z

If you set "--advertide-cluster-ip=false", our kubernetes service will not be able to route out.

@cloudnativer Have you tried kube-router.io/service.advertise.clusterip?

[requirements and test instructions]

Suppose we have a kubernetes service network segment with a range of 172.30.0.0/16. There are 100 running services in the cluster.
Our node server has a 172.32.0.128/25 pod CIDR network segment with 20 running pods.
We need to announce the two network segments of kubernetes service and kubernetes pod to the connected network device, so that we can directly access the service and pod from the outside.
We did the following tests according to your method.

[Test 1]

kube-router image version:

 image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

 args:
 - --run-router=true
 - --run-firewall=true
 - --run-service-proxy=true
 - --enable-overlay=false
 - --enable-pod-egress=false
 - --advertise-cluster-ip=false
 - --advertise-pod-cidr=true
 - --masquerade-all=false
 - --bgp-graceful-restart=true
 - --enable-ibgp=false
 - --nodes-full-mesh=true
 - --cluster-asn=64558
 - --peer-router-ips=192.168.140.1
 - --peer-router-asns=64558
 - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 2]

kube-router image version:

 image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

 args:
 - --run-router=true
 - --run-firewall=true
 - --run-service-proxy=true
 - --enable-overlay=false
 - --enable-pod-egress=false
 - --advertise-cluster-ip=false
 - --advertise-pod-cidr=false
 - --masquerade-all=false
 - --bgp-graceful-restart=true
 - --enable-ibgp=false
 - --nodes-full-mesh=true
 - --cluster-asn=64558
 - --peer-router-ips=192.168.140.1
 - --peer-router-asns=64558
 - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 3]

kube-router image version:

 image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

Args is set to:

 args:
 - --run-router=true
 - --run-firewall=true
 - --run-service-proxy=true
 - --enable-overlay=false
 - --enable-pod-egress=false
 - --advertise-cluster-ip=true
 - --advertise-pod-cidr=false
 - --masquerade-all=false
 - --bgp-graceful-restart=true
 - --enable-ibgp=false
 - --nodes-full-mesh=true
 - --cluster-asn=64558
 - --peer-router-ips=192.168.140.1
 - --peer-router-asns=64558
 - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [not learned]

[Test 4]

kube-router image version:

 image: Cloudnativelabs official version (https://github.com/cloudnativelabs/kube-router)

Annotations is set to:

      annotations:
        kube-router.io/service.advertise.clusterip: 172.30.0.0/16

3 Args is set to:

    args:
    - --run-router=true
    - --run-firewall=true
    - --run-service-proxy=true
    - --enable-overlay=false
    - --enable-pod-egress=false
    - --advertise-cluster-ip=true
    - --advertise-pod-cidr=true
    - --masquerade-all=false
    - --bgp-graceful-restart=true
    - --enable-ibgp=false
    - --nodes-full-mesh=true
    - --cluster-asn=64558
    - --peer-router-ips=192.168.140.1
    - --peer-router-asns=64558
    - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [not learned]
100 32-bit kubernetes service host routes in the cluster. [learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

[Test 5]

kube-router image version:

 image: My branch version (https://github.com/cloudnativer/kube-router-cnlabs/tree/advertise-cluster-subnet)

Annotations is not set.

Args is set to:

 args:
 - --run-router=true
 - --run-firewall=true
 - --run-service-proxy=true
 - --enable-overlay=false
 - --enable-pod-egress=false
 - --advertise-cluster-ip=true
 - --advertise-cluster-subnet=172.30.0.0/16
 - --advertise-pod-cidr=true
 - --masquerade-all=false
 - --bgp-graceful-restart=true
 - --enable-ibgp=false
 - --nodes-full-mesh=true
 - --cluster-asn=64558
 - --peer-router-ips=192.168.140.1
 - --peer-router-asns=64558
 - --kubeconfig=/etc/kubernetes/ssl/kubeconfig

The test results are as follows:

Routing table description on the uplink network device:

One kubernetes service route of 172.30.0.0/16 is specified. [learned]
100 32-bit kubernetes service host routes in the cluster. [not learned]
A 172.32.0.128/25 pod CIDR summary network segment route of the current node. [learned]

@murali-reddy

Attach my yaml template file for testing:

test.yaml.txt

I didn't use "kube-router.io/service.advertise.clusterip" to test the effect you said. Did I test it wrong? Or this "kube-router.io/service.advertise.clusterip" can't realize my previous requirements?
But we did use the "advertise-cluster-subnet" parameter to implement the previous requirements.

cloudnativer · 2020-06-12T06:29:53Z

Please note that I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range".
Keep the same parameter names as kube-api-server, kubeadm etc.
Please see #920.

murali-reddy · 2020-06-12T06:42:04Z

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

cloudnativer · 2020-06-12T09:23:11Z

@cloudnativer Apologies for delay in reverting back. I am focussing on getting 1.0 release out so hence the delay. Will leave comment in the PR

OK。

murali-reddy · 2020-06-16T15:46:46Z

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicated load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentially each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

cloudnativer · 2020-06-18T02:44:26Z

Adding some context to the problem. Kube-router's implementation of network load balancer is based on Ananta and Maglev. In both the models there are set of dedicate load balancer nodes (Mux in ananta and Maglev in Maglev) which are BGP speakers and advertise service VIP's. In case of Kubernetes each nodes is a load balancer/service proxy as well. So essentiall each node in the cluster is part of distributed load balancer. So if each of them is BGP speaker then advertising /32 routes for service VIP's can bloat the routing table as desribed above.

Yes, I agree with that.

But perhaps this is something that can be addressed at leaf routers by advertising service IP range. Neverthless its good weigh in pros and cons and presribe when to use what.

Yes, we can advertise the service IP range on the leaf router to reduce the number of spine routers.But in a large-scale kubernetes cluster network, if all Kube-routers advertise 32-bit host routing, the number of routes on the leaf router will also multiply. If only advertising the service IP range on the leaf router, it can't solve the problem of increasing the number of routes on the leaf router itself.Therefore, we need to be able to achieve the IP range of advertising service on the Kube-router of the server, which is used to reduce the number of leaf routers and uplink routers.

cloudnativer · 2020-06-30T13:25:42Z

According to the requirements of murali-Reddy, we split the document and code:

The documentation for solving this problem is here: Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944 .
The code to solve this problem is here: Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920 .

github-actions · 2023-09-05T01:59:07Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-09-10T02:02:55Z

This issue was closed because it has been stale for 5 days with no activity.

cloudnativer changed the title ~~The problem of too many routing entries between the kube-router server and the connected network device~~ Too many BGP neighbors and routes between kube-router server and connected network devices Jun 4, 2020

cloudnativer changed the title ~~Too many BGP neighbors and routes between kube-router server and connected network devices~~ Too many BGP routing entries and neighbors between kube-router server and connected network devices Jun 4, 2020

cloudnativer mentioned this issue Jun 4, 2020

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

Closed

murali-reddy added the enhancement label Jun 5, 2020

cloudnativer mentioned this issue Jun 30, 2020

Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944

Closed

aauren added the performance label Jul 10, 2020

github-actions bot added the Stale label Sep 5, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

cloudnativer commented Jun 4, 2020 •

edited

Loading

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020 •

edited

Loading

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020

rearden-steel commented Jun 4, 2020

murali-reddy commented Jun 5, 2020 •

edited

Loading

cloudnativer commented Jun 5, 2020 •

edited

Loading

cloudnativer commented Jun 5, 2020

cloudnativer commented Jun 5, 2020

murali-reddy commented Jun 5, 2020

cloudnativer commented Jun 5, 2020 •

edited

Loading

cloudnativer commented Jun 12, 2020

murali-reddy commented Jun 12, 2020

cloudnativer commented Jun 12, 2020

murali-reddy commented Jun 16, 2020 •

edited

Loading

cloudnativer commented Jun 18, 2020 •

edited

Loading

cloudnativer commented Jun 30, 2020 •

edited

Loading

github-actions bot commented Sep 5, 2023

github-actions bot commented Sep 10, 2023

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

Comments

cloudnativer commented Jun 4, 2020 • edited Loading

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020 • edited Loading

cloudnativer commented Jun 4, 2020

cloudnativer commented Jun 4, 2020

rearden-steel commented Jun 4, 2020

murali-reddy commented Jun 5, 2020 • edited Loading

cloudnativer commented Jun 5, 2020 • edited Loading

cloudnativer commented Jun 5, 2020

cloudnativer commented Jun 5, 2020

murali-reddy commented Jun 5, 2020

cloudnativer commented Jun 5, 2020 • edited Loading

cloudnativer commented Jun 12, 2020

murali-reddy commented Jun 12, 2020

cloudnativer commented Jun 12, 2020

murali-reddy commented Jun 16, 2020 • edited Loading

cloudnativer commented Jun 18, 2020 • edited Loading

cloudnativer commented Jun 30, 2020 • edited Loading

github-actions bot commented Sep 5, 2023

github-actions bot commented Sep 10, 2023

cloudnativer commented Jun 4, 2020 •

edited

Loading

cloudnativer commented Jun 4, 2020 •

edited

Loading

murali-reddy commented Jun 5, 2020 •

edited

Loading

cloudnativer commented Jun 5, 2020 •

edited

Loading

cloudnativer commented Jun 5, 2020 •

edited

Loading

murali-reddy commented Jun 16, 2020 •

edited

Loading

cloudnativer commented Jun 18, 2020 •

edited

Loading

cloudnativer commented Jun 30, 2020 •

edited

Loading