Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

cloudnativer · 2020-06-04T03:04:47Z

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than two year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network.

Added the "advertise-cluster-ip-range" flag parameter for optimizing the number of routes.
I added the "advertise-cluster-ip-range" flag parameter to kube-router. When you set the parameters of "-advertise-cluster-IP=true" and "-advertise-cluster-ip-range=you_service_ip_range" at the same time, the kubernetes node will only announce the aggregate cluster route you specified to the on-line router device.
The advantage is that when your kubernetes cluster is large and you need to announce cluster-ip routing, using this feature can reduce the number of service routing by 90%. This greatly reduces the cost of routers and can cope with larger network concurrent traffic.
Documents for optimization of large kubernetes cluster network are compiled. Please check Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944 for details.
In order for your architecture to support a larger network, you need to do the following two things:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.(See large-networks02 documentation).
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes. In this way, the availability, high performance and scalability of the network are realized.(See large-networks04 documentation).
(3) You need to set both "--advertise-cluster-IP=true" and "--advertise-cluster-ip-range=subnet" parameters.Let k8s node only notify k8s service aggregate routes to the upstream routers, reducing the service routing entries of the upstream routers.(See large-networks03 documentation).

…ent function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network.

murali-reddy · 2020-06-04T05:16:22Z

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than two year.

@cloudnativer Have you been using kube-router as CNI in these 4k node clusters? This has to be largest reported cluster of kube-router if so.

There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.

thanks for contributing back.

Added the "advertise-cluster-subnet" flag parameter for optimizing the number of routes.
I added the "advertise-cluster-subnet" flag parameter to kube-router. When you set the parameters of "-advertise-cluster-IP=true" and "-advertise-cluster-subnet=subnet" at the same time,

I have not fully looked at the PR. I presume you are referring to service-cluster-ip-range. Can we please use the name same as the one used by kube-api-server, kubeadm etc for e.g. -advertise-service-cluster-ip-range?

the kubernetes node will only announce the aggregate cluster route you specified to the on-line router device.

Let me clarify a concern here. This is specific to kube-router. kube-router will advertise a service VIP only if it has a backend pod for the service running on the node when annotated with kube-router.io/service.local or externalTrafficPolicy=local on the service object. So if -advertise-cluster-subnet=subnet is set then each node will advertise entire range. So this would break above funtionality right?

Can you please close #919 if this PR is an update?

cloudnativer · 2020-06-04T07:15:01Z

@murali-reddy This PR, I have transferred to #920 , and associated with #923 . We want to route the kubernets service IP directly to the external network for external access to the services in the kubernets cluster. We set "advertisement cluster IP = true". Kube router will announce the 32-bit host route of the service to the connected network device. If we have 10000 service IPS, Kube router will announce the 32-bit host routes of these 10000 services to the connected network devices. With the expansion of kubernetes cluster network, what if we have 100000 service IPS? In fact, these service IPS are all in the "service cluster IP range", so we can only declare a summarized service subnet to the upper network devices. If we are in a large-scale kubernetes network, we may also need to turn on the BGP ECMP function of the upper network device to realize the load balance of network traffic. When BGP ECMP is turned on, the kubernetes service route received by the connected network device will be more. If there are 40 kubernetes nodes under an uplink network device, and we have 100000 services, then you will see 4000000 ECMP routes on the uplink network device. The "-advertise-cluster-subnet" function I added to the kube-router allows you to customize the kubernetes service segment route announced to the upper network device. We can aggregate 100000 service routes into one route and announce it. In this way, the number of routes received on the connected network equipment will be reduced from 4000000 to 40, greatly saving the routing table and resources of the network device. This looks like the advance of ADVERTISE_CLUSTER_IPS function, please see https://www.projectcalico.org/kubernetes-service-ip-route-advertisement/. If you set the parameter "-advertise-cluster-subnet", you can announce the kubernetes service route in large-scale network and enable ECMP. Because "-advertise-cluster-subnet" does the routing summary, reduces the number of service IP routing entries of the connected network equipment, and reduces the performance loss of the network device. By the way, if you think it is more reasonable to use "-advertise-service-cluster-ip-range", of course, you can change "advertise-cluster-subnet" parameter in PR to "-advertise-service-cluster-ip-range" parameter. In addition, I have closed PR #919.

cloudnativer · 2020-06-12T06:19:10Z

So if -advertise-cluster-subnet=subnet is set then each node will advertise entire range. So this would break above funtionality right?

The purpose of setting this parameter is to reduce the number of BGP route entries declared to the connected network devices when the service subnet needs to be exposed to the outside in a large-scale k8s cluster network. Please see #923.

cloudnativer · 2020-06-12T06:27:08Z

@murali-reddy

Can we please use the name same as the one used by kube-api-server, kubeadm etc for e.g. -advertise-service-cluster-ip-range?

Now I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range".
Keep the same parameter names as kube-api-server, kubeadm etc.

murali-reddy · 2020-06-12T06:55:28Z

@cloudnativer I feel this use-case is relevent. Given that L3 core routers routing table can quickly filled as cluster size increases. But we need to see how it works with services with externalTrafficPolicy=local set. I would imagine if the nodes continue to advertise /32 VIP for such service along with service-cluster-ip-range, a /32 route would take precendence and everything should fall in place. But we need to verify. Will work on this PR next week.

Meanwhile may i suggest to split the PR? Move all the documentation to sepearate PR. It would be far easier to focus and review?

cloudnativer · 2020-06-12T07:06:29Z

I feel this use-case is relevent. Given that L3 core routers routing table can quickly filled as cluster size increases. But we need to see how it works with services with externalTrafficPolicy=local set. I would imagine if the nodes continue to advertise /32 VIP for such service along with service-cluster-ip-range, a /32 route would take precendence and everything should fall in place. But we need to verify. Will work on this PR next week.

I will continue to pay attention.

Meanwhile may i suggest to split the PR? Move all the documentation to sepearate PR. It would be far easier to focus and review?

OK, of course.

…cluster-ip-range

murali-reddy · 2020-06-30T06:13:14Z

@cloudnativer Hi, can you please split the code changes and documentation in to seperate PR's? 1.0 release is out now. So i want to work on the PR by testing it. thanks for your patience.

cloudnativer · 2020-06-30T13:08:17Z

@cloudnativer Hi, can you please split the code changes and documentation in to seperate PR's? 1.0 release is out now. So i want to work on the PR by testing it. thanks for your patience.

Yes, of course.

As you requested, I have separated the code changes from the documentation.

Only code changes are retained in Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920 .
Only document changes are retained in Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944 .

murali-reddy · 2020-07-08T19:54:50Z

@cloudnativer thanks for seperating the code changes and documentation in to seperate PR's.

Just so you know I am testing the patch with --advertise-service-cluster-ip-range enabled and services with externalTrafficPolicy=Local.I should get back to you in a day.

murali-reddy · 2020-07-08T19:57:36Z

docs/user-guide.md

@@ -40,6 +40,7 @@ Usage of kube-router:
      --advertise-external-ip                         Add External IP of service to the RIB so that it gets advertised to the BGP peers.
      --advertise-loadbalancer-ip                     Add LoadbBalancer IP of service status as set by the LB provider to the RIB so that it gets advertised to the BGP peers.
      --advertise-pod-cidr                            Add Node's POD cidr to the RIB so that it gets advertised to the BGP peers. (default true)
+      --advertise-service-cluster-ip-range string     If this parameter is set, Kube-router will add the service cluster IP range set by this parameter to the RIB, and send the routing advertisement of the service cluster IP range to the BGP peer. The purpose of this parameter is to reduce the number of service route entries sent by the Kube-router to the uplink network device. (Please configure "advertise-cluster-ip=true" at the same time to ensure that the "advertise-service-cluster-ip-range" parameter takes effect.)


Can you please cut down the length of the description to single line. Add longer description of the flag in documentation?

Yes, of course.
I've Cut down the length of the description to a single line, and Add longer description of the flag to the "Advertising IPs" section of the "user-guide.md" document.

murali-reddy · 2020-07-08T20:08:23Z

pkg/controllers/routing/bgp_policies.go

-	for _, ip := range advIps {
-		advIPPrefixList = append(advIPPrefixList, config.Prefix{IpPrefix: ip + "/32"})
+
+	//If the value of advertise-service-cluster-ip-range parameter is not empty, then the value of advertise-service-cluster-ip-range parameter is put into RIB, otherwise it will be done according to the original rules.


I would like both service CIDR and service VIP to be announced only for the services marked with external traffic policy set to local. So that on upstream routers which has /32 route to a VIP is given precedence over route to service cluster IP range.

@cloudnativer Did you give any thought to above comment? I see below three scenarios

if --advertise-service-cluster-ip-range is configured then advertise ONLY the service cluster IP range from the nodes and DO NOT advertise service VIP's from the node (which is what this PR intends to achieve) irrespective of fact service has pod's running on the node and service is marked externalTrafficPolicy=Local.

if --advertise-service-cluster-ip-range is configured then advertise the service cluster IP range from the nodes AND advertise /32 service VIP's from the node if node has a endpoint pod corresponding to the service running on the node and service is marked externalTrafficPolicy=Local.

if --advertise-service-cluster-ip-range is NOT configured then keep the original behaviour i.e.) advertise /32 VIP from all the nodes if service is not marked with externalTrafficPolicy=Local and advertise service VIP ONLY from the nodes which has a endpoint pod corresponding to the service is running on the node and service is marked with externalTrafficPolicy=Local

problem with #1 is it can blackhole the traffic for services set to externalTrafficPolicy=Local. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.

problem with #1 is it can blackhole the traffic for services set to externalTrafficPolicy=Local. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.

This problem does exist if externalTrafficPolicy=Local is set.

(1) The applicable conditions of advertise-service-cluster-ip-range are as follows:

externalTrafficPolicy can only be used when the service is set to loadbalancer or nodeport.

If the service is set to loadbalancer or nodeport, you should set externalTrafficPolicy=Cluster at the same time, so that the advertise-service-cluster-ip-range parameter is meaningful.

Except for this situation, you can directly set up advertise-service-cluster-ip-range.

(2) In large-scale environment, network traffic load balancing is very important. However, the configuration of externalTrafficPolicy=Local will lead to unbalanced network traffic load, so our production environment uses the default externalTrafficPolicy=Cluster.

…er description of the flag to document.

cloudnativer · 2020-08-20T06:41:05Z

@murali-reddy I submitted the PR code successfully before. Now it seems that there is a conflict?
Can you resolve this code conflict?

aauren · 2020-09-17T14:41:35Z

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

cloudnativer · 2020-10-15T02:00:09Z

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

OK, I always pay attention to information.

cloudnativer · 2021-03-22T11:31:21Z

I solved the above code conflict problem and put the latest code in #1050 .
Because #920's code is based on kube-router v0.3, and there is a big conflict between the new version of kube-router v1.1.1 and #920's code. Based on the new version of kube-router v1.1.1, I rewrote the advertise-service-cluster-ip-range function and put it in #1050.

cloudnativer · 2021-03-22T13:01:58Z

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

Please use the code in #1050 when merging to 1.2.

aauren · 2021-03-22T13:20:15Z

So does that mean that this PR should be closed?

aauren · 2021-04-11T22:36:45Z

Closing in favor of #1050 after no specific feedback from user.

root added 2 commits June 4, 2020 10:57

Add the advertise-cluster-subnet parameter to summarize the announcem…

352fa95

…ent function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network.

modify img file

c5a5977

cloudnativer added 7 commits June 4, 2020 12:04

Update large-networks03.md

368dee1

Update large-networks03.md

7b5c657

Update large-networks03.md

a478274

Update large-networks03.md

534ef18

Update large-networks02.md

f4e127d

Update large-networks01.md

6765295

Update large-networks01.md

1a7d578

cloudnativer mentioned this pull request Jun 4, 2020

Too many BGP routing entries and neighbors between kube-router server and connected network devices #923

Closed

changed advertise-cluster-subnet to advertise-service-cluster-ip-range

6a2d846

cloudnativer added 9 commits June 18, 2020 10:52

Update large-networks01.md

15dedb4

Update large-networks03.md

b1e7cfa

Change advertise-cluster-subnet in the document to advertise-service-…

c168e2f

…cluster-ip-range

Change advertise-cluster-subnet in the document to advertise-service-…

59220e7

…cluster-ip-range

Update large-networks02.md

5f71bf5

Update large-networks02.md

f5f3789

Update large-networks02.md

8e31460

Update large-networks03.md

71f280f

Update large-networks03.md

0eb2243

cloudnativer added 6 commits June 19, 2020 18:25

Update large-networks01.md

5d28d52

Update kube-router-daemonset-advertise-cluster-subnet.yaml

0bc2ed4

Update large-networks03.md

7d42a10

Update large-networks03.md

963a94b

Update large-networks03.md

3ef1b0d

Update large-networks03.md

6d0513f

Separate code changes from documentation, leaving only code changes.

42b3788

cloudnativer mentioned this pull request Jun 30, 2020

Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944

Closed

murali-reddy self-requested a review July 8, 2020 19:56

murali-reddy reviewed Jul 8, 2020

View reviewed changes

root and others added 2 commits July 9, 2020 20:28

Cut down the length of the description to a single line, and Add long…

67c6acf

…er description of the flag to document.

See large-networks03 documentation for details

0bc222e

cloudnativer mentioned this pull request Mar 22, 2021

The routing announcement function of service subnet summary is added to reduce the number of routes of neighbor network devices and support large BGP network. #1049

Closed

cloudnativer mentioned this pull request Mar 22, 2021

The routing announcement function of service subnet summary is added to reduce the number of routes of neighbor network devices and support large BGP network. #1050

Closed

aauren closed this Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

cloudnativer commented Jun 4, 2020 •

edited

Loading

murali-reddy commented Jun 4, 2020 •

edited

Loading

cloudnativer commented Jun 4, 2020 via email •

edited

Loading

cloudnativer commented Jun 12, 2020 •

edited

Loading

cloudnativer commented Jun 12, 2020 •

edited

Loading

murali-reddy commented Jun 12, 2020

cloudnativer commented Jun 12, 2020 •

edited

Loading

murali-reddy commented Jun 30, 2020

cloudnativer commented Jun 30, 2020 •

edited

Loading

murali-reddy commented Jul 8, 2020

murali-reddy Jul 8, 2020

cloudnativer Jul 9, 2020

murali-reddy Jul 8, 2020 •

edited

Loading

murali-reddy Jul 17, 2020

cloudnativer Jul 23, 2020 •

edited

Loading

cloudnativer commented Aug 20, 2020

aauren commented Sep 17, 2020

cloudnativer commented Oct 15, 2020

cloudnativer commented Mar 22, 2021 •

edited

Loading

cloudnativer commented Mar 22, 2021

aauren commented Mar 22, 2021

aauren commented Apr 11, 2021

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

Conversation

cloudnativer commented Jun 4, 2020 • edited Loading

murali-reddy commented Jun 4, 2020 • edited Loading

cloudnativer commented Jun 4, 2020 via email • edited Loading

cloudnativer commented Jun 12, 2020 • edited Loading

cloudnativer commented Jun 12, 2020 • edited Loading

murali-reddy commented Jun 12, 2020

cloudnativer commented Jun 12, 2020 • edited Loading

murali-reddy commented Jun 30, 2020

cloudnativer commented Jun 30, 2020 • edited Loading

murali-reddy commented Jul 8, 2020

murali-reddy Jul 8, 2020

Choose a reason for hiding this comment

cloudnativer Jul 9, 2020

Choose a reason for hiding this comment

murali-reddy Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

murali-reddy Jul 17, 2020

Choose a reason for hiding this comment

cloudnativer Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

cloudnativer commented Aug 20, 2020

aauren commented Sep 17, 2020

cloudnativer commented Oct 15, 2020

cloudnativer commented Mar 22, 2021 • edited Loading

cloudnativer commented Mar 22, 2021

aauren commented Mar 22, 2021

aauren commented Apr 11, 2021

cloudnativer commented Jun 4, 2020 •

edited

Loading

murali-reddy commented Jun 4, 2020 •

edited

Loading

cloudnativer commented Jun 4, 2020 via email •

edited

Loading

cloudnativer commented Jun 12, 2020 •

edited

Loading

cloudnativer commented Jun 12, 2020 •

edited

Loading

cloudnativer commented Jun 12, 2020 •

edited

Loading

cloudnativer commented Jun 30, 2020 •

edited

Loading

murali-reddy Jul 8, 2020 •

edited

Loading

cloudnativer Jul 23, 2020 •

edited

Loading

cloudnativer commented Mar 22, 2021 •

edited

Loading