Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920

Closed
wants to merge 36 commits into from

Conversation

cloudnativer
Copy link
Contributor

@cloudnativer cloudnativer commented Jun 4, 2020

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than two year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network.

  1. Added the "advertise-cluster-ip-range" flag parameter for optimizing the number of routes.
    I added the "advertise-cluster-ip-range" flag parameter to kube-router. When you set the parameters of "-advertise-cluster-IP=true" and "-advertise-cluster-ip-range=you_service_ip_range" at the same time, the kubernetes node will only announce the aggregate cluster route you specified to the on-line router device.
    The advantage is that when your kubernetes cluster is large and you need to announce cluster-ip routing, using this feature can reduce the number of service routing by 90%. This greatly reduces the cost of routers and can cope with larger network concurrent traffic.

  2. Documents for optimization of large kubernetes cluster network are compiled. Please check Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944 for details.
    In order for your architecture to support a larger network, you need to do the following two things:
    (1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.(See large-networks02 documentation).
    (2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes. In this way, the availability, high performance and scalability of the network are realized.(See large-networks04 documentation).
    (3) You need to set both "--advertise-cluster-IP=true" and "--advertise-cluster-ip-range=subnet" parameters.Let k8s node only notify k8s service aggregate routes to the upstream routers, reducing the service routing entries of the upstream routers.(See large-networks03 documentation).

root added 2 commits June 4, 2020 10:57
…ent function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network.
@murali-reddy
Copy link
Member

murali-reddy commented Jun 4, 2020

Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than two year.

@cloudnativer Have you been using kube-router as CNI in these 4k node clusters? This has to be largest reported cluster of kube-router if so.

There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.

thanks for contributing back.

Added the "advertise-cluster-subnet" flag parameter for optimizing the number of routes.
I added the "advertise-cluster-subnet" flag parameter to kube-router. When you set the parameters of "-advertise-cluster-IP=true" and "-advertise-cluster-subnet=subnet" at the same time,

I have not fully looked at the PR. I presume you are referring to service-cluster-ip-range. Can we please use the name same as the one used by kube-api-server, kubeadm etc for e.g. -advertise-service-cluster-ip-range?

the kubernetes node will only announce the aggregate cluster route you specified to the on-line router device.

Let me clarify a concern here. This is specific to kube-router. kube-router will advertise a service VIP only if it has a backend pod for the service running on the node when annotated with kube-router.io/service.local or externalTrafficPolicy=local on the service object. So if -advertise-cluster-subnet=subnet is set then each node will advertise entire range. So this would break above funtionality right?

Can you please close #919 if this PR is an update?

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 4, 2020 via email

@cloudnativer cloudnativer changed the title [update]Add the advertise-cluster-subnet parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. Add the advertise-cluster-subnet parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. Jun 4, 2020
@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 12, 2020

So if -advertise-cluster-subnet=subnet is set then each node will advertise entire range. So this would break above funtionality right?

The purpose of setting this parameter is to reduce the number of BGP route entries declared to the connected network devices when the service subnet needs to be exposed to the outside in a large-scale k8s cluster network. Please see #923.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 12, 2020

@murali-reddy

Can we please use the name same as the one used by kube-api-server, kubeadm etc for e.g. -advertise-service-cluster-ip-range?

Now I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range".
Keep the same parameter names as kube-api-server, kubeadm etc.

@murali-reddy
Copy link
Member

@cloudnativer I feel this use-case is relevent. Given that L3 core routers routing table can quickly filled as cluster size increases. But we need to see how it works with services with externalTrafficPolicy=local set. I would imagine if the nodes continue to advertise /32 VIP for such service along with service-cluster-ip-range, a /32 route would take precendence and everything should fall in place. But we need to verify. Will work on this PR next week.

Meanwhile may i suggest to split the PR? Move all the documentation to sepearate PR. It would be far easier to focus and review?

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 12, 2020

I feel this use-case is relevent. Given that L3 core routers routing table can quickly filled as cluster size increases. But we need to see how it works with services with externalTrafficPolicy=local set. I would imagine if the nodes continue to advertise /32 VIP for such service along with service-cluster-ip-range, a /32 route would take precendence and everything should fall in place. But we need to verify. Will work on this PR next week.

I will continue to pay attention.

Meanwhile may i suggest to split the PR? Move all the documentation to sepearate PR. It would be far easier to focus and review?

OK, of course.

@cloudnativer cloudnativer changed the title Add the advertise-cluster-subnet parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. Jun 19, 2020
@murali-reddy
Copy link
Member

@cloudnativer Hi, can you please split the code changes and documentation in to seperate PR's? 1.0 release is out now. So i want to work on the PR by testing it. thanks for your patience.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Jun 30, 2020

@cloudnativer Hi, can you please split the code changes and documentation in to seperate PR's? 1.0 release is out now. So i want to work on the PR by testing it. thanks for your patience.

Yes, of course.

As you requested, I have separated the code changes from the documentation.

@murali-reddy
Copy link
Member

@cloudnativer thanks for seperating the code changes and documentation in to seperate PR's.

Just so you know I am testing the patch with --advertise-service-cluster-ip-range enabled and services with externalTrafficPolicy=Local.I should get back to you in a day.

@murali-reddy murali-reddy self-requested a review July 8, 2020 19:56
@@ -40,6 +40,7 @@ Usage of kube-router:
--advertise-external-ip Add External IP of service to the RIB so that it gets advertised to the BGP peers.
--advertise-loadbalancer-ip Add LoadbBalancer IP of service status as set by the LB provider to the RIB so that it gets advertised to the BGP peers.
--advertise-pod-cidr Add Node's POD cidr to the RIB so that it gets advertised to the BGP peers. (default true)
--advertise-service-cluster-ip-range string If this parameter is set, Kube-router will add the service cluster IP range set by this parameter to the RIB, and send the routing advertisement of the service cluster IP range to the BGP peer. The purpose of this parameter is to reduce the number of service route entries sent by the Kube-router to the uplink network device. (Please configure "advertise-cluster-ip=true" at the same time to ensure that the "advertise-service-cluster-ip-range" parameter takes effect.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please cut down the length of the description to single line. Add longer description of the flag in documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course.
I've Cut down the length of the description to a single line, and Add longer description of the flag to the "Advertising IPs" section of the "user-guide.md" document.

for _, ip := range advIps {
advIPPrefixList = append(advIPPrefixList, config.Prefix{IpPrefix: ip + "/32"})

//If the value of advertise-service-cluster-ip-range parameter is not empty, then the value of advertise-service-cluster-ip-range parameter is put into RIB, otherwise it will be done according to the original rules.
Copy link
Member

@murali-reddy murali-reddy Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like both service CIDR and service VIP to be announced only for the services marked with external traffic policy set to local. So that on upstream routers which has /32 route to a VIP is given precedence over route to service cluster IP range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloudnativer Did you give any thought to above comment? I see below three scenarios

  • if --advertise-service-cluster-ip-range is configured then advertise ONLY the service cluster IP range from the nodes and DO NOT advertise service VIP's from the node (which is what this PR intends to achieve) irrespective of fact service has pod's running on the node and service is marked externalTrafficPolicy=Local.
  • if --advertise-service-cluster-ip-range is configured then advertise the service cluster IP range from the nodes AND advertise /32 service VIP's from the node if node has a endpoint pod corresponding to the service running on the node and service is marked externalTrafficPolicy=Local.
  • if --advertise-service-cluster-ip-range is NOT configured then keep the original behaviour i.e.) advertise /32 VIP from all the nodes if service is not marked with externalTrafficPolicy=Local and advertise service VIP ONLY from the nodes which has a endpoint pod corresponding to the service is running on the node and service is marked with externalTrafficPolicy=Local

problem with #1 is it can blackhole the traffic for services set to externalTrafficPolicy=Local. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.

Copy link
Contributor Author

@cloudnativer cloudnativer Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

problem with #1 is it can blackhole the traffic for services set to externalTrafficPolicy=Local. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.

This problem does exist if externalTrafficPolicy=Local is set.


(1) The applicable conditions of advertise-service-cluster-ip-range are as follows:

  • externalTrafficPolicy can only be used when the service is set to loadbalancer or nodeport.
  • If the service is set to loadbalancer or nodeport, you should set externalTrafficPolicy=Cluster at the same time, so that the advertise-service-cluster-ip-range parameter is meaningful.
  • Except for this situation, you can directly set up advertise-service-cluster-ip-range.

(2) In large-scale environment, network traffic load balancing is very important. However, the configuration of externalTrafficPolicy=Local will lead to unbalanced network traffic load, so our production environment uses the default externalTrafficPolicy=Cluster.

@cloudnativer
Copy link
Contributor Author

@murali-reddy I submitted the PR code successfully before. Now it seems that there is a conflict?
Can you resolve this code conflict?

@aauren
Copy link
Collaborator

aauren commented Sep 17, 2020

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

@cloudnativer
Copy link
Contributor Author

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

OK, I always pay attention to information.

@cloudnativer
Copy link
Contributor Author

cloudnativer commented Mar 22, 2021

I solved the above code conflict problem and put the latest code in #1050 .
Because #920's code is based on kube-router v0.3, and there is a big conflict between the new version of kube-router v1.1.1 and #920's code. Based on the new version of kube-router v1.1.1, I rewrote the advertise-service-cluster-ip-range function and put it in #1050.

@cloudnativer
Copy link
Contributor Author

Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1

Please use the code in #1050 when merging to 1.2.

@aauren
Copy link
Collaborator

aauren commented Mar 22, 2021

So does that mean that this PR should be closed?

@aauren
Copy link
Collaborator

aauren commented Apr 11, 2021

Closing in favor of #1050 after no specific feedback from user.

@aauren aauren closed this Apr 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants