-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920
Add the advertise-service-cluster-ip-range parameter to summarize the announcement function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network. #920
Conversation
…ent function of the service network segment, reduce the routing entries of the connected network devices, and support a larger BGP network.
@cloudnativer Have you been using kube-router as CNI in these 4k node clusters? This has to be largest reported cluster of kube-router if so.
thanks for contributing back.
I have not fully looked at the PR. I presume you are referring to
Let me clarify a concern here. This is specific to kube-router. kube-router will advertise a service VIP only if it has a backend pod for the service running on the node when annotated with Can you please close #919 if this PR is an update? |
@murali-reddy
This PR, I have transferred to #920 , and associated with #923 .
We want to route the kubernets service IP directly to the external network for external access to the services in the kubernets cluster. We set "advertisement cluster IP = true". Kube router will announce the 32-bit host route of the service to the connected network device.
If we have 10000 service IPS, Kube router will announce the 32-bit host routes of these 10000 services to the connected network devices. With the expansion of kubernetes cluster network, what if we have 100000 service IPS?
In fact, these service IPS are all in the "service cluster IP range", so we can only declare a summarized service subnet to the upper network devices.
If we are in a large-scale kubernetes network, we may also need to turn on the BGP ECMP function of the upper network device to realize the load balance of network traffic. When BGP ECMP is turned on, the kubernetes service route received by the connected network device will be more. If there are 40 kubernetes nodes under an uplink network device, and we have 100000 services, then you will see 4000000 ECMP routes on the uplink network device.
The "-advertise-cluster-subnet" function I added to the kube-router allows you to customize the kubernetes service segment route announced to the upper network device. We can aggregate 100000 service routes into one route and announce it. In this way, the number of routes received on the connected network equipment will be reduced from 4000000 to 40, greatly saving the routing table and resources of the network device.
This looks like the advance of ADVERTISE_CLUSTER_IPS function, please see https://www.projectcalico.org/kubernetes-service-ip-route-advertisement/.
If you set the parameter "-advertise-cluster-subnet", you can announce the kubernetes service route in large-scale network and enable ECMP. Because "-advertise-cluster-subnet" does the routing summary, reduces the number of service IP routing entries of the connected network equipment, and reduces the performance loss of the network device.
By the way, if you think it is more reasonable to use "-advertise-service-cluster-ip-range", of course, you can change "advertise-cluster-subnet" parameter in PR to "-advertise-service-cluster-ip-range" parameter.
In addition, I have closed PR #919.
|
The purpose of setting this parameter is to reduce the number of BGP route entries declared to the connected network devices when the service subnet needs to be exposed to the outside in a large-scale k8s cluster network. Please see #923. |
Now I've changed "advertise-cluster-subnet" to "advertise-service-cluster-ip-range". |
@cloudnativer I feel this use-case is relevent. Given that L3 core routers routing table can quickly filled as cluster size increases. But we need to see how it works with services with Meanwhile may i suggest to split the PR? Move all the documentation to sepearate PR. It would be far easier to focus and review? |
I will continue to pay attention.
OK, of course. |
@cloudnativer Hi, can you please split the code changes and documentation in to seperate PR's? 1.0 release is out now. So i want to work on the PR by testing it. thanks for your patience. |
Yes, of course. As you requested, I have separated the code changes from the documentation.
|
@cloudnativer thanks for seperating the code changes and documentation in to seperate PR's. Just so you know I am testing the patch with |
docs/user-guide.md
Outdated
@@ -40,6 +40,7 @@ Usage of kube-router: | |||
--advertise-external-ip Add External IP of service to the RIB so that it gets advertised to the BGP peers. | |||
--advertise-loadbalancer-ip Add LoadbBalancer IP of service status as set by the LB provider to the RIB so that it gets advertised to the BGP peers. | |||
--advertise-pod-cidr Add Node's POD cidr to the RIB so that it gets advertised to the BGP peers. (default true) | |||
--advertise-service-cluster-ip-range string If this parameter is set, Kube-router will add the service cluster IP range set by this parameter to the RIB, and send the routing advertisement of the service cluster IP range to the BGP peer. The purpose of this parameter is to reduce the number of service route entries sent by the Kube-router to the uplink network device. (Please configure "advertise-cluster-ip=true" at the same time to ensure that the "advertise-service-cluster-ip-range" parameter takes effect.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please cut down the length of the description to single line. Add longer description of the flag in documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course.
I've Cut down the length of the description to a single line, and Add longer description of the flag to the "Advertising IPs" section of the "user-guide.md" document.
for _, ip := range advIps { | ||
advIPPrefixList = append(advIPPrefixList, config.Prefix{IpPrefix: ip + "/32"}) | ||
|
||
//If the value of advertise-service-cluster-ip-range parameter is not empty, then the value of advertise-service-cluster-ip-range parameter is put into RIB, otherwise it will be done according to the original rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like both service CIDR and service VIP to be announced only for the services marked with external traffic policy set to local. So that on upstream routers which has /32
route to a VIP is given precedence over route to service cluster IP range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloudnativer Did you give any thought to above comment? I see below three scenarios
- if
--advertise-service-cluster-ip-range
is configured then advertise ONLY the service cluster IP range from the nodes and DO NOT advertise service VIP's from the node (which is what this PR intends to achieve) irrespective of fact service has pod's running on the node and service is markedexternalTrafficPolicy=Local
. - if
--advertise-service-cluster-ip-range
is configured then advertise the service cluster IP range from the nodes AND advertise/32
service VIP's from the node if node has a endpoint pod corresponding to the service running on the node and service is markedexternalTrafficPolicy=Local
. - if
--advertise-service-cluster-ip-range
is NOT configured then keep the original behaviour i.e.) advertise/32
VIP from all the nodes if service is not marked withexternalTrafficPolicy=Local
and advertise service VIP ONLY from the nodes which has a endpoint pod corresponding to the service is running on the node and service is marked withexternalTrafficPolicy=Local
problem with #1 is it can blackhole the traffic for services set to externalTrafficPolicy=Local
. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
problem with #1 is it can blackhole the traffic for services set to
externalTrafficPolicy=Local
. Traffic to service VIP get's ECMP ed to a node not running any service endpoint pod and proxy running on the node will reject the traffic.
This problem does exist if externalTrafficPolicy=Local
is set.
(1) The applicable conditions of advertise-service-cluster-ip-range
are as follows:
externalTrafficPolicy
can only be used when the service is set to loadbalancer or nodeport.- If the service is set to loadbalancer or nodeport, you should set
externalTrafficPolicy=Cluster
at the same time, so that theadvertise-service-cluster-ip-range
parameter is meaningful. - Except for this situation, you can directly set up
advertise-service-cluster-ip-range
.
(2) In large-scale environment, network traffic load balancing is very important. However, the configuration of externalTrafficPolicy=Local
will lead to unbalanced network traffic load, so our production environment uses the default externalTrafficPolicy=Cluster
.
…er description of the flag to document.
@murali-reddy I submitted the PR code successfully before. Now it seems that there is a conflict? |
Moved to 1.2 as these PRs are large items and we're focusing on library upgrades for 1.1 |
OK, I always pay attention to information. |
I solved the above code conflict problem and put the latest code in #1050 . |
Please use the code in #1050 when merging to 1.2. |
So does that mean that this PR should be closed? |
Closing in favor of #1050 after no specific feedback from user. |
Each kubernetes cluster in our production environment has 4000 nodes, and the whole network is interconnected by BGP, which has been running stably for more than two year. There are many problems with kube-router in the large kubernetes cluster, and we have done a lot of optimization, so I want to contribute some information to the community.I have contributed an enhanced function in the large kubernetes cluster network to Kube router, as well as several practical documents about the large kubernetes cluster network.
Added the "advertise-cluster-ip-range" flag parameter for optimizing the number of routes.
I added the "advertise-cluster-ip-range" flag parameter to kube-router. When you set the parameters of "-advertise-cluster-IP=true" and "-advertise-cluster-ip-range=you_service_ip_range" at the same time, the kubernetes node will only announce the aggregate cluster route you specified to the on-line router device.
The advantage is that when your kubernetes cluster is large and you need to announce cluster-ip routing, using this feature can reduce the number of service routing by 90%. This greatly reduces the cost of routers and can cope with larger network concurrent traffic.
Documents for optimization of large kubernetes cluster network are compiled. Please check Solve the routing optimization problem in large k8s cluster and support larger BGP routing network scale. #944 for details.
In order for your architecture to support a larger network, you need to do the following two things:
(1) Set the parameter "--enable-ibgp=false", do not let kubernetes node directly establish BGP neighbors with each other. Let your kubernetes node only build BGP neighbors with the on-line router device.(See large-networks02 documentation).
(2) You'd better turn on the BGP ECMP function of the kubernetes node on-line router device. The effect of this method is that when the user's access traffic enters the router device, it is first balanced to the kubernetes node of the back end through ECMP load balancing, and then to the final pod through IPVS load balancing. When devices, links and nodes in the network are down, traffic can be automatically switched to other healthy devices, links and nodes. In this way, the availability, high performance and scalability of the network are realized.(See large-networks04 documentation).
(3) You need to set both "--advertise-cluster-IP=true" and "--advertise-cluster-ip-range=subnet" parameters.Let k8s node only notify k8s service aggregate routes to the upstream routers, reducing the service routing entries of the upstream routers.(See large-networks03 documentation).