-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP: Bandwidth Manager with fq_codel? #29083
Comments
Thank you for doing these benchmarks! I was very concerned when I heard of this new design that they had not taken a hard look at in-flow latency. |
Several suggestions, if you have time:
The designers of this cilium subsystem were claiming that it was bbr (which does not support rfc3168 ecn) that was doing the magic, but it looks to my eye that it is merely massively excessive latency that makes it look more uniform. I am under the impression however that at least kernel 6.1 was needed for cilium + fq + bbr to work "correctly"? One fq_codel/cake trick few DC users apply is you can for short paths actually tune fq_codel's target down quite a bit, to achieve the same bandwidth at much lower within stream latency. I regularly run fq_codel on bare metal with a target of 250us and interval of 5ms in the one transiting-a-dc-only app I have (the cross dc latency is about 4ms tops). It is really hard to measure improvements below a ms however! I have often wondered how low you could go across containers with ecn enabled. The theory behind the codel algorithm is that you can go own to one MTU and 2 RTT with it, were it implemented directly in hardware. Lastly, as much as I love the rrul test, most container workloads are pretty unidirectional, and I might start by taking packet captures of your application in deployment and trying to model that? otherwise my next test in your test series would be more like --step-size=.05 -x --test-parameter=upload_streams=4 --socket-stats tcp_nup |
https://blog.cerowrt.org/post/juniper/ has a simple script to sweep from 1-64 flows in it. |
Hmm, so I have a hunch about what's the cause of this behaviour, but I don't have time to dig into the details right now. But just to see if I'm completely off base, could you try running a UDP-based bandwidth test as well? Just using iperf2 with |
Thank you both for your comments! I took some time to look into them. First of all, regarding the "UDP-based bandwidth test", I tried it out on a pair of
Interestingly enough there was 98% packet loss (this is the Once I removed the
Regarding
To give a bit more context, we are mostly interested in This is why the forced switch to
I tried this out by setting securityContext:
sysctls:
- name: net.ipv4.tcp_ecn
value: "1" on both pods and running
Yes, this is definitely on my to-do list, we have 100s of applications running in our infra but I do have some in mind which are quite sensitive to latency and throughput, and which would be good candidates for more benchmarking. It's a bit of a hassle to benchmark them though because I will need to sync with application teams first, this is why I wanted to get as many insights as possible from synthetic benchmarks and only then move to real applications 😄
Added this to my to-do list as well, will try to get to it this week. |
On 15 November 2023 18:22:58 CET, Anton Ippolitov ***@***.***> wrote:
Thank you both for your comments! I took some time to look into them.
First of all, regarding the "UDP-based bandwidth test", I tried it out on a pair of `n2-standard-8` nodes in GCP. I left the default `fq` setup and rate-limited the client to 100Mbps and this is the result I got:
```
# iperf --udp --client 10.19.1.46 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.1.46, UDP port 5001 with pid 101
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[ 3] local 10.19.0.249 port 48567 connected with 10.19.1.46 port 5001
[ ID] Interval Transfer Bandwidth Write/Err PPS
[ 3] 0.0000-10.0001 sec 250 MBytes 210 Mbits/sec 178330/0 17832 pps
[ 3] Sent 178330 datagrams
[ 3] Server Report:
[ 3] 0.0-12.0 sec 3.81 MBytes 2.66 Mbits/sec 0.077 ms 175618/178334 (98%)
[ 3] 0.0000-11.9995 sec 1 datagrams received out-of-order
```
Interestingly enough there was 98% packet loss (this is the `175618/178334 (98%)` number reported by the server)
Once I removed the `kubernetes.io/egress-bandwidth` annotation to disable rate-limiting, the packet loss was gone:
```
# iperf --udp --client 10.19.1.46 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.1.46, UDP port 5001 with pid 104
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[ 3] local 10.19.0.249 port 48935 connected with 10.19.1.46 port 5001
[ ID] Interval Transfer Bandwidth Write/Err PPS
[ 3] 0.0000-10.0001 sec 250 MBytes 210 Mbits/sec 178329/0 17832 pps
[ 3] Sent 178329 datagrams
[ 3] Server Report:
[ 3] 0.0-10.0 sec 250 MBytes 210 Mbits/sec 0.018 ms 0/178329 (0%)
[ 3] 0.0000-9.9989 sec 6 datagrams received out-of-order
```
Okay, that's very interesting. What about with egress-bandwidth set, but fq_codel as the qdisc?
|
Hm, I switched to # iperf --udp --client 10.19.0.30 --bandwidth 200M --enhancedreports ------------------------------------------------------------ Client connecting to 10.19.0.30, UDP port 5001 with pid 24 Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust) UDP buffer size: 16.0 MByte (default) ------------------------------------------------------------ [ 3] local 10.19.0.8 port 40141 connected with 10.19.0.30 port 5001 [ 3] WARNING: did not receive ack of last datagram after 10 tries. [ ID] Interval Transfer Bandwidth Write/Err PPS [ 3] 0.0000-10.0000 sec 250 MBytes 210 Mbits/sec 178329/0 17832 pps [ 3] Sent 178329 datagrams The situation is better with the
Here are the queue parameters btw:
|
I am pleased you are running fq_codel in the first place and would love a dump from prod of, say 100 "boxes", of In the non-rate limited k8 case I have generally assumed it was mq doing a goodly percentage of the FQ in the first place. I usually post process this stuff with awk. I am obsolete. If you are a JSON wizard tc -s -j qdisc show and go to town. |
quantum 1474? Not 1514? why? |
Lastly, I think we have a differing interpretation of "jitter" on the rrul and I would love you to describe how you were thinking about it before I launch into my std lecture so I can maybe lecture sanely to others in the future? I note that I used the word wrong myself in my initial comment. I am happy you are delving into this. It seemed 99.9999% of the k8 crowd thinks tcp is a function call. I was once able to save a cloudy user 80% of their bandwidth bill by cutting tcp_notsent_lowat down to something reasonable.... there are other things worth tuning, like initcwnd... anyway in this topology: (internet) <- web proxy <- local containers the local container interfaces can be configged down to target 250us interval 2.5ms on bare metal, esp with ECN enabled. As for the "jitter" thing, you can possibly see the difference I was expecting via a cdf plot comparing the before/after ECN, on the tcp_nup tcp rtt. I also note the raw JSON files in flent contain very little sensitive info (IP addresses, qdiscs) and toke and I really good at flipping through them. The -x option however gathers more than most corps would like. |
Hello, sorry for the delay due to conference travel.. digesting this thread a bit: [...]
Yep.
The initial motivation for fq was to do EDT for the Cilium Bandwidth manager (see also https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF). As far as I know fq_codel does not support EDT. Compare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530 vs https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c#n184 . See also https://lore.kernel.org/netdev/[email protected]/ . The EDT we make use of for the The BBR for Pods etc needs v5.18+ kernel, otherwise pacing is broken given the skb->tstamp (delivery timestamp) is cleared upon netns traversal and the rates are fluctuating. The relevant fixes which are part from these kernel onwards are https://lore.kernel.org/bpf/[email protected]/ .
Again, fq_codel does not support EDT, hence the fq_codel measurement is "broken" here given skb->tstamps that were set by the Cilium Bandwidth Manager's BPF code are ignored.
Btw, semi-related, did you also try to run a 6.7-rc2 kernel? These ones got merged there: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b49a948568dcbb5f38cbf5356ea0fb9c9c6f6953 . Also priority bands landed recently (https://netdev.bots.linux.dev/netconf/2023/eric.pdf). |
Thank you @borkmann ! I looked into your main point
and I believe you are right, my I removed these two lines, then re-ran a benchmark with So it looks like using
I haven't had the chance to try it out yet, we are still mainly running 5.15 everywhere. I will give it a try. I am still curious to understand the performance issues I saw with Also thank you @dtaht for the suggestions.
It would indeed be interesting to look into better tuning
This is a default value set by Google Cloud, probably their MTU of 1460 bytes + hardware header length of 14 bytes = 1474.
Yeah, I am not sure that "jitter" is the exact right term, I was trying to refer to the sharp oscillations looking like this: |
Ah! This is the behaviour I was expecting, and couldn't for the life of me figuring out why this wasn't the case; totally missed the horizon drop thing in the BPF code, that explains it :)
I wouldn't expect 6.7 to help with the latency issues you were seeing. This is caused by the way the bandwidth manager is implemented, AFAICT; basically it creates a virtual FIFO queue without implementing any kind of AQM or flow queueing, so the terrible latency is totally expected. Specifically, the shaper logic here (right above the horizon drop thing you linked above), does a lookup into the rate config map using the previously set queue mapping as the key[0], finds the rate and the last timestamp, and sets a timestamp for the packet based on the rate and the length of the packet. So packets will be delayed in the From a queueing algorithm behaviour PoV this is obviously not great. And doing better at this level is not so straight forward either: since the dequeue time is computed before each packet is queued, tail dropping is the only action possible to control the queue. I've seen a CoDel implementation that works in this mode at one point, so that would be possible, I guess; but flow queueing is not, really, at least not without temporarily going above the bandwidth limit after the fact. Also just straight round-robin scheduling between becomes quite challenging to do in this "virtual queue of future transmission times" mode. So based on the above, my recommendation would simply be "don't use the bandwidth manager unless you don't care about latency at all". I'm frankly a little puzzled that no one noticed this behaviour before; I guess it's the ever-present curse of "only benchmark TCP throughput"? IDK. Or who knows, maybe I'm missing something fundamental here, and there's some reason why things are not as dire as I paint them above? If so, I would love to know what that reason is! :) [0] Not quite sure how that is initially set, but seems to be coming from inside the pod? so veth queue id, I guess? anyway, doesn't look like there is more than at most a few IDs per pod. |
I note that I do this sort of tuning for a living, and am between contracts at the moment. Anyway, thank you all for getting into the methods and claims behind how this cilium subsystem works without me having to poke into it much. I would like to know how much the rtt grows before shuddering overmuch. A) Packet captures would be nice. Seeing the rtt in both slow start and congestion avoidance. In general the rrul test is a partial emulation of a BitTorrent workload, and has been a good proxy for creating "good" residential and small corporate network behaviors in the general case. Regrettably it has next to no bearing on the actual behaviors of a container workload, IMHO. Measure those. Most of these I have seen (with the exception of movie streaming) have been totally dominated by slow start, where tuning initcwnd and tcp_notsent_lowwat matter most, and on the places I have been called in, leveraging fq_codel + ECN + cubic to manage it, further, without packet drop. I never bothered to try and rate shape anything before now, the problems being always more dominated by the behaviors of the flows from the web proxy outwards. |
The "jitter" you observe on the rrul test is the conflation of two things: sampling error and actual bandwidth usage during the sample interval defined. It is normal for it to occilate somewhat because that is how the tcp sawtooth works as a function of the rtt, to go deep see https://ee.lbl.gov/papers/congavoid.pdf and really deep see https://en.wikipedia.org/wiki/Lyapunov_stability It is necessary for it to bounce around a bit in order for the internet to not collapse. rfc970. The height is related to the rtt, but flent sampling error so huge on this sub-1us RTT that you are not getting a real picture of the carnage underneath at all. Your second plot shows a classic example of slow convergence, where the first flows to start hog all the processor and bandwidth and the later flows - due in part to the massively inflating rtt, takes a longer time (15 seconds) to achieve equality. The FQ is not helping here, but somewhere there are some big buffers in this stack. A better test would be to start, say two saturating flows and then measure transactions per second for a zillion other flows, netperf's tcp_rr test for example. In other words this plot is massively better than the others because the latecomers to the party In both these cases, honestly, the underlying behavior of the container's real workload looks nothing like this and the only way I can ever convince someone of this is for them to take 5 minutes or so of packet capture and tear it, rather than iperf/netperf test traffic, apart. |
Well, no, not really, at least not in itself, for the reason I outlined above... |
I would like a packet capture. And a glass of scotch. |
The idea was to transfer https://netdevconf.info//0x14/pub/slides/55/slides.pdf / https://netdevconf.info//0x14/pub/papers/55/0x14-paper55-talk-paper.pdf for K8s and utilise the Pod egress rate annotations there. [...]
Hm, I wonder also if we're hitting other things such as sch->limit with the defaults, I've heard about issues like these due to defaults being too low. Would be good to trace kfree_skb to double check. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530
Given all other skb fields such as mark etc are already used up, it basically stores the endpoint ID into queue_mapping which is preserved all the way through the stack and later resets it, so kernel picks queue via flow hash. Not great, but seems to function at least. |
Yup, I did recognise the approach from there. I'm also wondering why the Google folks didn't see these latency spikes. My best guess is that it's BBR throttling that kicks in early enough to mostly mask the FIFO behaviour. That, combined with some workload specific traffic properties may put you into "don't care about latency" territory for some deployments (especially if you're coming from a highly contended global HTB lock scenario)?
Those limit drops should be visible in the qdisc stats, then ( Hmm, or maybe we're not, on the AWS instances, at least? There will be some smaller packets (ACKs) in the queue as well, so the ~400 ms could well be the queue overflow point? Maybe that's also the reason for the difference between the GCP and AWS results - i.e., differences in TCP stack backpressure effectiveness?
Right, OK, so it's basically one ID per container/pod? That's what I was assuming (as you'd want the limit to be global for that entity), I just didn't manage to trace the code back far enough to figure out where those IDs were coming from :) |
Gathering qdisc stats as Toke mentioned would be great indeed, if you have them Anton. @antonipp Did you measure with BPF host routing? (https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing) If not, could you try to set it and redo the measurement? (Upper stack has the skb_orphan which breaks TCP backpressure.. :/) Either way, independent of that, I can craft a PR today to bump the sch->limit.. this was on my todo list anyway for some time.
Yeah aggregate is one ID per Pod (== netns, a Pod can hold one or more containers sharing the same netns). The BPF program on the host veth device knows that all traffic going through that device is from the given Pod, and each Pod has a unique ID on the node. |
@borkmann I am not sure if we are talking past each other or not? In the 100mbit FIFO case sch->limit (and/or sch_fq) should be knocked way down, not up. 100 packets tops. In the 10gbit case, internal to a container, no more than 2ms is needed. It's worse than that, in that with gso present a "packet" can get bulked up by 42x, which is why byte limits are better than packet limits. |
@tohojo the results so far are from the rrul test which on a packet limit floods the queue with short acks and starves the data path somewhat while taking 1/15th the time to transmit. So why it is merely 400ms, not 1.3 seconds (rule of thumb it takes 13ms for 1Mbit, 130us for 100mbit) is explained by that. A pure upload test may well hit 1.2s (but due to exhausting other limits might take multiple flows to hit. See also: https://www.duo.uio.no/bitstream/handle/10852/45274/1/thesis.pdf I do not know much about sch_fq, I thought it naively used 100 packets per flow? I recall Eric dumazet suggesting putting any form of ECN with a 5ms brick-wall limit over the whole qdisc before EDT came out. BBR internal to Google has an ECN response. tsq does regulate simple things fairly well, but going back to the debate: systemd/systemd#9725 (comment) I won that debate, then. :) I was hoping this new facility truly extended containers to be doing more of the right thing. A good simple test would be 4 saturating flows + a typical request/response payload for the application (sized to fit initcwnd) or tcp_rr (which will understate, being only 5 packets). My guess is that the tps (transactions per second) at fq 100mbit, 10k packets would be 1000x worse than fq_codel. |
Just following-up on a couple of things: I indeed ran the benchmarks with Legacy Host Routing. I ran some more with eBPF Host Routing and the results are much better: the download throughput is now maxed out and the latency doesn't spike as much. Here are the results for comparison (all of these are from Legacy Host RoutingeBPF Host RoutingI also gathered the qdisc stats on client nodes after running these benchmarks (I booted the node, ran 1 benchmark, gathered the stats and then killed the node) Stats after running the benchmark with Legacy Host Routing
Stats after running the benchmark with eBPF Host Routing
And I also agree that the RRUL test is not very representative of a real container workload. As I mentioned, we have many hundreds of very different applications so it's a bit hard to find something representative tbh... I might try out
And also look for a couple of candidate applications to capture packets from as well. |
Any progress here? 15ms is really miserable. It should be not much more than 250us at 100mbit. See also: https://blog.tohojo.dk/2023/12/the-big-fifo-in-the-cloud.html |
I used iperf to test cilium bandwidth manager recently, and found the similar issue about its UDP throughput. My limit is 10Mbps(
Thanks to @tohojo 's explanation, I can see there is no backpressure for UDP from and it causes the drop. In my case, when sending at 10Mbps, the limit is also 10Mbps, while the received throughput is smaller(7Mbps) . I can see |
kangjie ***@***.***> writes:
I can see there is no backpressure for UDP from @tohojo 's explanation
and it causes the drop. In my case, when sending at 10Mbps, the limit
is also 10Mbps, while the received throughput is smaller(7Mbps) . I
can see `flow_plimit` increases from `tc -s qdisc`, could you help me
understand why this still hits the `flow_limit 100` since the sender's
not exceeding the limit.
My guess would be that this is due to a difference in what iperf
considers to be 10Mbit/s and what the shapes does. Running a quick test
on my own machine, when sending at 10Mbit/s iperf sleeps somewhere
between 960 and 985 usecs between each frame. Whereas the interpacket
gap between 1500-byte frames is 1200 usecs (1500*8/10000000), or 1227
(1534*8/10000000) usecs if you account for the ethernet framing.
That's a roughly 20% difference, which is not too far off from the
packet loss rate quoted in your iperf output above...
|
A quick status update on our end: we've paused our work on the Bandwidth Manager for now, while we migrate our fleet to eBPF Host Routing which will take us a bit of time. Also thank you @tohojo for the very well written blog post! FWIW, I am also planning on sharing some of our findings in a short presentation at Cilium + eBPF day in Paris in March (not sure how much I'll be able to fit in a 5 minute window but we'll see 😄) |
Anton Ippolitov ***@***.***> writes:
> Any progress here?
A quick status update on our end: we've paused our work on the Bandwidth Manager for now, while we migrate our fleet to eBPF Host Routing which will take us a bit of time.
Also thank you @tohojo for the very well written blog post! FWIW, I am
also planning on sharing some of our findings in a short
[presentation](https://colocatedeventseu2024.sched.com/event/e9ce04de5abd9be45a47e7e90752c0a0)
at Cilium + eBPF day in Paris in March (not sure how much I'll be able
to fit in a 5 minute window but we'll see 😄)
Thanks for the update! Glad you enjoyed the blog post! :)
|
I would be tickled if you tried cake instead, and enabled ecn sender side. tc qdisc add the_interface bandwidth 100Mbit rtt 5ms The above rtt is sized more or less correctly for codel and the target bandwidth within containers. (250us target) |
In the hope this might spawn a little out of the box thinking on this bug: https://www.youtube.com/watch?v=rWnb543Sdk8&t=2603s |
ping? |
Hi Dave, I have it on my roadmap to work on this likely for Cilium 1.17. Also, thanks for the pointer to your talk! |
For current design, can we alleviate the latency problem by setting a smaller drop horizion (smaller queue length) code? |
2 seconds for within containers on the same device is kind of nuts... |
Currently the bandwidth manager is enforcing a rate limit for flows in one pod, and the flows in one pod are sharing the one queue. It's using tail drop policy with threshold as 2 seconds. This can cause bufferbloat and 2-second queuing latency when there are many tcp connections. Here we introduce ecn marking to solve the issue, by default, the marking threshold is set to 1ms. For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP connections in the pod as background traffic, and we compare the TCP_RR latency Method | Avg Latency - | - with-ECN | 3.1ms without-ECN | 2247.3ms Fixes: cilium#29083 Signed-off-by: Kangjie Xu <[email protected]>
Currently the bandwidth manager is enforcing a rate limit for flows in one pod, and the flows in one pod are sharing the one queue. It's using tail drop policy with threshold as 2 seconds. This can cause bufferbloat and 2-second queuing latency when there are many tcp connections. Here we introduce ecn marking to solve the issue, by default, the marking threshold is set to 1ms. For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP connections in the pod as background traffic, and we compare the TCP_RR latency Method | Avg Latency - | - with-ECN | 3.1ms without-ECN | 2247.3ms Fixes: cilium#29083 Signed-off-by: Kangjie Xu <[email protected]>
Currently the bandwidth manager is enforcing a rate limit for flows in one pod, and the flows in one pod are sharing the one queue. It's using tail drop policy with threshold as 2 seconds. This can cause bufferbloat and 2-second queuing latency when there are many tcp connections. Here we introduce ecn marking to solve the issue, by default, the marking threshold is set to 1ms. For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP connections in the pod as background traffic, and we compare the TCP_RR latency Method | Avg Latency - | - with-ECN | 3.1ms without-ECN | 2247.3ms Fixes: cilium#29083 Signed-off-by: Kangjie Xu <[email protected]>
@dtaht we are a happy user of tc-cake. I wish you a happy Thanksgiving and a terrific holiday season! tc-cake has been used to throttle the ingress bandwidth of trino clusters for several months . Peak bandwidth of all the trino containers is larger than 1.5 Tb/s. Lots of our services are latency sensitive and their p99 latency are expected to less than several milliseconds. Let me share the stats of one of our k8s node. # tc -s qdisc show dev ifb4eth0
qdisc mq 1: root
Sent 1116849779461062 bytes 1651798624 pkt (dropped 2454154, overlimits 1560279534 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 8001: parent 1:1 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
Sent 558297167301650 bytes 2890888094 pkt (dropped 1192093, overlimits 2916917559 requeues 0)
backlog 0b 0p requeues 0
memory used: 4281088b of 4Mb
capacity estimate: 8Gbit
min/max network layer size: 46 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 500Mbit 8Gbit 2Gbit
target 50us 50us 50us
interval 1ms 1ms 1ms
pk_delay 91us 12us 0us
av_delay 20us 2us 0us
sp_delay 0us 1us 0us
backlog 0b 0b 0b
pkts 182803950 4192489032 0
bytes 368114945149999 190202052026351 0
way_inds 225131238 599585639 0
way_miss 533872792 2940120670 0
way_cols 0 943 0
drops 826120 365973 0
marks 51784581 46454061 0
ack_drop 0 0 0
sp_flows 1 2 0
bk_flows 0 0 0
un_flows 0 0 0
max_len 68519 68519 0
quantum 3028 3028 3028
qdisc cake 8002: parent 1:2 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
Sent 558552612159412 bytes 3055877826 pkt (dropped 1262061, overlimits 2938329271 requeues 0)
backlog 0b 0p requeues 0
memory used: 4281472b of 4Mb
capacity estimate: 8Gbit
min/max network layer size: 46 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 500Mbit 8Gbit 2Gbit
target 50us 50us 50us
interval 1ms 1ms 1ms
pk_delay 22us 3us 0us
av_delay 2us 1us 0us
sp_delay 0us 1us 0us
backlog 0b 0b 0b
pkts 193810664 4196261164 0
bytes 368251301980545 190321728536187 0
way_inds 225356579 604529018 0
way_miss 533784418 2930662682 0
way_cols 0 707 0
drops 839621 422440 0
marks 51804336 46385740 0
ack_drop 0 0 0
sp_flows 1 1 0
bk_flows 0 1 0
un_flows 0 0 0
max_len 68519 68519 0
quantum 3028 3028 3028 # tc -s qdisc show dev ifb4eth1
qdisc mq 1: root
Sent 1118804089717826 bytes 2944733618 pkt (dropped 2501391, overlimits 1760244539 requeues 0)
backlog 0b 0p requeues 0
qdisc cake 8003: parent 1:1 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
Sent 559677882348699 bytes 3804882758 pkt (dropped 1223367, overlimits 3044223444 requeues 0)
backlog 0b 0p requeues 0
memory used: 4281216b of 4Mb
capacity estimate: 8Gbit
min/max network layer size: 46 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 500Mbit 8Gbit 2Gbit
target 50us 50us 50us
interval 1ms 1ms 1ms
pk_delay 58us 34us 0us
av_delay 7us 2us 0us
sp_delay 0us 0us 0us
backlog 0b 0b 0b
pkts 242564561 4274611928 0
bytes 368332249638520 191365928859171 0
way_inds 225554613 606753461 0
way_miss 533905978 2932321244 0
way_cols 0 3117 0
drops 841871 381496 0
marks 52162961 46514679 0
ack_drop 0 0 0
sp_flows 0 22 0
bk_flows 1 0 0
un_flows 0 0 0
max_len 68519 68519 0
quantum 3028 3028 3028
qdisc cake 8004: parent 1:2 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
Sent 559126207369127 bytes 3434818156 pkt (dropped 1278024, overlimits 3010988391 requeues 0)
backlog 0b 0p requeues 0
memory used: 4181Kb of 4Mb
capacity estimate: 8Gbit
min/max network layer size: 46 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 500Mbit 8Gbit 2Gbit
target 50us 50us 50us
interval 1ms 1ms 1ms
pk_delay 50us 5us 0us
av_delay 12us 1us 0us
sp_delay 0us 0us 0us
backlog 0b 0b 0b
pkts 238720122 4251216926 0
bytes 368319815321517 190827026851471 0
way_inds 221314827 605251303 0
way_miss 533859584 2926938123 0
way_cols 0 480 0
drops 848823 429201 0
marks 52048251 46370197 0
ack_drop 0 0 0
sp_flows 0 12 0
bk_flows 0 0 0
un_flows 0 0 0
max_len 68519 68519 0
quantum 3028 3028 3028 |
50us is mind-blowing. One other thing that stands out for me is the very low hash collision rate - the theory said an 8 way set associative hash would work, and seeing it work is great! |
@wenjianhn you might drop less packets if you hand cake a bit more memory via the memlimit parameter. Say 8Mbytes rather than 4. What that will do to your p99 would have to be measured, though! |
If you're interested in jitter, there's a nice video here |
Hi!
(First of all, a disclaimer: I am not trying to start an endless debate like systemd/systemd#9725 or a flamewar 🙂 I am mostly interested in understanding the technical reasons why the Cilium Bandwidth Manager is using
fq
under the hood and if it would make sense to give users an option to usefq_codel
instead)In the Bandwidth Manager code, I can see that it sets
net.core.default_qdisc
tofq
and then forces thefq
qdisc on all relevant devices.Based on the docs, it looks like the choice of
fq
was originally made in order to have an option to use BBR? I don't think that in 2023 it's a requirement anymore: according to this thread,fq
was required for BBR because it supports TCP pacing but it seems thatfq_codel
can also work now as well:I was also curious about the performance impact of
fq
vsfq_codel
when Cilium Bandwidth Manager rate-limiting was in place. In order to test this, I ran a set of benchmarks. My benchmark setup was as follows:fq
vsfq_codel
with Cilium Bandwidth Manager egress limit enforced vs unenforced. When the egress limit was set, it was limiting the Flent client bandwidth to 100Mbit/s withkubernetes.io/egress-bandwidth: "100M"
(You can see its effect on the "Upload" graphs below). In order to try out the Bandwidth Manager +fq_codel
scenarios, I manually replaced theqdisc
s on all devices withtc qdisc del
/tc qdisc add
.More details about the instance types which were used
m5.8xlarge
n2-standard-8
m5.large
n2-standard-2
The results are generally consistent across Cloud Providers and instance types:
fq
andfq_codel
are in the same latency ballpark and can easily attain the instances' max throughput and maintain it consistently (beware that you need to multiply all throughput numbers by 4 on all graphs because they show throughput by flow)fq_codel
is able to maintain low latency (< 2 ms) and is able to keep the same max download throughput as with Cilium Bandwidth Manager rate-limiting turned off. When it comes tofq
, the situation seems to be much worse: the latency skyrockets (>400ms in worst cases, so ~200x times worse thatfq_codel
) and the download throughput suffers significantly despite only the egress bandwidth being limited. The only advantage offq
is that the numbers are much more uniform compared tofq_codel
which is much more jittery.Here are all the benchmark results 👇
So the main question is whether it would make sense to give users the option to run the Bandwidth Manager with
fq_codel
?I am also curious to hear your thoughts about the benchmarks: is there anything I'm missing here? Is there any other metric to look at?
The text was updated successfully, but these errors were encountered: