Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP: Bandwidth Manager with fq_codel? #29083

Open
antonipp opened this issue Nov 9, 2023 · 38 comments
Open

CFP: Bandwidth Manager with fq_codel? #29083

antonipp opened this issue Nov 9, 2023 · 38 comments
Assignees
Labels
feature/bandwidth-manager Impacts BPF bandwidth manager. kind/cfp kind/feature This introduces new functionality. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@antonipp
Copy link
Contributor

antonipp commented Nov 9, 2023

Hi!

(First of all, a disclaimer: I am not trying to start an endless debate like systemd/systemd#9725 or a flamewar 🙂 I am mostly interested in understanding the technical reasons why the Cilium Bandwidth Manager is using fq under the hood and if it would make sense to give users an option to use fq_codel instead)

In the Bandwidth Manager code, I can see that it sets net.core.default_qdisc to fq and then forces the fq qdisc on all relevant devices.

Based on the docs, it looks like the choice of fq was originally made in order to have an option to use BBR? I don't think that in 2023 it's a requirement anymore: according to this thread, fq was required for BBR because it supports TCP pacing but it seems that fq_codel can also work now as well:

Yes, it is fine to use BBR with fq_codel on recent kernels.
For kernels v4.20 and later, BBR will use the Linux TCP-layer pacing if the connection notices that there is no qdisc on the sending host implementing pacing.

I was also curious about the performance impact of fq vs fq_codel when Cilium Bandwidth Manager rate-limiting was in place. In order to test this, I ran a set of benchmarks. My benchmark setup was as follows:

  • I ran the standard 60 second "Realtime Response Under Load" test from the Flent test suite 1.3.2. The test creates 4 TCP flows doing downloads and 4 TCP flows doing uploads and tries to saturate the links. Latency is measured with ICMP and UDP pings.
  • I did 4 benchmarks: fq vs fq_codel with Cilium Bandwidth Manager egress limit enforced vs unenforced. When the egress limit was set, it was limiting the Flent client bandwidth to 100Mbit/s with kubernetes.io/egress-bandwidth: "100M" (You can see its effect on the "Upload" graphs below). In order to try out the Bandwidth Manager + fq_codel scenarios, I manually replaced the qdiscs on all devices with tc qdisc del / tc qdisc add.
  • I benchmarked 2 instances in AWS and 2 in GCP. In each Cloud Provider I chose one "small" instance with 2 vCPU and 2 NIC driver queues and one bigger instance with at least 8 vCPU and exactly 8 NIC driver queues.
  • I ensured that client & server instances are in the same AZ and that the Flent clients and servers are the only workload pods running on these instances.
  • The TCP Congestion Control algorithm was CUBIC in all tests
More details about the instance types which were used
Cloud Provider Instance Type vCPU Egress bandwidth NIC queues
AWS m5.8xlarge 32 10 Gbps 8
GCP n2-standard-8 8 16 Gbps 8
AWS m5.large 2 10 Gbps [up to] 2
GCP n2-standard-2 2 10 Gbps 2

The results are generally consistent across Cloud Providers and instance types:

  • When the Cilium Bandwidth Manager rate-limiting is off, both fq and fq_codel are in the same latency ballpark and can easily attain the instances' max throughput and maintain it consistently (beware that you need to multiply all throughput numbers by 4 on all graphs because they show throughput by flow)
  • When the Cilium Bandwidth Manager rate-limiting is on, fq_codel is able to maintain low latency (< 2 ms) and is able to keep the same max download throughput as with Cilium Bandwidth Manager rate-limiting turned off. When it comes to fq, the situation seems to be much worse: the latency skyrockets (>400ms in worst cases, so ~200x times worse that fq_codel) and the download throughput suffers significantly despite only the egress bandwidth being limited. The only advantage of fq is that the numbers are much more uniform compared to fq_codel which is much more jittery.
Here are all the benchmark results 👇

01-flent-benchmarks-aws-m5-8xlarge
02-flent-benchmarks-gcp-n2-standard-8
03-flent-benchmarks-aws-m5-large
04-flent-benchmarks-gcp-n2-standard-2

So the main question is whether it would make sense to give users the option to run the Bandwidth Manager with fq_codel?
I am also curious to hear your thoughts about the benchmarks: is there anything I'm missing here? Is there any other metric to look at?

@antonipp antonipp added the kind/feature This introduces new functionality. label Nov 9, 2023
@hemanthmalla hemanthmalla added feature/bandwidth-manager Impacts BPF bandwidth manager. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Nov 9, 2023
@dtaht
Copy link

dtaht commented Nov 12, 2023

Thank you for doing these benchmarks! I was very concerned when I heard of this new design that they had not taken a hard look at in-flow latency.

@dtaht
Copy link

dtaht commented Nov 12, 2023

Several suggestions, if you have time:

  1. Try cake with it's integral shaper in "besteffort" mode
  2. Try enabling ecn and tcp cubic on both sides of the container, and repeat the series. This will make fq_codel and cake less jittery. sudo sysctl -w net.ipv4.tcp_ecn=1
    (you can do this on a route specific basis also which makes more sense for continers)

The designers of this cilium subsystem were claiming that it was bbr (which does not support rfc3168 ecn) that was doing the magic, but it looks to my eye that it is merely massively excessive latency that makes it look more uniform. I am under the impression however that at least kernel 6.1 was needed for cilium + fq + bbr to work "correctly"?

One fq_codel/cake trick few DC users apply is you can for short paths actually tune fq_codel's target down quite a bit, to achieve the same bandwidth at much lower within stream latency. I regularly run fq_codel on bare metal with a target of 250us and interval of 5ms in the one transiting-a-dc-only app I have (the cross dc latency is about 4ms tops). It is really hard to measure improvements below a ms however! I have often wondered how low you could go across containers with ecn enabled. The theory behind the codel algorithm is that you can go own to one MTU and 2 RTT with it, were it implemented directly in hardware.

Lastly, as much as I love the rrul test, most container workloads are pretty unidirectional, and I might start by taking packet captures of your application in deployment and trying to model that?

otherwise my next test in your test series would be more like --step-size=.05 -x --test-parameter=upload_streams=4 --socket-stats tcp_nup
to just look at uploads through the shaper. --socket-stats on the upload portion of many tests will give you plots of the underlying tcp rtts.

@dtaht
Copy link

dtaht commented Nov 12, 2023

https://blog.cerowrt.org/post/juniper/ has a simple script to sweep from 1-64 flows in it.

@tohojo
Copy link

tohojo commented Nov 12, 2023

Hmm, so I have a hunch about what's the cause of this behaviour, but I don't have time to dig into the details right now. But just to see if I'm completely off base, could you try running a UDP-based bandwidth test as well? Just using iperf2 with iperf -u -s on one end and iperf -u -c HOSTNAME -b 200M on the other end and seeing what actual bandwidth it reports at the end of the test should be sufficient (the traffic goes from the client to the server, so the client should be inside the rate-limited container). Don't worry about measuring latency for that test, I just want to see how effective the bandwidth enforcement is for UDP traffic with each qdisc.

@antonipp
Copy link
Contributor Author

Thank you both for your comments! I took some time to look into them.

First of all, regarding the "UDP-based bandwidth test", I tried it out on a pair of n2-standard-8 nodes in GCP. I left the default fq setup and rate-limited the client to 100Mbps and this is the result I got:

# iperf --udp --client 10.19.1.46 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.1.46, UDP port 5001 with pid 101
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[  3] local 10.19.0.249 port 48567 connected with 10.19.1.46 port 5001
[ ID] Interval            Transfer     Bandwidth      Write/Err  PPS
[  3] 0.0000-10.0001 sec   250 MBytes   210 Mbits/sec  178330/0    17832 pps
[  3] Sent 178330 datagrams
[  3] Server Report:
[  3]  0.0-12.0 sec  3.81 MBytes  2.66 Mbits/sec   0.077 ms 175618/178334 (98%)
[  3] 0.0000-11.9995 sec  1 datagrams received out-of-order

Interestingly enough there was 98% packet loss (this is the 175618/178334 (98%) number reported by the server)

Once I removed the kubernetes.io/egress-bandwidth annotation to disable rate-limiting, the packet loss was gone:

# iperf --udp --client 10.19.1.46 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.1.46, UDP port 5001 with pid 104
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[  3] local 10.19.0.249 port 48935 connected with 10.19.1.46 port 5001
[ ID] Interval            Transfer     Bandwidth      Write/Err  PPS
[  3] 0.0000-10.0001 sec   250 MBytes   210 Mbits/sec  178329/0    17832 pps
[  3] Sent 178329 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec   250 MBytes   210 Mbits/sec   0.018 ms    0/178329 (0%)
[  3] 0.0000-9.9989 sec  6 datagrams received out-of-order

Regarding

Try cake with it's integral shaper in "besteffort" mode

To give a bit more context, we are mostly interested in fq_codel specifically. The reason is that we are currently running Cilium + fq_codel on 10000s of nodes and we want to enable the Bandwidth Manager on all these nodes but at the same time want to minimize the number of changes to our infrastructure.

This is why the forced switch to fq when the Bandwidth Manager is enabled is problematic because it increases the delta from our current configuration. So using cake does sound like an interesting avenue to explore but ideally we would switch things one at a time: first enable the Bandwidth Manager and then potentially look into changing the qdiscs on all of our nodes.

Try enabling ecn and tcp cubic on both sides of the container, and repeat the series

I tried this out by setting

  securityContext:
    sysctls:
      - name: net.ipv4.tcp_ecn
        value: "1"

on both pods and running sysctl -w net.ipv4.tcp_ecn=1 on the hosts (it used to be set to 2 in my previous tests). CUBIC was already set-up everywhere. Unfortunately I didn't observe any noticeable difference, there is still a fair amount of jitter:

Here are the graphs

AWS m5.8xlarge benchmark with ECN on:

02-rrul-fq-codel-bw-on-ecn-on-aws

GCP n2-standard-8 benchmark with ECN on:

02-rrul-fq-codel-bw-on-ecn-on

I might start by taking packet captures of your application in deployment and trying to model that

Yes, this is definitely on my to-do list, we have 100s of applications running in our infra but I do have some in mind which are quite sensitive to latency and throughput, and which would be good candidates for more benchmarking. It's a bit of a hassle to benchmark them though because I will need to sync with application teams first, this is why I wanted to get as many insights as possible from synthetic benchmarks and only then move to real applications 😄

otherwise my next test in your test series would be more like --step-size=.05 -x --test-parameter=upload_streams=4 --socket-stats tcp_nup
to just look at uploads through the shaper. --socket-stats on the upload portion of many tests will give you plots of the underlying tcp rtts.

Added this to my to-do list as well, will try to get to it this week.

@tohojo
Copy link

tohojo commented Nov 15, 2023 via email

@antonipp
Copy link
Contributor Author

What about with egress-bandwidth set, but fq_codel as the qdisc?

Hm, I switched to fq_codel but the situation is even worse with egress-bandwidth set, it looks like the server isn't getting any packets at all? (or at least the client isn't getting any acks). I haven't had the chance to dig into it yet, but here are the results:

# iperf --udp --client 10.19.0.30 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.0.30, UDP port 5001 with pid 24
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[  3] local 10.19.0.8 port 40141 connected with 10.19.0.30 port 5001
[  3] WARNING: did not receive ack of last datagram after 10 tries.
[ ID] Interval            Transfer     Bandwidth      Write/Err  PPS
[  3] 0.0000-10.0000 sec   250 MBytes   210 Mbits/sec  178329/0    17832 pps
[  3] Sent 178329 datagrams

The situation is better with the egress-bandwidth annotation removed:

# iperf --udp --client 10.19.0.30 --bandwidth 200M --enhancedreports
------------------------------------------------------------
Client connecting to 10.19.0.30, UDP port 5001 with pid 30
Sending 1470 byte datagrams, IPG target: 56.08 us (kalman adjust)
UDP buffer size: 16.0 MByte (default)
------------------------------------------------------------
[  3] local 10.19.0.8 port 53212 connected with 10.19.0.30 port 5001
[ ID] Interval            Transfer     Bandwidth      Write/Err  PPS
[  3] 0.0000-10.0001 sec   250 MBytes   210 Mbits/sec  178330/0    17832 pps
[  3] Sent 178330 datagrams
[  3] Server Report:
[  3]  0.0-430.7 sec   250 MBytes  4.87 Mbits/sec   0.021 ms    0/178330 (0%)
[  3] 0.0000-430.7236 sec  24 datagrams received out-of-order

Here are the queue parameters btw:

$ sudo tc qdisc show dev ens4
qdisc mq 8003: root
qdisc fq_codel 0: parent 8003:8 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:7 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:6 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:5 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:4 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:3 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:2 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent 8003:1 limit 10240p flows 1024 quantum 1474 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc clsact ffff: parent ffff:fff1

@dtaht
Copy link

dtaht commented Nov 17, 2023

I am pleased you are running fq_codel in the first place and would love a dump from prod of, say 100 "boxes", of
tc -s qdisc show
so as to see packets, drops, marks, backlogs, and reschedules. reschedules = FQ is working (but how often compared to the total packets?). Backlogs implies queues are forming within fq_codel, as well as lower layers. marks and drops are of the aqm actually needing to "do stuff", also relative to the number of packets.

In the non-rate limited k8 case I have generally assumed it was mq doing a goodly percentage of the FQ in the first place.

I usually post process this stuff with awk. I am obsolete.

If you are a JSON wizard tc -s -j qdisc show and go to town.

@dtaht
Copy link

dtaht commented Nov 17, 2023

quantum 1474? Not 1514? why?

@dtaht
Copy link

dtaht commented Nov 17, 2023

Lastly, I think we have a differing interpretation of "jitter" on the rrul and I would love you to describe how you were thinking about it before I launch into my std lecture so I can maybe lecture sanely to others in the future? I note that I used the word wrong myself in my initial comment.

I am happy you are delving into this. It seemed 99.9999% of the k8 crowd thinks tcp is a function call. I was once able to save a cloudy user 80% of their bandwidth bill by cutting tcp_notsent_lowat down to something reasonable.... there are other things worth tuning, like initcwnd...

anyway in this topology:

(internet) <- web proxy <- local containers the local container interfaces can be configged down to target 250us interval 2.5ms on bare metal, esp with ECN enabled.

As for the "jitter" thing, you can possibly see the difference I was expecting via a cdf plot comparing the before/after ECN, on the tcp_nup tcp rtt. I also note the raw JSON files in flent contain very little sensitive info (IP addresses, qdiscs) and toke and I really good at flipping through them. The -x option however gathers more than most corps would like.

@borkmann
Copy link
Member

borkmann commented Nov 22, 2023

Hello, sorry for the delay due to conference travel.. digesting this thread a bit:

[...]

In the Bandwidth Manager code, I can see that it sets net.core.default_qdisc to fq and then forces the fq qdisc on all relevant devices.

Yep.

Based on the docs, it looks like the choice of fq was originally made in order to have an option to use BBR? I don't think that in 2023 it's a requirement anymore: according to this thread, fq was required for BBR because it supports TCP pacing but it seems that fq_codel can also work now as well:

The initial motivation for fq was to do EDT for the Cilium Bandwidth manager (see also https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF). As far as I know fq_codel does not support EDT. Compare https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530 vs https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq_codel.c#n184 . See also https://lore.kernel.org/netdev/[email protected]/ . The EDT we make use of for the kubernetes.io/egress-bandwidth annotation.

The BBR for Pods etc needs v5.18+ kernel, otherwise pacing is broken given the skb->tstamp (delivery timestamp) is cleared upon netns traversal and the rates are fluctuating. The relevant fixes which are part from these kernel onwards are https://lore.kernel.org/bpf/[email protected]/ .

Yes, it is fine to use BBR with fq_codel on recent kernels.
For kernels v4.20 and later, BBR will use the Linux TCP-layer pacing if the connection notices that there is no qdisc on the sending host implementing pacing.

I was also curious about the performance impact of fq vs fq_codel when Cilium Bandwidth Manager rate-limiting was in place. In order to test this, I ran a set of benchmarks. My benchmark setup was as follows:

Again, fq_codel does not support EDT, hence the fq_codel measurement is "broken" here given skb->tstamps that were set by the Cilium Bandwidth Manager's BPF code are ignored.

  • I ran the standard 60 second "Realtime Response Under Load" test from the Flent test suite 1.3.2. The test creates 4 TCP flows doing downloads and 4 TCP flows doing uploads and tries to saturate the links. Latency is measured with ICMP and UDP pings.
  • I did 4 benchmarks: fq vs fq_codel with Cilium Bandwidth Manager egress limit enforced vs unenforced. When the egress limit was set, it was limiting the Flent client bandwidth to 100Mbit/s with kubernetes.io/egress-bandwidth: "100M" (You can see its effect on the "Upload" graphs below). In order to try out the Bandwidth Manager + fq_codel scenarios, I manually replaced the qdiscs on all devices with tc qdisc del / tc qdisc add.
  • I benchmarked 2 instances in AWS and 2 in GCP. In each Cloud Provider I chose one "small" instance with 2 vCPU and 2 NIC driver queues and one bigger instance with at least 8 vCPU and exactly 8 NIC driver queues.
  • I ensured that client & server instances are in the same AZ and that the Flent clients and servers are the only workload pods running on these instances.
  • The TCP Congestion Control algorithm was CUBIC in all tests

Btw, semi-related, did you also try to run a 6.7-rc2 kernel? These ones got merged there: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b49a948568dcbb5f38cbf5356ea0fb9c9c6f6953 . Also priority bands landed recently (https://netdev.bots.linux.dev/netconf/2023/eric.pdf).

@borkmann borkmann self-assigned this Nov 22, 2023
@antonipp
Copy link
Contributor Author

Thank you @borkmann !

I looked into your main point

fq_codel does not support EDT, hence the fq_codel measurement is "broken"

and I believe you are right, my fq_codel benchmark results seem to have been flawed: the egress rate-limiting that I saw in these benchmarks seems to have been achieved as a side-effect of horizon drops here, meaning that so many packets were dropped that the bandwidth was effectively capped at ~100Mbps.

I removed these two lines, then re-ran a benchmark with fq_codel and the bandwidth was essentially uncapped:

So it looks like using fq_codel with the Bandwidth Manager is not an option after all (unless we somehow add EDT support there)

did you also try to run a 6.7-rc2 kernel?

I haven't had the chance to try it out yet, we are still mainly running 5.15 everywhere. I will give it a try.

I am still curious to understand the performance issues I saw with fq when the Bandwidth Manager is enabled since we are not going to migrate to 6.7 any time soon. What would explain the latency + download throughput degradations?

Also thank you @dtaht for the suggestions.

I am pleased you are running fq_codel in the first place and would love a dump from prod of, say 100 "boxes", of
tc -s qdisc show so as to see packets, drops, marks, backlogs, and reschedules

It would indeed be interesting to look into better tuning fq_codel on our instances (and I'm sure we have room to improve) but it's slightly unrelated to the work I'm doing right now, I will try to come back to this later on.

quantum 1474? Not 1514? why?

This is a default value set by Google Cloud, probably their MTU of 1460 bytes + hardware header length of 14 bytes = 1474.

Lastly, I think we have a differing interpretation of "jitter" on the rrul and I would love you to describe how you were thinking about it

Yeah, I am not sure that "jitter" is the exact right term, I was trying to refer to the sharp oscillations looking like this:

image

As compared to smoother curves like this one:
image

@tohojo
Copy link

tohojo commented Nov 23, 2023

I removed these two lines, then re-ran a benchmark with fq_codel and the bandwidth was essentially uncapped:

Ah! This is the behaviour I was expecting, and couldn't for the life of me figuring out why this wasn't the case; totally missed the horizon drop thing in the BPF code, that explains it :)

I am still curious to understand the performance issues I saw with fq when the Bandwidth Manager is enabled since we are not going to migrate to 6.7 any time soon. What would explain the latency + download throughput degradations?

I wouldn't expect 6.7 to help with the latency issues you were seeing. This is caused by the way the bandwidth manager is implemented, AFAICT; basically it creates a virtual FIFO queue without implementing any kind of AQM or flow queueing, so the terrible latency is totally expected.

Specifically, the shaper logic here (right above the horizon drop thing you linked above), does a lookup into the rate config map using the previously set queue mapping as the key[0], finds the rate and the last timestamp, and sets a timestamp for the packet based on the rate and the length of the packet. So packets will be delayed in the fq qdisc until their virtual TX time, and since there's only a single running timestamp for each queue mapping ID, this in effect becomes a virtual FIFO, with the only limit being the horizon timestamp, as you found.

From a queueing algorithm behaviour PoV this is obviously not great. And doing better at this level is not so straight forward either: since the dequeue time is computed before each packet is queued, tail dropping is the only action possible to control the queue. I've seen a CoDel implementation that works in this mode at one point, so that would be possible, I guess; but flow queueing is not, really, at least not without temporarily going above the bandwidth limit after the fact. Also just straight round-robin scheduling between becomes quite challenging to do in this "virtual queue of future transmission times" mode.

So based on the above, my recommendation would simply be "don't use the bandwidth manager unless you don't care about latency at all". I'm frankly a little puzzled that no one noticed this behaviour before; I guess it's the ever-present curse of "only benchmark TCP throughput"? IDK.

Or who knows, maybe I'm missing something fundamental here, and there's some reason why things are not as dire as I paint them above? If so, I would love to know what that reason is! :)

[0] Not quite sure how that is initially set, but seems to be coming from inside the pod? so veth queue id, I guess? anyway, doesn't look like there is more than at most a few IDs per pod.

@dtaht
Copy link

dtaht commented Nov 23, 2023

I note that I do this sort of tuning for a living, and am between contracts at the moment.

Anyway, thank you all for getting into the methods and claims behind how this cilium subsystem works without me having to poke into it much. I would like to know how much the rtt grows before shuddering overmuch.

A) Packet captures would be nice. Seeing the rtt in both slow start and congestion avoidance.
B) BBR + EDT + short horizon is, I think, a means to do bandwidth shaping multicore?

In general the rrul test is a partial emulation of a BitTorrent workload, and has been a good proxy for creating "good" residential and small corporate network behaviors in the general case. Regrettably it has next to no bearing on the actual behaviors of a container workload, IMHO.

Measure those.

Most of these I have seen (with the exception of movie streaming) have been totally dominated by slow start, where tuning initcwnd and tcp_notsent_lowwat matter most, and on the places I have been called in, leveraging fq_codel + ECN + cubic to manage it, further, without packet drop. I never bothered to try and rate shape anything before now, the problems being always more dominated by the behaviors of the flows from the web proxy outwards.

@dtaht
Copy link

dtaht commented Nov 23, 2023

The "jitter" you observe on the rrul test is the conflation of two things: sampling error and actual bandwidth usage during the sample interval defined. It is normal for it to occilate somewhat because that is how the tcp sawtooth works as a function of the rtt, to go deep see https://ee.lbl.gov/papers/congavoid.pdf and really deep see https://en.wikipedia.org/wiki/Lyapunov_stability

It is necessary for it to bounce around a bit in order for the internet to not collapse. rfc970. The height is related to the rtt, but flent sampling error so huge on this sub-1us RTT that you are not getting a real picture of the carnage underneath at all.

Your second plot shows a classic example of slow convergence, where the first flows to start hog all the processor and bandwidth and the later flows - due in part to the massively inflating rtt, takes a longer time (15 seconds) to achieve equality. The FQ is not helping here, but somewhere there are some big buffers in this stack. A better test would be to start, say two saturating flows and then measure transactions per second for a zillion other flows, netperf's tcp_rr test for example.

In other words this plot is massively better than the others because the latecomers to the party
https://user-images.githubusercontent.com/17275833/285262486-c99feceb-4ece-48e2-909a-8c5daa11b058.png ramp up quickly, (and are typically short), the opposite of your interpretation. Thank you for explaining how you saw this data differently than I!

In both these cases, honestly, the underlying behavior of the container's real workload looks nothing like this and the only way I can ever convince someone of this is for them to take 5 minutes or so of packet capture and tear it, rather than iperf/netperf test traffic, apart.

@tohojo
Copy link

tohojo commented Nov 23, 2023

B) BBR + EDT + short horizon is, I think, a means to do bandwidth shaping multicore?

Well, no, not really, at least not in itself, for the reason I outlined above...

@dtaht
Copy link

dtaht commented Nov 23, 2023

I would like a packet capture. And a glass of scotch.

@borkmann
Copy link
Member

borkmann commented Nov 23, 2023

I am still curious to understand the performance issues I saw with fq when the Bandwidth Manager is enabled since we are not going to migrate to 6.7 any time soon. What would explain the latency + download throughput degradations?

I wouldn't expect 6.7 to help with the latency issues you were seeing. This is caused by the way the bandwidth manager is implemented, AFAICT; basically it creates a virtual FIFO queue without implementing any kind of AQM or flow queueing, so the terrible latency is totally expected.

Specifically, the shaper logic here (right above the horizon drop thing you linked above), does a lookup into the rate config map using the previously set queue mapping as the key[0], finds the rate and the last timestamp, and sets a timestamp for the packet based on the rate and the length of the packet. So packets will be delayed in the fq qdisc until their virtual TX time, and since there's only a single running timestamp for each queue mapping ID, this in effect becomes a virtual FIFO, with the only limit being the horizon timestamp, as you found.

The idea was to transfer https://netdevconf.info//0x14/pub/slides/55/slides.pdf / https://netdevconf.info//0x14/pub/papers/55/0x14-paper55-talk-paper.pdf for K8s and utilise the Pod egress rate annotations there.

[...]

Or who knows, maybe I'm missing something fundamental here, and there's some reason why things are not as dire as I paint them above? If so, I would love to know what that reason is! :)

Hm, I wonder also if we're hitting other things such as sch->limit with the defaults, I've heard about issues like these due to defaults being too low. Would be good to trace kfree_skb to double check.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530

[0] Not quite sure how that is initially set, but seems to be coming from inside the pod? so veth queue id, I guess? anyway, doesn't look like there is more than at most a few IDs per pod.

Given all other skb fields such as mark etc are already used up, it basically stores the endpoint ID into queue_mapping which is preserved all the way through the stack and later resets it, so kernel picks queue via flow hash. Not great, but seems to function at least.

@tohojo
Copy link

tohojo commented Nov 23, 2023

I am still curious to understand the performance issues I saw with fq when the Bandwidth Manager is enabled since we are not going to migrate to 6.7 any time soon. What would explain the latency + download throughput degradations?

I wouldn't expect 6.7 to help with the latency issues you were seeing. This is caused by the way the bandwidth manager is implemented, AFAICT; basically it creates a virtual FIFO queue without implementing any kind of AQM or flow queueing, so the terrible latency is totally expected.
Specifically, the shaper logic here (right above the horizon drop thing you linked above), does a lookup into the rate config map using the previously set queue mapping as the key[0], finds the rate and the last timestamp, and sets a timestamp for the packet based on the rate and the length of the packet. So packets will be delayed in the fq qdisc until their virtual TX time, and since there's only a single running timestamp for each queue mapping ID, this in effect becomes a virtual FIFO, with the only limit being the horizon timestamp, as you found.

The idea was to transfer https://netdevconf.info//0x14/pub/slides/55/slides.pdf / https://netdevconf.info//0x14/pub/papers/55/0x14-paper55-talk-paper.pdf for K8s and utilise the Pod egress rate annotations there.

[...]

Yup, I did recognise the approach from there. I'm also wondering why the Google folks didn't see these latency spikes. My best guess is that it's BBR throttling that kicks in early enough to mostly mask the FIFO behaviour. That, combined with some workload specific traffic properties may put you into "don't care about latency" territory for some deployments (especially if you're coming from a highly contended global HTB lock scenario)?

Or who knows, maybe I'm missing something fundamental here, and there's some reason why things are not as dire as I paint them above? If so, I would love to know what that reason is! :)

Hm, I wonder also if we're hitting other things such as sch->limit with the defaults, I've heard about issues like these due to defaults being too low. Would be good to trace kfree_skb to double check.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530

Those limit drops should be visible in the qdisc stats, then (tc -s qdisc). The default limit for fq is 10k packets, which is around 1.2 seconds at 100Mbit (for 1500-MTU packets), so yeah, I guess you could hit that limit before you hit the horizon drop. However, TCP backpressure should also kick in before that (TSQ should still work all the way into the container, right?), which I think is supported by the results in the graphs above - we're a fair way from 1.2 seconds.

Hmm, or maybe we're not, on the AWS instances, at least? There will be some smaller packets (ACKs) in the queue as well, so the ~400 ms could well be the queue overflow point? Maybe that's also the reason for the difference between the GCP and AWS results - i.e., differences in TCP stack backpressure effectiveness?

[0] Not quite sure how that is initially set, but seems to be coming from inside the pod? so veth queue id, I guess? anyway, doesn't look like there is more than at most a few IDs per pod.

Given all other skb fields such as mark etc are already used up, it basically stores the endpoint ID into queue_mapping which is preserved all the way through the stack and later resets it, so kernel picks queue via flow hash. Not great, but seems to function at least.

Right, OK, so it's basically one ID per container/pod? That's what I was assuming (as you'd want the limit to be global for that entity), I just didn't manage to trace the code back far enough to figure out where those IDs were coming from :)

@borkmann
Copy link
Member

borkmann commented Nov 24, 2023

I am still curious to understand the performance issues I saw with fq when the Bandwidth Manager is enabled since we are not going to migrate to 6.7 any time soon. What would explain the latency + download throughput degradations?

I wouldn't expect 6.7 to help with the latency issues you were seeing. This is caused by the way the bandwidth manager is implemented, AFAICT; basically it creates a virtual FIFO queue without implementing any kind of AQM or flow queueing, so the terrible latency is totally expected.
Specifically, the shaper logic here (right above the horizon drop thing you linked above), does a lookup into the rate config map using the previously set queue mapping as the key[0], finds the rate and the last timestamp, and sets a timestamp for the packet based on the rate and the length of the packet. So packets will be delayed in the fq qdisc until their virtual TX time, and since there's only a single running timestamp for each queue mapping ID, this in effect becomes a virtual FIFO, with the only limit being the horizon timestamp, as you found.

The idea was to transfer https://netdevconf.info//0x14/pub/slides/55/slides.pdf / https://netdevconf.info//0x14/pub/papers/55/0x14-paper55-talk-paper.pdf for K8s and utilise the Pod egress rate annotations there.
[...]

Yup, I did recognise the approach from there. I'm also wondering why the Google folks didn't see these latency spikes. My best guess is that it's BBR throttling that kicks in early enough to mostly mask the FIFO behaviour. That, combined with some workload specific traffic properties may put you into "don't care about latency" territory for some deployments (especially if you're coming from a highly contended global HTB lock scenario)?

Or who knows, maybe I'm missing something fundamental here, and there's some reason why things are not as dire as I paint them above? If so, I would love to know what that reason is! :)

Hm, I wonder also if we're hitting other things such as sch->limit with the defaults, I've heard about issues like these due to defaults being too low. Would be good to trace kfree_skb to double check.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_fq.c#n530

Those limit drops should be visible in the qdisc stats, then (tc -s qdisc). The default limit for fq is 10k packets, which is around 1.2 seconds at 100Mbit (for 1500-MTU packets), so yeah, I guess you could hit that limit before you hit the horizon drop. However, TCP backpressure should also kick in before that (TSQ should still work all the way into the container, right?), which I think is supported by the results in the graphs above - we're a fair way from 1.2 seconds.

Gathering qdisc stats as Toke mentioned would be great indeed, if you have them Anton.

@antonipp Did you measure with BPF host routing? (https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing) If not, could you try to set it and redo the measurement? (Upper stack has the skb_orphan which breaks TCP backpressure.. :/)

Either way, independent of that, I can craft a PR today to bump the sch->limit.. this was on my todo list anyway for some time.

Hmm, or maybe we're not, on the AWS instances, at least? There will be some smaller packets (ACKs) in the queue as well, so the ~400 ms could well be the queue overflow point? Maybe that's also the reason for the difference between the GCP and AWS results - i.e., differences in TCP stack backpressure effectiveness?

[0] Not quite sure how that is initially set, but seems to be coming from inside the pod? so veth queue id, I guess? anyway, doesn't look like there is more than at most a few IDs per pod.

Given all other skb fields such as mark etc are already used up, it basically stores the endpoint ID into queue_mapping which is preserved all the way through the stack and later resets it, so kernel picks queue via flow hash. Not great, but seems to function at least.

Right, OK, so it's basically one ID per container/pod? That's what I was assuming (as you'd want the limit to be global for that entity), I just didn't manage to trace the code back far enough to figure out where those IDs were coming from :)

Yeah aggregate is one ID per Pod (== netns, a Pod can hold one or more containers sharing the same netns). The BPF program on the host veth device knows that all traffic going through that device is from the given Pod, and each Pod has a unique ID on the node.

@dtaht
Copy link

dtaht commented Nov 24, 2023

@borkmann I am not sure if we are talking past each other or not? In the 100mbit FIFO case sch->limit (and/or sch_fq) should be knocked way down, not up. 100 packets tops. In the 10gbit case, internal to a container, no more than 2ms is needed. It's worse than that, in that with gso present a "packet" can get bulked up by 42x, which is why byte limits are better than packet limits.

@dtaht
Copy link

dtaht commented Nov 24, 2023

@tohojo the results so far are from the rrul test which on a packet limit floods the queue with short acks and starves the data path somewhat while taking 1/15th the time to transmit. So why it is merely 400ms, not 1.3 seconds (rule of thumb it takes 13ms for 1Mbit, 130us for 100mbit) is explained by that. A pure upload test may well hit 1.2s (but due to exhausting other limits might take multiple flows to hit. See also: https://www.duo.uio.no/bitstream/handle/10852/45274/1/thesis.pdf

I do not know much about sch_fq, I thought it naively used 100 packets per flow? I recall Eric dumazet suggesting putting any form of ECN with a 5ms brick-wall limit over the whole qdisc before EDT came out. BBR internal to Google has an ECN response.

tsq does regulate simple things fairly well, but going back to the debate: systemd/systemd#9725 (comment)

I won that debate, then. :) I was hoping this new facility truly extended containers to be doing more of the right thing.

A good simple test would be 4 saturating flows + a typical request/response payload for the application (sized to fit initcwnd) or tcp_rr (which will understate, being only 5 packets). My guess is that the tps (transactions per second) at fq 100mbit, 10k packets would be 1000x worse than fq_codel.

@antonipp
Copy link
Contributor Author

Just following-up on a couple of things: I indeed ran the benchmarks with Legacy Host Routing. I ran some more with eBPF Host Routing and the results are much better: the download throughput is now maxed out and the latency doesn't spike as much. Here are the results for comparison (all of these are from n2-standard-8 instances on GCP):

Legacy Host Routing

flent-test-bpf-host-routing-off

eBPF Host Routing

flent-test-bpf-host-routing-on

I also gathered the qdisc stats on client nodes after running these benchmarks (I booted the node, ran 1 benchmark, gathered the stats and then killed the node)
Here are the results:

Stats after running the benchmark with Legacy Host Routing
$ sudo tc -s qdisc show dev ens4
qdisc mq 8002: root
 Sent 783370773 bytes 1006231 pkt (dropped 584, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq 0: parent 8002:8 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 171368969 bytes 176640 pkt (dropped 104, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 163 (inactive 156 throttled 0)
  gc 0 highprio 0 throttled 87232 latency 26.6us flows_plimit 104
qdisc fq 0: parent 8002:7 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 205563021 bytes 201232 pkt (dropped 37, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 170 (inactive 167 throttled 0)
  gc 0 highprio 0 throttled 95733 latency 23.2us flows_plimit 37
qdisc fq 0: parent 8002:6 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 5867951 bytes 79383 pkt (dropped 185, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 165 (inactive 162 throttled 0)
  gc 0 highprio 0 throttled 37829 latency 26.1us flows_plimit 185
qdisc fq 0: parent 8002:5 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 4932330 bytes 56204 pkt (dropped 18, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 173 (inactive 172 throttled 0)
  gc 0 highprio 0 throttled 31996 latency 27.6us flows_plimit 18
qdisc fq 0: parent 8002:4 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 194851808 bytes 181687 pkt (dropped 40, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 176 (inactive 166 throttled 0)
  gc 0 highprio 0 throttled 93956 latency 24.5us flows_plimit 40
qdisc fq 0: parent 8002:3 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 4472102 bytes 61329 pkt (dropped 14, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 165 (inactive 157 throttled 0)
  gc 0 highprio 0 throttled 33655 latency 29.1us flows_plimit 14
qdisc fq 0: parent 8002:2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 5385404 bytes 70527 pkt (dropped 107, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 159 (inactive 155 throttled 0)
  gc 0 highprio 0 throttled 33977 latency 36.6us flows_plimit 107
qdisc fq 0: parent 8002:1 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 190929188 bytes 179229 pkt (dropped 79, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 173 (inactive 171 throttled 0)
  gc 0 highprio 0 throttled 88435 latency 27.7us flows_plimit 79
qdisc clsact ffff: parent ffff:fff1
 Sent 22038857299 bytes 1427707 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
Stats after running the benchmark with eBPF Host Routing
$ sudo tc -s qdisc show dev ens4
qdisc mq 8002: root
 Sent 768138725 bytes 2279589 pkt (dropped 13619, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq 0: parent 8002:8 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 286859943 bytes 392418 pkt (dropped 1743, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 93 (inactive 85 throttled 0)
  gc 0 highprio 0 throttled 110950 latency 18.3us flows_plimit 1743
qdisc fq 0: parent 8002:7 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 274331527 bytes 415580 pkt (dropped 2506, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 85 (inactive 79 throttled 0)
  gc 0 highprio 0 throttled 122406 latency 25.3us flows_plimit 2506
qdisc fq 0: parent 8002:6 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 69129154 bytes 243111 pkt (dropped 1803, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 103 (inactive 93 throttled 0)
  gc 0 highprio 0 throttled 116332 latency 17.6us flows_plimit 1803
qdisc fq 0: parent 8002:5 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 63169165 bytes 250551 pkt (dropped 1382, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 103 (inactive 97 throttled 0)
  gc 0 highprio 0 throttled 122248 latency 18us flows_plimit 1382
qdisc fq 0: parent 8002:4 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 16547541 bytes 255166 pkt (dropped 1791, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 87 (inactive 84 throttled 0)
  gc 0 highprio 0 throttled 124166 latency 21.1us flows_plimit 1791
qdisc fq 0: parent 8002:3 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 15222457 bytes 235888 pkt (dropped 1523, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 90 (inactive 85 throttled 0)
  gc 0 highprio 0 throttled 117827 latency 21.4us flows_plimit 1523
qdisc fq 0: parent 8002:2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 27700561 bytes 252835 pkt (dropped 1191, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 88 (inactive 84 throttled 0)
  gc 0 highprio 0 throttled 120161 latency 23.5us flows_plimit 1191
qdisc fq 0: parent 8002:1 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 2948b initial_quantum 14740b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 15178377 bytes 234040 pkt (dropped 1680, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 86 (inactive 82 throttled 0)
  gc 0 highprio 0 throttled 112652 latency 19.2us flows_plimit 1680
qdisc clsact ffff: parent ffff:fff1
 Sent 112585582090 bytes 2519853 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

And I also agree that the RRUL test is not very representative of a real container workload. As I mentioned, we have many hundreds of very different applications so it's a bit hard to find something representative tbh... I might try out

A better test would be to start, say two saturating flows and then measure transactions per second for a zillion other flows, netperf's tcp_rr test for example.

And also look for a couple of candidate applications to capture packets from as well.

@dtaht
Copy link

dtaht commented Jan 16, 2024

Any progress here? 15ms is really miserable. It should be not much more than 250us at 100mbit.

See also: https://blog.tohojo.dk/2023/12/the-big-fifo-in-the-cloud.html

@middaywords
Copy link

middaywords commented Jan 18, 2024

I used iperf to test cilium bandwidth manager recently, and found the similar issue about its UDP throughput.

My limit is 10Mbps(kubernetes.io/egress-bandwidth: "10M"), And I tested it using 10Mbps UDP traffic.

# iperf3 -c <server-ip> -b 10M -u
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  11.9 MBytes  10.0 Mbits/sec  0.000 ms  0/8633 (0%)  sender
[  5]   0.00-10.00  sec  9.10 MBytes  7.63 Mbits/sec  0.304 ms  1786/8376 (21%)  receiver

Those limit drops should be visible in the qdisc stats, then (tc -s qdisc). The default limit for fq is 10k packets, which is around 1.2 seconds at 100Mbit (for 1500-MTU packets), so yeah, I guess you could hit that limit before you hit the horizon drop. However, TCP backpressure should also kick in before that (TSQ should still work all the way into the container, right?), which I think is supported by the results in the graphs above - we're a fair way from 1.2 seconds.

Thanks to @tohojo 's explanation, I can see there is no backpressure for UDP from and it causes the drop. In my case, when sending at 10Mbps, the limit is also 10Mbps, while the received throughput is smaller(7Mbps) . I can see flow_plimit increases from tc -s qdisc, could you help me understand why this still hits the flow_limit 100 since the sender's not exceeding the limit.

@tohojo
Copy link

tohojo commented Jan 18, 2024 via email

@antonipp
Copy link
Contributor Author

Any progress here?

A quick status update on our end: we've paused our work on the Bandwidth Manager for now, while we migrate our fleet to eBPF Host Routing which will take us a bit of time.

Also thank you @tohojo for the very well written blog post! FWIW, I am also planning on sharing some of our findings in a short presentation at Cilium + eBPF day in Paris in March (not sure how much I'll be able to fit in a 5 minute window but we'll see 😄)

@tohojo
Copy link

tohojo commented Jan 26, 2024 via email

@dtaht
Copy link

dtaht commented Jan 28, 2024

I would be tickled if you tried cake instead, and enabled ecn sender side.

tc qdisc add the_interface bandwidth 100Mbit rtt 5ms
with host routing you can add ecn support via a ip route, or via the global sysctl -w net.ipv4.tcp_ecn=1

The above rtt is sized more or less correctly for codel and the target bandwidth within containers. (250us target)

@dtaht
Copy link

dtaht commented Feb 23, 2024

In the hope this might spawn a little out of the box thinking on this bug: https://www.youtube.com/watch?v=rWnb543Sdk8&t=2603s

@dtaht
Copy link

dtaht commented Apr 9, 2024

ping?

@borkmann
Copy link
Member

borkmann commented Apr 9, 2024

ping?

Hi Dave, I have it on my roadmap to work on this likely for Cilium 1.17. Also, thanks for the pointer to your talk!

@middaywords
Copy link

For current design, can we alleviate the latency problem by setting a smaller drop horizion (smaller queue length) code?
currently it's set to 2 seconds by default. with many tcp flows in one pod, it causes delay about 2 seconds for some packets.
if we set it to a smaller value to make packets drop early, we may have a smaller latency I think.

@dtaht
Copy link

dtaht commented Jun 2, 2024

2 seconds for within containers on the same device is kind of nuts...

middaywords pushed a commit to middaywords/cilium that referenced this issue Jun 18, 2024
Currently the bandwidth manager is enforcing a rate limit for flows in
one pod, and the flows in one pod are sharing the one queue. It's using
tail drop policy with threshold as 2 seconds. This can cause bufferbloat
and 2-second queuing latency when there are many tcp connections.

Here we introduce ecn marking to solve the issue, by default, the
marking threshold is set to 1ms.

For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP
connections in the pod as background traffic, and we compare the TCP_RR
latency

Method		| Avg Latency
-		| -
with-ECN	| 3.1ms
without-ECN	| 2247.3ms

Fixes: cilium#29083

Signed-off-by: Kangjie Xu <[email protected]>
middaywords added a commit to middaywords/cilium that referenced this issue Jun 18, 2024
Currently the bandwidth manager is enforcing a rate limit for flows in
one pod, and the flows in one pod are sharing the one queue. It's using
tail drop policy with threshold as 2 seconds. This can cause bufferbloat
and 2-second queuing latency when there are many tcp connections.

Here we introduce ecn marking to solve the issue, by default, the
marking threshold is set to 1ms.

For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP
connections in the pod as background traffic, and we compare the TCP_RR
latency

Method | Avg Latency
- | -
with-ECN | 3.1ms
without-ECN | 2247.3ms

Fixes: cilium#29083

Signed-off-by: Kangjie Xu <[email protected]>
middaywords added a commit to middaywords/cilium that referenced this issue Jun 18, 2024
Currently the bandwidth manager is enforcing a rate limit for flows in
one pod, and the flows in one pod are sharing the one queue. It's using
tail drop policy with threshold as 2 seconds. This can cause bufferbloat
and 2-second queuing latency when there are many tcp connections.

Here we introduce ecn marking to solve the issue, by default, the
marking threshold is set to 1ms.

For tests, we had a pod with 100Mbps egress limit, and there are 128 TCP
connections in the pod as background traffic, and we compare the TCP_RR
latency

Method | Avg Latency
- | -
with-ECN | 3.1ms
without-ECN | 2247.3ms

Fixes: cilium#29083

Signed-off-by: Kangjie Xu <[email protected]>
@aanm aanm added the kind/cfp label Jun 20, 2024
@wenjianhn
Copy link

@dtaht we are a happy user of tc-cake. I wish you a happy Thanksgiving and a terrific holiday season!

tc-cake has been used to throttle the ingress bandwidth of trino clusters for several months . Peak bandwidth of all the trino containers is larger than 1.5 Tb/s.

Lots of our services are latency sensitive and their p99 latency are expected to less than several milliseconds.
Thanks to tc-cake, those services are able to running on the same nodes of the trino workers.

Let me share the stats of one of our k8s node.
A note for those who are not familiar with tc-cake: 'marks' means we have enabled TCP ECN.

# tc -s qdisc show dev ifb4eth0
qdisc mq 1: root
 Sent 1116849779461062 bytes 1651798624 pkt (dropped 2454154, overlimits 1560279534 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8001: parent 1:1 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
 Sent 558297167301650 bytes 2890888094 pkt (dropped 1192093, overlimits 2916917559 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 4281088b of 4Mb
 capacity estimate: 8Gbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       84 /    1538
 average network hdr offset:           14

                   Bulk  Best Effort        Voice
  thresh        500Mbit        8Gbit        2Gbit
  target           50us         50us         50us
  interval          1ms          1ms          1ms
  pk_delay         91us         12us          0us
  av_delay         20us          2us          0us
  sp_delay          0us          1us          0us
  backlog            0b           0b           0b
  pkts        182803950   4192489032            0
  bytes    368114945149999 190202052026351            0
  way_inds    225131238    599585639            0
  way_miss    533872792   2940120670            0
  way_cols            0          943            0
  drops          826120       365973            0
  marks        51784581     46454061            0
  ack_drop            0            0            0
  sp_flows            1            2            0
  bk_flows            0            0            0
  un_flows            0            0            0
  max_len         68519        68519            0
  quantum          3028         3028         3028

qdisc cake 8002: parent 1:2 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
 Sent 558552612159412 bytes 3055877826 pkt (dropped 1262061, overlimits 2938329271 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 4281472b of 4Mb
 capacity estimate: 8Gbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       84 /    1538
 average network hdr offset:           14

                   Bulk  Best Effort        Voice
  thresh        500Mbit        8Gbit        2Gbit
  target           50us         50us         50us
  interval          1ms          1ms          1ms
  pk_delay         22us          3us          0us
  av_delay          2us          1us          0us
  sp_delay          0us          1us          0us
  backlog            0b           0b           0b
  pkts        193810664   4196261164            0
  bytes    368251301980545 190321728536187            0
  way_inds    225356579    604529018            0
  way_miss    533784418   2930662682            0
  way_cols            0          707            0
  drops          839621       422440            0
  marks        51804336     46385740            0
  ack_drop            0            0            0
  sp_flows            1            1            0
  bk_flows            0            1            0
  un_flows            0            0            0
  max_len         68519        68519            0
  quantum          3028         3028         3028
# tc -s qdisc show dev ifb4eth1
qdisc mq 1: root
 Sent 1118804089717826 bytes 2944733618 pkt (dropped 2501391, overlimits 1760244539 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8003: parent 1:1 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
 Sent 559677882348699 bytes 3804882758 pkt (dropped 1223367, overlimits 3044223444 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 4281216b of 4Mb
 capacity estimate: 8Gbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       84 /    1538
 average network hdr offset:           14

                   Bulk  Best Effort        Voice
  thresh        500Mbit        8Gbit        2Gbit
  target           50us         50us         50us
  interval          1ms          1ms          1ms
  pk_delay         58us         34us          0us
  av_delay          7us          2us          0us
  sp_delay          0us          0us          0us
  backlog            0b           0b           0b
  pkts        242564561   4274611928            0
  bytes    368332249638520 191365928859171            0
  way_inds    225554613    606753461            0
  way_miss    533905978   2932321244            0
  way_cols            0         3117            0
  drops          841871       381496            0
  marks        52162961     46514679            0
  ack_drop            0            0            0
  sp_flows            0           22            0
  bk_flows            1            0            0
  un_flows            0            0            0
  max_len         68519        68519            0
  quantum          3028         3028         3028

qdisc cake 8004: parent 1:2 bandwidth 8Gbit diffserv3 dual-dsthost nonat nowash ingress no-ack-filter no-split-gso rtt 1ms noatm overhead 38 mpu 84
 Sent 559126207369127 bytes 3434818156 pkt (dropped 1278024, overlimits 3010988391 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 4181Kb of 4Mb
 capacity estimate: 8Gbit
 min/max network layer size:           46 /    1500
 min/max overhead-adjusted size:       84 /    1538
 average network hdr offset:           14

                   Bulk  Best Effort        Voice
  thresh        500Mbit        8Gbit        2Gbit
  target           50us         50us         50us
  interval          1ms          1ms          1ms
  pk_delay         50us          5us          0us
  av_delay         12us          1us          0us
  sp_delay          0us          0us          0us
  backlog            0b           0b           0b
  pkts        238720122   4251216926            0
  bytes    368319815321517 190827026851471            0
  way_inds    221314827    605251303            0
  way_miss    533859584   2926938123            0
  way_cols            0          480            0
  drops          848823       429201            0
  marks        52048251     46370197            0
  ack_drop            0            0            0
  sp_flows            0           12            0
  bk_flows            0            0            0
  un_flows            0            0            0
  max_len         68519        68519            0
  quantum          3028         3028         3028

@dtaht
Copy link

dtaht commented Nov 19, 2024

50us is mind-blowing. One other thing that stands out for me is the very low hash collision rate - the theory said an 8 way set associative hash would work, and seeing it work is great!

@dtaht
Copy link

dtaht commented Nov 19, 2024

@wenjianhn you might drop less packets if you hand cake a bit more memory via the memlimit parameter. Say 8Mbytes rather than 4. What that will do to your p99 would have to be measured, though!

@randomizedcoder
Copy link

If you're interested in jitter, there's a nice video here
https://youtu.be/I_TtMk5z0O0?si=EUdasVZg-nt2YmMV
regards,
Dave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/bandwidth-manager Impacts BPF bandwidth manager. kind/cfp kind/feature This introduces new functionality. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants