Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default default qdisc from fq_codel to sch_fq #9725

Open
roland-bless opened this issue Jul 26, 2018 · 53 comments
Open

Change default default qdisc from fq_codel to sch_fq #9725

roland-bless opened this issue Jul 26, 2018 · 53 comments
Labels
RFE 🎁 Request for Enhancement, i.e. a feature request sysctl

Comments

@roland-bless
Copy link

systemd version the issue has been seen with

e6c253e

Used distribution

Ubuntu 18.04

Expected behaviour you didn't see

Better performance than sch_fq

Unexpected behaviour you saw

Packet loss at the sender (ok, not really unexpected)

Steps to reproduce the problem
The main problem is that fq_codel is nice for routers or forwarding devices, but not for end-systems or servers (and I assume that this is rather the majority of linux installations): it doesn't provide an advantage there, but a serious drawback: packet loss at the sending end-system (note that this will also increase the overall latency for the affected TCP connections!). Normally, this should be avoided by using back-pressure mechanisms locally. However, CoDel drops packets instead of propagating back-pressure.
The steps to reproduce the problems are described here:
https://lists.bufferbloat.net/pipermail/bloat/2018-June/008318.html
(BTW: there is no technical argument in this thread for using fq_codel as default).
So if you're using Linux for a software-based router (openwrt), you'll need a special configuration anyway (so it's easy to set fq_codel as default there), but for servers and normal end-systems/hosts the default should be sch_fq, not fq_codel!

@poettering
Copy link
Member

@michich opinion on this?

@poettering poettering added sysctl RFE 🎁 Request for Enhancement, i.e. a feature request labels Jul 30, 2018
@roland-bless
Copy link
Author

Please note that the Bufferbloat wiki also recommends sch_fq for servers:
https://www.bufferbloat.net/projects/codel/wiki/
(see heading "Binary code and kernels for Linux based operating systems").
However, as I wrote: the default shipping Linux code should be for end-systems and CoDel doesn't make any sense inside an end-system (flow queueing is useful though!).

@dtaht
Copy link

dtaht commented Aug 11, 2018

@michich @poettering

um. fq_codel is the most general purpose qdisc there is handling udp and tcp traffic rationally and equally, while managing queue length, and working with routers, and virtual network substrates also.

sch_fq is for tcp-serving heavy workloads, primarily from the datacenter, and does nothing sane for other forms of traffic including acks in the reverse direction. IF all you are doing is tcp serving at 10gigE+, sch_fq is totally the right thing. Otherwise it can actually be worse than pfifo_fast! At 1gigE, sch_fq's defaults are set too high (for example) for even your typical nas or laptop.

One of the big benefits to sch_fq (pacing) arrived for all tcp connections in recent kernels so now fq_codel takes advantage of pacing also.

My vote is you keep fq_codel as the default, and I'll try to clarify the referenced bufferbloat page, and show some benchmarks as to why.

I'd also made some lengthy comments on your ecn enablement side on another bug report.

@dtaht
Copy link

dtaht commented Aug 11, 2018

#9748 on that front. @roland-bless I have no idea how you are turning bufferbloat.net's general recomendations around. The best all-round general purpose default for linux remains fq_codel. tcp serving from the dc is not what you want to optimize your typical linux distro user for.

@dtaht
Copy link

dtaht commented Aug 11, 2018

ah, I see the thread: https://lists.bufferbloat.net/pipermail/bloat/2018-June/thread.html which is all over the place, and I'd had pithy comments.

one of the problems with your 100 flow test is in that case, current linux tcps cap cwnd reductions so they never get the right rate anymore at gigE speeds, lots of flows, and short rtts. In fact, fractional cwnds would be nice... But you'll find the achieved bandwidth is exactly the same (despite the retransmits) but sch_fq .caps will have an ever-expanding rtt.

You want things like fq_codel to regulate that behavior, not just for tcp, but things like webrtc, which doesn't run over tcp. All traffic, not just tcp serving workloads.

@cherusk
Copy link

cherusk commented Aug 12, 2018

Technically, are there cues that there has to be the ONE. Why not converging to the compromise that every distro could isolate its default choice? Most distros are even churned out in explicit flavours for e.g. back-end or desktop use cases ...

@roland-bless
Copy link
Author

@dtaht

um. fq_codel is the most general purpose qdisc there is handling udp and tcp traffic rationally and equally, while managing queue length, and working with routers, and virtual network substrates also.

fq_codel is great for forwarding devices that can easily drop on dequeue, but that's not the point. The point is that it's not reasonable to loose the backpressure signal in sending end-systems by using fq_codel.

sch_fq is for tcp-serving heavy workloads, primarily from the datacenter, and does nothing sane for other forms of traffic including acks in the reverse direction.

It also favors new/small flows like fq in fq_codel, so the benefit for those packets is the same as in fq_codel.

With respect to the test case: 100 concurrent flows from a 1GigE web server isn't unrealistic . It shows that fq_codel drops outgoing packets, because TCP CC doesn't work here so well either since the load in number of flows is the problem. sch_fq doesn't have to drop packets, because of propagating backpressure to applications locally. Throughput and latency are comparable, and sch_fq works quite well even in this scenario (we didn't adjust any sch_fq parameters).

You want things like fq_codel to regulate that behavior, not just for tcp, but things like webrtc, which doesn't run over tcp. All traffic, not just tcp serving workloads.

That's not the point, because you want to use backpressure in the sending end-system. sch_fq works also for UDP or other traffic. One point is that in the mentioned scenario, fq_codel cannot even control the TCP flows so well, so the same would happen with UDP flows (which should also be congestion controlled).

@dtaht
Copy link

dtaht commented Aug 13, 2018

Roland, you are wrong on multiple fronts here. I don't really want to take the time to exaustively write this - it's that your viewpoint is "I'm the web server in the datacenter" that sticks in my craw. What can I do to convince you my points are valid? What concrete set of experiments?

Systemd runs on everything from iot to laptops to virtual hosts to servers. Having a good default that works across the widest range is what we want. "I'm the laptop on the edge of the network."

"sch_fq works also for UDP or other traffic. "

No, it doesn't manage queue length in that case. Next question. That really should be the end of this debate!!!

"One point is that in the mentioned scenario, fq_codel cannot even control the TCP flows so well"

Look at the rtts and throughput in both cases. Try using the flent tool to sample rtts. Things are being controlled, the fact the local tcp cannot in this case match the rate is more the fault of today's tcps. [1] I miss reno sometimes.

" so the same would happen with UDP flows (which should also be congestion controlled)."

"can" happen. When we are pushing more data out than we can match rates on, what do you want to happen? grow the buffer? Have backpressure for udp? You don't have backpressure through the whole system , you have drop, that's it.

  1. fq_codel, so far as I know, has been the systemd default for years now. I'm not aware of many complaints. It's way better than pfifo_fast at rates from 1mbit to 40gigE+.

  2. Take caps of sch_fq. Look at your rtts.

  3. sch_fq does not apply backpressure to anything besides local tcp traffic and not enough of it at low rates either. Switching to sch_fq would bring back the era of 10,000 packet buffers for all other kinds of traffic, no regulation of queue size at all. It would be bufferbloat^10. It won't do ecn. It won't handle quic. nor vms. nor encapsulated traffic. Is that what you want?

fq_codel applies global regulation when each application only has it's own narrow viewpoint.

  1. TSQ only applies to a limited number of flows before self congesting. Last I recall it always stacked up 4 packets per flow. It works really well for low numbers of flows and/or longer paths. TSQ + sch_fq works spectacularly for lots of tcp flows, 10+gigE, and long paths.

sch_fq is great for serving tcps in the datacenter. It may even be a good choice for your local web server example. Although that example is flawed, using only 100 greedy, continuously running flows. A real webserver handles a variety of flows from 1 packet up to (last I checked) about 2Mbytes in size. A much better benchmark of web-server - nfs server - etc is in transactions per sec across a workload, where most of the workload lives in slow start. The vast majority of flows, being rtt-bound in their ramp, or size limited, will never exceed their fair share and get hit by codel.

I'd like to clearly establish where we are disagreeing or not. You've made several assertions that are provably false. Others, like the amount of backpressure needed, or your distaste for retransmits on a short path where it does not matter... are something of a matter of taste. My "taste" leans towards never having much more than 5ms local buffering, except for what is needed to handle bursts. There's buffering in the socket (see tcp_sent_lowwat), in the app, in the qdisc, and in the network.

[1] You can make an argument that fq_codel should interact with the local tcp stack better than it currently does, under conditions of extreme load. Or you can argue the local tcp stack should shrink it's demands better. You can also argue that bql tends towards overbuffering also.

But I think first up, would be looking into how tsq and sch_fq currently interact at 1gbit rates. Because of this whole thread I did go looking at 100+flow tests on short paths, at bbr's current behaviors, and at ecn, while looking at cake, sch_fq, and q_codel, but I didn't get iproute2-next to let me sample buffer sizes again until a commit arrived for fixing, ironically, it's overbuffered output buffer!

@roland-bless
Copy link
Author

Dave: my point is not about web-servers in a data center, it's about using fq_codel in the end-system, which can cause harm. So we have the case where the sender is the only bottleneck. Why should I abuse congestion control signals and add RTT latency when I can manage this locally? Congestion control should prevent overloading the network, but it's not adequate to manage overload in the end-system. In our test case you can see that congestion control doesn't work, while local backpressure mechanisms do work.

What can I do to convince you my points are valid? What concrete set of experiments?

Show me a case where fq_codel performs significantly better than sch_fq in an end-system.

...is more the fault of today's tcps

Basically, we have AQMs to deal with the fault of today's TCPs... :-)

No, it doesn't manage queue length in that case.

Hm, so flow isolation still works, but UDP flows may suffer from self-inflicted queuing delay
(due to lack of TSQ)?

Look at the rtts and throughput in both cases.

Both were fine, as described.

your distaste for retransmits on a short path where it does not matter

This could also be retransmits on a long path, where they do matter. So, yes, my main point is that we should use local backpressure instead of congestion signals in order to avoid these retransmits.

@dtaht
Copy link

dtaht commented Aug 15, 2018

No, it doesn't manage queue length in that case.

Hm, so flow isolation still works, but UDP flows may suffer from self-inflicted queuing delay
(due to lack of TSQ)?

Flow isolation still works, but it's a second class citizen in sch_fq. Queue length management reverts to tail drop on a 10,000 packet (and GSO enabled, thus 64k possible per packet) queue.

All flows not directly managed by the local tcp stack do not get backpressure. This includes tcp flows from vms, stuff flowing through hypervisors, encapsulated traffic from vpns and containers and network name spaces, udp from any source including quic, webrtc, voip, dns, gaming, and attack traffic, or any other protocol. So any of these flows can self inflict queuing delay, and by being present still inflict some delay on other flows.

To me this ends the debate over sch_fq as a good default! Does it work on you yet?

What can I do to convince you my points are valid? What concrete set of experiments?
Show me a case where fq_codel performs significantly better than sch_fq in an end-system.

OK, well, that depends on what you consider as a valid test. Would a MOS score of voip flows taken against an also tcp_loaded server work? Or a measurement of self inflicted latency from webrtc? or locally vpn'd traffic competing with local traffic? Bittorrent?

"end-system" partially depends on whether you are a client or server, on wifi or ethernet. Certainly the fq_codel queue management we did for wifi routers also applies to clients and we should probably do a followup on that paper running the same algo on both sides, ( https://arxiv.org/pdf/1703.00064.pdf )

Our viewpoint as to "performs" might differ. I'm all about low latency and filling the pipe not the queue. Forcing tcp to back off and retransmit "does no harm" so long as utilization is 100%.

your distaste for retransmits on a short path where it does not matter

This could also be retransmits on a long path, where they do matter. So, yes, my main point is that we >should use local backpressure instead of congestion signals in order to avoid these retransmits.

Just to clearly establish things for those in the audience, retransmits "fill in the hole". If your rtt is short, the signal gets through faster. retransmits do cause additional work on behalf of both sides of the stack.

"where they do matter"

It only matters if the remaining duration of your transaction is less than the RTT, or you've dropped the last packet in flight, forcing an RTO.

as the rtt grows, the need to do congestion control via dropping declines dramatically. One drop means a lot. So we could repeat your 100 flow test over a 100ms or 1000ms rtt, to show that (or we could dig up a paper on relationship between drop rates and rtt)

Yes, I totally agree we should use local backpressure to avoid retransmits whenever possible. pfifo_fast does tail drop, which is not actually dropped, but pushes a cwnd reduction into it and reposts the packet. sch_fq + TSQ applies local backpressure and cwnd (and even better, pacing) (but does grow the self inflicted apparent rtt). fq_codel does head drop, which means you generally won't see an RTO, as the flow's next packet immediately behind it is delivered, thus the receiving tcp notices the hole and asks that it gets filled in.

we did produce a tail dropping version of fq_codel at one point. yep, local tcp stack backpressure. But it hurt all other applications (as noted above), that can't observe that backpressure, to not get the earliest congestion signal possible. It matters to voip/video/dns to drop the stalest packet, in particular, and tcps in general not synchronized bulk drop (as what happens when drop tail is highly congested) Also, often, when fq_codel is in a dropping state the local total queue is so full in the first place that by the time it empties it's already got the ack from the other side, indicating please reduce your window and fill in this hole - there's often (on short paths) waaay more than a bdp in there when overloaded in the first place.

I think we are both in agreement that having min/max fair fq (sch_fq, fq_codel, and now sch_cake) is better than a fifo queue?

...is more the fault of today's tcps
Basically, we have AQMs to deal with the fault of today's TCPs... :-)

Even if I could step back to 1986 and mandate a delay, rather than drop based tcp, it wouldn't have worked in todays highly variable rtt environment. The need for aqm was understood, and if only red had worked - or drr/sfq applied more universally, we'd have had a better internet. I got involved in all this because I ran an ISP in 1993-96 and still have the scars on my back...

If I could step back to 2000 and reserve 3 bits for QoS and 5 for ecn, instead of the diffserv mess, that would have helped.

I'm pretty sure, if I could go back to 1989 and the first ietf meeting, and stood shoulder to shoulder with John Nagle about the need for fair queuing everywhere, that would have made a difference! Most congestion would thus be self inflicted and applications could just do themselves in.

Dave: my point is not about web-servers in a data center, it's about using fq_codel in the end-system, >which can cause harm.

What harm? Retransmits and congestion control are totally normal highly optimized aspects of tcps behavior.

No harm, and a general benefit. A light tap here and there to reduce self inflicted congestion in the general case.

So we have the case where the sender is the only bottleneck. Why should I abuse congestion control >signals and add RTT latency when I can manage this locally? Congestion control should prevent >overloading the network, but it's not adequate to manage overload in the end-system. In our test >case you can see that congestion control doesn't work, while local backpressure mechanisms do >work.

Our definition of working is different here. You want no retransmits, and effective backpressure. I'm saying that effective backpressure is impossible in the general case.

Utilization is 100% in both cases. the same amount of data is transferred. Locally observed RTT with fq_codel is lower (probably - I'd have to go look as I mentioned earlier, it's a per-flow curve with tsq in place, and pacing now helps a lot and I have only last week got iproute2 fixed ). As I noted on your test case, it's not an example of a web workload, either.

It's "working".

I too want effective backpressure. I'd like it if modern tcps were IW4, not IW10, that cubic backed off .5 rather than .7, that tcps still reduced to minimal cwnd AND that since fractional cwnd is impossible, used pacing instead, and that tcps reduced their mss size when under congestion also.

I would not mind at all if the head dropping of packets in fq_codel immediately forced a cwnd or pacing reduction locally. Hmm... that might be feasible... similarly a local response to seeing ecn excerted... this is not stuff the main folk working at the big g care about.

and add RTT latency when I can manage this locally

so far as I know - and like I said, I have to go look as it has been a while, sch_fq + TSQ add RTT latency at a minimum of 4 packets outstanding locally per flow. (I will revise this statement after doing the work, but I did ask you to take packet caps of sch_fq in your benchmark). I would not mind if it dropped to 1 packet, nor would I mind if it then started reducing packet size - but the DC guys simply don't see problems at 1gigE and below - they would generally love to be able to self congest but are otherwise out of cpu.

so you are making an assertion that sch_fq is not contributing to rtt latency that I currently doubt is true. I observe HUGE buffer sizes in sch_fq when I look at it by eyeball as I add flows.

@dtaht
Copy link

dtaht commented Aug 15, 2018

Not quite relevant but I had to get this off my chest:

IMHO, the ideal self inflicted delay is 1 packet. Ironically, as we go higher rates, latency suffers further on the host, as we need more buffering because we can't handle interrupts fast enough. So you'll find - in BQL - that self inflicted delays at 100mbit are in the 2 packet, a few dozen usec range, and often a few msec at 1gigE. in other words going at higher rates, even with all this newfangled fq/aqm stuff in place, fairly fat fifo queues grow at the device driver, and 100mbit networks can actually have less latency than gig+ ones because of interrupt latency and fq being above it.

BQL is a godsend, because 100mbit networks had gone completely to hell prior to it's development and the addition of GSO and big ring buffers to linux.

I wish we could do better.

@dtaht
Copy link

dtaht commented Aug 15, 2018

OK, I setup a brief test using network namespaces to make the point. I did not go to any great extent to make it terribly scientific, I'd much rather you repeat my tests to make the point to yourself.
A tarball of the setup script, flent tests and captures for 100 flows and the square wave test is now up at: http://flent-newark.bufferbloat.net/~d/netns_sch_fq_sch_fq_codel.tgz

(if you've pulled this in the last 20 minutes I updated it with newer and more data )

This is that sch_fq result - 150-250ms delays on the netns'd cubic flow - (probably more regulated by all the other tcps competing and their RTTs than drop).

rtt_sch_fq

The observed RTTs in this test (it's in the cap and xplot data) for fq_codel as the ending qdisc rather than sch_fq:

This lines up with the flent measurement as well with fq_codel'd ~10ms RTTs, is here:
correct_with_noecn

drop counts were much higher for fq_codel (100x?) (I still haven't fixed my sampler) - but throughput, identical. RTTs 1/15 or better that of the alternative.

Just for giggles, I did a couple tests with flent's tcp_square_wave test (4 flows, two cubic, two bbr). the cubic result - even for only two cubic flows going through sch_fq was painful.

Given the limited number of flows on this test the difference in drops was much better:
fq_codel drops ~260 packets over the duration of that test, 120 or so for sch_fq

bandwidth identical:

tcp_4up_squarewave_-_compared_bw_identical

but which level of latency do you want for your tcp flow?

tcp_4up_squarewave_-_netns_sch_fq
tcp_4up_squarewave_-_netns_sch_fq_codel

(You can certainly see a compelling advantage to bbr over cubic in this test also). Either FQ system gives it a fair share to start with, and then BBR probes for the right rate (those drops in throughput every 10sec), and gets it. If it were competing with an overlarge fifo, and not self congested, it would be uglier still. I'd rather like it if fq made it across the edges of the internet, and then all sorts of congestion controls would work way better.

@dtaht
Copy link

dtaht commented Aug 16, 2018

in terms of the local stack only, (no netns) TSQ works pretty good in both the sch_fq and sch_fq_codel cases. Over 60 seconds at gigE, 8 full rate flows going through either sch_fq or fq_codel never drop a packet. 16 flows drop 5 "packets" with fq_codel, none with sch_fq. you can't count conventional "packets" anymore as most of these are TSO and greater than 1514 bytes - but that was out of 7341269449 bytes sent.

at 15 it drops 3 packets. In both cases no difference in throughput same rtt.

tcp_nup_-_tsq-selfcongest-15-sch_fq

@chromi
Copy link

chromi commented Aug 16, 2018

There's a really simple answer to all this that nobody's emphasised yet: TURN ON ECN. That will let AQM sort out the congestion backpressure without incurring packet losses and retransmissions.

More and more end-host platforms are turning on ECN by default. Shouldn't systemd do the same?

@filbranden
Copy link
Member

@chromi:

There's a really simple answer to all this that nobody's emphasised yet: TURN ON ECN. That will let AQM sort out the congestion backpressure without incurring packet losses and retransmissions.

More and more end-host platforms are turning on ECN by default. Shouldn't systemd do the same?

Yes, starting with systemd v239, see #9143 and the update to the NEWS file.

We've had one issue (#9748) reported related to ECN though.

@chromi
Copy link

chromi commented Aug 16, 2018

I see. So why is the OP seeing packet loss with fq_codel? Has Ubuntu overridden systemd's default ECN setting?

@dtaht
Copy link

dtaht commented Aug 16, 2018

@chromi per that bug note, I retain grave doubts about ecn universally unless tcps evolve a better response. With people pushing it to have even less response to loss than cubic does instead of more, as in ( https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-09 ), with no response defined towards drop and CE simultaneously in an RTT, and with the extra damage an ECN enabled DDOS can do, with codel not increasing it's signalling rate on overload of ecn vs normal packets... at this stage in the game my vote remains to leave it off by default until more things get sorted out.

You really should spend more time looking at queue depths in cake with ecn on and off, at high loads.

@filbranden
Copy link
Member

@chromi Ubuntu 18.04LTS still ships systemd 237.

I haven't checked whether they changed the default for ECN (which, at that point, was still off), but I imagine they didn't.

I guess if @roland-bless would apply that same change locally (it's just a sysctl config, could even be configured dynamically on the local system) he might be able to tell whether that makes a difference?

@dtaht
Copy link

dtaht commented Aug 16, 2018

@chromi - systemd has only had ecn "on" for a few weeks. This bug report is about a separate request to switch systemd's default to sch_fq, which is a horrible idea that I just spent several days attempting to refute, per above. Yep, tcp ecn on + fq_codel kills the retransmits the original poster was complaining about in this 100 flow test. Ran that. But the retransmits do no harm in the context of this test and self inflicted rtts with ecn off are lower than the actual path delay.

@roland-bless
Copy link
Author

@chromi Yes, ECN avoids packet loss, but the main point was to use local backpressure mechanisms instead of abusing congestion control signals, letting them return them by the other end and then only react on them. This costs you at least an RTT, whereas local feedback provides backpressure signals immediately.

@roland-bless
Copy link
Author

@dtaht I'm currently on vacation and travelling, so I cannot respond at high frequencies. Maybe Mario kicks in.

All flows not directly managed by the local tcp stack do not get backpressure. This includes tcp flows from vms, stuff flowing through hypervisors, encapsulated traffic from vpns and containers and network name spaces, udp from any source including quic, webrtc, voip, dns, gaming, and attack traffic, or any other protocol. So any of these flows can self inflict queuing delay, and by being present still inflict some delay on other flows.

I see, but usually you have virtual switches for VMs and they should use fq_codel then. It would probably make sense to also limit the number of locally queued packets inside the OS in those cases, too.

OK, well, that depends on what you consider as a valid test. Would a MOS score of voip flows taken against an also tcp_loaded server work? Or a measurement of self inflicted latency from webrtc? or locally vpn'd traffic competing with local traffic? Bittorrent?

VoIP flows should benefit from fq's flow isolation. WebRTC should not cause self-inflicted delay due to the corresponding congestion control there (NADA, SCREAM, etc.).

I'll respond to the other stuff later if time permits...

@dtaht
Copy link

dtaht commented Aug 17, 2018

@roland-bless - I took the time to do the easiest counter-example - network namespaces - in a long part of the post above. I imagine you didn't read that far. I'm on vacation also.

In the mere context of this bug report, which involves you asking to switch systemd over to sch_fq, no "shoulds" or "usually"s or other forms of wishful thinking can apply. What actually is - the situation where tons of different kinds of unregulated flows existing, on the wide variety of millions of possible systemd installations, the situation that exists today, that needs to apply to the engineering decisions of the correct, best, basic, default. sch_fq is worse than pfifo_fast in these respects, and fq_codel the overall winner. I'd sooner revert systemd to pfifo_fast (packet limit 1000) than sch_fq for a general purpose qdisc for the general public.

But it's not my call, either. I have no say in this matter, no connection with systemd at all. If someone hadn't mentioned this "bug" on the bloat mailing list I'd have not shown up and felt compelled to educate and argue.

I support knowledgeable sysadmins and distros changing their default to anything they choose based on their workload.

But: can we "close" that part of this bug report? That we're done discussing changing the default? I'm hoping that my network namespace example sufficies to prove for you and those in the audience that sufficient backpressure does not exist for a wide variety of common applications.

Then we can go and discuss about making the shoulds, and usuallys into things that always are. Can we close that part of this bug report?

(and btw, (bikeshedding!) particularly on wifi, webrtc does self inflict delay (managed beautifully by the fq_codel for wifi stuff ( https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf ) , not anywhere else), and my preferred congestion control is googles ( https://tools.ietf.org/html/draft-ietf-rmcat-gcc-02). I don't know if NADA or SCREAM actually got implemented in a shipping browser? I rather liked an early version of nada.

But I'm off topic and all I want to do is shut the conversation down about foolishly changing systemd's default. Can we do that yet?

If you say yes (or hopefully, some of the systemd folk watching the fireworks?), I'll let you go enjoy your vacation. And I can go back to mine.

@dtaht
Copy link

dtaht commented Aug 17, 2018

Oh core systemd folk? @poettering @michich -I don't know who else is core to systemd - can I go back to something else in life besides this bug report now? Your call, I made my points as best I could, and I'm going to logout now.

@phomes
Copy link
Contributor

phomes commented Aug 17, 2018

for what it's worth I agree with Dave to keep fq_codel

@nealcardwell
Copy link

dtaht said:
] Forcing tcp to back off and retransmit "does no harm" so long as utilization is 100%.

This is only true if the application using TCP is a bulk transfer application. Many important applications (e.g. web traffic, RPC traffic) care about latency, and are harmed by the added latency for loss recovery forced by drops from fq_codel.

I agree with Roland's point that fq_codel, in its current implementation, does not seem like a good fit for end systems using TCP. For the current fq_codel and fq implementations, for end systems that use TCP, fq seems like a better fit than fq_codel.

IMHO for systems with TCP traffic it is not a good trade-off to knowingly impose latency regressions on latency-sensitive TCP applications in order to provide a better back-pressure signal for some subset of UDP applications that will use the drops.

I suspect it is possible to enhance fq_codel to avoid this penalty for local TCP traffic: for example, not dropping traffic that is using TSQ. That way, TCP traffic is not dropped, but UDP traffic is dropped. But I don't think the required mechanisms are in place yet.

@dtaht
Copy link

dtaht commented Aug 18, 2018

I agree that tsq could probably be made more effective. I showed earlier that on this test it managed well at a gigE to 16 greedy flows. That seemed "good enough".

I don't think we see eye to eye on what an "end system" is yet. (?) IF your "end system" definition means it uses "locally managed tcp traffic only - and nothing else - and isn't going to self congest" - then I am in agreement that sch_fq can be a good choice, and a knowlegable sysadmin should flip it over. But for the vast swath of other possible uses , it isn't, and thus, not a safe default for systemd.

I definitely view all networked computers as "routers", so I don't understand what you mean by an "end-system". An application puts or retrieves data. An OS arbitrates and routes it to the right place and back and enforces resource limits of various types.

@dtaht
Copy link

dtaht commented Aug 18, 2018

Latency "regression" my ass! a factor of 15-25X improvement, no loss in throughput. You can't put in more data than you can get out in a reasonable time - Oh, man, has this thread 'caused a weeklong blood pressure spike on basic bufferbloat principles.

All tcp traffic is latency sensitive. The less rtt, the faster other tcps can react to changes in network conditions, the difference in response time is quadratic. At 10ms observed rtt, if 90 of our 100 flows ended, the other flows claw back bandwidth faster. A TPS benchmark would be useful here, not saturating greedy flows. I get really bugged by rasearchers always measuring big loads in congestion avoidance rather than lots of smaller flows in slow start.

the fq part, which I think all here agree on (for a change), is great for this also. All flows observe changes in load and react as fast as they can along the observed path, which is far, far slower in the case of a fifo, where one shorter rtt flow can be very unfair to all the others trying to get their "fair share". As for my contention that most flows never hit their bandwidth allocation and rarely exit out of slow start, or get hit by codel's generous 100ms burst allowance - a simple tcpdump of your corporate or home lan suffices in normal use suffices, looking at all, not just the tcp (or quic, nowadays) packets. A good measure is observing dns rtt and the amount of dns you see on lans like that.

we have ~10ms of local buffering on this 1gbit 13us path. that's 760 times more buffering than what is required to fill the path. (I'm not crazy, I know painfully well that interrupt latency and cpu cost is too high to get away with much less than a ms, tcp timestamps an issue also - I'm just making a point of how much "better" things could be)

I'd love it if the pacing rate in tsq was fractional enough to handle essentially cwnds that needed to be less that 2. ? I'm going to go look into that...

I'd love it if more folk used the tcp lowwat sysctl- I'd support a sysctl for udp to do the same. Still doesn't fix the problem of so many other potential loads from other sources exceeding the max output rate for long periods of time.

@dtaht
Copy link

dtaht commented Aug 21, 2018

Not enough people have read this: https://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/ - it's not directly relevant to the discussion at hand, but it helps to deeply grok the impact of rtt on web traffic as discussed there.

@heistp
Copy link

heistp commented Aug 23, 2018

I'm surprised that replacing fq_codel with sch_fq on end hosts is under consideration. The reduction in intra-flow latency under load (i.e. TCP RTT) is demonstrable in benchmark tests as Dave showed, and afaik, this is rather settled in numerous scenarios at different bandwidths and RTTs, as claimed by CoDel's authors and verified in years of followup testing.

I wasn't quite in harmony with the characterization that the intra-flow latency gains from CoDel are "much less important" than the inter-flow latency gains from fq, and I'm not sure this conversation should have been used in the initial argument. But backing up, I think we can all agree that intra-flow latency for HTTP and other conversational protocols is very important, and hope we can we figure out what gets us there.

It was initially stated that:

packet loss at the sending end-system ... will also increase the overall latency for the affected TCP connections!

We have to be extraordinarily careful when making claims about queueing and congestion control not to make inductive fallacies. Is there any real-world data showing that fq actually leads to lower PLTs than fq_codel, for example?

@fox-mage
Copy link

fox-mage commented Mar 1, 2019

If you set queue discipline any CoDel variant, the non-TCP/IP network stops working.
The world is not limited to TCP/IP network.
Additionally, the random drop of packets in a stream increases the RTT for that stream by several times. An average of 6 times, rarely observed 22x (!!!) increase. The overall RTT does get smaller, but it becomes harder to look for network problems for a specific user.

@dtaht
Copy link

dtaht commented Mar 2, 2019

please feel free to put up your test setup so we can have a coherent discussion.

@fox-mage
Copy link

fox-mage commented Mar 7, 2019

please feel free to put up your test setup so we can have a coherent discussion.

Please live, I was far from the computer. I enclose test results on a live school system. As you can see with fq_codel there is a tcp stream with rtt 227018. Such a large delay causes video frames to drop on the student's computer.

receiver - 1 vCPU, RAM 4G, nic vmxtnet3
sender - 4 vCPU, RAM 32G, nic vmxnet3, video nvidia P40

OS Debian 9. Kernel 4.9
tcp CC = htcp

command - iperf3 -c X.X.X.X -i 0 -P 36 -t 20 -d | grep _rtt

qdisc = fq

tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 10729
tcpi_snd_cwnd 83 tcpi_snd_mss 1448 tcpi_rtt 7420
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 9990
tcpi_snd_cwnd 190 tcpi_snd_mss 1448 tcpi_rtt 8805
tcpi_snd_cwnd 130 tcpi_snd_mss 1448 tcpi_rtt 8291
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7894
tcpi_snd_cwnd 124 tcpi_snd_mss 1448 tcpi_rtt 8145
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 6898
tcpi_snd_cwnd 37 tcpi_snd_mss 1448 tcpi_rtt 7999
tcpi_snd_cwnd 97 tcpi_snd_mss 1448 tcpi_rtt 8322
tcpi_snd_cwnd 90 tcpi_snd_mss 1448 tcpi_rtt 8161
tcpi_snd_cwnd 39 tcpi_snd_mss 1448 tcpi_rtt 8053
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7586
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7431
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7337
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7818
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7591
tcpi_snd_cwnd 180 tcpi_snd_mss 1448 tcpi_rtt 8092
tcpi_snd_cwnd 78 tcpi_snd_mss 1448 tcpi_rtt 8085
tcpi_snd_cwnd 63 tcpi_snd_mss 1448 tcpi_rtt 8231
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7373
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7761
tcpi_snd_cwnd 68 tcpi_snd_mss 1448 tcpi_rtt 7900
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7361
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7903
tcpi_snd_cwnd 200 tcpi_snd_mss 1448 tcpi_rtt 8127
tcpi_snd_cwnd 127 tcpi_snd_mss 1448 tcpi_rtt 8077
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7383
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7888
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7883
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7662
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7754
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7807
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7795
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 7944
tcpi_snd_cwnd 118 tcpi_snd_mss 1448 tcpi_rtt 8092

qdisc = fq_codel
tcpi_snd_cwnd 187 tcpi_snd_mss 1448 tcpi_rtt 1641
tcpi_snd_cwnd 233 tcpi_snd_mss 1448 tcpi_rtt 1228
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 5711
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 11093
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 6401
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 8731
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 1944
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 12737
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 16561
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 43405
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3558
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 2905
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 4203
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 5305
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3911
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 5451
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3798
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3850
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3808
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 4243
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3869
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3816
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3826
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 6592
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 2373
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 5859
tcpi_snd_cwnd 251 tcpi_snd_mss 1448 tcpi_rtt 1490
tcpi_snd_cwnd 146 tcpi_snd_mss 1448 tcpi_rtt 1703
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 184364
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 25109
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 12433
tcpi_snd_cwnd 222 tcpi_snd_mss 1448 tcpi_rtt 1286
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 6213
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 5425
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 3612
tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 227018

@heistp
Copy link

heistp commented Mar 7, 2019

Fwiw, I cannot reproduce this on the same OS and kernel (Debian 9 / 4.9.0-8) using APU2 hardware as client and server.

Also just mentioning, fq_codel's choice of packets to drop isn't random. Drops are a normal part of congestion control, fq_codel or not. You can avoid them by enabling ECN.

# tc qdisc show dev enp1s0
qdisc fq 8002: root refcnt 9 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028 initial_quantum 15140 low_rate_threshold 550Kbit refill_delay 40.0ms 
# sysctl -w net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_congestion_control = htcp
# iperf3 -c apu2b -i 0 -P 36 -t 20 -d | grep _rtt
net.ipv4.tcp_congestion_control = htcp
tcpi_snd_cwnd 203 tcpi_snd_mss 1448 tcpi_rtt 18828
tcpi_snd_cwnd 204 tcpi_snd_mss 1448 tcpi_rtt 14384
tcpi_snd_cwnd 186 tcpi_snd_mss 1448 tcpi_rtt 21442
tcpi_snd_cwnd 205 tcpi_snd_mss 1448 tcpi_rtt 21740
tcpi_snd_cwnd 204 tcpi_snd_mss 1448 tcpi_rtt 25320
tcpi_snd_cwnd 205 tcpi_snd_mss 1448 tcpi_rtt 15195
tcpi_snd_cwnd 203 tcpi_snd_mss 1448 tcpi_rtt 14094
tcpi_snd_cwnd 198 tcpi_snd_mss 1448 tcpi_rtt 14731
tcpi_snd_cwnd 157 tcpi_snd_mss 1448 tcpi_rtt 16814
tcpi_snd_cwnd 214 tcpi_snd_mss 1448 tcpi_rtt 19405
tcpi_snd_cwnd 151 tcpi_snd_mss 1448 tcpi_rtt 19211
tcpi_snd_cwnd 147 tcpi_snd_mss 1448 tcpi_rtt 14704
tcpi_snd_cwnd 148 tcpi_snd_mss 1448 tcpi_rtt 13937
tcpi_snd_cwnd 146 tcpi_snd_mss 1448 tcpi_rtt 13963
tcpi_snd_cwnd 155 tcpi_snd_mss 1448 tcpi_rtt 19039
tcpi_snd_cwnd 172 tcpi_snd_mss 1448 tcpi_rtt 19873
tcpi_snd_cwnd 151 tcpi_snd_mss 1448 tcpi_rtt 14034
tcpi_snd_cwnd 147 tcpi_snd_mss 1448 tcpi_rtt 14448
tcpi_snd_cwnd 152 tcpi_snd_mss 1448 tcpi_rtt 18876
tcpi_snd_cwnd 154 tcpi_snd_mss 1448 tcpi_rtt 15819
tcpi_snd_cwnd 156 tcpi_snd_mss 1448 tcpi_rtt 18149
tcpi_snd_cwnd 153 tcpi_snd_mss 1448 tcpi_rtt 16956
tcpi_snd_cwnd 149 tcpi_snd_mss 1448 tcpi_rtt 16234
tcpi_snd_cwnd 149 tcpi_snd_mss 1448 tcpi_rtt 18557
tcpi_snd_cwnd 146 tcpi_snd_mss 1448 tcpi_rtt 15681
tcpi_snd_cwnd 150 tcpi_snd_mss 1448 tcpi_rtt 16537
tcpi_snd_cwnd 149 tcpi_snd_mss 1448 tcpi_rtt 18275
tcpi_snd_cwnd 149 tcpi_snd_mss 1448 tcpi_rtt 19147
tcpi_snd_cwnd 152 tcpi_snd_mss 1448 tcpi_rtt 18325
tcpi_snd_cwnd 148 tcpi_snd_mss 1448 tcpi_rtt 16892
tcpi_snd_cwnd 154 tcpi_snd_mss 1448 tcpi_rtt 15112
tcpi_snd_cwnd 157 tcpi_snd_mss 1448 tcpi_rtt 15669
tcpi_snd_cwnd 148 tcpi_snd_mss 1448 tcpi_rtt 14443
tcpi_snd_cwnd 152 tcpi_snd_mss 1448 tcpi_rtt 15190
tcpi_snd_cwnd 154 tcpi_snd_mss 1448 tcpi_rtt 19406
tcpi_snd_cwnd 146 tcpi_snd_mss 1448 tcpi_rtt 15151
# tc qdisc show dev enp1s0
qdisc fq_codel 8002: root refcnt 9 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn 
# sysctl -w net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_congestion_control = htcp
# iperf3 -c apu2b -i 0 -P 36 -t 20 -d | grep _rtt
tcpi_snd_cwnd 105 tcpi_snd_mss 1448 tcpi_rtt 9418
tcpi_snd_cwnd 45 tcpi_snd_mss 1448 tcpi_rtt 8575
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 6271
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 7220
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 5782
tcpi_snd_cwnd 40 tcpi_snd_mss 1448 tcpi_rtt 6663
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6695
tcpi_snd_cwnd 64 tcpi_snd_mss 1448 tcpi_rtt 8536
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 5988
tcpi_snd_cwnd 37 tcpi_snd_mss 1448 tcpi_rtt 6289
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6356
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6204
tcpi_snd_cwnd 62 tcpi_snd_mss 1448 tcpi_rtt 8837
tcpi_snd_cwnd 36 tcpi_snd_mss 1448 tcpi_rtt 7112
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6262
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 6168
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6082
tcpi_snd_cwnd 61 tcpi_snd_mss 1448 tcpi_rtt 8560
tcpi_snd_cwnd 38 tcpi_snd_mss 1448 tcpi_rtt 6518
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6979
tcpi_snd_cwnd 36 tcpi_snd_mss 1448 tcpi_rtt 6862
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6292
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6034
tcpi_snd_cwnd 38 tcpi_snd_mss 1448 tcpi_rtt 6209
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6532
tcpi_snd_cwnd 60 tcpi_snd_mss 1448 tcpi_rtt 8179
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6711
tcpi_snd_cwnd 51 tcpi_snd_mss 1448 tcpi_rtt 7622
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 6399
tcpi_snd_cwnd 40 tcpi_snd_mss 1448 tcpi_rtt 6590
tcpi_snd_cwnd 60 tcpi_snd_mss 1448 tcpi_rtt 8658
tcpi_snd_cwnd 34 tcpi_snd_mss 1448 tcpi_rtt 6552
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 5984
tcpi_snd_cwnd 35 tcpi_snd_mss 1448 tcpi_rtt 7107
tcpi_snd_cwnd 47 tcpi_snd_mss 1448 tcpi_rtt 6720
tcpi_snd_cwnd 33 tcpi_snd_mss 1448 tcpi_rtt 6074

@fox-mage
Copy link

fox-mage commented Mar 7, 2019

I have a virtual 10G in VMware environment

@dtaht
Copy link

dtaht commented Mar 8, 2019

A packet capture would be easier to look at than just cwnd stats. We use the flent.org toolset to analyze a lot of stuff. It's available in most linuxes (apt-get install flent flent-gui netperf for ubuntu, yum for fedora, 0 or via pip install flent).

Totally ok if you don't use that, but a simple command line and simultaneous capture via tcpdump and all sorts of useful stuff. To replicate your test with flent (and netperf on the other side, running)

tcpdump -i your_interface -w my.cap -s 128 &
flent --socket-stats -H somehostwithnetperf -t title_of_whatever_qdisc_you_are_testing --te=upload_streams=1 tcp_nup
killall tcpdump.

This generates a flent.gz file which can be inspected and plotted with flent-gui.

You can get live qdisc stats also with a few other options to flent, over the course of the test. See the man page

But I'd settle for a packet capture of whatever tool you are using, which I can take apart with tcptrace -G and xplot.org

What I infer above, given the cwnd reductions and timeouts is that there is some sort of mismatch or bug between the vmware instance and the underlying OS (which is?). It could be a TSO/GSO interaction problem, mtu, driver problem, physical wire issue, all kinds of stuff. Certainly any qdisc can drop packets as it's a necessary part of congestion control, as can anything else on the path. Given that both sch_fq and fq_codel are doing something wierd I'd expect it's this much lower in the stack. What does a pure fifo do? Enabling ecn on your tcps can also be helpful as pete notes.

@dtaht
Copy link

dtaht commented Mar 8, 2019

Ah, your tests MIGHT differ because you are using an ancient and unmaintained tcp (hyla?) cc algorithm, or so I guess. what happens with cubic/bbr/reno which are well tested? Still I suspect a problem lower in the stack.....

@heistp
Copy link

heistp commented Mar 9, 2019

I believe the original test was with htcp, which I also used, so that shouldn't be it. I didn't use VMWare, and there are likely more things about the setup I didn't reproduce. I could install two Debian 9 instances on VMWare Fusion if we thought that'd get us anywhere.

@fox-mage Is there anything in dmesg when these latency spikes occur?

@dtaht
Copy link

dtaht commented Mar 9, 2019

Another thing to try is using a different vmware virtual network interface. 'round here I tend to use the intel one.

@fox-mage
Copy link

fox-mage commented Mar 9, 2019

Ah, your tests MIGHT differ because you are using an ancient and unmaintained tcp (hyla?) cc algorithm, or so I guess. what happens with cubic/bbr/reno which are well tested? Still I suspect a problem lower in the stack.....

I only use tcp cca htcp for video and the impact of traffic outside the data center, and loss/delay of cca inside the datacenter - new vegas, illinois.

ECN default
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_ecn_fallback = 1

@sergiuhlihor
Copy link

I've started reading this whole topic with high interest but I fail to see any good advice. I have 3 setups, Ubuntu 14.04, 16.04 and 18.04. In 18.04, default is fq_codel (vs pfifo_fast on others) and with default settings I see constant package drops, one every 2 seconds in idle. Consuming large amounts of data over network from a database leads to massive package loss for same load compared to 16.04 or 14.04. What I can attest is that for large servers with over 20-50K persisted connections, defaults for 18.04 are significantly worse. Cannot quantify how much is due to fq_codel itself or some other screwups at kernel level but it's certain worse on same hardware and that's without Spectre/Meltdown patches applied. I've be also very interested to see which settings scale better with number of connections. Is anyone aware of good articles on these topics and specially regarding what are best settings on latest kernels when it comes to large servers (64-128 cores, 100Gbit network,100K+ connections) ?

@dtaht
Copy link

dtaht commented Nov 5, 2019 via email

@sergiuhlihor
Copy link

sergiuhlihor commented Nov 6, 2019

@dtaht Thank you for the hints. I'm starting a series of long running load tests and I will test your settings. For now I actually switched back to pfifo_fast as I saw better behavior.
To mention a few more details about my workloads. I have large servers, 24+ physical cores (128 planned in future) as individual nodes and each collects data from 20k to 100K devices (sometimes one data point per second) from outside, so large number of persisted connections plus local large databases and public APIs for serving our clients. Due to the large amount of data, we also do batch processing on same server since amount of idle CPU power is more than enough. This however leads to periodic micro overloads where CPU is fully booked for 100-200ms. At application level, having suddenly a delay of 100 more ms when whole database query takes 500ms is not an issue. What I'll care is total latency of operation. Due to the temporary overload, the package drop cannot improve latency, it can only make it worse. From my tests, impact is not significant for internal network where you have one hop between servers (going to the database from the other server), however when I have an IPSEC connection and 20-100K devices connected over 2G with a latency of 500 to 3000 ms, every package lost puts me at the mercy of the stack behind, where I depend on a chain of God know how old Cisco routers or worse, a variety of small devices, many with their own embedded OS, many times non Linux and custom TCP stack not properly implemented. In this cases any package lost can lead to weird behaviours which cannot even be debugged and many times lead to hundreds of hours wasted in investigations. I already have a set of devices where all evidence point buggy retransmission mechanism and we have no way to update their firmware or change them completely for next few years. This is the real world of IOT which I am confronted daily and so far the best strategy is to avoid package loss at any cost. For this reason I have to say that research papers which describe the benefits of fq_codel may be fully decoupled from reality or are focusing on solving maybe a subset of problems that were common 10 or 5 years ago but are fully outdated today.

A little offtopic, but in my opinion, the whole topic of defaults, not only default_qdisc should be reconsidered. If for example, having a set of defaults that is totally unusable for servers, then this translates into hundreds of not thousands of man hours lost by each company in investigations and finetuning. When multiplied with thousands of companies facing similar issues, you now have millions of man hours wasted due to poor defaults. This is a sad reality of Linux as a whole and I see it getting only worse. With Ubuntu 14.04 (kernel 3.x) I've touched only a handful of parameters, mostly just to support a large number of connections. I did load tests to the point where servers were constantly overloaded and everything is fine. With 18.04 (kernel 4.15.x) even now, after 1.5 years I do not have a stable set of parameters for same workload. And this is not only networking. Default IO scheduler was changed in 4.15 from Deadline to CFQ which killed completely my database. Whoever decided that in 2018 CFQ should be default when about everyone is using SSDs (which at core are RAID 0 arrays with 32 or 64 devices coupled with large DDR caches) or Intel Optane (which can deliver data faster than Linux IO stack can process it) should be punished by working his whole life with HDDs. This alone is the most retarded default, set without understanding the hardware market at all.

@dtaht
Copy link

dtaht commented Nov 9, 2019

Thank you for the hints. I'm starting a series of long running load tests and I will test your settings. For now I actually switched back to pfifo_fast as I saw better behavior.

On this kernel and workload I'd recommend sch_fq over either pfifo_fast or fq_codel. The vast number of timers required to fire is best handled in sch_fq.

To mention a few more details about my workloads. I have large servers, 24+ physical cores (128 >planned in future) as individual nodes and each collects data from 20k to 100K devices >>(sometimes one data point per second) from outside, so large number of persisted connections >plus local large databases and public APIs for serving our clients.

The core ongoing optimizations we see for linux's kernel stack is actually as a server sending data, most of which is driven by google. Something managing 20k persistent connections, inbound, sparse? well I hope you are using epoll at least.

Due to the large amount of data, we also do batch processing on same server since amount of idle >CPU power is more than enough. This however leads to periodic micro overloads where CPU is >fully booked for 100-200ms. At application level, having suddenly a delay of 100 more ms when >whole database query takes 500ms is not an issue. What I'll care is total latency of operation. Due >to the temporary overload, the package drop cannot improve latency, it can only make it worse.

This is where your statement is (currently) jumping the shark for me. Usually the flow of packets is entirely regulated inside the kernel. Even if you are pegging the cpu with your db workload, packet services time should be largely unaffected. If you have a query that takes > 250ms to generate a response, then it is possible a tcp keepalive timer will kick in and send a packet, but that's it.

fq_codel only engages when there is a persistent, filled queue that is not draining in under 100ms.

tc -s qdisc show will give you some data from there.

From my tests, impact is not significant for internal network where you have one hop between servers (going to the database from the other server), however when I have an IPSEC connection and 20-100K devices connected over 2G with a latency of 500 to 3000 ms, every package lost puts me at the mercy of the stack behind, where I depend on a chain of God know how old Cisco routers or worse, a variety of small devices, many with their own embedded OS, many times non Linux and custom TCP stack not properly implemented.

Once latencies climb past 250ms, all sorts of other rarely invoked portions of tcp stacks begin to
show their heads. It is very sad that so few in IoT and elsewhere are focused on making their stacks better - not just in light of bufferbloat, but in terms of overall robustmess and stability, and buyers don't care - nor do they have sufficient info or means of recourse.

I would support all sorts of means to make the edge devices "better', ranging from government standards efforts, to certification efforts, to engaging legions on college students to help get it more right. IoT frightens me. The level of basic knowledge about how core network protocols wrk has deteriorated to such an extent that I often gasp at the level of ignorance "out there", and yet,
as a "rocket scientist", I find paying jobs few and far between. I could spend the next several lifetimes making tcp stacks more robust and useful - and save "thousands of manhours" others would spend debugging stupidies in the deployment, but so far, no luck. I do what I can in my spare time, taking timeouts from the latest pets.com idea....

But again, in your case, it only matters when the local queue fills past 5ms full of stuff with fq_codel for over 100ms.

I do note that I consider induced latencies of over 250ms (more than one time around the planet) as intensely damaging to the internet in general, which is in part why we focus so much on matching the length of the pipe to the actual length in things like fq_codel and BBR.

In this cases any package lost can lead to weird behaviours which cannot even be debugged and many times lead to hundreds of hours wasted in investigations.

Um... packet loss is integral to the internet, period.

I already have a set of devices where all evidence point buggy retransmission mechanism and we have no way to update their firmware or change them completely for next few years.

Publish that so others can avoid... or fix.

This is the real world of IOT which I am confronted daily and so far the best strategy is to avoid package loss at any cost.

From server to endpoint there are many places where packets can be lost. As for losing packets due to congestion control reasons, there is a lot of work going on for ecn-enablement -notably the SCE work - https://tools.ietf.org/html/draft-morton-tsvwg-sce-01 - for congestion control without loss.

No Iot stack I know of has support for ECN.

I would very much support more work making iot tcp stacks more robust. I'd even do some of it, if paid to do so.

However, not not drop packets at ""at any cost". Latency - and the most current data points - matter for many valuable internet applications.

For this reason I have to say that research papers which describe the benefits of fq_codel may be fully decoupled from reality or are focusing on solving maybe a subset of problems that were common 10 or 5 years ago but are fully outdated today.

We're still trying to establish root causes of your troubles here, nothing - so far - points at fq_codel as the real source of your issues. Certainly I agree "times has changed" since identification of the bufferbloat issue - hardware buffer growth has halted - deployment of fancy algorithms like fq_codel and pie are well underweigh - bbr is getting some traction - bql (probably the most important technology of all) is on everything running >= 10gbit in linux.

As for "fully outdated"? No, the bufferbloat problem remains at epidemic proportions. Probably the most successful effort we have ongoing is mostly invisible, but it's making wifi (and wifi 6) a LOT better for handling lots of device and videoconferencing and audio transport. https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf

A little offtopic, but in my opinion, the
whole topic of defaults, not only default_qdisc should be reconsidered.

We are trying to optimize for a variety of network devices that spans over 8 orders of magnitude of bandwidth - from 10ks of kb to 100gbit, and we simply don't know how to do that. Definately! More folk should try and identify core use cases and then create either autotuning or good defaults!

I'm in a perpetual battle with the googlers over trying to keep <=gigE stuff working well in linux, they are perpetually tuning for 40Gbit and above as a "default". Last fight was over GSO by default in sch_cake.

If for example, having a set of defaults that is totally unusable for servers, then this translates into hundreds of not thousands of man hours lost by each company in investigations and finetuning. When multiplied with thousands of companies facing similar issues, you now have millions of man hours wasted due to poor defaults. This is a sad reality of Linux as a whole and I see it getting only worse. With Ubuntu 14.04 (kernel 3.x) I've touched only a handful of parameters, mostly just to support a large number of connections. I did load tests to the point where servers were constantly overloaded and everything is fine. With 18.04 (kernel 4.15.x) even now, after 1.5 years I do not have a stable set of parameters for same workload. And this is not only networking. Default IO scheduler was changed in 4.15 from Deadline to CFQ which killed completely my database. Whoever decided that in 2018 CFQ should be default when about everyone is using SSDs (which at core are RAID 0 arrays with 32 or 64 devices coupled with large DDR caches) or Intel Optane (which can deliver data faster than Linux IO stack can process it) should be punished by working his whole life with HDDs. This alone is the most retarded default, set without understanding the hardware market at all.

Certainly is off topic here, and I'd advise complaining to the right mailing lists. Personally I have found the constant tuning for dc workloads to make using my laptop running linux a far less pleasant experience than it used to be. I used to mix audio, do video production, etc, on linux. No more. OSX folk "get" latency"

@sergiuhlihor
Copy link

sergiuhlihor commented Nov 10, 2019

@dtaht Thanks for the constructive comments. For now I'm going to do some more load tests. Overall kernel 4.15 looks worse compared to kernel 3.x but I cannot say yet if it's due to one parameter having a wrong default or a set of parameters or some regression.

Regarding optimization goals, what I was trying to say is that the load of today and tomorrow is no longer server as sending dominant. IOT means acquiring large amounts of data from external sources then processing it on large servers and serving back a small portion of it. This leads to two different problems extra to the standard content serving:

  • large number of connections, low bandwidth per connection but mostly incoming and predictable, aggregated constant incoming bandwidth in the range of 100 to 1000 Mbit, devices with non standard TCP implementations which cope well with high latencies end to end but not so well with package loss.
  • low number of connections, high bandwidth per connection, aggregated bandwidth of 10-100 Gbit with large variations second to second both incoming and outgoing, internal infrastructure

In my infrastructure, IOT receiving bandwidth is by far the most dominant factor. Years ago I would have expected the problems to be on the application side, but in practice, this is not the case. I can handle easily 50K to 100K connections and all with a low enough CPU usage (and all in Java) that I can easily do even processing on same node, thus removing most of the need for node to node intercommunication. When looking at Linux ecosystem, what is missing is a good segmentation per use case. Trying to apply one size fits all for 8 orders of magnitude of load just does not work. The classes of workloads are known for years. Missing piece of the puzzle is specialized configurations with optimal or close to optimal settings for given workload or automatically tuning of the parameters.

@ivanbaldo
Copy link

Don't kill me :-D : what about CAKE by default?

@chromi
Copy link

chromi commented Dec 12, 2019 via email

@dtaht
Copy link

dtaht commented Dec 15, 2019 via email

raphielscape added a commit to RaphielGang/disrupt_kernel_xiaomi_sdm845 that referenced this issue Jan 6, 2020
- Ref : systemd/systemd#9725

Suggested-by: Albert I <[email protected]>
Signed-off-by: Raphiel Rollerscaperers <[email protected]>
@voidzero
Copy link

Cake is designed primarily for last-mile CPE environments, and is relatively heavyweight with features intended to deal with effects seen there. So it might cause excessive CPU load if applied to a 10GbE interface, for little benefit relative to fq_codel. There are cases where you do want Cake's features on such an interface, but you should explicitly choose those cases, not have them foisted on you by a default. - Jonathan Moron

At the risk of being off topic - if I do want to try out cake on an asymmetric connection (ethernet, e.g. eth0 with 500mbit ingress and 40mbit egress), do I set the bandwidth property to the ingress or the egress?

@chromi
Copy link

chromi commented Dec 30, 2020 via email

@voidzero
Copy link

@chromi fantastic, something to play around with. Thanks for your response.

@socketpair
Copy link

Finally. What to choose, LOL ?

@PranavBhattarai
Copy link

@socketpair The issue was issued on Jul 26, 2018.

We might wait another few years before choosing the right default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFE 🎁 Request for Enhancement, i.e. a feature request sysctl
Development

No branches or pull requests