-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default default qdisc from fq_codel to sch_fq #9725
Comments
@michich opinion on this? |
Please note that the Bufferbloat wiki also recommends sch_fq for servers: |
um. fq_codel is the most general purpose qdisc there is handling udp and tcp traffic rationally and equally, while managing queue length, and working with routers, and virtual network substrates also. sch_fq is for tcp-serving heavy workloads, primarily from the datacenter, and does nothing sane for other forms of traffic including acks in the reverse direction. IF all you are doing is tcp serving at 10gigE+, sch_fq is totally the right thing. Otherwise it can actually be worse than pfifo_fast! At 1gigE, sch_fq's defaults are set too high (for example) for even your typical nas or laptop. One of the big benefits to sch_fq (pacing) arrived for all tcp connections in recent kernels so now fq_codel takes advantage of pacing also. My vote is you keep fq_codel as the default, and I'll try to clarify the referenced bufferbloat page, and show some benchmarks as to why. I'd also made some lengthy comments on your ecn enablement side on another bug report. |
#9748 on that front. @roland-bless I have no idea how you are turning bufferbloat.net's general recomendations around. The best all-round general purpose default for linux remains fq_codel. tcp serving from the dc is not what you want to optimize your typical linux distro user for. |
ah, I see the thread: https://lists.bufferbloat.net/pipermail/bloat/2018-June/thread.html which is all over the place, and I'd had pithy comments. one of the problems with your 100 flow test is in that case, current linux tcps cap cwnd reductions so they never get the right rate anymore at gigE speeds, lots of flows, and short rtts. In fact, fractional cwnds would be nice... But you'll find the achieved bandwidth is exactly the same (despite the retransmits) but sch_fq .caps will have an ever-expanding rtt. You want things like fq_codel to regulate that behavior, not just for tcp, but things like webrtc, which doesn't run over tcp. All traffic, not just tcp serving workloads. |
Technically, are there cues that there has to be the ONE. Why not converging to the compromise that every distro could isolate its default choice? Most distros are even churned out in explicit flavours for e.g. back-end or desktop use cases ... |
fq_codel is great for forwarding devices that can easily drop on dequeue, but that's not the point. The point is that it's not reasonable to loose the backpressure signal in sending end-systems by using fq_codel.
It also favors new/small flows like fq in fq_codel, so the benefit for those packets is the same as in fq_codel. With respect to the test case: 100 concurrent flows from a 1GigE web server isn't unrealistic . It shows that fq_codel drops outgoing packets, because TCP CC doesn't work here so well either since the load in number of flows is the problem. sch_fq doesn't have to drop packets, because of propagating backpressure to applications locally. Throughput and latency are comparable, and sch_fq works quite well even in this scenario (we didn't adjust any sch_fq parameters).
That's not the point, because you want to use backpressure in the sending end-system. sch_fq works also for UDP or other traffic. One point is that in the mentioned scenario, fq_codel cannot even control the TCP flows so well, so the same would happen with UDP flows (which should also be congestion controlled). |
Roland, you are wrong on multiple fronts here. I don't really want to take the time to exaustively write this - it's that your viewpoint is "I'm the web server in the datacenter" that sticks in my craw. What can I do to convince you my points are valid? What concrete set of experiments? Systemd runs on everything from iot to laptops to virtual hosts to servers. Having a good default that works across the widest range is what we want. "I'm the laptop on the edge of the network." "sch_fq works also for UDP or other traffic. " No, it doesn't manage queue length in that case. Next question. That really should be the end of this debate!!! "One point is that in the mentioned scenario, fq_codel cannot even control the TCP flows so well" Look at the rtts and throughput in both cases. Try using the flent tool to sample rtts. Things are being controlled, the fact the local tcp cannot in this case match the rate is more the fault of today's tcps. [1] I miss reno sometimes. " so the same would happen with UDP flows (which should also be congestion controlled)." "can" happen. When we are pushing more data out than we can match rates on, what do you want to happen? grow the buffer? Have backpressure for udp? You don't have backpressure through the whole system , you have drop, that's it.
fq_codel applies global regulation when each application only has it's own narrow viewpoint.
sch_fq is great for serving tcps in the datacenter. It may even be a good choice for your local web server example. Although that example is flawed, using only 100 greedy, continuously running flows. A real webserver handles a variety of flows from 1 packet up to (last I checked) about 2Mbytes in size. A much better benchmark of web-server - nfs server - etc is in transactions per sec across a workload, where most of the workload lives in slow start. The vast majority of flows, being rtt-bound in their ramp, or size limited, will never exceed their fair share and get hit by codel. I'd like to clearly establish where we are disagreeing or not. You've made several assertions that are provably false. Others, like the amount of backpressure needed, or your distaste for retransmits on a short path where it does not matter... are something of a matter of taste. My "taste" leans towards never having much more than 5ms local buffering, except for what is needed to handle bursts. There's buffering in the socket (see tcp_sent_lowwat), in the app, in the qdisc, and in the network. [1] You can make an argument that fq_codel should interact with the local tcp stack better than it currently does, under conditions of extreme load. Or you can argue the local tcp stack should shrink it's demands better. You can also argue that bql tends towards overbuffering also. But I think first up, would be looking into how tsq and sch_fq currently interact at 1gbit rates. Because of this whole thread I did go looking at 100+flow tests on short paths, at bbr's current behaviors, and at ecn, while looking at cake, sch_fq, and q_codel, but I didn't get iproute2-next to let me sample buffer sizes again until a commit arrived for fixing, ironically, it's overbuffered output buffer! |
Dave: my point is not about web-servers in a data center, it's about using fq_codel in the end-system, which can cause harm. So we have the case where the sender is the only bottleneck. Why should I abuse congestion control signals and add RTT latency when I can manage this locally? Congestion control should prevent overloading the network, but it's not adequate to manage overload in the end-system. In our test case you can see that congestion control doesn't work, while local backpressure mechanisms do work.
Show me a case where fq_codel performs significantly better than sch_fq in an end-system.
Basically, we have AQMs to deal with the fault of today's TCPs... :-)
Hm, so flow isolation still works, but UDP flows may suffer from self-inflicted queuing delay
Both were fine, as described.
This could also be retransmits on a long path, where they do matter. So, yes, my main point is that we should use local backpressure instead of congestion signals in order to avoid these retransmits. |
Flow isolation still works, but it's a second class citizen in sch_fq. Queue length management reverts to tail drop on a 10,000 packet (and GSO enabled, thus 64k possible per packet) queue. All flows not directly managed by the local tcp stack do not get backpressure. This includes tcp flows from vms, stuff flowing through hypervisors, encapsulated traffic from vpns and containers and network name spaces, udp from any source including quic, webrtc, voip, dns, gaming, and attack traffic, or any other protocol. So any of these flows can self inflict queuing delay, and by being present still inflict some delay on other flows. To me this ends the debate over sch_fq as a good default! Does it work on you yet?
OK, well, that depends on what you consider as a valid test. Would a MOS score of voip flows taken against an also tcp_loaded server work? Or a measurement of self inflicted latency from webrtc? or locally vpn'd traffic competing with local traffic? Bittorrent? "end-system" partially depends on whether you are a client or server, on wifi or ethernet. Certainly the fq_codel queue management we did for wifi routers also applies to clients and we should probably do a followup on that paper running the same algo on both sides, ( https://arxiv.org/pdf/1703.00064.pdf ) Our viewpoint as to "performs" might differ. I'm all about low latency and filling the pipe not the queue. Forcing tcp to back off and retransmit "does no harm" so long as utilization is 100%.
Just to clearly establish things for those in the audience, retransmits "fill in the hole". If your rtt is short, the signal gets through faster. retransmits do cause additional work on behalf of both sides of the stack.
It only matters if the remaining duration of your transaction is less than the RTT, or you've dropped the last packet in flight, forcing an RTO. as the rtt grows, the need to do congestion control via dropping declines dramatically. One drop means a lot. So we could repeat your 100 flow test over a 100ms or 1000ms rtt, to show that (or we could dig up a paper on relationship between drop rates and rtt) Yes, I totally agree we should use local backpressure to avoid retransmits whenever possible. pfifo_fast does tail drop, which is not actually dropped, but pushes a cwnd reduction into it and reposts the packet. sch_fq + TSQ applies local backpressure and cwnd (and even better, pacing) (but does grow the self inflicted apparent rtt). fq_codel does head drop, which means you generally won't see an RTO, as the flow's next packet immediately behind it is delivered, thus the receiving tcp notices the hole and asks that it gets filled in. we did produce a tail dropping version of fq_codel at one point. yep, local tcp stack backpressure. But it hurt all other applications (as noted above), that can't observe that backpressure, to not get the earliest congestion signal possible. It matters to voip/video/dns to drop the stalest packet, in particular, and tcps in general not synchronized bulk drop (as what happens when drop tail is highly congested) Also, often, when fq_codel is in a dropping state the local total queue is so full in the first place that by the time it empties it's already got the ack from the other side, indicating please reduce your window and fill in this hole - there's often (on short paths) waaay more than a bdp in there when overloaded in the first place. I think we are both in agreement that having min/max fair fq (sch_fq, fq_codel, and now sch_cake) is better than a fifo queue?
Even if I could step back to 1986 and mandate a delay, rather than drop based tcp, it wouldn't have worked in todays highly variable rtt environment. The need for aqm was understood, and if only red had worked - or drr/sfq applied more universally, we'd have had a better internet. I got involved in all this because I ran an ISP in 1993-96 and still have the scars on my back... If I could step back to 2000 and reserve 3 bits for QoS and 5 for ecn, instead of the diffserv mess, that would have helped. I'm pretty sure, if I could go back to 1989 and the first ietf meeting, and stood shoulder to shoulder with John Nagle about the need for fair queuing everywhere, that would have made a difference! Most congestion would thus be self inflicted and applications could just do themselves in.
What harm? Retransmits and congestion control are totally normal highly optimized aspects of tcps behavior. No harm, and a general benefit. A light tap here and there to reduce self inflicted congestion in the general case.
Our definition of working is different here. You want no retransmits, and effective backpressure. I'm saying that effective backpressure is impossible in the general case. Utilization is 100% in both cases. the same amount of data is transferred. Locally observed RTT with fq_codel is lower (probably - I'd have to go look as I mentioned earlier, it's a per-flow curve with tsq in place, and pacing now helps a lot and I have only last week got iproute2 fixed ). As I noted on your test case, it's not an example of a web workload, either. It's "working". I too want effective backpressure. I'd like it if modern tcps were IW4, not IW10, that cubic backed off .5 rather than .7, that tcps still reduced to minimal cwnd AND that since fractional cwnd is impossible, used pacing instead, and that tcps reduced their mss size when under congestion also. I would not mind at all if the head dropping of packets in fq_codel immediately forced a cwnd or pacing reduction locally. Hmm... that might be feasible... similarly a local response to seeing ecn excerted... this is not stuff the main folk working at the big g care about.
so far as I know - and like I said, I have to go look as it has been a while, sch_fq + TSQ add RTT latency at a minimum of 4 packets outstanding locally per flow. (I will revise this statement after doing the work, but I did ask you to take packet caps of sch_fq in your benchmark). I would not mind if it dropped to 1 packet, nor would I mind if it then started reducing packet size - but the DC guys simply don't see problems at 1gigE and below - they would generally love to be able to self congest but are otherwise out of cpu. so you are making an assertion that sch_fq is not contributing to rtt latency that I currently doubt is true. I observe HUGE buffer sizes in sch_fq when I look at it by eyeball as I add flows. |
Not quite relevant but I had to get this off my chest: IMHO, the ideal self inflicted delay is 1 packet. Ironically, as we go higher rates, latency suffers further on the host, as we need more buffering because we can't handle interrupts fast enough. So you'll find - in BQL - that self inflicted delays at 100mbit are in the 2 packet, a few dozen usec range, and often a few msec at 1gigE. in other words going at higher rates, even with all this newfangled fq/aqm stuff in place, fairly fat fifo queues grow at the device driver, and 100mbit networks can actually have less latency than gig+ ones because of interrupt latency and fq being above it. BQL is a godsend, because 100mbit networks had gone completely to hell prior to it's development and the addition of GSO and big ring buffers to linux. I wish we could do better. |
OK, I setup a brief test using network namespaces to make the point. I did not go to any great extent to make it terribly scientific, I'd much rather you repeat my tests to make the point to yourself. (if you've pulled this in the last 20 minutes I updated it with newer and more data ) This is that sch_fq result - 150-250ms delays on the netns'd cubic flow - (probably more regulated by all the other tcps competing and their RTTs than drop). The observed RTTs in this test (it's in the cap and xplot data) for fq_codel as the ending qdisc rather than sch_fq: This lines up with the flent measurement as well with fq_codel'd ~10ms RTTs, is here: drop counts were much higher for fq_codel (100x?) (I still haven't fixed my sampler) - but throughput, identical. RTTs 1/15 or better that of the alternative. Just for giggles, I did a couple tests with flent's tcp_square_wave test (4 flows, two cubic, two bbr). the cubic result - even for only two cubic flows going through sch_fq was painful. Given the limited number of flows on this test the difference in drops was much better: bandwidth identical: but which level of latency do you want for your tcp flow? (You can certainly see a compelling advantage to bbr over cubic in this test also). Either FQ system gives it a fair share to start with, and then BBR probes for the right rate (those drops in throughput every 10sec), and gets it. If it were competing with an overlarge fifo, and not self congested, it would be uglier still. I'd rather like it if fq made it across the edges of the internet, and then all sorts of congestion controls would work way better. |
in terms of the local stack only, (no netns) TSQ works pretty good in both the sch_fq and sch_fq_codel cases. Over 60 seconds at gigE, 8 full rate flows going through either sch_fq or fq_codel never drop a packet. 16 flows drop 5 "packets" with fq_codel, none with sch_fq. you can't count conventional "packets" anymore as most of these are TSO and greater than 1514 bytes - but that was out of 7341269449 bytes sent. at 15 it drops 3 packets. In both cases no difference in throughput same rtt. |
There's a really simple answer to all this that nobody's emphasised yet: TURN ON ECN. That will let AQM sort out the congestion backpressure without incurring packet losses and retransmissions. More and more end-host platforms are turning on ECN by default. Shouldn't systemd do the same? |
Yes, starting with systemd v239, see #9143 and the update to the NEWS file. We've had one issue (#9748) reported related to ECN though. |
I see. So why is the OP seeing packet loss with fq_codel? Has Ubuntu overridden systemd's default ECN setting? |
@chromi per that bug note, I retain grave doubts about ecn universally unless tcps evolve a better response. With people pushing it to have even less response to loss than cubic does instead of more, as in ( https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn-09 ), with no response defined towards drop and CE simultaneously in an RTT, and with the extra damage an ECN enabled DDOS can do, with codel not increasing it's signalling rate on overload of ecn vs normal packets... at this stage in the game my vote remains to leave it off by default until more things get sorted out. You really should spend more time looking at queue depths in cake with ecn on and off, at high loads. |
@chromi Ubuntu 18.04LTS still ships systemd 237. I haven't checked whether they changed the default for ECN (which, at that point, was still off), but I imagine they didn't. I guess if @roland-bless would apply that same change locally (it's just a sysctl config, could even be configured dynamically on the local system) he might be able to tell whether that makes a difference? |
@chromi - systemd has only had ecn "on" for a few weeks. This bug report is about a separate request to switch systemd's default to sch_fq, which is a horrible idea that I just spent several days attempting to refute, per above. Yep, tcp ecn on + fq_codel kills the retransmits the original poster was complaining about in this 100 flow test. Ran that. But the retransmits do no harm in the context of this test and self inflicted rtts with ecn off are lower than the actual path delay. |
@chromi Yes, ECN avoids packet loss, but the main point was to use local backpressure mechanisms instead of abusing congestion control signals, letting them return them by the other end and then only react on them. This costs you at least an RTT, whereas local feedback provides backpressure signals immediately. |
@dtaht I'm currently on vacation and travelling, so I cannot respond at high frequencies. Maybe Mario kicks in.
I see, but usually you have virtual switches for VMs and they should use fq_codel then. It would probably make sense to also limit the number of locally queued packets inside the OS in those cases, too.
VoIP flows should benefit from fq's flow isolation. WebRTC should not cause self-inflicted delay due to the corresponding congestion control there (NADA, SCREAM, etc.). I'll respond to the other stuff later if time permits... |
@roland-bless - I took the time to do the easiest counter-example - network namespaces - in a long part of the post above. I imagine you didn't read that far. I'm on vacation also. In the mere context of this bug report, which involves you asking to switch systemd over to sch_fq, no "shoulds" or "usually"s or other forms of wishful thinking can apply. What actually is - the situation where tons of different kinds of unregulated flows existing, on the wide variety of millions of possible systemd installations, the situation that exists today, that needs to apply to the engineering decisions of the correct, best, basic, default. sch_fq is worse than pfifo_fast in these respects, and fq_codel the overall winner. I'd sooner revert systemd to pfifo_fast (packet limit 1000) than sch_fq for a general purpose qdisc for the general public. But it's not my call, either. I have no say in this matter, no connection with systemd at all. If someone hadn't mentioned this "bug" on the bloat mailing list I'd have not shown up and felt compelled to educate and argue. I support knowledgeable sysadmins and distros changing their default to anything they choose based on their workload. But: can we "close" that part of this bug report? That we're done discussing changing the default? I'm hoping that my network namespace example sufficies to prove for you and those in the audience that sufficient backpressure does not exist for a wide variety of common applications. Then we can go and discuss about making the shoulds, and usuallys into things that always are. Can we close that part of this bug report? (and btw, (bikeshedding!) particularly on wifi, webrtc does self inflict delay (managed beautifully by the fq_codel for wifi stuff ( https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf ) , not anywhere else), and my preferred congestion control is googles ( https://tools.ietf.org/html/draft-ietf-rmcat-gcc-02). I don't know if NADA or SCREAM actually got implemented in a shipping browser? I rather liked an early version of nada. But I'm off topic and all I want to do is shut the conversation down about foolishly changing systemd's default. Can we do that yet? If you say yes (or hopefully, some of the systemd folk watching the fireworks?), I'll let you go enjoy your vacation. And I can go back to mine. |
Oh core systemd folk? @poettering @michich -I don't know who else is core to systemd - can I go back to something else in life besides this bug report now? Your call, I made my points as best I could, and I'm going to logout now. |
for what it's worth I agree with Dave to keep fq_codel |
dtaht said: This is only true if the application using TCP is a bulk transfer application. Many important applications (e.g. web traffic, RPC traffic) care about latency, and are harmed by the added latency for loss recovery forced by drops from fq_codel. I agree with Roland's point that fq_codel, in its current implementation, does not seem like a good fit for end systems using TCP. For the current fq_codel and fq implementations, for end systems that use TCP, fq seems like a better fit than fq_codel. IMHO for systems with TCP traffic it is not a good trade-off to knowingly impose latency regressions on latency-sensitive TCP applications in order to provide a better back-pressure signal for some subset of UDP applications that will use the drops. I suspect it is possible to enhance fq_codel to avoid this penalty for local TCP traffic: for example, not dropping traffic that is using TSQ. That way, TCP traffic is not dropped, but UDP traffic is dropped. But I don't think the required mechanisms are in place yet. |
I agree that tsq could probably be made more effective. I showed earlier that on this test it managed well at a gigE to 16 greedy flows. That seemed "good enough". I don't think we see eye to eye on what an "end system" is yet. (?) IF your "end system" definition means it uses "locally managed tcp traffic only - and nothing else - and isn't going to self congest" - then I am in agreement that sch_fq can be a good choice, and a knowlegable sysadmin should flip it over. But for the vast swath of other possible uses , it isn't, and thus, not a safe default for systemd. I definitely view all networked computers as "routers", so I don't understand what you mean by an "end-system". An application puts or retrieves data. An OS arbitrates and routes it to the right place and back and enforces resource limits of various types. |
Latency "regression" my ass! a factor of 15-25X improvement, no loss in throughput. You can't put in more data than you can get out in a reasonable time - Oh, man, has this thread 'caused a weeklong blood pressure spike on basic bufferbloat principles. All tcp traffic is latency sensitive. The less rtt, the faster other tcps can react to changes in network conditions, the difference in response time is quadratic. At 10ms observed rtt, if 90 of our 100 flows ended, the other flows claw back bandwidth faster. A TPS benchmark would be useful here, not saturating greedy flows. I get really bugged by rasearchers always measuring big loads in congestion avoidance rather than lots of smaller flows in slow start. the fq part, which I think all here agree on (for a change), is great for this also. All flows observe changes in load and react as fast as they can along the observed path, which is far, far slower in the case of a fifo, where one shorter rtt flow can be very unfair to all the others trying to get their "fair share". As for my contention that most flows never hit their bandwidth allocation and rarely exit out of slow start, or get hit by codel's generous 100ms burst allowance - a simple tcpdump of your corporate or home lan suffices in normal use suffices, looking at all, not just the tcp (or quic, nowadays) packets. A good measure is observing dns rtt and the amount of dns you see on lans like that. we have ~10ms of local buffering on this 1gbit 13us path. that's 760 times more buffering than what is required to fill the path. (I'm not crazy, I know painfully well that interrupt latency and cpu cost is too high to get away with much less than a ms, tcp timestamps an issue also - I'm just making a point of how much "better" things could be) I'd love it if the pacing rate in tsq was fractional enough to handle essentially cwnds that needed to be less that 2. ? I'm going to go look into that... I'd love it if more folk used the tcp lowwat sysctl- I'd support a sysctl for udp to do the same. Still doesn't fix the problem of so many other potential loads from other sources exceeding the max output rate for long periods of time. |
Not enough people have read this: https://gettys.wordpress.com/2013/07/10/low-latency-requires-smart-queuing-traditional-aqm-is-not-enough/ - it's not directly relevant to the discussion at hand, but it helps to deeply grok the impact of rtt on web traffic as discussed there. |
I'm surprised that replacing fq_codel with sch_fq on end hosts is under consideration. The reduction in intra-flow latency under load (i.e. TCP RTT) is demonstrable in benchmark tests as Dave showed, and afaik, this is rather settled in numerous scenarios at different bandwidths and RTTs, as claimed by CoDel's authors and verified in years of followup testing. I wasn't quite in harmony with the characterization that the intra-flow latency gains from CoDel are "much less important" than the inter-flow latency gains from fq, and I'm not sure this conversation should have been used in the initial argument. But backing up, I think we can all agree that intra-flow latency for HTTP and other conversational protocols is very important, and hope we can we figure out what gets us there. It was initially stated that:
We have to be extraordinarily careful when making claims about queueing and congestion control not to make inductive fallacies. Is there any real-world data showing that fq actually leads to lower PLTs than fq_codel, for example? |
If you set queue discipline any CoDel variant, the non-TCP/IP network stops working. |
please feel free to put up your test setup so we can have a coherent discussion. |
Please live, I was far from the computer. I enclose test results on a live school system. As you can see with fq_codel there is a tcp stream with rtt 227018. Such a large delay causes video frames to drop on the student's computer. receiver - 1 vCPU, RAM 4G, nic vmxtnet3 OS Debian 9. Kernel 4.9 command - iperf3 -c X.X.X.X -i 0 -P 36 -t 20 -d | grep _rtt qdisc = fq tcpi_snd_cwnd 1 tcpi_snd_mss 1448 tcpi_rtt 10729 qdisc = fq_codel |
Fwiw, I cannot reproduce this on the same OS and kernel (Debian 9 / 4.9.0-8) using APU2 hardware as client and server. Also just mentioning, fq_codel's choice of packets to drop isn't random. Drops are a normal part of congestion control, fq_codel or not. You can avoid them by enabling ECN.
|
I have a virtual 10G in VMware environment |
A packet capture would be easier to look at than just cwnd stats. We use the flent.org toolset to analyze a lot of stuff. It's available in most linuxes (apt-get install flent flent-gui netperf for ubuntu, yum for fedora, 0 or via pip install flent). Totally ok if you don't use that, but a simple command line and simultaneous capture via tcpdump and all sorts of useful stuff. To replicate your test with flent (and netperf on the other side, running) tcpdump -i your_interface -w my.cap -s 128 & This generates a flent.gz file which can be inspected and plotted with flent-gui. You can get live qdisc stats also with a few other options to flent, over the course of the test. See the man page But I'd settle for a packet capture of whatever tool you are using, which I can take apart with tcptrace -G and xplot.org What I infer above, given the cwnd reductions and timeouts is that there is some sort of mismatch or bug between the vmware instance and the underlying OS (which is?). It could be a TSO/GSO interaction problem, mtu, driver problem, physical wire issue, all kinds of stuff. Certainly any qdisc can drop packets as it's a necessary part of congestion control, as can anything else on the path. Given that both sch_fq and fq_codel are doing something wierd I'd expect it's this much lower in the stack. What does a pure fifo do? Enabling ecn on your tcps can also be helpful as pete notes. |
Ah, your tests MIGHT differ because you are using an ancient and unmaintained tcp (hyla?) cc algorithm, or so I guess. what happens with cubic/bbr/reno which are well tested? Still I suspect a problem lower in the stack..... |
I believe the original test was with @fox-mage Is there anything in dmesg when these latency spikes occur? |
Another thing to try is using a different vmware virtual network interface. 'round here I tend to use the intel one. |
I only use tcp cca htcp for video and the impact of traffic outside the data center, and loss/delay of cca inside the datacenter - new vegas, illinois. ECN default |
I've started reading this whole topic with high interest but I fail to see any good advice. I have 3 setups, Ubuntu 14.04, 16.04 and 18.04. In 18.04, default is fq_codel (vs pfifo_fast on others) and with default settings I see constant package drops, one every 2 seconds in idle. Consuming large amounts of data over network from a database leads to massive package loss for same load compared to 16.04 or 14.04. What I can attest is that for large servers with over 20-50K persisted connections, defaults for 18.04 are significantly worse. Cannot quantify how much is due to fq_codel itself or some other screwups at kernel level but it's certain worse on same hardware and that's without Spectre/Meltdown patches applied. I've be also very interested to see which settings scale better with number of connections. Is anyone aware of good articles on these topics and specially regarding what are best settings on latest kernels when it comes to large servers (64-128 cores, 100Gbit network,100K+ connections) ? |
sergiuhlihor <[email protected]> writes:
I've started reading this whole topic with high interest but I fail to
see any good advice. I have 3 setups, Ubuntu 14.04, 16.04 and 18.04.
In 18.04, default is fq_codel (vs pfifo_fast on others) and with
default settings I see constant package drops, one every 2 seconds in
idle. Consuming large amounts of data over network from a database
leads to massive package loss for same load compared to 16.04 or
14.04. What I can attest is that for large servers with over 20-50K
persisted connections, defaults for 18.04 are significantly worse.
Cannot quantify how much is due to fq_codel itself or some other
screwups at kernel level but it's certain worse on same hardware and
that's without Spectre/Meltdown patches applied. I've be also very
interested to see which settings scale better with number of
connections. Is anyone aware of good articles on these topics and
specially regarding what are best settings on latest kernels when it
comes to large servers (64-128 cores, 100Gbit network,100K+
connections) ?
At one point, I might argue in favor of changing this long thread to a
separate bug, as I was hoping the idea of "changing the default" had
been killed thoroughly in context with all the other uses of a qdisc.
tc -s qdisc show dev your_device will show how many drops/marks fq_codel
is inducing. I imagine this is a mq'd device?
If that matches your observed drop count (from where?), then
you can point a finger at fq_codel struggling to hold on-host latencies
below 5ms on this workload. Otherwise, you may be having another problem
deeper in the stack, and your "dropping every 2 seconds thing" hints at something deeper than the qdisc.
Easiest thing to do is try switching your qdisc over to sch_fq, best
thing to do is connect to your database server with ecn enabled
on your tcps, so you can do rate control without loss.
Note that "drops" are not necessarily a bad thing. Better are
actual measurements of throughput and latency on your workload.
All drops do is tell tcp to slow down and try to match the rate
to the pipe. Usually (because fq_codel does head drop) this is
pretty invisible to the application and result in decreased latency
for all traffic.
As for other qdiscs - the trade-off is latency in the stack. Latencies
can climb hugely with sch_fq and large numbers of connections, so much
so, that a common thing to do with that is to add a bpf filter to hard
exert drops and marks after the latency target is exceeded.
It would help in this context if you would identify your kernel, and
my advise for a workload like this would indeed be sch_fq + ecn
enablement + that ebpf filter I mentioned.
There's a long list of other tcp tunables worth trying at a workload
like this.
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@dtaht Thank you for the hints. I'm starting a series of long running load tests and I will test your settings. For now I actually switched back to pfifo_fast as I saw better behavior. A little offtopic, but in my opinion, the whole topic of defaults, not only default_qdisc should be reconsidered. If for example, having a set of defaults that is totally unusable for servers, then this translates into hundreds of not thousands of man hours lost by each company in investigations and finetuning. When multiplied with thousands of companies facing similar issues, you now have millions of man hours wasted due to poor defaults. This is a sad reality of Linux as a whole and I see it getting only worse. With Ubuntu 14.04 (kernel 3.x) I've touched only a handful of parameters, mostly just to support a large number of connections. I did load tests to the point where servers were constantly overloaded and everything is fine. With 18.04 (kernel 4.15.x) even now, after 1.5 years I do not have a stable set of parameters for same workload. And this is not only networking. Default IO scheduler was changed in 4.15 from Deadline to CFQ which killed completely my database. Whoever decided that in 2018 CFQ should be default when about everyone is using SSDs (which at core are RAID 0 arrays with 32 or 64 devices coupled with large DDR caches) or Intel Optane (which can deliver data faster than Linux IO stack can process it) should be punished by working his whole life with HDDs. This alone is the most retarded default, set without understanding the hardware market at all. |
On this kernel and workload I'd recommend sch_fq over either pfifo_fast or fq_codel. The vast number of timers required to fire is best handled in sch_fq.
The core ongoing optimizations we see for linux's kernel stack is actually as a server sending data, most of which is driven by google. Something managing 20k persistent connections, inbound, sparse? well I hope you are using epoll at least.
This is where your statement is (currently) jumping the shark for me. Usually the flow of packets is entirely regulated inside the kernel. Even if you are pegging the cpu with your db workload, packet services time should be largely unaffected. If you have a query that takes > 250ms to generate a response, then it is possible a tcp keepalive timer will kick in and send a packet, but that's it. fq_codel only engages when there is a persistent, filled queue that is not draining in under 100ms. tc -s qdisc show will give you some data from there.
Once latencies climb past 250ms, all sorts of other rarely invoked portions of tcp stacks begin to I would support all sorts of means to make the edge devices "better', ranging from government standards efforts, to certification efforts, to engaging legions on college students to help get it more right. IoT frightens me. The level of basic knowledge about how core network protocols wrk has deteriorated to such an extent that I often gasp at the level of ignorance "out there", and yet, But again, in your case, it only matters when the local queue fills past 5ms full of stuff with fq_codel for over 100ms. I do note that I consider induced latencies of over 250ms (more than one time around the planet) as intensely damaging to the internet in general, which is in part why we focus so much on matching the length of the pipe to the actual length in things like fq_codel and BBR.
Um... packet loss is integral to the internet, period.
Publish that so others can avoid... or fix.
From server to endpoint there are many places where packets can be lost. As for losing packets due to congestion control reasons, there is a lot of work going on for ecn-enablement -notably the SCE work - https://tools.ietf.org/html/draft-morton-tsvwg-sce-01 - for congestion control without loss. No Iot stack I know of has support for ECN. I would very much support more work making iot tcp stacks more robust. I'd even do some of it, if paid to do so. However, not not drop packets at ""at any cost". Latency - and the most current data points - matter for many valuable internet applications.
We're still trying to establish root causes of your troubles here, nothing - so far - points at fq_codel as the real source of your issues. Certainly I agree "times has changed" since identification of the bufferbloat issue - hardware buffer growth has halted - deployment of fancy algorithms like fq_codel and pie are well underweigh - bbr is getting some traction - bql (probably the most important technology of all) is on everything running >= 10gbit in linux. As for "fully outdated"? No, the bufferbloat problem remains at epidemic proportions. Probably the most successful effort we have ongoing is mostly invisible, but it's making wifi (and wifi 6) a LOT better for handling lots of device and videoconferencing and audio transport. https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf
We are trying to optimize for a variety of network devices that spans over 8 orders of magnitude of bandwidth - from 10ks of kb to 100gbit, and we simply don't know how to do that. Definately! More folk should try and identify core use cases and then create either autotuning or good defaults! I'm in a perpetual battle with the googlers over trying to keep <=gigE stuff working well in linux, they are perpetually tuning for 40Gbit and above as a "default". Last fight was over GSO by default in sch_cake.
Certainly is off topic here, and I'd advise complaining to the right mailing lists. Personally I have found the constant tuning for dc workloads to make using my laptop running linux a far less pleasant experience than it used to be. I used to mix audio, do video production, etc, on linux. No more. OSX folk "get" latency" |
@dtaht Thanks for the constructive comments. For now I'm going to do some more load tests. Overall kernel 4.15 looks worse compared to kernel 3.x but I cannot say yet if it's due to one parameter having a wrong default or a set of parameters or some regression. Regarding optimization goals, what I was trying to say is that the load of today and tomorrow is no longer server as sending dominant. IOT means acquiring large amounts of data from external sources then processing it on large servers and serving back a small portion of it. This leads to two different problems extra to the standard content serving:
In my infrastructure, IOT receiving bandwidth is by far the most dominant factor. Years ago I would have expected the problems to be on the application side, but in practice, this is not the case. I can handle easily 50K to 100K connections and all with a low enough CPU usage (and all in Java) that I can easily do even processing on same node, thus removing most of the need for node to node intercommunication. When looking at Linux ecosystem, what is missing is a good segmentation per use case. Trying to apply one size fits all for 8 orders of magnitude of load just does not work. The classes of workloads are known for years. Missing piece of the puzzle is specialized configurations with optimal or close to optimal settings for given workload or automatically tuning of the parameters. |
Don't kill me :-D : what about CAKE by default? |
On 12 Dec, 2019, at 2:10 am, Iván Baldo ***@***.***> wrote:
Don't kill me :-D : what about CAKE by default?
This might not be the best idea - and I say that as the principal author of Cake.
Cake is designed primarily for last-mile CPE environments, and is relatively heavyweight with features intended to deal with effects seen there. So it might cause excessive CPU load if applied to a 10GbE interface, for little benefit relative to fq_codel. There are cases where you *do* want Cake's features on such an interface, but you should explicitly choose those cases, not have them foisted on you by a default.
- Jonathan Moron
|
Jonathan Morton <[email protected]> writes:
> On 12 Dec, 2019, at 2:10 am, Iván Baldo ***@***.***>
wrote:
>
> Don't kill me :-D : what about CAKE by default?
This might not be the best idea - and I say that as the principal
author of Cake.
Cake is designed primarily for last-mile CPE environments, and is
relatively heavyweight with features intended to deal with effects
seen there. So it might cause excessive CPU load if applied to a 10GbE
interface, for little benefit relative to fq_codel. There are cases
where you *do* want Cake's features on such an interface, but you
should explicitly choose those cases, not have them foisted on you by
a default.
I have already seen cake used on 10GigE links in two scenarios:
1) Where you wanted to essentially load balance multiple segments of the
network across a major internet gateway (host fairness)
2) Where the ISP wanted to prioritize certain kinds of traffic.
It has been tested up to about 50Gbits. But, no, as a default, no...
…
- Jonathan Moron
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
- Ref : systemd/systemd#9725 Suggested-by: Albert I <[email protected]> Signed-off-by: Raphiel Rollerscaperers <[email protected]>
At the risk of being off topic - if I do want to try out cake on an asymmetric connection (ethernet, e.g. eth0 with 500mbit ingress and 40mbit egress), do I set the |
On 30 Dec, 2020, at 2:22 am, Mark ***@***.***> wrote:
At the risk of being off topic - if I do want to try out cake on an asymmetric connection (ethernet, e.g. eth0 with 500mbit ingress and 40mbit egress), do I set the bandwidth property to the ingress or the egress?
Ideally you would have two Cake instances, one controlling ingress and the other controlling egress. The one directly attached to the interface would be the egress, and should be set accordingly (to 40Mbit in your case).
Again ideally, the ingress instance should be upstream of the bottleneck link, but you might not have access to put it there. The normal workaround for putting it downstream of the bottleneck is to attach it to an IFB interface, then use act_mirred to redirect ingress traffic to that. There the bandwidth should be set somewhat *less* than the true link rate; perhaps 450Mbit would work for you. Add the "ingress" keyword to this instance to inform Cake that the traffic has already been through the bottleneck.
In both directions, don't forget to also account for link overhead. In many cases, adding the "ethernet" keyword may be sufficient, or "docsis" if you have cable.
- Jonathan Morton
|
@chromi fantastic, something to play around with. Thanks for your response. |
Finally. What to choose, LOL ? |
@socketpair The issue was issued on Jul 26, 2018. We might wait another few years before choosing the right default. |
systemd version the issue has been seen with
Used distribution
Expected behaviour you didn't see
Unexpected behaviour you saw
Steps to reproduce the problem
The main problem is that fq_codel is nice for routers or forwarding devices, but not for end-systems or servers (and I assume that this is rather the majority of linux installations): it doesn't provide an advantage there, but a serious drawback: packet loss at the sending end-system (note that this will also increase the overall latency for the affected TCP connections!). Normally, this should be avoided by using back-pressure mechanisms locally. However, CoDel drops packets instead of propagating back-pressure.
The steps to reproduce the problems are described here:
https://lists.bufferbloat.net/pipermail/bloat/2018-June/008318.html
(BTW: there is no technical argument in this thread for using fq_codel as default).
So if you're using Linux for a software-based router (openwrt), you'll need a special configuration anyway (so it's easy to set fq_codel as default there), but for servers and normal end-systems/hosts the default should be sch_fq, not fq_codel!
The text was updated successfully, but these errors were encountered: