Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing the behavior of small queue building non-ecn'd flows #148

Open
dtaht opened this issue Aug 18, 2018 · 21 comments
Open

testing the behavior of small queue building non-ecn'd flows #148

dtaht opened this issue Aug 18, 2018 · 21 comments

Comments

@dtaht
Copy link
Collaborator

dtaht commented Aug 18, 2018

@chromi @heistp @jg @richb-hanover

Our tests with typical sampling rates in the 200ms range are misleading. We (until the development of irtt) are basically pitting request/response traffic against heavy tcp traffic and I think it's been leading us to draw some conclusions that are untrue for many other kinds of traffic, particularly with ecn enabled and the collateral damage it might cause.

The keruffle over: systemd/systemd#9748 and systemd/systemd#9725 is a symptom of my uneasyness.

I'm probably the only one that runs flent against a 20ms sampling interval regularly. Queues do build in this case, finally, for voip like traffic, and we end up in the "slow" queue, even the "fast" queue gets more than one packet to deliver.

Having to prioritize arp slightly as cake does in diffserv mode is one symptom, having to (as I've done for years now) ecn mark babel packets on a congested router is another. Other routing protocols that don't use IP will also always end up in a fixed queue.

In an ecn'd world, I've long thought a "special" "1025th" queue for things like arp were possibly needed. Right now that maps to the "0th" queue and can collide. There are other protocols not handled by the flow dissector.

  • tracking packet loss better for the measurement flows would comfort me A LOT (having a graph mixin that could pull that data out?)

  • a rrul_v2 test that did the 20ms irtt thing always would be good

  • a test that tested ecn'd flows vs non-ecn'd flows would be good.

  • a fixed rate, non-ecned, but queue building flow mixin (sort of like what babel does to me now). Toke picks on me for using babel on workloads like this, I view it as a subtle reminder real networks are not like a lab.

  • syn repeats?

  • RTO tracking?

  • A heavy flows going and squarewave tests

I was also regularly able in the latest string of extreme tests get some, out of the hundred flows started simultaneously - at 100mbit - to wind up in ecn fallback mode for some.

Using "flows 32" for fq_codel & ecn was often "not good" from the perspective of my (non-ecned) monitoring flow, things like "top" would have their output pause half screened.

@dtaht
Copy link
Collaborator Author

dtaht commented Aug 19, 2018

maybe we can flow-dissect arp?

@heistp
Copy link
Contributor

heistp commented Aug 25, 2018

Is there an argument for lowering the default interval for when flent calls irtt?

The default irtt packet length with no payload is 60 bytes, so here are bitrates at various intervals for IPv4+Ethernet (106 byte frames, and tripled for RRUL's three UDP flows):

200ms => 4.2 Kbit * 3 = 12.7 Kbit
100ms => 8.5 Kbit * 3 = 25.4 Kbit
50ms => 17.0 Kbit * 3 = 51 Kbit
20ms => 42.4 Kbit * 3 = 127.2 Kbit
10ms => 84.8 Kbit * 3 = 254.4 Kbit

50ms wouldn't be too disruptive in most cases. At 1 Mbit, the 5% of bandwidth threshold is crossed.

Bitrates could also be lowered by ~15% (16 bytes per packet) by passing in --tstamp=midpoint and sacrificing the server processing time stat.

I'd also like to see packet loss (up vs down separately) shown by default, somehow. :)

@flent-users
Copy link

flent-users commented Aug 25, 2018 via email

@flent-users
Copy link

flent-users commented Aug 25, 2018 via email

@heistp
Copy link
Contributor

heistp commented Aug 26, 2018

Yeah, I haven't thought about what it would mean to change the semantics of the existing tests by changing the default interval. Although, as it is, there's still a fallback to UDP_RR for current tests, so results can change if irtt isn't installed or the server isn't reachable for some reason.

I'm fine with a 20ms default interval, but that could affect folks testing on lower rate ADSL.

Sub 10ms intervals means -i needs to be passed to the server to reduce the min interval it will accept.

2.7ms intervals should be no problem. 200µs still functions on decent hardware, but below that isn't much good. There can be any number of things that could cause those kinds of latencies.

tron:~:% irtt client -q -i 400us -d 10s localhost
timer stats: 53/25000 (0.21%) missed, 2.63% error
tron:~:% irtt client -q -i 300us -d 10s localhost
timer stats: 44/33334 (0.13%) missed, 3.30% error
tron:~:% irtt client -q -i 200us -d 10s localhost
timer stats: 176/50000 (0.35%) missed, 7.40% error
tron:~:% irtt client -q -i 150us -d 10s localhost
timer stats: 2077/66666 (3.12%) missed, 12.43% error
tron:~:% irtt client -q -i 100us -d 10s localhost
timer stats: 22136/99999 (22.14%) missed, 17.79% error

@flent-users
Copy link

flent-users commented Aug 27, 2018 via email

@heistp
Copy link
Contributor

heistp commented Aug 27, 2018 via email

@flent-users
Copy link

flent-users commented Aug 27, 2018 via email

@tohojo
Copy link
Owner

tohojo commented Aug 27, 2018 via email

@heistp
Copy link
Contributor

heistp commented Aug 27, 2018

Ok, well if we do go for it, so far in irtt's JSON there's just an average send_rate and receive_rate under stats, both of which contain an integer bps and a string text representation. send_rate ignores lost packets and receive_rate takes them into account. Let me know if anything different would be expected...

@dtaht
Copy link
Collaborator Author

dtaht commented Aug 27, 2018

I really do care about measuring packet loss and re-orders accurately.

I've also been fiddling with setting the tos field, to do ect(0,1) and CE. Doing that at a higher level would be good and noting the result. --ecn 1,2,3 ?

summary line of "Forward/backward path stripping dscp", "CE marks"
reorders and loss

@dtaht
Copy link
Collaborator Author

dtaht commented Aug 27, 2018

on plotting stuff I could see adding a 4th graph much like TSDE's for loss and reorder.

@dtaht
Copy link
Collaborator Author

dtaht commented Aug 28, 2018

actually - and I can see pete running screaming from the room - we could add tcp-like behavior to irtt and obsolete netperf entirely except for referencing the main stack. The main reason we use netperf was because core linux devs trusted it, and the reason why we sample only is because timestamping each packet and extracting stats from it is hard in light of mss and the complexity of the netperf codebase.

Implementing tcp-like behavior and tcp-like congestion controllers on top of irtt seems simpler in comparison, and we already have better timestamp facilities than tcp in irtt.

Who here likes playing the Zerg as much as I do?

@heistp
Copy link
Contributor

heistp commented Aug 28, 2018

As for packet loss and reorders, there's the lost property on each round_trip that could be plotted, but for re-orders there's so far just a global late_packets, which is the number of packets who sequence number is lower than the previous one received. It would be possible to add a late flag to round_trip without breaking anything, so I added that to the list.

What's TSDE?

As for tcp'ish irtt, I think I need to go canicross the dog in the forest before I internalize that. :) Although I bet per-packet RTTs would be invaluable for investigating ecn?

pping gives per-packet rtt for tcp today, in case that useful. Perhaps an integrated tool could combine traffic generation using the standard stack and passive analysis for gathering results...

@heistp
Copy link
Contributor

heistp commented Aug 28, 2018

Ah, I see TSDE is Pollere's work. I need to go through the talks referenced on pollere.net asap to get smarter on that. Will be on some roofs today though, p2p connection for the neighbors...

@dtaht
Copy link
Collaborator Author

dtaht commented Aug 28, 2018

this convo is (purposefully) all over the place, but I'm leaning towards a rrul_v2 test with 10ms irtt intervals. Not clear to me if flent could deal with two different sample rates. Also perhaps an IRTT_REQUIRE flag --te=irrt=1

@dtaht
Copy link
Collaborator Author

dtaht commented Sep 3, 2018

Another rrul_v2 issue would be to correctly end up in all the queues on wifi.

@flent-users
Copy link

flent-users commented Sep 3, 2018 via email

@tohojo
Copy link
Owner

tohojo commented Sep 3, 2018 via email

@flent-users
Copy link

flent-users commented Sep 3, 2018 via email

@tohojo
Copy link
Owner

tohojo commented Sep 3, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants