-
Notifications
You must be signed in to change notification settings - Fork 10
Linux patches to support microsecond granularity RTOs in datacenters.
vrv/linux-microsecondrto
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Linux TCP microsecond timers patch Author: Vijay Vasudevan <[email protected]> This patch adds support for microsecond-granularity TCP retransmissions. It replaces the existing coarse-grained timers using the jiffy timer with microsecond-graunlarity timers using the high-resolution timer (hrtimer) infrastructure. Warning: This patch can cause the system to hang during heavy network activity for versions after 2.6.28. I have not root caused the reason, so proceed with caution. If you identify the problem, please let me know or submit a pull request to this patch. Background: This patch was developed to help mitigate the "incast" problem for wide-fan-in, synchronized communication in datacenters. See http://www.pdl.cmu.edu/Incast/ for details. Our solution required retransmissions to fire on the timescale of latencies in a datacenter (i.e., microseconds). This patch enables machines with high-resolution timer capabilities to retransmit quickly on packet loss. Keep in mind that the retransmission timeout is always calculated using the flow latency, so wide-area communication should still have timeouts in the tens to hundreds of milliseconds using this patch. This patch has only been tested in local clusters and on a few externally facing machines, but needs much larger-scale testing before further adoption. This repository contains two versions of the patch, one for Linux 2.6.28.10 and one for 2.6.39 (cloned just hours before 3.0-rc1 was released, so it should work on 3.0-rc1 too). The 2.6.28.10 patch has been tested in real-world clusters, whereas the 2.6.39 patch compiles but has not been tested and may contain bugs (maintaining a separate patch across 2 years of independent kernel development is tough!). This patch requires that the kernel config options HIGH_RES_TIMERS and TCP_HIGH_RES_TIMERS are enabled. The TCP timeout defaults are unchanged on boot (e.g., 200ms minimum TCP RTO), so you will have to execute the following sysctl commands to change them. sysctl -w net.ipv4.tcp_rto_min=X sets the minimum retransmission timeout to X microseconds. (Default 200000) sysctl -w net.ipv4.tcp_delack_min=X sets the minimum delayed ACK timeout to X microseconds. (Default 40000) sysctl -w net.ipv4.tcp_delayed_ack=0 "disables" the delayed ACK mechanism. (=1 enables). (Default 1) Notes: 1) These patches have not been tested, nor do they compile, with TCP congestion control options such as cubic, bicubic, lp, etc. Modifying these congestion control implementations should be somewhat straightforward if based off the changes from the default implementation. Volunteers to implement/test are encouraged. 2) An alternative to these patches is to simply reduce TCP_RTO_MIN in tcp.h to 1. However, this provides no less than 5ms retransmissions because the existing timer infrastructure is tied to HZ, which does not increase above 1000 in most configurations. However, this is a simple one-line change. Note that this may have strange interactions with delayed ACK unless you use a patch above to disable delayed ACK. See http://www.pdl.cmu.edu/PDL-FTP/Storage/sigcomm147-vasudevan.pdf for details. 3) Please read up on the incast problem using the links above to understand when this patch is applicable. Please email me if you have trouble and let me know what kernel version and the errors you are receiving. Pull requests to fix bugs or add functionality are greatly appreciated!
About
Linux patches to support microsecond granularity RTOs in datacenters.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published