Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect against thrashing not caused by swapping? #2

Closed
duanyao opened this issue Jun 23, 2015 · 6 comments
Closed

protect against thrashing not caused by swapping? #2

duanyao opened this issue Jun 23, 2015 · 6 comments

Comments

@duanyao
Copy link

duanyao commented Jun 23, 2015

I had heard from multiple sources(including your README) that turning off swap can prevent thrashing, but this is not true. Executable files (and some data files) of processes have to be cached by OS to allow them to run. If there is not enough physical memory and swap is off, OS has to discard and refill huge amount of caches during process scheduling, which can cause thrashing.

I did oberseved this issue on my laptop with 4GB memory and swap is off. I monitored IO by atop/iotop during the thrashing, and found that firefox, thunderbird, eclipse, amule etc. generated enoumous amount of reading, and the disk kept 100% busy.

Currently thrash-protect seems not able to handle this situation. I suggest kill -STOP some processes if the disk has been 100% busy for a while.

@tobixen
Copy link
Owner

tobixen commented Jun 23, 2015

That's interesting, and indeed I think I've observed such situations myself. I will fix the README at once. I will have to think a bit about the suggestion above.

@tobixen
Copy link
Owner

tobixen commented Jun 23, 2015

I've fixed the README, and I'm done thinking; I don't think your suggestion will fit very well for the following reasons:

  1. It is perfectly OK if one process causes 100% IO-utilization, for instance a backup script running every night. We wouldn't want it to be disturbed by thrash-protect. Possibly we could monitor the number of page faults; indeed the early versions of thrash-protect did that, but it was discarded - i.e., starting up firefox always causes a huge amount of page faults, we don't want thrash-protect to suspend firefox (or other innocent processes) when it's being started up.

  2. I would need to consider if the logic I've made for selecting what process to suspend is suitable for the "ran out of cache"-scenario (I believe it is, but I don't have the time to verify that right now)

  3. I think that thrash-protect should be used together with "enough" swap (I've stressed this in the documentation now), and then I believe the "ran out of cache"-scenario is quite moot. You'd almost always (?) experience lots of swap in and swap out before getting into such problems.

Anyway, thanks for the report. If nothing else, the documentation has been corrected and hardened now 👍

@tobixen tobixen closed this as completed Jun 23, 2015
@duanyao
Copy link
Author

duanyao commented Jun 23, 2015

  1. It is perfectly OK if one process causes 100% IO-utilization, for instance a backup script running every night.

So it depends on the role of the computer: batch or interactive processing. I think my suggestion is useful for the latter. Maybe it is better to opt-in? I have experienced that some IO-intensive programs (dpkg, svn, updatedb) may also freeze the system, even if there are enough memory.

  1. I think that thrash-protect should be used together with "enough" swap

I'm ok with this requirement. However as I pointed above, it may be valuable to also protect against 100% disk usage in general for interactive system.

@tobixen
Copy link
Owner

tobixen commented Jun 23, 2015

We do have quite many production servers with 24/7 / 99.7% SLA, our backup script is running nightly causing 100% io-utilization (read-only), and it's almost never an issue.

I think that if a single IO-intensive program actually freezes your system, you're probably experiencing some hardware problems. I once experienced that the "sync" command caused a three second delay, the hardware vendor denied it could be a hardware problem, anyway the problem disappeared when they replaced the server.

@tobixen
Copy link
Owner

tobixen commented Jun 23, 2015

I think I would accept a patch / pull-request enabling thrash-protect to kick in on a "100% IO-utilization", on an opt-in basis. It shouldn't be much complicated, and I'm OK with it as long as it's not the default. :-)

@tobixen tobixen reopened this Jun 23, 2015
@duanyao
Copy link
Author

duanyao commented Jun 23, 2015

Thanks for your information. Poor hardware could be the problem (it is a laptop hard disk anyway), and some software configuration (there are ntfs-3g partitions) probably makes it worse.

@tobixen tobixen closed this as completed Oct 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants