-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uninterruptible sleep in linux-sandbox processes - bazel hang only fixed with reboot #3494
Comments
We have a theory that this is due to the build-cache server. What's unusual with us is that we regularly switch between the Two more failures happened within the last 24 hours from the same engineer:
|
@philwo could you take a look? |
A few comments from me based on what I saw tensorflow/tensorflow#717 (comment) @brunobowden: if you get a reproduction with a bazel process in an uninterruptible sleep, can you get its stack trace? For example, if the PID of your hung process is 17988, you can run:
Hopefully the stack trace will pinpoint where in the kernel is bazel stuck. I am aware of two possible forms of this issue:
The first form of the issue is really not the fault of bazel, since any application accessing the mount will behave similarly. The second form of the issue could be argued to be an issue with bazel, or possibly the implementation of Linux namespaces (if that's where the behavior is coming from). |
@igor0 - you were right on target with your NFS speculation. Here's the stack trace:
Here's a summarized
|
This would also match up with what I mentioned earlier about the problem occurring when we switched from using the build-cache to |
I just had another occurrence with exactly the same stack trace (copied below). I won't report further stack traces unless I see something different from what was reported before. Log out / in of user doesn't fix the issue on Ubuntu, only the restart.
|
I think you found the correct cause here. As the sandbox iterates over all mount points in order to make them readonly in its mount namespace via a bind/remount mount syscall, I could imagine that this causes a hang when the nfs mount is not available. I don't think there's a way to avoid this. I would advise to just not have inaccessible NFS mount points. |
We've figured out a workaround. If you just
|
@philwo: is the sandbox code able to report errors, say by printing to stderr? If it is an option, a possible improvement would be to print an error if a remount takes more than say 60 seconds. When you hit this issue, seeing an error message that tells you what's wrong would be very helpful. |
bonus points if you can detect which mount it's blocked on and suggest that it's unmounted if that can be safely done |
We have disabled the SandboxModule locally to combat this. I have tried augmenting this mount enumeration code to find points matching a particular filesystem type or prefix and switching those to be remounted (since private unmount doesn't seem to be a thing, and likely would exhibit the hang anyway) as a read-only empty tmpfs, similar to the extra mount point locations specified as options. I suppose it is our fault for 'letting' our machines operate in the degraded state of possibly hung mounts, but I have several issues with the existing solution: There is no way to disable the Compounding this is the fact that sandboxing buys us so little that we've long since decided to turn it off in favor of buildfarm remote execution, since an example of a problematic mount enumeration is, as described here in no uncertain terms, remotely served nfs unavailability, which sandbox will gleefully provide, yet remote execution cuts off. |
I don't think we can do anything reasonable here in the presence of non-functional NFS mounts. The sandbox can be disabled, so I think that's a valid workaround. And last I looked, we now cache the sandboxing state during the life of the Bazel server and not check for it on every build as we did before. |
Summary
A group of us at the same company have been repeated bazel hangs for months now. Bazel will hang indefinitely and can't be killed with
kill -9
. The cause appears to be multiplelinux-sandbox
processes that are stuck in the uninterruptible "D" state (waiting for unknown IO operations). Only a reboot fixes the problem, but thankfully it always fixes the issue.What's curious is that it's happened on multiple engineer laptops and desktop machines as well. The cause is unclear but possible contributing factors are:
A very similar report from @igor0 mentioned an identical issue building TensorFlow. Given that this issue hasn't been tracked separately, I'm wondering if it's something specific to our build setup or is a more widely seen issue by others:
tensorflow/tensorflow#717 (comment)
Reproduction Steps
Unknown. Happens fairly infrequently, maybe every few hundred builds?
Environment info
Have you found anything relevant by searching the web?
Nothing found on SO, GitHub or the web aside from issue 717:
tensorflow/tensorflow#717 (comment)
Other Info
bazel either hangs after the INFO line or part way through the build (shown below):
The text was updated successfully, but these errors were encountered: