-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192
Comments
I'm going to try and investigate at some point this week so assigning myself. |
Timing breakdown for these 4 logs
The MIPS and PowerPC spent >20 minutes on stage0-std which may be a different bug (maybe related to rust-lang/cargo#5030 as suggested by @alexcrichton) PowerPC64 looks most suspicious, as everything looks so uniform. |
Looking at some of those with the travis output:
Not sure what happened with PowerPC 64 and i686... I'm looking into the 20m libstd stage0 bug |
I've narrowed down this bug to focus on the 20-minute libstd builds. |
So some information about why libstd is taking 20 minutes to build. I randomly ran across this the other day and was very surprised to see what was happening. The behavior that I was seeing is that building libstd hung in stage0 and when looking at Looking into why Cargo was eating CPU I discovered that Cargo's I have no idea why cago is waiting 20 minutes for child file descriptors to get closed. The fix in rust-lang/cargo#5030, if I understand it correctly, will basically just cause cargo to block correctly rather than spin, but it won't help actually get the condition Cargo is waiting on to come about any sooner. I've got no idea why one fd of a child process would get closed and the other wouldn't for 20 minutes. I wonder, though if this is related to the sccache and build scripts change as sccache could in theory hold a descriptor open for a long time by accident. I haven't been able to confirm this locally though. |
Ok I'm like 99% sure it's the sccache change now. The sccache server is starting in the libstd/libcore build scripts and is somehow inheriting too many fds. My current assumption for what's happening is:
I'm not entirely sure why a configure script (aka @kennytm I'm gonna be tight on time soon, mind doing the revert for me? |
#48209 includes a revert of the sccache commit, presumably if things stabilize after that lands we can close this. |
Try to fix 48116 and 48192 The bug #48116 happens because of a misoptimization of the `import_path_to_string` function, where a `names` slice is empty but the `!names.is_empty()` branch is executed. https://github.com/rust-lang/rust/blob/4d2d3fc5dadf894a8ad709a5860a549f2c0b1032/src/librustc_resolve/resolve_imports.rs#L1015-L1042 Yesterday, @eddyb had locally reproduced the bug, and [came across the `position` function](https://mozilla.logbot.info/rust-infra/20180214#c14296834) where the `assume()` call is found to be suspicious. We have *not* concluded that this `assume()` causes #48116, but given [the reputation of `assume()`](#45501 (comment)), this seems higher relevant. Here we try to see if commenting it out can fix the errors. Later @alexcrichton has bisected and found a potential bug [in the LLVM side](#48116 (comment)). We are currently testing if reverting that LLVM commit is enough to stop the bug. If true, this PR can be reverted (keep the `assume()`) and we could backport the LLVM patch instead. (This PR also includes an earlier commit from #48127 for help debugging ICE happening in compile-fail/parse-fail tests.) The PR also reverts #48059, which seems to cause #48192. r? @alexcrichton cc @eddyb, @arthurprs (#47333)
Hm no nevermind, those flags don't include the cloexec flag |
Ok I'm now 100% sure that this is a "bug" or rather simply a fact of how Jemalloc's configure script has an innocous line which I had to look up b/c I have no idea what that line does. Apparently it duplicates stdout onto file descriptor 6! It also apparently doesn't set CLOEXEC in the shell, meaning that file descriptor (aka stdin) will now leak into all further processes. I created a build script that looks like this: use std::process::Command;
fn main() {
let s = Command::new("sh")
.arg("-c")
.arg("exec 3>&1; exec 2>&-; sleep 600 &")
.status()
.unwrap();
println!("{}", s);
} and I can deterministically reproduce the hanging behavior. The Overall this build script exits immediately but Cargo sits there spinning burning CPU. After rust-lang/cargo#5030 then Cargo is "properly blocking" waiting for the file descriptor to be closed, but it's still not getting closed. tl;dr; I think this issue is for sure now fixed with #48209 landed. In the meantime I think we just can't use sccache unless we start up the server outside of the main build process where we're sure that fds won't leak in. From what I can tell I think the shell is "at fault" here but I'm also like 99% sure that's a feature of the shell as well, so that's probably not gonna change. There's not much that Cargo/sccache can do here as they're basically just bystanders. Closing... |
This is a re-attempt at rust-lang#48192 hopefully this time with 100% less randomly [blocking builds for 20 minutes][block]. To work around rust-lang#48192 the sccache server is started in the `run.sh` script very early on in the compilation process. [block]: rust-lang#48192
Symptom: One of the
dist-*-linux
job timed out after 3 hours. The stage0-std phase needs over 1000 seconds to compile.Typically all 3 compilers (host stage0, host stage1, target stage1) are completely built, but the RLS build or final deployment doesn't finish in time.
The log should be inspected to ensure this is not caused by network error (e.g. cloning a submodule taking over 1 hour to complete).
Examples:
PowerPC 64 (Timed out before completing RLS)(not 20-minute stage0-libstd)i686 (Timed out before completing RLS)(not 20-minute stage0-libstd)The text was updated successfully, but these errors were encountered: