Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

kennytm · 2018-02-13T19:26:56Z

Symptom: One of the dist-*-linux job timed out after 3 hours. The stage0-std phase needs over 1000 seconds to compile.

Typically all 3 compilers (host stage0, host stage1, target stage1) are completely built, but the RLS build or final deployment doesn't finish in time.

The log should be inspected to ensure this is not caused by network error (e.g. cloning a submodule taking over 1 hour to complete).

Examples:

PowerPC (Build completed at 2:57:22, but 3 minutes isn't enough for deployment)
MIPS (Timed out before completing RLS)
~~PowerPC 64 (Timed out before completing RLS)~~ (not 20-minute stage0-libstd)
~~i686 (Timed out before completing RLS)~~ (not 20-minute stage0-libstd)

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2018-02-13T19:31:08Z

I'm going to try and investigate at some point this week so assigning myself.

kennytm · 2018-02-13T20:07:37Z

Timing breakdown for these 4 logs

	i686	ppc64	mips	ppc
make-prepare	69.19	74.98	60.87	52.38
pytest/bootstrap	0.00	0.00	0.00	0.00
stage0-std	91.80	104.70	1260.49	1250.99
stage0-test	20.93	24.50	22.87	17.52
stage0-rustc	1478.91	1616.18	1511.29	1234.51
llvm	223.83	253.40	379.83	320.81
stage0-trans-llvm	205.13	244.81	206.33	168.54
llvm-emscripten	136.07
stage0-trans-emscripten	116.82
stage1-std	200.31	101.66	86.61	68.47
stage1-test	23.13	27.84	23.53	18.37
stage1-rustc	1727.49	1871.25	1667.80	1325.40
stage1-trans-llvm	217.83	255.12	221.41	177.87
stage1-trans-emscripten	126.75
stage1-std	89.00	99.63	85.85	65.50
stage0-unstable-book-gen	13.26	14.43	14.54	11.37
stage0-rustbook	240.16	251.63	208.86	196.78
stage2-rustdoc	287.74	306.87	289.94	239.42
doc/std	15.35	15.90	14.23	12.92
doc/compiler	22.85	26.30	24.32	18.23
stage1-rustc	1701.37	1885.18	1566.31	1342.24
doc/proc_macro	1.35	1.38	1.43	1.19
stage2-error_index_generator	8.24	8.74	8.56	6.48
stage0-fabricate	129.00	137.81	139.86	109.77
llvm	223.84	228.95	368.32	287.82
stage1-trans-llvm	222.48	281.78	218.72	196.22
llvm-emscripten	133.85
stage1-trans-emscripten	131.67
stage2-rustdoc	304.44	327.96	275.49	236.46
dist/rustsrc	171.99
stage2-cargo	589.43	629.74	498.94	406.81
stage2-rustfmt	338.24	361.24	314.88	264.20
stage2-cargo-fmt	0.00	0.00	0.00	0.00
stage2-rls		665.78	523.11	440.94

The MIPS and PowerPC spent >20 minutes on stage0-std which may be a different bug (maybe related to rust-lang/cargo#5030 as suggested by @alexcrichton)

PowerPC64 looks most suspicious, as everything looks so uniform.

Graphed

alexcrichton · 2018-02-13T20:45:37Z

Looking at some of those with the travis output:

PowerPC
- The travis container was rebuilt here which took ~20m, this would be an anomaly
- This also took 20m to build libstd in stage0 (!)
MIPS
- Took 20m to build libstd in stage0 (!)

Not sure what happened with PowerPC 64 and i686... I'm looking into the 20m libstd stage0 bug

kennytm · 2018-02-14T07:51:02Z

I've narrowed down this bug to focus on the 20-minute libstd builds.

alexcrichton · 2018-02-14T14:42:54Z

So some information about why libstd is taking 20 minutes to build. I randomly ran across this the other day and was very surprised to see what was happening. The behavior that I was seeing is that building libstd hung in stage0 and when looking at top Cargo was just sitting there eating tons of CPU.

Looking into why Cargo was eating CPU I discovered that Cargo's read2 function wasn't quite right, namely that when one of the fds it was interested in was closed it would just spin infinitely without blocking until the other fd was closed. The fix (rust-lang/cargo#5030) should fix the spin behavior but I don't believe it will block the "cargo hangs for 20 minutes" bug.

I have no idea why cago is waiting 20 minutes for child file descriptors to get closed. The fix in rust-lang/cargo#5030, if I understand it correctly, will basically just cause cargo to block correctly rather than spin, but it won't help actually get the condition Cargo is waiting on to come about any sooner.

I've got no idea why one fd of a child process would get closed and the other wouldn't for 20 minutes. I wonder, though if this is related to the sccache and build scripts change as sccache could in theory hold a descriptor open for a long time by accident. I haven't been able to confirm this locally though.

alexcrichton · 2018-02-14T15:35:45Z

Ok I'm like 99% sure it's the sccache change now. The sccache server is starting in the libstd/libcore build scripts and is somehow inheriting too many fds. My current assumption for what's happening is:

Cargo spawns a build script with stdout/stderr fds
The build script then spawns a subprocess like the configure script
The configure script now does stuff that doesn't use CLOEXEC. For example the configure script may dup its own stdout/stderr onto separate fds
The configure script eventually invokes sccache to test it
Sccache realizes no server is running so it daemonizes
Sccache's daemon inherits the original stdout/stderr fds, dup'd by the configure script as not CLOEXEC, and keeps them open
Everything exits except for sccache's daemon which stays alive for a 10m idle time
Cargo now attempts to finish reading the original stdout/sdterr fds. One is closed but one leaked into sccache's server, so Cargo thinks it never gets closed.

I'm not entirely sure why a configure script (aka sh) would dup a stdio fd into a separate location without CLOEXEC, but it's currently the only way that I can explain this. I think we've got two fixes: either revert the sccache patch or try to start the sccache server earlier on in the compilation process. I vote we revert the sccache patch.

@kennytm I'm gonna be tight on time soon, mind doing the revert for me?

alexcrichton · 2018-02-14T15:46:17Z

#48209 includes a revert of the sccache commit, presumably if things stabilize after that lands we can close this.

@eddyb

Try to fix 48116 and 48192 The bug #48116 happens because of a misoptimization of the `import_path_to_string` function, where a `names` slice is empty but the `!names.is_empty()` branch is executed. https://github.com/rust-lang/rust/blob/4d2d3fc5dadf894a8ad709a5860a549f2c0b1032/src/librustc_resolve/resolve_imports.rs#L1015-L1042 Yesterday, @eddyb had locally reproduced the bug, and [came across the `position` function](https://mozilla.logbot.info/rust-infra/20180214#c14296834) where the `assume()` call is found to be suspicious. We have *not* concluded that this `assume()` causes #48116, but given [the reputation of `assume()`](#45501 (comment)), this seems higher relevant. Here we try to see if commenting it out can fix the errors. Later @alexcrichton has bisected and found a potential bug [in the LLVM side](#48116 (comment)). We are currently testing if reverting that LLVM commit is enough to stop the bug. If true, this PR can be reverted (keep the `assume()`) and we could backport the LLVM patch instead. (This PR also includes an earlier commit from #48127 for help debugging ICE happening in compile-fail/parse-fail tests.) The PR also reverts #48059, which seems to cause #48192. r? @alexcrichton cc @eddyb, @arthurprs (#47333)

alexcrichton · 2018-02-14T21:37:13Z

omg cargo clobbers cloexec

alexcrichton · 2018-02-14T21:50:30Z

Hm no nevermind, those flags don't include the cloexec flag

alexcrichton · 2018-02-14T22:28:06Z

Ok I'm now 100% sure that this is a "bug" or rather simply a fact of how sh works.

Jemalloc's configure script has an innocous line which I had to look up b/c I have no idea what that line does. Apparently it duplicates stdout onto file descriptor 6! It also apparently doesn't set CLOEXEC in the shell, meaning that file descriptor (aka stdin) will now leak into all further processes.

I created a build script that looks like this:

use std::process::Command;

fn main() {
    let s = Command::new("sh")
        .arg("-c")
        .arg("exec 3>&1; exec 2>&-; sleep 600 &")
        .status()
        .unwrap();
    println!("{}", s);
}

and I can deterministically reproduce the hanging behavior. The exec 3>&1 says "put file descriptor 1, stdout, on file descriptor 3". The shell now also doesn't set that to cloexec apparently. The next line says 'close file descriptor 2' which helps show Cargo spinning. The last one is "go sleep in some process for awhile, but return from the shell"

Overall this build script exits immediately but Cargo sits there spinning burning CPU. After rust-lang/cargo#5030 then Cargo is "properly blocking" waiting for the file descriptor to be closed, but it's still not getting closed.

tl;dr; I think this issue is for sure now fixed with #48209 landed. In the meantime I think we just can't use sccache unless we start up the server outside of the main build process where we're sure that fds won't leak in. From what I can tell I think the shell is "at fault" here but I'm also like 99% sure that's a feature of the shell as well, so that's probably not gonna change. There's not much that Cargo/sccache can do here as they're basically just bystanders.

Closing...

This is a re-attempt at rust-lang#48192 hopefully this time with 100% less randomly [blocking builds for 20 minutes][block]. To work around rust-lang#48192 the sccache server is started in the `run.sh` script very early on in the compilation process. [block]: rust-lang#48192

rustbuild: Pass `ccache` to build scripts This is a re-attempt at #48192 hopefully this time with 100% less randomly [blocking builds for 20 minutes][block]. To work around #48192 the sccache server is started in the `run.sh` script very early on in the compilation process. [block]: #48192

Mark-Simulacrum self-assigned this Feb 13, 2018

This was referenced Feb 13, 2018

Implement excluding a build-step via --exclude #48105

Merged

Investigate splitting 32bit windows tests #47911

Closed

kennytm mentioned this issue Feb 14, 2018

Refactor diverging and numeric fallback. #46714

Merged

kennytm changed the title ~~Spurious 3-hour timeout for dist-* jobs~~ Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd Feb 14, 2018

kennytm mentioned this issue Feb 14, 2018

Try to fix 48116 and 48192 #48209

Merged

alexcrichton closed this as completed Feb 14, 2018

alexcrichton mentioned this issue Mar 1, 2018

rustbuild: Pass ccache to build scripts #48630

Merged

kennytm mentioned this issue Mar 6, 2018

Rollup of 14 pull requests #48768

Merged

alexcrichton mentioned this issue Mar 7, 2018

Spurious timeout in dist builders #48827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

kennytm commented Feb 13, 2018 •

edited

Loading

Mark-Simulacrum commented Feb 13, 2018

kennytm commented Feb 13, 2018

alexcrichton commented Feb 13, 2018

kennytm commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018 •

edited

Loading

Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

Comments

kennytm commented Feb 13, 2018 • edited Loading

Mark-Simulacrum commented Feb 13, 2018

kennytm commented Feb 13, 2018

alexcrichton commented Feb 13, 2018

kennytm commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018

alexcrichton commented Feb 14, 2018 • edited Loading

kennytm commented Feb 13, 2018 •

edited

Loading

alexcrichton commented Feb 14, 2018 •

edited

Loading