Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak Travis to use GCE #28500

Merged
merged 1 commit into from
Sep 30, 2015
Merged

Tweak Travis to use GCE #28500

merged 1 commit into from
Sep 30, 2015

Conversation

alexcrichton
Copy link
Member

Travis CI has new infrastructure using the Google Compute Engine which has both
faster CPUs and more memory, and we've been encouraged to switch as it should
help our build times! The only downside currently, however, is that IPv6 is
disabled, causing a number of standard library tests to fail.

Consequently this commit tweaks our travis config in a few ways:

  • ccache is disabled as it's not working on GCE just yet
  • Docker is used to run tests inside which reportedly will get IPv6 working
  • A system LLVM installation is used instead of building LLVM itself. This is
    primarily done to reduce build times, but we want automation for this sort of
    behavior anyway and we can extend this in the future with building from source
    as well if needed.
  • gcc-specific logic is removed as the docker image for Ubuntu gives us a
    recent-enough gcc by default.

@rust-highfive
Copy link
Collaborator

r? @pcwalton

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

(should hold off on merging until travis actually passes)

cc @joshk, gonna have this be the continuation of #28437

@Gankra
Copy link
Contributor

Gankra commented Sep 18, 2015

cc me

@Gankra
Copy link
Contributor

Gankra commented Sep 18, 2015

Maybe I'm misremembering, but I thought we preferred building our patched up LLVM for maximal similarity to the buildbots?

@alexcrichton
Copy link
Member Author

Yeah that's nice to have, but we've also gotten more requests recently to ensure we work with a stock build of Travis, and we may want to start gating on Travis CI as well soon (in addition to buildbot). The gating should fix the "accidentally broken" problem and if possible we can also have a build on Travis which builds LLVM from source.

Looks like the travis build still had failures:


failures:

---- [run-pass] run-pass/core-run-destroy.rs stdout ----

error: test run failed!
status: exit code: 101
command: x86_64-unknown-linux-gnu/test/run-pass/core-run-destroy.stage2-x86_64-unknown-linux-gnu 
stdout:
------------------------------------------

running 1 test
test test_destroy_actually_kills ... FAILED

failures:

---- test_destroy_actually_kills stdout ----
    thread 'test_destroy_actually_kills' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 13, message: "Permission denied" } }', src/libcore/result.rs:736



failures:
    test_destroy_actually_kills

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured


------------------------------------------
stderr:
------------------------------------------

------------------------------------------

thread '[run-pass] run-pass/core-run-destroy.rs' panicked at 'explicit panic', /build/src/compiletest/runtest.rs:1501


---- [run-pass] run-pass/issue-26468.rs stdout ----

error: test run failed!
status: exit code: 101
command: x86_64-unknown-linux-gnu/test/run-pass/issue-26468.stage2-x86_64-unknown-linux-gnu 
stdout:
------------------------------------------

------------------------------------------
stderr:
------------------------------------------
thread '<main>' panicked at 'assertion failed: `(left == right)` (left: `42`, right: `19`)', /build/src/test/run-pass/issue-26468.rs:37

------------------------------------------

thread '[run-pass] run-pass/issue-26468.rs' panicked at 'explicit panic', /build/src/compiletest/runtest.rs:1501



failures:
    [run-pass] run-pass/core-run-destroy.rs
    [run-pass] run-pass/issue-26468.rs
failures:

---- sys::process::tests::test_process_mask stdout ----
    thread 'sys::process::tests::test_process_mask' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 13, message: "Permission denied" } }', src/libcore/result.rs:736



failures:
    net::udp::tests::bind_error
    process::tests::signal_reported_right
    sys::process::tests::test_process_mask

Retrying with some more diagnostics and hopefully some fixed.

@alexcrichton
Copy link
Member Author

@joshk interestingly it looks like all our calls to the kill syscall are failing with EPERM, you wouldn't happen to have seen this before, would you have?

@alexcrichton
Copy link
Member Author

Oh wait travis-ci/travis-ci#4751 indicates that --priviledged is a thing to docker, let's try that!

@alexcrichton alexcrichton force-pushed the docker-travis branch 3 times, most recently from cb4dd2d to a6b9fe8 Compare September 19, 2015 19:28
@joshk
Copy link

joshk commented Sep 20, 2015

Is it working better now?

@joshk
Copy link

joshk commented Sep 20, 2015

Also, I see some things we can do to improve this further :)

@alexcrichton
Copy link
Member Author

@joshk yeah so far looking so good, the IPv6 tests are passing in the docker container and the only remaining failure is because there's a known bug in LLVM 3.6 which causes our tests to fail, so I just need to figure out how to install LLVM 3.7 instead!

@joshk
Copy link

joshk commented Sep 21, 2015

@alexcrichton i would suggest creating a new Docker image (eg. something like rust-base) which has all the deps you need already preinstalled for testing purposes. This would mean all that needs to be done is for Travis to clone the repo, pull the Docker image, and then run the tests.

@alexcrichton alexcrichton force-pushed the docker-travis branch 2 times, most recently from f02f8ff to b284636 Compare September 21, 2015 18:44
@alexcrichton
Copy link
Member Author

Huzzah! That did the trick! So, to summarize the changes here to get the build working:

  • I added a Dockerfile to src/etc which is used by the Travis bots to build an image inside of which all tests are run. This is built via docker build in a before_install step on Travis.
  • I fixed a UDP test to work even if the tests are run as root (which happens here apparently)
  • A few process-related tests were touched up in terms of style and what they do, but nothing major was changed. Without the --privileged flag to Docker we can't use the kill syscall, and I had to do some diagnosing to figure this out, but it ended up "just fixing" all the tests once this flag was passed. To be clear, I have no idea what this flag does beyond make our tests start passing.
  • Turns out there were a few straggler commits in LLVM 3.7 we didn't pick up recently, so I updated our fork of LLVM to use the true upstream 3.7 release and tweaked one of our bindings as well.

So I'm pretty comfortable with this:

  • I like the idea of us having a higher timeout by default here so we're not "sketchily getting past timeouts"
  • I like docker to start building from an "absolutely clean slate" to keep our dependencies under control
  • I like docker as it's relatively easy to reproduce failures locally
  • I like having some form of CI for using system LLVM as we want this regardless
  • This opens up the door to running a lot more tests on Travis (yay!)

So all in all, r? @brson

@rust-highfive rust-highfive assigned brson and unassigned pcwalton Sep 21, 2015
@joshk
Copy link

joshk commented Sep 21, 2015

Wow, nice!

Two quick questions:

  1. Does using a higher -j value in make improve the build time at all?
  2. Can the build be split up so that you can use different jobs to run different parts of the test suite? (thus reducing the overall build time)

@alexcrichton
Copy link
Member Author

Does using a higher -j value in make improve the build time at all?

How many cores do these machines have? e.g. is there a recommended -j for us to use?

Can the build be split up so that you can use different jobs to run different parts of the test suite? (thus reducing the overall build time)

The current way our makefiles are set up doesn't make this super easy, but logically this is totally plausible. We've got one ~30min build to produce a compiler followed by N test suites (which greatly vary in size), but all of the N test suites can be run in parallel (just gotta make sure they're all run).

@joshk
Copy link

joshk commented Sep 21, 2015

The current instances use 2 CPUs, but playing around with a higher -j value might pay off, or might not.

As for the test suite break up, this is cool to know. This would mean, at least for now, that each Job would also need to produce a new compiler. We have plans to improve this, but this might be some work which we include in next year and work together on.

@alexcrichton
Copy link
Member Author

Ah ok, I think that @gankro played around with different values of -j in the past and didn't see much benefit beyond 2 (but went with 4 to be safe). Also we'd totally be willing to tweak our makefiles to conform to whatever framework is in place to run tests in parallel, I have a feeling it'd definitely speed up our builds!

@joshk
Copy link

joshk commented Sep 21, 2015

You can use Env vars to break up your build into N many jobs. But if we can partner on some work, then we can look into build pipeline support for Travis, allowing you to build a compiler once instead of N times. And maybe we can partner this year on the ability to opt in for larger VMs?

@alexcrichton
Copy link
Member Author

That sounds like it'd work for us! We'd love to put more of our testing on travis, so we'd basically be producing N different compiler configurations, each of which needs to run M test suites, so we could manually encode the NxM matrix into .travis.yml but for now we'll probably stick to just N lines :). If we could encode into our configuration, however, N+M steps (e.g. how to build a compiler and then how to test any compiler) that'd be awesome!

I'm sure we'd also be more than willing to help out wherever possible, ideally we'd be able to move off buildbot completely but we're probably aways out from that!

@alexcrichton
Copy link
Member Author

@joshk hm interesting, looks like the recent build we added timed out (enabling debug assertions and debuginfo in the compiler itself). It also looks like the normal build took ~15min longer than usual (perhaps normal?), in light of that should we keep trying to optimize our build, or would it be possible to increase the timeout a bit more? It's totally reasonable to say a 3hr timeout is a bit unreasonable :)

@joshk
Copy link

joshk commented Sep 22, 2015

I can increase it to 3hrs if you like, so you can test further, but this means, of the 5 jobs you can run at once, you could have the queue blocked for 3 hours at a time due to long running jobs.

@alexcrichton
Copy link
Member Author

@joshk hm ok, if it's alright we'll take the higher timeout for now and probably investigate how to parallelize more or just run fewer tests on our end.

@joshk
Copy link

joshk commented Sep 27, 2015

Done!

(sorry for the wait)

@brson
Copy link
Contributor

brson commented Sep 28, 2015

r=me

@alexcrichton
Copy link
Member Author

Hm, still waiting on a successful run from travis, haven't gotten one from the debug builder yet...

@alexcrichton
Copy link
Member Author

OK, looks like we may not be able to run the debug builder on travis (just takes too long), so just updated back to only using the system LLVM + make check, and let's see how far we get!

@alexcrichton
Copy link
Member Author

@bors: r+ 0f09984

bors added a commit that referenced this pull request Sep 29, 2015
Travis CI has new infrastructure using the Google Compute Engine which has both
faster CPUs and more memory, and we've been encouraged to switch as it should
help our build times! The only downside currently, however, is that IPv6 is
disabled, causing a number of standard library tests to fail.

Consequently this commit tweaks our travis config in a few ways:

* ccache is disabled as it's not working on GCE just yet
* Docker is used to run tests inside which reportedly will get IPv6 working
* A system LLVM installation is used instead of building LLVM itself. This is
  primarily done to reduce build times, but we want automation for this sort of
  behavior anyway and we can extend this in the future with building from source
  as well if needed.
* gcc-specific logic is removed as the docker image for Ubuntu gives us a
  recent-enough gcc by default.
@bors
Copy link
Contributor

bors commented Sep 29, 2015

⌛ Testing commit 0f09984 with merge 759bf1c...

@bors
Copy link
Contributor

bors commented Sep 29, 2015

💔 Test failed - auto-linux-64-x-android-t

@tamird
Copy link
Contributor

tamird commented Sep 29, 2015

http://buildbot.rust-lang.org/builders/auto-linux-64-x-android-t/builds/6536/steps/test/logs/stdio

failure was signal_reported_right

Weird, that test is marked #[cfg(all(unix, not(target_os="android")))]

Travis CI has new infrastructure using the Google Compute Engine which has both
faster CPUs and more memory, and we've been encouraged to switch as it should
help our build times! The only downside currently, however, is that IPv6 is
disabled, causing a number of standard library tests to fail.

Consequently this commit tweaks our travis config in a few ways:

* ccache is disabled as it's not working on GCE just yet
* Docker is used to run tests inside which reportedly will get IPv6 working
* A system LLVM installation is used instead of building LLVM itself. This is
  primarily done to reduce build times, but we want automation for this sort of
  behavior anyway and we can extend this in the future with building from source
  as well if needed.
* gcc-specific logic is removed as the docker image for Ubuntu gives us a
  recent-enough gcc by default.
@alexcrichton
Copy link
Member Author

@bors: r=brson 27dd6dd

Oh that's because I switched it to cfg(unix), oops!

bors added a commit that referenced this pull request Sep 30, 2015
Travis CI has new infrastructure using the Google Compute Engine which has both
faster CPUs and more memory, and we've been encouraged to switch as it should
help our build times! The only downside currently, however, is that IPv6 is
disabled, causing a number of standard library tests to fail.

Consequently this commit tweaks our travis config in a few ways:

* ccache is disabled as it's not working on GCE just yet
* Docker is used to run tests inside which reportedly will get IPv6 working
* A system LLVM installation is used instead of building LLVM itself. This is
  primarily done to reduce build times, but we want automation for this sort of
  behavior anyway and we can extend this in the future with building from source
  as well if needed.
* gcc-specific logic is removed as the docker image for Ubuntu gives us a
  recent-enough gcc by default.
@bors
Copy link
Contributor

bors commented Sep 30, 2015

⌛ Testing commit 27dd6dd with merge 15db6ec...

#[cfg(all(unix, not(target_os="android")))]
#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton did you mean to remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah oops! I did indeed not mean to!

@bors bors merged commit 27dd6dd into rust-lang:master Sep 30, 2015
RUN apt-get -y --force-yes install llvm-3.7-tools

RUN mkdir /build
WORKDIR /build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton IIRC, each RUN creates a new file system layer. Combining the RUNs into one (just using &&) might speed up your docker build.

If this image is cached anywhere (which might make sense), you might also want append some clean up calls to the RUN calling apt-get to reduce the size. I'm no authority on this, but I've often seen stuff like apt-get autoremove -y && apt-get clean all && rm -rf /var/lib/apt/lists/*.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Right now I don't think that this is anywhere near the limiting factor of our builds, however, so it may not be too bad one way or the other.

If building an image ends up taking too long in the future we'll probably want to just send it up to the hub and download it from there, but hopefully it won't be taking too too long!

steveklabnik added a commit to steveklabnik/rust that referenced this pull request Oct 20, 2015
steveklabnik added a commit to steveklabnik/rust that referenced this pull request Oct 20, 2015
@alexcrichton alexcrichton deleted the docker-travis branch October 21, 2015 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants