Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GitHub Actions for CI (attempt 2) #2465

Merged
merged 11 commits into from
May 15, 2020

Conversation

richardxia
Copy link
Member

@richardxia richardxia commented May 13, 2020

Replaces #2459 to see if not having two simultaneous PRs open from the same branch fixes anything. This is still being submitted from my personal fork, since I want to test that this still works for people who don't have commit access to chipsalliance/rocket-chip


Related issue: N/A

Type of change: other enhancement

Impact: API modification

Development Phase: implementation

Release Notes


I'm going to make up my own PR template to organize my thoughts:

Problem Statement

I've occasionally run into friction while using Travis CI, and I think it's worth trying out other CI services to see whether they have nicer experiences. Some of my main issues with Travis CI are:

  1. Documentation just doesn't feel that clear or polished. I often find myself not understanding exactly how something works and having to just try it out in order to test it.
  2. Parts of their product that feel a bit "Frankenstein" like, where they ended up in a dead-end API-wise, but then had to support both an older API and a newer API simultaneously. For example, jobs configured as Build Stages vs. Jobs configured using a normal build have noticeably different behavior and syntax.
  3. Travis CI kills any jobs that don't print to stdout at least once every 10 minutes, treating them as hangs. travis_wait is an annoying kludge, since it has the side effect of hiding all of the normal console output of your command until either it completes or until the timeout kicks.
  4. You are limited to five parallel jobs at a time.
  5. Anecdotally, I feel that I very often see transient failures due to networking issues, and I'm actually inclined to believe that the way Travis CI have set up their server networking, it's often an issue on their end rather than on the other side of the internet.
  6. The way caching is handle is confusing

Why GitHub Actions may be a better solution

Addressing each of my points above, one by one:

  1. Although I feel that GitHub Actions' documentation could be better, it already feels a lot more detailed and precise than Travis CI's.
  2. GitHub Actions' APIs and schemas feel much more conceptually unified than Travis CI's. I find it less difficult to reason about how to make a configuration change in GitHub Actions.
  3. GitHub Actions does not have the restriction on needing to print to stdout at least once every 10 minutes. GitHub Actions also seems to have much longer max timeouts set (I think 5 hours per job).
  4. GitHub Actions appears to have a max of 20 concurrent jobs.
  5. Anecdotally, I feel that I see errors due to networking less frequently on GitHub Actions.
  6. GitHub Actions' caching APIs make a lot more sense to me. You can specify multiple caches, each with a different key, and they also have utility functions for computing hashes for keys based on the contents of files (e.g. build.sbt), which allows us to control cache hits and misses.

Drawbacks to using GitHub Actions

Although I have had a fairly positive experience with GitHub Actions, I have noticed some things that feel like a step back from Travis CI:

  1. You cannot delete caches. You must either way for the cache to automatically be cleaned up (one week with no hits) or you must change the cache key, which requires making a Git commit to change the configuration file.
  2. Parallel jobs flood the status indicator at the bottom of a PR. Travis normally has a single bubble for the entire workflow, but GitHub Actions has a bubble for each job in a workflow.
  3. GitHub Actions has a much, much smaller amount of disk available (14 GB, including whatever comes on the preinstalled VM image they use, vs. Travis CI's ~50 GB).
  4. You cannot restart an individual parallel job; you must restart the whole workflow.
  5. Anecdotally, I feel that GitHub Actions has run into a lot of infrastructural issues over the past several months (coincident with COVID-19?). https://www.githubstatus.com/ shows quite a lot of outages. Travis CI's for comparison: https://www.traviscistatus.com/

What this PR changes

I've done a heavy amount of rebasing and squashing so that my commits are ordered in a specific way.

Refactoring work that is probably useful to merge in even if we don't want to use GitHub Actions

45ef2f3 - I pulled out the actual make commands from the .travis.yml file into a bash script so that I could more easily run them in either Travis CI or GitHub Actions. This would affect anyone that wants to update the CI tests in both Travis CI and GitHub Actions, since they must now modify that bash script. I would want this to be easily modifiable by others, so please let me know if you find this more confusing or worse than what we had previously with Travis CI.

b5fca2d - I added a verilator.hash, modeled after riscv-tools.hash, so that I could share that version number between both Travis CI and GitHub Actions and so that I could use that hash to compute the cache key in GitHub Actions.

468da4e - I modified the regression/Makefile targets to invert the dependency between riscv-tests.stamp and rocket-tools_checkout.stamp. Previously, riscv-tests.stamp depended on rocket-tools_checkout.stamp, and riscv-tests.stamp was essentially a no-op, since it relied on rocket-tools_checkout.stamp to actually check out riscv-tests. I modified this because I noticed it was taking 30 minutes to just clone all of riscv-tools' submodules, particularly riscv-gnu-toolchain and fsf-binutils-gdb, even if you have a cache hit on a precompiled riscv-tools, since the GDB tests require the actual riscv-tests source code to be checked out.

I changed this so that now rocket-tools_checkout.stamp depends on riscv-tests.stamp, and riscv-tests.stamp is now responsible for doing the clone of rocket-tools.git and riscv-tests.git. rocket-tools_checkout.stamp is still responsible for cloning all the other submodules. If @aswaterman or someone could let me know if I've done this correctly, that would be great, since I don't think I understand how these targets work or how they are meant to be used in local development.

Adding the actual GitHub Actions workflow definitions

I broke up the commits such that each commit introduces a different job in the workflow file (e.g. wit submodule check, prepare riscv-tools cache, etc.). I think the main interesting one is the one that actually runs the main tests, which I set up as a matrix job: 015dc0b

The other main commit of note is 7885c0f, where I added a README_GITHUB_ACTIONS.md.

Proposed rollout

I'm not sure if we want to make a big change to flip from Travis CI to GitHub Actions, so I tried to develop this PR to support running both CI systems simultaneously. I think it may even be advantages to perpetually run both CI systems, since an outage in one service will (hopefully) not affect the other service, and we can more comfortably waive transient failures from one service with the passing results from the other.

I think we actually do get quite a lot of resilience from using both providers, since they even use different cloud vendors underneath: Travis CI uses Google Cloud Platform, while GitHub Actions uses Microsoft Azure. This means that we're more robust to datacenter-wide failures as well as entire GCP- or Azure-wide failures.

That said, it is more overhead to maintain two CI systems, and I think there is a social risk of letting one system rot if it starts deterministically failing, since we always have the other to rely on.

Special thanks

@aswaterman, for helping me with working out some issues with the compilation of the fesvr and Verilator when using mismatched versions of g++! That really helped unblock me and get me to the finish line!

Copy link
Member

@aswaterman aswaterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

@richardxia
Copy link
Member Author

Ironically, the only thing that failed was one of the long-running test buckets in Travis CI. Let me restart that one job.

@richardxia
Copy link
Member Author

It's still failing Travis CI...

Since I'm not a frequent contributor to rocket-chip, I don't think I have a good sense of the current state of the build. Have other people noticed the seventh and final test bucket (make emulator-ndebug -C regression SUITE=Miscellaneous) having timeout issues? Do we just need to bump the timeout, or there a potential underlying problem we need to be aware of?

Comparing the running times between GitHub Actions and Travis CI, it seems like the GitHub Actions builds usually complete about ~20 minutes faster, although I see that Test Bucket 3 took 1 hr 30 min on GH Actions and 1 hr 35 min on Travis CI, so sometimes they are close. Bucket 7 took 55 min on GH Actions, and is timing out after 1 hr 20 min on Travis. One thing that is notable is that with Travis CI, we are setting timeouts per make command, not per job, and Bucket 7 comprises only a single make command.

@richardxia
Copy link
Member Author

For another data point, the last build that I did on my personal fork of rocket-chip took 1 hr 5 min on Bucket 7: https://github.com/richardxia/rocket-chip/runs/663756436?check_suite_focus=true

@terpstra
Copy link
Contributor

I have had to restart travis jobs 4-5 times to get past the download/timeout issues we've been having. So, your PR failing travis is not an anomaly.

@aswaterman
Copy link
Member

The Travis failure doesn't surprise me, either. We can ignore it if we want to give Actions a shot.

Ironically due to Travis CI continuing to time out on it.
@richardxia
Copy link
Member Author

I tried restarting it like three times today, to no avail.

I am now bumping the Travis CI timeout on that last bucket from 80 to 100 minutes: 3145f71

@terpstra
Copy link
Contributor

Can we please merge this and disable Travis? 🥇

@richardxia
Copy link
Member Author

I will merge this in and let someone else do the disabling Travis part.

@richardxia richardxia merged commit a31e00d into chipsalliance:master May 15, 2020
@richardxia richardxia deleted the use-github-action-redux branch May 15, 2020 00:24
@aswaterman
Copy link
Member

aswaterman commented May 15, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants