Skip to content

LLVM pre merge tests operations blog

Mikhail Goncharov edited this page Jan 19, 2022 · 35 revisions

2022-01-19

Updated k8s deployment from PR https://github.com/google/llvm-premerge-checks/pull/380:

  1. checkout PR locally
git fetch origin pull/380/head:pr380
git checkout pr380
  1. apply changes from kubernetes/buildkite/linux-agents.yaml to kubernetes/buildkite/linux-agents-test.yaml

  2. build and deploy docker image sudo ./containers/build_deploy.sh buildkite-premerge-debian

  3. apply changes to linux-agents-test deployment kubectl apply -f kubernetes/buildkite/linux-agents-test.yaml

  4. start new build on "main" https://buildkite.com/llvm-project/llvm-main with additional parameters:

ph_linux_agents={"queue": "linux-test"}
ph_skip_windows=yes
ph_skip_generated=yes
  1. connected manually to agent to verify that buildkite uses the intended folder kubectl exec -it linux-agents-test-7c7ddf5c88-w9t62 -n buildkite -- /bin/bash

  2. waited for build to succeed https://buildkite.com/llvm-project/llvm-main/builds/4121#7dd1104c-402d-4f98-8dca-8ccc4055607c

  3. merged PR

  4. updated image label to "stable" in GCP interface

  5. updated stable deployment kubectl apply -f kubernetes/buildkite/linux-agents.yaml

2021-05-20

  • Updated windows instances: added a few n2d instances instead of e2 that were quite slow. Added 'machine_type' to agent metadata.

2021-05-17

  • Updated linux agents nodes. The biggest change now they don't use external SSD but bootstrap disk is SSD. Node type changed from n1-standard-32 to e2-standard-32.

2021-05-04

  • Already running builds are cancelled when a new diff is uploaded for the same D# revision #278

2021-04-26

  • Added a separate 'service' queue to process lightweight tasks.

2021-02-24

  • Updated phabricator rules to trigger builds on all revisions except known "other" repos. #263

2020-09-28

  • All WIndows builds started failing on Friday
    • error message:
      # Removing C:\ws\w16n2-1\llvm-project\llvm-master-build
      🚨 Error: Failed to remove "C:\ws\w16n2-1\llvm-project\llvm-master-build" (remove C:\ws\w16n2-1\llvm-project\llvm-master-build\build\.ninja_deps: The process cannot access the file because it is being used by another process.)
      
    • manually deleting files actually failed on buildkite-windows-16n2-1, looks like some process still has open file handles
    • debugged using Process explorer (choco install procexp) as Admin showed that there are two ninja proccesses still open, I killed both of them. After that I could delete the folders. The open file handle was: \Device\HarddiskVolume7\ws\w1\llvm-project\premerge-checks\build\tools\clang\test\ARCMT\
    • I rebooted all machimes, Rebooting failed to automatically restart the docker containers on buildkite-windows-32cpu-ssd-1. I restarted it with windows_agent_start_buildkite.ps1. The machines needed a few minutes to become visible in buildkite after a restart.
    • Since all windows machines were affected, this seems to be a new change. So far it's unclear what caused this and if this is solved now:

2020-09-01

  • Removed Jenkins configuration, nodes, disk storage etc. For the reference: last commit with Jenkins configuration is tagged jenkins.

2020-08-06

  • Looked into stability of the master branch as that is crucial for the false-positive rate of pre-merge testing.
    • We had several long-term build breakages on master, some lasting for several days.
    • Currently Polly is failing and got fixed.
    • I reverted https://reviews.llvm.org/D79219 to get the Windows build green again.
    • I suppose we need to be actively monitoring the master branch and revert commits that brake it. However it's probably easier to point to failing buildbots as people are used to that.
  • The only proper solution is to have mandatory pre-merge testing in place so t that it becomes really hard to break the master branch. However I do not see that happening in the current infrastructure and workflows.
  • With the goal of more stability on the master branch, I guess we need to look into options for automatic reverts of commits breaking master. But this means we need a solution where we can timely build and test every commit on master so we know what to revert. With Windows builds taking ~1 hour, bisecting is not really an option for a fast response.

2020-05-29

2020-05-28

  • Jenkins master was offline: "503 Service Temporarily Unavailable".
    • GCP reports Jenkins as "unavialable".
    • Builds on Phabricator failed since last night: https://reviews.llvm.org/harbormaster/build/68946/
      • It seems to have given up on those builds.
    • Jenkins logs did not contain anything insightful: kubectl logs <pod name> --namespace jenkins | less
    • I killed the master process and let k8s restart it. The master is back online and immediately started building something.
  • I just saw another two instances of the 503 Service Temporarily Unavailable message, 4 hours after fixing the last one.
    • The event log of the pod shows >4400 errors of type Readiness probe failed: HTTP probe failed with statuscode: 503 in the last month and 30 restarts. So something is definitely broken there :(
    • And after waiting for 2 minutes it recovered automatically.
    • As we have "only" 30 restarts for 4000 failed readyness probes, the service seems to recover most of the time.
    • Hypothesis: Jenkins is sometimes too slow to respond to the health check so it fails.
  • Jenkins metrics show lots of 503 errors, CPUs maxing out at 50%, Ram usage at 75%.
    • http requests take up to 110 sec!
    • I increased CPU and RAM limits in Kubernetes, we still have resources available on our default node pool.
    • We could also switch to a "Persistent SSD" for the Jenkins home directory. This should speed up IO.
    • I had to cancel a few Windows builds as they somehow got stuck as part of the master restart.

2020-05-27

2020-05-25

  • Windows build logs are not accessable any more #187
    • If you now open the URL from a browser in private mode, you are required to log in with a Google account. This is not the intention! I am very sure that this used to work before.
    • I double checked the GCS bucket configuration against the documentation and everything looks good.
    • The bucket is also listed as "Public to internet".
    • I created a ticket for that: b/157393846
  • Checked on the daily-cleanup job and that seems to be doning fine. We have around 1.900 open branches on the github repo.
  • Checked on the build times (Jenkins only):
    • Linux: 11-40 minutes
    • Windows: 17-80 minutes
      • The slowest build there was actually that slow. 50 min ninja all, 26 min ninja check-all. The queing time was only 3 secs. sccache was enabled.
      • So to speed up this job, we need to speed up compile times.

2020-05-20

  • Fixed missing depenency for libc in 'llvm-dependencies.yaml'.

2020-05-14

  • No major issues.
  • To further optimize the tests we would need better metrics and monitoring:
    • We sometimes have long queuing times for apply_patch in Jenkins. I've seen up to 1 hour of queueing.
      • Maybe we should add 2 n2-standard-1 machines for these small jobs to take the load off the large build machines and get them done quickly. This would also be useful on Buildkite. I'm not sure what the IO performance of such a machine a persistent SSD instead of the local SSD would be.
    • I enabled timestamps in the build logs a few days back, so now we can extract the queuing times from these build logs (time between Waiting for next available.* and Running on .*):
      00:00:16.413  Waiting for next available executor on ‘linux’
      01:18:15.398  Running on agent-debian-testing-ssd-76469c58dd-xm76d-20d26a14 in /mnt/disks/ssd0/agent/workspace/apply_patch
      
    • There ware also 2 timeouts (1, 2) on Linux builds. We should also crate a metric for this. In both cases the tests never finished.
      Cancelling nested steps due to timeout
      

2020-05-13

  • Servers seem to be doing fine, not problems so far. This is the first day in quite some time without new Windows issues...

2020-05-12

  • Windows build times (including queing) are between 15 and 100 minutes.
    • Maybe we should add one or two more machines.
    • I'll keep observing this for some more time.
  • fixed #181

2020-05-11

  • lldb tests are failing on Linux, while they pass on my workstation. Disabled lldb again.
    • full log
    • Pinged Eric and Pavel on what to do.
  • Some Windows agent were failing while uploading results to GCS as I forgot to copy the .boto file to c:\credentails.
    • The file is "hidden" and is not copied by defaul in the Windows explorer.
    • So I added the file and restarted the docker containers.
    • Also triggered master_windows_visualstudio uilds on all machines to check the results.
  • The new daily-cleanup job worked well over the weekend.
  • Created some charts on peak hours and days based one the recent builds.
    • Saturday and Sunday are quite slow, Tuesday is the most busy day
    • It would definetely make sense to scale the number of workers, either based on the build load or on the day of the week.
    • I looked into scaling the number of agents, but I would wait with that until we've moved to Buildkite see #70 for details.
  • Enabled Windows builds for all users and sent out email.

2020-05-08

  • preparing for Windows rollout to everyone
    • Production has about 120 builds per day, beta testing around 40. So the difference is 3x.
    • We now have 6 windows agents for Jenkins (numbers 1,2,4,5,6,7).
      • configuration of these : n1-standard-16 (16 vCPUs, 60 GB memory) with local scratch SSD
      • Agent 3 is working for Buildkite, still using the old setup with 32 cores.
      • I hope this enough to cover the additional Windows build load.
    • We have 2 sets of Jenkinsfiles and build jobs one for beta testing and one for everyone. The ones for beta testing have the prefix BETA. This way we can roll out changes separately to beta testing and productions.
    • On top of that we have one build job for Linux and Windows each building the master branch every 4 hours.
    • I cleaned up the old build jobs and Jenkinsfiles.
  • deployed new pipeline daily-cleanup to delete branches older than 30 days.

2020-05-07

  • All Phabricator windows builds (at least build 35 through 39) on Agent1 seem to be failing with the same error.
    • Master branch on Agent4 is doing fine.
    • Took Agent1 offline and restarted build 39 so it gets executed on a different machine: build 41.
      • That build also failed on Agent4 with the same failing tests. So it's not the machine.
      • Agent1 does not have any obvious hanging processes, git reports working directory as clean.
    • Maybe a problem with path lengths again?
      • path length for failing file was ~120 characters, should also not be the problem...
      • But it was. When moving the builds to a shorter path C:\ws\beta the builds started passing again.
    • Linux builds are doing fine.
    • Some LLVM Windows build are also failing, but for different reasons...
    • I can't find a bug in the infrastructure, so I'll keep observing this.
  • Buildkite.com is down because of database maintenance.
  • Buildable for D78743 stuck for >20 hours.
    • Branch was created, but Linux and Windows Builds never returned results.
    • Phabricator log shows build was triggered.
    • Linux and Windows builds are nowhere to be found on Jenkins.
    • It looks like these jobs were lost in the Jenkins queue. It's the first time I've seen that happen.
  • reinstalled cert-manager as current version v0.10.1 was outdated and cert for buildkite integration at http://build.llvm-merge-guard.org was not issued correctly. Build plans for premerge checks were disabled for ~30 minutes. full log
  • Re-enabled lldb project on Linux. The failing tests were fixed.

2020-05-06

  • investigated #176
  • re-configured Phabricator to now use the split patch/Linux pipelines, as we have them for BETA testers.
  • Looking at some statistics for the Windows builds with sccache enabled:
    • Moving from 16 core machines to 32 cores is not a big difference when caching workes fine, for 80th percentile of build times, the speedup is 1.26x
    • When caching fails 32 cores are clearly faster, for 90th percentile of build times the speedup is 2.08x.
    • VM pricing is linear in the number of cores: for the price of one 32 core machine, we can also have two 16 core machines
    • As a conclusion, I would use 16 core machines and see how build times develop over longer periods of time.

2020-05-05

  • Windows builds on master are failing because of MLIR, notified the MLIR build cops.
  • other builds look fine
  • not part of pre-merge testing, mentioning it here anyway:
    • mlir-nvidia build bot was offline for a week, nobody noticed/complained.
    • From the logs: buildbot agent has lost connection to the server and shut down.
    • However systemd did not restart the service, not sure why that is.
    • I manually started the service, drained the build queue and the agent is up and running again.

2020-05-04

  • Windows builds were failing on agent2, with a sccache timout. The other agents are doing fine so far.
    • I've seen that before and wasn't able to reproduce it.
    • reproducing the problem: login to agent2, docker exec -it agent-windows-jenkins powershell into the container and run:
      PS C:\ws> $Env:SCCACHE_DIR = "C:\\ws\\sccache"
      PS C:\ws> sccache --start-server
      Starting sccache server...
      error: Timed out waiting for server startup
      PS C:\ws> echo $LASTEXITCODE
      2
    • The problem does not occur when using an empty SCCACHE_DIR.
    • After deleting the SCCACHE_DIR dir, the problem disappears.
    • Enabling logging by setting the environment variable $Env:SCCACHE_ERROR_LOG="C:\ws\sccache.log" creates this in the log file: error: The system cannot find the file specified. (os error 2)
    • Analyzing with Process Monitor did not reveal what the problem might be.
    • I'll try this workaround: in run_ninja.py check if sccache can be started, if it fails and contains "timeout" in the error message: wipe the cache directory.
    • I created a bug report.
  • One Linux agent ran out of disk space, created issue 174 for this. It was the first time I was that happening.