-
Notifications
You must be signed in to change notification settings - Fork 37
LLVM pre merge tests operations blog
Updated k8s deployment from PR https://github.com/google/llvm-premerge-checks/pull/380:
- checkout PR locally
git fetch origin pull/380/head:pr380
git checkout pr380
-
apply changes from kubernetes/buildkite/linux-agents.yaml to kubernetes/buildkite/linux-agents-test.yaml
-
build and deploy docker image
sudo ./containers/build_deploy.sh buildkite-premerge-debian
-
apply changes to linux-agents-test deployment
kubectl apply -f kubernetes/buildkite/linux-agents-test.yaml
-
start new build on "main" https://buildkite.com/llvm-project/llvm-main with additional parameters:
ph_linux_agents={"queue": "linux-test"}
ph_skip_windows=yes
ph_skip_generated=yes
-
connected manually to agent to verify that buildkite uses the intended folder
kubectl exec -it linux-agents-test-7c7ddf5c88-w9t62 -n buildkite -- /bin/bash
-
waited for build to succeed https://buildkite.com/llvm-project/llvm-main/builds/4121#7dd1104c-402d-4f98-8dca-8ccc4055607c
-
merged PR
-
updated image label to "stable" in GCP interface
-
updated stable deployment
kubectl apply -f kubernetes/buildkite/linux-agents.yaml
- Updated windows instances: added a few n2d instances instead of e2 that were quite slow. Added 'machine_type' to agent metadata.
- Updated linux agents nodes. The biggest change now they don't use external SSD but bootstrap disk is SSD. Node type changed from n1-standard-32 to e2-standard-32.
- Already running builds are cancelled when a new diff is uploaded for the same D# revision #278
- Added a separate 'service' queue to process lightweight tasks.
- Updated phabricator rules to trigger builds on all revisions except known "other" repos. #263
- All WIndows builds started failing on Friday
- error message:
# Removing C:\ws\w16n2-1\llvm-project\llvm-master-build 🚨 Error: Failed to remove "C:\ws\w16n2-1\llvm-project\llvm-master-build" (remove C:\ws\w16n2-1\llvm-project\llvm-master-build\build\.ninja_deps: The process cannot access the file because it is being used by another process.)
- manually deleting files actually failed on buildkite-windows-16n2-1, looks like some process still has open file handles
- debugged using Process explorer (
choco install procexp
) as Admin showed that there are twoninja
proccesses still open, I killed both of them. After that I could delete the folders. The open file handle was:\Device\HarddiskVolume7\ws\w1\llvm-project\premerge-checks\build\tools\clang\test\ARCMT\
- I rebooted all machimes, Rebooting failed to automatically restart the
docker containers on
buildkite-windows-32cpu-ssd-1
. I restarted it withwindows_agent_start_buildkite.ps1
. The machines needed a few minutes to become visible in buildkite after a restart. - Since all windows machines were affected, this seems to be a new change.
So far it's unclear what caused this and if this is solved now:
- the last change to these tests was in August, to the sources was in July
- error message:
- Removed Jenkins configuration, nodes, disk storage etc. For the reference:
last commit with Jenkins configuration is tagged
jenkins
.
- Looked into stability of the master branch as that is crucial for the
false-positive rate of pre-merge testing.
- We had several long-term build breakages on master, some lasting for several days.
- Currently Polly is failing and got fixed.
- I reverted https://reviews.llvm.org/D79219 to get the Windows build green again.
- I suppose we need to be actively monitoring the master branch and revert commits that brake it. However it's probably easier to point to failing buildbots as people are used to that.
- The only proper solution is to have mandatory pre-merge testing in place so t that it becomes really hard to break the master branch. However I do not see that happening in the current infrastructure and workflows.
- With the goal of more stability on the master branch, I guess we need to look into options for automatic reverts of commits breaking master. But this means we need a solution where we can timely build and test every commit on master so we know what to revert. With Windows builds taking ~1 hour, bisecting is not really an option for a fast response.
- Looking at the new metrics:
- Since increasing CPU and RAM limits, the RAM usage has increased a bit, the CPU usage has decreased, the time spent on garbage collection has decreased, GCP does not list any failed health checks any more and the http errors has mostly disappeared.
- So I suppose, this really was a resource issue.
- One
apply patch
failed with connection issues with github. We should add a retry for that. - build infrastructure looks good.
- Jenkins master was offline: "503 Service Temporarily Unavailable".
- GCP reports Jenkins as "unavialable".
- Builds on Phabricator failed since last night: https://reviews.llvm.org/harbormaster/build/68946/
- It seems to have given up on those builds.
- Jenkins logs did not contain anything insightful:
kubectl logs <pod name> --namespace jenkins | less
- I killed the master process and let k8s restart it. The master is back online and immediately started building something.
- I just saw another two instances of the
503 Service Temporarily Unavailable
message, 4 hours after fixing the last one.- The event log of the pod shows >4400 errors of type
Readiness probe failed: HTTP probe failed with statuscode: 503
in the last month and 30 restarts. So something is definitely broken there :( - And after waiting for 2 minutes it recovered automatically.
- As we have "only" 30 restarts for 4000 failed readyness probes, the service seems to recover most of the time.
- Hypothesis: Jenkins is sometimes too slow to respond to the health check so it fails.
- So I installed a server monitoring plugin: https://jenkins.llvm-merge-guard.org/monitoring . Maybe this gives us some data if we have a performance issue at our hands.
- The event log of the pod shows >4400 errors of type
- Jenkins metrics show lots of 503 errors, CPUs maxing out at 50%, Ram usage at 75%.
- http requests take up to 110 sec!
- I increased CPU and RAM limits in Kubernetes, we still have resources available on our default node pool.
- We could also switch to a "Persistent SSD" for the Jenkins home directory. This should speed up IO.
- I had to cancel a few Windows builds as they somehow got stuck as part of the master restart.
- Got feedback on b/157393846:
- There are different APIs for storage access. For anonymous access we're supposed to use: https://storage.googleapis.com/llvm-premerge-checks/results/amd64_windows_visualstudio-530/console-log.txt
- This works on my machine in Incognito Mode in Chrome. I replied to #187, to have the user check it as well.
- If that works, we just need to update the URLs in the reports we're creating.
- Builds on Jenkins looking good, did not see and infrastructure issues there.
- Windows build logs are not accessable any more #187
- If you now open the URL from a browser in private mode, you are required to log in with a Google account. This is not the intention! I am very sure that this used to work before.
- I double checked the GCS bucket configuration against the documentation and everything looks good.
- The bucket is also listed as "Public to internet".
- I created a ticket for that: b/157393846
- Checked on the
daily-cleanup
job and that seems to be doning fine. We have around 1.900 open branches on the github repo. - Checked on the build times (Jenkins only):
- Linux: 11-40 minutes
- Windows: 17-80 minutes
- The slowest build there was actually that slow. 50 min
ninja all
, 26 minninja check-all
. The queing time was only 3 secs. sccache was enabled. - So to speed up this job, we need to speed up compile times.
- The slowest build there was actually that slow. 50 min
- Fixed missing depenency for libc in 'llvm-dependencies.yaml'.
- No major issues.
- To further optimize the tests we would need better metrics and monitoring:
- We sometimes have long queuing times for
apply_patch
in Jenkins. I've seen up to 1 hour of queueing.- Maybe we should add 2
n2-standard-1
machines for these small jobs to take the load off the large build machines and get them done quickly. This would also be useful on Buildkite. I'm not sure what the IO performance of such a machine a persistent SSD instead of the local SSD would be.
- Maybe we should add 2
- I enabled timestamps in the build logs a few days back, so now we can extract the queuing times from these build logs (time between
Waiting for next available.*
andRunning on .*
):00:00:16.413 Waiting for next available executor on ‘linux’ 01:18:15.398 Running on agent-debian-testing-ssd-76469c58dd-xm76d-20d26a14 in /mnt/disks/ssd0/agent/workspace/apply_patch
- There ware also 2 timeouts (1, 2) on Linux builds. We should also crate a metric for this. In both cases the tests never finished.
Cancelling nested steps due to timeout
- We sometimes have long queuing times for
- Servers seem to be doing fine, not problems so far. This is the first day in quite some time without new Windows issues...
- Windows build times (including queing) are between 15 and 100 minutes.
- Maybe we should add one or two more machines.
- I'll keep observing this for some more time.
- fixed #181
- lldb tests are failing on Linux, while they pass on my workstation. Disabled lldb again.
- full log
- Pinged Eric and Pavel on what to do.
- Some Windows agent were failing while uploading results to GCS as I forgot to copy the
.boto
file toc:\credentails
.- The file is "hidden" and is not copied by defaul in the Windows explorer.
- So I added the file and restarted the docker containers.
- Also triggered
master_windows_visualstudio
uilds on all machines to check the results.
- The new daily-cleanup job worked well over the weekend.
- Created some charts on peak hours and days based one the recent builds.
- Saturday and Sunday are quite slow, Tuesday is the most busy day
- It would definetely make sense to scale the number of workers, either based on the build load or on the day of the week.
- I looked into scaling the number of agents, but I would wait with that until we've moved to Buildkite see #70 for details.
- Enabled Windows builds for all users and sent out email.
- preparing for Windows rollout to everyone
- Production has about 120 builds per day, beta testing around 40. So the difference is 3x.
- We now have 6 windows agents for Jenkins (numbers 1,2,4,5,6,7).
- configuration of these : n1-standard-16 (16 vCPUs, 60 GB memory) with local scratch SSD
- Agent 3 is working for Buildkite, still using the old setup with 32 cores.
- I hope this enough to cover the additional Windows build load.
- We have 2 sets of
Jenkinsfile
s and build jobs one for beta testing and one for everyone. The ones for beta testing have the prefixBETA
. This way we can roll out changes separately to beta testing and productions. - On top of that we have one build job for Linux and Windows each building the master branch every 4 hours.
- I cleaned up the old build jobs and
Jenkinsfile
s.
- deployed new pipeline
daily-cleanup
to delete branches older than 30 days.
- All Phabricator windows builds (at least build 35 through 39) on Agent1 seem to be failing with the same error.
- Master branch on Agent4 is doing fine.
- Took Agent1 offline and restarted build 39 so it gets executed on a different machine: build 41.
- That build also failed on Agent4 with the same failing tests. So it's not the machine.
- Agent1 does not have any obvious hanging processes, git reports working directory as clean.
- Maybe a problem with path lengths again?
- path length for failing file was ~120 characters, should also not be the problem...
- But it was. When moving the builds to a shorter path
C:\ws\beta
the builds started passing again.
- Linux builds are doing fine.
- Some LLVM Windows build are also failing, but for different reasons...
- I can't find a bug in the infrastructure, so I'll keep observing this.
- Buildkite.com is down because of database maintenance.
-
Buildable for D78743 stuck for >20 hours.
- Branch was created, but Linux and Windows Builds never returned results.
- Phabricator log shows build was triggered.
- Linux and Windows builds are nowhere to be found on Jenkins.
- It looks like these jobs were lost in the Jenkins queue. It's the first time I've seen that happen.
- reinstalled cert-manager as current version v0.10.1 was outdated and cert for buildkite integration at http://build.llvm-merge-guard.org was not issued correctly. Build plans for premerge checks were disabled for ~30 minutes. full log
- Re-enabled lldb project on Linux. The failing tests were fixed.
- investigated #176
- re-configured Phabricator to now use the split patch/Linux pipelines, as we have them for BETA testers.
- re-configured the Harbormaster plan
- Windows is not yet enabled for non-beta testers.
- first builds are getting triggered:
- disabled the old pipeline, but did not yet delete it so we can turn it back on
- updated the user documentation
- Looking at some statistics for the Windows builds with sccache enabled:
- Moving from 16 core machines to 32 cores is not a big difference when caching workes fine, for 80th percentile of build times, the speedup is 1.26x
- When caching fails 32 cores are clearly faster, for 90th percentile of build times the speedup is 2.08x.
- VM pricing is linear in the number of cores: for the price of one 32 core machine, we can also have two 16 core machines
- As a conclusion, I would use 16 core machines and see how build times develop over longer periods of time.
- Windows builds on master are failing because of MLIR, notified the MLIR build cops.
- other builds look fine
- not part of pre-merge testing, mentioning it here anyway:
- mlir-nvidia build bot was offline for a week, nobody noticed/complained.
- From the logs: buildbot agent has lost connection to the server and shut down.
- However systemd did not restart the service, not sure why that is.
- I manually started the service, drained the build queue and the agent is up and running again.
- Windows builds were failing on agent2, with a sccache timout. The other agents are doing fine so far.
- I've seen that before and wasn't able to reproduce it.
- reproducing the problem: login to agent2,
docker exec -it agent-windows-jenkins powershell
into the container and run:PS C:\ws> $Env:SCCACHE_DIR = "C:\\ws\\sccache" PS C:\ws> sccache --start-server Starting sccache server... error: Timed out waiting for server startup PS C:\ws> echo $LASTEXITCODE 2
- The problem does not occur when using an empty SCCACHE_DIR.
- After deleting the SCCACHE_DIR dir, the problem disappears.
- Enabling logging by setting the environment variable
$Env:SCCACHE_ERROR_LOG="C:\ws\sccache.log"
creates this in the log file:error: The system cannot find the file specified. (os error 2)
- Analyzing with Process Monitor did not reveal what the problem might be.
- I'll try this workaround: in
run_ninja.py
check if sccache can be started, if it fails and contains "timeout" in the error message: wipe the cache directory. - I created a bug report.
- One Linux agent ran out of disk space, created issue 174 for this. It was the first time I was that happening.