Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ci): handle disk mounting and logs reading edge-cases #7690

Merged
merged 24 commits into from
Oct 9, 2023
Merged

Conversation

gustavovalverde
Copy link
Member

@gustavovalverde gustavovalverde commented Oct 8, 2023

Motivation

Some tests are still failing with tee: 'standard output': Broken pipe, recently it has been more frequently with the sync-to-checkpoint test, but it could happen on other tests too.

Also, mounting errors are still happening, and we'd like to capture the dmesg when this happens at launch, for better debugging. Other options like set -x will also help debugging and replicating some commands locally.

Edit:
In the end we had this different outputs for disks from the same VM:

NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda         8:0    0  300G  0 disk 
sdb         8:16   0   50G  0 disk 
|-sdb1      8:17   0 45.8G  0 part /var/lib/toolbox
|                                  /var/lib/google
|                                  /var/lib/docker
|                                  /var/lib/containerd
|                                  /var
|                                  /home
|                                  /mnt/stateful_partition
|-sdb2      8:18   0   16M  0 part 
|-sdb3      8:19   0    2G  0 part 
| `-vroot 253:0    0  1.9G  1 dm   /
|-sdb4      8:20   0   16M  0 part 
|-sdb5      8:21   0    2G  0 part 
|-sdb6      8:22   0  512B  0 part 
|-sdb7      8:23   0  512B  0 part 
|-sdb8      8:24   0   16M  0 part /usr/share/oem
|-sdb9      8:25   0  512B  0 part 
|-sdb10     8:26   0  512B  0 part 
|-sdb11     8:27   0    8M  0 part 
`-sdb12     8:28   0   32M  0 part 
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda         8:0    0   50G  0 disk 
|-sda1      8:1    0 45.8G  0 part /var/lib/toolbox
|                                  /var/lib/google
|                                  /var/lib/docker
|                                  /var/lib/containerd
|                                  /var
|                                  /home
|                                  /mnt/stateful_partition
|-sda2      8:2    0   16M  0 part 
|-sda3      8:3    0    2G  0 part 
| `-vroot 253:0    0  1.9G  1 dm   /
|-sda4      8:4    0   16M  0 part 
|-sda5      8:5    0    2G  0 part 
|-sda6      8:6    0  512B  0 part 
|-sda7      8:7    0  512B  0 part 
|-sda8      8:8    0   16M  0 part /usr/share/oem
|-sda9      8:9    0  512B  0 part 
|-sda10     8:10   0  512B  0 part 
|-sda11     8:11   0    8M  0 part 
`-sda12     8:12   0   32M  0 part 
sdb         8:16   0  300G  0 disk 

Based on the preceding output, we can see that GCP is mounting the disks in different order on each VM. This is why the /dev/sda and /dev/sdb devices are not consistent across the VMs, nor across reboots on the same VM.

Fixes #7564
Fixes #7659
Fixes #7614

Solution

For tee: 'standard output': Broken pipe:

  • Use shell: /usr/bin/bash -exo pipefail {0} in GitHub jobs, for tests deployed to GCP
  • Temporarily disable "set -e" to handle the broken pipe error gracefully
  • Handle the exit code a bit better when reading from grep and docker wait
  • Use trap quotes correctly
  • Standardized --command quoting for GCP tests

For failed to mount local volume: mount /dev/sdb device or resource busy:

To work around this we're extracting the disk to be use, based on the device name we set in previous GitHub Action steps.

# Extract the correct disk name based on the device-name
DISK_NAME=$(ls -l /dev/disk/by-id | grep -oE "google-${{ inputs.test_id }}-${{ env.GITHUB_SHA_SHORT }} -> ../../[^ ]+" | grep -oE "/[^/]+$" | cut -c 2-); \
echo "Disk name: $DISK_NAME";

Review

  • Confirm this is working with sync-to-checkpoint test.
  • Make the tests fail and confirm we're getting the required EXIT_STATUS code.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

@gustavovalverde gustavovalverde added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles C-enhancement Category: This is an improvement P-Critical 🚑 I-integration-fail Continuous integration fails, including build and test failures labels Oct 8, 2023
@gustavovalverde gustavovalverde self-assigned this Oct 8, 2023
@github-actions github-actions bot added the C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG label Oct 8, 2023
Temporarily disabled the `set -e` option around the docker logs command to handle the broken pipe error gracefully.

Handle more complex scenarios in our `Result of ${{ inputs.test_id }} test` job
@teor2345
Copy link
Contributor

teor2345 commented Oct 8, 2023

The sync-to-checkpoint test won't run any more because we have a checkpoint disk now. So it won't run on PRs or in the scheduled jobs.

I think the tee error might be caused by set -o pipefail, or the input or output processes exiting early. Either way, replacing tee… with (tee … || true) should fix it. (That runs tee in a subshell and ignores its exit status.)

There might also be arguments we can pass to tee to change its exit status, like --mode=warn-nopipe:
https://www.man7.org/linux/man-pages/man1/tee.1.html

@gustavovalverde gustavovalverde changed the title fix(ci): do not use tee to read container logs fix(ci): handle tests launch and run edge-cases Oct 8, 2023
@gustavovalverde
Copy link
Member Author

@teor2345 yeah, at least I was able to capture the errors before the checkpoint disk got fully build and the job started to get skipped.

@gustavovalverde gustavovalverde marked this pull request as ready for review October 8, 2023 23:15
@gustavovalverde gustavovalverde requested a review from a team as a code owner October 8, 2023 23:15
@gustavovalverde gustavovalverde requested review from arya2 and removed request for a team October 8, 2023 23:15
@teor2345
Copy link
Contributor

teor2345 commented Oct 8, 2023

@teor2345 yeah, at least I was able to capture the errors before the checkpoint disk got fully build and the job started to get skipped.

Let's see if the error happens on other jobs? If it doesn't then maybe we can lower the priority of this PR and ticket.

@gustavovalverde
Copy link
Member Author

@teor2345
Copy link
Contributor

teor2345 commented Oct 8, 2023

It might be worth trying tee --mode=warn-nopipe ... or (tee … || true) if you're looking for a quick fix.

@teor2345
Copy link
Contributor

teor2345 commented Oct 9, 2023

It might be worth trying tee --mode=warn-nopipe ...

It looks like --mode=exit-nopipe doesn't work, because we need tee to exit when grep finds something, and it's not doing that here:
https://github.com/ZcashFoundation/zebra/actions/runs/6450645551/job/17510271960?pr=7690#step:4:291

That means all the other modes won't work either.

But (tee … || true) will definitely ignore the exit status.

@gustavovalverde
Copy link
Member Author

This should be good to go

teor2345
teor2345 previously approved these changes Oct 9, 2023
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a lot better.

mergify bot added a commit that referenced this pull request Oct 9, 2023
@teor2345
Copy link
Contributor

teor2345 commented Oct 9, 2023

Unfortunately this PR failed in the merge queue due to issue #7659, so it might help to focus on fixing that issue first.

@gustavovalverde
Copy link
Member Author

Yeah, I'll now focus on #7659

@teor2345
Copy link
Contributor

teor2345 commented Oct 9, 2023

Yeah, I'll now focus on #7659

I've tried another way to un-block that issue in PR #7703: repeating the failing job until it works. It's not a long-term solution, but it would let us merge other PRs while we're fixing it.

@gustavovalverde gustavovalverde changed the title fix(ci): handle tests launch and run edge-cases fix(ci): handle disk mounting and logs reading edge-cases Oct 9, 2023
Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
@mergify mergify bot merged commit 8d0a17e into main Oct 9, 2023
142 checks passed
@mergify mergify bot deleted the ref-docker-log branch October 9, 2023 18:00
@teor2345
Copy link
Contributor

teor2345 commented Oct 9, 2023

Looks good to me, sync-to-checkpoint test passed here: ZcashFoundation/zebra/actions/runs/6455786703/job/17524400741

That's the sync-past-checkpoint test, this PR also needs a full sync to verify it works. @gustavovalverde did you want to run a full sync now, or wait until Friday?

@gustavovalverde
Copy link
Member Author

@teor2345 last Friday it failed, so I triggered a new one here:

https://github.com/ZcashFoundation/zebra/actions/runs/6460865308

@gustavovalverde
Copy link
Member Author

Running https://github.com/ZcashFoundation/zebra/actions/runs/6460865308 failed, as that's basically the only test not needing a cached state and thus the only one which was not tested in this PR 😞

It should be fixed now in:

I'm running a Full sync from that branch to confirm: https://github.com/ZcashFoundation/zebra/actions/runs/6461129131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug C-enhancement Category: This is an improvement C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
3 participants