Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ci): Split Docker logs into sprout, other checkpoints, and full validation #4704

Merged
merged 14 commits into from
Jun 30, 2022

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jun 28, 2022

Motivation

We want to avoid the 6 hour job limit for GitHub actions by splitting CI jobs.

Depends-On: #4690, #4694

Close #4661

Solution

Split the log following CI job into:

  • sprout checkpoints (ends on sapling activation)
  • sapling/orchard checkpoints (ends at the final checkpoint)
  • full validation (ends at the end of the test)

Some tests don't do a full sync, so all of the log jobs also end if the test has ended.

Review

This PR blocks a bunch of other PRs, so I've marked it as a critical hotfix.

The diff on this PR will show extra code until PRs #4690 and #4694 merge.

Reviewer Checklist

  • Test job structure is correct
  • Full sync consistently passes without timing out
  • Failing jobs still fail

@teor2345 teor2345 added A-devops Area: Pipelines, CI/CD and Dockerfiles C-enhancement Category: This is an improvement P-Critical 🚑 I-integration-fail Continuous integration fails, including build and test failures labels Jun 28, 2022
@teor2345 teor2345 requested a review from a team as a code owner June 28, 2022 00:25
@teor2345 teor2345 self-assigned this Jun 28, 2022
@teor2345 teor2345 requested review from a team as code owners June 28, 2022 00:25
@teor2345 teor2345 requested review from gustavovalverde and oxarbitrage and removed request for a team June 28, 2022 00:25
@teor2345

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

Merging #4704 (569f32e) into main (c8cdf06) will increase coverage by 0.07%.
The diff coverage is 1.07%.

❗ Current head 569f32e differs from pull request most recent head e00636c. Consider uploading reports for the commit e00636c to get more accurate results

@@            Coverage Diff             @@
##             main    #4704      +/-   ##
==========================================
+ Coverage   78.86%   78.94%   +0.07%     
==========================================
  Files         306      306              
  Lines       37552    37498      -54     
==========================================
- Hits        29617    29601      -16     
+ Misses       7935     7897      -38     

@teor2345 teor2345 removed the request for review from a team June 28, 2022 03:33
@teor2345

This comment was marked as outdated.

@teor2345 teor2345 changed the title fix(ci): Split Google Cloud logs into sprout, other checkpoints, and full validation fix(ci): Split Docker logs into sprout, other checkpoints, and full validation Jun 28, 2022
@conradoplg
Copy link
Collaborator

conradoplg commented Jun 28, 2022

The command to show the output and grep it at the same time was wrong.

Here's a new full sync and checkpoint job: https://github.com/ZcashFoundation/zebra/actions/runs/2573686935

We need to check that all the log jobs run, finish, and succeed.

That failed with Process completed with exit code 141.. I think that when grep quits, tee attempts to write to a closed pipe and finishes with an error, making the entire script fail because of the pipefail. I remove the pipefails in a7ee37b and I'm trying a full sync in https://github.com/ZcashFoundation/zebra/actions/runs/2576925253

@conradoplg
Copy link
Collaborator

I think I launched that incorrectly... trying again in https://github.com/ZcashFoundation/zebra/actions/runs/2577580881

@teor2345 teor2345 marked this pull request as draft June 28, 2022 20:49
@teor2345 teor2345 removed the request for review from oxarbitrage June 28, 2022 22:47
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 29, 2022

The command to show the output and grep it at the same time was wrong.
Here's a new full sync and checkpoint job: https://github.com/ZcashFoundation/zebra/actions/runs/2573686935
We need to check that all the log jobs run, finish, and succeed.

That failed with Process completed with exit code 141.. I think that when grep quits, tee attempts to write to a closed pipe and finishes with an error, making the entire script fail because of the pipefail. I remove the pipefails in a7ee37b and I'm trying a full sync in https://github.com/ZcashFoundation/zebra/actions/runs/2576925253

Thanks for fixing this up!

I'd like to keep pipefail, so that the test fails if the docker logs command fails.

tee --output-error=exit-nopipe should ignore pipe errors, but fail writing to /dev/stderr.
(Just in case we get that path wrong, or writing fails.)

@teor2345 teor2345 marked this pull request as ready for review June 29, 2022 00:41
@teor2345

This comment was marked as outdated.

@gustavovalverde
Copy link
Member

I'm reviewing this, I'm not completely sure yet. I'll block it until I'm sure of the process and outcome

@gustavovalverde gustavovalverde added the do-not-merge Tells Mergify not to merge this PR label Jun 29, 2022
teor2345 and others added 12 commits June 30, 2022 08:09
But put TODOs where we might be able to skip checkouts
This reverts commit a7ee37b.
We had this issue before, I can't recall if this was a parsing error between GitHub Actions and gcloud `--command` parsing, but we had to change this into two pieces.

This implementation keeps it how we did it before https://github.com/ZcashFoundation/zebra/blob/9b9578c99975952a291006dde8d2828fd3e97799/.github/workflows/test.yml#L235-L243
We can't remove  `actions/checkout` nor set `create_credentials_file` to `false` as next steps won't be able to authenticate to GCP.

We can surely remove `actions/checkout` and leave `create_credentials_file` as `true`, but this will raise a warning on each step, and there's no benefit of doing so.
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 29, 2022

We're running:

exit "Generating ssh key ..."

Which causes an error.

I've just pushed a change to print that output, so I can see which parts of it are the exit code.

@teor2345
Copy link
Contributor Author

I think we can fall back to docker inspect if docker wait doesn't work.
(docker wait has known bugs.)

If that doesn't work, we could also try docker events, but it only keeps the last 1000 events.

Currently I'm failing the test if docker can't find the container.
But we could also make the test succeed for now, and rely on PR #4710 to check the test logs for failures.

Here's a full tip and checkpoint sync to test it all out:
https://github.com/ZcashFoundation/zebra/actions/runs/2587061833

@teor2345
Copy link
Contributor Author

For some reason, the job status isn't showing on this PR, but you can see it here:
https://github.com/ZcashFoundation/zebra/actions/runs/2587046986

@teor2345
Copy link
Contributor Author

I think we have the behaviour we want here.

Passing tests pass the workflow:
https://github.com/ZcashFoundation/zebra/runs/7123703714

Failing tests fail the workflow, and dependent jobs:
https://github.com/ZcashFoundation/zebra/runs/7124420969
(The sync bug causing this failure will be fixed by ticket #4715.)

Cancellations stop most jobs, but still do cleanup jobs:
https://github.com/ZcashFoundation/zebra/actions/runs/2586943058
(I cancelled the lightwalletd tip job.)

And the tip and checkpoint images get created:
https://github.com/ZcashFoundation/zebra/actions/runs/2587061833

I haven't been able to confirm that docker wait || docker inspect works, because I haven't seen docker wait fail on the most recent commit. If the fix in this PR doesn't work, PR #4710 checks the test status using the test logs (and we can just ignore the unreliable docker wait status).

I'd like to do any other fixes in separate PRs.

@teor2345
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 30, 2022

update

✅ Branch has been successfully updated

@gustavovalverde gustavovalverde removed the do-not-merge Tells Mergify not to merge this PR label Jun 30, 2022
@mergify mergify bot merged commit 67dc26f into main Jun 30, 2022
@mergify mergify bot deleted the ci-nu-split branch June 30, 2022 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-enhancement Category: This is an improvement I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split full sync test into checkpoint and full validation jobs
3 participants