Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(docker): Stop resetting the cargo-chef cache in the Dockerfile #6934

Merged
merged 12 commits into from
Jun 22, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jun 13, 2023

Motivation

We're getting very slow build speeds with the Docker cache, as if it isn't working at all.

Reference / Solution

A cargo-chef user recommends removing COPY . . commands, because they overwrite the cache:
LukeMathWalker/cargo-chef#180 (comment)

This solution isn't complete, because we still need to reset the cargo-chef changes somehow. rsync can do this reset, only updating the timestamps of the changed files:
https://serverfault.com/questions/211005/rsync-difference-between-checksum-and-ignore-times-options
https://www.man7.org/linux/man-pages/man1/rsync.1.html

It isn't possible to do this with Dockerfile commands, due to a Docker bug:
https://stackoverflow.com/questions/36553502/is-there-a-way-to-add-only-changed-files-to-a-docker-image-as-a-new-layer-with#36561768

Testing

I checked that these commands worked locally, and only updated the timestamps on the files modified by cargo-chef.

Review

This is important but not a release blocker.

We might want to merge it separately to PR #6933, so we can see the cache speed improvement. The full speed improvement won't be visible until this PR merges to main.

Reviewer Checklist

  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
  • How do you know it works? Does it have tests?

@teor2345 teor2345 added C-bug Category: This is a bug A-devops Area: Pipelines, CI/CD and Dockerfiles P-High 🔥 I-slow Problems with performance or responsiveness do-not-merge Tells Mergify not to merge this PR C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG labels Jun 13, 2023
@teor2345 teor2345 requested a review from a team as a code owner June 13, 2023 07:36
@teor2345 teor2345 self-assigned this Jun 13, 2023
@teor2345 teor2345 requested review from dconnolly and removed request for a team June 13, 2023 07:36
Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting fix. And something new for me to consider as I've built several Dockerfile and I don't recall this being an issue. But that's commonly the case as the build tools create new files, and do not modify existing ones.

In any case, this was an oversight anyways as the COPY command was already used in the first stage, and we're inheriting the whole stage and not specific output files from it.

@teor2345 teor2345 changed the title fix(docker): Stop overwriting the cargo-chef cache in the Dockerfile fix(docker): Stop resetting the cargo-chef cache in the Dockerfile Jun 13, 2023
@teor2345
Copy link
Contributor Author

This is a very interesting fix. And something new for me to consider as I've built several Dockerfile and I don't recall this being an issue. But that's commonly the case as the build tools create new files, and do not modify existing ones.

Yep, it's very unexpected. cargo-chef has to do it this way because cargo uses file timestamps for caching.

We still need to reset the files changed by cargo-chef, I did that using git because it detects changes and updates timestamps correctly.

@teor2345 teor2345 requested a review from a team as a code owner June 13, 2023 22:08
@teor2345
Copy link
Contributor Author

Temporary network error:

buildx failed with: ERROR: failed to solve: failed to push us-docker.pkg.dev/zfnd-dev-zebra/zebra/zebrad-test:pr-6934: failed to copy: io: read/write on closed pipe

@teor2345 teor2345 requested a review from gustavovalverde June 14, 2023 04:45
Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the last changes including the .git directory, I do not feel as comfortable as before, I think it's too hacky and we might find a better approach, like using Docker mount caches.

I'll go through the documentation you shared and validate each option https://github.com/ZcashFoundation/zebra/pull/6934/files#diff-2f754321d62f08ba8392b9b168b83e24ea2852bb5d815d63e767f6c3d23c6ac5R31-R32

@dconnolly dconnolly removed the do-not-merge Tells Mergify not to merge this PR label Jun 14, 2023
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 14, 2023

After the last changes including the .git directory, I do not feel as comfortable as before, I think it's too hacky and we might find a better approach, like using Docker mount caches.

I don't think any purely Docker-based solution will work. Lots of people have tried, but they are still running into this known Docker bug:
https://stackoverflow.com/questions/36553502/is-there-a-way-to-add-only-changed-files-to-a-docker-image-as-a-new-layer-with#36561768

I have replaced git with COPY and rsync.

@teor2345 teor2345 requested review from gustavovalverde and removed request for dconnolly and a team June 14, 2023 21:29
Base automatically changed from fix-dockerfile-cache-use to main June 14, 2023 22:15
@teor2345
Copy link
Contributor Author

The testnet failure is a known issue related to some ongoing testing, we'll fix it in #6936.

@teor2345 teor2345 dismissed gustavovalverde’s stale review June 18, 2023 22:10

Requested changes were made

Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we measuring the caching improvements in this PR? I'd also all the ls commands from the Dockerfile

@teor2345
Copy link
Contributor Author

How are we measuring the caching improvements in this PR?

The full speed improvement won't be visible until this PR merges to main. So I suggest we merge it, then check the build speed on a Rust PR that doesn't change any dependencies.

Currently it takes around 70 minutes to do a full rebuild, which is the same as the Dockerfile in the main branch.

The best speed we can achieve rebuilding Rust changes is 42 minutes, because that's how long it takes to rebuild the test binaries (and the caching layer only caches dependencies, not binaries):
https://github.com/ZcashFoundation/zebra/actions/runs/5273943072/jobs/9604406765?pr=6934#step:10:1521

If this PR makes things slower we can fix the Dockerfile so useless steps aren't being re-run, or revert this PR.

I'd also all the ls commands from the Dockerfile

I was using them to check whether caching is active, but sure.

@teor2345
Copy link
Contributor Author

Failed due to temporary error #7015, I'll restart the jobs.

@teor2345 teor2345 dismissed gustavovalverde’s stale review June 19, 2023 23:55

Requested changes were made

@teor2345
Copy link
Contributor Author

This is not a high priority, not sure why I thought it was.

Copy link
Member

@gustavovalverde gustavovalverde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mergify bot added a commit that referenced this pull request Jun 22, 2023
@mergify mergify bot merged commit 8861de6 into main Jun 22, 2023
@mergify mergify bot deleted the fix-docker-cache-disable branch June 22, 2023 04:16
@teor2345
Copy link
Contributor Author

This appears to have reduced the build time by about 14 minutes:

PR before:
https://github.com/ZcashFoundation/zebra/actions/runs/5329582704/jobs/9655409075

PR after:
https://github.com/ZcashFoundation/zebra/actions/runs/5347717722/jobs/9696610852

That's good, but there might be more we can do here to improve caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles C-bug Category: This is a bug C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants