Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild occuring every time in dockerized build, with cache #4493

Closed
pwaller opened this issue Jan 8, 2019 · 10 comments
Closed

Rebuild occuring every time in dockerized build, with cache #4493

pwaller opened this issue Jan 8, 2019 · 10 comments

Comments

@pwaller
Copy link

pwaller commented Jan 8, 2019

General summary/comments (optional)

For me, stack often seems to rebuild things when it seems as though it shouldn't. See for example the kind of behavior described by someone else I don't know in #4490. In this issue I'm describing something a bit different - in this case the project (and subprojects) are being rebuilt, not the dependencies.

I can only apologise that I don't have the time to make a straightforward reproducer. I have tried my best to rule out things which might interfere, the most obvious candidate would be say, an editor/IDE which modifies files and causes a rebuild.

This could well be user error on my part, but it is difficult to understand what's going wrong. I'm inexperienced in the Hask ecosystem. Only for a short time I'm maintaining a medium sized project which currently takes 30 minutes to build from cold on a beefy machine. Full rebuilds are expensive and worth avoiding.

I thought to work around the issue in part by dockerizing the build, thus providing isolation from anything else going happening on the system.

Now I need you to suspend disbelief for a moment if this is new to you - docker has recently introduced a new caching mechanism which enables you to retain files across docker builds. They're revamping the way docker build works. The feature is in recent docker versions (18.09 I believe) and can be turned on from the docker client if you have DOCKER_BUILDKIT=1 in your environment. This means for example that we can retain the .stack and .stack-work directories across docker rebuilds, whilst still rebuilding layers.

However, we can still avoid rebuilding layers too, so let's do that. My execution plan in the Dockerfile is something like this (dockerfile):

(With $HOME/.stack and $PWD/.stack-work cached):

  1. Install stack.
  2. Copy across the stack.yaml.
  3. Run stack update.
  4. Copy across .cabal files.
  5. stack build --dependencies-only
  6. Copy across everything.
  7. stack build --test --bench --no-run-benchmarks --no-run-tests

This way the external dependencies only get invalidated in principle if the cabal files are invalidated.

My hope was that given that we have the caches enabled, if only a few haskell files changed, step 6 would be 'fast', i.e. only rebuild what changed. However, if there is a trivial change anywhere, all of the subprojects get rebuilt. (The external dependencies don't, which is good, because those are huge too).

The issue seems to be that stack always thinks the files have changed (seen in stack build --verbose):

[info] subproject1-0.1.0.0: unregistering (local file changes: subproject.cabal spec/Spec.hs ...)

All of the other projects are then unregistered and rebuilt because one of the dependencies has changed.

One possible cause is that the Changed Time field shown by stat indicates that the 'inode-changed time' reflects the moment that the docker COPY occurred, not the moment the file was modified. Docker copies across atime and mtime, but the ctime cannot be changed without unmounting the filesystem and poking the bits in the filesystem image. I took a brief look at stack's source code to see if the 'inode-changed time' was leaking in, but I could not quickly find it.

One possible resolution to this issue would be to stop using the 'inode-changed time' as a factor for build cache invalidation. If possible, keying the validation on file contents would a be better alternative in this regard.

Steps to reproduce

Here is a Dockerfile which reproduces the problem. Unfortunately I haven't yet been able to make a minimal example project yet - I will update the issue if I get time to do so.

If I fail to update the issue with a reproducer, and no one else encounters the problem, I won't be offended if the issue is closed.

Stack yaml:

resolver: lts-10.5
packages:
- '.'
- './subproject1'
- './subproject2'
- './subproject3'
- './subproejct4'
extra-deps:
- about-10-of-these
flags: {}
extra-package-dbs: []
ghc-options:
  "$locals": -j

Expected

When I rebuild the Dockerfile described above, only projects which are changed (or whose dependencies are changed) get rebuilt.

So for example, below, I would expect 'project' to be rebuilt if a file within 'project' is changed, but not 'subproject1' to be rebuilt in the same case. It should only be updated if something in subproject1 changes.

Actual

$ DOCKER_BUILDKIT=1 docker build .
[info] subproject1-0.1.0.0: unregistering (local file changes: subproject.cabal spec/Spec.hs ...)
configure subproject1
build subproject1
configure subproject2
build subproject2
configure subproject3
build subproject3
configure subproject4
build subproject4
configure project
build project

Stack version

$ stack --version
Version 1.9.3, Git revision 40cf7b37526b86d1676da82167ea8758a854953b (6211 commits) x86_64 hpack-0.31.1

Method of installation

  • Official binary, downloaded from stackage.org or fpcomplete's package repository
@dbaynard
Copy link
Contributor

dbaynard commented Jan 8, 2019

Hi @pwaller

It would be helpful to see the full log — would you kindly paste that?

It looks, though, like you aren't caching enough. The subproject directories have their own '.stack-work' directories which contain the build output.

Incidentally, instead of the "$locals": -j line in your config, you may wish to set jobs: explicitly.

Stack has docker support. Did you consider this?

@pwaller
Copy link
Author

pwaller commented Jan 8, 2019

Thanks for the quick and very helpful reply!

It would be helpful to see the full log — would you kindly paste that?

This relates to my inability to make a quick reproducer, the log contains lots of information I would rather not disclose.

The lack of caches mounted in the subproject directories might well be the reason! Is there a way to redirect the caches to just one path fragment? It is awkward to need to redirect many of them!

Thanks for the tip about jobs:.

I'm reluctant to use the docker abstraction built into stack - naively, it seems to increase the complexity, since a developer would need to understand how stack interacts with docker in addition to the complexity docker brings. I also don't expect it will help with the problem of caching things at a fine grained level? So I haven't looked into it yet.

@dbaynard
Copy link
Contributor

dbaynard commented Jan 8, 2019

The lack of caches mounted in the subproject directories might well be the reason! Is there a way to redirect the caches to just one path fragment? It is awkward to need to redirect many of them!

#1178 (comment)

Basically, no.

At least you only have to do it once.

I guess you could cache the entire directory, then copy in your source, and rely on stack not needing to rebuild unnecessary stuff?

Something like

  --mount=type=cache,target=/app/,id=work-dir

I'd be interested to know whether that works, by the way — I've been needing to do something like this for a while. Also it would make a good addition to the stack documentation.

@pwaller
Copy link
Author

pwaller commented Jan 8, 2019

I guess you could cache the entire directory, then copy in your source, and rely on stack not needing to rebuild unnecessary stuff?

No, this won't work, or at least it would be very non-idiomatic, since you'd have to copy it in within the run command:

RUN  --mount=type=cache,target=/app/,id=work-dir cp /somewhere/else /build/path && stack build

Any chance to add environment configuration which controls the caching? It's nice to not pollute the source directories in any case!

@dbaynard
Copy link
Contributor

dbaynard commented Jan 8, 2019

No, this won't work, or at least it would be very non-idiomatic, since you'd have to copy it in within the run command:

Surely you'd be copying the source directly into the cached directory outside of docker?

Any chance to add environment configuration which controls the caching?

I don't quite understand.


At least you only have to do it once.

The simplest method is just to extend the list of caches for your final RUN. Everything up to and including the --dependencies-only line only uses the project cache.

@pwaller
Copy link
Author

pwaller commented Jan 8, 2019

Surely you'd be copying the source directly into the cached directory outside of docker?

The way the cache works is that it applies to a RUN command. So if you did a COPY to $DIR followed by a RUN with $DIR as a cache, the thing being ran would see an empty $DIR.

The simplest method is just to extend the list of caches for your final RUN.

Thanks for that clarification. Makes sense.

Any chance to add environment configuration which controls the caching?

I don't quite understand.

Sorry, I was not specific! I was wondering if you could allow the cache directory root to be specified (for example through configuration or the environment), so that it's possible to avoid polluting the source directory tree.

@pwaller
Copy link
Author

pwaller commented Jan 8, 2019

I can report success with your suggested approach, thanks! I had to put a lot of cache lines on the final build, but I guess it isn't /that/ bad.

@pwaller
Copy link
Author

pwaller commented Jan 8, 2019

Happy at this point to close the issue unless there is anything else in here you'd like to track.

@dbaynard
Copy link
Contributor

dbaynard commented Jan 8, 2019

Might be worth adding to the documentation — I don't think it explains that there are multiple .stack-work directories. I'll take a look.

I was wondering if you could allow the cache directory root to be specified (for example through configuration or the environment), so that it's possible to avoid polluting the source directory tree.

The build cache (the ones in the subproject folders) has to be in directories below the project cabal files — it's all explained in #1178.

@snoyberg
Copy link
Contributor

Looks like this is resolved, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants