Per-user compiled artifact cache #5931

djc · 2018-08-23T14:30:36Z

I was wondering if anyone has contemplated somehow sharing compiled crates. If I have a number of projects on disk that often have similar dependencies, I'm spending a lot of time recompiling the same packages. (Even correcting for features, compiler flags and compilation profiles.) Would it make sense to store symlinks in ~/.cargo or equivalent pointing to compiled artefacts?

The text was updated successfully, but these errors were encountered:

alexcrichton · 2018-08-23T18:33:00Z

There's been musings about this historically but never any degree of serious consideration. I've always wanted to explore it though! (I think it's definitely plausible)

aidanhs · 2018-09-02T17:17:52Z

sccache is one option here - it has a local disk cache in addition to the more exotic options to store compiled artifacts in the cloud.

djc · 2018-09-02T18:15:48Z

sccache would be good for the compilation time part, but it'd be nice to also get a handle on the disk size part of it.

Eh2406 · 2018-11-23T19:41:58Z

cc #6229

Vlad-Shcherbina · 2018-11-23T22:10:35Z

I think you can put

[build]
target-dir = "/my/shared/target/dir"

in ~/.cargo/config.

But I have no idea if this mode is officially supported. Is it?

Eh2406 · 2018-11-23T22:34:21Z

Yes it is, as is setting it with the corresponding environment variable. However the problems with cargo never deleting the unused artifacts gets to be dramatic quickly. Hence the connection to #6229

epage · 2023-05-17T14:14:41Z

@joshtriplett and I had a brainstorming session on this at RustNL last week.

It'd be great if cargo could have a very small subset of sccache's logic: per-user caching of intermediate build artifacts. By building this into cargo, we can tie it into all that cargo knows and cane make extensions to better support it.

Risks

Poisoning the cache from
- Broken builds (e.g. from incremental compilation bugs)
- Non-deterministic builds
- mtime bugs
Races with parallel builds
- Performance hits from locking
Running out of disk space

To mitigate problems with cache poisoning

Packages must advertise that they are deterministic
- Long term: Wasm build scripts and proc-macros to ensure determinism
Initially limit the caching to packages from immutable sources
- No mtime bugs
- No need for incremental compilation

As a contingency for if the cache is poisoned, we need a way to clear the cache (see also #3289)

To mitigate running out of disk space, we need a GC / prune (see also #6509)

One strategy is to clear caches for a rust version that is no longer installed
We could manually track atime

Locking strategy to mitigate race conditions / locking performance

Assumptions:
- Parallel builds are likely
- Parallel builds of the same package fingerprint are unlikely
Design:
1. Does it exist (without lock)
2. Build outside the cache
3. If can't do atomic rename, move into a temp dir, then do an atomic rename. If it now exists, just delete the tmpdir
Read/prune lock
- Multi-Reader/single-writer lock
- What about reading it?
  - Hold lock wile building, blocking prunes
  - Could copy out while holding lock to minimize lock time
  - Hold a lock for copy it
  - Or don't bother copying, just block the prune until lock is available
- Prune by renaming and then deleting

Transition plan (modeled off of sparse registry)

Steps:
1. Unstable
2. Hacky env variable to opt-in (unstable)
- On stable, warn if env variable is set so people can set it globally and use versions of Rust from Step 2 and Step 3
1. Hacky env variable to opt-in (stable)
2. Stable
Don't need full cleanup/pruning strategy until it is stable

epage · 2023-05-17T19:13:33Z

Wonder if something like reflink would be useful

epage · 2023-05-24T20:05:48Z

See also #7150

This was broken in rust-lang#12268 when we stopped using an intermediate `Cargo.toml` file. Unlike pre-rust-lang#12268, - We are hashing the path, rather than the content, with the assumption that people change content more frequently than the path - We are using a simpler hash than `blake3` in the hopes that we can get away with it Unlike the Pre-RFC demo - We are not forcing a single target dir for all scripts in the hopes that we get rust-lang#5931

fix(embeded): Don't pollute the scripts dir with `target/` ### What does this PR try to resolve? This PR is part of #12207. This specific behavior was broken in #12268 when we stopped using an intermediate `Cargo.toml` file. Unlike pre-#12268, - We are hashing the path, rather than the content, with the assumption that people change content more frequently than the path - We are using a simpler hash than `blake3` in the hopes that we can get away with it Unlike the Pre-RFC demo - We are not forcing a single target dir for all scripts in the hopes that we get #5931 ### How should we test and review this PR? A new test was added specifically to show the target dir behavior, rather than overloading an existing test or making all tests sensitive to changes in this behavior. ### Additional information In the future, we might want to resolve symlinks before we get to this point

epage · 2023-08-07T17:19:37Z

Some complications that came up when discussing this this with ehuss.

First, some background. We track rebuilds in two ways. The first is we have an external fingerprint that is a hash that we use to tell when to rebuild. The second is we have the hash of build inputs we pass to rustc with -Cmetadata that is used to keep symbols unique. We include this in the file name, so if -Cmetadata changes, then the filename changes. If the file doesn't exist, that is a sure sign it needs to be built.

Problems

We co-mingle files in the target directory.
- This makes it easy for us to pass a single directory for rustc to slurp up rlibs
- Some other tools depend on this for slurping up rlibs or for asm output from rustc
- This is a problem because we'll need all artifacts for an immutable package to be in isolated directories, for capturing the files and reading from the cache
RUSTFLAGS is only present in the fingerprint and not in -Cmetadata (Reconsider RUSTFLAGS artifact caching. #8716).
- We don't want PGO related RUSTFLAGS to change symbols
- We don't want remap related RUSTFLAGS to change symbols
- However, this introduces mutable data into the immutable package, making it so we can't cache it

I guess the first question is whether the per-user cache should be organized around fingerprint, -Cmetadata, or something else. Well, -Cmetadata isn't an option so long as it doesn't have RUSTFLAGS, which it shouldn't, so it would more be fingerprint or us adding a new hash type, one that maybe we reuse with the file names and ensure doesn't cause problems with rustc.

weihanglo · 2023-08-17T09:36:07Z

Cargo uses relative paths to workspace root for path dependencies to generate stable hashes. This causes an issue (#12516) when sharing target directories between package with the same name and version and relative path to workspace.

epage · 2023-08-21T15:51:59Z

For me, the biggest thing that needs to be figured out before any other progress is worth it is how to get a reasonable amount of value out of this cache.

Take my system

74 repos with a Cargo.lock in the root
65 of those repos have syn in the lockfile
Among those 65 repos, 44 different versions of syn are used (there is a mix of v1 and v2)

This is a "bottom of the stack" package. As you go up the stack, the impact of version combinations grows dramatically.

I worry a per-user cache's value will only be slightly more than making cargo clean && cargo build faster and that doesn't feel worth the complexity to me.

jplatte · 2023-08-21T18:24:23Z

How did you do that analysis? I'd be interested in running it on my own system.

Also. re caching and RUSTFLAGS, could it be an optional to simply fall back to the existing caching scheme (project-specific target dir) if RUSTFLAGS is set at all? Personally, I work on very few projects that utilize RUSTFLAGS (AFAIK.. although I also have it configured globally right now, with -C link-arg=-fuse-ld=mold, which I'd have to disable or find an alternative solution for), so everything else benefitting from a shared cache dir might already be useful.

epage · 2023-08-21T18:29:33Z

How did you do that analysis? I'd be interested in running it on my own system.

Pre-req: I keep all repos in a single folder.

$ ls */Cargo.lock | wc -l
$ rg 'name = "syn"' */Cargo.lock -l | wc -l
$ rg 'name = "syn"' */Cargo.lock -A 1 | rg version | rg -o '".*"' | sort -u | wc -l

(and yes, there are likely rg features I can use to code-golf this)

jplatte · 2023-08-21T18:46:11Z

Thanks! I keep my projects in two dirs (approximated active and inactive projects), but running this separately on both I get the following. Also included futures-util as another commonly-used crate, one that does not get released as often.

stat	active	inactive
number of crates / workspaces	34	91
workspaces pulling in `syn`	29	72
different versions of `syn`	15	33
workspaces pulling in `futures-util`	20	44
different versions of `futures-util`	3	14

djc · 2023-08-22T07:27:22Z

Also syn is part of a set of crates that gets a lot of little bumps. This is common for dtolnay crates, but not so much for a whole host of other crates -- so I'm not sure this particular test is very representative. (Note that I'm definitely not disagreeing that the utility of a per-user compiled artifact cache might not be as great as hoped.)

FWIW, given what I see Rust-Analyzer doing in a large workspace at work (some 670 crates are involved) it seems to be doing a lot of recompilation even with only weekly updates to the dependencies so even within a single workspace there might be some wins?

lu-zero · 2023-08-22T07:40:12Z

Somebody with a deduplicating file system could share their statistics for a theoretical upper bound?

idelvall · 2023-11-29T09:56:15Z

When expecting to compile a package

If in cache, copy it into CARGO_TARGET_DIR

If not in cache, compile it and write it to cache

@epage could you avoid the copy to CARGO_TARGET_DIR and read the binaries directly from the new cache?

That would help us having three mount caches at earthly/lib/rust without duplicated entries: One for CARGO_HOME, other for CARGO_TARGET_DIR and other for the new cache.

epage · 2023-11-29T18:19:58Z

Yes, we could have locks on a per-cached item basis and read directly from it. Whether we do depends on how much we trust the end-to-end process.

RobJellinghaus · 2024-02-15T23:13:32Z

Hi folks, chiming in here to merge two streams: some of us at Microsoft did a hackathon project to prototype a per-user Cargo cache, late last September.

Here's the Zulip chat: https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/Per-user.20build.20caches
Our HackMD with status as of our last discussion of this with the Cargo team: https://hackmd.io/R64ykWblRr-y9-jeWNLGtQ?view
Comparison of our branch's changes: https://github.com/rust-lang/cargo/compare/master...arlosi:cargo:hackathon?expand=1

Our initial testing of this generally showed surprisingly small speedup, even when things were entirely cached. It seems that rustc is just really damn fast at crate compilation :-O And that (as we know) the long pole is always the final LLVM binary build.

This change took the approach of creating a user-shared cache using the cacache crate, which worked very well and allowed us to get something functioning quickly, but which was generally considered by the Cargo team to perpetuate the anti-pattern of having a not very human-readable filesystem layout. @arlosi had a background goal of creating a more human-manageable filesystem cache provider implementation, but I know other priorities have supervened since then.

I do think this change's approach of having a very narrow cache interface is a good design direction that was reasonably proven by this experiment.

Happy to discuss our approach on any level, hope it is useful to people wanting to move this further forwards.

mydoghasfleas · 2024-04-15T08:43:50Z

Our initial testing of this generally showed surprisingly small speedup, even when things were entirely cached. It seems that rustc is just really damn fast at crate compilation :-O And that (as we know) the long pole is always the final LLVM binary build.

But the problem being solved here is not only the speed, but the amount of storage space being consumed. When you have dozens of projects each compiling the same crates and each easily taking up 2GB, your drive starts filling up very quickly!

ssokolow · 2024-04-15T08:57:53Z

Our initial testing of this generally showed surprisingly small speedup, even when things were entirely cached. It seems that rustc is just really damn fast at crate compilation :-O And that (as we know) the long pole is always the final LLVM binary build.

But the problem being solved here is not only the speed, but the amount of storage space being consumed. When you have dozens of projects each compiling the same crates and each easily taking up 2GB, your drive starts filling up very quickly!

Exactly.

I've now got a Ryzen 5 7600. Combined with mold, cargo clean; cargo build --release takes almost no time... but I've only got so much disk space and, even without deleting and re-building things to limit space consumption, I'd still rather avoid unnecessary write cycles on my SSD.

mydoghasfleas · 2024-04-22T08:24:33Z

Our initial testing of this generally showed surprisingly small speedup, even when things were entirely cached. It seems that rustc is just really damn fast at crate compilation :-O And that (as we know) the long pole is always the final LLVM binary build.

But the problem being solved here is not only the speed, but the amount of storage space being consumed. When you have dozens of projects each compiling the same crates and each easily taking up 2GB, your drive starts filling up very quickly!

Exactly.

I've now got a Ryzen 5 7600. Combined with mold, cargo clean; cargo build --release takes almost no time... but I've only got so much disk space and, even without deleting and re-building things to limit space consumption, I'd still rather avoid unnecessary write cycles on my SSD.

The write cycles issue is very pertinent!

jaskij · 2024-08-14T12:56:12Z

Came across this, reading the design document , and decided to share my thoughts here. Hopefully this will be good input.

In CI, users generally have to declare what directory is should be cached between jobs. This directory will be compressed and uploaded at the end of the job. If the next job's cache key matches, the tarball will be downloaded and decompressed. If too much is cached, the time for managing the cache can dwarf the benefits of the cache. Some third-party projects exist to help manage cache size.

This makes several assumption about CI behavior, probably based on how GH Actions behaves. Note that other runners will behave differently, for example by default GitLab does not upload the cache anywhere. I'm not even sure if the cache is compressed. This shouldn't be an issue, but it is a set of bad assumptions in the design doc and could theoretically could lead to bad design.

I would also like to add that CI interactions should have supporting multiple CI runners from the get go. GitHub is, for better or worse, dominant, especially in mindshare, and the Rust project shouldn't be furthering it's monopoly.

Hashes and fingerprinting

Stale caches are a major pain, to the point that I have learned to recognize the signs working with at least two build systems. Currently, Cargo does not even use the whole hash, only 64 bits (sixteen digits) of it. I'm worried that with user-wide caches, collisions may happen. To that end, I'd prefer there was a plaintext copy of all the data going into the cache, to be verified that it is indeed the correct cache. Or at least use the full length of the hash.

Garbage collection

I've seen someone mention clearing old stuff by access time. This would work, in theory, but is also something to be careful around. For example, a lot of desktop systems use relatime, and some people set noatime for performance reasons. Are we even sure that the data is always available?

Personally, I would love to see something more advanced, with data tracking.

Cargo tracks which dependencies in the global cache are used by which project, including:
- which projects use which dependencies
- projects' filesystem paths
cargo clean:
- removes connections between projects and it's dependencies
- deletes all cached dependencies with no connections left (configurable via config.toml)
connections are restored automatically whenever a project is built
a new command with subcommands:
- go through dependencies and removes all that have no project connections
- go through all projects in the cache tracking database and verify that the projects indeed still exist in the system, removing stale project entries and their connections
- a combination of both

Setting the cache directory

It is a feature that would greatly improve flexibility, while being fairly simple. I know people who would put the build cache in a ramdisk. I can envision a situation where the cache is put on a network share, to provide rudimentary sharing between people. Just allowing the user to configure the cache directory would make it much easier to set up. The default should be under $CARGO_HOME, but it should be possible to change independently. I know all the stuff I described above should be possible under Linux using just filesystem manipulation, but this variable makes it much more user friendly.

epage · 2024-08-14T13:22:23Z

This makes several assumption about CI behavior, probably based on how GH Actions behaves. Note that other runners will behave differently, for example by default GitLab does not upload the cache anywhere. I'm not even sure if the cache is compressed. This shouldn't be an issue, but it is a set of bad assumptions in the design doc and could theoretically could lead to bad design.

I would also like to add that CI interactions should have supporting multiple CI runners from the get go. GitHub is, for better or worse, dominant, especially in mindshare, and the Rust project shouldn't be furthering it's monopoly.

This is mentioning a use case. Nothing in the design is "github specific"

I've seen someone mention clearing old stuff by access time. This would work, in theory, but is also something to be careful around. For example, a lot of desktop systems use relatime, and some people set noatime for performance reasons. Are we even sure that the data is always available?

We are doing our own access time tracking in an sqlite database, see https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#gc

CosmicHorrorDev · 2024-08-14T16:16:15Z

Currently, Cargo does not even use the whole hash, only 64 bits (sixteen digits) of it. I'm worried that with user-wide caches, collisions may happen

Looks like with 64 bits the probability of getting a collision starts becoming more of a possibility when you start getting close to a billion entries

From: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

ehuss added the A-caching Area: caching of dependencies, repositories, and build artifacts label Feb 21, 2019

ehuss mentioned this issue Jul 19, 2019

Cache usage meta tracking issue #7150

Open

kentfredric mentioned this issue Oct 28, 2019

Future support for shared compile cache kentnl-gentoo-rust/rust-dev-overlay#9

Open

epage mentioned this issue May 17, 2023

cargo clean ~/.cargo #3289

Closed

epage changed the title ~~Per-user compiled artefact cache~~ Per-user compiled artifact cache May 17, 2023

This was referenced May 24, 2023

Cache compilations of everything from crates.io #4436

Closed

Suggestion: re-use built dependencies across directories #4301

Closed

epage mentioned this issue Jun 17, 2023

fix(embeded): Don't pollute the scripts dir with target/ #12282

Merged

This was referenced Jun 27, 2023

RFC: Templating CARGO_TARGET_DIR to make it the parent of all target directories rust-lang/rfcs#3371

Closed

Tracking Issue for cargo-script RFC 3424 #12207

Open

epage mentioned this issue Jul 12, 2023

Reconsider RUSTFLAGS artifact caching. #8716

Open

idelvall mentioned this issue Nov 29, 2023

[lib/rust] RUSTFLAGS might introduce cache-misses in the target mount cache earthly/lib#30

Open

weihanglo mentioned this issue Jan 11, 2024

cargo workspace made easier #13219

Closed

epage mentioned this issue Jan 11, 2024

Cargo allows you to have a different name in [lib] and [package] when publishing #6827

Open

This was referenced Apr 23, 2024

RFC: cargo-script rust-lang/rfcs#3502

Merged

Clean package perf improvements #13818

Merged

epage mentioned this issue May 10, 2024

cargo package --workspace is not very useful #10948

Open

4 tasks

This was referenced Jun 1, 2024

"Cargo install" should provide download binaries only from crate.io if availibale rather than build these binaries every time when calling cargo install #13994

Open

--artifact-dir (nee --out-dir) Tracking Issue #6790

Open

epage mentioned this issue Jun 22, 2024

Redefine CARGO_TARGET_DIR to be only an artifacts directory #14125

Open

epage mentioned this issue Jul 9, 2024

Add named path bases to cargo (v2) rust-lang/rfcs#3529

Merged

This was referenced Jul 16, 2024

Add user-wide cache goal rust-lang/rust-project-goals#50

Merged

Refactoring: move hex and hasher modules from util module to util crate #11429

Closed

not download and build artifacts again, example: arch linux, cosmic desktop #14278

Closed

weihanglo mentioned this issue Jul 24, 2024

Consider renaming [gc] config table. #14292

Open

This was referenced Jul 25, 2024

User-wide build cache rust-lang/rust-project-goals#124

Open

CARGO_TARGET_DIR spreads out files in many directories #14310

Closed

This comment was marked as off-topic.

Sign in to view

weihanglo mentioned this issue Oct 23, 2024

Support cargo clean --workspace #14720

Open

weihanglo mentioned this issue Nov 3, 2024

Snake Game when compiling #14771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-user compiled artifact cache #5931

Per-user compiled artifact cache #5931

djc commented Aug 23, 2018

alexcrichton commented Aug 23, 2018

aidanhs commented Sep 2, 2018

djc commented Sep 2, 2018

Eh2406 commented Nov 23, 2018

Vlad-Shcherbina commented Nov 23, 2018

Eh2406 commented Nov 23, 2018

epage commented May 17, 2023

epage commented May 17, 2023

epage commented May 24, 2023

epage commented Aug 7, 2023

weihanglo commented Aug 17, 2023 •

edited

Loading

epage commented Aug 21, 2023

jplatte commented Aug 21, 2023 •

edited

Loading

epage commented Aug 21, 2023 •

edited

Loading

jplatte commented Aug 21, 2023 •

edited

Loading

djc commented Aug 22, 2023 •

edited

Loading

lu-zero commented Aug 22, 2023

idelvall commented Nov 29, 2023

epage commented Nov 29, 2023

RobJellinghaus commented Feb 15, 2024 •

edited

Loading

mydoghasfleas commented Apr 15, 2024

ssokolow commented Apr 15, 2024

mydoghasfleas commented Apr 22, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

jaskij commented Aug 14, 2024

epage commented Aug 14, 2024

CosmicHorrorDev commented Aug 14, 2024 •

edited

Loading

Per-user compiled artifact cache #5931

Per-user compiled artifact cache #5931

Comments

djc commented Aug 23, 2018

alexcrichton commented Aug 23, 2018

aidanhs commented Sep 2, 2018

djc commented Sep 2, 2018

Eh2406 commented Nov 23, 2018

Vlad-Shcherbina commented Nov 23, 2018

Eh2406 commented Nov 23, 2018

epage commented May 17, 2023

epage commented May 17, 2023

epage commented May 24, 2023

epage commented Aug 7, 2023

weihanglo commented Aug 17, 2023 • edited Loading

epage commented Aug 21, 2023

jplatte commented Aug 21, 2023 • edited Loading

epage commented Aug 21, 2023 • edited Loading

jplatte commented Aug 21, 2023 • edited Loading

djc commented Aug 22, 2023 • edited Loading

lu-zero commented Aug 22, 2023

idelvall commented Nov 29, 2023

epage commented Nov 29, 2023

RobJellinghaus commented Feb 15, 2024 • edited Loading

mydoghasfleas commented Apr 15, 2024

ssokolow commented Apr 15, 2024

mydoghasfleas commented Apr 22, 2024 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

jaskij commented Aug 14, 2024

Hashes and fingerprinting

Garbage collection

Setting the cache directory

epage commented Aug 14, 2024

CosmicHorrorDev commented Aug 14, 2024 • edited Loading

weihanglo commented Aug 17, 2023 •

edited

Loading

jplatte commented Aug 21, 2023 •

edited

Loading

epage commented Aug 21, 2023 •

edited

Loading

jplatte commented Aug 21, 2023 •

edited

Loading

djc commented Aug 22, 2023 •

edited

Loading

RobJellinghaus commented Feb 15, 2024 •

edited

Loading

mydoghasfleas commented Apr 22, 2024 •

edited

Loading

CosmicHorrorDev commented Aug 14, 2024 •

edited

Loading