RFC: Turborepo Local Cache Location #1023

nathanhammond · 2022-04-08T03:11:57Z

nathanhammond
Apr 8, 2022

RFC: Turborepo Local Cache Location

After doing an extensive review, I wanted to spend a few minutes discussing our options for temporary file storage and then offer a proposal for what I believe is our best ... path ... for moving forward.

Considerations

When working to identify the best solution for our caching setup, these are the things that we should be considering and optimizing for.

1. Cache Usefulness

The first thing we should take under consideration is how we think about the value provided by Turborepo caching, and ensuring that we're able to meet those demands.

Growth & Boundedness

In theory, the Turborepo execution node caches can grow in an unbounded manner. In reality, the maximum cache size can be approximated by the cost of storing executions across all active HEADs of the repository (including those on local development boxes). This should also include maintaining some history as well for things like bisection, incremental code reviews, and backtracking while doing local development.

From a technical standpoint for the majority of use-cases, Turborepo is capable of reviewing the state of the local repository, the origin repository, and other remotes to determine which caches are possibly useful and which caches can be expired. Further, known unstable execution nodes and descendants can be excluded from caching as we know that they will never be encountered twice.

Local Cache

The local cache for Turborepo is designed to optimize for the performance of builds. On a daily basis the number of times that a Turborepo user receives a majority of cache misses from their local cache should be equivalent to the (number of major tasks undertaken * number of descendant nodes invalidated by that task). We further expect that the majority of changes will occur closer to the leaf nodes of the execution graph, limiting the time spent on builds.

For a user's own execution graph invalidations, we can even know in advance that there will be a cache miss and skip checking for remote caches.

Given what we expect to be the typical user pattern, maintaining a local cache of executions is paramount for achieving high-performance builds in day-to-day use.

Remote Cache

The remote cache for Turborepo is designed to share the build work across the team. No person on the team should have to run the same build task twice. By rule, excepting changes you have made and unstable execution nodes, somebody has already run that execution node by the time you are able to check it out to your own local environment. Even if somebody blind-pushed to let CI do it for them, then CI itself can even populate that cache (approximating a remote execution).

So, given that, the value of the remote cache is asymptotic to (time to run execution node - (time to download with current network conditions<cache consumer> + time to upload with current network conditions<cache creator>)). We assume that most Turborepo users will be working under good network conditions such that reconstituting the local cache from the remote cache is likely a valuable exercise—and something that should only happen once per major task undertaken.

2. File Watching

A second consideration for how (and where) we cache is system file watching behavior.

A number of individual developer systems have multiple concurrent file watching services and we need to determine how best to optimize integration with those systems. If we trigger a large number of monitored file system events we could dramatically reduce the overall machine performance—which runs counter to our goals with Turborepo.

As examples of these systems:

Dropbox ships an agent that, in some configurations and releases, monitors nearly every single file system event on a computer. This performance tax is generally evenly spread across the entire system.
Watchman runs as a daemonized process which receives registrations from multiple clients. It attempts to minimize the cost of file system event monitoring and triggers notifications to services.

We need to consider how the location we choose on disk interacts with systems like these. For global system watching the location that we choose doesn't matter. For subscription-based watching based upon location, we need to make sure we avoid accidentally landing in somebody's glob match.

3. Cache Persistence

A third consideration is what the ideal duration of a cache is. Though validity of execution node caches is expected to decay exponentially, some caches could last near-indefinitely. Given this, we should choose a location that gives us the ability to store things indefinitely.

Technically, there are a lot of options on where to store things. Temporary folders on Linux and Unix devices are cleared by the kernel on reboot (though rebooting is uncommon behavior these days). Windows temporary folders grow unbounded. Removing files at process end can be accomplished cross-platform, even in the event of an error (unlink after open, FileOption.DeleteOnClose).

Non-temporary locations require us to implement Cache Management of some sort.

4. Cache Management

A fourth consideration is how (or even whether) Turborepo should manage its cache. Cache management per invocation is expensive and incurs disproportionate costs in terms of total proportion of execution time.

Our hypothesis is that, on average, developer device disk space is cheap and developer time is expensive. Given that, we believe that expanding the local cache to cover as much as possible is the right strategy. We find it unlikely that we will find ourselves under disk space pressure, even in the JavaScript ecosystem. If a developer somehow finds themselves disk-space bound because of Turborepo caches a smart IT organization will immediately invest in the bigger hard drive that will pay for itself within a week.

So, in service of our goal of maximizing cache hits, we should expand the local cache—even up to the point of disk limits.

5. Continuous Integration (CI) Environments

A fifth consideration is how to integrate caching with CI services. Many CI providers will reconstitute their images with a local cache, avoiding requests to the Turborepo remote cache origin to download remote caches. We need to make sure it is simple to integrate with this to help speed up CI runs.

6. API Design Enabling User Error

A sixth consideration is to make sure that we optimize away any place where user error can creep into the system. (Though I use the term "user error" here, by making it possible to occur is is not the user's error, it would be Turborepo's.)

Currently one of the steps in the Turborepo documentation requires the user to make modifications to their project to avoid accidentally committing information about Turborepo's internal state to the repository.

Ensuring that users "fall into the pit of success" should be an important part of what makes Turborepo special. Including anything that requires the user to be conscientious (even just once!) introduces friction.

If we move caches outside of product folders this neatly sidesteps the problem.

7. Inspection

Currently the outputs of Turborepo are only semi-opaque. A user is able to investigate individual items inside of the caches and make some sense of them.

While we want to support this use case we should make this available in a different way rather than by browsing through internals.

Options

Given all of these considerations, the three options for cache path are:

Individual project directory.
System temporary directory.
Turborepo-specific cache directory.

Individual Project Directory

This helps with:

Inspection. Easy to find, always correlated.

It has possible drawbacks in:

API Design Enabling User Error. It is possible that the cache could accidentally end up committed to a repository.
Cache Management. By distributing the caches throughout the system, they're hard to individually track down and remove.
Cache Persistence. Any time somebody runs git clean -fdx all the caches disappear.
File Watching. The application could have a file watching setup that accidentally conflicts with the Turborepo cache folder resulting in incompatibility.

It has no material impact on:

CI Environments

System Temporary Directory

This helps with:

API Design Enabling User Error. No possible errors here.
File Watching. Nothing is in the project folder, nothing to go wrong.
Cache Management. For POSIX environments, this location is emptied every reboot.
Cache Sharing. All Turborepo caches are content addressable, enabling them to easily be shared across multiple projects or instances.
Performance? Some systems may elect to use a RAM disk for the system temporary directory or make other optimizations regarding fsync.

It has possible drawbacks in:

Inspection. Spelunking to your system temporary folder generally requires deeper system knowledge—especially on Windows.
Cache Persistence. Every reboot things disappear.
Cache Management. It is not a complete solution to cache management and therefore does not eliminate the need for cache management.

Unknown impact on:

CI Environments. Likely fine, but possibly a bit more complicated.

Turborepo-specific Cache Directory

This would be equivalent to something like ~/.turbo/caches/.

This helps with:

API Design Enabling User Error. No possible errors here.
File Watching. Nothing is in the project folder, nothing to go wrong.
Cache Management. All caches are co-located, enabling easy management.
Cache Persistence. In order for the caches to be removed, the user must specifically opt in to their removal, it would not occur as a side effect.
Cache Sharing. All Turborepo caches are content addressable, enabling them to easily be shared across multiple projects or instances.

It has possible drawbacks in:

Inspection. Compared to in-project cache storage, it is not as easy to get visibility into what is happening. This tradeoff can be mitigated with tooling.

It has no material impact on:

CI Environments

Proposal

After weighing the options on their merits, I propose creating a Turborepo-specific cache directory. The path name itself risks becoming a bikeshed and is not an important part of this decision (and is easy to change later).

weyert · 2022-04-08T10:33:24Z

weyert
Apr 8, 2022

I am voting for the project specific cache directory. There are already a bunch of things that needs to be ignored for file watchers (such as .history-directory, generated build artefacts etc) so I think it's not a major thing.

Can I purpose a combination of the two similar how you can configure the store-dir in PNPM? The default is ~/.pnpm but you can also configure it to use something like ./pnpm-store via the .npmrc in your project directory

3 replies

nathanhammond Apr 8, 2022
Author

Can you explain your reasons as to why specifically you prefer per-project organization? What is the goal you are hoping to accomplish by storing the cache relative to the project? So far I primarily understand that you don't find file watching to be a negative.

TIL git clean -fdx Feels a bit odd to use x and so ignoring .gitignore

-x is the only way you can get things like build products or node_modules which are the primary reason I use git clean. (Usually just before publish to make sure I don't make any mistakes.) For me at least, git clean -fdx is typed entirely via muscle memory.

a combination of the two

[I'm assuming that you mean system store with an optional configuration option to local store.]

This introduces a lot of maintenance complexity and edge cases that we could accidentally ship to users if we fail to account for them in tests. Whichever configuration is not the default will, by rule, have significantly fewer users and be more-likely to encounter bugs. Each bit of configuration has costs, and our goal is to help reduce the decisions that people have to make.

weyert Apr 8, 2022

Can you explain your reasons as to why specifically you prefer per-project organization? What is the goal you are hoping to accomplish by storing the cache relative to the project?

It's easier to convince people to white list a project folder for the virus scanner than a random directory in the user directory. Also it gives more visibility what's going on local cache while the one in the user directory hides the magic. Out of sight, Out of Mind.

Helps avoiding mixing caches between/of different projects/monorepos; and mimics the approach we are using for the remote cache. I would like to avoid the potential chance of accidentally using a local cache of a different project/monorepos.

So far I primarily understand that you don't find file watching to be a negative.

Yes, personally. I would add the turbo cache directory (and pnpm store directory, .history) to the repository's gitignore-file. I would assume that file watchers will ignore the directories listed here. Only I am not sure if that's the case for Next.js's next dev command.

-x is the only way you can get things like build products or node_modules which are the primary reason I use git clean

For me, it's something I learned today :) I typically have a clean, clean:cache, and clean:coverage script or similar in my projects to cleanse the project from build artefacts.

[I'm assuming that you mean system store with an optional configuration option to local store.]

That's correct

nathanhammond Apr 8, 2022
Author

It's easier to convince people to white list a project folder for the virus scanner

The audience of turbo is developers. Do you believe that virus scanning is a significant concern considering that, prior to being able to use turbo they've already had to go through the process of installing node, npm, git, and able to successfully navigate a command line? I find that to be very unlikely.

what's going on local cache while the one in the user directory hides the magic. Out of sight, Out of Mind.

Do you regularly use the cache folder as an interface into the workings of turbo? If so, how do you use it? Do you believe that spelunking into the cache folder is the best possible interface that we could create to information about your builds? (Yes, this is a hint.)

Helps avoiding mixing caches between/of different projects/monorepos

We're using a content-addressable caching strategy so that if a particular execution node returns the same hash we should be able to reuse it. That reuse is considered a feature of the design we have chosen, not a bug.

I would assume that file watchers will ignore the directories listed here. Only I am not sure if that's the case for Next.js's next dev command.

You have precisely demonstrated my concern within just the products that I'd be on the hook for providing an out-of-box support story. Now multiply that by the number of projects in the JavaScript ecosystem. That's an impossibly high maintenance burden without even accounting for integration of turbo into projects written three years ago in unmaintained tools (or out-of-date versions) in which we have no method to address the problem.

rafaeltab · 2022-04-08T10:55:05Z

rafaeltab
Apr 8, 2022

I would like to add a drawback to both the 'Turborepo-specific Cache Directory' and 'System Temporary Directory' options.
These options both ignore the drive that the user has chosen for their project.
The drive that the cache will live on in these options often originates on a smaller SSD, intended for OS use only.

While it may be easy to buy an additional hard-drive for cache, I believe it is not a smart idea to intentionally choose the drive that is likely to be the smallest drive on the system. Choosing the same drive as the project will allow the developer to consider disk space issues when deciding a good spot for their project.

4 replies

nathanhammond Apr 8, 2022
Author

Being able to place the cache directory on the "correct" drive is a valid need. However, inferring which drive is the "correct" drive isn't easily accomplished. I would in fact posit that the system temporary directory is the most likely "regular" location that could be assumed to be on the "correct" drive for frequent reads/writes/deletes of derived data.

I also believe that we should look at the typical devices being used for computing these days (laptops) and make the assumption that fewer and fewer people have multiple drives. My hypothesis is that in the majority of cases this would be primarily a CI concern.

Does "single system directory with a configurable location" address the concerns that you bring up with regards to drive choice?

rafaeltab Apr 8, 2022

Being able to place the cache directory on the "correct" drive is a valid need. However, inferring which drive is the "correct" drive isn't easily accomplished. I would in fact posit that the system temporary directory is the most likely "regular" location that could be assumed to be on the "correct" drive for frequent reads/writes/deletes of derived data.

I would say that when only looking at drive space the temporary directory would be better than for example the home directory, since it is cleared. However, it still exists on the smaller SSD, if that is what the developer uses, and if not cleared often enough it might pose a problem. Also, the correct drive would often be the same drive as the one on which the project lives.

I also believe that we should look at the typical devices being used for computing these days (laptops) and make the assumption that fewer and fewer people have multiple drives. My hypothesis is that in the majority of cases this would be primarily a CI concern.

Personally, I do believe that multi-drive systems are still very common, I myself have four computers (2 older ones, 1 new one, and one for work only) of which 3 have multiple drives, 2 of which more than 2. Both my parents also have multiple drives. And a good friend, who is also a software engineer, also has multiple drives. I do not see this as something that is uncommon at all.

Does "single system directory with a configurable location" address the concerns that you bring up with regards to drive choice?

A "single system directory with a configurable location" would of course solve this issue, since you can decide for yourself where to put it, which could then be on another drive. In this use case, I do believe it to be important for the developer to be able to delete the cache for one specific project, instead of all cache at once. When working on open-source projects, the cache may want to be deleted after usage, and this would become impossible if this also means all other cache is deleted. Therefore, I would be for "single system directory with a configurable location" with subdirectories for specific turbo projects. What the names of the folders would be would however become an issue in that case, since duplicates may occur in some cases.

An alternative to creating folders for each project could be a turbo prune-cache command, which would delete the cache associated with the current working directory. However, it may be appropriate to put this idea in another RFC if this feature seems desirable.

nathanhammond Apr 8, 2022
Author

able to delete the cache for one specific project

Why do you feel like you need to be able to delete the cache? What problem are you trying to address with that? Do you feel like the failure modes are better or worse for the alternative solution of individual project caches? (e.g. npm + node_modules, https://twitter.com/MarkPieszak/status/1159136343559155712)

An alternative to creating folders for each project could be a turbo prune-cache command

Have you been joining my zoom calls? There are ongoing discussions about features we can unlock with this change, as well as those which would need to be added to achieve the most optimal system we can build.

rafaeltab Apr 8, 2022

Personally I would like to delete cache for a project I am not working on anymore, if I am unable to identify which cache belongs to which project this becomes impossible.

This would be a use case specifically because the cache of turbo can grow indefinitely and can be very large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Turborepo Local Cache Location #1023

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Turborepo Local Cache Location #1023

nathanhammond Apr 8, 2022

RFC: Turborepo Local Cache Location

Considerations

1. Cache Usefulness

Growth & Boundedness

Local Cache

Remote Cache

2. File Watching

3. Cache Persistence

4. Cache Management

5. Continuous Integration (CI) Environments

6. API Design Enabling User Error

7. Inspection

Options

Individual Project Directory

System Temporary Directory

Turborepo-specific Cache Directory

Proposal

Replies: 2 comments · 7 replies

weyert Apr 8, 2022

nathanhammond Apr 8, 2022 Author

weyert Apr 8, 2022

nathanhammond Apr 8, 2022 Author

rafaeltab Apr 8, 2022

nathanhammond Apr 8, 2022 Author

rafaeltab Apr 8, 2022

nathanhammond Apr 8, 2022 Author

rafaeltab Apr 8, 2022

nathanhammond
Apr 8, 2022

Replies: 2 comments 7 replies

weyert
Apr 8, 2022

nathanhammond Apr 8, 2022
Author

nathanhammond Apr 8, 2022
Author

rafaeltab
Apr 8, 2022

nathanhammond Apr 8, 2022
Author

nathanhammond Apr 8, 2022
Author