RFC: Turborepo Local Cache Location #1023
Replies: 2 comments 7 replies
-
I am voting for the project specific cache directory. There are already a bunch of things that needs to be ignored for file watchers (such as Can I purpose a combination of the two similar how you can configure the |
Beta Was this translation helpful? Give feedback.
-
I would like to add a drawback to both the 'Turborepo-specific Cache Directory' and 'System Temporary Directory' options. While it may be easy to buy an additional hard-drive for cache, I believe it is not a smart idea to intentionally choose the drive that is likely to be the smallest drive on the system. Choosing the same drive as the project will allow the developer to consider disk space issues when deciding a good spot for their project. |
Beta Was this translation helpful? Give feedback.
-
RFC: Turborepo Local Cache Location
After doing an extensive review, I wanted to spend a few minutes discussing our options for temporary file storage and then offer a proposal for what I believe is our best ... path ... for moving forward.
Considerations
When working to identify the best solution for our caching setup, these are the things that we should be considering and optimizing for.
1. Cache Usefulness
The first thing we should take under consideration is how we think about the value provided by Turborepo caching, and ensuring that we're able to meet those demands.
Growth & Boundedness
In theory, the Turborepo execution node caches can grow in an unbounded manner. In reality, the maximum cache size can be approximated by the cost of storing executions across all active
HEAD
s of the repository (including those on local development boxes). This should also include maintaining some history as well for things like bisection, incremental code reviews, and backtracking while doing local development.From a technical standpoint for the majority of use-cases, Turborepo is capable of reviewing the state of the local repository, the origin repository, and other remotes to determine which caches are possibly useful and which caches can be expired. Further, known unstable execution nodes and descendants can be excluded from caching as we know that they will never be encountered twice.
Local Cache
The local cache for Turborepo is designed to optimize for the performance of builds. On a daily basis the number of times that a Turborepo user receives a majority of cache misses from their local cache should be equivalent to the
(number of major tasks undertaken * number of descendant nodes invalidated by that task)
. We further expect that the majority of changes will occur closer to the leaf nodes of the execution graph, limiting the time spent on builds.For a user's own execution graph invalidations, we can even know in advance that there will be a cache miss and skip checking for remote caches.
Given what we expect to be the typical user pattern, maintaining a local cache of executions is paramount for achieving high-performance builds in day-to-day use.
Remote Cache
The remote cache for Turborepo is designed to share the build work across the team. No person on the team should have to run the same build task twice. By rule, excepting changes you have made and unstable execution nodes, somebody has already run that execution node by the time you are able to check it out to your own local environment. Even if somebody blind-pushed to let CI do it for them, then CI itself can even populate that cache (approximating a remote execution).
So, given that, the value of the remote cache is asymptotic to
(time to run execution node - (time to download with current network conditions<cache consumer> + time to upload with current network conditions<cache creator>))
. We assume that most Turborepo users will be working under good network conditions such that reconstituting the local cache from the remote cache is likely a valuable exercise—and something that should only happen once permajor task undertaken
.2. File Watching
A second consideration for how (and where) we cache is system file watching behavior.
A number of individual developer systems have multiple concurrent file watching services and we need to determine how best to optimize integration with those systems. If we trigger a large number of monitored file system events we could dramatically reduce the overall machine performance—which runs counter to our goals with Turborepo.
As examples of these systems:
We need to consider how the location we choose on disk interacts with systems like these. For global system watching the location that we choose doesn't matter. For subscription-based watching based upon location, we need to make sure we avoid accidentally landing in somebody's glob match.
3. Cache Persistence
A third consideration is what the ideal duration of a cache is. Though validity of execution node caches is expected to decay exponentially, some caches could last near-indefinitely. Given this, we should choose a location that gives us the ability to store things indefinitely.
Technically, there are a lot of options on where to store things. Temporary folders on Linux and Unix devices are cleared by the kernel on reboot (though rebooting is uncommon behavior these days). Windows temporary folders grow unbounded. Removing files at process end can be accomplished cross-platform, even in the event of an error (
unlink
after open,FileOption.DeleteOnClose
).Non-temporary locations require us to implement Cache Management of some sort.
4. Cache Management
A fourth consideration is how (or even whether) Turborepo should manage its cache. Cache management per invocation is expensive and incurs disproportionate costs in terms of total proportion of execution time.
Our hypothesis is that, on average, developer device disk space is cheap and developer time is expensive. Given that, we believe that expanding the local cache to cover as much as possible is the right strategy. We find it unlikely that we will find ourselves under disk space pressure, even in the JavaScript ecosystem. If a developer somehow finds themselves disk-space bound because of Turborepo caches a smart IT organization will immediately invest in the bigger hard drive that will pay for itself within a week.
So, in service of our goal of maximizing cache hits, we should expand the local cache—even up to the point of disk limits.
5. Continuous Integration (CI) Environments
A fifth consideration is how to integrate caching with CI services. Many CI providers will reconstitute their images with a local cache, avoiding requests to the Turborepo remote cache origin to download remote caches. We need to make sure it is simple to integrate with this to help speed up CI runs.
6. API Design Enabling User Error
A sixth consideration is to make sure that we optimize away any place where user error can creep into the system. (Though I use the term "user error" here, by making it possible to occur is is not the user's error, it would be Turborepo's.)
Currently one of the steps in the Turborepo documentation requires the user to make modifications to their project to avoid accidentally committing information about Turborepo's internal state to the repository.
Ensuring that users "fall into the pit of success" should be an important part of what makes Turborepo special. Including anything that requires the user to be conscientious (even just once!) introduces friction.
If we move caches outside of product folders this neatly sidesteps the problem.
7. Inspection
Currently the outputs of Turborepo are only semi-opaque. A user is able to investigate individual items inside of the caches and make some sense of them.
While we want to support this use case we should make this available in a different way rather than by browsing through internals.
Options
Given all of these considerations, the three options for cache path are:
Individual Project Directory
This helps with:
It has possible drawbacks in:
git clean -fdx
all the caches disappear.It has no material impact on:
System Temporary Directory
This helps with:
fsync
.It has possible drawbacks in:
Unknown impact on:
Turborepo-specific Cache Directory
This would be equivalent to something like
~/.turbo/caches/
.This helps with:
It has possible drawbacks in:
It has no material impact on:
Proposal
After weighing the options on their merits, I propose creating a Turborepo-specific cache directory. The path name itself risks becoming a bikeshed and is not an important part of this decision (and is easy to change later).
Beta Was this translation helpful? Give feedback.
All reactions