Implement automatic garbage collection for the disk cache #5139

buchgr · 2018-05-02T11:28:03Z

Break out from #4870.

Bazel can use a local directory as a remote cache via the --disk_cache flag.
We want it to also be able to automatically clean the cache after a size threshold
has been reached. It probably makes sense to clean based on least recently used
semantics.

@RNabel would you want to work on this?

@RNabel @davido

The text was updated successfully, but these errors were encountered:

davido · 2018-05-02T12:46:22Z

I will look into implementing this, unless someone else is faster than me.

RNabel · 2018-05-02T13:00:21Z

I don't have time to work on this right now. @davido, if you don't get around to working on this in the next 2-3 weeks, I'm happy to pick this up.

During migration from Buck to Bazel we lost action caches activation per default. For one, local action cache wasn't implemented in Bazel, for another, there was no option to specify HOME directory. I fixed both problems and starting with Bazel 0.14.0 both features are included in released Bazel version: [1], [2]. There is still one not implemented option, limit the cache directory with max size: [3]. But for now the advantage to activate the caches per default far outweigh the disadvantage of unlimited growth of size of cache directory beyound imaginary max size of say 10 GB. In meantime we add a warning to watch the size of the directory cache and periodically clean cache directory: $ rm -rf ~/.gerritcodereview/bazel-cache/cas/* [1] https://bazel-review.googlesource.com/#/c/bazel/+/16810 [2] bazelbuild/bazel#4852 [3] bazelbuild/bazel#5139 Change-Id: I42e8f6fb9770a5976751ffef286c0fe80b75cf93

daghub · 2018-09-11T07:04:00Z

Hi, I would also very much like to see this feature implemented! @davido , @RNabel did you get anywhere with your experiments?

RNabel · 2018-09-11T10:14:03Z

Not finished, but had an initial stab: RNabel/bazel@baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work)

I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido, what do you think is the best solution here? Also happy to write up a brief design doc for discussion.

buchgr · 2018-09-11T11:34:28Z

What do you guys think about just running a local proxy service that has this functionality already implemented? For exampe: https://github.com/Asana/bazels3cache or https://github.com/buchgr/bazel-remote? One could then point Bazel to it using --remote_http_cache=http://localhost:XXX. We could even think about Bazel automatically launching such a service if it is not running already.

ittaiz · 2018-09-11T12:20:04Z

I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well. @buchgr, I feel this is core Bazel functionality and in my humble opinion outsourcing it isn’t the right direction. People at my company are often amazed Bazel doesn’t have this fully supported out of the box.

…

On Tue, 11 Sep 2018 at 13:14 Robin Nabel ***@***.***> wrote: Not finished, but had an initial stab: RNabel/bazel@ baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size <RNabel@baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size> (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work) I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido <https://github.com/davido>, what do you think is the best solution here? Also happy to write up a brief design doc for discussion. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUIF_yJPnfWAoPzJufI6WwjckenYmNUks5uZ4zygaJpZM4TvSgK> .

aehlig · 2018-09-11T12:54:09Z

I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well.

@ittaiz, what solution are you talking about? What we have so far for the repository cache is that the file gets touched on every cache hit (see e0d8035), so that deleting the oldest files would be a cleanup; the latter, however, is not yet implemented, for lack of time.

For the repository cache, it is also a slightly different story, as clean up should always be manual; upstream might have disappeared, to the cache might be last copy of the archive available to the user—and we don't want to remove that on the fly.

buchgr · 2018-09-11T13:24:18Z

outsourcing it isn’t the right direction

I would be interested to learn more about why you think so.

ittaiz · 2018-09-11T14:31:06Z

@aehlig sorry, my bad. You are indeed correct. @buchgr, I think so because I think a disk cache is a really basic feature of Bazel and the fact that it doesn’t work like this by default is IMHO a leaky abstraction (of how exactly the cached work) and influenced greatly by the fact that googlers work mainly (almost exclusively?) with remote execution. I’ve explained bazel to tens maybe hundreds of people. All of them were surprised disk cache isn’t out of the box (eviction wise and also plans wise like we discussed).

…

On Tue, 11 Sep 2018 at 16:24 Jakob Buchgraber ***@***.***> wrote: outsourcing it isn’t the right direction I would be interested to learn more about why you think so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUIF-CT0FTFJOrIqJUvj5rmeKlfT502ks5uZ7mKgaJpZM4TvSgK> .

buchgr · 2018-09-13T19:56:31Z

@ittaiz
the disk cache is indeed a leaky abstraction that was mainly added because
it was easy to do so. I agree that if Bazel should have a disk cache in the long
term, then it should also support read/write through to a remote cache and
garbage collection.

However, I am not convinced that Bazel should have a disk cache built in but
instead this functionality could also be handled by another program running
locally. So I am trying to better understand why this should be part of Bazel.
Please note that there are no immediate plans to remove it and we will not do
so without a design doc of an alternative. I am mainly interested in kicking off
a discussion.

ittaiz · 2018-09-13T22:05:52Z

Thanks for the clarification and I appreciate the discussion. I think that users don’t want to operate many different tools and servers locally. They want a build tool that works. The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense.

…

On Thu, 13 Sep 2018 at 22:56 Jakob Buchgraber ***@***.***> wrote: @ittaiz <https://github.com/ittaiz> the disk cache is indeed a leaky abstraction that was mainly added because it was easy to do so. I agree that if Bazel should have a disk cache in the long term, then it should also support read/write through to a remote cache and garbage collection. However, I am not convinced that Bazel should have a disk cache built in but instead this functionality could also be handled by another program running locally. So I am trying to better understand why this should be part of Bazel. Please note that there are no immediate plans to remove it and we will not do so without a design doc of an alternative. I am mainly interested in kicking off a discussion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUIF8ewS8x09uklzku9r6-aS6zUeLqYks5uarh4gaJpZM4TvSgK> .

buchgr · 2018-09-14T10:39:59Z

I think that users don’t want to operate many different tools and servers locally.

I partly agree. I'd argue in many companies that would change as you would typically have an IT department configuring workstations and laptops.

The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense.

I think that also depends. I'd say that if one only wants to use the local disk cache then I agree that providing two flags is as friction less as it gets. However, I think it's possible that most disk cache users will also want to do remote caching/execution and that for them this might not be noteworthy additional work.

So I think there are two possible future scenarios for the disk cache:

Add garbage collection to the disk cache and be done with it.
Add garbage collection, remote read fallback, remote write and async remote writes.

I think 1) makes sense if we think that the disk cache will be a standalone feature that a lot of people will find useful on its own and if so I think its worth the effort to implement this in Bazel. For 2) I am not so sure as I can see several challenges that might be better solved in a separate process:

Async remote writes are the idea that Bazel writes blobs to the disk cache and then asynchronously (to the build) writes them to the remote cache thereby removing the upload time from the build's critical path. This is difficult to implement in Bazel, partly because there are no guarantees about the lifetime of the server process and partly because of lots of edge cases.
We might want to move authentication for remote caching/execution out of Bazel in the long term. We currently support Google Cloud authentication, we are about to add AWS and if we are successful I think its likely that we will need to add many more in the future and these authentication SDKs are quite large and increase the binary size. So we might end up with a separate proxy process anyway.
It's unconventional and potentially insecure that one has to pass authentication flags and secrets to Bazel itself. It seems to me that a separate process running as a different user that hides the authentication secrets from the rest of the system using OS security mechanisms is a better idea.
Once we implement a virtual remote filesystem (planned for Q4) in Bazel and then Bazel does not need to download cached artifacts anymore then the combination of a local disk cache and remote cache might become less attractive because downloads should no longer be a bottleneck (if it works out as expected).

So I think it might make sense for us to think about having a standard local caching proxy that's a separate process and that can be operated independently and/or that Bazel can launch automatically for improved usability might be an idea worth thinking about.

bayareabear · 2019-01-23T19:12:01Z

Is there any plan to roll out the "virtual remote filesystem" soon? I am interested to learn more about it and can help if needed. We are hitting network speed bottleneck.

buchgr · 2019-01-24T12:55:52Z

yep, please follow #6862

thekyz · 2020-02-11T21:39:32Z

any plan of implementing the max size feature or a garbage collector for the local cache?

brentleyjones · 2020-02-11T22:52:36Z

This is a much needed feature in order to use Remote Builds without the Bytes, since naively cleaning up the disk cache results in build failures.

This will be used by the implementation of garbage collection for the disk cache, as discussed in #5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons: 1. It's a much smaller dependency. 2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite. 3. We can (more easily) compile it from source for all host platforms, including the BSDs. PiperOrigin-RevId: 628046749 Change-Id: I17bd0547876df460f48af24944d3f7327069375f

Wyverald · 2024-05-10T23:09:56Z

How far are we from getting this done? rc1 is scheduled for next Monday, but judging by the urgency and remaining work, we can push it out a bit.

This will be used by the implementation of garbage collection for the disk cache, as discussed in bazelbuild#5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons: 1. It's a much smaller dependency. 2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite. 3. We can (more easily) compile it from source for all host platforms, including the BSDs. PiperOrigin-RevId: 628046749 Change-Id: I17bd0547876df460f48af24944d3f7327069375f

tjgq · 2024-05-13T13:48:34Z

Unfortunately, I ran into some difficulties and this is not ready yet. I'm aiming to build up to a minimally useful implementation within the next few days.

If this FR is the only reason we would delay rc1, it would be fine to get it out today, under the (not so unreasonable?) assumption that there will be an rc2 that the remaining changes can still make it into.

peaceiris · 2024-05-17T11:16:02Z

Here is my workaround.

find /path/to/bazel-cache -amin +1440 -delete 2>/dev/null || true

The command searches for files in the /path/to/bazel-cache directory that have not been accessed in the last 1440 minutes (24 hours) and deletes them.

dkashyn-sfdc · 2024-05-17T11:50:34Z

It won't help you to "trim to size" if you need to fit cache below a certain threshold of disk space.

ceejatec · 2024-05-18T03:21:03Z

@peaceiris I don't know for certain, but our experience strongly suggests that any process that deletes files directly from the cache like that is doomed to cause strange failures sooner or later. It may depend on specifically what tasks bazel runs; we've found that code coverage jobs are especially brittle.

peaceiris · 2024-05-18T06:53:18Z

Yes, I think that's right. So we're looking forward to implementing this feature in Bazel.

Wyverald · 2024-05-22T17:40:14Z

Unfortunately we won't have enough time for this in 7.2.0; postponing to 7.3.0

withtimesgo1115 · 2024-06-12T04:47:42Z

Here is my workaround.
find /path/to/bazel-cache -amin +1440 -delete 2>/dev/null || true
The command searches for files in the /path/to/bazel-cache directory that have not been accessed in the last 1440 minutes (24 hours) and deletes them.

Thx, but this is not a good walkaround for this problem because you cannot promise what files you will delete here and different cache files have connections. If somebody plans to try it, please stay careful.

dkashyn-sfdc · 2024-06-28T21:02:29Z

7.3.0 is not that far as we think. Is there a way to try this with nightly builds or so?

Wyverald · 2024-07-23T22:24:32Z

Unfortunately we didn't have time for this in 7.3.0 either -- @tjgq has been occupied by other work. This may need to wait until 7.4.0 or even 8.0.

7.3.0 is not that far as we think. Is there a way to try this with nightly builds or so?

We do have rolling releases, which are cut every ~2 weeks. But I'm not sure if this feature in particular is ready for even preliminary testing.

lexspoon · 2024-08-06T20:10:12Z

I really appreciate everyone who is looking into this, and I left comments in the design doc.

I am in a similar boat as @ittaiz and trying to help a 100-person company improve their build infrastructure. Having a default local cache, with GC, enabled by default, is all very valuable for Bazel in this environment. I was also surprised that it doesn't already work this way.

tjgq · 2024-09-27T14:07:00Z

Status update: Bazel 8 is going to ship with built-in garbage collection, with knobs to set a maximum size in bytes, a maximum age for cache entries, or both.

However, contrary to previously announced plans, we're going to provide an offline implementation, whereby garbage collection occurs in the background while the Bazel server is idle. It therefore remains possible for the set limits to be exceeded during the course of a build, which is a necessary tradeoff to preserve build performance. We will also provide an official tool to run a garbage collection manually, for users who prefer to have more control. The documentation page at https://bazel.build/remote/caching#disk-cache will soon be updated with the details.

The more sophisticated online implementation described in #5139 (comment) is, unfortunately, extremely challenging to implement across all supported operating systems, while providing hard size guarantees, allowing multiple Bazel processes to share the same disk cache, and preserving current build performance in all scenarios. It's unclear to me at this point that all of these goals can be simultaneously fulfilled.

nouiz · 2024-09-27T14:30:55Z

Thanks for the update.
As a B plan, can we have a suggested way to handle a hard disk space constraint?
A documentation like:

If you have a hard disk space contraint of size X, setting bazel max cache sizes of X-Y% should be safe if your binary size is under Z bytes.

And maybe document what will happen if the sizes isn't set right. Hopefully something like, "he current upload will be lost, but there won't be any curruption and nothing will crash. So the only consequence is the recomputation of some entry".

tjgq · 2024-09-27T14:42:14Z

I think the best rule of thumb we can give is: prior to starting a build, you must have enough disk space to fit the entirety of the build outputs twice (one time for the contents of bazel-out, another for the copies uploaded to the disk cache). That's obviously an overestimation, and obviously dependent on what your build does; the actual number can't be computed except by actually running the build.

If you do run out of space during a build, it's always safe to retry it after you have (manually) recovered some space. Bazel is designed to produce a correct build result regardless of the starting conditions.

*** Reason for rollback *** The experiment with using sqlite didn't pan out. We might find other uses for it in the future, but we can always resurrect it from version control; no need to saddle Bazel with the additional dependency until then. *** Original change description *** Implement a JNI wrapper around SQLite. This will be used by the implementation of garbage collection for the disk cache, as discussed in #5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons: 1. It's a much smaller dependency. 2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite. 3. We can (more easily) compile it from source for all host platforms, including the BSDs. *** PiperOrigin-RevId: 679600756 Change-Id: Ic3748fa30404a31504426c523c9b9a60ec451863

lexspoon · 2024-09-27T15:39:56Z

I think the best rule of thumb we can give is: prior to starting a build, you must have enough disk space to fit the entirety of the build outputs twice (one time for the contents of bazel-out, another for the copies uploaded to the disk cache).

+1 to this general perspective. Developers will generally maintain enough free disk space that they don't need to be right at the limit all the time. As such, Bazel doesn't need to go overboard on the disk space limit, whenever it cuts into other desiderata.

*** Reason for rollback *** The experiment with using sqlite didn't pan out. We might find other uses for it in the future, but we can always resurrect it from version control; no need to saddle Bazel with the additional dependency until then. *** Original change description *** Implement a JNI wrapper around SQLite. This will be used by the implementation of garbage collection for the disk cache, as discussed in bazelbuild#5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons: 1. It's a much smaller dependency. 2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite. 3. We can (more easily) compile it from source for all host platforms, including the BSDs. *** PiperOrigin-RevId: 679600756 Change-Id: Ic3748fa30404a31504426c523c9b9a60ec451863

*** Reason for rollback *** The experiment with using sqlite didn't pan out. We might find other uses for it in the future, but we can always resurrect it from version control; no need to saddle Bazel with the additional dependency until then. *** Original change description *** Implement a JNI wrapper around SQLite. This will be used by the implementation of garbage collection for the disk cache, as discussed in #5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons: 1. It's a much smaller dependency. 2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite. 3. We can (more easily) compile it from source for all host platforms, including the BSDs. *** PiperOrigin-RevId: 679600756 Change-Id: Ic3748fa30404a31504426c523c9b9a60ec451863

dkashyn-sfdc · 2024-10-03T13:50:06Z

@tjgq So sqlite won't be used for GC, right?

tjgq · 2024-10-03T17:20:28Z

@dkashyn-sfdc Correct. I experimented with speeding up filesystem scans by using a sqlite database as an index, but it's very tricky to do while observing all of the constraints I mentioned above, and in the end I felt that it had an unfavorable complexity/utility ratio (so it will ship without an index, just a parallelized filesystem scan).

tjgq · 2024-10-22T12:18:23Z

To close the loop: in addition to shipping in 8.0.0, this feature has also been cherry-picked into 7.4.0, and the documentation has been updated accordingly.

buchgr added the category: http caching label May 2, 2018

buchgr mentioned this issue May 2, 2018

improve --experimental_local_disk_cache #4870

Closed

buchgr self-assigned this Sep 13, 2018

buchgr added type: feature request team-Remote-Exec Issues and PRs for the Execution (Remote) team P2 We'll consider working on this in future. (Assignee optional) and removed category: http caching labels Jan 16, 2019

This was referenced Jan 17, 2019

Support read-only proxy buchgr/bazel-remote#69

Closed

Important race condition when using remote cache without remote build #3360

Open

nmattia mentioned this issue Mar 28, 2019

Add disk cache to CI (darwin) tweag/rules_haskell#795

Merged

asana-unito mentioned this issue May 15, 2019

Disk based cache. Asana/bazels3cache#16

Open

buchgr removed their assignment Jan 9, 2020

uri-canva mentioned this issue Feb 3, 2020

Add configs for readonly and writeonly proxies. buchgr/bazel-remote#76

Closed

Wyverald added this to the Mainline issues targeted for 7.2.0 milestone May 8, 2024

Wyverald modified the milestones: Mainline issues targeted for 7.2.0, Mainline issues targeted for 7.3.0 May 22, 2024

This was referenced May 23, 2024

Implement garbage collection for the repository cache #22516

Open

Cache management facilities #1035

Closed

tjgq mentioned this issue Sep 27, 2024

[7.4.0] Automated rollback of commit 8ef7b705bb33de6ee9025c01bdf65a23f73926e3. #23794

Merged

tjgq closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement automatic garbage collection for the disk cache #5139

Implement automatic garbage collection for the disk cache #5139

buchgr commented May 2, 2018

davido commented May 2, 2018 •

edited

Loading

RNabel commented May 2, 2018 •

edited

Loading

daghub commented Sep 11, 2018

RNabel commented Sep 11, 2018

buchgr commented Sep 11, 2018 •

edited

Loading

ittaiz commented Sep 11, 2018 via email

aehlig commented Sep 11, 2018

buchgr commented Sep 11, 2018

ittaiz commented Sep 11, 2018 via email

buchgr commented Sep 13, 2018

ittaiz commented Sep 13, 2018 via email

buchgr commented Sep 14, 2018

bayareabear commented Jan 23, 2019

buchgr commented Jan 24, 2019

thekyz commented Feb 11, 2020

brentleyjones commented Feb 11, 2020

Wyverald commented May 10, 2024

tjgq commented May 13, 2024

peaceiris commented May 17, 2024

dkashyn-sfdc commented May 17, 2024

ceejatec commented May 18, 2024

peaceiris commented May 18, 2024

Wyverald commented May 22, 2024

withtimesgo1115 commented Jun 12, 2024

dkashyn-sfdc commented Jun 28, 2024

Wyverald commented Jul 23, 2024

lexspoon commented Aug 6, 2024 •

edited

Loading

tjgq commented Sep 27, 2024

nouiz commented Sep 27, 2024

tjgq commented Sep 27, 2024

lexspoon commented Sep 27, 2024

dkashyn-sfdc commented Oct 3, 2024

tjgq commented Oct 3, 2024

tjgq commented Oct 22, 2024

Implement automatic garbage collection for the disk cache #5139

Implement automatic garbage collection for the disk cache #5139

Comments

buchgr commented May 2, 2018

davido commented May 2, 2018 • edited Loading

RNabel commented May 2, 2018 • edited Loading

daghub commented Sep 11, 2018

RNabel commented Sep 11, 2018

buchgr commented Sep 11, 2018 • edited Loading

ittaiz commented Sep 11, 2018 via email

aehlig commented Sep 11, 2018

buchgr commented Sep 11, 2018

ittaiz commented Sep 11, 2018 via email

buchgr commented Sep 13, 2018

ittaiz commented Sep 13, 2018 via email

buchgr commented Sep 14, 2018

bayareabear commented Jan 23, 2019

buchgr commented Jan 24, 2019

thekyz commented Feb 11, 2020

brentleyjones commented Feb 11, 2020

Wyverald commented May 10, 2024

tjgq commented May 13, 2024

peaceiris commented May 17, 2024

dkashyn-sfdc commented May 17, 2024

ceejatec commented May 18, 2024

peaceiris commented May 18, 2024

Wyverald commented May 22, 2024

withtimesgo1115 commented Jun 12, 2024

dkashyn-sfdc commented Jun 28, 2024

Wyverald commented Jul 23, 2024

lexspoon commented Aug 6, 2024 • edited Loading

tjgq commented Sep 27, 2024

nouiz commented Sep 27, 2024

tjgq commented Sep 27, 2024

lexspoon commented Sep 27, 2024

dkashyn-sfdc commented Oct 3, 2024

tjgq commented Oct 3, 2024

tjgq commented Oct 22, 2024

davido commented May 2, 2018 •

edited

Loading

RNabel commented May 2, 2018 •

edited

Loading

buchgr commented Sep 11, 2018 •

edited

Loading

lexspoon commented Aug 6, 2024 •

edited

Loading