-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include execution cache hit ratios #16043
Comments
Consider trying
If you want to have the info programmatically, you can use the BEP, e.g. by adding the flag
|
I think one main issue we have is that it's impossible to get the ratio of cacheable actions / actual cache hits from the BEP...the only way as mentioned is by using the exec log. it would be good to have the aggregate count of potentially cacheable actions in the BEP as well. |
In what cases is the distinction |
In the use case of analyzing client performance. In our situations cache missing can take a heavy toll on infrastructure and user experience. Knowing how healthy or unhealthy our cache performance is from a clients point of view helps us understand if we have problems. We collect and monitor metrics from both bazel clients via BEP and remote cache servers. Also, if we had these metrics it would have been very clear that bazel 5.2.0 was released with broken client side caching, two executions would show what was expected to be cached and was was or was not found (in terms of counts) in cache from build events. I don't expect most folks look at execution logs between cold local cache builds. I also find it hard to reason about making calculations from runner counts because they don't always show up, and the naming might be a bit dynamic (such as the sandbox names), if the runners always initiated to 0, it would be better for using those counts from a machine parsed point of view, I do see from user experience point of view why printing a bunch of 0 counts for runners wouldn't be ideal. |
@chancila @meisterT we use BEP to extract runner counts, but it's not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I've seen some of our executions logs in the size of GB (even when serialized into protobuf). Our attempt at calculation of cache hit ratio is the following (its crude and I don't think its accurate)
@chancila @meisterT we use BEP to extract runner counts, but it's not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I've seen some of our executions logs in the size of GB (even when serialized into protobuf). Our attempt at calculation of cache hit ratio is the following (its crude and I don't think its accurate)
While this "works" its probably lossy at best if not completely wrong. |
isn't the example could be the build has 1000 total actions, 500 of which are not remote cacheable, and 500 which are...technically in this build if i get 500 remote cache hits I have 100% cache hit...this number is currently impossible to derive without the execution log. Currently we use the total number of actions / cache hits which gives us a signal we can track for regression as long as the ratio of cacheable/not cacheable doesn't change drastically. a significant portion of our build graph is not cacheable, this is on purpose because in some cases certain actions are faster to run locally
yea we have exec logs the size of 40gb+ but they compress down to a few 100 MBs with zstd and a couple of tweaks, mainly long range compression...it would be cool if bazel compressed the log as it writes it to avoid murdering IO...on macos our builds are significantly slower due to the IO hit of the exec log. |
Description of the feature request:
It would be ideal if we could surface cache hit ratios as first class information. As it stands now I'm only aware of two ways to try and calculate cache hit ratios from a users point of view. One is to create the execution log and parse it for
remoteCacheable
andremoteCacheHit
then calculate the standard hit ratio. The other is to use build event logs and pluck out the runner information, for exampleINFO: 1436 processes: 977 disk cache hit, 456 internal, 1 darwin-sandbox, 2 worker
. As discussed in https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1659330964512159?thread_ts=1659329231.807649&cid=C01E7TH8XK9 there are some heuristics we could use to assume some knowledge of cache hit ratio, but it is not concrete.The request here is to surface cache hit ratio for different dimensions
What underlying problem are you trying to solve with this feature?
We are trying to understand when we hit cache regressions from a client side point of view. Recently bazel was released with broken caching in 5.2.0 #15682 (comment), we (twitter) realized it when our remote build cache systems experienced a lower cache hit ratio on cache requests over the network. It would have been ideal for us to also log that kind of metrics from the clients point of view as the source of truth that the runner indeed did miss and what state it might have been in at that time. It would also allow us to put monitoring metrics around the health of the runners when executing on some branches, such as
master
.Easy access to this information would greatly help us identify regressions sooner and give us more confidence in bazel and rules upgrades.
Which operating system are you running Bazel on?
osx,linux
What is the output of
bazel info release
?No response
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: