-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long overflow in cache statistics #3503
Comments
/cc @ben-manes |
Thanks, I'll wait to see what the Guava team decides and emulate. If this is not resolved timely, then I'll probably add a |
Thanks @ben-manes. Sounds good. Just to give a more detailed example of how we're seeing this seemingly very unlikely behaviour: Imagine a cache with a few hundred thousand entries, with "refresh after write" of 5 minutes so that frequently-requested items are reasonably fresh. The With an average queue size of 50k, we are effectively accruing 5.7 years (50,000 hours) of load time per hour, well over a hundred years per day. Hence seeing the overflows after only a few days of the application running in production. |
One resolution is to calculate the It is still useful to know the service rate of an executor, but that should be instrumented independently. |
What about using Presumably that would pin the totals to |
I suppose we'd also need to use it in |
I think
1 & 2 are trivial. 3 may not be doable since |
RE: (2):
Meaning, set it to 0 if it's negative? Would it make sense to manually compare to zero and set to long hitCount = hitCount.sum();
... hitCount >= 0 ? hitCount : Long.MAX_VALUE ... Of course it's possible that the value would later wrap so far that it goes past 0 again, at which point we'd start returning positive values that are too low, but maybe it's still better to return A more sophisticated but still incomplete (specifically, racy, though maybe it could be made to work fully) option is to track whether any snapshot has ever seen a negative value for each metric. We'd have to maintain |
@cpovirk Really racy? AFAIK for booleans which get only set and never cleared, volatile must do. You obviously may always miss the overflow... Anyway, Something like
might be better... maybe... |
I'd be happy with that. Once overflow is reached, I think the stats should be considered non-deterministic. If users want to do better then they should implement a |
Yeah, I was thinking of the extremely unlikely case in which two threads read at almost the same time, one with a negative value after the overflow and another with a then-positive-again value after the overflow. The latter might not see the former's update to the boolean. But you're right that this isn't really any worse (or likely) than missing the overflow entirely.
That's fair. As long as we've got some kind of handling to avoid negatives, I feel pretty good. Maybe |
Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180
Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180
Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180
Use saturatedToNanos() in CacheBuilder to avoid overflows. google/guava@7d04f72 Use LongMath.saturatedAdd/Subtract in CacheStats. google/guava@9f3d048 google/guava#3503 Added overflow handling to snapshot() in ConcurrentStatsCounter. (missing in Guava?) Co-authored-by: Kurt Alfred Kluever <[email protected]>
Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180
@ronshapiro This is not fixed until |
@ben-manes Sorry, what needs to be done to
|
The sum can overflow and then the constructor throws an exception. That was in the original description of this problem. |
OK so the overflow is really happening inside the |
That’s my understanding, yes. |
Gotcha - Hmm, I'm not super keen on mucking with The alternative is also gross though, and would require us to changes negatives results to |
Yep, that was @cpovirk’s suggestion which I followed when porting your commits. ben-manes/caffeine@aa63462#diff-276cf7945bb6416652e3b90ab41ef47f |
jshell> import java.util.concurrent.atomic.*;
jshell> var x = new LongAdder();
x ==> 0
jshell> x.add(Long.MAX_VALUE);
jshell> x.add(1)
jshell> x.sum();
$6 ==> -9223372036854775808 |
@ben-manes Thanks - I went with a similar approach, but I'm a little concerned about: ...but I also don't want to start messing with |
We could possibly use |
We'd still need a solution for the Android branch. Maybe pushing |
I think if we consider overflow non-deterministic and only promise to avoid exceptions, then it is not our responsibility to make counts correct. That means either is okay, and better handling is the user's responsibility by providing an alternative |
See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549
Fixes #3503 RELNOTES=Fix potential overflow/IAE during cache stats calculations. ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256243122
See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549
Fixes #3503 RELNOTES=Fix potential overflow/IAE during cache stats calculations. ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256243122
sync'd to Caffeine |
This has been released in Caffeine 2.8. |
RELNOTES=Add MediaType for "image/heif" and "image/jp2" ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=260960132
We use Guava caches inside a large enterprise application.
After a few days of runtime calling the
stats()
method on heavily-used caches throws anIllegalArgumentException
at the equivalent line to CacheStats.java:87, presumably because the total load time has overflowed a long and looped back round to a negative value.It looks like this is down to the load time stopwatch being started at the point that an entry is submitted for asynchronous refresh, i.e. added to the refresh queue. Our largest queues are drained in batches and end-up with usual sizes in the tens of thousands — so the nanosecond-precision load time increases very quickly under these circumstances.
I am aware that it is a design decision for the cache statistics to be monotonically increasing, so a reasonably sane solution might be to simply drop the non-negative validation as part of the CacheStats constructor, or try to somehow wrap back to zero instead of
Long.MIN_VALUE
. Metrics systems using similar to Graphite'snonNegativeDerivative
should handle that gracefully anyway.NB: I think this issue also affects Caffeine caches in the same way.
The text was updated successfully, but these errors were encountered: