Long overflow in cache statistics #3503

timw6n · 2019-06-17T11:30:10Z

We use Guava caches inside a large enterprise application.

After a few days of runtime calling the stats() method on heavily-used caches throws an IllegalArgumentException at the equivalent line to CacheStats.java:87, presumably because the total load time has overflowed a long and looped back round to a negative value.

It looks like this is down to the load time stopwatch being started at the point that an entry is submitted for asynchronous refresh, i.e. added to the refresh queue. Our largest queues are drained in batches and end-up with usual sizes in the tens of thousands — so the nanosecond-precision load time increases very quickly under these circumstances.

I am aware that it is a design decision for the cache statistics to be monotonically increasing, so a reasonably sane solution might be to simply drop the non-negative validation as part of the CacheStats constructor, or try to somehow wrap back to zero instead of Long.MIN_VALUE. Metrics systems using similar to Graphite's nonNegativeDerivative should handle that gracefully anyway.

NB: I think this issue also affects Caffeine caches in the same way.

The text was updated successfully, but these errors were encountered:

kluever · 2019-06-17T14:17:11Z

/cc @ben-manes

ben-manes · 2019-06-17T16:22:07Z

Thanks, I'll wait to see what the Guava team decides and emulate. If this is not resolved timely, then I'll probably add a Math.max(0, totalLoadTime.sum()) to the default StatsCounter. Since an overflow occurs after 292 years, it should be rarely hit and in Caffeine you can supply your own StatsCounter, e.g. to publish negative values if desired, making it easy to workaround. I slightly lean against removing the negative check in CacheStats, but I will follow Guava's direction if they choose to.

timw6n · 2019-06-17T16:54:00Z

Thanks @ben-manes. Sounds good.

Just to give a more detailed example of how we're seeing this seemingly very unlikely behaviour:

Imagine a cache with a few hundred thousand entries, with "refresh after write" of 5 minutes so that frequently-requested items are reasonably fresh.

The reload method on the loader adds the element to a queue which, to optimise network calls to a slow downstream service, drains slowly in large batches, i.e. with bulk calls.

With an average queue size of 50k, we are effectively accruing 5.7 years (50,000 hours) of load time per hour, well over a hundred years per day. Hence seeing the overflows after only a few days of the application running in production.

ben-manes · 2019-06-17T23:43:01Z

One resolution is to calculate the loadTime by wrapping the task so that only the execution is included, not the wait time in the executor. The time of a load-on-miss does not include the time waiting to lock the hash table. In Caffeine, we include Map.compute methods as loads whereas Guava does not (which seems wrong, imho). Similarly, in Caffeine, an AsyncCache computes the time only for the method itself, not how long it waits in the executor. So an argument that the load time is only the operation's execution time, not any queuing time, would fit all other usages. This would also avoid needing to take care of the overflow directly.

It is still useful to know the service rate of an executor, but that should be instrumented independently.

kluever · 2019-06-19T18:14:17Z

What about using LongMath.saturatedAdd(long, long) in CacheStats.plus(CacheStats)?

Presumably that would pin the totals to Long.MAX_VALUE, which seems better than having them wrap-around to negative values and then get reset to 0?

kluever · 2019-06-19T18:45:10Z

I suppose we'd also need to use it in loadCount(), requestCount(), loadExceptionRate(), and averageLoadPenalty().

ben-manes · 2019-06-19T19:09:13Z

I think plus and minus being saturated make sense. SimpleStatsCounter.snapshot() also has to be changed. The counters are stored using LongAdder for efficiency, so we can detect the wrap-around only when snapshotting. I think the recommendations boil down to,

Use LongMath.saturatedAdd and LongMath.saturatedSubtract in CacheStats plus and minus.
Use Math.max in SimpleStatsCounter.snapshot when creating a CacheStats.
Ideally, calculate the loadTime as only its execution duration, do not include the queuing wait time.

1 & 2 are trivial. 3 may not be doable since CacheLoader.reload returns a future, so we can't wrap the execution logic.

cpovirk · 2019-06-19T21:59:55Z

RE: (2):

Use Math.max in SimpleStatsCounter.snapshot when creating a CacheStats.

Meaning, set it to 0 if it's negative?

Would it make sense to manually compare to zero and set to MAX_VALUE instead? As in:

long hitCount = hitCount.sum();
... hitCount >= 0 ? hitCount : Long.MAX_VALUE ...

Of course it's possible that the value would later wrap so far that it goes past 0 again, at which point we'd start returning positive values that are too low, but maybe it's still better to return MAX_VALUE as long as we can, rather than 0?

A more sophisticated but still incomplete (specifically, racy, though maybe it could be made to work fully) option is to track whether any snapshot has ever seen a negative value for each metric. We'd have to maintain AtomicBoolean hitCountSaturated, etc. If that boolean were ever set, then we'd always return MAX_VALUE for that value.

Maaartinus · 2019-06-19T23:21:00Z

@cpovirk Really racy? AFAIK for booleans which get only set and never cleared, volatile must do. You obviously may always miss the overflow...

Anyway, hitCountSaturated doesn't prevent you from returning 3, 2, 1, 0 in that order (pretty improbable, but possible). For a while, I thought that hitCountDecreased would be better for detecting overflow, but with LongAdder a decrease might be possible even without overflow (just guessing that it may happen - the docs isn't explicit about that).

Something like

long hitCount = hitCountLongAdder.sum();
long hitCount2 = hitCount >= 0 ? hitCount : Long.MAX_VALUE;
long hitCount3 = hitCountAtomicLong.updateAndGet(x -> Math.max(x, hitCount2))

might be better... maybe...

ben-manes · 2019-06-20T00:11:21Z

Would it make sense to manually compare to zero and set to MAX_VALUE instead?

I'd be happy with that.

Once overflow is reached, I think the stats should be considered non-deterministic. If users want to do better then they should implement a StatsCounter and we should avoid penalizing performance for an uncommon case. Then if they want better overflow logic, such as 128-bit counters, it's solvable without us providing a canonical fix.

cpovirk · 2019-06-24T17:23:30Z

@cpovirk Really racy? AFAIK for booleans which get only set and never cleared, volatile must do. You obviously may always miss the overflow...

Yeah, I was thinking of the extremely unlikely case in which two threads read at almost the same time, one with a negative value after the overflow and another with a then-positive-again value after the overflow. The latter might not see the former's update to the boolean. But you're right that this isn't really any worse (or likely) than missing the overflow entirely.

Would it make sense to manually compare to zero and set to MAX_VALUE instead?

I'd be happy with that.

Once overflow is reached, I think the stats should be considered non-deterministic. If users want to do better then they should implement a StatsCounter and we should avoid penalizing performance for an uncommon case. Then if they want better overflow logic, such as 128-bit counters, it's solvable without us providing a canonical fix.

That's fair. As long as we've got some kind of handling to avoid negatives, I feel pretty good. Maybe MAX_VALUE is slightly better, but it's only a stopgap if we're not trying my boolean approach (and you've convinced me that we shouldn't).

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

Use saturatedToNanos() in CacheBuilder to avoid overflows. google/guava@7d04f72 Use LongMath.saturatedAdd/Subtract in CacheStats. google/guava@9f3d048 google/guava#3503 Added overflow handling to snapshot() in ConcurrentStatsCounter. (missing in Guava?) Co-authored-by: Kurt Alfred Kluever <[email protected]>

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

ben-manes · 2019-06-30T15:57:42Z

@ronshapiro This is not fixed until SimpleStatsCounter.snapshot has overflow protection as well.

kluever · 2019-07-01T14:46:34Z

@ben-manes Sorry, what needs to be done to SimpleStatsCounter.snapshot?

    public CacheStats snapshot() {
      return new CacheStats(
          hitCount.sum(),
          missCount.sum(),
          loadSuccessCount.sum(),
          loadExceptionCount.sum(),
          totalLoadTime.sum(),
          evictionCount.sum());
    }

ben-manes · 2019-07-01T14:52:02Z

The sum can overflow and then the constructor throws an exception. That was in the original description of this problem.

kluever · 2019-07-01T15:04:51Z

OK so the overflow is really happening inside the LongAdder.sum() impl, right?

ben-manes · 2019-07-01T15:11:55Z

That’s my understanding, yes.

kluever · 2019-07-01T15:22:34Z

Gotcha - Hmm, I'm not super keen on mucking with LongAdder, since it's from Doug Lea's jsr166e.

The alternative is also gross though, and would require us to changes negatives results to Long.MAX_VALUE. Blehhhh

ben-manes · 2019-07-01T15:32:44Z

Yep, that was @cpovirk’s suggestion which I followed when porting your commits.

ben-manes/caffeine@aa63462#diff-276cf7945bb6416652e3b90ab41ef47f

ben-manes · 2019-07-01T17:04:28Z

jshell> import java.util.concurrent.atomic.*;
jshell> var x = new LongAdder();
x ==> 0
jshell> x.add(Long.MAX_VALUE);
jshell> x.add(1)
jshell> x.sum();
$6 ==> -9223372036854775808

kluever · 2019-07-03T12:41:15Z

@ben-manes Thanks - I went with a similar approach, but I'm a little concerned about:
<some large value> + <some large value> + <some large value>
...that results in an overflow, but ends up being positive (and thus doesn't get saturated).

...but I also don't want to start messing with LongAdder.

lowasser · 2019-07-03T15:59:51Z

We could possibly use LongAccumulator with LongMath.saturatedAdd?

kluever · 2019-07-03T16:05:02Z

We'd still need a solution for the Android branch.

Maybe pushing LongMath.saturatedAdd() calls into LongAdder isn't the worst thing in the world?

ben-manes · 2019-07-03T16:18:52Z

LongAccumulator would work, as you can push the saturatedAdd as the accumulation function.

Class {@link LongAdder} provides analogs of the functionality of
this class for the common special case of maintaining counts and
sums. The call {@code new LongAdder()} is equivalent to {@code new
LongAccumulator((x, y) -> x + y, 0L)}.

I think if we consider overflow non-deterministic and only promise to avoid exceptions, then it is not our responsibility to make counts correct. That means either is okay, and better handling is the user's responsibility by providing an alternative StatsCounter (note - Guava doesn't allow that currently).

See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549

Fixes #3503 RELNOTES=Fix potential overflow/IAE during cache stats calculations. ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256243122

See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549

Fixes #3503 RELNOTES=Fix potential overflow/IAE during cache stats calculations. ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256243122

ben-manes · 2019-07-09T04:25:19Z

sync'd to Caffeine

ben-manes · 2019-08-06T06:22:47Z

This has been released in Caffeine 2.8.

RELNOTES=Add MediaType for "image/heif" and "image/jp2" ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=260960132

kluever added package=cache status=triaged labels Jun 17, 2019

raghsriniv added the P3 label Jun 24, 2019

ronshapiro mentioned this issue Jun 27, 2019

Moe Sync #3515

Closed

ronshapiro pushed a commit that referenced this issue Jun 27, 2019

Use LongMath.saturatedAdd/Subtract in CacheStats.

0a0f357

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

ronshapiro mentioned this issue Jun 28, 2019

Moe Sync #3516

Closed

ronshapiro pushed a commit that referenced this issue Jun 28, 2019

Use LongMath.saturatedAdd/Subtract in CacheStats.

de78928

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

This was referenced Jun 28, 2019

DO NOT SUBMIT: trying to fix travis #3518

Closed

Moe Sync #3519

Merged

ronshapiro pushed a commit that referenced this issue Jun 28, 2019

Use LongMath.saturatedAdd/Subtract in CacheStats.

9f3d048

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

ronshapiro closed this as completed in #3519 Jun 30, 2019

ronshapiro pushed a commit that referenced this issue Jun 30, 2019

Use LongMath.saturatedAdd/Subtract in CacheStats.

705101e

Fixes #3503 RELNOTES=avoid overflows/underflows in CacheStats ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=254899180

ronshapiro reopened this Jun 30, 2019

kluever self-assigned this Jul 1, 2019

ronshapiro pushed a commit that referenced this issue Jul 8, 2019

Add a test for LongAdder overflow behavior.

558321c

See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549

ronshapiro mentioned this issue Jul 8, 2019

Moe Sync #3525

Merged

ronshapiro closed this as completed in #3525 Jul 9, 2019

ronshapiro pushed a commit that referenced this issue Jul 9, 2019

Add a test for LongAdder overflow behavior.

ab7caa4

See #3503 RELNOTES=n/a ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=256193549

fburas85 referenced this issue Aug 6, 2019

Add MediaType for "image/heif" and "image/jp2"

508696a

RELNOTES=Add MediaType for "image/heif" and "image/jp2" ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=260960132

This was referenced Aug 4, 2021

Update dependency com.github.ben-manes.caffeine:caffeine to v3 - autoclosed grails/gorm-mongodb#269

Closed

Update dependency com.github.ben-manes.caffeine:caffeine to v3 - autoclosed grails/grails-core#11968

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long overflow in cache statistics #3503

Long overflow in cache statistics #3503

timw6n commented Jun 17, 2019

kluever commented Jun 17, 2019

ben-manes commented Jun 17, 2019

timw6n commented Jun 17, 2019

ben-manes commented Jun 17, 2019

kluever commented Jun 19, 2019

kluever commented Jun 19, 2019 •

edited

Loading

ben-manes commented Jun 19, 2019

cpovirk commented Jun 19, 2019

Maaartinus commented Jun 19, 2019

ben-manes commented Jun 20, 2019

cpovirk commented Jun 24, 2019

ben-manes commented Jun 30, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 3, 2019

lowasser commented Jul 3, 2019

kluever commented Jul 3, 2019

ben-manes commented Jul 3, 2019

ben-manes commented Jul 9, 2019

ben-manes commented Aug 6, 2019

Long overflow in cache statistics #3503

Long overflow in cache statistics #3503

Comments

timw6n commented Jun 17, 2019

kluever commented Jun 17, 2019

ben-manes commented Jun 17, 2019

timw6n commented Jun 17, 2019

ben-manes commented Jun 17, 2019

kluever commented Jun 19, 2019

kluever commented Jun 19, 2019 • edited Loading

ben-manes commented Jun 19, 2019

cpovirk commented Jun 19, 2019

Maaartinus commented Jun 19, 2019

ben-manes commented Jun 20, 2019

cpovirk commented Jun 24, 2019

ben-manes commented Jun 30, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 1, 2019

ben-manes commented Jul 1, 2019

ben-manes commented Jul 1, 2019

kluever commented Jul 3, 2019

lowasser commented Jul 3, 2019

kluever commented Jul 3, 2019

ben-manes commented Jul 3, 2019

ben-manes commented Jul 9, 2019

ben-manes commented Aug 6, 2019

kluever commented Jun 19, 2019 •

edited

Loading