[LW] Add additional fail-safe mechanisms around stored state #5658

Jolyon-S · 2021-09-23T13:02:26Z

Goals (and why):
We've hit several memory leaks in the past within the lock watch code path. The aim here is to not only prevent these from taking down a service, but also to automatically surface them. Instead of adding metrics for each of these, they instead cause the cache to fall back. This in turn increments the metric for cache fallbacks, which we can then monitor or create alerts from.

Implementation Description (bullets):
Throw SafeIllegalStateException if internal structures exceed a constant threshold. In practice, these structures should be approximately proportional in size to the number of open transactions, which will almost never exceed 5k, and thus 20k is a very safe limit.

Note that any runtime exception is caught in the resilient cache and rewrapped as a retryable exception after the fallback has been selected.

Testing (What was existing testing like? What have you done to improve it?):
No tests for these bounds.

Concerns (what feedback would you like?):
Should we just take the $$$ hit and add metrics? My instincts say no, since we likely won't need the metrics unless there is a real issue.

Where should we start reviewing?:
Diff

Priority (whenever / two weeks / yesterday):
This/next week

changelog-app · 2021-09-23T13:02:32Z

Generate changelog in `changelog/@unreleased`

Type

Description

Add additional failsafe mechanisms around the size of lock watch internal state - this will cause the cache to fallback if it exceeds a certain size.

Check the box to generate changelog(s)

Generate changelog entry

…lue-gun

gmaretic

Looks good, but I'd add a bit more info to the log lines

atlasdb-impl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/CacheStoreImpl.java

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/SnapshotStoreImpl.java

...pl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/watch/LockWatchEventCacheImpl.java

gmaretic · 2021-09-24T13:19:54Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/watch/TimestampStateStore.java

+    private void validateStateSize() {
+        if (timestampMap.size() > MAXIMUM_SIZE || livingVersions.size() > MAXIMUM_SIZE) {
+            log.warn(
+                    "Timestamp state store has exceeded its maximum size. This likely indicates a memory leak",


same as above

add logging around possibly leaky stores

51ba28a

Add generated changelog entries

4528a93

Jolyon-S requested a review from gmaretic September 23, 2021 13:11

Jolyon-S added 3 commits September 23, 2021 14:11

fix test

0ae3b10

Merge branch 'acv-glue-gun' of github.com:palantir/atlasdb into acv-g…

34f5399

…lue-gun

Merge branch 'develop' into acv-glue-gun

c7e831f

gmaretic approved these changes Sep 24, 2021

View reviewed changes

more logging

f8e92c7

Jolyon-S added the merge when ready label Sep 24, 2021

Jolyon-S added 2 commits September 24, 2021 15:26

stick

27a41fe

star import why

5ee75eb

bulldozer-bot bot merged commit 30cdf93 into develop Sep 27, 2021

bulldozer-bot bot deleted the acv-glue-gun branch September 27, 2021 09:27

Jolyon-S added a commit that referenced this pull request Sep 28, 2021

[LW] Add additional fail-safe mechanisms around stored state (#5658)

ba9eafa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LW] Add additional fail-safe mechanisms around stored state #5658

[LW] Add additional fail-safe mechanisms around stored state #5658

Jolyon-S commented Sep 23, 2021 •

edited

Loading

changelog-app bot commented Sep 23, 2021 •

edited by Jolyon-S

Loading

gmaretic left a comment

gmaretic Sep 24, 2021

[LW] Add additional fail-safe mechanisms around stored state #5658

[LW] Add additional fail-safe mechanisms around stored state #5658

Conversation

Jolyon-S commented Sep 23, 2021 • edited Loading

changelog-app bot commented Sep 23, 2021 • edited by Jolyon-S Loading

Generate changelog in changelog/@unreleased

gmaretic left a comment

Choose a reason for hiding this comment

gmaretic Sep 24, 2021

Choose a reason for hiding this comment

Jolyon-S commented Sep 23, 2021 •

edited

Loading

changelog-app bot commented Sep 23, 2021 •

edited by Jolyon-S

Loading

Generate changelog in `changelog/@unreleased`