This repository has been archived by the owner on Nov 14, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
[LW] Add additional fail-safe mechanisms around stored state #5658
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Generate changelog in
|
gmaretic
approved these changes
Sep 24, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'd add a bit more info to the log lines
atlasdb-impl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/CacheStoreImpl.java
Show resolved
Hide resolved
...sdb-impl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/SnapshotStoreImpl.java
Show resolved
Hide resolved
...pl-shared/src/main/java/com/palantir/atlasdb/keyvalue/api/watch/LockWatchEventCacheImpl.java
Show resolved
Hide resolved
private void validateStateSize() { | ||
if (timestampMap.size() > MAXIMUM_SIZE || livingVersions.size() > MAXIMUM_SIZE) { | ||
log.warn( | ||
"Timestamp state store has exceeded its maximum size. This likely indicates a memory leak", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goals (and why):
We've hit several memory leaks in the past within the lock watch code path. The aim here is to not only prevent these from taking down a service, but also to automatically surface them. Instead of adding metrics for each of these, they instead cause the cache to fall back. This in turn increments the metric for cache fallbacks, which we can then monitor or create alerts from.
Implementation Description (bullets):
Throw
SafeIllegalStateException
if internal structures exceed a constant threshold. In practice, these structures should be approximately proportional in size to the number of open transactions, which will almost never exceed 5k, and thus 20k is a very safe limit.Note that any runtime exception is caught in the resilient cache and rewrapped as a retryable exception after the fallback has been selected.
Testing (What was existing testing like? What have you done to improve it?):
No tests for these bounds.
Concerns (what feedback would you like?):
Should we just take the $$$ hit and add metrics? My instincts say no, since we likely won't need the metrics unless there is a real issue.
Where should we start reviewing?:
Diff
Priority (whenever / two weeks / yesterday):
This/next week