Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

[LW] Add additional fail-safe mechanisms around stored state #5658

Merged
merged 8 commits into from
Sep 27, 2021

Conversation

Jolyon-S
Copy link
Contributor

@Jolyon-S Jolyon-S commented Sep 23, 2021

Goals (and why):
We've hit several memory leaks in the past within the lock watch code path. The aim here is to not only prevent these from taking down a service, but also to automatically surface them. Instead of adding metrics for each of these, they instead cause the cache to fall back. This in turn increments the metric for cache fallbacks, which we can then monitor or create alerts from.

Implementation Description (bullets):
Throw SafeIllegalStateException if internal structures exceed a constant threshold. In practice, these structures should be approximately proportional in size to the number of open transactions, which will almost never exceed 5k, and thus 20k is a very safe limit.

Note that any runtime exception is caught in the resilient cache and rewrapped as a retryable exception after the fallback has been selected.

Testing (What was existing testing like? What have you done to improve it?):
No tests for these bounds.

Concerns (what feedback would you like?):
Should we just take the $$$ hit and add metrics? My instincts say no, since we likely won't need the metrics unless there is a real issue.

Where should we start reviewing?:
Diff

Priority (whenever / two weeks / yesterday):
This/next week

@changelog-app
Copy link

changelog-app bot commented Sep 23, 2021

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Add additional failsafe mechanisms around the size of lock watch internal state - this will cause the cache to fallback if it exceeds a certain size.

Check the box to generate changelog(s)

  • Generate changelog entry

@Jolyon-S Jolyon-S requested a review from gmaretic September 23, 2021 13:11
Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'd add a bit more info to the log lines

private void validateStateSize() {
if (timestampMap.size() > MAXIMUM_SIZE || livingVersions.size() > MAXIMUM_SIZE) {
log.warn(
"Timestamp state store has exceeded its maximum size. This likely indicates a memory leak",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@bulldozer-bot bulldozer-bot bot merged commit 30cdf93 into develop Sep 27, 2021
@bulldozer-bot bulldozer-bot bot deleted the acv-glue-gun branch September 27, 2021 09:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants