[WIP] broker: do not drop dirty cache entries on error #4524
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: If an error occurs during a content store, a dirty cache entry can be lost forever as it not on the flush list or the lru list. As a result, the "dirty count" of entries will be inconsistent to the known dirty entries (i.e. entries in the flush list or in the process of being stored).
Solution: If a content store fails, add the entry to a new flush_errors list so it can be tried again during a forced content flush.
Fixes #4472
Notes:
While I'm in the middle of all this broker cache code, I thought it would be wise to try and fix this up before forgetting about this chunk of the broker. I'm marking WIP b/c:
A) this is based on code review while trying to figure out #4482. We actually haven't had this be an issue.
B) while not very complex, it maybe adds a bit more complexity than I wanted to. I couldn't put the dirty entries back on the flush list b/c that could lead to "infinite loop" as dirty cache entries get retried over and over again to be stored (assuming error on-going).
B2) we could make more optimal by checking for certain error conditions, but elected not to do that for now
C) I don't know how to test this without introducing a lot of instrumentation. So punted on that for now.
So I'm thinking either:
we consider merging based on just code review
We park this code as "for future knowledge" in case we ever need to look into it more seriously down the road