Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] broker: do not drop dirty cache entries on error #4524

Closed

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 26, 2022

Problem: If an error occurs during a content store, a dirty cache entry can be lost forever as it not on the flush list or the lru list. As a result, the "dirty count" of entries will be inconsistent to the known dirty entries (i.e. entries in the flush list or in the process of being stored).

Solution: If a content store fails, add the entry to a new flush_errors list so it can be tried again during a forced content flush.

Fixes #4472

Notes:

While I'm in the middle of all this broker cache code, I thought it would be wise to try and fix this up before forgetting about this chunk of the broker. I'm marking WIP b/c:

A) this is based on code review while trying to figure out #4482. We actually haven't had this be an issue.

B) while not very complex, it maybe adds a bit more complexity than I wanted to. I couldn't put the dirty entries back on the flush list b/c that could lead to "infinite loop" as dirty cache entries get retried over and over again to be stored (assuming error on-going).

B2) we could make more optimal by checking for certain error conditions, but elected not to do that for now

C) I don't know how to test this without introducing a lot of instrumentation. So punted on that for now.

So I'm thinking either:

  1. we consider merging based on just code review

  2. We park this code as "for future knowledge" in case we ever need to look into it more seriously down the road

Problem: If an error occurs during a content store, a dirty
cache entry can be lost forever as it not on the flush list
or the lru list.  As a result, the "dirty count" of entries
will be inconsistent to the known dirty entries (i.e. entries
in the flush list or in the process of being stored).

Solution: If a content store fails, add the entry to a new
flush_errors list so it can be tried again at a later time.

Fixes flux-framework#4472
@garlick
Copy link
Member

garlick commented Aug 26, 2022

I'd wager that the most likely cause of failure would be a full file system, and no amount of retries is going to fix that. For now, I would suggest we log a LOG_CRIT message to be sure the first error is captured by the systemd journal (ENOSPC for example) and just stop the reactor. It would be OK IMHO to forgo testing for this for now.

@chu11
Copy link
Member Author

chu11 commented Aug 26, 2022

I'd wager that the most likely cause of failure would be a full file system, and no amount of retries is going to fix that. For now, I would suggest we log a LOG_CRIT message to be sure the first error is captured by the systemd journal (ENOSPC for example) and just stop the reactor. It would be OK IMHO to forgo testing for this for now.

Hmmm, I wasn't contemplating going down this path.

Alternate idea, could we unregister the backing module? (this goes along with PR #4492, things could keep on working for awhile).

@garlick
Copy link
Member

garlick commented Aug 26, 2022

Alternate idea, could we unregister the backing module?

That would create dangling refs from the KVS.

We could mark it read-only so that current cache entries remain dirty like none and then maybe try to put the system in some kind of safe mode, or shut it down with a dump (to some configured alternate location).

We may want to just park this for a while though. Maybe a distributed/replicated content store (as @trws was alluding to in slack earlier) would reduce the odds of getting here and then we might prefer the "fail fast" alternative in the unlikely event that we do.

@chu11
Copy link
Member Author

chu11 commented Aug 26, 2022

We may want to just park this for a while though. Maybe a distributed/replicated content store (as @trws was alluding to in slack earlier) would reduce the odds of getting here and then we might prefer the "fail fast" alternative in the unlikely event that we do.

Yeah, lets park it for the time being. I do like your idea of logging to LOG_CRIT instead. Perhaps I'll just do that and call it a day. I'll put that in a different PR since I want to preserve this branch for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

broker: content-cache acct_dirty can be greater than the number of flushable entries
2 participants