[WIP] broker: do not drop dirty cache entries on error #4524

chu11 · 2022-08-26T18:24:52Z

Problem: If an error occurs during a content store, a dirty cache entry can be lost forever as it not on the flush list or the lru list. As a result, the "dirty count" of entries will be inconsistent to the known dirty entries (i.e. entries in the flush list or in the process of being stored).

Solution: If a content store fails, add the entry to a new flush_errors list so it can be tried again during a forced content flush.

Fixes #4472

Notes:

While I'm in the middle of all this broker cache code, I thought it would be wise to try and fix this up before forgetting about this chunk of the broker. I'm marking WIP b/c:

A) this is based on code review while trying to figure out #4482. We actually haven't had this be an issue.

B) while not very complex, it maybe adds a bit more complexity than I wanted to. I couldn't put the dirty entries back on the flush list b/c that could lead to "infinite loop" as dirty cache entries get retried over and over again to be stored (assuming error on-going).

B2) we could make more optimal by checking for certain error conditions, but elected not to do that for now

C) I don't know how to test this without introducing a lot of instrumentation. So punted on that for now.

So I'm thinking either:

we consider merging based on just code review
We park this code as "for future knowledge" in case we ever need to look into it more seriously down the road

Problem: If an error occurs during a content store, a dirty cache entry can be lost forever as it not on the flush list or the lru list. As a result, the "dirty count" of entries will be inconsistent to the known dirty entries (i.e. entries in the flush list or in the process of being stored). Solution: If a content store fails, add the entry to a new flush_errors list so it can be tried again at a later time. Fixes flux-framework#4472

garlick · 2022-08-26T19:47:34Z

I'd wager that the most likely cause of failure would be a full file system, and no amount of retries is going to fix that. For now, I would suggest we log a LOG_CRIT message to be sure the first error is captured by the systemd journal (ENOSPC for example) and just stop the reactor. It would be OK IMHO to forgo testing for this for now.

chu11 · 2022-08-26T20:43:49Z

I'd wager that the most likely cause of failure would be a full file system, and no amount of retries is going to fix that. For now, I would suggest we log a LOG_CRIT message to be sure the first error is captured by the systemd journal (ENOSPC for example) and just stop the reactor. It would be OK IMHO to forgo testing for this for now.

Hmmm, I wasn't contemplating going down this path.

Alternate idea, could we unregister the backing module? (this goes along with PR #4492, things could keep on working for awhile).

garlick · 2022-08-26T21:03:22Z

Alternate idea, could we unregister the backing module?

That would create dangling refs from the KVS.

We could mark it read-only so that current cache entries remain dirty like none and then maybe try to put the system in some kind of safe mode, or shut it down with a dump (to some configured alternate location).

We may want to just park this for a while though. Maybe a distributed/replicated content store (as @trws was alluding to in slack earlier) would reduce the odds of getting here and then we might prefer the "fail fast" alternative in the unlikely event that we do.

chu11 · 2022-08-26T21:35:57Z

We may want to just park this for a while though. Maybe a distributed/replicated content store (as @trws was alluding to in slack earlier) would reduce the odds of getting here and then we might prefer the "fail fast" alternative in the unlikely event that we do.

Yeah, lets park it for the time being. I do like your idea of logging to LOG_CRIT instead. Perhaps I'll just do that and call it a day. I'll put that in a different PR since I want to preserve this branch for the future.

chu11 closed this Aug 26, 2022

chu11 mentioned this pull request Aug 26, 2022

broker: content-cache acct_dirty can be greater than the number of flushable entries #4472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] broker: do not drop dirty cache entries on error #4524

[WIP] broker: do not drop dirty cache entries on error #4524

chu11 commented Aug 26, 2022 •

edited

Loading

garlick commented Aug 26, 2022

chu11 commented Aug 26, 2022

garlick commented Aug 26, 2022

chu11 commented Aug 26, 2022

[WIP] broker: do not drop dirty cache entries on error #4524

[WIP] broker: do not drop dirty cache entries on error #4524

Conversation

chu11 commented Aug 26, 2022 • edited Loading

garlick commented Aug 26, 2022

chu11 commented Aug 26, 2022

garlick commented Aug 26, 2022

chu11 commented Aug 26, 2022

chu11 commented Aug 26, 2022 •

edited

Loading