Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: content-cache acct_dirty can be greater than the number of flushable entries #4472

Closed
chu11 opened this issue Aug 5, 2022 · 7 comments
Assignees

Comments

@chu11
Copy link
Member

chu11 commented Aug 5, 2022

While working on #4267, I noticed that sometimes the acct_dirty field can be greater than the length of the flush list, even when a backing module is not loaded. a flux content flush will therefore hang, b/c acct_dirty will never become zero, so the content flush will never respond to the original request.

I'm not sure how this happened, but here are several possibilities when looking through code. Some of this is duplicated text from #4267.

  1. one potential path is in cache_store() and cache_store_continuation(). An entry from the flush list is sent to the content backing module for backing and removed from the flush list. If an error occurs, cache_store_continuation() will see the error but not put it back on the flush list and acct_dirty will not be decremented.

  2. Another potential (similar) path may be when we unload the backing-module, so that backing store requests get lost. acct_dirty will stay incremented even though flush entries have been taken off the list.

(Update, now that I think about it #2 is a strong possibility. I noticed this in the regression tests, which one test fix of #4267 went into affect, content-sqlite was being loaded / reloaded regularly in the regression tests.)

@chu11
Copy link
Member Author

chu11 commented Aug 9, 2022

this bothered me so much i continued looking into it and I figured out the problem, list_add_tail() does not check if the node is already on the list or manipulate pointers to move the node to the end of the list like I think code writers (including myself) thought it would.

This issue occurred via regression test for issue 1760, in which the KVS drops its internal cache via dropcache. The content-cache in the broker therefore got content store requests for the same data twice, ultimately leading to dirty entries being added to the flush list twice, and messing up the flush list and making it inconsistent to the acct_dirty count.

It appeared during my work on #4267 because the content backing store wasn't loaded by default.

Trying to see if I can reproduce this issue in master as it is a semi-severe bug that might even cause segfault / memory corruption.

Note that I think the issues I describe above in my first comment are still issues, so I may make another issue for this specific one.

@garlick
Copy link
Member

garlick commented Aug 9, 2022

Oof, glad you ran that down!

@garlick
Copy link
Member

garlick commented Aug 15, 2022

So sounds like this is still an issue?

one potential path is in cache_store() and cache_store_continuation(). An entry from the flush list is sent to the content backing module for backing and removed from the flush list. If an error occurs, cache_store_continuation() will see the error but not put it back on the flush list and acct_dirty will not be decremented.

@chu11
Copy link
Member Author

chu11 commented Aug 15, 2022

yeah, I split out the list corruption into #4482, what I wrote above is still possible ... and perhaps there's other paths I did not see.

@chu11
Copy link
Member Author

chu11 commented Aug 25, 2022

while I'm in the middle of doing so much work on the broker's content-cache, I thought to try and fix the two potential error paths I noted above.

As I think about it more, issue 2 above shouldn't be a problem if issue 1 is solved. The content-cache should eventually get an ENOSYS that the backing module is gone.

As for issue 1, it'd be easy to simply just put an entry back on the flush list if an error occurs. But not all errors are equal. If the file system is full and the backing module can never write anything to disk, then all we're doing is just constantly flushing things to disk that will never succeed, ever increasing the flush list. On the other hand, if we get ENOSYS, we can be confident that the backing module is gone for now and we can just stick it back on the flush list for later.

An optimal solution would be to mark each cache entry with a timeout of sorts saying, "don't try to flush this for awhile" if we believe the backing module to be borked. But this is probably more feature / work than is truly necessary at this point in time. So I'm thinking if an error occurs on the backing module, just stick it back on the flush list is a more than suitable solution.

Edit: oh no, issue 2 has to be handled in a similar way to issue 1, since we can't guarantee the message received by the backing module. we just have to assume the request failed and have to put the dirty entry back on the flush list.
Edit2: Nope, I think I was right the first time :P

@chu11 chu11 self-assigned this Aug 25, 2022
chu11 added a commit to chu11/flux-core that referenced this issue Aug 26, 2022
Problem: If an error occurs during a content store, a dirty
cache entry can be lost forever as it not on the flush list
or the lru list.  As a result, the "dirty count" of entries
will be inconsistent to the known dirty entries (i.e. entries
in the flush list or in the process of being stored).

Solution: If a content store fails, add the entry to a new
flush_errors list so it can be tried again at a later time.

Fixes flux-framework#4472
@chu11
Copy link
Member Author

chu11 commented Aug 26, 2022

Per some discussion in #4524 on this branch:

https://github.com/chu11/flux-core/tree/issue4472_put_back_on_flush_list

A) this is based on code review while trying to figure out #4482. We actually haven't had this be an issue.

B) while not very complex, it maybe adds a bit more complexity than I wanted to. I couldn't put the dirty entries back on the flush list b/c that could lead to "infinite loop" as dirty cache entries get retried over and over again to be stored (assuming error on-going).

B2) we could make more optimal by checking for certain error conditions, but elected not to do that for now

C) I don't know how to test this without introducing a lot of instrumentation. So punted on that for now.

@garlick then noted:

We could mark it read-only so that current cache entries remain dirty like none and then maybe try to put the system in some kind of safe mode, or shut it down with a dump (to some configured alternate location).

This is more complex than perhaps desired and its not for a scenario that we have ever seen and such a scenario is likely rare (ENOSPC, i.e. disk full, is the most likely).

Since there is no current need for any of the above, decision is to log to LOG_CRIT instead of LOG_ERR, so that the error is more obvious to system administrators that things are bad bad bad.

@chu11
Copy link
Member Author

chu11 commented Sep 1, 2022

we'll consider this closed with #4526, but the branch listed above is available for reference of "fuller" solution

@chu11 chu11 closed this as completed Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants