-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvs object store needs garbage collection #258
Comments
This is potentially becoming relevant again. A long-run job I have testing longevity is, at 200,000 jobs none of which use the kvs for IO, up to 26GB of sqlite database at this point. Do you think it would be reasonable to do something like etcd, where older versions can be traversed, but only up to a certain distance back? In their case it's a certain number of updates to a value, on the order of 1000 or something, but we might be able to establish a limit at which a client couldn't expect to be able to look back in time to ease this a bit. |
An idle thought, but if we could tell the kvs explicitly that nothing will ever look at a given key again, would it be reasonable to clear out all data that was ever used to represent that key? I'm thinking specifically of something we could include with purge to take care of old lwj data that we're explicitly deleting. |
Hmm, not sure how that would work since content blobs can be pointed to from multiple keys/directories and the content store is inherently deduplicating, with no refcount/back reference data kept with the blobs. It seems like we need something like git-gc here, to identify "unreachable objects". For us this is complicated by multiple namespaces sharing one content store, the possible existence of content references outside of any KVS namespace, and the difficulty of taking the KVS offline for any length of time to walk every reference. I like the etcd idea. I wonder as a first cut if we could add an epoch to each content blob and then periodically walk the namespace(s), updating the epoch for all currently-referenced blobs? Then purge all blobs whose epoch is older than some threshold. The other thing that seems worth pursuing is to add some "persistence flags" to a namespace to handle
Another thought is that the current "write back" cache on rank 0 might have some opportunity to avoid writing some objects to the backing store completely, e.g. if they are "dereferenced" before being written. Content blobs in flight could have some additional flags that affect their persistence. Just thinking out loud really, more discussion/thought needed. |
Per the coffee discussion today:
@trws suggested that we could have a separate content store for each namespace. For guest namespaces for a job, the final "snapshot" could be copied into the main content store and the guest content store be deleted.
It was also mentioned that for our TOSS4 timeline, we could start running this gc process after an instance restart as long as the job shells don't try and access "old"/"stale" references (that code will need some auditing). |
Let's say that this issue can be closed if we can garbage collect the content store on the way up from an instance restart, based on following the last-written root blobref checkpoint, and deleting everything that's not referenced. Sort of like WALL-E. The trash piles up, then we send Flux away until the robots finish cleaning up. What could go wrong? Let's save KVS redesign with refcounting for another day/issue. |
Problem: a system instance that runs flux-dump(1) from rc3 might get killed by systemd TimeoutStopSec. Have flux-shutdown(1) arrange for the dump. If the instance is being shut down by this method, then systemctl stop is not being run, so TimeoutStopSec does not apply. Fixes flux-framework#258
The kvs object store currently grows without bound.
This is hard to address in the current design. Reference counting would add overhead to the system. Periodically walking the namespace to find disconnected objects ignores the fact that eventually consistent slaves may still be using those objects, or clients may be traversing old versions of the name space after
kvsdir_t
is turned into a "snapshot reference" per issue #64.The text was updated successfully, but these errors were encountered: