Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs object store needs garbage collection #258

Closed
garlick opened this issue Jul 10, 2015 · 5 comments
Closed

kvs object store needs garbage collection #258

garlick opened this issue Jul 10, 2015 · 5 comments

Comments

@garlick
Copy link
Member

garlick commented Jul 10, 2015

The kvs object store currently grows without bound.

This is hard to address in the current design. Reference counting would add overhead to the system. Periodically walking the namespace to find disconnected objects ignores the fact that eventually consistent slaves may still be using those objects, or clients may be traversing old versions of the name space after kvsdir_t is turned into a "snapshot reference" per issue #64.

@trws
Copy link
Member

trws commented Mar 25, 2018

This is potentially becoming relevant again. A long-run job I have testing longevity is, at 200,000 jobs none of which use the kvs for IO, up to 26GB of sqlite database at this point.

Do you think it would be reasonable to do something like etcd, where older versions can be traversed, but only up to a certain distance back? In their case it's a certain number of updates to a value, on the order of 1000 or something, but we might be able to establish a limit at which a client couldn't expect to be able to look back in time to ease this a bit.

@trws
Copy link
Member

trws commented Mar 28, 2018

An idle thought, but if we could tell the kvs explicitly that nothing will ever look at a given key again, would it be reasonable to clear out all data that was ever used to represent that key? I'm thinking specifically of something we could include with purge to take care of old lwj data that we're explicitly deleting.

@garlick
Copy link
Member Author

garlick commented Mar 29, 2018

Hmm, not sure how that would work since content blobs can be pointed to from multiple keys/directories and the content store is inherently deduplicating, with no refcount/back reference data kept with the blobs.

It seems like we need something like git-gc here, to identify "unreachable objects". For us this is complicated by multiple namespaces sharing one content store, the possible existence of content references outside of any KVS namespace, and the difficulty of taking the KVS offline for any length of time to walk every reference.

I like the etcd idea. I wonder as a first cut if we could add an epoch to each content blob and then periodically walk the namespace(s), updating the epoch for all currently-referenced blobs? Then purge all blobs whose epoch is older than some threshold.

The other thing that seems worth pursuing is to add some "persistence flags" to a namespace to handle

  • namespaces that need no persistence at all, like the per-job PMI namespace
  • namespaces that might only need their "final" snapshot captured (some jobs maybe?)
  • namespaces that require strong persistence (on disk after every commit, say)

Another thought is that the current "write back" cache on rank 0 might have some opportunity to avoid writing some objects to the backing store completely, e.g. if they are "dereferenced" before being written. Content blobs in flight could have some additional flags that affect their persistence.

Just thinking out loud really, more discussion/thought needed.

@trws trws mentioned this issue Jul 10, 2018
@SteVwonder
Copy link
Member

Per the coffee discussion today:

For us this is complicated by multiple namespaces sharing one content store

@trws suggested that we could have a separate content store for each namespace. For guest namespaces for a job, the final "snapshot" could be copied into the main content store and the guest content store be deleted.

It seems like we need something like git-gc here, to identify "unreachable objects".

It was also mentioned that for our TOSS4 timeline, we could start running this gc process after an instance restart as long as the job shells don't try and access "old"/"stale" references (that code will need some auditing).

@garlick
Copy link
Member Author

garlick commented Feb 24, 2022

Let's say that this issue can be closed if we can garbage collect the content store on the way up from an instance restart, based on following the last-written root blobref checkpoint, and deleting everything that's not referenced.

Sort of like WALL-E. The trash piles up, then we send Flux away until the robots finish cleaning up. What could go wrong?

Let's save KVS redesign with refcounting for another day/issue.

@garlick garlick added this to the flux-core v0.39.0 milestone May 2, 2022
garlick added a commit to garlick/flux-core that referenced this issue May 2, 2022
Problem: a system instance that runs flux-dump(1) from rc3
might get killed by systemd TimeoutStopSec.

Have flux-shutdown(1) arrange for the dump.  If the instance is
being shut down by this method, then systemctl stop is not being run,
so TimeoutStopSec does not apply.

Fixes flux-framework#258
@mergify mergify bot closed this as completed in 358f21b May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants