-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More intelligent blockstore garbage collection #3092
Comments
It might also be good to have both a min and max threshold, and have a garbage collection that only removed things until the min threshold was reached. It could be automatically triggered at the max threashold, as is done now. |
@jakobvarmose how would your reference counting idea work? would it count existing references to blocks? or would it simply increment some counter each time a block is referenced by another object? I agree 100% that the current GC implementation is a bit... blunt. The issue is that the marking phase is relatively expensive, so doing smaller runs is not cheap. There has been some previous discussion towards this topic here: ipfs/notes#130 One related topic i was talking about earlier (for |
I would add two new variables associated with each block:
When an object is pinned directly just set To check if a block is pinned simply do |
@jakobvarmose yeah, we used to do that. It was really slow, and expensive. We have thousands to millions (or more) of blocks. Blocks may also be very small, and the overhead of that reference information would be significant. Every pin operation would need to iterate over every child block and update every entry for that child block. |
@whyrusleeping Oh, I didn't know. Yeah, it would use quite a bit of memory. But I can't see how it would be much slower, as the current implementation also reads each child block from disk, and this is probably the slowest part. Unpinning will of course be slower, but pinning should be about the same speed. Or what am I missing? |
@jakobvarmose no, currently when you pin and unpin we only update list of root hashes, reference counting would require to iterate whole hash tree and update the reference count. |
@whyrusleeping I was wondering about why we won't use reference counting for indirect pins. How is that worse than what we have to do now. How is it worse to have to iterate over every child of a single recursive pin to update it worse than having to always iterate over every child of every recursive pin just to check if a block is pinned. The checking operation would seam to me to be more frequent than the updating operation. Have we considered storing the indirect pinned blocks individually in the datastore. That is for example for each indirectly pinned block have an entry under the name "/local/pins/indirect/". This will have disk space overhead, but since it is no longer in memory will scale well. The underlying leveldb is designed to be fast so at this point I am having a hard time understating that a mass update could be really slow. The cost of checking pins will become more of an issue once "block rm" lands (#2962) and also in my filestore code (#2634). A bloom filter will help in the case that a block is not pinned, but to make sure it is we will still have to iterate over every child of every single recursive pin. |
@kevina we've been down this road before. The disk overhead is significant, the cost to pin large objects becomes relatively obscene. If you want to dig into it more, go check out the old code and PRs |
@Kubuxu When unpinning you are right that the current implementation runs very quickly. But when pinning all child blocks are read from the datastore (see https://github.com/ipfs/go-ipfs/blob/master/pin/pin.go#L140-L144). |
Another related idea: To make GC faster instead of simply marking, count the number of references (you could call it count-and-sweep), and store this result to disk. Additionally store incremental updates of recursive pins. The next time the garbage collector is run it will read the results from last time, and update incrementally. |
Maybe we should review proper gc algorithms and find something that matches
|
reference: #2030 |
A rather longest IRC discussion: https://botbot.me/freenode/ipfs/msg/71626736/ (Starting around 9 pm PDT on Aug 19). No real consensus but lots of ideas and background info. |
ref: #4149 |
Type: Feature
Area: Blockstore, Pin
Description:
The current garbage collector deletes all un-pinned blocks. This makes the system more fragile, as people don't usually pin a lot of objects. And if I have configured my node to keep up to 10GB of data I would also expect it to keep close to that limit at all times.
One solution is to delete blocks at random until the disk usage falls below the threshold limit. But blocks could also be deleted based on supply and demand, or when they were last accessed.
The current garbage collector uses a mark-and-sweep algorithm, but another option would be to use reference-counting, and I think that is better.
The text was updated successfully, but these errors were encountered: