-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] SPFS Safe Deletion #1011
Comments
Yeah that's an interesting case that the age-based protections don't help with. I don't think the What I think the fundamental race problem is that there is not a way to write both a manifest object and its children objects into the database atomically. So there is a window of time between when the child objects are checked for existence / written (and can be considered garbage) and when the manifest object is written that creates the hard reference to those child objects. This window actually extends up to the point where a tag is written that references the manifest(s). I suggest a different approach to tackle the race more directly using a cooperative locking mechanism. What is needed is an extra way to track object liveness that is separate from walking references from tags. Here's what would happen:
The lock would be short lived and still allow concurrency for writes. |
I'm happy to explore the idea of transactions, but it sounds like you are suggesting a global lock on the whole DB. That makes me nervous, but maybe a more detailed discussion of the implementation would be good. One consideration for us is that the clean operation can take a significant amount of time on our repository, and it's not reasonable to ask the server to stop accepting requests during that period because we have processes writing new data at most hours of the day, now.
It seems to me that this point is not actually true, since you would need to block writes even while collecting the list of objects lest you select something for deletion that was re-added while scanning (the original problem that spawned this). As I think about the transactions, I'm hard pressed to see how we actually rid ourselves of a similar race condition, but I'm keep to keep talking about it. I arrived ad the |
This is not true. |
From the meeting today:
|
Background and Problem Space
On our server, we have a race condition with respect to the spfs clean process. Consider the following scenario:
The use of object age in the cleaner is not helpful here, because OLD can be of any age.
Proposal
I propose a safe-deletion process for spfs that works in two stages where data is first marked for deletion before removing it at a later time. This two stage removal process would allow administrators to leverage a second age-out window for deleted data. This second window would need to be tuned for each workflow to fully eliminate race conditions, but a safe and reasonable default would suffice for all currently known cases.
Step-by-Step
At this stage the repository would appear to no longer contain the object via
has_object
. If the object/payload was read, though, the repository would recognize the deleted instance and restore it instead of failing. Crucially, this restore should make the object "new" again in the eyes of the cleaner.The text was updated successfully, but these errors were encountered: