-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider gc implementation #2325
Comments
This is more about UI defaults than implementation, which will surely be affected too. |
yes - very much agree. I am new to dvc, but the gc implementation is already making me nervous. One additional note on gc. Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice. @Suor Why do you think this is more about UI than implementation? I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository? The conservative answer to this would be to not remove them. However this sounds like a major change in strategy, where only those items are considered for removal that Or am I misunderstanding this? Thanks |
Hi @hhoeflin ! Sorry for the delay.
We usually recommend keeping metrics files in the git (e.g.
In the current state of things, those files are going to be removed, that is why
Currently dvc cache/remote doesn't have any notion of which project it came from, it would require some additional complications to make it aware of that.
It is clear that we need different strategies for gc. Here are a few related tickets about other strategies: #678 #377 #155 #855 Improving gc is around the top of our TODO list, so we will get to dedicating it more time soon. Thank you for the feedback, we really appreciate it! 🙂 |
An excellent overview and answer @efiop, thank you :) A few interesting questions have come to mind reading this. For example, should it be analyzing all commits in all branches in the repo by default (vs only all commits in the current branch)? How do we ensure that we have all the branches we need? How do we ensure (and should we?) that when we run it, the branch is correct? Should it be for example, a global command (in a |
One of the sources of issues is we are treating local gc and remote gc the same now, which is wrong for the very usual scenario of using both - in that case local cache is temporary, so we want to say what we want to keep. However for last/permanent storage such as remote cache we usually want to keep everything referenced by default. This is complicated by the existence of other scenarios like shared machine with shared local cache. We should somehow separate this cases. BTW, when we think like that #2037 becomes obvious. Another thing is, given distributed nature of git, it's impossible to implement |
@Suor |
Extracted from #2147: Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author. The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that: dvc gc -c --grace-period=0 This is "I know what I am doing even thoufgh I just messed up flag" :) |
@Suor grace-period won't work, because a user might use some models that he has generated earlier and by removing those we would break the project for him. Next logical step is LRU, but, once again it is not possible to properly implement it in a distributed nature. I think gc should only be run with the upstream project, to include all important references, and for forks users should use separate storage until their PR is merged. |
@efiop LRU won't work either, if that model is in local cache it won't be accessed anyway. But grace period will work for most cases, the example with using some old model, which is still in use, but never referenced by any tag/branch is artificial. |
@Suor I'm not saying that is not referenced by any tag/commit/branch, I'm saying that you might've added a dataset 2 years ago and still using it on your |
@efiop if it's referenced then it won't be collected, the grace period is needed to protect something only recently added and not referenced in a particular repo |
@hhoeflin "deleting everything that I don't know I need" is the right thing for local repo backed by remote. So the issue is the same approach in both situations. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I agree with this. By default I would also say that the name Another thing is that a remote storage seems to me like a backup/restore place for the local cache. If we see it like this, then the only thing that has to deal with it is the script (or scripts, or command) that makes the synchronization between the local cache and the remote storage. Depending on the sync policy, everything that is deleted on the local cache may be deleted on the remote storage as well, or everything that is deleted on the remote storage will be deleted on the local cache as well. The point is that |
I think that the only way to make the cleanup process both safe and flexible is to make it interactive. This means that the user should have enough information to decide which cached files to delete and which ones to keep. When I look at the cache directory, I see only files and directories with strange names, and I am not able to understand which files in my workspace they represent, when they were cached, what is the git tag/branch to which they are related (if there is one), what is the git commit message (if there is one), etc. If I had this information, then I might be able to decide easily which ones of the cache files to delete and which ones to keep. I wouldn't mind even deleting them manually (plain So, the crucial question is: can DVC show such metadata information about the cached files? Maybe it stores such information on the state DB. Extracting this information from git would not be a good idea, both for performance reasons and because DVC is supposed to be able to work independently of git (with In any case, the |
Interactivity would be one of the next stages. We at least should be able to delete everything that is not used in any commit, it is basic stuff. Ok guys. Looks like the first step in this would be to make |
@efiop |
Mu suggestion are pretty much the same as before. I still think that we got stuck because we moved too far from the initial issue(s) in this ticket and expanded the scope way too much:
@skshetry some additional consideration, most of them should go into a separate ticket(s) to my mind:
To me there is no superior semantics. Both of them are clear and have their place. Remove when it's exactly clear what is supposed to be removed (you put some burden on user's shoulders in terms of somehow getting this knowledge) - e.g. some large dataset that is unused, keep when it's exactly clear what I'm interested in - like some scope of experiments (it correlates very well with what we use for push/pull/show/, etc - users operate in terms of experiments they need).
I don't quite understand/feel statement. Could you elaborate, please?
Yes, let's try to stay focused. See above. I think the whole ticket is not about something completely else now.
As @dmpetrov pointed we should provide an option to
as we discussed in the #2498 , it looks like the agreement is to not use prompts but fail instead |
Yes, instead of walking a step at a time, I proposed wild lot of changes, and that's a failure in my part. But, yes, we can do what you proposed as a first iteration. The discussions or questions, that I raised is just that, discussion.
In all of the programming languages and in git, Also, this stems from how the user is using P.S: I'm not trying to say, |
I believe, it's a very precise definition in our case - everything that is not referenced in DVC-files anymore. And the scope of commits we collect references from corresponds one-to-one with all other commands in DVC - push/pull/metrics show, etc, meaning that is corresponds well with the higher level abstractions we operate - experiments.
Never felt this way regarding it being automatic. It feels like an implementation details on itself. If it's confusing to a lot of people, I'm fine to rename it to clean or something else. But keep the semantics. If we need
All of this feels like unnecessary complication or at least does not make this workflow superior in any way. With "keep" semantics I don't care about old stuff, I just keep some experiments I care about - the most recent stuff. Also, I don't want to do additional commands to figure our what to remove. It feels too low level - like
that's the good point. And see my other responses - users use it in different modes. |
@shcheklein and @skshetry you are going to the rabbit hole again. GC/not-GC is not the most important part here and it is not very important how it corresponds to programming languages. What is the important part - how to clean space? If we just make GC safe - users won't be able to clean the space. So, we cannot take this step without introducing more aggressive strategies or at least without keeping the current one (probably as an optional). Also, the specific remove is needed Actions that I'd suggest (the commands' names TBD) sorted by priorities:
|
Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:
And the rest we can leave for the future discussion. What do you think guys, can we start with these steps? |
@efiop thank you for the brief summary. Let's start with this. PS: you can keep only (1) and (2) in this issue and make (3) as a separate issue if you think the scope is too big. Ups to you... |
This is what you proposed last week as well, right @efiop? I stalled (eh? 😄) on it a bit as I wanted to discuss more on the usecases (ref: #2325 (comment)) that I felt like were important from reading this issue. Though, we can start with this than getting stuck. Let's roll with these for now. 🙂 |
How about |
Coming late to this thread. I expected to have a I'd probably suggest requiring either I also find the wording of the
To me, that actually sounds like we will put the garbage into the remote repository, i.e. find garbage locally and upload it to the remote. I think if you just changed "in" to "from", that would help; otherwise I think the following would be more clear:
|
|
@casperdcl @kenahoo, it's not really possible to provide much information in
Thanks for pointing that out. It seems to be a help message. I'll try making a PR this week for that. |
I just want to also vote for considering interactivity when running
|
@skshetry I don't think |
@kenahoo If you don't mind |
Interactivity is a very good idea but not as a default behavior. It is a bit related to prompt\not-prompt problem (#2498) because it complicates scripting. |
From #2451
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
As pointed out in discussion in #1691, we should reconsider
gc
implementation.Currently, if called without any options, dvc will collect current branch dependencies and outputs checksums, and remove everything besides it. We can easily clear history of changes with this command.
gc
should be safer with default options. Straightforward implementation could get all outputs for all revisions in git repo and remove everything that is not on list.As pointed out by @Suor, this approach might be slow for repository with long history.
The text was updated successfully, but these errors were encountered: