Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider gc implementation #2325

Closed
pared opened this issue Jul 26, 2019 · 73 comments
Closed

Reconsider gc implementation #2325

pared opened this issue Jul 26, 2019 · 73 comments
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do research ui user interface / interaction

Comments

@pared
Copy link
Contributor

pared commented Jul 26, 2019

As pointed out in discussion in #1691, we should reconsider gc implementation.
Currently, if called without any options, dvc will collect current branch dependencies and outputs checksums, and remove everything besides it. We can easily clear history of changes with this command. gc should be safer with default options. Straightforward implementation could get all outputs for all revisions in git repo and remove everything that is not on list.

As pointed out by @Suor, this approach might be slow for repository with long history.

@Suor
Copy link
Contributor

Suor commented Jul 27, 2019

This is more about UI defaults than implementation, which will surely be affected too.

@efiop efiop added enhancement Enhances DVC p1-important Important, aka current backlog of things to do research ui user interface / interaction labels Jul 27, 2019
@hhoeflin
Copy link

hhoeflin commented Aug 9, 2019

yes - very much agree. I am new to dvc, but the gc implementation is already making me nervous.

One additional note on gc. Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice.

@Suor Why do you think this is more about UI than implementation? I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository?

The conservative answer to this would be to not remove them. However this sounds like a major change in strategy, where only those items are considered for removal that
a) are known where they are referenced
b) are cleared for deletion (e.g. by being in a certain branch or by having never versions for the same file in the a branch)

Or am I misunderstanding this?

Thanks

@efiop
Copy link
Contributor

efiop commented Aug 11, 2019

Hi @hhoeflin ! Sorry for the delay.

Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice.

We usually recommend keeping metrics files in the git (e.g. dvc run -M) for that precise reason plus if you'd lose access to your cache at some point, you would still be able to use git to see what the metric was like.

I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository?

In the current state of things, those files are going to be removed, that is why gc needs --projects if you are using the same cache or remote for a number of projects. The perk of doing that is that you get inter-project deduplication of cache files, which might be handy 🙂 In other cases, we usually recommend keeping separate caches/remotes for your projects.

a) are known where they are referenced

Currently dvc cache/remote doesn't have any notion of which project it came from, it would require some additional complications to make it aware of that.

b) are cleared for deletion (e.g. by being in a certain branch or by having never versions for the same file in the a branch)

It is clear that we need different strategies for gc. Here are a few related tickets about other strategies: #678 #377 #155 #855
And we clearly need to adopt some reasonable default approach by gathering a number of strategies 🙂

Improving gc is around the top of our TODO list, so we will get to dedicating it more time soon. Thank you for the feedback, we really appreciate it! 🙂

@shcheklein
Copy link
Member

An excellent overview and answer @efiop, thank you :)

A few interesting questions have come to mind reading this. For example, should it be analyzing all commits in all branches in the repo by default (vs only all commits in the current branch)? How do we ensure that we have all the branches we need? How do we ensure (and should we?) that when we run it, the branch is correct?

Should it be for example, a global command (in a dvc get sense) - (e.g. dvc remote gc, dvc cache gc - just throwing some ideas) that takes a Git repo link (or multiple links) and analyzes it (them) holistically? By default saving everything - all commits/tags/branches/repos. Then a few options that will be cutting the scope down - branch names, tag names, dates of commits, etc.

@Suor
Copy link
Contributor

Suor commented Aug 16, 2019

One of the sources of issues is we are treating local gc and remote gc the same now, which is wrong for the very usual scenario of using both - in that case local cache is temporary, so we want to say what we want to keep. However for last/permanent storage such as remote cache we usually want to keep everything referenced by default. This is complicated by the existence of other scenarios like shared machine with shared local cache. We should somehow separate this cases. BTW, when we think like that #2037 becomes obvious.

Another thing is, given distributed nature of git, it's impossible to implement dvc gc against shared cache safely - one might just have pushed something only he has ref to and we will collect that unknowingly. So we shouldn't try to be perfect, we should try to be safe though. We may not collect recent files in shared cache for example.

@hhoeflin
Copy link

@Suor
Thank you for that explanation. From my point of view, the main issue of gc is a bit of a different one than you describe. It is its basic philosophy of "deleting everything that I don't know I need" rather than "deleting everything that I know I don't need".

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 19, 2019
@Suor
Copy link
Contributor

Suor commented Aug 20, 2019

Extracted from #2147:

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is "I know what I am doing even thoufgh I just messed up flag" :)

@efiop
Copy link
Contributor

efiop commented Aug 22, 2019

@Suor grace-period won't work, because a user might use some models that he has generated earlier and by removing those we would break the project for him. Next logical step is LRU, but, once again it is not possible to properly implement it in a distributed nature. I think gc should only be run with the upstream project, to include all important references, and for forks users should use separate storage until their PR is merged.

@Suor
Copy link
Contributor

Suor commented Aug 23, 2019

@efiop LRU won't work either, if that model is in local cache it won't be accessed anyway. But grace period will work for most cases, the example with using some old model, which is still in use, but never referenced by any tag/branch is artificial.

@efiop
Copy link
Contributor

efiop commented Aug 26, 2019

@Suor I'm not saying that is not referenced by any tag/commit/branch, I'm saying that you might've added a dataset 2 years ago and still using it on your master. How would a grace period work in that case?

@Suor
Copy link
Contributor

Suor commented Aug 26, 2019

@efiop if it's referenced then it won't be collected, the grace period is needed to protect something only recently added and not referenced in a particular repo dvc gc is run in, but maybe referenced in other clone.

@Suor
Copy link
Contributor

Suor commented Aug 26, 2019

@hhoeflin "deleting everything that I don't know I need" is the right thing for local repo backed by remote. So the issue is the same approach in both situations.

@jorgeorpinel

This comment has been minimized.

@efiop

This comment has been minimized.

@ghost

This comment has been minimized.

@dashohoxha
Copy link
Contributor

gc should be safer with default options

I agree with this. By default gc should delete as little as possible (maybe nothing?), and as you add more options to the command it should delete more. This makes sense because if you delete something by mistake you cannot restore it back, but if you don't delete something by mistake you can run the command again with more options in order to delete it.

I would also say that the name gc is a bit cryptic and maybe misleading. I was wondering what it means, until I read the docs. Why not call it cleanup or delete?

Another thing is that a remote storage seems to me like a backup/restore place for the local cache. If we see it like this, then the only thing that has to deal with it is the script (or scripts, or command) that makes the synchronization between the local cache and the remote storage. Depending on the sync policy, everything that is deleted on the local cache may be deleted on the remote storage as well, or everything that is deleted on the remote storage will be deleted on the local cache as well.

The point is that dvc commands should operate only on the local cache and not touch the remote storage, except for the command(s) that synchronize them (the local cache and the remote storage).

@dashohoxha
Copy link
Contributor

I think that the only way to make the cleanup process both safe and flexible is to make it interactive. This means that the user should have enough information to decide which cached files to delete and which ones to keep.

When I look at the cache directory, I see only files and directories with strange names, and I am not able to understand which files in my workspace they represent, when they were cached, what is the git tag/branch to which they are related (if there is one), what is the git commit message (if there is one), etc. If I had this information, then I might be able to decide easily which ones of the cache files to delete and which ones to keep. I wouldn't mind even deleting them manually (plain rm command and other shell tricks).

So, the crucial question is: can DVC show such metadata information about the cached files? Maybe it stores such information on the state DB. Extracting this information from git would not be a good idea, both for performance reasons and because DVC is supposed to be able to work independently of git (with dvc init --no-scm).

In any case, the gc command (or whatever it is called) would be safer if it does not delete the files immediately, but shows a list of files that it is going to delete and asks for confirmation. But this list of files makes sense only if we know something about them and their context (the name of the file that was cached, the time, the commit message, etc.). Otherwise, if all we see is a list of checksums, we won't be able to make a useful decision.

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Sep 9, 2019
@efiop
Copy link
Contributor

efiop commented Sep 9, 2019

@dashohoxha

Extracting this information from git would not be a good idea, both for performance reasons and because DVC is supposed to be able to work independently of git (with dvc init --no-scm).

--no-scm case is the simplest one, anything that is used in the current workspace stays, rest is deleted.

Interactivity would be one of the next stages. We at least should be able to delete everything that is not used in any commit, it is basic stuff.


Ok guys. Looks like the first step in this would be to make dvc gc by default walk through all commits for all branches/tags/etc to collect everything used and then remove anything that is not in that list. Optimizations and grace periods will follow later, for now we can keep assuming that users have ceased using the shared cache or remote when running gc. Your thoughts? @iterative/engineering

@pared
Copy link
Contributor Author

pared commented Sep 9, 2019

@efiop
I agree with that. This seems easy to implement, and will make gc much safer than it is now. But grace period should follow soon after initial improvements.

@shcheklein
Copy link
Member

Mu suggestion are pretty much the same as before. I still think that we got stuck because we moved too far from the initial issue(s) in this ticket and expanded the scope way too much:

  1. Do first the thing we wanted to do - make gc safer. Two options - do not allow running gc in its current default at all (ask to specify -f or something). Or make --all-commits the default and introduce the --workspace, -w option. No significant changes to any options, symmetry, etc. Has nothing to do with remove discussed above.

  2. Move remove semantics (pretty much all the discussion above - its need overall, speific command names -remove or clean, what are other commands needed to make that workflow usable, etc) to a separate ticket. I don't see any reason to remove or redo gc itself and it feels that providing some additional remove-like functionality is a separate topic.

  3. (note relate directly to this ticket) As a next step for gc/pull/push/show metrics/etc start introducing -n and filters to define scope by names (tag names, commit names, etc, etc). This is a much needed change that we started implementing initially and went into this hole of discussing removes, etc. To my mind this is more important than even providing remove.


@skshetry some additional consideration, most of them should go into a separate ticket(s) to my mind:

Keeping things sound reverse thinking and a weird semantic to use when removing stuffs

To me there is no superior semantics. Both of them are clear and have their place. Remove when it's exactly clear what is supposed to be removed (you put some burden on user's shoulders in terms of somehow getting this knowledge) - e.g. some large dataset that is unused, keep when it's exactly clear what I'm interested in - like some scope of experiments (it correlates very well with what we use for push/pull/show/, etc - users operate in terms of experiments they need).

We should move away from the current gc terminology (it's a very git-centric term). Let's have dvc cache remove/clean.

I don't quite understand/feel statement. Could you elaborate, please? gc is a very precise term and usually intuition is that is counts/collects references in some way and then removes everything that is not referenced. This is exactly what out gc does. I also don't see why is it git-centric. To me it corresponds more with GC in programming languages :)

And, let's move config commands to dvc config or, separately discuss those.

Yes, let's try to stay focused. See above. I think the whole ticket is not about something completely else now.

Separate cleaning up remote and local cache: dvc remote vs dvc cache.

As @dmpetrov pointed we should provide an option to dvc cache to remove stuff in remote along the way (removing both in cache and remote effectively). It is a regular workflow and this is how gc operates now in -c`-rmodes. It's a good question if we need a separate command togc\clean\remove` on the remote alone? I don't know. As I mentioned above, my opinion - it should be a separate ticket. We try to solve to many things at once.

Of course, we prompt and try to be more transparent about removing things.

as we discussed in the #2498 , it looks like the agreement is to not use prompts but fail instead

@skshetry
Copy link
Member

skshetry commented Feb 17, 2020

Yes, instead of walking a step at a time, I proposed wild lot of changes, and that's a failure in my part. But, yes, we can do what you proposed as a first iteration. The discussions or questions, that I raised is just that, discussion.

I don't quite understand/feel statement. Could you elaborate, please? gc is a very precise term and usually intuition is that is counts/collects references in some way and then removes everything that is not referenced. This is exactly what out gc does. I also don't see why is it git-centric. To me it corresponds more with GC in programming languages :)

gc is about removing unused stuffs automatically. ref-counting is an implementation detail, and we also doing the same could be co-incidental (or, not :) ). And, unused in our case is debatable (related to history? workspace? heads?). That's why we are focused on safety issues.

In all of the programming languages and in git, gc is automatic, meaning, user is not notified of the changes nor the user tells the tool to do it (usually). So, what we are really talking about is, cache cleaning/removing.

Also, this stems from how the user is using gc: to clean/remove items from the cache. If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

P.S: I'm not trying to say, gc is not important. But, it feels people are really trying to remove stuffs off of the cache that they think is unused than removing garbage.

@shcheklein
Copy link
Member

And, unused in our case is debatable (related to history? workspace? heads?). That's why we are focused on safety issues.

I believe, it's a very precise definition in our case - everything that is not referenced in DVC-files anymore. And the scope of commits we collect references from corresponds one-to-one with all other commands in DVC - push/pull/metrics show, etc, meaning that is corresponds well with the higher level abstractions we operate - experiments.

gc is about removing unused stuffs automatically

Never felt this way regarding it being automatic. It feels like an implementation details on itself. If it's confusing to a lot of people, I'm fine to rename it to clean or something else. But keep the semantics. If we need remove - let's discuss it separately.

If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

All of this feels like unnecessary complication or at least does not make this workflow superior in any way. With "keep" semantics I don't care about old stuff, I just keep some experiments I care about - the most recent stuff. Also, I don't want to do additional commands to figure our what to remove. It feels too low level - like du -hs + rm. Not saying that it does not have its place, again. It just a separate remove like logic that we should be introducing separately.

Also, this stems from how the user is using gc: to clean/remove items from the cache. If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

that's the good point. And see my other responses - users use it in different modes.

@dmpetrov
Copy link
Member

@shcheklein and @skshetry you are going to the rabbit hole again. GC/not-GC is not the most important part here and it is not very important how it corresponds to programming languages. What is the important part - how to clean space?

If we just make GC safe - users won't be able to clean the space. So, we cannot take this step without introducing more aggressive strategies or at least without keeping the current one (probably as an optional). Also, the specific remove is needed dvc cache remove <datafile-or-dir>.

Actions that I'd suggest (the commands' names TBD) sorted by priorities:

  1. In this ticket. Replace the current default scenario to the safe one and keep the current one as an option (probably in the same command).
    1. (optional) Consider renaming dvc gc to dvc cache gc.
  2. Create a separate ticket for the specific remove dvc cache remove <datafile-or-dir>.
  3. Start introducing -n for others commands (see @shcheklein comment).
  4. Think about other gc/delete strategies.

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:

  1. Forbid dvc gc(no arguments) and create -w|--workspace flag to work as current behaviour; Remove confirmation prompt that we already have;
  2. For --all-tags and --all-commits assume -w automatically to match previous behaviour;
  3. Introduce -n(not sure about the best long option name) that will tell how much history commits to keep. So dvc gc -n 10 will keep 10 commits back in your workspace. dvc gc -n 10 -a will keep 10 commits back in all tags and the workspace and so on.

And the rest we can leave for the future discussion. What do you think guys, can we start with these steps?

@dmpetrov
Copy link
Member

@efiop thank you for the brief summary. Let's start with this.

PS: you can keep only (1) and (2) in this issue and make (3) as a separate issue if you think the scope is too big. Ups to you...

@skshetry
Copy link
Member

Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:

  1. Forbid dvc gc(no arguments) and create -w|--workspace flag to work as current behaviour; Remove confirmation prompt that we already have;
  2. For --all-tags and --all-commits assume -w automatically to match previous behaviour;
  3. Introduce -n(not sure about the best long option name) that will tell how much history commits to keep. So dvc gc -n 10 will keep 10 commits back in your workspace. dvc gc -n 10 -a will keep 10 commits back in all tags and the workspace and so on.

And the rest we can leave for the future discussion. What do you think guys, can we start with these steps?

This is what you proposed last week as well, right @efiop? I stalled (eh? 😄) on it a bit as I wanted to discuss more on the usecases (ref: #2325 (comment)) that I felt like were important from reading this issue.

Though, we can start with this than getting stuck. Let's roll with these for now. 🙂

@pared
Copy link
Contributor Author

pared commented Feb 21, 2020

Introduce -n(not sure about the best long option name)

How about --tail? though that would not match -n, on the other hand -t could be confused with -T

@kenahoo
Copy link

kenahoo commented Mar 3, 2020

Coming late to this thread.

I expected to have a --dry-run option in dvc gc that would just report what would happen, without actually removing anything. That's similar in spirit to the "require --force to do anything" suggestion, but I think it's a little more straightforward and explicit.

I'd probably suggest requiring either --dry-run or --force to be explicitly specified, just like git clean does.

I also find the wording of the --cloud and --remote options confusing:

  • --cloud: "Collect garbage in remote repository"
  • --remote: "Remote storage to collect garbage in."

To me, that actually sounds like we will put the garbage into the remote repository, i.e. find garbage locally and upload it to the remote. I think if you just changed "in" to "from", that would help; otherwise I think the following would be more clear:

  • --cloud: "Remove garbage from remote repository"
  • --remote: "Remote repository to remove garbage from"

@casperdcl
Copy link
Contributor

git clean is a good point. I use it all the time but never without arguments - to the extent that I forgot that it raises an error without args. Though you could argue that makes it a little pointless (surely the default should be dry run).

@skshetry
Copy link
Member

skshetry commented Mar 3, 2020

@casperdcl @kenahoo, it's not really possible to provide much information in --dry-run because we only delete what we don't know (though we can probably dig into the older commits and try to find out more information).

I also find the wording of the --cloud and --remote options confusing:

Thanks for pointing that out. It seems to be a help message. I'll try making a PR this week for that.

@jorgeorpinel
Copy link
Contributor

that the only way to make the cleanup process both safe and flexible is to make it interactive
Interactivity would be one of the next stages

I just want to also vote for considering interactivity when running dvc gc without flags. Or with an -i option not to interfere with shell scripting.

Related discussion: iterative/dvc.org#1023 (comment)

@kenahoo
Copy link

kenahoo commented Mar 9, 2020

@skshetry I don't think --dry-run has to be hugely informative, it just mainly has to disable the actual deletion.

@Suor
Copy link
Contributor

Suor commented Mar 9, 2020

@kenahoo If you don't mind --dry-run to be almost as slow as a non-dry one then that is not hard to implement. The thing is we don't really know what we will delete until we list all the cache/all the remote cache, which is the slow part.

@dmpetrov
Copy link
Member

I just want to also vote for considering interactivity when running dvc gc without flags

Interactivity is a very good idea but not as a default behavior. It is a bit related to prompt\not-prompt problem (#2498) because it complicates scripting.

@efiop
Copy link
Contributor

efiop commented Dec 10, 2020

From #2451

Maybe a report of what was collected (and from where) would be more informative. This way the user could know something actually happened, without having to understand and check DVC internals before and after running the command.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do research ui user interface / interaction
Projects
None yet
Development

No branches or pull requests