-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] gc: change option to use single remove flags #3207
Conversation
8679cd4
to
c3b44f7
Compare
c3b44f7
to
907263b
Compare
dvc/command/gc.py
Outdated
all_commits=self.args.all_commits, | ||
all_branches="branches" in self.args.remove, | ||
all_tags="tags" in self.args.remove, | ||
all_commits="commits" in self.args.remove, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pass remove
to the repo.gc
and let it deal with it. We need to keep API and CLI symmetric here, no reason to change one and keep the other unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree about the symmetricity. But, I find it better to use args rather than a collection of filters, kinda like buttons which you can turn on or off to be more aggressive/effective.
I don't have strong opinions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry well, but why not? :) Think about them as being used in future to remove specific references. E.g. remove=mytag
, which you will have to pass as remove=["mytag"]
to API anyways, so why bother with additional flags for repo.gc
, if you can repo.gc(remove=["commits", "tags", "branches"])
right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sense from future, as it allows a lot of flexibility. I was just saying, all_commits=True, all_branches=True
are better interface than remove={"commits", "branches"}
, unless there are too many arguments to pass (not really flexible).
I was looking from user interface-side. I'll fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see three downsides to this, on top of @Efiops point:
- it's very confusing that
all_branches
inRepo.gc()
is somewhat but not entirely reverse toall_branches
inRepo.used_cache()/brancher()
- three params instead of one combining them, I was looking into reducing the number of params in our functions and combining them is one of the ways
- we try to keep
Repo
methods API close to cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop I don't like changing brancher. The logic in both brancher and .used_cache()
should stay the same. There is no reason to translate cli/api to internal code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop additionally all that new logic you propose is more complex than existing one and does not solve anything that is required by this PR/issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor Let's be constructive here. What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop wrote that below, but maybe to make it more clear:
- leave
.brancher()
,.used_cache()
and any non-cli non Repo api-methods code unchanged. - accept new options in
dvc gc
, raise someDeprecatedError
on the old ones - pass that to
Repo.gc()
as is - translate
remove={...}
toall_branches
and friends that.brancher()
and.used_cache()
understand inRepo.gc()
implementation.
This way only CmdGC
and Repo.gc()
would be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accept new options in dvc gc, raise some DeprecatedError on the old ones
No need, we are breaking the default behavior anyway, preserving old options would be hard and confusing. We need to drop them right away, without any DeprecationError's.
I think you are missing something here. There is no direct translation of current options into the new ones, so your proposition is impossible to implement without changing some of that. For example, dvc gc --remove=branches
needs to collect all used cache for all references except for all branches, so it would translate to something like git rev-list --all --skip $ALL_BRANCHES
, which requires adjustments for used_cache
and brancher
.
907263b
to
7708de1
Compare
7708de1
to
b9838b7
Compare
For the record: @skshetry left for a holiday, so I'm taking over here π |
@@ -244,8 +244,18 @@ def list_branches(self): | |||
def list_tags(self): | |||
return [t.name for t in self.repo.tags] | |||
|
|||
def list_all_commits(self): | |||
return [c.hexsha for c in self.repo.iter_commits("--all")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a misuse of iter_commits
, where passing --all
to it worked accidentally because it was passing it as rev
to rev-list
. New implementation uses rev-list
directly, which should be significantly faster (no benches yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, previous implementation was silently keeping refs/remotes
and refs/stash
as well, which might or might not be what we want, considering that we don't currently have any means of deleting cache for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it was previously simply "keep anything, which is referenced from git". Now it's more complicated )
all_branches=self.args.all_branches, | ||
all_tags=self.args.all_tags, | ||
all_commits=self.args.all_commits, | ||
remove_all_tags=self.args.remove_all_tags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted for explicit flags to avoid confusion such as previous --remove=tags,commits
(where it is not clear what commits
are and why you are not able to pass specific tag or branch to it) and also to simplify completion.
@@ -92,16 +92,26 @@ def test(self): | |||
|
|||
self._check_cache(4) | |||
|
|||
self.dvc.gc(all_tags=True, all_branches=True) | |||
self.dvc.gc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into converting to pytest and possibly adding new tests soon.
else "and all git branches", | ||
history="" | ||
if self.args.remove_all_history | ||
else "and their history", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the phrase weird - many ands. Maybe just list things instead:
This will remove all cache not referenced from:
- the working tree
- all branch heads
...
? Will make it more glanceable, the changing items will be always in the same place, any user already familiar with gc, will be simply skipping the entering phrase without loose of any meaning.
"--remove-all-history", | ||
action="store_true", | ||
default=False, | ||
help="Remove cache for all history of all branches and tags.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I recall, your intention was to remove history by default. Why did you change your mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor To keep it as safe as possible for now π And, well, I implemented a proper approach by excluding refs when not deleting full history.
revs = scm.list_all_commits( | ||
exclude_all_tags=not all_tags, | ||
exclude_all_branches=not all_branches, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes .brancher
and .used_cache()
interfaces lying ones, all_commits
no longer means all commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also makes dvc gc --remove-all-branches --remove-all-tags
approaching noop for many scenarios, but not quite noop, a confusing situation. Suppose you make things iteratively and last commits only about some polish like README adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the naming brancher
is at fault here. It should really have been something like iter_commits()
instead.
But --remove-all-branches --remove-all-tags
is not a noop, it will exclude tags and branches from the refs for which it iterates through. Hence why I've mentioned an issue with refs/remotes
and refs/stash
etc. We should either exclude those by default or add an option to get rid of those. For now this is a "safe" approach, that can be revisited iteratively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's even more confusing. You need a whole paragraph of text to explain how it works, but still it's usually just a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor Well, that is because of refs/remotes
and refs/stash
. Will need to handle them somehow. To me seems like a good idea to just drop them by default and only keep refs/heads and refs/tags. What do you think?
Looking at this I see it becoming more and more weird. I hate to crush the party, but we should consider closing it with the original ticket. So far I see only one advantage to whatever we have now - it's safer for no upstream or
What we should care instead probably:
EDIT. Those things, while making more sense to me, are also not that important probably. It feels that we are inventing the problem or just elevating its priority for no good reason. So the right approach might be saving the effort here. |
Some background, the way I remember/see this. The whole things (interestingly enough) originates from the two discussions:
(The discussion went into questions around symmetry, and I still believe symmetry is important if we can preserve it)
To be precise, they were concerned about the default behavior. But as I discussed with them providing an option to keep some N last commits should be enough. Though some folks were complaining about the default behavior besides ODS guys. I agree that changing the default behavior creates a mess, but it also looks like it not exactly necessary to solve the problems pointed by the users. For example, we can do something like:
This way it can be safe enough, no need to change names/behavior, we still reuse the |
@shcheklein my impression now is that people express their concerns with e.g. non-safe defaults but they don't really care about it. I mean they don't do any followups or even +1s on them, which means this is not bugging them in the long run. We are the only ones still pushing it. I agree that those changes might be beneficial just not that necessary, thus I was talking about saving the effort. |
Ok, so we are lacking a complete solution here, which was expected, as this change was only meant to make the first iteration on our path to good gc. Something safe that we could jump to and then chip away from it with new options. But now the discussion is again about the broader topic, which will again result in it just hanging out there for a few more months more. Current gc suffers from the same issues solved in this new version, but has old cryptic options that won't allow us to do something like I know that with that example I'm making something out to apply to current option list, but the reason I do that is that in the next step you could easily apply that to potential That being said, it is clear that we also need to improve user awareness, which should be solved by some So I think we should bite the bullet, switch to the safest behaviour and start iterating on this with a list of options that we pretty much already agree on like preserving data sources, freeing up local cache if there is stuff on the cloud already, separating cloud and local gc better, introducing history depth option etc. If there will be errors, so be it, we could always improve or switch behaviour, at least we will get more experience and will know better what we want and what we need, which will fuel next iterations. Otherwise, we will be "saving the effort" indefinitely and won't move anywhere, not gaining any new feedback/info and so being stuck in the discussion forever flying in our own fantasies. The history of the current gc ticket is a clear illustration that we are stuck, we need to stop talking the talk and finally walk the walk, to iterate and collect new feedback. |
@efiop I don't agree that most people want to remove something specific. People usually want to simply free up some space, which Bottom lines:
I understand that there is an invested effort/code/time here, I was pushing this for a long time too, but now I think it's not worth it, both overall and this particular PR. EDIT. TLDR. We've spawned a monster. |
@Suor I have never ever heard any user ask how to only keep something and delete the rest in all of these years, have you? Maybe @shcheklein or @dmpetrov have a different experience. |
@efiop I think there is a natural bias - The best way to answer here is to try to understand what actually users want in their workflow. Probably again, not one can come up with a 100% correct answer, unfortunately. I think both approaches would be valid in different cases. My 2cs here - keep semantics is more user friendly if you can put it that way. I care about certain things from my domain (experiments, models, datasets and whatnot) to stay on my machine because I still iterate with them. And I don't care about anything else. Remove feels more advanced - I need to understand now what exactly do I want to remove (either do a negation from what I want to keep) or recall what takes most of the space and go remove it. I definitely recognize the lack of a "pinpoint" remove operation - few users have been asking - how to un-add something, for example. But it feels it should be solved separately? |
@shcheklein How I imagine this. I'm a DS and I have created a new branch People seem to rely on remotes as a backup and just push stuff as they go and nuke the whole local cache whenever it is taking too much space, as they don't understand current |
how about
so, we change it, right? we provide all-commits, last N commits, branches, tags, etc. Next step would be to provide filters on tags and branches - would love to better explorer the "they don't understand" part, for now I tend to agree that there are not that much evidence for this. saying all that, I understand that there is not one-fits-all solution in this case - some might prefer keep, some remove semantics. In some case it would easier to write in one, in other cases in another. But, I'm not sure we have strong enough evidence that "remove" is superior to the extent to jump into breaking the symmetry, default behavior and redoing all options. May be we should try to come with more examples to start feeling this better? (but again, it can be an argument to slightly change it for now, not changing the whole semantics of it?) |
If you delete the branch prior to that - yes, but it is not a "no-brainer" command. And it doesn't extend to the aforementioned removal of particular dvc-file.
No, it doesn't have to be symmetrical with metrics/pull/push, really, as they have different purposes.
Wasn't it the previously constantly discussed argument about current
Well, it has been a couple of years already, and the feedback we've received was about gc being dangerous and people asking how to delete something in particular, at least in my experience. I think we need to iterate based on what we have and not freeze this ticket again for another couple of years.
Well, we could forbid I'll try to prepare more examples, but looks like we are already out of focus here and this is turning back into a large discussion... Hope we will agree on some action points... |
Interesting discussion... and it is more than one year long π A lot of good points for each of the options.
My major concern with GC - I have a strong feeling that we are still lacking a big picture and don't fully understand the scenarios. The only thing that we can say with confidence - the current GC does not solve users' problems. Without the big pictures, it is quite dangerous to move forward with the current approach because it is quite opinionated and is based on the assumption that we can implement all possible scenarios (with probably
It might be more efficient and much safer to implement a very basic, low-level GC-option for removing specific cache items. I can imagine a command With this specific GC-commands, users will be well-equipped for implementing their own GC-scenarios. So, we can identify common GC-scenarios and implement them later as high-level options like I'd suggest reverting the current change and implementing the low-level GC-option. |
Also, I quite donβt understand why we care about a safe approach so much in the context of GC. The core propose of GC is removing data. It is not safe. I think it was a mistake (my mistake btwβ¦. π) to call this command GC. Today PS:
So, introducing a |
Not really it would be the same speed. You can't really delete whatever is referenced from some branch because anything referenced from there might also be referenced from somewhere else. So the implementation is always the same:
This is why a proper unadd is such a complicated operation.
The issue always was that it is not safe, not that it's non-intuitive. At least this is how I understood that all the way.
I doubt this. It looks like most users are ok with it, some people worry occasionally about non-safe defaults, but they don't follow up on their worries, which makes me think the issue is theoritical and we are the only ones really pushing it. |
It looks like different people have a different understanding and expectations of GC command. Some people (@efiop and I'm) trying to clean the cache and cloud/remote to solve user issues with space. While others (@Suor and @shcheklein) are trying to preserve more "keep" semantic which makes sense for GC name. To my mind, the actual problem we need to focus on is cleaning space (local and cloud) and we should give up with the GC name. Let's introduce a new subcommand to With this command, we can:
UPDATED cmd name: |
@dmpetrov Thanks for the feedback! I really like your |
From my 1-1 with @efiop: The remove specific branch or file commands won't care if same checksum is referenced anywhere else in the history, they will assume a user knows what he or she is doing. |
Closing this fo |
Related to #2325.
β Have you followed the guidelines in the Contributing to DVC list?
π Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
β Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. π