-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run: prompt when removing outputs #2187
Conversation
should we have an option for |
create a ticket for the documentation with the description of changes or update the docs please :) |
@shcheklein , this doesn't affect |
changed the code, pawel, now it's another behavior, please check it out!
Fix #2027 ---- ```bash mkdir data/ echo "foo" > data/foo.txt dvc run --outs data/ 'echo "hello" > data/hello' ``` The following command above assumes `data/` directory will exist, however, `dvc run` deletes all the outputs before running the command (this is the easiest way to check if the command is, indeed, creating the outputs), and in this case, `data/foo.txt` will also be deleted. This is error prone and can cause **data loss**. The following patch prompts the user when removing outputs that are not cached. To prevent it from prompting, use `--remove-outs`.
@@ -630,6 +630,9 @@ def exists_with_progress(chunks): | |||
return list(set(checksums) & set(self.all())) | |||
|
|||
def already_cached(self, path_info): | |||
if not self.cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? was there a bug before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output.remote
doesn't have a path_info
and crashes the whole thing, the alternative was to use output.cache.already_cached
(as cache
has path_info
), however, when a remote is not configured (for example, non-cached external outputs) cache
would be None.
Remembering these edge cases is error prone, so now you can use output.remote.already_cached
and you'll be covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, what is the difference with was here before? I'm not sure I understand this still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calling stage.remove_outs
on Stage.create
(we are deleting the outputs to make sure the command is generating them)
# dvc/stage.py (create)
# Removing outputs for reproducibility sake,
# it is expected that the `cmd` will create them.
if stage.cmd:
confirm = kwargs.get("remove_outs", False)
stage.remove_outs(ask=not confirm)
I added the ask
parameter to remove_outs
(previously, it just removed the outputs without asking (calling output.remove
), but now, output.remove
calls output.safe_remove
if ask
is set to true
(this method asks the user for confirmation before it deletes the output).
# dvc/stage.py
def remove_outs(self, ignore_remove=False, force=False, ask=False):
"""Used mainly for `dvc remove --outs` and :func:`Stage.reproduce`."""
for out in self.outs:
if out.persist and not force:
out.unprotect()
else:
logger.debug(
"Removing output '{out}' of '{stage}'.".format(
out=out, stage=self.relpath
)
)
if force:
ask = False
out.remove(ignore_remove=ignore_remove, force=not ask)
# dvc/output/base.py
def remove(self, ignore_remove=False, force=True):
self.remote.safe_remove(self.path_info, force=force)
This is were things start to get confusing with the ask
, force
, and confirm
.
- For
stage.remove_outs
:force
means to remove persisting outputs.ask
means to ask for confirmation when removing outputs (by default set to false (to not affect the previously defined interface)
For output.remove
:
force
is passed directly tosafe_remove
, and it means to not ask for confirmation while "safe removing".
Now, output.remove
calls remote.safe_remove
.
For safe_remove
to work, it needs a cache_dir
(because safe remove checks if the output is on the cache, if it has no cache it crashes -- this was not a bug before because we were not using safe remove on stage.create
but now, if the remote has no cache_dir
it will break, that's why I'm checking if the cache exists and sending methods to the cache
object instead of the remote
one).
Imagine the following case:
dvc remote add ssh ssh://localhost
dvc run --outs-no-cache remote://ssh/file.txt
The stage file is going to have one output.
This output doesn't require any cache.
This is where the if not self.cache
is relevant.
I hope this helps, it took me a while to get everything on my head.
@@ -195,7 +195,7 @@ def already_cached(self, path_info): | |||
if not current_md5: | |||
return False | |||
|
|||
return not self.changed_cache(current_md5) | |||
return not self.cache.changed_cache(current_md5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, what's the difference here. All this relationship between remote, cache, and output is very confusing tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein , it is confusing, indeed. This solve the same problem explained above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrOutis So why not self.changed_cache
? It should do the same thing, no? Why do we need to access self.cache.changed_cache directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self
is a remote with no path_info
, due to the way outputs are instantiated:
self.remote = remote or self.REMOTE(self.repo, {})
If we call changed_cache
directly it will fail on this one:
def checksum_to_path_info(self, checksum):
return self.path_info / checksum[0:2] / checksum[2:]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrOutis I guess you meant that
return not self.cache.changed_cache(current_md5)
is different from
def changed_cache(self):
if not self.use_cache or not self.checksum:
return True
return self.cache.changed_cache(self.checksum)
because in the self.changed_cache we are using self.checksum wich might not match current_md5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop , where did you get the self.use_cache
and self.checksum
? I'm not sure if we are talking about the same changed_cache
😅 there's one in output/base.py
and another one in remote/base.py
. I'm talking about latter one. (this change is on remote/base.py
-- just clearing it out of the way, it might result confusing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrOutis Aah, indeed, i was looking at output
for some reason. Sorry! So looks like we've had a bug here before, right?
if force: | ||
ask = False | ||
|
||
out.remove(ignore_remove=ignore_remove, force=not ask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this naming is super confusing - force, aks, confirm mean different things even in the same function sometimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, can't agree more, @shcheklein , but changing this will require a big refactor, force
means different things, for example: ask
/ prompt
/ confirm
are different names for the same thing, the problem is that prompt
is the module, confirm
is the method and _ask
is the implementation beneath confirm
😅
We can address this in another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrOutis Seems like force
has the meaning of something like remove_persistent
? Maybe let's switch them up? Your ask
-> force
and force
-> remove_persistent
. Would that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop , lol, it make sense, I'll do that ^
"unable to remove '{}' without a confirmation from the user. Use " | ||
"'-f' to force.".format(path) | ||
"unable to remove '{}' without a confirmation from the user." | ||
" Use '--force' (repro) or '--force-remove-outs' (run) to confirm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that on repro
though? Didn't we agree that it is allowed for repro
to do that silently? --force
for repro
has a totally different meaning than --force-remove-outs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 that's true, @efiop , when --force
is used with repro
it reproduces the stage even if the stage didn't change.
I'm not sure why the error message was like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a bug:
def run(self, dry=False, no_commit=False, force=False):
if (self.cmd or self.is_import) and not self.locked and not dry:
self.remove_outs(ignore_remove=False, force=False)
I think it should have been self.remove_outs(ignore_remove=False, force=force)
def remove(self, ignore_remove=False): | ||
self.remote.remove(self.path_info) | ||
def remove(self, ignore_remove=False, force=True): | ||
self.remote.safe_remove(self.path_info, force=force) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But self.remote
might not be the same as self.cache
and self.cache
is the one that actually has cache and can check if cache for that path_info is present. I'm pretty sure you've had a if self.use_cache
here previously, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is solved on this one:
def already_cached(self, path_info):
assert path_info.scheme in ["", "local"]
current_md5 = self.get_checksum(path_info)
if not current_md5:
return False
- return not self.changed_cache(current_md5)
+ return not self.cache.changed_cache(current_md5)
It doesn't matter if you call it with remote
or cache
(It thought this was a better interface / less error prone) 😅 at the end, it will call the cache
, the tricky part is to get that you can call cache.cache.cache....
multiple times and it will return the same instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if self.use_cache
is on changed_cache
already 🤔
Close #2027
Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
Uses
--remove-outs
to prevent prompting.