run: prompt when removing outputs #2187

ghost · 2019-06-25T03:35:44Z

Close #2027

Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.

Uses --remove-outs to prevent prompting.

dvc/repo/run.py

dvc/repo/imp_url.py

shcheklein · 2019-06-25T05:07:46Z

should we have an option for repro to pass down to the stages to skip the confirmation?

shcheklein · 2019-06-25T05:08:21Z

create a ticket for the documentation with the description of changes or update the docs please :)

dvc/repo/run.py

ghost · 2019-06-25T18:13:27Z

should we have an option for repro to pass down to the stages to skip the confirmation?

@shcheklein , this doesn't affect repro, it is run specific

dvc/stage.py

dvc/output/base.py

changed the code, pawel, now it's another behavior, please check it out!

dvc/output/base.py

Fix #2027 ---- ```bash mkdir data/ echo "foo" > data/foo.txt dvc run --outs data/ 'echo "hello" > data/hello' ``` The following command above assumes `data/` directory will exist, however, `dvc run` deletes all the outputs before running the command (this is the easiest way to check if the command is, indeed, creating the outputs), and in this case, `data/foo.txt` will also be deleted. This is error prone and can cause **data loss**. The following patch prompts the user when removing outputs that are not cached. To prevent it from prompting, use `--remove-outs`.

dvc/output/base.py

dvc/command/imp_url.py

shcheklein · 2019-07-23T23:12:44Z

dvc/remote/base.py

@@ -630,6 +630,9 @@ def exists_with_progress(chunks):
        return list(set(checksums) & set(self.all()))

    def already_cached(self, path_info):
+        if not self.cache:


why this change? was there a bug before?

output.remote doesn't have a path_info and crashes the whole thing, the alternative was to use output.cache.already_cached (as cache has path_info), however, when a remote is not configured (for example, non-cached external outputs) cache would be None.

Remembering these edge cases is error prone, so now you can use output.remote.already_cached and you'll be covered.

so, what is the difference with was here before? I'm not sure I understand this still.

We are calling stage.remove_outs on Stage.create (we are deleting the outputs to make sure the command is generating them)

# dvc/stage.py (create) # Removing outputs for reproducibility sake, # it is expected that the `cmd` will create them. if stage.cmd: confirm = kwargs.get("remove_outs", False) stage.remove_outs(ask=not confirm)

I added the ask parameter to remove_outs (previously, it just removed the outputs without asking (calling output.remove), but now, output.remove calls output.safe_remove if ask is set to true (this method asks the user for confirmation before it deletes the output).

# dvc/stage.py def remove_outs(self, ignore_remove=False, force=False, ask=False): """Used mainly for `dvc remove --outs` and :func:`Stage.reproduce`.""" for out in self.outs: if out.persist and not force: out.unprotect() else: logger.debug( "Removing output '{out}' of '{stage}'.".format( out=out, stage=self.relpath ) ) if force: ask = False out.remove(ignore_remove=ignore_remove, force=not ask)

# dvc/output/base.py def remove(self, ignore_remove=False, force=True): self.remote.safe_remove(self.path_info, force=force)

This is were things start to get confusing with the ask, force, and confirm.

For stage.remove_outs:

force means to remove persisting outputs.

ask means to ask for confirmation when removing outputs (by default set to false (to not affect the previously defined interface)

For output.remove:

force is passed directly to safe_remove, and it means to not ask for confirmation while "safe removing".

Now, output.remove calls remote.safe_remove.
For safe_remove to work, it needs a cache_dir (because safe remove checks if the output is on the cache, if it has no cache it crashes -- this was not a bug before because we were not using safe remove on stage.create but now, if the remote has no cache_dir it will break, that's why I'm checking if the cache exists and sending methods to the cache object instead of the remote one).

Imagine the following case:

dvc remote add ssh ssh://localhost dvc run --outs-no-cache remote://ssh/file.txt

The stage file is going to have one output.
This output doesn't require any cache.

This is where the if not self.cache is relevant.

I hope this helps, it took me a while to get everything on my head.

scripts/completion/dvc.zsh

dvc/exceptions.py

shcheklein · 2019-07-28T03:08:49Z

dvc/remote/local/__init__.py

@@ -195,7 +195,7 @@ def already_cached(self, path_info):
        if not current_md5:
            return False

-        return not self.changed_cache(current_md5)
+        return not self.cache.changed_cache(current_md5)


again, what's the difference here. All this relationship between remote, cache, and output is very confusing tbh.

@shcheklein , it is confusing, indeed. This solve the same problem explained above.

@MrOutis So why not self.changed_cache? It should do the same thing, no? Why do we need to access self.cache.changed_cache directly?

self is a remote with no path_info, due to the way outputs are instantiated:

self.remote = remote or self.REMOTE(self.repo, {})

If we call changed_cache directly it will fail on this one:

def checksum_to_path_info(self, checksum): return self.path_info / checksum[0:2] / checksum[2:]

@MrOutis I guess you meant that

return not self.cache.changed_cache(current_md5)

is different from

def changed_cache(self): if not self.use_cache or not self.checksum: return True return self.cache.changed_cache(self.checksum)

because in the self.changed_cache we are using self.checksum wich might not match current_md5?

@efiop , where did you get the self.use_cache and self.checksum? I'm not sure if we are talking about the same changed_cache 😅 there's one in output/base.py and another one in remote/base.py. I'm talking about latter one. (this change is on remote/base.py -- just clearing it out of the way, it might result confusing)

@MrOutis Aah, indeed, i was looking at output for some reason. Sorry! So looks like we've had a bug here before, right?

dvc/stage.py

shcheklein · 2019-07-28T03:25:13Z

dvc/stage.py

+                if force:
+                    ask = False
+
+                out.remove(ignore_remove=ignore_remove, force=not ask)


all this naming is super confusing - force, aks, confirm mean different things even in the same function sometimes

yep, can't agree more, @shcheklein , but changing this will require a big refactor, force means different things, for example: ask / prompt / confirm are different names for the same thing, the problem is that prompt is the module, confirm is the method and _ask is the implementation beneath confirm 😅

We can address this in another issue.

@MrOutis Seems like force has the meaning of something like remove_persistent? Maybe let's switch them up? Your ask -> force and force -> remove_persistent. Would that make sense?

@efiop , lol, it make sense, I'll do that ^

efiop · 2019-07-29T12:11:39Z

dvc/exceptions.py

-            "unable to remove '{}' without a confirmation from the user. Use "
-            "'-f' to force.".format(path)
+            "unable to remove '{}' without a confirmation from the user."
+            " Use '--force' (repro) or '--force-remove-outs' (run) to confirm"


Do we need to check that on repro though? Didn't we agree that it is allowed for repro to do that silently? --force for repro has a totally different meaning than --force-remove-outs.

🤔 that's true, @efiop , when --force is used with repro it reproduces the stage even if the stage didn't change.

I'm not sure why the error message was like that.

It might be a bug:

def run(self, dry=False, no_commit=False, force=False): if (self.cmd or self.is_import) and not self.locked and not dry: self.remove_outs(ignore_remove=False, force=False)

I think it should have been self.remove_outs(ignore_remove=False, force=force)

efiop · 2019-07-29T12:16:12Z

dvc/output/base.py

-    def remove(self, ignore_remove=False):
-        self.remote.remove(self.path_info)
+    def remove(self, ignore_remove=False, force=True):
+        self.remote.safe_remove(self.path_info, force=force)


But self.remote might not be the same as self.cache and self.cache is the one that actually has cache and can check if cache for that path_info is present. I'm pretty sure you've had a if self.use_cache here previously, no?

This is solved on this one:

def already_cached(self, path_info): assert path_info.scheme in ["", "local"] current_md5 = self.get_checksum(path_info) if not current_md5: return False - return not self.changed_cache(current_md5) + return not self.cache.changed_cache(current_md5)

It doesn't matter if you call it with remote or cache (It thought this was a better interface / less error prone) 😅 at the end, it will call the cache, the tricky part is to get that you can call cache.cache.cache.... multiple times and it will return the same instance

The if self.use_cache is on changed_cache already 🤔

ghost requested a review from efiop June 25, 2019 03:49

shcheklein reviewed Jun 25, 2019

View reviewed changes

dvc/repo/run.py Outdated Show resolved Hide resolved

shcheklein reviewed Jun 25, 2019

View reviewed changes

dvc/repo/run.py Outdated Show resolved Hide resolved

shcheklein reviewed Jun 25, 2019

View reviewed changes

dvc/repo/run.py Outdated Show resolved Hide resolved

shcheklein reviewed Jun 25, 2019

View reviewed changes

dvc/repo/imp_url.py Outdated Show resolved Hide resolved

pared previously requested changes Jun 25, 2019

View reviewed changes

dvc/repo/run.py Outdated Show resolved Hide resolved

efiop reviewed Jun 26, 2019

View reviewed changes

dvc/stage.py Show resolved Hide resolved

efiop reviewed Jul 23, 2019

View reviewed changes

dvc/output/base.py Outdated Show resolved Hide resolved

ghost requested a review from efiop July 23, 2019 01:41

efiop reviewed Jul 23, 2019

View reviewed changes

dvc/output/base.py Outdated Show resolved Hide resolved

efiop reviewed Jul 23, 2019

View reviewed changes

dvc/output/base.py Outdated Show resolved Hide resolved

Mr. Outis added 2 commits July 23, 2019 09:29

assume there will be always a cache when safe remove is used

287d34e

use correct terminology in safe_remove message

3059d63

efiop added the c3-small-fix label Jul 23, 2019

efiop assigned ghost Jul 23, 2019

prompt even if remote has no cache configured

8f3c7c4

ghost requested a review from efiop July 23, 2019 19:17

shcheklein reviewed Jul 23, 2019

View reviewed changes

dvc/command/imp_url.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 23, 2019

View reviewed changes

dvc/command/imp_url.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 23, 2019

View reviewed changes

dvc/command/imp_url.py Show resolved Hide resolved

shcheklein reviewed Jul 23, 2019

View reviewed changes

scripts/completion/dvc.zsh Outdated Show resolved Hide resolved

shcheklein reviewed Jul 27, 2019

View reviewed changes

dvc/exceptions.py Outdated Show resolved Hide resolved

rename --remove-outs to --force-remove-outs

b591cd5

shcheklein reviewed Jul 28, 2019

View reviewed changes

dvc/stage.py Show resolved Hide resolved

shcheklein reviewed Jul 28, 2019

View reviewed changes

efiop reviewed Jul 29, 2019

View reviewed changes

ghost closed this Dec 13, 2019

ghost deleted the fix-2027 branch December 13, 2019 23:31

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run: prompt when removing outputs #2187

run: prompt when removing outputs #2187

ghost commented Jun 25, 2019 •

edited by ghost

Loading

shcheklein commented Jun 25, 2019

shcheklein commented Jun 25, 2019

ghost commented Jun 25, 2019

shcheklein Jul 23, 2019

ghost Jul 24, 2019

shcheklein Jul 28, 2019

ghost Jul 28, 2019

shcheklein Jul 28, 2019

ghost Jul 28, 2019

efiop Jul 29, 2019

ghost Jul 29, 2019 •

edited by ghost

Loading

efiop Jul 30, 2019

ghost Jul 30, 2019

efiop Jul 30, 2019

shcheklein Jul 28, 2019

ghost Jul 28, 2019

efiop Jul 30, 2019

ghost Jul 30, 2019

efiop Jul 29, 2019

ghost Jul 30, 2019

ghost Jul 30, 2019

efiop Jul 29, 2019

ghost Jul 30, 2019

ghost Jul 30, 2019

run: prompt when removing outputs #2187

run: prompt when removing outputs #2187

Conversation

ghost commented Jun 25, 2019 • edited by ghost Loading

shcheklein commented Jun 25, 2019

shcheklein commented Jun 25, 2019

ghost commented Jun 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 29, 2019 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 25, 2019 •

edited by ghost

Loading

ghost Jul 29, 2019 •

edited by ghost

Loading