Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: prompt when removing outputs #2187

Closed
wants to merge 5 commits into from
Closed

run: prompt when removing outputs #2187

wants to merge 5 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 25, 2019

Close #2027

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


asciicast

Uses --remove-outs to prevent prompting.

@ghost ghost requested a review from efiop June 25, 2019 03:49
dvc/repo/run.py Outdated Show resolved Hide resolved
dvc/repo/run.py Outdated Show resolved Hide resolved
dvc/repo/run.py Outdated Show resolved Hide resolved
dvc/repo/imp_url.py Outdated Show resolved Hide resolved
@shcheklein
Copy link
Member

should we have an option for repro to pass down to the stages to skip the confirmation?

@shcheklein
Copy link
Member

create a ticket for the documentation with the description of changes or update the docs please :)

pared
pared previously requested changes Jun 25, 2019
dvc/repo/run.py Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Jun 25, 2019

should we have an option for repro to pass down to the stages to skip the confirmation?

@shcheklein , this doesn't affect repro, it is run specific

dvc/stage.py Show resolved Hide resolved
dvc/output/base.py Outdated Show resolved Hide resolved
@ghost ghost dismissed pared’s stale review July 23, 2019 01:41

changed the code, pawel, now it's another behavior, please check it out!

@ghost ghost requested a review from efiop July 23, 2019 01:41
dvc/output/base.py Outdated Show resolved Hide resolved
Fix #2027

----

```bash
mkdir data/
echo "foo" > data/foo.txt
dvc run --outs data/ 'echo "hello" > data/hello'
```

The following command above assumes `data/` directory will exist, however,
`dvc run` deletes all the outputs before running the command (this
is the easiest way to check if the command is, indeed, creating
the outputs), and in this case, `data/foo.txt` will also be deleted.

This is error prone and can cause **data loss**.

The following patch prompts the user when removing outputs that are not
cached. To prevent it from prompting, use `--remove-outs`.
dvc/output/base.py Outdated Show resolved Hide resolved
@efiop efiop assigned ghost Jul 23, 2019
@ghost ghost requested a review from efiop July 23, 2019 19:17
dvc/command/imp_url.py Outdated Show resolved Hide resolved
dvc/command/imp_url.py Outdated Show resolved Hide resolved
@@ -630,6 +630,9 @@ def exists_with_progress(chunks):
return list(set(checksums) & set(self.all()))

def already_cached(self, path_info):
if not self.cache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? was there a bug before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output.remote doesn't have a path_info and crashes the whole thing, the alternative was to use output.cache.already_cached (as cache has path_info), however, when a remote is not configured (for example, non-cached external outputs) cache would be None.

Remembering these edge cases is error prone, so now you can use output.remote.already_cached and you'll be covered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what is the difference with was here before? I'm not sure I understand this still.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling stage.remove_outs on Stage.create (we are deleting the outputs to make sure the command is generating them)

        # dvc/stage.py (create)

        # Removing outputs for reproducibility sake,
        # it is expected that the `cmd` will create them.
        if stage.cmd:
            confirm = kwargs.get("remove_outs", False)
            stage.remove_outs(ask=not confirm)

I added the ask parameter to remove_outs (previously, it just removed the outputs without asking (calling output.remove), but now, output.remove calls output.safe_remove if ask is set to true (this method asks the user for confirmation before it deletes the output).

    # dvc/stage.py
    def remove_outs(self, ignore_remove=False, force=False, ask=False):
        """Used mainly for `dvc remove --outs` and :func:`Stage.reproduce`."""
        for out in self.outs:
            if out.persist and not force:
                out.unprotect()
            else:
                logger.debug(
                    "Removing output '{out}' of '{stage}'.".format(
                        out=out, stage=self.relpath
                    )
                )

                if force:
                    ask = False

                out.remove(ignore_remove=ignore_remove, force=not ask)
    # dvc/output/base.py
    def remove(self, ignore_remove=False, force=True):
        self.remote.safe_remove(self.path_info, force=force)

This is were things start to get confusing with the ask, force, and confirm.

  • For stage.remove_outs:
    • force means to remove persisting outputs.
    • ask means to ask for confirmation when removing outputs (by default set to false (to not affect the previously defined interface)

For output.remove:

  • force is passed directly to safe_remove, and it means to not ask for confirmation while "safe removing".

Now, output.remove calls remote.safe_remove.
For safe_remove to work, it needs a cache_dir (because safe remove checks if the output is on the cache, if it has no cache it crashes -- this was not a bug before because we were not using safe remove on stage.create but now, if the remote has no cache_dir it will break, that's why I'm checking if the cache exists and sending methods to the cache object instead of the remote one).

Imagine the following case:

dvc remote add ssh ssh://localhost
dvc run --outs-no-cache remote://ssh/file.txt

The stage file is going to have one output.
This output doesn't require any cache.

This is where the if not self.cache is relevant.

I hope this helps, it took me a while to get everything on my head.

dvc/exceptions.py Outdated Show resolved Hide resolved
@@ -195,7 +195,7 @@ def already_cached(self, path_info):
if not current_md5:
return False

return not self.changed_cache(current_md5)
return not self.cache.changed_cache(current_md5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, what's the difference here. All this relationship between remote, cache, and output is very confusing tbh.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein , it is confusing, indeed. This solve the same problem explained above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrOutis So why not self.changed_cache? It should do the same thing, no? Why do we need to access self.cache.changed_cache directly?

Copy link
Author

@ghost ghost Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self is a remote with no path_info, due to the way outputs are instantiated:

        self.remote = remote or self.REMOTE(self.repo, {})

If we call changed_cache directly it will fail on this one:

    def checksum_to_path_info(self, checksum):
        return self.path_info / checksum[0:2] / checksum[2:]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrOutis I guess you meant that

        return not self.cache.changed_cache(current_md5)

is different from

    def changed_cache(self):                           
        if not self.use_cache or not self.checksum:    
            return True                                
                                                       
        return self.cache.changed_cache(self.checksum) 

because in the self.changed_cache we are using self.checksum wich might not match current_md5?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop , where did you get the self.use_cache and self.checksum? I'm not sure if we are talking about the same changed_cache 😅 there's one in output/base.py and another one in remote/base.py. I'm talking about latter one. (this change is on remote/base.py -- just clearing it out of the way, it might result confusing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrOutis Aah, indeed, i was looking at output for some reason. Sorry! So looks like we've had a bug here before, right?

if force:
ask = False

out.remove(ignore_remove=ignore_remove, force=not ask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this naming is super confusing - force, aks, confirm mean different things even in the same function sometimes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, can't agree more, @shcheklein , but changing this will require a big refactor, force means different things, for example: ask / prompt / confirm are different names for the same thing, the problem is that prompt is the module, confirm is the method and _ask is the implementation beneath confirm 😅

We can address this in another issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrOutis Seems like force has the meaning of something like remove_persistent? Maybe let's switch them up? Your ask -> force and force -> remove_persistent. Would that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop , lol, it make sense, I'll do that ^

"unable to remove '{}' without a confirmation from the user. Use "
"'-f' to force.".format(path)
"unable to remove '{}' without a confirmation from the user."
" Use '--force' (repro) or '--force-remove-outs' (run) to confirm"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check that on repro though? Didn't we agree that it is allowed for repro to do that silently? --force for repro has a totally different meaning than --force-remove-outs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that's true, @efiop , when --force is used with repro it reproduces the stage even if the stage didn't change.

I'm not sure why the error message was like that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bug:

    def run(self, dry=False, no_commit=False, force=False):
        if (self.cmd or self.is_import) and not self.locked and not dry:
            self.remove_outs(ignore_remove=False, force=False)

I think it should have been self.remove_outs(ignore_remove=False, force=force)

def remove(self, ignore_remove=False):
self.remote.remove(self.path_info)
def remove(self, ignore_remove=False, force=True):
self.remote.safe_remove(self.path_info, force=force)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But self.remote might not be the same as self.cache and self.cache is the one that actually has cache and can check if cache for that path_info is present. I'm pretty sure you've had a if self.use_cache here previously, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is solved on this one:

    def already_cached(self, path_info):
        assert path_info.scheme in ["", "local"]

        current_md5 = self.get_checksum(path_info)

        if not current_md5:
            return False

-        return not self.changed_cache(current_md5)
+        return not self.cache.changed_cache(current_md5)

It doesn't matter if you call it with remote or cache (It thought this was a better interface / less error prone) 😅 at the end, it will call the cache, the tricky part is to get that you can call cache.cache.cache.... multiple times and it will return the same instance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if self.use_cache is on changed_cache already 🤔

@ghost ghost closed this Dec 13, 2019
@ghost ghost deleted the fix-2027 branch December 13, 2019 23:31
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run: warn and/or prompt user when deleting an output
3 participants