Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: warn and/or prompt user when deleting an output #2027

Closed
david8381 opened this issue May 21, 2019 · 19 comments
Closed

run: warn and/or prompt user when deleting an output #2027

david8381 opened this issue May 21, 2019 · 19 comments
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research

Comments

@david8381
Copy link

The following command deletes data/clean -- it would be nice if there was a warning or something since there may already be data in data/clean. The solution was to modify the script so that the data/clean directory was created by the script.

dvc run -d data/raw/ -d src/featherize_data.py --outs data/clean/ python src/featherize_data.py

<<Hi @david8381 ! That happens because dvc run is trying to ensure that your command is the one creating your output.
So that when you run dvc repro later, it will be able to fully reproduce the output.
So you need to make your script create that directory.>>

@shcheklein shcheklein transferred this issue from iterative/dvc.org May 21, 2019
@efiop
Copy link
Contributor

efiop commented May 21, 2019

@efiop efiop changed the title setting a directory as an output with run command will delete the directory without warning run: warn and/or prompt user when deleting an output May 21, 2019
@efiop efiop added c5-half-a-day enhancement Enhances DVC p1-important Important, aka current backlog of things to do labels May 21, 2019
@efiop efiop assigned ghost Jun 11, 2019
@ghost ghost mentioned this issue Jun 25, 2019
2 tasks
@ghost
Copy link

ghost commented Aug 2, 2019

Problem

The command above assumes data/ directory exist:

mkdir data/
echo "foo" > data/foo.txt
dvc run --outs data/ 'echo "hello" > data/hello'
  • dvc run deletes the outputs before running the command.:

    • Behavior is error prone and can cause data loss
    • However it is the easiest way to check if outputs were created by the command
  • Output removal is done by calling stage.remove_outs

    • It happens during stage.run:
    # dvc/stage.py
    def run(self, dry=False, no_commit=False, force=False):
      if (self.cmd or self.is_import) and not self.locked and not dry:
          self.remove_outs(ignore_remove=False, force=False)
    • Also could happen on Stage.create if --remove-outs option is used:
    # dvc/stage.py
    # def create...
    if kwargs.get("remove_outs", False):
        logger.warning(
            "--remove-outs is deprecated."
            " It is now the default behavior,"
            " so there's no need to use this option anymore."
        )
        stage.remove_outs(ignore_remove=False)
        logger.warning("Build cache is ignored when using --remove-outs.")
        ignore_build_cache = True
    else:
        stage.unprotect_outs()
    • Both calls don't remove persisted outputs.
      • It can only happen during run if you use the --force flag

Notes

  • ⚠️ Implementation shouldn't affect repro

    • This will result in unwanted behavior and cumbersome UX
  • There's a remote.safe_remove method that ask for deletion if a path_info is not cached.

    • ⚠️ It assumes remote has a cache_dir
    • remote.safe_remove is only used on remote.checkout

Idea

Always remove outputs during Stage.create:

  • Commands using create:

    • imp
    • imp-url
    • run
    • add
  • Use safe_remove on stage.remove_outs / implement stage.safe_remove_outs

  • Add a command line option to force output removal without prompting

  • Add distinction between file and directory on safe_remove confirmation message

Questions

  • Should we reuse --remove-outs or name the option differently?

  • stage.remove_outs has a parameter force to remove persistent outputs,
    should we rename this to remove_persistent and use --unpersist instead of
    --force?


@efiop , @shcheklein , what do you think?

@efiop
Copy link
Contributor

efiop commented Aug 2, 2019

@MrOutis

Should we reuse --remove-outs or name the option differently?

As @shcheklein noted, the naming might be confusing, so another suggestion was --force-remove-outs, which is ok with me.

stage.remove_outs has a parameter force to remove persistent outputs,
should we rename this to remove_persistent and use --unpersist instead of
--force?

Makes sense to me :)

warning It assumes remote has a cache_dir

It assumes that self is a cache, not just a dummy remote, which is reasonable since safe_remove should check against cache.

Implementation shouldn't affect repro

It is a good assumption to start with, but we might think about enabling that behaviour for repro too, since if you are running repro, most of the outputs should already be in cache, so you won't get prompted for those and those that are not in cache are probably worth verifying if you are ok with deleting them or not.

@ghost
Copy link

ghost commented Aug 2, 2019

It is a good assumption to start with, but we might think about enabling that behaviour for repro too, since if you are running repro, most of the outputs should already be in cache, so you won't get prompted for those and those that are not in cache are probably worth verifying if you are ok with deleting them or not.

🤔 Interesting, @efiop , what about --outs-no-cache? (we remove them too)

@efiop
Copy link
Contributor

efiop commented Aug 2, 2019

@MrOutis oh, right. That is a really good point about dvc run too. Currently, we remove those silently. For dvc run asking would be fine, but for repro it would be really annoying, so anyone using --outs-no-cache would be forced to use --force-remove-outs with every dvc repro.

Ok, let's summarize which scenarios we are trying to protect users from:

  1. Placing some file from outside in place of some output file(cached or not cached) and running run or repro, so if it is deleted it would be a complete loss, with no way to get it back;
  2. Losing uncached output that was produced by the pipeline; Loss would occur if the user didn't track it with git(usually people do that to --outs-no-cache files, or at least I think so) and we either
    a) lose part of the pipeline(if user didn't git commit old dvcfile and then new dvc run removed old dvcfile or dvc repro overwritten it) or
    b) if the pipeline became broken for external reasons that were not caused by dvc itself.
  3. Losing cached output by losing the checksum reference in or with the dvcfile like in 2.a

Please feel free to add if I've missed some scenarios.

In the first case, there is no way we could remove it automatically. We need to either move it out of the way to something like out.back(this is really similar to what you've suggested previously) and notify the user about it or to explicitly confirm with the user that he wants us to delete this file.

For 2 and 3, I don't see a good way we could force the user to commit things so he doesn't lose the old dvcfile.

For 2.a, there is no way we could force user to git commit that output(esp since he might not want to do that). Both run and repro could check if output's checksum is the one that is in the old dvcfile, which would kinda prove that it came from the pipeline so if user did git commit things, then he would have a way to reproduce that output later if it is deleted automatically by us. We could detect if old dvcfile is tracked by git and unchanged, but what if someone did a git rebase or commit --amend? That would result in a loss, so in my opinion, these assumptions are not strong enough to automatically remove outputs. Another way to solve it would be to treat them as 1, but that would get annoying on dvc repro.

For 2.b there is nothing we can do about it, it would be 100% user's fault, but it makes auto-removal a little less viable still, but we could stick with the same solution as in 1.

For 3 on checkout we currently make an assumption that if it is already in the cache then we are fine, but that approach is prone to reference loss 🙁

So in summary 1 and 2.b will have the same solution, and 2a and 3 could use some imperfect tricks for auto-removal.

@shcheklein
Copy link
Member

Few questions:

Always remove outputs during Stage.create:

as far as I understand, it's not about create only, right? it's about repro, etc. In general, it sounds it should be part of run, not create. May be some refactoring is required here - to separate or isolate stage execution properly, even if create depends on it. (Note. To be honest I don't understand semantics of Stage.create. Can someone explain it to me :). I understand that there should be Stage.run, Stage.save, Stage.read, Stage._init_. All other logic should be moved out of Stage.*)

============================

Now, back to the problem. Regardless of where it's happening it looks like it's not safe to remove data that is not cached. Why don't we make a generic method that would list in advance the data that is going to be deleted and ask for a confirmation? No matter dvc repro, dvc run or anything else you are running.

Ideally there should be a single simple flag with some meaningful name that will be the same internally. I'm fine with -c | --confirm-remove-outs. force sounds too aggressive and dangerous in this case.

Would love to cleanup the mess with all the options internally. For example, if we use --overwrite-dvcfile it should be more or less the same throughout all code.

@ghost
Copy link

ghost commented Aug 3, 2019

@shcheklein, agree with the refactor (I'll open an issue for it), right now:

  • Stage.__init__ could be viewed as a namedtuple for Stage
  • Stage.create is where the magic happens, and turns "strings" into objects, for example, the outs and deps parameters are a list of paths, but this method converts them into their corresponding objects (Output, Dependency), it also verifies that the stage doesn't have any circular dependencies and other sorts of validation.

Regardless of where it's happening it looks like it's not safe to remove data that is not cached. Why don't we make a generic method that would list in advance the data that is going to be deleted and ask for a confirmation? No matter dvc repro, dvc run or anything else you are running.

I like the idea, the only inconvenience is when using --outs-no-cache (it would be asking every time you repro).

Would love to cleanup the mess with all the options internally. For example, if we use --overwrite-dvcfile it should be more or less the same throughout all code.
Yep, looks like we should create another issue for it

@shcheklein
Copy link
Member

Yeah, the problem is that Stage.create is doing too much. I'm fine with it initializing outputs/deps and even certain validations. But implementing actual execution logic inside it looks strange - like removing outputs for example.

I like the idea, the only inconvenience is when using --outs-no-cache (it would be asking every time you repro).

don't see any problems. It should be analyzing the graph in advance and show the full list of files that are going to be destroyed.

@ghost
Copy link

ghost commented Aug 7, 2019

@efiop , thanks a lot for such a great summary!

Always prompting doesn't sound that bad after all.
It worries me that automated scripts will start failing due to a missing flag (--confirm-remove-outs).
Besides that, I've no other complains to implement this safer alternative to run and repro.

Let me know if you and @shcheklein are good with this, so I can start thinking about how to implement it.

@shcheklein
Copy link
Member

@shcheklein I think, I'm good with this 👍

@efiop
Copy link
Contributor

efiop commented Aug 8, 2019

Sounds good. @MrOutis could you please write a final summary of what we need to do and how it should behave, please? Though, please be sure to prioritize #2319 over this. 🙂

@ghost
Copy link

ghost commented Aug 20, 2019

@efiop, sorry for the late response, here's the sum up

Problem:

  • dvc run deletes outputs prior running the script.
    The reasoning behind this behavior is to ensure the command in
    run/repro is the one creating the outputs.

  • Data lost could happen when:

    • Using a file not intended to be reproduced as an output,
      either by mistake or a misuse due to lack of knowledge about the
      side-effects of using run.

    • Loosing reference to the cache (i.e. DVC-file)

    • Not tracking --outs-no-cache elsewhere (i.e. git)

  • Things to consider when thinking about data lost:

    • If data was previously cached and you have the DVC-file,
      you can dvc checkout to retrieve the outputs.

    • If you have cached data but no DVC-file referencing it,
      it would need some grep -R -H <unique-content> .dvc/cache.
      Another way could be using the modification time to get an approximation
      of when the output was created, and try to retrieve a version based on
      those assumptions. I'll consider lossing the DVC-file as data lost.

    • DVC-files are tracked by Git, the only operation where DVC destroys
      a such files is when a user accepts deleting a previous version of it
      during dvc run, either by using --overwrite or responding yes to
      the prompt. However, the user needs to know the implications before hand.

    • In theory, if user's command is deterministic, dvc repro
      can generate the outputs again, however, it relies on the
      user specifying the --deps correctly and using a command/script
      that guarantees integrity.

Summary:

  • DVC should strive to be secure by default. (No data loss!)

  • The solution is to prompt anytime DVC is going to delete information.

    • We need to create an option to never prompt before removing,
      @shcheklein suggested --confirm-remove-outs.
  • We can't force the user to commit intermediate DVC-files or not
    cached outputs.

  • Other solutions:

    • Move outputs to a temporary location instead of removing them.

Implementation:

  • Modify stage.remove_outs to prompt when removing uncached outputs.
  • Add a --confirm-remove-outs option to all the methods using run:
    • imp
    • imp-url
    • run
    • repo

Related:

@efiop efiop removed the c8-full-day label Aug 20, 2019
@ghost
Copy link

ghost commented Aug 20, 2019

@efiop , @shcheklein , the only missing part is deciding between moving outputs to tmp files vs removing them.

If I recall correctly from offline discussions with @efiop , the problem with moving outputs is that it might confuse the user when moving them back to the original place after an unsuccessful operation, for example:

dvc init --no-scm
mkdir data
echo foo > data/foo
dvc run -o data/ "echo bar > data/bar"

Internally, it would do:

mv data/ data.tmp/
echo bar > data/bar

The previous command would be unsuccesful because data/ doesn't exist, so we will be moving back the tmp file to the original location:

mv data.tmp/ data/

The user would see an error message about data/ not existing but
clearly it exists.

Another downside would be that this mv operation happens implicitly, instead of being explicit about what DVC is doing (as in the case of removing the outputs).

I can't think of other reasons.

@shcheklein
Copy link
Member

@MrOutis I would go with removing outputs like we discussed initially. A good prompt should be enough in this case. We can even mention the commands you need to run to save the data as part of this prompt.

@ghost
Copy link

ghost commented Aug 24, 2019

@shcheklein, okok, I'm good with that

@dmpetrov
Copy link
Member

dmpetrov commented Sep 10, 2019

I'm trying to understand the priority of this issue. Are there other users who were complaining about the issue?

Removing derivative data (outputs) does not seem like a big issue in contrast to removing sources. Even the topic starter was asking only about a warning.

Also, introducing a user notice y/n is actually a very strong move. We will need to introduce --yes option (which is not a problem) and users should remember to use this option from their scripts all the time otherwise script stuck. This difference (script experience versus user experience) is an important issue for unix tools.

@dashohoxha
Copy link
Contributor

The following command deletes data/clean -- it would be nice if there was a warning or something since there may already be data in data/clean. The solution was to modify the script so that the data/clean directory was created by the script.

dvc run -d data/raw/ -d src/featherize_data.py --outs data/clean/ python src/featherize_data.py

In this case it is expected that dvc run will overwrite the outputs, and also data/clean/ is given explicitly. So, there is no need for a warning or a confirmation. Similar to the style of unix commands, we can assume that the user knows what he is doing.
However it might be nice to have an -i, --interactive option, so that the command can ask for confirmation in case it has to overwrite some data.

See also this discussion: https://discordapp.com/channels/485586884165107732/565699007037571084/620868507151892493

@jorgeorpinel
Copy link
Contributor

Hi! Reviving this 🧟

I'm trying to understand the priority of this issue. Are there other users who were complaining about the issue?

I've seen this come up in Discord now and then — keeping in mind I only get to read about 10-15% of all the questions there, so I'm pretty sure it's a common confusion. Example: https://discord.com/channels/485586884165107732/485596304961962003/735110378702241792

I haven't read the entire discussion above, some of which I assume may be outdated. But in general, I believe this behavior of deleting outputs, especially for directories, is kind of unintuitive or unexpected. Maybe it should need a --remove_outs flag (at least for directories), but at the very least a message + confirmation prompt seems appropriate to me.

@dberenbaum
Copy link
Collaborator

Interesting discussion. Not sure the cleanest way to warn people without making the workflow more cumbersome.

Anyway, this isn't on the current roadmap, so lowering priority to p2.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 18, 2022
@mattseddon mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants