-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run: warn and/or prompt user when deleting an output #2027
Comments
ProblemThe command above assumes mkdir data/
echo "foo" > data/foo.txt
dvc run --outs data/ 'echo "hello" > data/hello'
Notes
IdeaAlways remove outputs during
Questions
@efiop , @shcheklein , what do you think? |
As @shcheklein noted, the naming might be confusing, so another suggestion was
Makes sense to me :)
It assumes that self is a cache, not just a dummy remote, which is reasonable since safe_remove should check against cache.
It is a good assumption to start with, but we might think about enabling that behaviour for repro too, since if you are running |
🤔 Interesting, @efiop , what about |
@MrOutis oh, right. That is a really good point about Ok, let's summarize which scenarios we are trying to protect users from:
Please feel free to add if I've missed some scenarios. In the first case, there is no way we could remove it automatically. We need to either move it out of the way to something like For 2 and 3, I don't see a good way we could force the user to commit things so he doesn't lose the old dvcfile. For 2.a, there is no way we could force user to git commit that output(esp since he might not want to do that). Both For 2.b there is nothing we can do about it, it would be 100% user's fault, but it makes auto-removal a little less viable still, but we could stick with the same solution as in 1. For 3 on checkout we currently make an assumption that if it is already in the cache then we are fine, but that approach is prone to reference loss 🙁 So in summary 1 and 2.b will have the same solution, and 2a and 3 could use some imperfect tricks for auto-removal. |
Few questions:
as far as I understand, it's not about create only, right? it's about repro, etc. In general, it sounds it should be part of run, not create. May be some refactoring is required here - to separate or isolate stage execution properly, even if ============================ Now, back to the problem. Regardless of where it's happening it looks like it's not safe to remove data that is not cached. Why don't we make a generic method that would list in advance the data that is going to be deleted and ask for a confirmation? No matter Ideally there should be a single simple flag with some meaningful name that will be the same internally. I'm fine with Would love to cleanup the mess with all the options internally. For example, if we use |
@shcheklein, agree with the refactor (I'll open an issue for it), right now:
I like the idea, the only inconvenience is when using
|
Yeah, the problem is that
don't see any problems. It should be analyzing the graph in advance and show the full list of files that are going to be destroyed. |
@efiop , thanks a lot for such a great summary! Always prompting doesn't sound that bad after all. Let me know if you and @shcheklein are good with this, so I can start thinking about how to implement it. |
@shcheklein I think, I'm good with this 👍 |
@efiop, sorry for the late response, here's the sum up Problem:
Summary:
Implementation:
Related:
|
@efiop , @shcheklein , the only missing part is deciding between moving outputs to tmp files vs removing them. If I recall correctly from offline discussions with @efiop , the problem with moving outputs is that it might confuse the user when moving them back to the original place after an unsuccessful operation, for example: dvc init --no-scm
mkdir data
echo foo > data/foo
dvc run -o data/ "echo bar > data/bar" Internally, it would do: mv data/ data.tmp/
echo bar > data/bar The previous command would be unsuccesful because data/ doesn't exist, so we will be moving back the tmp file to the original location: mv data.tmp/ data/ The user would see an error message about Another downside would be that this I can't think of other reasons. |
@MrOutis I would go with removing outputs like we discussed initially. A good prompt should be enough in this case. We can even mention the commands you need to run to save the data as part of this prompt. |
@shcheklein, okok, I'm good with that |
I'm trying to understand the priority of this issue. Are there other users who were complaining about the issue? Removing derivative data (outputs) does not seem like a big issue in contrast to removing sources. Even the topic starter was asking only about a warning. Also, introducing a user notice |
In this case it is expected that See also this discussion: https://discordapp.com/channels/485586884165107732/565699007037571084/620868507151892493 |
Hi! Reviving this 🧟
I've seen this come up in Discord now and then — keeping in mind I only get to read about 10-15% of all the questions there, so I'm pretty sure it's a common confusion. Example: https://discord.com/channels/485586884165107732/485596304961962003/735110378702241792 I haven't read the entire discussion above, some of which I assume may be outdated. But in general, I believe this behavior of deleting outputs, especially for directories, is kind of unintuitive or unexpected. Maybe it should need a |
Interesting discussion. Not sure the cleanest way to warn people without making the workflow more cumbersome. Anyway, this isn't on the current roadmap, so lowering priority to p2. |
The following command deletes data/clean -- it would be nice if there was a warning or something since there may already be data in data/clean. The solution was to modify the script so that the data/clean directory was created by the script.
dvc run -d data/raw/ -d src/featherize_data.py --outs data/clean/ python src/featherize_data.py
<<Hi @david8381 ! That happens because dvc run is trying to ensure that your command is the one creating your output.
So that when you run dvc repro later, it will be able to fully reproduce the output.
So you need to make your script create that directory.>>
The text was updated successfully, but these errors were encountered: