Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt or not to prompt? #2498

Closed
dmpetrov opened this issue Sep 14, 2019 · 25 comments
Closed

Prompt or not to prompt? #2498

dmpetrov opened this issue Sep 14, 2019 · 25 comments
Labels
discussion requires active participation to reach a conclusion

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Sep 14, 2019

EDITED 9/15/19

There are a couple of discussions around prompting for user input from CLI. This issue-question tries to summarize the rules around prompting.

Now, DVC asks y/n questions from the command line. We have to introduce --yes option (which is not a problem) and users should remember to use this option from their scripts all the time otherwise script stuck. This difference (script experience versus user experience) is an important issue for Unix tools.

Traditional Unix-tools do not ask questions (and do not break the scripting experience) which leads to overwriting files or aborting commands. There is an opinion that DVC should behave the same way. The backside of this - it might lead to data loss.

Examples:

  1. [Overwriting] cp, tar - will overwrite files with
  2. [Aborting] git checkout - fails if there is a conflict
  3. [Prompt] there are some exceptions with prompt: unzip, apt-get. However, zip is not a traditional Unix tool, it was ported from DOS/Windows and follows Windows rules.

First, we need to decide if prompting is okay. Second, what are the set of rules for DVC for prompting, overwriting and aborting command?

The first rules summarization from @shcheklein #2497 (comment)

This is my take on abort vs prompt:

  1. They both give overhead when you use them in scripts. You have to use some CLI argument to force.
  2. If script is not written right (without set -eux) abort is less safer choice.
  3. If script is written right, prompt is more annoying since script hangs.
  4. The biggest difference for us in terms of workflow and usability comes from some long-running operations (as opposed to Git which is very fast in general). In our case it might be the case that we calculate status and/or md5 before even showing the prompt. Abort does not give a chance to continue - you have to run it second time.

Let's discuss the rules.

@ghost
Copy link

ghost commented Sep 18, 2019

We can use conventions as guidelines, but in the end, it is important that we consider the usefulness of prompting in our own terms.

Also, there's no consistency on how to skip those questions (--force vs --yes vs --overwrite-dvcfile).

A quick grep for prompt showed that we ask for confirmation on the following methods:

  • safe_remove: "File is going to be removed. Are you sure you want to proceed? [y/n]
  • _collect_used_dir_cache: "Missing cache for directory. Cache for files inside will be lost. Would you like to continue? [y/n/]"
  • gc: "This command ... (is dangerous). Are you sure you want to proceed? [y/n]"
  • _is_outs_only: "Are you sure you want to remove dvcfile with its outputs? [y/n]" (on dvc remove --purge)
  • destroy: "This will destroy all info about pipelines... yada yada. Are you sure you want to continue? [y/n]
  • check_can_commit: "dependencies changed. Are you sure you commit it? [y/n]" (:warning: there's a typo here)
  • stage.create: "dvcfile already exists. Do you wish to run the command and overwrite it? [y/n]"
  • stage.reproduce: "Going to reproduce dvcfile. Are you sure you want to continue? [y/n]" (only if --interactive)

From those prompts, I can point out the following:

  • All the prompts are "guarding" the user from potential data loss
  • Some prompts are on commands that are destructive by itself (I mean, don't expect dvc destroy or dvc remove to be safe)
  • I'm not sure if the user can predict when the command will result in some data loss
    • At least for me, it is still not that straight forward to think that dvc run -o foo will remove foo
    • Some of those prompts happen in specific cases (the ones if private methods)
    • What is data loss? Sometimes, files are in the cache but there's no reference (dvcfile) to them, is this data loss?

With that being said:

  • If we are going to remove the prompts, we should make the commands more predictable (better documentation, different naming, different levels of verbosity, etc.)
    • Tools were you can predict the behavior are safer than the ones that you can't (at least, you develop trust)
  • I like dvc's flexibility, but I can't imagine "designing" a tool for a lot of purposes and environments.
  • It shouldn't take us a lot to go through all the commands and discuss them.
  • It would help to identify common use cases for it (scripts, files sizes, workflows, etc.)

@shcheklein
Copy link
Member

@dmpetrov @MrOutis nice summaries, thanks!!

Just a few comments from the top of my head:

and do not break the scripting experience

To my mind and in my experience aborting is very similar to prompting in terms of scripts. You have to take that into account and remember to provide an option. Aborting can be even more dangerous because people tend to forget to put set -uex and script goes even if there was a problem in the middle.

there are some exceptions

heroku CLI is another example, even if you specify explicitly that you want to do something it will explain what is going on ask you to put a name of your project (repeating their Web UI) before destroying something for you.

@dashohoxha
Copy link
Contributor

Now, DVC asks y/n questions from the command line. We have to introduce --yes option (which is not a problem) and users should remember to use this option from their scripts all the time otherwise script stuck.

For me it is not a problem if a command asks for confirmation (or is interactive), as long as this can be overridden by a --yes or --force option.

For example the command apt install in debian/ubuntu. If it has to download lots of files it asks you for confirmation (showing the list of packages that is going to be installed). But in a script you can called it like apt install --yes ....

It shouldn't take us a lot to go through all the commands and discuss them.

I also think that discussing the commands one by one is better than formulating some general (abstract) principles/rules and trying to follow them strictly.

@dmpetrov
Copy link
Member Author

@dashohoxha I agree that we need to discuss commands one by one to make actionable decisions. However, we need to formulate basic principles to simplify the decision-making process. It should help to each of the team members to make and explain these UX decisions without any holy wars.

Data file importance

The biggest issue related to prompting is data loss. Importance of data files is different. In a high level, we can separate two data types by the importance:

  1. critical data or data sources which DVC cannot recover. In some cases, data might be completely unrecoverable for users (no others copies).
  2. not critical data or data derivatives/outputs which can be reproduced by DVC. But it might require some time and resources from a user.

In general, it is better not to lose any of the data types. However, it is okay to remove not critical data (data derivatives\outputs) in order to avoid prompting or simplifying UX (not to introduce additional options). Failing is always an option of keeping data.

Examples

DVC aligns a lot to Git and we expect users to have Git experience. It will be a mistake not to reuse the knowledge and behavior patterns that users have built with Git. In general, we should use Git UX as a guiding star when we make decisions about UX. Every time we don't agree on something we should ask ourselves a simple question - how it should look like if it becomes a part of a Git. If there is no clear answer, I'd recommend finding analogies with classical Unix commands such as ls, rm etc which most of our users should be familiar with.

Based on my understanding, Git avoids prompting as much as possible (see discussion in #2497). In some exceptional cases (merge, rebase), dialogs are needed. These exceptions are not just prompted questions, they are full dialogs with multiple questions.

@dmpetrov
Copy link
Member Author

@MrOutis thank you for the list of methods when DVC ask questions. Is it possible to find DVC commands and scenarios/cases when it is happening?

@dashohoxha
Copy link
Contributor

@dmpetrov The principles that you lay out are simple and reasonable, I completely agree with them.

It should help to each of the team members to make and explain these UX decisions without any holy wars.

My opinion is that developers (in general) are a little bit biased towards the software that they are building, and it might be difficult for them to look at the software from the point of view of a user that knows nothing about implementation. Once you know all the implementation details you can't become a novice again.

So, it might be useful if someone is selected as a UX expert, who decides about the details of user interface and user interaction. The others should try to convince him about their point of view, but at the end it is his decision that prevails. This would avoid any holy wars and would bring consistency.

@ghost
Copy link

ghost commented Sep 27, 2019

yes, @dmpetrov ! I can come up with some scenarios, I'll try to do it soon

@ghost ghost added discussion requires active participation to reach a conclusion and removed question I have a question? labels Sep 30, 2019
@ghost
Copy link

ghost commented Oct 1, 2019

@dmpetrov , it is taking a while 😅 !

⚠️ this is a work in progress ⚠️


checkout a file without cache:

dvc init --no-scm
echo 'foo' > foo
dvc add foo
rm -rf .dvc/cache/d3/b07384d113edec49eaa6238ad5ff00
dvc checkout

# WARNING: Cache 'd3b07384d113edec49eaa6238ad5ff00' not found. File 'foo' won't be created.
#   0%|          |Checkout                                                                                                 0/1 [00:00<?,     ?file/s]
# file 'foo' is going to be removed. Are you sure you want to proceed? [y/n]

remove "redundant files" during a directory checkout:

dvc init --no-scm
mkdir directory
echo 'foo' > directory/foo
dvc add directory
echo 'bar' > directory/bar
dvc checkout

# 0%|          |Checkout                                                                                                 0/1 [00:00<?,     ?file/sfile 'directory/bar' is going to be removed. Are you sure you want to proceed? [y/n]

checkout a file that changed:

dvc init --no-scm
echo 'foo' > foo
dvc add foo
echo 'bar' > foo
dvc checkout
#   0%|          |Checkout                                                                                                 0/1 [00:00<?,     ?file/s]
# file 'foo' is going to be removed. Are you sure you want to proceed? [y/n]

@jorgeorpinel
Copy link
Contributor

it is still not that straight forward to think that dvc run -o foo will remove foo

Agree! Vote for prompt in this case.

@dmpetrov
Copy link
Member Author

Thank you @MrOutis for funding these scenarios!

Let's go one by one.

  1. checkout a file without cache

Is there are cases when it makes sense to checkout data files without checking out a particular one (answer Yes case)? It seems like in most of the cases this is going to happen when a user forgets to run dvc pull\fetch - it is fine to fail in this case (wee can also show a proper message like Data file 'foo' is missing in cache. Please run 'dvc fetch' first.). Also, I can imagine when dvc fetch won't give you the data file (corrupted cache and remote) - in this "debug" mode partial checkout can be done by *.dvc files. So, in both of cases, failing seems like a better option.

I propose to fail instead of prompting

  1. remove "redundant files" during a directory checkout:

It seems like there are no conflicts to fail or prompt questions. DVC shouldn't remove files that are not committed. A similar Git example:

git init
echo foo > foo
git add foo
git commit -m 'foo commit'
echo bar > bar
git checkout ffb38d8 # the current commit

No changes, no errors. bar is in its place. I'd expect the same behavior from DVC.

I propose to avoid prompting and outputting any error messages or warning

  1. checkout a file that changed

It looks the same as the previous case (even though the implementation might be different). So, checkout should not raise any errors or prompts or even error messages.

I propose to avoid prompting and outputting any error messages or warning

Please take a look at my proposals. If there are no concerns, I'll create an issue.

@MrOutis are these three the only prompts that DVC has or we can find more?

@dmpetrov
Copy link
Member Author

A few more things...

It is okay to remove data without prompting if user runs destructive commands like destroy and remove.

Regarding dvc run. It might be not so important as the other commands (run is just a wrapper). But I'd suggest removing outputs with no prompting because it removes data derivatives, not data sources.

@jorgeorpinel
Copy link
Contributor

dvc run... removes data derivatives, not data sources

I thought if you have some files manually added to a directory and then use that dir as -o for run, it will remove your files with no prompt. These could be some other arbitrary files you needed and have no way of recovering.

@dmpetrov
Copy link
Member Author

if you have some files manually added to a directory and then use that dir as -o for run, it will remove your files with no prompt.

Good point @jorgeorpinel. It would be great to keep the information about data sources (like source: true in dvc-file) and raise an error if run tries to remove it. But even in this case, I'd prefer to fail and ask a user to remove the data source manually first (through dvc remove added-source-file.dvc).

How do you @iterative/engineering think, is it a good idea to introduce a new 'source: true' parameter in DVC file specifically for the cases not to removing them automatically. I'm wondering, can this information be derived from the dependency graph (and how reliable this information would be in this case)?

@dmpetrov
Copy link
Member Author

It looks like we got all the pros and cons of this issue. It makes sense to get rid of the prompting when it is possible. I created 3 issues for all the prompt cases that we've found except dvc run.

dvc run is a special case. From one point of view, it generates data derivatives and it is fine to remove them without asking questions. On another hand - there are some cases when data sources can be removed. Let's keep it as is and return to this question later if it arises.

I'm keeping checkboxes. We can close the current issue when all the checkboxes will be checked:

@jorgeorpinel
Copy link
Contributor

We can close the current issue when all the checkboxes will be checked:

Another option is to simply close this issue now. 🙂

@dmpetrov
Copy link
Member Author

dmpetrov commented Nov 19, 2019

@jorgeorpinel we can lose all the details if we close it :)

@dmpetrov
Copy link
Member Author

@shcheklein brought a good point that this issue is not only about prompting or not-prompting - it is about changing the checkout logic behavior in the first place.

So, we need to make dvc checkout logic align with git checkout logic and fail or proceed as Git does. There are two possible cases for checkout conflicts and none of these requires prompting user input:

  1. Keep the modified files in the workspace without failing when "checkouting" file is the same as changed one.
  2. Fail if an old version of a file needs to be checkout.

More details with Git code examples: #2802 (comment)

All 3 scenarios have to follow this checkout logic.

@shcheklein
Copy link
Member

Still trying to understand this :) (btw I still don't quite understand how is related to prompt vs not to prompt and why we are discussing this here).

Let's start with a simple case, first:

$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ dvc fetch
$ dvc checkout

What is the effect of running the last command in terms of changes to the workspace?

@dmpetrov
Copy link
Member Author

Let's start with a simple case, first:

I'm not sure I understand how it is related to the topic we discuss. Could you please elaborate?

@shcheklein
Copy link
Member

@dmpetrov we are discussing the changes to the dvc checkout behavior, right? The question is exactly about dvc checkout. I'm trying to understand exactly the new semantics of it. For example, how would it determine if files are new or old in a sense you mentioned. So, I'm going with really simple real scenarios to see outcomes and better understand what do you have in mind.

@dmpetrov
Copy link
Member Author

@shcheklein prompts happens in case of conflicts during checkout as you can see from the scenarios that @MrOutis showed. Git in such conflicts avoids prompts and (as we realized) behaves differently. It would be great to align DVC behavior with Git behavior (and avoid prompts as a result).

We found 3 DVC "misbehavior" scenarios (2 are actually related) and I create 3 issues to fix these scenarios.

In your last scenario, I see no checkout conflicts and no prompts. This is why I don't understand how it is related to the issue.

@shcheklein
Copy link
Member

shcheklein commented Nov 20, 2019

So, are you saying that checkout will behave differently if data is cached or not? E.g. if the new version is cached then we use the current semantics if not then we don't fail and do not effectively do checkout.

Again, I'm really confused about the change. It would be really beneficial to show some examples and simple scenarios when people do checkouts, modifications, etc.

@dmpetrov
Copy link
Member Author

@shcheklein I showed all the examples with Git: #2802 (comment)
We just need to reflect the Git checkout behavior in these 3 scenarios.

@shcheklein
Copy link
Member

@dmpetrov DVC is not Git by definition. E.g. we use Git with DVC to version DVC-files, it means that Git workflow affects the DVC workflow. DVC is a higher level abstraction in this sense. We can't just reflect even if we want to keep the same philosophy. I'm just saying I don't understand the suggested change (to the DVC semantics) still. And git examples are not enough for me to better understand those. I'm fine if it's only me. Would love to hear what others have to say.

@dmpetrov
Copy link
Member Author

@shcheklein you are right but it is not clear how it related to the current issue. It would be great if you can provide an example when Git's failing/proceeding behavior doesn't make sense for DVC in the context of these 3 checkout scenarios.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion
Projects
None yet
Development

No branches or pull requests

5 participants