Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No prompt: do not remove redundant files during a directory checkout #2802

Closed
dmpetrov opened this issue Nov 16, 2019 · 5 comments
Closed

No prompt: do not remove redundant files during a directory checkout #2802

dmpetrov opened this issue Nov 16, 2019 · 5 comments
Labels
enhancement Enhances DVC

Comments

@dmpetrov
Copy link
Member

See #2498

Some quotes:

dvc init --no-scm
mkdir directory
echo 'foo' > directory/foo
dvc add directory
echo 'bar' > directory/bar
dvc checkout
# 0%|          |Checkout           0/1 [00:00<?,     ?file/s
file 'directory/bar' is going to be removed. Are you sure you want to proceed? [y/n]

It seems like there are no conflicts to fail or prompt questions. DVC shouldn't remove files that are not committed. A similar Git example:

git init
echo foo > foo
git add foo
git commit -m 'foo commit'
echo bar > bar
git checkout ffb38d8 # the current commit

No changes, no errors. bar is in its place. I'd expect the same behavior from DVC.
I propose to avoid prompting and outputting any error messages or warning

@dmpetrov dmpetrov added the enhancement Enhances DVC label Nov 16, 2019
@shcheklein
Copy link
Member

My 2c on this specific ticket.

I don't think in this case analogy works. There is a difference between Git and DVC on how they treat directories. For DVC directory is a first class citizen vs Git is all about files. You can see this, for example, by the way we calculate a single checksum for a directory. Or for a way you can specify directory or a file in -o, -d, etc.

So, I think by not restoring DVC-controlled directory to its state (consistent with a checksum in a DVC-file) can be considered the same as not restoring a Git-controlled file to its state.

Also, it means that we would have different semantics for files and directories from a dependency/output management.

So, I would suggest to keep directories and files behavior consistent and raise an error if we can't safely restore it.

@dmpetrov
Copy link
Member Author

This is a good point. There are actually two cases:

  1. The current commit initially has the same version of the file (or dir) as the target commit. In this case, a checkout to the target commit should keep the modified file in the workspace and Git/DVC should not fail.
  2. You are checking out an old version. The old version cannot be rewritten by modified files. So, failing is the only option.

Of course, both of the cases need to be covered as a part of this issue.

Example:

$ git init
$echo aa > foo
$ git add foo
$ git commit -m 1st

$ echo bb >> foo
$ git add foo
$ git commit -m 2st

$ echo cc >> foo
$ git status -s
 M foo .   # it was modified

$ git checkout $2nd_VERSION # it works!
$ git status -s
 M foo .   # it is still there and modified

$ git checkout master # go back
$ git status -s
 M foo .   # it is still there and modified

$ git checkout $1nd_VERSION # fails!!!
error: Your local changes to the following files would be overwritten by checkout:
	foo
Please commit your changes or stash them before you switch branches.
Aborting

Also, it means that we would have different semantics for files and directories from a dependency/output management.

Do these two cases explain that there should be no difference in Git and DVC logic despite the directories differences? If not, could you please elaborate on the semantics differences?

@shcheklein
Copy link
Member

See the comment here #2803 (comment)

@efiop efiop changed the title No prompt: remove redundant files during a directory checkout No prompt: do not remove redundant files during a directory checkout Oct 28, 2021
@dberenbaum
Copy link
Collaborator

@efiop

  1. You are checking out an old version. The old version cannot be rewritten by modified files. So, failing is the only option.

This isn't really possible in DVC (at least for now). I can git checkout any version of a .dvc file, but once I do that, dvc checkout will always be checking out the version in that .dvc file. It's not like there's a dvc checkout --rev.

Let's do the DVC equivalent to the example above. What behavior do we expect?

git init
dvc init
mkdir directory
echo 'foo' > directory/foo
dvc add directory
git add .
git commit -m 1st
echo 'bar' > directory/bar
dvc commit
git add .
git commit -m 2nd
echo 'baz' > directory/baz
dvc status # show directory as having modifications (directory/baz added)?
dvc checkout # don't change directory; keep directory/baz?
dvc status # still show directory as having modifications (directory/baz added)?
git checkout HEAD~1
dvc checkout

What would be the expected behavior of the last command?

  • Succeed and show directory as having directory/bar and directory/baz added?
  • Fail? How could DVC determine that this should fail? Under what conditions would checkout fail?

If I have a huge dataset and want to go back to a previous version that was a subset of my current version, how do I do that if all the extra files are kept on checkout?

Also, would we do the same thing for changes within a file once everything is object-based?

@efiop
Copy link
Contributor

efiop commented Dec 8, 2023

dvc no longer deletes files, unless --force is specified. And even that will be changed in the future.

@efiop efiop closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC
Projects
None yet
Development

No branches or pull requests

4 participants