Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc get can can delete the current directory #3105

Closed
Persedes opened this issue Jan 10, 2020 · 12 comments · Fixed by #3170
Closed

dvc get can can delete the current directory #3105

Persedes opened this issue Jan 10, 2020 · 12 comments · Fixed by #3170
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@Persedes
Copy link

Persedes commented Jan 10, 2020

First off, I saw it coming, had my stuff backed up etc, so no harm done.
I was trying to test out dvc get to pull models from a different project and the cache must not have been present in the remote:

models/mlr/production is defined as the output of a dvc file.

$ dvc get <internal_repo> models/mlr/production/ --rev <name of branch>
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:    
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
Missing cache for directory '../../../tmp/tmp5jtl3fccdvc-erepo/models/mlr/production'. Cache for files inside will be lost. Would you like to continue? Use '-f' to force. [y/n] y            
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:   
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
WARNING: Cache 'ebbb5c351028a6e5485dfdec639fbb57.dir' not found. File '.' won't be created.                                                                                                   
file '.' is going to be removed. Are you sure you want to proceed? [y/n] y                                                                                                                    
ERROR: unexpected error - [Errno 2] No such file or directory

Once I saw the File '.' won't be created. file '.' is going to be removed. I knew what was going to happen, but was surprised that dvc lets you do that. (It wiped out my whole git repo ;))

DVC version 0.80.0,
Platform WSL
method of installation: pip

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 10, 2020
@efiop efiop added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Jan 11, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 11, 2020
@efiop
Copy link
Contributor

efiop commented Jan 11, 2020

Hi @Persedes !

Wow, that is awful. Investigating right now. Thank you for reporting it! 🙏

@efiop
Copy link
Contributor

efiop commented Jan 11, 2020

Indeed, the issue is caused by dvc being confused by the trailing / in the models/mlr/production/. A workaround for now is to not include the trailing slash. Will send a fix soon...

@pared
Copy link
Contributor

pared commented Jan 13, 2020

Reproduction script:

#!/bin/bash

rm -rf erepo repo git_repo
mkdir erepo repo git_repo

maindir=$(pwd)
set -ex
pushd git_repo
git init --bare --quiet
popd

pushd erepo
git init --quiet
git remote add origin $maindir/git_repo
dvc init -q

mkdir models
echo model >> models/1

dvc add models
git add -A
git commit -am "add stuff"

git push origin master
rm models/1
rm -rf .dvc/cache

popd

pushd repo
dvc get $maindir/erepo models/

Expected behavior:

@Persedes
Copy link
Author

I have found another oddity, which might be related to the bug above. dvc get keeps nesting outputs.

The first get pulls a folder with two tsv files.
The second one creates a prepared folder inside that one. I would have expected dvc get to either do nothing or overwrite the existing tsv files when pulling from a different branch/ commit.

# start in an empty folder
$ ls

# pull output from a folder and define the output path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
$ ls dvc_files/
test.tsv  train.tsv

# running it again creates a prepared folder inside of the defined output_path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
# prepared gets nested inside of dvc_files
$ ls dvc_files/
prepared  test.tsv  train.tsv

@ghost
Copy link

ghost commented Jan 13, 2020

@efiop , are you working on this one?

@ghost
Copy link

ghost commented Jan 13, 2020

#3105 (comment)

Nesting only happens once, this is due to resolve_output. Can't tell what was the original intention behind those lines:

dvc/dvc/utils/__init__.py

Lines 336 to 344 in fe635a5

def resolve_output(inp, out):
from urllib.parse import urlparse
name = os.path.basename(urlparse(inp).path)
if not out:
return name
if os.path.isdir(out):
return os.path.join(out, name)
return out

@efiop ?

@ghost
Copy link

ghost commented Jan 13, 2020

@pared , I was trying to write a test based on your reproduction script. But didn't understand why you were git pushing. Could you explain it for me?

By the way, here's what I have so far:

def test_error(tmp_dir, dvc, erepo_dir):
    with tempfile.TemporaryDirectory() as remote:
        git.Repo.init(remote)

        with erepo_dir.chdir():
            erepo_dir.scm_gen("dir/file", "text", commit="create file")
            erepo_dir.dvc_add("dir")
            origin = erepo_dir.scm.repo.create_remote("origin", remote)
            origin.push()

        os.remove(fspath(erepo_dir / "dir" / "file"))
        shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))

        dvc.get(erepo_dir, "dir")
        assert git.Repo(remote)

It passes, so I'm not sure if I'm translating it correctly 😕

@pared
Copy link
Contributor

pared commented Jan 15, 2020

@MrOutis, git push is unnecessary, dunno how it got there, removing it still makes the bug reproducible.

[EDIT]
@MrOutis, I modified your test a little bit:

def test_error(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.gen("dir/file", "text") # just gen here, not scm_gen
        erepo_dir.dvc_add("dir", commit="add dir")

    os.remove(fspath(erepo_dir / "dir" / "file"))
    shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))

    try:
        with mock.patch("dvc.prompt.confirm", return_value=True):
            dvc.get(fspath(erepo_dir), "dir/") # note that we get "dir/" not "dir"
    except Exception:
        pass
    finally:
        assert tmp_dir.exists()

@ghost
Copy link

ghost commented Jan 15, 2020

@pared , great! Thanks for clearing it out)

@ghost
Copy link

ghost commented Jan 16, 2020

@pared , I still don't get it. Tests are passing on my side, I'm loosing it on this one 😅

@pared
Copy link
Contributor

pared commented Jan 16, 2020

@MrOutis even the one that I provided?

@pared
Copy link
Contributor

pared commented Jan 16, 2020

@Persedes I did a little patch that should fix the original issue,

as to the second one, the one described in #3105 (comment). I am afraid that has to stay. I do agree that in this particular case it does not do what you would like it to do, but this behaviour is consistent with what filesystem operations allow as to do.
For example:

#!/bin/bash

rm -rf src clone
mkdir src
echo data >> src/file

cp -r src clone
tree clone
cp -r src clone
tree clone

Will show that the second cp will create clone/src.
If we were to prevent such behaviour, we would also have to forbid get-ing file -> dir that results in dir/file, or we would have to check whether src is directory, but then we would break consistency with filesystem operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants