Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when re-importing a renamed directory #3339

Closed
charlesbaynham opened this issue Feb 15, 2020 · 6 comments
Closed

Error when re-importing a renamed directory #3339

charlesbaynham opened this issue Feb 15, 2020 · 6 comments
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@charlesbaynham
Copy link
Contributor

charlesbaynham commented Feb 15, 2020

DVC 0.84.0, Mac, conda installation

Summary

When re-importing a directory from a git repository having specified an alternative output, dvc fails with an "overlapping outs" error. This does not occur if the output directory was not renamed.

Reproduction

The following code sets up a source git repo with a subdirectory and a dvc repo which imports from it. It

  • Successfully imports the subdirectory
  • Successfully reimports the subdirectory
  • Successfully imports and renames the subdirectory
  • Fails to reimport the renamed subdirectory

# Set up a source repo
mkdir source_repo
cd source_repo
git init
mkdir src
echo Here is some code > src/code.txt
git add .
git commit -m "Save some code"
cd  -

# Set up a dvc repo
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init DVC"

echo --- Preparations completed --
read -rsp $'Press any key to continue...\n' -n1 key

# Import "src" directory from the source repo
dvc import ../source_repo src

# Reimport same: works fine
dvc import ../source_repo src

echo ---
echo That import + reimport worked fine, but the next will fail 
echo ---
echo Now import the same directory, but rename it
read -rsp $'Press any key to continue...\n' -n1 key

# Now import src again, but rename it
dvc import -o src_renamed ../source_repo src

echo Reimport same...
read -rsp $'Press any key to continue...\n' -n1 key

# Finally, reimport the last
# This command will fail
dvc import -o src_renamed ../source_repo src

Expected behavior

No error.

Actual behaviour

ERROR: failed to import 'src' from '../source_repo'. - Paths for outs: 'src_renamed'('src_renamed.dvc') 'src_renamed/src'('src.dvc') overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 15, 2020
@charlesbaynham
Copy link
Contributor Author

This doesn't seem to be a problem with importing files, only directories

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Feb 15, 2020

This seems to occur because the import command behaves differently if there's already an output directory or not.

If I call e.g. dvc import -o "step_A" "../../sample_steps/step_A" src in an empty directory then I get a directory called step_A which contains the contents of src and a dvc file called step_A.dvc. I think that's expected behavior.

If however, I call it in a directory which already contains a subdir called step_A then I instead get a dvcfile called src.dvc and src is created as a subdir of step_A.

It's this changed behaviour that causes the reimport to fail, since the dvc file that it creates has a different name to the first run, so it's a duplicated output.

@dmpetrov dmpetrov added p0-critical Critical issue. Needs to be fixed ASAP. bug Did we break something? and removed triage Needs to be triaged labels Feb 15, 2020
@charlesbaynham
Copy link
Contributor Author

Full trace:

ERROR: failed to import 'src' from '../source_repo'. - Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------
Traceback (most recent call last):
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/command/imp.py", line 19, in run
    rev=self.args.rev,
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp.py", line 6, in imp
    return self.imp_url(path, out=out, erepo=erepo, locked=True)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 31, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp_url.py", line 20, in imp_url
    self.check_modified_graph([stage])
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 184, in check_modified_graph
    self._collect_graph(self.stages + new_stages)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 337, in _collect_graph
    raise OverlappingOutputPathsError(outs[p], out)
dvc.exceptions.OverlappingOutputPathsError: Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------

@skshetry
Copy link
Member

skshetry commented Feb 17, 2020

@charlesbaynham, this is clearly mentioned in the docs (i know, it's not what you'd expect, and this is an edge-case that happened when re-importing).

-o, --out - specify a path (directory and/or file name) to the desired location to place the imported
data and import stage (DVC-file) in. The default value (when this option isn't used) is the current
working directory (.) and original file name. If an existing directory is specified, then the output
will be placed inside of it
.

With #3337, update is on par with re-importing (yet to be released, though). And, I feel, we should encourage users to use it instead of re-importing.

@charlesbaynham
Copy link
Contributor Author

Hi @skshetry ,

Ah thanks for that, sorry I missed it in the docs. Happy to close this issue therefore.

Those changes to update sound like they'll remove the strict need for reimports, but IMO the ability to reimport is still useful. In my use-case, for example, I need to create an import if it doesn't exist or update it if it does. I could write code to detect an existing import, but how will I know that it's the same as the one I want (e.g. what if the url has changed?). Without doing a reimport I have to:

  • Check if there's an existing .dvc file
  • Check if it has the right url
  • Either run update or delete the .dvc file and run import

Do you know of a way to correctly reimport a directory?

@efiop
Copy link
Contributor

efiop commented Mar 2, 2020

@charlesbaynham Thanks for your patience, notification got lost in the flow :) Your command is correct, but as @skshetry noted, import behaves like other tools (e.g. cp) and if directory already exists then it places the output inside of it. So the responsibility for removing is on you in this case. You could do dvc update though (now released) and it will handle it for you. Closing this issue for now.

@efiop efiop closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
Development

No branches or pull requests

4 participants