Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: dvc add fails with a modified file (or directory) at the end of a list of files #2886

Closed
jaredsampson opened this issue Dec 3, 2019 · 16 comments · Fixed by #3349
Closed
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research

Comments

@jaredsampson
Copy link

Using DVC 0.71.0 on RHEL6, installed via conda, configured for system-wide hard links (not sure if it's relevant):

$ dvc --version
0.71.0

$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink

I have come across what appears to be a bug, where attempting to dvc add a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn't occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @MrOutis):

dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data

This results in the following output:

WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|██████████|Add                                       2.00/2.00 [00:02<00:00,  1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc

But if the command is re-run:

$ dvc add bar.txt data
Stage is cached, skipping.
100% Add                                                  1.00/1.00 [00:01<00:00,  1.96s/file]

So it appears dvc is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.

Thanks for any effort to track down the source of the bug.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Dec 3, 2019
@ghost
Copy link

ghost commented Dec 3, 2019

Thanks for reporting this, @jaredsampson !

Including the stack trace:

ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------
Traceback (most recent call last):
  File "/dvc/command/add.py", line 25, in run
    fname=self.args.file,
  File "/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/dvc/repo/add.py", line 50, in add
    repo.check_modified_graph(stages)
  File "/dvc/repo/__init__.py", line 177, in check_modified_graph
    self._collect_graph(self.stages + new_stages)
  File "/dvc/repo/__init__.py", line 327, in _collect_graph
    raise OutputDuplicationError(str(out), stages)
dvc.exceptions.OutputDuplicationError: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------

@jaredsampson
Copy link
Author

I get the same behavior with a flat file foo.txt instead of data/foo as well:

dvc init --no-scm
mkdir data
echo "foo" > foo.txt
dvc add foo.txt
echo "bar" > bar.txt
dvc unprotect foo.txt
echo "change" > foo.txt
dvc add foo.txt bar.txt

yields

WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100%|██████████|Add                                       2.00/2.00 [00:01<00:00,  1.51file/s]
ERROR: file/directory 'foo.txt' is specified as an output in more than one stage: foo.txt.dvc
    foo.txt.dvc

However, reversing the order of the files in the last command, everything works fine:

$ dvc add foo.txt bar.txt
WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100% Add                                                  2.00/2.00 [00:01<00:00,  1.47file/s]

@ghost
Copy link

ghost commented Dec 3, 2019

Thanks for the additional info, @jaredsampson !

The problem seems to be related to the way we collect the stages.

@efiop efiop added the p1-important Important, aka current backlog of things to do label Dec 3, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 3, 2019
@efiop efiop added bug Did we break something? triage Needs to be triaged labels Dec 3, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 3, 2019
@efiop
Copy link
Contributor

efiop commented Dec 3, 2019

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

dvc init --no-scm

mkdir data

echo "foo" > foo.txt
dvc add foo.txt

echo "bar" > bar.txt
dvc unprotect foo.txt

echo "change" > foo.txt
dvc add foo.txt bar.txt

doesn't reproduce it for me 🙁I suspect that the original issue might be cause by us not deduping the targets in dvc/repo/add.py, but I haven't looked into it closely enough.

@ghost
Copy link

ghost commented Dec 3, 2019

@efiop , change the order of the last call and it should work:

- dvc add foo.txt bar.txt
+ dvc add bar.txt foo.txt

@ghost
Copy link

ghost commented Dec 3, 2019

If someone is willing to give it a try, here's a test that you can play with:

diff --git a/tests/func/test_add.py b/tests/func/test_add.py
index 6d51f233..7328c84d 100644
--- a/tests/func/test_add.py
+++ b/tests/func/test_add.py
@@ -24,7 +24,7 @@ from dvc.system import System
 from dvc.utils import file_md5
 from dvc.utils import LARGE_DIR_SIZE
 from dvc.utils import relpath
-from dvc.utils.compat import range
+from dvc.utils.compat import range, pathlib
 from dvc.utils.stage import load_stage_file
 from tests.basic_env import TestDvc
 from tests.utils import get_gitignore_content
@@ -649,3 +649,10 @@ def test_escape_gitignore_entries(git, dvc_repo, repo_dir):
     dvc_repo.add(fname)
 
     assert ignored_fname in get_gitignore_content()
+
+
+def test_adding_several_files_after_one_has_been_modified(dvc_repo):
+    # https://github.com/iterative/dvc/issues/2886
+    dvc_repo.add('foo')
+    pathlib.Path('foo').write_text('change')
+    dvc_repo.add(['bar', 'foo'])

Might need revisiting the graph building process 😬

@jaredsampson jaredsampson changed the title Bug: dvc add fails with a modified directory after a list of files. Bug: dvc add fails with a modified file (or directory) at the end of a list of files Dec 3, 2019
@dmpetrov
Copy link
Member

dmpetrov commented Feb 3, 2020

@iterative/engineering it seems like an important bug. Should we make it p0?

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. and removed p1-important Important, aka current backlog of things to do labels Feb 3, 2020
@pared pared added the research label Feb 11, 2020
@pared pared self-assigned this Feb 14, 2020
@pared
Copy link
Contributor

pared commented Feb 14, 2020

can confirm that problem still exists on master:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init --quiet
dvc init -q

set -ex

mkdir data

echo foo>>data/foo
dvc add -q data

echo bar>>bar
echo change>>data/foo

dvc add bar data

@pared
Copy link
Contributor

pared commented Feb 17, 2020

Ok, so little investigation helped me to narrow down the issue.
The bug shows up when we are re-add-ing existing DVC-tracked output, but only after calling
dvc.stages first. For example:
Latest version i checked with: 0.86.1

def test_re_add(tmp_dir, scm, dvc):
    tmp_dir.dvc_gen( {"file": "file content"})

    tmp_dir.gen(
            {"file": "modified content"}
    )
    dvc.stages
    dvc.add(["file"])

We will get the same OutputDuplicationError.

  1. Why this is happening?
    repo.stages is cached property, so repo is storing file.dvc. The problem is that when we are re-add-ing some file, the chain of calls is repo.add -> repo._create_stages -> Stage.create which detects that file.dvc exists and removes it, so it can be created again.

So now we don't have file.dvc because it was removed by Stage.create, we have file.dvc in memory as "new" stage that we are creating right now, and because repo's stages were not invalidated, we also have repo.stages==[file.dvc]. That is why we try to check modified graph here during add we have self.stages==[file.dvc] and new_stages==[file.dvc]. While, actually self.stages should not return it (since it was removed by Stage.create.

TLDR
When re-adding we remove stage file, but we don't invalidate cached repo.stages and end up with repo._collect_graph(['file.dvc', 'file.dvc']) which has to fail.

Solution: invalidate repo.stages when removing the stage file.

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

dvc.stages
dvc.add(["file"])

But we don't do that during the dvc add, right? So what specifically has caused the issue?

@pared
Copy link
Contributor

pared commented Feb 17, 2020

@efiop

But we don't do that during the dvc add, right? So what specifically has caused the issue?

Well thats where the devil is. We call stages on repo.check_modified_graph that is why original issue was so strange:

  1. If we re-add only modified artifact, stages will be called only after file.dvc was removed on Stage.create
  2. If we do dvc add other_thingy file (where file exists and has been updated) repo.stages will be called when handling other_thingy, cached, and cause problems when handling updated file.

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

@pared Got it, makes sense now. So our issue is that we don't reset after creating a stage inside repo.add(), right? I suppose that issue was introduced while moving that multitarget logic from CLI to API.

@pared
Copy link
Contributor

pared commented Feb 17, 2020

@efiop
not only repo.add()
To generalize discussion here and under #3349 :
Whenever we Stage.create pre-existing stage and then repo.check_modified_graph (add, imp, run) we are vulnerable to cached stages.
examples:

def test_run_update(tmp_dir, dvc):
    dvc.run(outs=["file"], cmd="echo content > file")
    dvc.stages
    dvc.run(outs=["file"], cmd="echo new_content > file")

def test_imp_update(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen({"file": "file content"}, commit="commit")

    dvc.imp(fspath(erepo_dir), "file")

    dvc.stages
    dvc.imp(fspath(erepo_dir), "file")

@pared
Copy link
Contributor

pared commented Feb 20, 2020

@jaredsampson the fix is done and should be included in next release

@efiop
Copy link
Contributor

efiop commented Feb 20, 2020

@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try 🙂

@jaredsampson
Copy link
Author

Terrific. Thanks for the fix!

skshetry added a commit to skshetry/dvc that referenced this issue Jun 28, 2021
Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see iterative#2886 and iterative#3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see iterative#6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
efiop pushed a commit that referenced this issue Jul 3, 2021
* add: do not delete stage files before add

Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see #2886 and #3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see #6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

* add tests

* make the test more specific
skshetry added a commit to skshetry/dvc that referenced this issue Apr 27, 2022
Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see iterative#2886 and iterative#3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see iterative#6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants