-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: dvc add
fails with a modified file (or directory) at the end of a list of files
#2886
Comments
Thanks for reporting this, @jaredsampson ! Including the stack trace:
|
I get the same behavior with a flat file foo.txt instead of
yields
However, reversing the order of the files in the last command, everything works fine:
|
Thanks for the additional info, @jaredsampson ! The problem seems to be related to the way we collect the stages. |
doesn't reproduce it for me 🙁I suspect that the original issue might be cause by us not deduping the targets in dvc/repo/add.py, but I haven't looked into it closely enough. |
@efiop , change the order of the last call and it should work: - dvc add foo.txt bar.txt
+ dvc add bar.txt foo.txt |
If someone is willing to give it a try, here's a test that you can play with: diff --git a/tests/func/test_add.py b/tests/func/test_add.py
index 6d51f233..7328c84d 100644
--- a/tests/func/test_add.py
+++ b/tests/func/test_add.py
@@ -24,7 +24,7 @@ from dvc.system import System
from dvc.utils import file_md5
from dvc.utils import LARGE_DIR_SIZE
from dvc.utils import relpath
-from dvc.utils.compat import range
+from dvc.utils.compat import range, pathlib
from dvc.utils.stage import load_stage_file
from tests.basic_env import TestDvc
from tests.utils import get_gitignore_content
@@ -649,3 +649,10 @@ def test_escape_gitignore_entries(git, dvc_repo, repo_dir):
dvc_repo.add(fname)
assert ignored_fname in get_gitignore_content()
+
+
+def test_adding_several_files_after_one_has_been_modified(dvc_repo):
+ # https://github.com/iterative/dvc/issues/2886
+ dvc_repo.add('foo')
+ pathlib.Path('foo').write_text('change')
+ dvc_repo.add(['bar', 'foo']) Might need revisiting the graph building process 😬 |
dvc add
fails with a modified directory after a list of files.dvc add
fails with a modified file (or directory) at the end of a list of files
@iterative/engineering it seems like an important bug. Should we make it p0? |
can confirm that problem still exists on master:
|
Ok, so little investigation helped me to narrow down the issue.
We will get the same OutputDuplicationError.
So now we don't have TLDR Solution: invalidate |
But we don't do that during the |
Well thats where the devil is. We call stages on
|
@pared Got it, makes sense now. So our issue is that we don't reset after creating a stage inside |
@efiop
|
@jaredsampson the fix is done and should be included in next release |
@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try 🙂 |
Terrific. Thanks for the fix! |
Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
* add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see #2886 and #3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see #6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec. * add tests * make the test more specific
Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
Using DVC 0.71.0 on RHEL6, installed via
conda
, configured for system-wide hard links (not sure if it's relevant):I have come across what appears to be a bug, where attempting to
dvc add
a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn't occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @MrOutis):This results in the following output:
But if the command is re-run:
So it appears
dvc
is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.Thanks for any effort to track down the source of the bug.
The text was updated successfully, but these errors were encountered: