-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc gc
does not remove files under dir.unpacked
#2946
Comments
Hi @tlouismarie ! Indeed, looks like a bug. That unpacked dirs should indeed be removed by gc, but for now you can simply remove them yourself as a workaround. We'll investigate shortly. Thanks for the feedback! |
To fix it, we need to make |
Here is a reproducer:
|
I am having a look into this issue. |
Created PR: #3054 I have tested with the reproduce script suggested #2946 (comment). |
Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package. I tried with different configurations for the cache type (default, copy and symlink) and get a similar behavior.
I'm testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with
dvc init
and add data withdvc add data
in directory that contains to datasets:data/data.json
of size 240Mdata/data_1.json
of size 65MI then run a script that produces an output dataset:
dvc run -f prepare_data.dvc -d src/prepare_data.py -d data -o output python src prepare_data.py data
. It creates:output/prepared_data.npy
of size 360M.The
.dvc/cache
directory now contains the following files (names are simplified):06/file1
of size 240M at inode 6029431614/file2
of size 360M at inode 6094928820/file3.dir
at inode 6029428759/file4
of size 65M at inode 60294327dd/file5.dir
at inode 60949289I commit to git and tag it as expe1. I now modify my script and run
dvc repro prepare_data.dvc
. It produces a new file:output/prepared_data.npy
of size 500M. The.dvc/cache
directory now contains the following files:06/file1
of size 240M at inode 6029431614/file2
of size 360M at inode 6094928819/file6
of size 500M at inode 6094929020/file3.dir
at inode 6029428720/file3.dir.unpacked/data.json
of size 240M at inode 6029431620/file3.dir.unpacked/data_1.json
of size 65M at inode 6029432722/file7.dir
at inode 6029429159/file4
of size 65M at inode 60294327dd/file5.dir
at inode 60949289dd/file5.dir.unpacked/prepared_data.npy
of size 360M at inode 60949288I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run
dvc gc
. The.dvc/cache
directory now contains the following files:06/file1
of size 240M at inode 6029431614
19/file6
of size 500M at inode 6094929020/file3.dir
at inode 6029428720/file3.dir.unpacked/data.json
of size 240M at inode 6029431620/file3.dir.unpacked/data_1.json
of size 65M at inode 6029432722/file7.dir
at inode 6029429159/file4
of size 65M at inode 60294327dd/file5.dir
at inode 60949289dd/file5.dir.unpacked/prepared_data.npy
of size 360M at inode 60949288Therefore contrary to what I expect, the previous version of the output file is still present in the cache at
dd/file5.dir.unpacked/prepared_data.npy
. Is this the expected behavior ?How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway:
The text was updated successfully, but these errors were encountered: