Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc gc does not remove files under dir.unpacked #2946

Closed
tlouismarie opened this issue Dec 13, 2019 · 5 comments · Fixed by #3054
Closed

dvc gc does not remove files under dir.unpacked #2946

tlouismarie opened this issue Dec 13, 2019 · 5 comments · Fixed by #3054
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@tlouismarie
Copy link

Setup is dvc 0.75.0 with the .deb package under Ubuntu 18.04.
Also tested with windows 10 with .exe package. I tried with different configurations for the cache type (default, copy and symlink) and get a similar behavior.

I'm testing DVC and trying to understand how it manages the datasets in its cache.
I initialize an empty repository with dvc init and add data with dvc add data in directory that contains to datasets:

  • data/data.json of size 240M
  • data/data_1.json of size 65M

I then run a script that produces an output dataset: dvc run -f prepare_data.dvc -d src/prepare_data.py -d data -o output python src prepare_data.py data. It creates: output/prepared_data.npy of size 360M.
The .dvc/cache directory now contains the following files (names are simplified):

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 20/file3.dir at inode 60294287
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289

I commit to git and tag it as expe1. I now modify my script and run dvc repro prepare_data.dvc. It produces a new file: output/prepared_data.npy of size 500M. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • 14/file2 of size 360M at inode 60949288
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

I commit and tag it as expe2. I now want to clean the cache to remove previous outputs from expe1 and run dvc gc. The .dvc/cache directory now contains the following files:

  • 06/file1 of size 240M at inode 60294316
  • empty directory 14
  • 19/file6 of size 500M at inode 60949290
  • 20/file3.dir at inode 60294287
  • 20/file3.dir.unpacked/data.json of size 240M at inode 60294316
  • 20/file3.dir.unpacked/data_1.json of size 65M at inode 60294327
  • 22/file7.dir at inode 60294291
  • 59/file4 of size 65M at inode 60294327
  • dd/file5.dir at inode 60949289
  • dd/file5.dir.unpacked/prepared_data.npy of size 360M at inode 60949288

Therefore contrary to what I expect, the previous version of the output file is still present in the cache at dd/file5.dir.unpacked/prepared_data.npy . Is this the expected behavior ?
How can I properly clean the cache ?
It seems that this file is useless as if try to checkout expe1, it raises an error, and I have to reproduce the experiment anyway:

$ git checkout expe1
$ dvc checkout
ERROR: unexpected error - Checkout failed for the following target: 
   output
Did you forget to fetch ?``
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Dec 13, 2019
@efiop efiop added bug Did we break something? research labels Dec 13, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 13, 2019
@efiop
Copy link
Contributor

efiop commented Dec 13, 2019

Hi @tlouismarie !

Indeed, looks like a bug. That unpacked dirs should indeed be removed by gc, but for now you can simply remove them yourself as a workaround. We'll investigate shortly. Thanks for the feedback!

@efiop efiop added the p1-important Important, aka current backlog of things to do label Dec 13, 2019
@efiop
Copy link
Contributor

efiop commented Dec 30, 2019

To fix it, we need to make gc https://github.com/iterative/dvc/blob/0.78.1/dvc/remote/base.py#L694 do check if it is a dir checksum (by .dir suffix, see helpers around there) and then also remove .unpacked dir. Important to note that unpacked optimization is only valid for local path_infos(though need to double-check ssh too), so could consider checking that too.

@efiop
Copy link
Contributor

efiop commented Jan 3, 2020

Here is a reproducer:

#!/bin/bash           
                      
set -e                
set -x                
                      
rm -rf myrepo         
mkdir myrepo          
cd myrepo             
                      
git init              
dvc init              
                      
mkdir data            
echo foo > data/foo   
                      
dvc add data          
                      
tree .dvc/cache       
                      
dvc status            
                      
tree .dvc/cache       
                      
rm data.dvc           
                      
dvc gc -f             
                      
tree .dvc/cache       

@sharidas
Copy link

sharidas commented Jan 3, 2020

I am having a look into this issue.

@sharidas
Copy link

sharidas commented Jan 4, 2020

Created PR: #3054

I have tested with the reproduce script suggested #2946 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants