-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc gc remove <datafile-or-dir> #4218
Comments
This feature would be very important in my projects as well. |
The only real blocker to this besides prioritization is that it's dangerous since it could delete something needed elsewhere, right? Could we add the command with a stern warning that it's dangerous? |
Thanks @dberenbaum. In the meanwhile, could you please comment on whether the following hack would corrupt the DVC setup in any way? Imagine I have run
This will place the downloaded data into ls -l my_data_folder We see something like:
The idea is to be able to delete only specific files. By observing the hash that is displayed with the
Later, if I want to download this file again, I can just do Thanks! |
I think as long as you aren't editing the files in place, and are only dropping and adding files and pulling them, it should be safe. You may want to test on an example to be safe before trying on your real data. |
Thank you very much @dberenbaum. |
Currently $ dvc gc -w --outs data/abc.xml Keep the current version of data/abc.xml (referenced in the workspace) $ dvc gc -A -o *.dvc -o model.pt Keep raw data (all outputs of all .dvc files) and a specific model file referenced in all commits (in essence this removes all intermediate artifacts which can always be reproduced anyway). |
This would be very useful in cases where data has been pushed to the remote accidentally before setting |
Here's another use case. Our team been asked to remove all traces of one vendor's data from our systems, because our contract with them has ended. We removed all the relevant pipelines and source code, and we used dvc remove to remove any .dvc files we had from the codebase, but that vendor's data is still in the cache. Also, there are files tracked by dvc.lock files still in the cache. We could use gc to obliterate all old data, but for all our other data, we want to preserve old versions in the cache so that we can see exactly what changes were made between data versions. This can be useful when our data providers introduce new errors that break our pipelines. Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one. |
These 2 things should be possible today using: https://dvc.org/doc/command-reference/gc#--date |
Thanks @daavoo , those will be useful! |
Are there any plans to implement this feature ?
|
I would be willing to work on this but I'd appreciate any input before, if there is any real reason why this is not already implemented ? |
@asiron no particular reason besides what is already mentioned. And I don't see any critical blockers atm. Just not enough hands to prioritize this atm. |
This has already been mentioned a few times in #2325 but wanted to draw attention again to this aspect specifically:
Since
dvc pull
anddvc fetch
allow for granular selection oftargets
it would be very helpful to be able to usedvc gc
to remove those sametargets
from cache once we are done with them. In my case specifically, I have a few semi-independent datasets I would rather avoid having to keep in cache at the same time, but would like to be able to switch between for different analyses (and occasionally have both in cache for specific tasks).The text was updated successfully, but these errors were encountered: