Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc gc remove <datafile-or-dir> #4218

Open
edwardwbarber opened this issue Jul 16, 2020 · 13 comments
Open

dvc gc remove <datafile-or-dir> #4218

edwardwbarber opened this issue Jul 16, 2020 · 13 comments
Labels
A: gc Related go garbage collection feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@edwardwbarber
Copy link

This has already been mentioned a few times in #2325 but wanted to draw attention again to this aspect specifically:

Since dvc pull and dvc fetch allow for granular selection of targets it would be very helpful to be able to use dvc gc to remove those same targets from cache once we are done with them. In my case specifically, I have a few semi-independent datasets I would rather avoid having to keep in cache at the same time, but would like to be able to switch between for different analyses (and occasionally have both in cache for specific tasks).

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jul 16, 2020
@efiop efiop added feature request Requesting a new feature p2-medium Medium priority, should be done, but less important labels Jul 17, 2020
@triage-new-issues triage-new-issues bot removed triage Needs to be triaged labels Jul 17, 2020
@Jaume-JCI
Copy link

This feature would be very important in my projects as well.

@daavoo daavoo added the A: gc Related go garbage collection label Oct 7, 2022
@dberenbaum
Copy link
Collaborator

The only real blocker to this besides prioritization is that it's dangerous since it could delete something needed elsewhere, right? Could we add the command with a stern warning that it's dangerous?

@Jaume-JCI
Copy link

Jaume-JCI commented Oct 7, 2022

Thanks @dberenbaum. In the meanwhile, could you please comment on whether the following hack would corrupt the DVC setup in any way?

Imagine I have run

dvc pull my_data_folder.dvc

This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links), i.e., if we list the contents of the my_data_folder with

ls -l my_data_folder

We see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...

The idea is to be able to delete only specific files. By observing the hash that is displayed with the ls -l command, I can delete directly the corresponding files in the DVC cache. For instance, if we want to remove my_data_file_1.pk, I can do:

rm my_data_folder/my_data_file_1.pk
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

Later, if I want to download this file again, I can just do dvc pull my_data_folder.dvc again. Would that corrupt DVC? Do I need to instead delete all the files that are linked in my_data_folder, instead of just a single one?

Thanks!

@dberenbaum
Copy link
Collaborator

I think as long as you aren't editing the files in place, and are only dropping and adding files and pulling them, it should be safe. You may want to test on an example to be safe before trying on your real data.

@Jaume-JCI
Copy link

Thank you very much @dberenbaum.

@jorgeorpinel
Copy link
Contributor

Currently gc works by blacklisting i.e. you specify what NOT to collect. Would it still be helpful as an option to specify certain target files to keep? Ideas:

$ dvc gc -w --outs data/abc.xml

Keep the current version of data/abc.xml (referenced in the workspace)

$ dvc gc -A -o *.dvc -o model.pt

Keep raw data (all outputs of all .dvc files) and a specific model file referenced in all commits (in essence this removes all intermediate artifacts which can always be reproduced anyway).

@oadams
Copy link

oadams commented Dec 5, 2022

This would be very useful in cases where data has been pushed to the remote accidentally before setting cache: false / push: false in dvc.yaml. Currently it's quite difficult to selectively purge stuff that was accidentally pushed to the remote.

@jeremyherr
Copy link

Here's another use case. Our team been asked to remove all traces of one vendor's data from our systems, because our contract with them has ended. We removed all the relevant pipelines and source code, and we used dvc remove to remove any .dvc files we had from the codebase, but that vendor's data is still in the cache. Also, there are files tracked by dvc.lock files still in the cache. We could use gc to obliterate all old data, but for all our other data, we want to preserve old versions in the cache so that we can see exactly what changes were made between data versions. This can be useful when our data providers introduce new errors that break our pipelines.

Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.

@daavoo
Copy link
Contributor

daavoo commented Aug 3, 2023

Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.

These 2 things should be possible today using:

https://dvc.org/doc/command-reference/gc#--date
https://dvc.org/doc/command-reference/gc#--rev & https://dvc.org/doc/command-reference/gc#-n

@jeremyherr
Copy link

Thanks @daavoo , those will be useful!

@asiron
Copy link

asiron commented Nov 18, 2024

Are there any plans to implement this feature ?
There is essentially no way of deleting files from cache that you want to remove.
For example, I have a datasets repo as well as a model repo that imports a bunch of data from datasets.
The model repo created a lot of files during rapid development and easily over-bloated the shared cache and the storage.
Let's say that we have now a more mature version and we would like to clean the cache and storage:

  • remove all targets from this (model) repo older than X (date or rev)
  • keep in cache/storage all dvc imported targets in data/
  • an option to remove data just from the storage while keeping it in cache (equivalent to git push -d origin <branch>)
    Currently that is not possible and it seems to me like a simple use-case.

@asiron
Copy link

asiron commented Dec 16, 2024

I would be willing to work on this but I'd appreciate any input before, if there is any real reason why this is not already implemented ?

@shcheklein
Copy link
Member

@asiron no particular reason besides what is already mentioned. And I don't see any critical blockers atm. Just not enough hands to prioritize this atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: gc Related go garbage collection feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

10 participants