-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd-ref: document new gc behavior #1023
Changes from all commits
21572eb
50156f1
ab7f625
e530271
e4f1b3c
06445b8
f90622d
e2ea8a7
49e4f53
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,60 +1,80 @@ | ||||||||||||||||
# gc | ||||||||||||||||
|
||||||||||||||||
Remove unused objects from <abbr>cache</abbr> or remote storage. | ||||||||||||||||
Remove unused files and directories from <abbr>cache</abbr> or | ||||||||||||||||
[remote storage](/doc/command-reference/remote). | ||||||||||||||||
|
||||||||||||||||
## Synopsis | ||||||||||||||||
|
||||||||||||||||
```usage | ||||||||||||||||
usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r <name>] | ||||||||||||||||
usage: dvc gc [-h] [-q | -v] | ||||||||||||||||
[-w] [-a] [-T] [--all-commits] [-c] [-r <name>] | ||||||||||||||||
[-f] [-j <number>] [-p [<path> [<path> ...]]] | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
## Description | ||||||||||||||||
|
||||||||||||||||
This command deletes (garbage collects) data files or directories that may exist | ||||||||||||||||
in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is | ||||||||||||||||
used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format) | ||||||||||||||||
currently in the <abbr>workspace</abbr>. By default, this command only cleans up | ||||||||||||||||
the local cache, which is typically located on the same machine as the project | ||||||||||||||||
in question. This usually helps to free up disk space. | ||||||||||||||||
This command deletes (garbage collects) data files or directories that exist in | ||||||||||||||||
DVC cache but are no longer needed. With `--cloud` it also removes data in | ||||||||||||||||
[remote storage](/doc/command-reference/remote). | ||||||||||||||||
|
||||||||||||||||
There are important things to note when using Git to version the | ||||||||||||||||
<abbr>project</abbr>: | ||||||||||||||||
To avoid accidentally deleting data, it raises an error and doesn't touch any | ||||||||||||||||
files if no scope options are provided. It means it's user's responsibility to | ||||||||||||||||
explicitly provide the right set of options to specify what data is still needed | ||||||||||||||||
(so that DVC can figure out what fils can be safely deleted). | ||||||||||||||||
|
||||||||||||||||
- If the cache/remote holds several versions of the same data, all except the | ||||||||||||||||
current one will be deleted. | ||||||||||||||||
- Use the `--all-branches` or `--all-tags` options to avoid collecting data | ||||||||||||||||
referenced in the tips of all branches or all tags, respectively. | ||||||||||||||||
One of the scope options, `--workspace`, `--all-branches`, `--all-tags`, | ||||||||||||||||
`--all-commits`, or any combination of them must be provided. Each of them | ||||||||||||||||
corresponds to the current workspace _and_ a set of commits to analyze what | ||||||||||||||||
files, directories and what versions are still needed and should be kept (by | ||||||||||||||||
analyzing DVC-files in those commits). | ||||||||||||||||
Comment on lines
+26
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: what's the point of this option? To me it sounds like "don't gc anything" except maybe deleted parts of the repo like removed branches, rewritten commits, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. DVC cache contains data that is not referenced in any way in Git history. E.g. run do a lot of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably mention this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mentioned a possible use case in the recent version - @jorgeorpinel pls take a look There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still this suggestion wasn't applied but I'll address in regular updates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put a new paragraph that explains the use case, have you seen it? Or may be I didn't get you specific suggestion ... 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see it, thanks. May reword a little in regular updates but it's great!
|
||||||||||||||||
|
||||||||||||||||
The default remote is used (see `dvc config core.remote`) unless the `--remote` | ||||||||||||||||
option is used. | ||||||||||||||||
|
||||||||||||||||
Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files | ||||||||||||||||
from any remote. This means that any files collected from the local cache can be | ||||||||||||||||
Unless the `--cloud` option is used, `dvc gc` does not remove data files from | ||||||||||||||||
any remote. This means that any files collected from the local cache can be | ||||||||||||||||
restored using `dvc fetch`, as long as they have previously been uploaded with | ||||||||||||||||
`dvc push`. | ||||||||||||||||
|
||||||||||||||||
### Removing data in remote storage | ||||||||||||||||
|
||||||||||||||||
If `--cloud` option is provided, command deletes unused data not only in local | ||||||||||||||||
DVC cache, but also in remote storage. It means it can be dangerous since in | ||||||||||||||||
most cases removing data locally and in remote storage is irreversible. | ||||||||||||||||
|
||||||||||||||||
The default remote is cleaned (see `dvc config core.remote`) unless the | ||||||||||||||||
`--remote` option is used. | ||||||||||||||||
|
||||||||||||||||
## Options | ||||||||||||||||
|
||||||||||||||||
- `-a`, `--all-branches` - keep cached objects referenced in all Git branches. | ||||||||||||||||
Useful for keeping data for all the latest experiment versions. It's | ||||||||||||||||
recommended to consider including this option when using `-c` i.e. | ||||||||||||||||
`dvc gc -ac`. | ||||||||||||||||
- `-w`, `--workspace` - keep files and directories _only_ referenced in the | ||||||||||||||||
current workspace This option is enabled automatically if `--all-tags`, | ||||||||||||||||
`--all-branches`, or `--all-commits` are used. | ||||||||||||||||
|
||||||||||||||||
- `-a`, `--all-branches` - keep cached objects referenced in all Git branches as | ||||||||||||||||
well as in the workspace (implies `-w`). Useful if branches are used to track | ||||||||||||||||
different experiments. | ||||||||||||||||
|
||||||||||||||||
- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well | ||||||||||||||||
as the workspace (implies `-w`). Useful if tags are used to track | ||||||||||||||||
"checkpoints" of an experiment or project. Note that both options can be | ||||||||||||||||
combined, for example using the `-aT` flag. | ||||||||||||||||
|
||||||||||||||||
- `--all-commits` - the same as `-a` or `-T` above, but applies to _all_ Git | ||||||||||||||||
commits as well as the workspace (implies `-w`). Useful for keeping all the | ||||||||||||||||
data used in the entire existing commit history of the project. | ||||||||||||||||
|
||||||||||||||||
- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's | ||||||||||||||||
useful if tags are used to track "checkpoints" of an experiment or project. | ||||||||||||||||
Note that both options can be combined, for example using the `-aT` flag. | ||||||||||||||||
One of the use cases for this option is to safely delete all temporary data | ||||||||||||||||
DVC cached when `dvc run` and/or `dvc repro` were run without committing | ||||||||||||||||
changes to DVC-files (thus potentially caching data that is not referenced | ||||||||||||||||
from workspace or Git commits). | ||||||||||||||||
|
||||||||||||||||
- `-p <paths>`, `--projects <paths>` - if a single remote or a single cache is | ||||||||||||||||
shared among different projects (e.g. a configuration like the one described | ||||||||||||||||
[here](/doc/use-cases/shared-development-server)), this option can be used to | ||||||||||||||||
specify a list of them (each project is a path) to keep data that is currently | ||||||||||||||||
referenced from them. | ||||||||||||||||
|
||||||||||||||||
- `-c`, `--cloud` - also remove files in remote storage. _This operation is | ||||||||||||||||
dangerous._ It removes datasets, models, other files that are not linked in | ||||||||||||||||
the current commit (unless `-a` or `-T` are also used). The default remote is | ||||||||||||||||
used unless a specific one is given with `-r`. | ||||||||||||||||
- `-c`, `--cloud` - remove files in remote storage in addition to local cache. | ||||||||||||||||
**This option is dangerous.** The default remote is used unless a specific one | ||||||||||||||||
is given with `-r`. | ||||||||||||||||
|
||||||||||||||||
- `-r <name>`, `--remote <name>` - name of the | ||||||||||||||||
[remote storage](/doc/command-reference/remote) to collect unused objects from | ||||||||||||||||
|
@@ -83,11 +103,12 @@ $ du -sh .dvc/cache/ | |||||||||||||||
7.4G .dvc/cache/ | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
When you run `dvc gc` it removes all objects from cache that are not referenced | ||||||||||||||||
in the <abbr>workspace</abbr> (by collecting hash values from the DVC-files): | ||||||||||||||||
When you run `dvc gc --workspace`, DVC removes all objects from cache that are | ||||||||||||||||
not referenced in the <abbr>workspace</abbr> (by collecting hash values from the | ||||||||||||||||
DVC-files): | ||||||||||||||||
|
||||||||||||||||
```dvc | ||||||||||||||||
$ dvc gc | ||||||||||||||||
$ dvc gc --workspace | ||||||||||||||||
|
||||||||||||||||
'.dvc/cache/27e30965256ed4d3e71c2bf0c4caad2e' was removed | ||||||||||||||||
'.dvc/cache/2e006be822767e8ba5d73ebad49ef082' was removed | ||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be better to express this in any other way. I don't have a better suggestion, though than the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I think needed was there before. I don't have a very strong opinion ... may be even
needed
better because it is more precise here - I just keep stuff that I need. need is more general then use. Still, need to review other place with in use/need.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to always use
need
vsin-use
. I thinkneed
andneeded
are actually more correct here.