Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd-ref: document new gc behavior #1023

Merged
merged 9 commits into from
Mar 19, 2020
87 changes: 54 additions & 33 deletions content/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,80 @@
# gc

Remove unused objects from <abbr>cache</abbr> or remote storage.
Remove unused files and directories from <abbr>cache</abbr> or
[remote storage](/doc/command-reference/remote).

## Synopsis

```usage
usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r <name>]
usage: dvc gc [-h] [-q | -v]
[-w] [-a] [-T] [--all-commits] [-c] [-r <name>]
[-f] [-j <number>] [-p [<path> [<path> ...]]]
```

## Description

This command deletes (garbage collects) data files or directories that may exist
in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is
used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format)
currently in the <abbr>workspace</abbr>. By default, this command only cleans up
the local cache, which is typically located on the same machine as the project
in question. This usually helps to free up disk space.
This command deletes (garbage collects) data files or directories that exist in
DVC cache but are no longer needed. With `--cloud` it also removes data in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but are no longer needed.

It'd be better to express this in any other way. I don't have a better suggestion, though than the following:

Suggested change
DVC cache but are no longer needed. With `--cloud` it also removes data in
DVC cache but are no longer in use. With `--cloud` it also removes data in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think needed was there before. I don't have a very strong opinion ... may be even needed better because it is more precise here - I just keep stuff that I need. need is more general then use. Still, need to review other place with in use/need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to always use need vs in-use. I think need and needed are actually more correct here.

[remote storage](/doc/command-reference/remote).

There are important things to note when using Git to version the
<abbr>project</abbr>:
To avoid accidentally deleting data, it raises an error and doesn't touch any
files if no scope options are provided. It means it's user's responsibility to
explicitly provide the right set of options to specify what data is still needed
(so that DVC can figure out what fils can be safely deleted).

- If the cache/remote holds several versions of the same data, all except the
current one will be deleted.
- Use the `--all-branches` or `--all-tags` options to avoid collecting data
referenced in the tips of all branches or all tags, respectively.
One of the scope options, `--workspace`, `--all-branches`, `--all-tags`,
`--all-commits`, or any combination of them must be provided. Each of them
corresponds to the current workspace _and_ a set of commits to analyze what
files, directories and what versions are still needed and should be kept (by
analyzing DVC-files in those commits).
Comment on lines +26 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`--all-commits`, or any combination of them must be provided. Each of them
corresponds to the current workspace _and_ a set of commits to analyze what
files, directories and what versions are still needed and should be kept (by
analyzing DVC-files in those commits).
`--all-commits`, or any combination of them must be provided. Each of them
corresponds to keeping the data for the current workspace and for a
different set of commits (determined by reading the DVC-files in them).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: what's the point of this option? To me it sounds like "don't gc anything" except maybe deleted parts of the repo like removed branches, rewritten commits, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC cache contains data that is not referenced in any way in Git history. E.g. run do a lot of dvc run, or dvc repro after modifying certain files/settings. If --no-commit is not specified DVC is saving data to cache not matter you do git commit to save DVC-files or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably mention this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned a possible use case in the recent version - @jorgeorpinel pls take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still this suggestion wasn't applied but I'll address in regular updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a new paragraph that explains the use case, have you seen it? Or may be I didn't get you specific suggestion ... 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it, thanks. May reword a little in regular updates but it's great!

One of the use cases for this option is to safely delete all temporary data DVC cached when dvc run and/or dvc repro were run without committing changes to DVC-files (thus potentially caching data that is not referenced from workspace or Git commits).


The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used.

Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files
from any remote. This means that any files collected from the local cache can be
Unless the `--cloud` option is used, `dvc gc` does not remove data files from
any remote. This means that any files collected from the local cache can be
restored using `dvc fetch`, as long as they have previously been uploaded with
`dvc push`.

### Removing data in remote storage

If `--cloud` option is provided, command deletes unused data not only in local
DVC cache, but also in remote storage. It means it can be dangerous since in
most cases removing data locally and in remote storage is irreversible.

The default remote is cleaned (see `dvc config core.remote`) unless the
`--remote` option is used.

## Options

- `-a`, `--all-branches` - keep cached objects referenced in all Git branches.
Useful for keeping data for all the latest experiment versions. It's
recommended to consider including this option when using `-c` i.e.
`dvc gc -ac`.
- `-w`, `--workspace` - keep files and directories _only_ referenced in the
current workspace This option is enabled automatically if `--all-tags`,
`--all-branches`, or `--all-commits` are used.

- `-a`, `--all-branches` - keep cached objects referenced in all Git branches as
well as in the workspace (implies `-w`). Useful if branches are used to track
different experiments.

- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well
as the workspace (implies `-w`). Useful if tags are used to track
"checkpoints" of an experiment or project. Note that both options can be
combined, for example using the `-aT` flag.

- `--all-commits` - the same as `-a` or `-T` above, but applies to _all_ Git
commits as well as the workspace (implies `-w`). Useful for keeping all the
data used in the entire existing commit history of the project.

- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's
useful if tags are used to track "checkpoints" of an experiment or project.
Note that both options can be combined, for example using the `-aT` flag.
One of the use cases for this option is to safely delete all temporary data
DVC cached when `dvc run` and/or `dvc repro` were run without committing
changes to DVC-files (thus potentially caching data that is not referenced
from workspace or Git commits).

- `-p <paths>`, `--projects <paths>` - if a single remote or a single cache is
shared among different projects (e.g. a configuration like the one described
[here](/doc/use-cases/shared-development-server)), this option can be used to
specify a list of them (each project is a path) to keep data that is currently
referenced from them.

- `-c`, `--cloud` - also remove files in remote storage. _This operation is
dangerous._ It removes datasets, models, other files that are not linked in
the current commit (unless `-a` or `-T` are also used). The default remote is
used unless a specific one is given with `-r`.
- `-c`, `--cloud` - remove files in remote storage in addition to local cache.
**This option is dangerous.** The default remote is used unless a specific one
is given with `-r`.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to collect unused objects from
Expand Down Expand Up @@ -83,11 +103,12 @@ $ du -sh .dvc/cache/
7.4G .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the <abbr>workspace</abbr> (by collecting hash values from the DVC-files):
When you run `dvc gc --workspace`, DVC removes all objects from cache that are
not referenced in the <abbr>workspace</abbr> (by collecting hash values from the
DVC-files):

```dvc
$ dvc gc
$ dvc gc --workspace

'.dvc/cache/27e30965256ed4d3e71c2bf0c4caad2e' was removed
'.dvc/cache/2e006be822767e8ba5d73ebad49ef082' was removed
Expand Down