Skip to content

Commit

Permalink
cmd-ref: document new gc behavior (#1023)
Browse files Browse the repository at this point in the history
* cmd-ref: document new gc behavior

* mention about -w as scope specifier for garbage collection

* apply some corrections to the latest gc doc

* Update content/docs/command-reference/gc.md

Co-Authored-By: Jorge Orpinel <[email protected]>

* Update content/docs/command-reference/gc.md

Co-Authored-By: Jorge Orpinel <[email protected]>

* Update content/docs/command-reference/gc.md

Co-Authored-By: Jorge Orpinel <[email protected]>

* Update content/docs/command-reference/gc.md

Co-Authored-By: Jorge Orpinel <[email protected]>

* address reviews feedback, always use need, mention use case for all-commits

Co-authored-by: Ivan Shcheklein <[email protected]>
Co-authored-by: Jorge Orpinel <[email protected]>
  • Loading branch information
3 people authored Mar 19, 2020
1 parent 4be8579 commit 0381090
Showing 1 changed file with 54 additions and 33 deletions.
87 changes: 54 additions & 33 deletions content/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,80 @@
# gc

Remove unused objects from <abbr>cache</abbr> or remote storage.
Remove unused files and directories from <abbr>cache</abbr> or
[remote storage](/doc/command-reference/remote).

## Synopsis

```usage
usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r <name>]
usage: dvc gc [-h] [-q | -v]
[-w] [-a] [-T] [--all-commits] [-c] [-r <name>]
[-f] [-j <number>] [-p [<path> [<path> ...]]]
```

## Description

This command deletes (garbage collects) data files or directories that may exist
in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is
used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format)
currently in the <abbr>workspace</abbr>. By default, this command only cleans up
the local cache, which is typically located on the same machine as the project
in question. This usually helps to free up disk space.
This command deletes (garbage collects) data files or directories that exist in
DVC cache but are no longer needed. With `--cloud` it also removes data in
[remote storage](/doc/command-reference/remote).

There are important things to note when using Git to version the
<abbr>project</abbr>:
To avoid accidentally deleting data, it raises an error and doesn't touch any
files if no scope options are provided. It means it's user's responsibility to
explicitly provide the right set of options to specify what data is still needed
(so that DVC can figure out what fils can be safely deleted).

- If the cache/remote holds several versions of the same data, all except the
current one will be deleted.
- Use the `--all-branches` or `--all-tags` options to avoid collecting data
referenced in the tips of all branches or all tags, respectively.
One of the scope options, `--workspace`, `--all-branches`, `--all-tags`,
`--all-commits`, or any combination of them must be provided. Each of them
corresponds to the current workspace _and_ a set of commits to analyze what
files, directories and what versions are still needed and should be kept (by
analyzing DVC-files in those commits).

The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used.

Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files
from any remote. This means that any files collected from the local cache can be
Unless the `--cloud` option is used, `dvc gc` does not remove data files from
any remote. This means that any files collected from the local cache can be
restored using `dvc fetch`, as long as they have previously been uploaded with
`dvc push`.

### Removing data in remote storage

If `--cloud` option is provided, command deletes unused data not only in local
DVC cache, but also in remote storage. It means it can be dangerous since in
most cases removing data locally and in remote storage is irreversible.

The default remote is cleaned (see `dvc config core.remote`) unless the
`--remote` option is used.

## Options

- `-a`, `--all-branches` - keep cached objects referenced in all Git branches.
Useful for keeping data for all the latest experiment versions. It's
recommended to consider including this option when using `-c` i.e.
`dvc gc -ac`.
- `-w`, `--workspace` - keep files and directories _only_ referenced in the
current workspace This option is enabled automatically if `--all-tags`,
`--all-branches`, or `--all-commits` are used.

- `-a`, `--all-branches` - keep cached objects referenced in all Git branches as
well as in the workspace (implies `-w`). Useful if branches are used to track
different experiments.

- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well
as the workspace (implies `-w`). Useful if tags are used to track
"checkpoints" of an experiment or project. Note that both options can be
combined, for example using the `-aT` flag.

- `--all-commits` - the same as `-a` or `-T` above, but applies to _all_ Git
commits as well as the workspace (implies `-w`). Useful for keeping all the
data used in the entire existing commit history of the project.

- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's
useful if tags are used to track "checkpoints" of an experiment or project.
Note that both options can be combined, for example using the `-aT` flag.
One of the use cases for this option is to safely delete all temporary data
DVC cached when `dvc run` and/or `dvc repro` were run without committing
changes to DVC-files (thus potentially caching data that is not referenced
from workspace or Git commits).

- `-p <paths>`, `--projects <paths>` - if a single remote or a single cache is
shared among different projects (e.g. a configuration like the one described
[here](/doc/use-cases/shared-development-server)), this option can be used to
specify a list of them (each project is a path) to keep data that is currently
referenced from them.

- `-c`, `--cloud` - also remove files in remote storage. _This operation is
dangerous._ It removes datasets, models, other files that are not linked in
the current commit (unless `-a` or `-T` are also used). The default remote is
used unless a specific one is given with `-r`.
- `-c`, `--cloud` - remove files in remote storage in addition to local cache.
**This option is dangerous.** The default remote is used unless a specific one
is given with `-r`.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to collect unused objects from
Expand Down Expand Up @@ -83,11 +103,12 @@ $ du -sh .dvc/cache/
7.4G .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the <abbr>workspace</abbr> (by collecting hash values from the DVC-files):
When you run `dvc gc --workspace`, DVC removes all objects from cache that are
not referenced in the <abbr>workspace</abbr> (by collecting hash values from the
DVC-files):

```dvc
$ dvc gc
$ dvc gc --workspace
'.dvc/cache/27e30965256ed4d3e71c2bf0c4caad2e' was removed
'.dvc/cache/2e006be822767e8ba5d73ebad49ef082' was removed
Expand Down

0 comments on commit 0381090

Please sign in to comment.