diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md
index 1173c78b41..e2146257b1 100644
--- a/content/docs/command-reference/add.md
+++ b/content/docs/command-reference/add.md
@@ -56,13 +56,8 @@ each one:
Summarizing, the result is that the target data is replaced by small `.dvc`
files that can be easily tracked with Git.
-> Note that `.dvc` files can be considered _orphan stages_, because they have no
-> dependencies, only outputs. These are treated as _always changed_
-> by `dvc status` and `dvc repro`, which always executes them. See `dvc.yaml` to
-> learn more about stages.
-
-To avoid adding files inside a directory accidentally, you can add the
-corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.
+It's possible to prevent files or directories from being added by DVC by adding
+the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file.
By default, DVC tries to use reflinks (see
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
@@ -73,15 +68,14 @@ large files. DVC also supports other link types for use on file systems without
### Adding entire directories
-A `dvc add` target can be an individual file or a directory. In the latter case,
-a `.dvc` file is created for the top of the directory (with default name
+A `dvc add` target can be either a file or a directory. In the latter case, a
+`.dvc` file is created for the top of the hierarchy (with default name
`.dvc`).
-Every file in the hierarchy is added to the cache (unless the `--no-commit`
-option is used), but DVC does not produce individual `.dvc` files for each file
-in the directory tree. Instead, the single `.dvc` file references a special JSON
-file in the cache (with `.dir` extension), that in turn points to the added
-files.
+Every file inside is added to the cache (unless the `--no-commit` option is
+used), but DVC does not produce individual `.dvc` files for each file in the
+entire tree. Instead, the single `.dvc` file references a special JSON file in
+the cache (with `.dir` extension), that in turn points to the added files.
> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
@@ -97,6 +91,9 @@ generated for each file in he same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.
+To avoid adding files inside a directory accidentally, you can add the
+corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`.
+
## Options
- `-R`, `--recursive` - determines the files to add by searching each target
@@ -180,7 +177,8 @@ pics
└── dogs [more image files]
```
-Tracking a directory with DVC as simple as with a single file:
+[Tracking a directory](#tracking-directories) with DVC as simple as with a
+single file:
```dvc
$ dvc add pics
diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md
index baafff3655..b90fdd9f7c 100644
--- a/content/docs/command-reference/checkout.md
+++ b/content/docs/command-reference/checkout.md
@@ -1,7 +1,7 @@
# checkout
-Update data files and directories in the workspace based on current
-DVC-files.
+Update DVC-tracked files and directories in the workspace based on
+current `dvc.lock` and `.dvc` files.
## Synopsis
@@ -10,39 +10,39 @@ usage: dvc checkout [-h] [-q | -v] [--summary] [-d] [-R] [-f]
[--relink] [targets [targets ...]]
positional arguments:
- targets Limit command scope to these stages or .dvc files.
- Using -R, directories to search for stages or .dvc
- files can also be given.
+ targets Limit command scope to these tracked files/directories,
+ .dvc files, or stage names.
```
## Description
-`.dvc` and `dvc.lock` [files](/doc/user-guide/dvc-files-and-directories) act as
-pointers to specific version of data files or directories tracked by DVC. This
-command synchronizes the workspace data with the versions specified in the
-current `.dvc` and `dvc.lock` files.
+This command is usually needed after `git checkout`, `git clone`, or any other
+operation that changes the current `dvc.lock` or `.dvc` files. It restores the
+corresponding versions of the DVC-tracked files and directories from the
+cache to the workspace.
-`dvc checkout` is useful, for example, when using Git in the
-project, after `git clone`, `git checkout`, or any other operation
-that changes the DVC files in the workspace.
-
-💡 For convenience, a Git hook is available to automate running `dvc checkout`
-after `git checkout`. See the
-[Automating example](#example-automating-dvc-checkout) below or `dvc install`
-for more details.
+The `targets` given to this command (if any) limit what to checkout. It accepts
+paths to tracked files or directories (including paths inside tracked
+directories), `.dvc` files, or stage names (found in `dvc.yaml`).
The execution of `dvc checkout` does the following:
-- Scans the `.dvc` and `dvc.lock` files to compare against the data files or
- directories in the workspace. DVC knows which data
- (outputs) match because the corresponding hash values are saved
- in the `outs` fields in those files. Scanning is limited to the given
- `targets` (if any). See also options `--with-deps` and `--recursive` below.
+- Checks `dvc.lock` and `.dvc` files to compare the hash values of their
+ outputs against the actual files or directories in the
+ workspace (similar to `dvc status`).
+
+ > Stage outputs should be defined in `dvc.yaml`. If found there but not in
+ > `dvc.lock`, they'll be skipped, with a warning.
-- Missing data files or directories are restored from the cache.
- Those that don't match with any DVC-file are removed. See options `--force`
+- Missing data files or directories are restored from the cache. Those that
+ don't match with `dvc.lock` or `.dvc` files are removed. See options `--force`
and `--relink`. A list of the changes done is printed.
+💡 For convenience, a Git hook is available to automate running `dvc checkout`
+after `git checkout`. See the
+[Automating example](#example-automating-dvc-checkout) below or `dvc install`
+for more details.
+
By default, this command tries not make copies of cached files in the workspace,
using reflinks instead when supported by the file system (refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)).
@@ -64,9 +64,9 @@ such a case, `dvc checkout` prints a warning message. It also lists the partial
progress made by the checkout.
There are two methods to restore a file missing from the cache, depending on the
-situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
-regenerate its outputs (see also `dvc dag`). In other cases the cache can be
-pulled from remote storage using `dvc pull`.
+situation. In some cases the cache can be pulled from
+[remote storage](/doc/command-reference/remote) using `dvc pull`. In other cases
+the pipeline must be reproduced (using `dvc repro`) to regenerate its outputs.
## Options
@@ -130,9 +130,8 @@ below.
The workspace looks like this:
-````dvc
+```dvc
.
-├── README.md
├── data
│ └── data.xml.dvc
├── dvc.lock
@@ -141,15 +140,11 @@ The workspace looks like this:
├── prc.json
├── scores.json
└── src
- ├── evaluate.py
- ├── featurization.py
- ├── prepare.py
- ├── requirements.txt
- └── train.py```
-````
+ └──
+```
-This repository includes the following tags, that represent different variants
-of the resulting model:
+Note that this repository includes the following tags, that represent different
+variants of the resulting model:
```dvc
$ git tag
@@ -158,10 +153,9 @@ baseline-experiment <- First simple version of the model
bigrams-experiment <- Uses bigrams to improve the model
```
-We can now just run `dvc checkout` that will update the most recent `model.pkl`,
-`data.xml`, and other files that are tracked by DVC. The model file hash is
-defined in the `dvc.lock` file, and in the `data.xml.dvc` file for the
-`data.xml`:
+We can now run `dvc checkout` to update the most recent `model.pkl`, `data.xml`,
+and any other files tracked by DVC. The model file hash (`ab349c2...`) is saved
+in `dvc.lock`, and it can be confirmed with:
```dvc
$ dvc checkout
@@ -170,13 +164,15 @@ $ md5 model.pkl
MD5 (data.xml) = ab349c2b5fa2a0f66d6f33f94424aebe
```
+## Example: Switch versions
+
What if we want to "rewind history", so to speak? The `git checkout` command
-lets us restore any point in the repository history, including any tags. It
-automatically adjusts the files, by replacing file content and adding or
-deleting files as necessary.
+lets us restore any commit in the repository history (including tags). It
+automatically adjusts the repo files, by replacing, adding, or deleting them as
+necessary.
```dvc
-$ git checkout baseline-experiment # Stage where model is first created
+$ git checkout baseline-experiment # Git commit where model was created
```
Let's check the hash value of `model.pkl` in `dvc.lock` now:
@@ -187,16 +183,10 @@ outs:
md5: 98af33933679a75c2a51b953d3ab50aa
```
-But if you check `model.pkl`, the file hash is still the same:
-
-```dvc
-$ md5 model.pkl
-MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
-```
-
-This is because `git checkout` changed `dvc.lock` and other DVC files. But it
-did nothing with the `model.pkl` and `matrix.pkl` files. Git doesn't track those
-files; DVC does, so we must do this:
+But if you check the MD5 of `model.pkl`, the file hash is still the same
+(`ab349c2...`). This is because `git checkout` changed `dvc.lock` and other DVC
+files, but it did nothing with `model.pkl`, or any other DVC-tracked files/dirs.
+Since Git doesn't track them, we must do this:
```dvc
$ dvc checkout
@@ -207,8 +197,29 @@ $ md5 model.pkl
MD5 (model.pkl) = 98af33933679a75c2a51b953d3ab50aa
```
-What happened is that DVC went through the DVC-files and adjusted the current
-set of output files to match the `outs` in them.
+DVC went through the stages (in `dvc.yaml`) and adjusted the current set of
+outputs to match the `outs` in the corresponding `dvc.lock`.
+
+## Example: Specific files or directories
+
+`dvc checkout` only affects the tracked data corresponding to any given
+`targets`:
+
+```dvc
+$ git checkout master
+$ dvc checkout # Start with latest version of everything.
+
+$ git checkout baseline-experiment -- dvc.lock
+$ dvc checkout model.pkl # Get previous model file only.
+```
+
+Note that you can checkout data within directories tracked. For example, the
+`featurize` stage has the entire `data/features` directory as output, but we can
+just get this:
+
+```dvc
+$ dvc checkout data/features/test.pkl
+```
## Example: Automating DVC checkout
@@ -235,5 +246,5 @@ MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
```
Previously this took two commands, `git checkout` followed by `dvc checkout`. We
-can now skip the second one, which is automatically run for us. The workspace is
-automatically synchronized accordingly.
+can now skip the second one, which is automatically run for us. The workspace
+files are automatically updated accordingly.
diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md
index af3abf8509..c9ba9a05ab 100644
--- a/content/docs/command-reference/fetch.md
+++ b/content/docs/command-reference/fetch.md
@@ -11,19 +11,28 @@ usage: dvc fetch [-h] [-q | -v] [-j ] [-r ] [-a] [-T]
[targets [targets ...]]
positional arguments:
- targets Limit command scope to these stages or .dvc files.
- Using -R, directories to search for stages or .dvc
- files can also be given.
+ targets Limit command scope to these tracked files/directories,
+ .dvc files, or stage names.
```
## Description
-The `dvc fetch` downloads DVC-tracked files from remote storage into the cache
-of the project, but without placing them in the workspace. This
-makes the data files available for linking (or copying) into the workspace.
-(Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along
-with `dvc checkout`, it's performed automatically by `dvc pull` when the target
-`dvc.yaml` or `.dvc` files are not already in the cache:
+Downloads DVC-tracked files from remote storage into the cache of the project
+(without placing them in the workspace, like `dvc pull` would).
+This makes them available for linking (or copying) into the workspace (refer to
+[`dvc config cache.type`](/doc/command-reference/config#cache)).
+
+Without arguments, `dvc fetch` ensures that the files specified in all
+`dvc.lock` and `.dvc` files in the workspace exist in the cache. The
+`--all-branches`, `--all-tags`, and `--all-commits` options enable fetching data
+for multiple Git commits.
+
+The `targets` given to this command (if any) limit what to fetch. It accepts
+paths to tracked files or directories (including paths inside tracked
+directories), `.dvc` files, or stage names (found in `dvc.yaml`).
+
+Fetching is performed automatically by `dvc pull` (when the data is not already
+in the cache), along with `dvc checkout`:
```
Controlled files Commands
@@ -42,32 +51,19 @@ project's cache ++ | dvc pull |
workspace
```
-Fetching could be useful when first checking out a DVC project,
-since files tracked by DVC should already exist in remote storage, but won't be
-in the project's cache. (Refer to `dvc remote` for more information
-on DVC remotes.) These necessary data or model files are listed as
-dependencies or outputs in a target
-[stage](/doc/command-reference/run) (in `dvc.yaml`) or `.dvc` file, so they are
-required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
-corresponding [pipeline](/doc/command-reference/dag).
-
-`dvc fetch` ensures that the files needed for a stage or `.dvc` file to be
-[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in
-cache. If no `targets` are specified, the set of data files to fetch is
-determined by analyzing all `dvc.yaml` and `.dvc` files in the current branch,
-unless `--all-branches` or `--all-tags` is specified.
-
-The default remote is used (see `dvc config core.remote`) unless the `--remote`
-option is used.
-
-`dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands
-perform data synchronization among local and remote storage. The specific way in
-which the set of files to push/fetch/pull is determined begins with calculating
-file hashes when these are [added](/doc/command-reference/add) with DVC. File
-hash values are stored in the corresponding `dvc.yaml` or `.dvc` files
-(typically versioned with Git). Only the hash specified in `dvc.yaml` or `.dvc`
-files currently in the workspace are considered by `dvc fetch` (unless the `-a`
-or `-T` options are used).
+Here are some scenarios in which `dvc fetch` is useful, instead of pulling:
+
+- After checking out a fresh copy of a DVC repository, to get
+ DVC-tracked data from multiple project branches or tags into your machine.
+- To use comparison commands across different Git commits, for example
+ `dvc metrics show` with its `--all-branches` option.
+- If you want to avoid [linking](/doc/user-guide/large-dataset-optimization)
+ files from the cache, or keep the workspace clean for any other
+ reason.
+
+The default remote is used (see
+[`dvc config core.remote`](/doc/command-reference/config#core)) unless the
+`--remote` option is used.
## Options
@@ -119,8 +115,8 @@ or `-T` options are used).
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
-[Get Started](/doc/tutorials/get-started). Then we can see what happens with
-`dvc fetch` as we switch from tag to tag.
+[Get Started](/doc/tutorials/get-started). Then we can see what `dvc fetch` does
+in different scenarios.
@@ -135,30 +131,21 @@ $ cd example-get-started
+The workspace looks like this:
+
```dvc
.
├── data
│ └── data.xml.dvc
-├── evaluate.dvc
-├── featurize.dvc
-├── prepare.dvc
-├── train.dvc
+├── dvc.lock
+├── dvc.yaml
+├── params.yaml
+├── prc.json
+├── scores.json
└── src
└──
```
-We have these tags in the repository that represent different iterations of
-solving the problem:
-
-```dvc
-$ git tag
-
-baseline-experiment <- first simple version of the model
-bigrams-experiment <- use bigrams to improve the model
-```
-
-## Example: Default behavior
-
This project comes with a predefined HTTP
[remote storage](/doc/command-reference/remote). We can now just run `dvc fetch`
to download the most recent `model.pkl`, `data.xml`, and other DVC-tracked files
@@ -167,24 +154,22 @@ into our local cache.
```dvc
$ dvc status --cloud
...
- deleted: model.pkl
- deleted: data/features/...
+ deleted: data/features/train.pkl
+ deleted: model.pkl
$ dvc fetch
+
+$ tree .dvc/cache
+.dvc/cache
+├── 38
+│ └── 63d0e317dee0a55c4e59d2ec0eef33
+├── 42
+│ └── c7025fc0edeb174069280d17add2d4.dir
...
-$ tree .dvc
-.dvc
-├── cache
-│ ├── 38
-│ │ └── 63d0e317dee0a55c4e59d2ec0eef33
-│ ├── 42
-│ │ └── c7025fc0edeb174069280d17add2d4.dir
-│ ├── ...
-├── config
-├── ...
```
-> `dvc status --cloud` compares the cache contents vs. the default remote.
+> `dvc status --cloud` compares the cache contents against the default remote.
+> Refer to `dvc status`.
Note that the `.dvc/cache` directory was created and populated.
@@ -192,11 +177,10 @@ Note that the `.dvc/cache` directory was created and populated.
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
> for more info.
-Used without arguments (as above), `dvc fetch` downloads all assets needed by
-all `dvc.yaml` and `.dvc` files in the current branch, including for
-directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and
-`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and
-`data/features/` directory, respectively.
+Used without arguments (as above), `dvc fetch` downloads all files and
+directories needed by all `dvc.yaml` and `.dvc` files in the current branch. For
+example, the hash values `3863d0e...` and `42c7025...` correspond to the
+`model.pkl` file and `data/features/` directory, respectively.
Let's now link files from the cache to the workspace with:
@@ -204,34 +188,42 @@ Let's now link files from the cache to the workspace with:
$ dvc checkout
```
-## Example: Specific stages
+## Example: Specific files or directories
-> Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to
-> follow this example if you tried the previous one (**Default behavior**).
+> If you tried the previous example, please delete the `.dvc/cache` directory
+> first (e.g. `rm -Rf .dvc/cache`) to follow this one.
-`dvc fetch` only downloads the data files of a specific stage when the
-corresponding `.dvc` file (command target) is specified:
+`dvc fetch` only downloads the tracked data corresponding to any given
+`targets`:
```dvc
-$ dvc fetch prepare.dvc
+$ dvc fetch prepare
$ tree .dvc/cache
.dvc/cache
-├── 42
-│ └── c7025fc0edeb174069280d17add2d4.dir
-├── 58
-│ └── 245acfdc65b519c44e37f7cce12931
-├── 68
-│ └── 36f797f3924fb46fcfd6b9f6aa6416.dir
-└── 9d
- └── 603888ec04a6e75a560df8678317fb
+├── 20
+│ └── b786b6e6f80e2b3fcf17827ad18597.dir
+├── 32
+│ └── b715ef0d71ff4c9e61f55b09c15e75
+└── 6f
+ └── 597d341ceb7d8fbbe88859a892ef81
```
-> Note that `prepare.dvc` is the first stage in our example's pipeline.
+Cache entries for the `data/prepared` directory (output of the
+`prepare` target), as well as the actual `test.tsv` and `train.tsv` files, were
+downloaded. Their hash values are shown above.
+
+Note that you can fetch data within directories tracked. For example, the
+`featurize` stage has the entire `data/features` directory as output, but we can
+just get this:
+
+```dvc
+$ dvc fetch data/features/test.pkl
+```
-Cache entries for the necessary directories, as well as the actual
-`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded.
-Their hash values are shown above.
+If you check again `.dvc/cache`, you'll see a couple more files were downloaded:
+the cache entries for the `data/features` directory, and
+`data/features/test.pkl` itself.
## Example: With dependencies
diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md
index 8a4a9293e8..d09a375804 100644
--- a/content/docs/command-reference/get.md
+++ b/content/docs/command-reference/get.md
@@ -36,11 +36,11 @@ the data source. Both HTTP and SSH protocols are supported for online repos
to an "offline" repo (if it's a DVC repo without a default remote, instead of
downloading, DVC will try to copy the target data from its cache).
-The `path` argument is used to specify the location of the target to be
-downloaded within the source repository at `url`. `path` can specify any file or
-directory in the source repo, including those tracked by DVC, or by Git. Note
-that DVC-tracked targets should be found in a `dvc.yaml` or `.dvc` file of the
-project.
+The `path` argument is used to specify the location of the target to download
+within the source repository at `url`. `path` can specify any file or directory
+in the source repo, either tracked by DVC (including paths inside tracked
+directories), or by Git. Note that DVC-tracked targets should be found in a
+`dvc.yaml` or `.dvc` file of the project.
⚠️ The project should have a default
[DVC remote](/doc/command-reference/remote), containing the actual data for this
diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md
index 97f0a08b8f..c504be21c1 100644
--- a/content/docs/command-reference/import.md
+++ b/content/docs/command-reference/import.md
@@ -39,11 +39,11 @@ the data source. Both HTTP and SSH protocols are supported for online repos
to an "offline" repo (if it's a DVC repo without a default remote, instead of
downloading, DVC will try to copy the target data from its cache).
-The `path` argument is used to specify the location of the target to be
-downloaded within the source repository at `url`. `path` can specify any file or
-directory in the source repo, including those tracked by DVC, or by Git. Note
-that DVC-tracked targets should be found in a `dvc.yaml` or `.dvc` file of the
-project.
+The `path` argument is used to specify the location of the target to download
+within the source repository at `url`. `path` can specify any file or directory
+in the source repo, either tracked by DVC (including paths inside tracked
+directories), or by Git. Note that DVC-tracked targets should be found in a
+`dvc.yaml` or `.dvc` file of the project.
⚠️ The project should have a default
[DVC remote](/doc/command-reference/remote), containing the actual data for this
diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md
index 546bc0e2d9..13e83f50c9 100644
--- a/content/docs/command-reference/list.md
+++ b/content/docs/command-reference/list.md
@@ -41,10 +41,10 @@ data source. Both HTTP and SSH protocols are supported for online repos (e.g.
`[user@]server:project.git`). `url` can also be a local file system path to an
"offline" Git repo.
-The optional `path` argument is used to specify directory to list within the
-source repository at `url`. It's similar to providing a path to list to commands
-such as `ls` or `aws s3 ls`. And similar to the, `-R` option might be used to
-list files recursively.
+The optional `path` argument is used to specify a directory to list within the
+source repository at `url` (including paths inside tracked directories). It's
+similar to providing a path to list to commands such as `ls` or `aws s3 ls`, and
+similar to the former, the `-R` option might be used to list files recursively.
Please note that `dvc list` doesn't check whether the listed data (tracked by
DVC) actually exists in remote storage, so it's not guaranteed whether it can be
diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md
index 422a8b7b39..35d4220281 100644
--- a/content/docs/command-reference/metrics/show.md
+++ b/content/docs/command-reference/metrics/show.md
@@ -23,7 +23,7 @@ Finds and prints all metrics in the project by examining all of its
> (`--metrics-no-cache`) options of `dvc run`.
If `targets` are provided, it will show those specific metric files instead.
-With the `-a` or`-T` options, this command shows the different metrics values
+With the `-a` or `-T` options, this command shows the different metrics values
across all Git branches or tags, respectively. With the `-R` option, some of the
target can even be directories, so that DVC recursively shows all metric files
inside.
diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md
index 19182db6f2..cc4588b83b 100644
--- a/content/docs/command-reference/pull.md
+++ b/content/docs/command-reference/pull.md
@@ -12,9 +12,8 @@ usage: dvc pull [-h] [-q | -v] [-j ] [-r ] [-a] [-T]
[targets [targets ...]]
positional arguments:
- targets Limit command scope to these stages or .dvc files.
- Using -R, directories to search for stages or .dvc files
- can also be given.
+ targets Limit command scope to these tracked files/directories,
+ .dvc files, or stage names.
```
## Description
@@ -29,31 +28,29 @@ are the most common use cases for these commands.
The `dvc pull` command allows one to retrieve data from remote storage.
`dvc pull` has the same effect as running `dvc fetch` and `dvc checkout`
-immediately after that.
+immediately after.
The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used. See `dvc remote` for more information on how to configure a
remote.
With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads
-only the files (or directories) missing from the workspace by searching all
-stages in `dvc.yaml` or `.dvc` files currently in the project. It
-will not download files associated with earlier commits in the
-repository (if using Git), nor will it download files that have not
-changed.
+only the files (or directories) missing from the workspace by checking all
+`.dvc` files and stages (in `dvc.yaml` and `dvc.lock`) currently in the
+project. It will not download files associated with earlier commits
+in the repository (if using Git), nor will it download files that
+have not changed.
The command `dvc status -c` can list files referenced in current stages (in
`dvc.yaml`) or `.dvc` files, but missing from the cache. It can be
used to see what files `dvc pull` would download.
-If one or more `targets` are specified, DVC only considers the files associated
-with those stages or `.dvc` files. Using the `--with-deps` option, DVC tracks
-dependencies backward from the target [stage files](/doc/command-reference/run),
-through the corresponding [pipelines](/doc/command-reference/dag), to find data
-files to pull.
+The `targets` given to this command (if any) limit what to pull. It accepts
+paths to tracked files or directories (including paths inside tracked
+directories), `.dvc` files, or stage names (found in `dvc.yaml`).
-After a data file is in cache, `dvc pull` can use OS-specific mechanisms like
-reflinks or hardlinks to put it in the workspace without copying. See
+After the data is in the cache, `dvc pull` uses OS-specific mechanisms like
+reflinks or hardlinks to put it in the workspace, trying to avoid copying. See
`dvc checkout` for more details.
## Options
diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md
index e9168bb81b..6e48a9cf46 100644
--- a/content/docs/command-reference/push.md
+++ b/content/docs/command-reference/push.md
@@ -11,9 +11,8 @@ usage: dvc push [-h] [-q | -v] [-j ] [-r ] [-a] [-T]
[targets [targets ...]]
positional arguments:
- targets Limit command scope to these stages or .dvc files.
- Using -R, directories to search for stages or .dvc files
- can also be given.
+ targets Limit command scope to these tracked files/directories,
+ .dvc files, or stage names.
```
## Description
@@ -38,9 +37,9 @@ with `git commit` and `git push`).
Under the hood a few actions are taken:
-- The push command by default uses all `dvc.yaml` and `.dvc` files in the
- workspace. The command options listed below will either limit or
- expand the set of stages (in dvc.yaml) or `.dvc` files to consult.
+- The push command by default uses all stages (in `dvc.yaml` and `dvc.lock`) and
+ `.dvc` files in the workspace. The command options will either
+ limit or expand the set of stages or `.dvc` files to consult.
- For each output referenced in every selected stage or `.dvc`
file, DVC finds a corresponding file or directory in the cache.
@@ -64,10 +63,9 @@ The `dvc status -c` command can list files tracked by DVC that are new in the
cache (compared to the default remote.) It can be used to see what files
`dvc push` would upload.
-If one or more `targets` are specified, DVC only considers the files associated
-with them. Using the `--with-deps` option, DVC tracks dependencies backward from
-the target [stage files](/doc/command-reference/run), through the corresponding
-[pipelines](/doc/command-reference/dag), to find data files to push.
+The `targets` given to this command (if any) limit what to push. It accepts
+paths to tracked files or directories (including paths inside tracked
+directories), `.dvc` files, or stage names (found in `dvc.yaml`).
## Options
diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md
index 941260e405..969ba884aa 100644
--- a/content/docs/command-reference/status.md
+++ b/content/docs/command-reference/status.md
@@ -13,82 +13,68 @@ usage: dvc status [-h] [-v] [-j ] [-q] [-c] [-r ] [-a] [-T]
[targets [targets ...]]
positional arguments:
- targets Limit command scope to these stages or .dvc files.
- Using -R, directories to search for stages or .dvc
- files can also be given.
+ targets Limit command scope to these tracked files/directories,
+ .dvc files, or stage names.
```
## Description
-`dvc status` searches for changes in the existing pipelines, either showing
-which [stages](/doc/command-reference/run) have changed in the workspace (not
-yet tracked by DVC) and must be added again (with `dvc add`) or reproduced (with
+`dvc status` searches for changes in the existing tracked data and pipelines,
+either showing which files or directories have changed in the
+workspace and must be added or reproduced again (with `dvc add` or
`dvc repro`); or differences between cache vs. remote storage
-(meaning `dvc push` or `dvc pull` should be run to synchronize them). The two
-modes, _local_ and _cloud_ are triggered by using the `--cloud` or `--remote`
-options:
-
-| Mode | Command option | Description |
-| ------ | -------------- | --------------------------------------------------------------------------------------------------------------------------- |
-| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) |
-| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. |
-| remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. |
-
-DVC determines which data and code files to compare by analyzing all stages (in
-`dvc.yaml` and `.dvc` files in the workspace (the `--all-branches`
-and `--all-tags` options compare multiple workspace versions).
-
-The comparison can be limited to certain stages (in `dvc.yaml`) or `.dvc` files
-only, by listing them as `targets`. (Changes are reported only against these.)
-When this is combined with the `--with-deps` option, a search is made for
-changes in other stages that affect each target.
-
-In the local mode, changes are detected through the hash value of every file
-listed in every stage (in `dvc.yaml` or `.dvc` files) in question against the
-corresponding file in the file system. The command output indicates the detected
-changes, if any. If no differences are detected, `dvc status` prints this
-message:
-
-```dvc
-$ dvc status
-Data and pipelines are up to date.
-```
-
-This indicates that no differences were detected, and therefore no stages would
-be executed by `dvc repro`.
-
-If instead, differences are detected, `dvc status` lists those changes. For each
-stage with differences, the changes in dependencies and/or
+(implying `dvc push` or `dvc pull` should be run to synchronize them). The
+_remote_ mode is triggered by using the `--cloud` or `--remote` options:
+
+| Mode | Option | Description |
+| ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
+| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) |
+| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. |
+| remote | `--cloud` | Comparisons are made between the cache, and the default remote (typically defined with `dvc remote --default`). |
+
+Without arguments, this command checks all stages (defined in `dvc.yaml`) and
+`.dvc` files, and compares the hash values of their outputs (found
+in `dvc.lock` for stages) against the actual data files or directories in the
+workspace. The `--all-branches`, `--all-tags`, and `--all-commits` options
+enable checking data for multiple Git commits.
+
+The `targets` given to this command (if any) limit what to check. Paths to
+tracked files or directories (including paths inside tracked directories),
+`.dvc` files, or stage names (found in `dvc.yaml`) are accepted.
+
+If no differences are detected, `dvc status` prints
+`Data and pipelines are up to date.` If differences are detected by
+`dvc status`, the command output indicates the changes. For each stage with
+differences, the changes in dependencies and/or
outputs that differ are listed. For each item listed, either the
-file name or hash is shown, and additionally a status word is shown describing
-the changes (described below).
+file name or hash is shown, along with a _state description_, as detailed below:
-- _changed checksum_ means that the stage (in `dvc.yaml`) or `.dvc` file hash
- has changed (e.g. someone manually edited the file).
+- _changed checksum_ means that the `.dvc` file hash has changed (e.g. someone
+ manually edited it).
-- _always changed_ means that this is a `.dvc` file with no dependencies (an
- _orphan stage_ (see [`dvc add`](/doc/command-reference/add)) or that the stage
- in `dvc.yaml` has the `always_changed: true` value set (see `--always-changed`
- option in `dvc run`).
+- _always changed_ means that this is a `.dvc` file with no dependencies (see
+ `dvc add`) or that the stage in `dvc.yaml` has the `always_changed: true`
+ value set (see `--always-changed` option in `dvc run`).
- _changed deps_ or _changed outs_ means that there are changes in dependencies
- or outputs tracked by the stage in `dvc.yaml` or `.dvc` file. Depending on the
- use case, commands like `dvc commit`, `dvc repro`, or `dvc run` can be used to
- update the file. Possible states are:
+ or outputs tracked by the stage or `.dvc` file. Depending on the use case,
+ commands like `dvc commit`, `dvc repro`, or `dvc run` can be used to update
+ the file. Possible states are:
- - _new_: An output is found in the workspace, but there is no
- corresponding file hash saved in the `dvc.lock` or `.dvc` file yet.
+ - _new_: An output is found in the workspace, but
+ there is no corresponding file hash saved in the `dvc.lock` or `.dvc` file
+ yet.
- _modified_: An output or dependency is found in the workspace,
but the corresponding file hash in the `dvc.lock` or `.dvc` file is not up
to date.
- - _deleted_: The output or dependency is referenced in a `dvc.yaml` or `.dvc`
+ - _deleted_: The output or dependency is referenced in a `dvc.lock` or `.dvc`
file, but does not exist in the workspace.
- _not in cache_: An output exists in the workspace, and the corresponding
file hash in the `dvc.lock` or `.dvc` file is up to date, but there is no
corresponding cache file or directory.
-- _update available_ means that import stages are outdated. The
- original file or directory has changed. The imported data can be moved to its
+- _update available_ means that an import stage is outdated (the
+ original data source has changed). The imported data can be brought to its
latest version by using `dvc update`.
**For comparison against remote storage:**
@@ -155,11 +141,10 @@ workspace) is different from remote storage. Bringing the two into sync requires
- `-v`, `--verbose` - displays detailed tracing information.
-## Example: Simple usage
+## Examples
```dvc
$ dvc status
-
bar.dvc:
changed deps:
modified: bar
@@ -179,24 +164,59 @@ This shows that for stage `bar.dvc`, the dependency `foo` and the
output `bar` have changed. Likewise for `foo.dvc`, the dependency
`foo` has changed, but no output has changed.
+## Example: Specific files or directories
+
+`dvc status` only checks the tracked data corresponding to any given `targets`:
+
+```dvc
+$ dvc status foo.dvc dobar
+foo.dvc
+ changed outs:
+ deleted: foo
+ changed checksum
+dobar
+ changed deps:
+ modified: bar
+ changed outs:
+ not in cache: foo
+```
+
+> In this case, the target `foo.dvc` is a `.dvc` file to track the `foo` file,
+> while `dobar` is the name of a stage defined in `dvc.yaml`.
+
+Note that you can check data within directories tracked, such as the `data/raw`
+directory (tracked with `data/raw.dvc`):
+
+```dvc
+$ tree data
+data
+├── raw
+│ ├── partition.1.dat
+│ ├── ...
+│ └── partition.n.dat
+└── raw.dvc
+
+$ dvc fetch data/raw/partition.1.dat
+new: data/raw
+```
+
## Example: Dependencies
```dvc
$ vi code/featurization.py
... edit the code
-$ dvc status model.p.dvc
+$ dvc status model.p
Data and pipelines are up to date.
-$ dvc status model.p.dvc --with-deps
-matrix-train.p.dvc
+$ dvc status model.p --with-deps
+matrix-train.p
changed deps:
modified: code/featurization.py
```
-If the `dvc status` command is limited to a target that had no changes, result
-shows no changes. By adding `--with-deps` the change will be found, so long as
-the change is in a preceding stage.
+The `dvc status` command may be limited to a target that had no changes, but by
+adding `--with-deps`, any change in a preceding stage will be found.
## Example: Remote comparisons
@@ -213,17 +233,16 @@ remote yet:
```dvc
$ dvc status --remote storage
-Preparing to collect status from s3://dvc-remote
- new: data/model.p
- new: data/eval.txt
- new: data/matrix-train.p
- new: data/matrix-test.p
+new: data/model.p
+new: data/eval.txt
+new: data/matrix-train.p
+new: data/matrix-test.p
```
The output shows where the location of the remote storage is, as well as any
differences between the cache and `storage` remote.
-## Example: Import stage
+## Example: Check imported data
Let's import a data file (`data.csv`) from a different DVC repository
into our current project using `dvc import`.