diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 1173c78b41..e2146257b1 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -56,13 +56,8 @@ each one: Summarizing, the result is that the target data is replaced by small `.dvc` files that can be easily tracked with Git. -> Note that `.dvc` files can be considered _orphan stages_, because they have no -> dependencies, only outputs. These are treated as _always changed_ -> by `dvc status` and `dvc repro`, which always executes them. See `dvc.yaml` to -> learn more about stages. - -To avoid adding files inside a directory accidentally, you can add the -corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file. +It's possible to prevent files or directories from being added by DVC by adding +the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. By default, DVC tries to use reflinks (see [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) @@ -73,15 +68,14 @@ large files. DVC also supports other link types for use on file systems without ### Adding entire directories -A `dvc add` target can be an individual file or a directory. In the latter case, -a `.dvc` file is created for the top of the directory (with default name +A `dvc add` target can be either a file or a directory. In the latter case, a +`.dvc` file is created for the top of the hierarchy (with default name `.dvc`). -Every file in the hierarchy is added to the cache (unless the `--no-commit` -option is used), but DVC does not produce individual `.dvc` files for each file -in the directory tree. Instead, the single `.dvc` file references a special JSON -file in the cache (with `.dir` extension), that in turn points to the added -files. +Every file inside is added to the cache (unless the `--no-commit` option is +used), but DVC does not produce individual `.dvc` files for each file in the +entire tree. Instead, the single `.dvc` file references a special JSON file in +the cache (with `.dir` extension), that in turn points to the added files. > Refer to > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) @@ -97,6 +91,9 @@ generated for each file in he same location. This may be helpful to save time adding several data files grouped in a structural directory, but it's undesirable for data directories with a large number of files. +To avoid adding files inside a directory accidentally, you can add the +corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`. + ## Options - `-R`, `--recursive` - determines the files to add by searching each target @@ -180,7 +177,8 @@ pics └── dogs [more image files] ``` -Tracking a directory with DVC as simple as with a single file: +[Tracking a directory](#tracking-directories) with DVC as simple as with a +single file: ```dvc $ dvc add pics diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index baafff3655..b90fdd9f7c 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -1,7 +1,7 @@ # checkout -Update data files and directories in the workspace based on current -DVC-files. +Update DVC-tracked files and directories in the workspace based on +current `dvc.lock` and `.dvc` files. ## Synopsis @@ -10,39 +10,39 @@ usage: dvc checkout [-h] [-q | -v] [--summary] [-d] [-R] [-f] [--relink] [targets [targets ...]] positional arguments: - targets Limit command scope to these stages or .dvc files. - Using -R, directories to search for stages or .dvc - files can also be given. + targets Limit command scope to these tracked files/directories, + .dvc files, or stage names. ``` ## Description -`.dvc` and `dvc.lock` [files](/doc/user-guide/dvc-files-and-directories) act as -pointers to specific version of data files or directories tracked by DVC. This -command synchronizes the workspace data with the versions specified in the -current `.dvc` and `dvc.lock` files. +This command is usually needed after `git checkout`, `git clone`, or any other +operation that changes the current `dvc.lock` or `.dvc` files. It restores the +corresponding versions of the DVC-tracked files and directories from the +cache to the workspace. -`dvc checkout` is useful, for example, when using Git in the -project, after `git clone`, `git checkout`, or any other operation -that changes the DVC files in the workspace. - -💡 For convenience, a Git hook is available to automate running `dvc checkout` -after `git checkout`. See the -[Automating example](#example-automating-dvc-checkout) below or `dvc install` -for more details. +The `targets` given to this command (if any) limit what to checkout. It accepts +paths to tracked files or directories (including paths inside tracked +directories), `.dvc` files, or stage names (found in `dvc.yaml`). The execution of `dvc checkout` does the following: -- Scans the `.dvc` and `dvc.lock` files to compare against the data files or - directories in the workspace. DVC knows which data - (outputs) match because the corresponding hash values are saved - in the `outs` fields in those files. Scanning is limited to the given - `targets` (if any). See also options `--with-deps` and `--recursive` below. +- Checks `dvc.lock` and `.dvc` files to compare the hash values of their + outputs against the actual files or directories in the + workspace (similar to `dvc status`). + + > Stage outputs should be defined in `dvc.yaml`. If found there but not in + > `dvc.lock`, they'll be skipped, with a warning. -- Missing data files or directories are restored from the cache. - Those that don't match with any DVC-file are removed. See options `--force` +- Missing data files or directories are restored from the cache. Those that + don't match with `dvc.lock` or `.dvc` files are removed. See options `--force` and `--relink`. A list of the changes done is printed. +💡 For convenience, a Git hook is available to automate running `dvc checkout` +after `git checkout`. See the +[Automating example](#example-automating-dvc-checkout) below or `dvc install` +for more details. + By default, this command tries not make copies of cached files in the workspace, using reflinks instead when supported by the file system (refer to [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)). @@ -64,9 +64,9 @@ such a case, `dvc checkout` prints a warning message. It also lists the partial progress made by the checkout. There are two methods to restore a file missing from the cache, depending on the -situation. In some cases a pipeline must be reproduced (using `dvc repro`) to -regenerate its outputs (see also `dvc dag`). In other cases the cache can be -pulled from remote storage using `dvc pull`. +situation. In some cases the cache can be pulled from +[remote storage](/doc/command-reference/remote) using `dvc pull`. In other cases +the pipeline must be reproduced (using `dvc repro`) to regenerate its outputs. ## Options @@ -130,9 +130,8 @@ below. The workspace looks like this: -````dvc +```dvc . -├── README.md ├── data │   └── data.xml.dvc ├── dvc.lock @@ -141,15 +140,11 @@ The workspace looks like this: ├── prc.json ├── scores.json └── src - ├── evaluate.py - ├── featurization.py - ├── prepare.py - ├── requirements.txt - └── train.py``` -```` + └── +``` -This repository includes the following tags, that represent different variants -of the resulting model: +Note that this repository includes the following tags, that represent different +variants of the resulting model: ```dvc $ git tag @@ -158,10 +153,9 @@ baseline-experiment <- First simple version of the model bigrams-experiment <- Uses bigrams to improve the model ``` -We can now just run `dvc checkout` that will update the most recent `model.pkl`, -`data.xml`, and other files that are tracked by DVC. The model file hash is -defined in the `dvc.lock` file, and in the `data.xml.dvc` file for the -`data.xml`: +We can now run `dvc checkout` to update the most recent `model.pkl`, `data.xml`, +and any other files tracked by DVC. The model file hash (`ab349c2...`) is saved +in `dvc.lock`, and it can be confirmed with: ```dvc $ dvc checkout @@ -170,13 +164,15 @@ $ md5 model.pkl MD5 (data.xml) = ab349c2b5fa2a0f66d6f33f94424aebe ``` +## Example: Switch versions + What if we want to "rewind history", so to speak? The `git checkout` command -lets us restore any point in the repository history, including any tags. It -automatically adjusts the files, by replacing file content and adding or -deleting files as necessary. +lets us restore any commit in the repository history (including tags). It +automatically adjusts the repo files, by replacing, adding, or deleting them as +necessary. ```dvc -$ git checkout baseline-experiment # Stage where model is first created +$ git checkout baseline-experiment # Git commit where model was created ``` Let's check the hash value of `model.pkl` in `dvc.lock` now: @@ -187,16 +183,10 @@ outs: md5: 98af33933679a75c2a51b953d3ab50aa ``` -But if you check `model.pkl`, the file hash is still the same: - -```dvc -$ md5 model.pkl -MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe -``` - -This is because `git checkout` changed `dvc.lock` and other DVC files. But it -did nothing with the `model.pkl` and `matrix.pkl` files. Git doesn't track those -files; DVC does, so we must do this: +But if you check the MD5 of `model.pkl`, the file hash is still the same +(`ab349c2...`). This is because `git checkout` changed `dvc.lock` and other DVC +files, but it did nothing with `model.pkl`, or any other DVC-tracked files/dirs. +Since Git doesn't track them, we must do this: ```dvc $ dvc checkout @@ -207,8 +197,29 @@ $ md5 model.pkl MD5 (model.pkl) = 98af33933679a75c2a51b953d3ab50aa ``` -What happened is that DVC went through the DVC-files and adjusted the current -set of output files to match the `outs` in them. +DVC went through the stages (in `dvc.yaml`) and adjusted the current set of +outputs to match the `outs` in the corresponding `dvc.lock`. + +## Example: Specific files or directories + +`dvc checkout` only affects the tracked data corresponding to any given +`targets`: + +```dvc +$ git checkout master +$ dvc checkout # Start with latest version of everything. + +$ git checkout baseline-experiment -- dvc.lock +$ dvc checkout model.pkl # Get previous model file only. +``` + +Note that you can checkout data within directories tracked. For example, the +`featurize` stage has the entire `data/features` directory as output, but we can +just get this: + +```dvc +$ dvc checkout data/features/test.pkl +``` ## Example: Automating DVC checkout @@ -235,5 +246,5 @@ MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe ``` Previously this took two commands, `git checkout` followed by `dvc checkout`. We -can now skip the second one, which is automatically run for us. The workspace is -automatically synchronized accordingly. +can now skip the second one, which is automatically run for us. The workspace +files are automatically updated accordingly. diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index af3abf8509..c9ba9a05ab 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -11,19 +11,28 @@ usage: dvc fetch [-h] [-q | -v] [-j ] [-r ] [-a] [-T] [targets [targets ...]] positional arguments: - targets Limit command scope to these stages or .dvc files. - Using -R, directories to search for stages or .dvc - files can also be given. + targets Limit command scope to these tracked files/directories, + .dvc files, or stage names. ``` ## Description -The `dvc fetch` downloads DVC-tracked files from remote storage into the cache -of the project, but without placing them in the workspace. This -makes the data files available for linking (or copying) into the workspace. -(Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along -with `dvc checkout`, it's performed automatically by `dvc pull` when the target -`dvc.yaml` or `.dvc` files are not already in the cache: +Downloads DVC-tracked files from remote storage into the cache of the project +(without placing them in the workspace, like `dvc pull` would). +This makes them available for linking (or copying) into the workspace (refer to +[`dvc config cache.type`](/doc/command-reference/config#cache)). + +Without arguments, `dvc fetch` ensures that the files specified in all +`dvc.lock` and `.dvc` files in the workspace exist in the cache. The +`--all-branches`, `--all-tags`, and `--all-commits` options enable fetching data +for multiple Git commits. + +The `targets` given to this command (if any) limit what to fetch. It accepts +paths to tracked files or directories (including paths inside tracked +directories), `.dvc` files, or stage names (found in `dvc.yaml`). + +Fetching is performed automatically by `dvc pull` (when the data is not already +in the cache), along with `dvc checkout`: ``` Controlled files Commands @@ -42,32 +51,19 @@ project's cache ++ | dvc pull | workspace ``` -Fetching could be useful when first checking out a DVC project, -since files tracked by DVC should already exist in remote storage, but won't be -in the project's cache. (Refer to `dvc remote` for more information -on DVC remotes.) These necessary data or model files are listed as -dependencies or outputs in a target -[stage](/doc/command-reference/run) (in `dvc.yaml`) or `.dvc` file, so they are -required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the -corresponding [pipeline](/doc/command-reference/dag). - -`dvc fetch` ensures that the files needed for a stage or `.dvc` file to be -[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in -cache. If no `targets` are specified, the set of data files to fetch is -determined by analyzing all `dvc.yaml` and `.dvc` files in the current branch, -unless `--all-branches` or `--all-tags` is specified. - -The default remote is used (see `dvc config core.remote`) unless the `--remote` -option is used. - -`dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands -perform data synchronization among local and remote storage. The specific way in -which the set of files to push/fetch/pull is determined begins with calculating -file hashes when these are [added](/doc/command-reference/add) with DVC. File -hash values are stored in the corresponding `dvc.yaml` or `.dvc` files -(typically versioned with Git). Only the hash specified in `dvc.yaml` or `.dvc` -files currently in the workspace are considered by `dvc fetch` (unless the `-a` -or `-T` options are used). +Here are some scenarios in which `dvc fetch` is useful, instead of pulling: + +- After checking out a fresh copy of a DVC repository, to get + DVC-tracked data from multiple project branches or tags into your machine. +- To use comparison commands across different Git commits, for example + `dvc metrics show` with its `--all-branches` option. +- If you want to avoid [linking](/doc/user-guide/large-dataset-optimization) + files from the cache, or keep the workspace clean for any other + reason. + +The default remote is used (see +[`dvc config core.remote`](/doc/command-reference/config#core)) unless the +`--remote` option is used. ## Options @@ -119,8 +115,8 @@ or `-T` options are used). Let's employ a simple workspace with some data, code, ML models, pipeline stages, such as the DVC project created for the -[Get Started](/doc/tutorials/get-started). Then we can see what happens with -`dvc fetch` as we switch from tag to tag. +[Get Started](/doc/tutorials/get-started). Then we can see what `dvc fetch` does +in different scenarios.
@@ -135,30 +131,21 @@ $ cd example-get-started
+The workspace looks like this: + ```dvc . ├── data │   └── data.xml.dvc -├── evaluate.dvc -├── featurize.dvc -├── prepare.dvc -├── train.dvc +├── dvc.lock +├── dvc.yaml +├── params.yaml +├── prc.json +├── scores.json └── src └── ``` -We have these tags in the repository that represent different iterations of -solving the problem: - -```dvc -$ git tag - -baseline-experiment <- first simple version of the model -bigrams-experiment <- use bigrams to improve the model -``` - -## Example: Default behavior - This project comes with a predefined HTTP [remote storage](/doc/command-reference/remote). We can now just run `dvc fetch` to download the most recent `model.pkl`, `data.xml`, and other DVC-tracked files @@ -167,24 +154,22 @@ into our local cache. ```dvc $ dvc status --cloud ... - deleted: model.pkl - deleted: data/features/... + deleted: data/features/train.pkl + deleted: model.pkl $ dvc fetch + +$ tree .dvc/cache +.dvc/cache +├── 38 +│   └── 63d0e317dee0a55c4e59d2ec0eef33 +├── 42 +│   └── c7025fc0edeb174069280d17add2d4.dir ... -$ tree .dvc -.dvc -├── cache -│   ├── 38 -│   │   └── 63d0e317dee0a55c4e59d2ec0eef33 -│   ├── 42 -│   │   └── c7025fc0edeb174069280d17add2d4.dir -│   ├── ... -├── config -├── ... ``` -> `dvc status --cloud` compares the cache contents vs. the default remote. +> `dvc status --cloud` compares the cache contents against the default remote. +> Refer to `dvc status`. Note that the `.dvc/cache` directory was created and populated. @@ -192,11 +177,10 @@ Note that the `.dvc/cache` directory was created and populated. > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. -Used without arguments (as above), `dvc fetch` downloads all assets needed by -all `dvc.yaml` and `.dvc` files in the current branch, including for -directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and -`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and -`data/features/` directory, respectively. +Used without arguments (as above), `dvc fetch` downloads all files and +directories needed by all `dvc.yaml` and `.dvc` files in the current branch. For +example, the hash values `3863d0e...` and `42c7025...` correspond to the +`model.pkl` file and `data/features/` directory, respectively. Let's now link files from the cache to the workspace with: @@ -204,34 +188,42 @@ Let's now link files from the cache to the workspace with: $ dvc checkout ``` -## Example: Specific stages +## Example: Specific files or directories -> Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to -> follow this example if you tried the previous one (**Default behavior**). +> If you tried the previous example, please delete the `.dvc/cache` directory +> first (e.g. `rm -Rf .dvc/cache`) to follow this one. -`dvc fetch` only downloads the data files of a specific stage when the -corresponding `.dvc` file (command target) is specified: +`dvc fetch` only downloads the tracked data corresponding to any given +`targets`: ```dvc -$ dvc fetch prepare.dvc +$ dvc fetch prepare $ tree .dvc/cache .dvc/cache -├── 42 -│   └── c7025fc0edeb174069280d17add2d4.dir -├── 58 -│   └── 245acfdc65b519c44e37f7cce12931 -├── 68 -│   └── 36f797f3924fb46fcfd6b9f6aa6416.dir -└── 9d - └── 603888ec04a6e75a560df8678317fb +├── 20 +│ └── b786b6e6f80e2b3fcf17827ad18597.dir +├── 32 +│ └── b715ef0d71ff4c9e61f55b09c15e75 +└── 6f + └── 597d341ceb7d8fbbe88859a892ef81 ``` -> Note that `prepare.dvc` is the first stage in our example's pipeline. +Cache entries for the `data/prepared` directory (output of the +`prepare` target), as well as the actual `test.tsv` and `train.tsv` files, were +downloaded. Their hash values are shown above. + +Note that you can fetch data within directories tracked. For example, the +`featurize` stage has the entire `data/features` directory as output, but we can +just get this: + +```dvc +$ dvc fetch data/features/test.pkl +``` -Cache entries for the necessary directories, as well as the actual -`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded. -Their hash values are shown above. +If you check again `.dvc/cache`, you'll see a couple more files were downloaded: +the cache entries for the `data/features` directory, and +`data/features/test.pkl` itself. ## Example: With dependencies diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 8a4a9293e8..d09a375804 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -36,11 +36,11 @@ the data source. Both HTTP and SSH protocols are supported for online repos to an "offline" repo (if it's a DVC repo without a default remote, instead of downloading, DVC will try to copy the target data from its cache). -The `path` argument is used to specify the location of the target to be -downloaded within the source repository at `url`. `path` can specify any file or -directory in the source repo, including those tracked by DVC, or by Git. Note -that DVC-tracked targets should be found in a `dvc.yaml` or `.dvc` file of the -project. +The `path` argument is used to specify the location of the target to download +within the source repository at `url`. `path` can specify any file or directory +in the source repo, either tracked by DVC (including paths inside tracked +directories), or by Git. Note that DVC-tracked targets should be found in a +`dvc.yaml` or `.dvc` file of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 97f0a08b8f..c504be21c1 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -39,11 +39,11 @@ the data source. Both HTTP and SSH protocols are supported for online repos to an "offline" repo (if it's a DVC repo without a default remote, instead of downloading, DVC will try to copy the target data from its cache). -The `path` argument is used to specify the location of the target to be -downloaded within the source repository at `url`. `path` can specify any file or -directory in the source repo, including those tracked by DVC, or by Git. Note -that DVC-tracked targets should be found in a `dvc.yaml` or `.dvc` file of the -project. +The `path` argument is used to specify the location of the target to download +within the source repository at `url`. `path` can specify any file or directory +in the source repo, either tracked by DVC (including paths inside tracked +directories), or by Git. Note that DVC-tracked targets should be found in a +`dvc.yaml` or `.dvc` file of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index 546bc0e2d9..13e83f50c9 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -41,10 +41,10 @@ data source. Both HTTP and SSH protocols are supported for online repos (e.g. `[user@]server:project.git`). `url` can also be a local file system path to an "offline" Git repo. -The optional `path` argument is used to specify directory to list within the -source repository at `url`. It's similar to providing a path to list to commands -such as `ls` or `aws s3 ls`. And similar to the, `-R` option might be used to -list files recursively. +The optional `path` argument is used to specify a directory to list within the +source repository at `url` (including paths inside tracked directories). It's +similar to providing a path to list to commands such as `ls` or `aws s3 ls`, and +similar to the former, the `-R` option might be used to list files recursively. Please note that `dvc list` doesn't check whether the listed data (tracked by DVC) actually exists in remote storage, so it's not guaranteed whether it can be diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 422a8b7b39..35d4220281 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -23,7 +23,7 @@ Finds and prints all metrics in the project by examining all of its > (`--metrics-no-cache`) options of `dvc run`. If `targets` are provided, it will show those specific metric files instead. -With the `-a` or`-T` options, this command shows the different metrics values +With the `-a` or `-T` options, this command shows the different metrics values across all Git branches or tags, respectively. With the `-R` option, some of the target can even be directories, so that DVC recursively shows all metric files inside. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 19182db6f2..cc4588b83b 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -12,9 +12,8 @@ usage: dvc pull [-h] [-q | -v] [-j ] [-r ] [-a] [-T] [targets [targets ...]] positional arguments: - targets Limit command scope to these stages or .dvc files. - Using -R, directories to search for stages or .dvc files - can also be given. + targets Limit command scope to these tracked files/directories, + .dvc files, or stage names. ``` ## Description @@ -29,31 +28,29 @@ are the most common use cases for these commands. The `dvc pull` command allows one to retrieve data from remote storage. `dvc pull` has the same effect as running `dvc fetch` and `dvc checkout` -immediately after that. +immediately after. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. See `dvc remote` for more information on how to configure a remote. With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads -only the files (or directories) missing from the workspace by searching all -stages in `dvc.yaml` or `.dvc` files currently in the project. It -will not download files associated with earlier commits in the -repository (if using Git), nor will it download files that have not -changed. +only the files (or directories) missing from the workspace by checking all +`.dvc` files and stages (in `dvc.yaml` and `dvc.lock`) currently in the +project. It will not download files associated with earlier commits +in the repository (if using Git), nor will it download files that +have not changed. The command `dvc status -c` can list files referenced in current stages (in `dvc.yaml`) or `.dvc` files, but missing from the cache. It can be used to see what files `dvc pull` would download. -If one or more `targets` are specified, DVC only considers the files associated -with those stages or `.dvc` files. Using the `--with-deps` option, DVC tracks -dependencies backward from the target [stage files](/doc/command-reference/run), -through the corresponding [pipelines](/doc/command-reference/dag), to find data -files to pull. +The `targets` given to this command (if any) limit what to pull. It accepts +paths to tracked files or directories (including paths inside tracked +directories), `.dvc` files, or stage names (found in `dvc.yaml`). -After a data file is in cache, `dvc pull` can use OS-specific mechanisms like -reflinks or hardlinks to put it in the workspace without copying. See +After the data is in the cache, `dvc pull` uses OS-specific mechanisms like +reflinks or hardlinks to put it in the workspace, trying to avoid copying. See `dvc checkout` for more details. ## Options diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index e9168bb81b..6e48a9cf46 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -11,9 +11,8 @@ usage: dvc push [-h] [-q | -v] [-j ] [-r ] [-a] [-T] [targets [targets ...]] positional arguments: - targets Limit command scope to these stages or .dvc files. - Using -R, directories to search for stages or .dvc files - can also be given. + targets Limit command scope to these tracked files/directories, + .dvc files, or stage names. ``` ## Description @@ -38,9 +37,9 @@ with `git commit` and `git push`). Under the hood a few actions are taken: -- The push command by default uses all `dvc.yaml` and `.dvc` files in the - workspace. The command options listed below will either limit or - expand the set of stages (in dvc.yaml) or `.dvc` files to consult. +- The push command by default uses all stages (in `dvc.yaml` and `dvc.lock`) and + `.dvc` files in the workspace. The command options will either + limit or expand the set of stages or `.dvc` files to consult. - For each output referenced in every selected stage or `.dvc` file, DVC finds a corresponding file or directory in the cache. @@ -64,10 +63,9 @@ The `dvc status -c` command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files `dvc push` would upload. -If one or more `targets` are specified, DVC only considers the files associated -with them. Using the `--with-deps` option, DVC tracks dependencies backward from -the target [stage files](/doc/command-reference/run), through the corresponding -[pipelines](/doc/command-reference/dag), to find data files to push. +The `targets` given to this command (if any) limit what to push. It accepts +paths to tracked files or directories (including paths inside tracked +directories), `.dvc` files, or stage names (found in `dvc.yaml`). ## Options diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 941260e405..969ba884aa 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -13,82 +13,68 @@ usage: dvc status [-h] [-v] [-j ] [-q] [-c] [-r ] [-a] [-T] [targets [targets ...]] positional arguments: - targets Limit command scope to these stages or .dvc files. - Using -R, directories to search for stages or .dvc - files can also be given. + targets Limit command scope to these tracked files/directories, + .dvc files, or stage names. ``` ## Description -`dvc status` searches for changes in the existing pipelines, either showing -which [stages](/doc/command-reference/run) have changed in the workspace (not -yet tracked by DVC) and must be added again (with `dvc add`) or reproduced (with +`dvc status` searches for changes in the existing tracked data and pipelines, +either showing which files or directories have changed in the +workspace and must be added or reproduced again (with `dvc add` or `dvc repro`); or differences between cache vs. remote storage -(meaning `dvc push` or `dvc pull` should be run to synchronize them). The two -modes, _local_ and _cloud_ are triggered by using the `--cloud` or `--remote` -options: - -| Mode | Command option | Description | -| ------ | -------------- | --------------------------------------------------------------------------------------------------------------------------- | -| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) | -| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. | -| remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | - -DVC determines which data and code files to compare by analyzing all stages (in -`dvc.yaml` and `.dvc` files in the workspace (the `--all-branches` -and `--all-tags` options compare multiple workspace versions). - -The comparison can be limited to certain stages (in `dvc.yaml`) or `.dvc` files -only, by listing them as `targets`. (Changes are reported only against these.) -When this is combined with the `--with-deps` option, a search is made for -changes in other stages that affect each target. - -In the local mode, changes are detected through the hash value of every file -listed in every stage (in `dvc.yaml` or `.dvc` files) in question against the -corresponding file in the file system. The command output indicates the detected -changes, if any. If no differences are detected, `dvc status` prints this -message: - -```dvc -$ dvc status -Data and pipelines are up to date. -``` - -This indicates that no differences were detected, and therefore no stages would -be executed by `dvc repro`. - -If instead, differences are detected, `dvc status` lists those changes. For each -stage with differences, the changes in dependencies and/or +(implying `dvc push` or `dvc pull` should be run to synchronize them). The +_remote_ mode is triggered by using the `--cloud` or `--remote` options: + +| Mode | Option | Description | +| ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- | +| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) | +| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. | +| remote | `--cloud` | Comparisons are made between the cache, and the default remote (typically defined with `dvc remote --default`). | + +Without arguments, this command checks all stages (defined in `dvc.yaml`) and +`.dvc` files, and compares the hash values of their outputs (found +in `dvc.lock` for stages) against the actual data files or directories in the +workspace. The `--all-branches`, `--all-tags`, and `--all-commits` options +enable checking data for multiple Git commits. + +The `targets` given to this command (if any) limit what to check. Paths to +tracked files or directories (including paths inside tracked directories), +`.dvc` files, or stage names (found in `dvc.yaml`) are accepted. + +If no differences are detected, `dvc status` prints +`Data and pipelines are up to date.` If differences are detected by +`dvc status`, the command output indicates the changes. For each stage with +differences, the changes in dependencies and/or outputs that differ are listed. For each item listed, either the -file name or hash is shown, and additionally a status word is shown describing -the changes (described below). +file name or hash is shown, along with a _state description_, as detailed below: -- _changed checksum_ means that the stage (in `dvc.yaml`) or `.dvc` file hash - has changed (e.g. someone manually edited the file). +- _changed checksum_ means that the `.dvc` file hash has changed (e.g. someone + manually edited it). -- _always changed_ means that this is a `.dvc` file with no dependencies (an - _orphan stage_ (see [`dvc add`](/doc/command-reference/add)) or that the stage - in `dvc.yaml` has the `always_changed: true` value set (see `--always-changed` - option in `dvc run`). +- _always changed_ means that this is a `.dvc` file with no dependencies (see + `dvc add`) or that the stage in `dvc.yaml` has the `always_changed: true` + value set (see `--always-changed` option in `dvc run`). - _changed deps_ or _changed outs_ means that there are changes in dependencies - or outputs tracked by the stage in `dvc.yaml` or `.dvc` file. Depending on the - use case, commands like `dvc commit`, `dvc repro`, or `dvc run` can be used to - update the file. Possible states are: + or outputs tracked by the stage or `.dvc` file. Depending on the use case, + commands like `dvc commit`, `dvc repro`, or `dvc run` can be used to update + the file. Possible states are: - - _new_: An output is found in the workspace, but there is no - corresponding file hash saved in the `dvc.lock` or `.dvc` file yet. + - _new_: An output is found in the workspace, but + there is no corresponding file hash saved in the `dvc.lock` or `.dvc` file + yet. - _modified_: An output or dependency is found in the workspace, but the corresponding file hash in the `dvc.lock` or `.dvc` file is not up to date. - - _deleted_: The output or dependency is referenced in a `dvc.yaml` or `.dvc` + - _deleted_: The output or dependency is referenced in a `dvc.lock` or `.dvc` file, but does not exist in the workspace. - _not in cache_: An output exists in the workspace, and the corresponding file hash in the `dvc.lock` or `.dvc` file is up to date, but there is no corresponding cache file or directory. -- _update available_ means that import stages are outdated. The - original file or directory has changed. The imported data can be moved to its +- _update available_ means that an import stage is outdated (the + original data source has changed). The imported data can be brought to its latest version by using `dvc update`. **For comparison against remote storage:** @@ -155,11 +141,10 @@ workspace) is different from remote storage. Bringing the two into sync requires - `-v`, `--verbose` - displays detailed tracing information. -## Example: Simple usage +## Examples ```dvc $ dvc status - bar.dvc: changed deps: modified: bar @@ -179,24 +164,59 @@ This shows that for stage `bar.dvc`, the dependency `foo` and the output `bar` have changed. Likewise for `foo.dvc`, the dependency `foo` has changed, but no output has changed. +## Example: Specific files or directories + +`dvc status` only checks the tracked data corresponding to any given `targets`: + +```dvc +$ dvc status foo.dvc dobar +foo.dvc + changed outs: + deleted: foo + changed checksum +dobar + changed deps: + modified: bar + changed outs: + not in cache: foo +``` + +> In this case, the target `foo.dvc` is a `.dvc` file to track the `foo` file, +> while `dobar` is the name of a stage defined in `dvc.yaml`. + +Note that you can check data within directories tracked, such as the `data/raw` +directory (tracked with `data/raw.dvc`): + +```dvc +$ tree data +data +├── raw +│ ├── partition.1.dat +│ ├── ... +│ └── partition.n.dat +└── raw.dvc + +$ dvc fetch data/raw/partition.1.dat +new: data/raw +``` + ## Example: Dependencies ```dvc $ vi code/featurization.py ... edit the code -$ dvc status model.p.dvc +$ dvc status model.p Data and pipelines are up to date. -$ dvc status model.p.dvc --with-deps -matrix-train.p.dvc +$ dvc status model.p --with-deps +matrix-train.p changed deps: modified: code/featurization.py ``` -If the `dvc status` command is limited to a target that had no changes, result -shows no changes. By adding `--with-deps` the change will be found, so long as -the change is in a preceding stage. +The `dvc status` command may be limited to a target that had no changes, but by +adding `--with-deps`, any change in a preceding stage will be found. ## Example: Remote comparisons @@ -213,17 +233,16 @@ remote yet: ```dvc $ dvc status --remote storage -Preparing to collect status from s3://dvc-remote - new: data/model.p - new: data/eval.txt - new: data/matrix-train.p - new: data/matrix-test.p +new: data/model.p +new: data/eval.txt +new: data/matrix-train.p +new: data/matrix-test.p ``` The output shows where the location of the remote storage is, as well as any differences between the cache and `storage` remote. -## Example: Import stage +## Example: Check imported data Let's import a data file (`data.csv`) from a different DVC repository into our current project using `dvc import`.