From 64182e2d370086690aee46fecfa0d3201f169914 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 11 Jun 2020 17:43:34 -0500 Subject: [PATCH] cmd ref: dvc add 1.0 update (#1411) * cmd ref: add note that move creates dirs * cmd ref: improve structure of add ref desc. * grammar: add some commas * term: checksum -> hash value in dvcignore guide * style: lower case bullet text * cmd ref: remove some redundancy in metrics index * cmd ref: update plots refs synopsis and descriptions per iterative/dvc/issues/3924 et al. * Add plots modify cmd * typo: CSV->csv * term: working tree -> workspace per iterative/dvc/pull/3914 * cmd ref: couple improvements to add ref per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-422235749 and https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-422237494 * Update config/prismjs/dvc-commands.js * cmd ref: update plots modify description * cmd ref: add plots modify to nav, with a few more improvements * cmd ref: plots --show-json -> --show-vega per https://github.com/iterative/dvc/pull/3891#issuecomment-638251223 * rename x-lab to x-label * cmd ref: review descriptions of plots index, show, and diff * cmd ref: review and update old plots cmds options per https://github.com/iterative/dvc/pull/3948 et al. * cmd ref: a couple more option updates per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-424070145 * cmd ref: emphasize add works with any large file/dir per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-423970876 * cmd ref: updae plots modify top half (definition, description) per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-423722291et al. * cmd ref: improve all plot cmd option descriptions * Update content/docs/command-reference/plots/modify.md * cmd ref: review examples (mainly images) in plots modify per https://github.com/iterative/dvc.org/pull/1382#discussion_r434968322 et al. * cmd ref: rephrase info about how data arrays are injected to plot templates per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425713344 * cmd ref: update info on how targets for for plots show/diff per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425713399 * cmd ref: double check all plots examples per https://github.com/iterative/dvc.org/pull/1382#issuecomment-639989366 * cmd ref: remove info about plots show --select * cmd ref: update add desc per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425768295 * cmd ref: re-explain dvc add for dirs per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425768492 * cmd ref: improve description about targets in plots diff per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425768658 * cmd ref: make emoji note in plots index per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425769215 * cmd ref: remove ineffective CSV code block highlighting from plots refs per https://github.com/iterative/dvc.org/pull/1382#pullrequestreview-425769562 * get started: improve intro in index * glossary: remove external deps entry (no need) * cmd ref: update add for 1.0 (1) up to... before Examples * cmd ref: 1.0 updates for add (2) - examples * cmd ref: remove note about comments in add example per https://github.com/iterative/dvc.org/pull/1411#pullrequestreview-426727725 Co-authored-by: Dmitry Petrov --- content/docs/command-reference/add.md | 128 +++++++++---------- content/docs/command-reference/plots/diff.md | 3 +- content/docs/command-reference/status.md | 6 +- content/docs/tutorials/pipelines.md | 2 +- 4 files changed, 66 insertions(+), 73 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index c9b321f0e2..bfaa51b274 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -6,11 +6,11 @@ Track data files or directories with DVC, by creating a corresponding ## Synopsis ```usage -usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [-f ] - targets [targets ...] +usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external] + [-f ] targets [targets ...] positional arguments: - targets Input files/directories to add. + targets Files or directories to add ``` ## Description @@ -36,29 +36,30 @@ Under the hood, a few actions are taken for each file (or directory) in 1. Calculate the file hash. 2. Move the file contents to the cache (by default in `.dvc/cache`), using the - file hash to form the cached file names. (See + file hash to form the cached file path. (See [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) for more details.) -3. Attempt to replace the file with a link to the cached data (more details - further down). -4. Create a corresponding `.dvc` file to store the file (as an - output), using its path and hash to identify the cached data. - Unless the `-f` option is used, the `.dvc` file name generated by default is - `.dvc`, where `` is the file name of the first target. -5. Unless `dvc init --no-scm` was used when initializing the project, add the - `targets` to `.gitignore` in order to prevent them from being committed to - the Git repository. +3. Attempt to replace the file with a link to the cached data (more details on + file linking further down). +4. Create a corresponding [`.dvc` file](/doc/user-guide/dvc-file-format) to + track the file, using its path and hash to identify the cached data. The + `.dvc` file lists the DVC-tracked file as an output (`outs` + field). Unless the `-f` option is used, the `.dvc` file name generated by + default is `.dvc`, where `` is the file name of the first target. +5. Add the `targets` to `.gitignore` in order to prevent them from being + committed to the Git repository (unless `dvc init --no-scm` was used when + initializing the DVC project). 6. Instructions are printed showing `git` commands for adding the files, if appropriate. Summarizing, the result is that the target data is replaced by small `.dvc` -files that can be tracked with Git. See +files that can easily be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. -> Note that `.dvc` files created by this command are considered _orphan stage -> files_ because they have no _dependencies_, only outputs. These are always -> treated as _changed_ by `dvc repro`, which always executes them. See `dvc run` -> to learn more about stage files. +> Note that `.dvc` files can be considered _orphan stages_, because they have no +> dependencies, only outputs. These are treated as _always changed_ +> by `dvc status` and `dvc repro`, which always executes them. See +> [`dvc.yaml`](/doc/user-guide/dvc-file-format) to learn more about stages. To avoid adding files inside a directory accidentally, you can add the corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file. @@ -111,6 +112,9 @@ undesirable for data directories with a large number of files. file name of the given target. This option allows to set the name and the path of the generated `.dvc` file. +- `--external` - allow `targets` that are outside of the DVC repository. See + [Managing External Data](/doc/user-guide/managing-external-data). + - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no @@ -124,15 +128,14 @@ Track a file with DVC: ```dvc $ dvc add data.xml +... -Saving information to 'data.xml.dvc'. - -To track the changes with git run: +To track the changes with git, run: - git add .gitignore data.xml.dvc + git add .gitignore data.xml.dvc ``` -As shown above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been +As indicated above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been created for `data.xml`. Let's explore the result: ```dvc @@ -145,32 +148,21 @@ $ tree Let's check the `data.xml.dvc` file inside: ```yaml -md5: aae37d74224b05178153acd94e15956b outs: - - cache: true - md5: d8acabbfd4ee51c95da5d7628c7ef74b - metric: false + - md5: 6137cde4893c59f76f005a8123d8e8e6 path: data.xml -meta: # Special field to contain arbitary user data - name: John - email: john@xyz.com ``` -This is a standard `.dvc` file with only one output (in the `outs` field). The -hash value should correspond to a file path in the cache. - -> Note that the `meta` values above were entered manually for this example. Meta -> values and `#` comments are not preserved when a `.dvc` file is overwritten -> with the `dvc add`, `dvc run`, `dvc import`, or `dvc import-url` commands. +This is a standard `.dvc` file with only one output (`outs` field). The hash +value (`md5` field) corresponds to a file path in the cache. ```dvc $ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b - -.dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b: ASCII text +.dvc/cache/61/37cde4893c59f76f005a8123d8e8e6: ASCII text ``` -Note that tracking compressed files (e.g. ZIP or TAR archives) is not -recommended, as `dvc add` supports tracking directories. (Details below.) +⚠️ Note that tracking compressed files (e.g. ZIP or TAR archives) is not +recommended, as `dvc add` supports tracking directories (details below). ## Example: Directory @@ -193,28 +185,17 @@ Tracking a directory with DVC as simple as with a single file: ```dvc $ dvc add pics -Computing md5 for a large number of files. This is only done once. -... -Linking directory 'pics'. - -Saving information to 'pics.dvc'. -... ``` There are no [`.dvc` files](/doc/user-guide/dvc-file-format) generated within -this directory structure, but the images are all added to the -cache. DVC prints a message mentioning that MD5 hash values are -computed for each file. A single `pics.dvc` file is generated for the top-level +this directory structure to match each images, but the image files are all +cached. A single `pics.dvc` file is generated for the top-level directory, and it contains: ```yaml -md5: df06d8d51e6483ed5a74d3979f8fe42e outs: - - cache: true - md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir - metric: false + - md5: ce57450aa92ab8f2b957c24b0df73edc.dir path: pics -wdir: . ``` > Refer to @@ -222,34 +203,46 @@ wdir: . > for more info. This allows us to treat the entire directory structure as a single data -artifact. This lets you pass the whole directory tree as a +artifact. For example, you can pass the whole directory tree as a dependency to a `dvc run` stage definition: ```dvc -$ dvc run -f train.dvc \ +$ dvc run -n train \ -d train.py -d pics \ -M metrics.json -o model.h5 \ python train.py ``` -> To follow the full example, see the [Versioning](/doc/tutorials/versioning) -> tutorial. +> To try this example, see the [Versioning](/doc/tutorials/versioning) tutorial. If instead we use the `--recursive` (`-R`) option, the output looks like this: ```dvc $ dvc add -R pics -Saving information to 'pics/cat1.jpg.dvc'. -Saving information to 'pics/cat3.jpg.dvc'. -Saving information to 'pics/cat2.jpg.dvc'. -Saving information to 'pics/cat4.jpg.dvc'. -... ``` In this case, a `.dvc` file is generated for each file in the `pics/` directory -tree. No top-level `.dvc` file is generated, which is typically less convenient. -For example, we cannot use the directory structure as one unit with `dvc run` or -other commands. +tree: + +```dvc +$ tree pics +pics +├── train +| ├── cats +| | ├── img1.jpg +| | ├── img1.jpg.dvc +| | ├── img2.jpg +| | ├── img2.jpg.dvc +| | ├── ... +| └── dogs +| ├── img1.jpg +| ├── img1.jpg.dvc +| ... +``` + +Note that no top-level `.dvc` file is generated, which is typically less +convenient. For example, we cannot use the directory structure as one unit with +`dvc run` or other commands. ## Example: Dvcignore @@ -290,6 +283,7 @@ $ tree .dvc/cache └── 4bcc8502a70ac49bf441db350eafc2 ``` -Only the hash values of directory (`dir/`) and `file2` have been cached. +Only the hash values of the `dir/` directory (with `.dir` file extension) and +`file2` have been cached. See [Dvcignore](/doc/user-guide/dvcignore) for more details. diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index eda3d28d04..8125363141 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -102,8 +102,7 @@ file:///Users/dmitry/src/plots/logs.html > Note that we renamed the X axis label with option `--x-label x`. -Compare two specific versions (commit hashes, tags, or branches can be provided, -for example): +Compare two specific versions (commit hashes, tags, or branches): ```dvc $ dvc plots diff --targets logs.csv HEAD 0135527 diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 9c1a270adb..a336c70559 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -65,9 +65,9 @@ describing the changes (described below). someone manually edited the file). - _always changed_ means that this is a DVC-file with no dependencies (an - _orphan_ stage file) or that it has the `always_changed: true` value set (see - `--always-changed` option in `dvc run`), so its considered always changed, and - thus is always executed by `dvc repro`. + _orphan stage_ (see `dvc add`) or that it has the `always_changed: true` value + set (see `--always-changed` option in `dvc run`), so its considered always + changed, and thus is always executed by `dvc repro`. - _changed deps_ or _changed outs_ means that there are changes in dependencies or outputs tracked by the DVC-file. Depending on the use case, diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index 8594dc3af2..a305fabe9f 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -110,7 +110,7 @@ hidden from the user. This directory is automatically staged with `git add`, so it can be easily committed with Git. Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an -_orphan_ [stage file](/doc/command-reference/run): +_orphan stage_ (see `dvc add`): ```yaml md5: c183f094869ef359e87e68d2264b6cdd