From 44b9b17c6e98e162e7ed96a7a05da0e4078351d9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 28 May 2020 02:15:44 -0500 Subject: [PATCH 01/36] get-started: update index for DVC 1.0 --- content/docs/tutorials/get-started/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/docs/tutorials/get-started/index.md b/content/docs/tutorials/get-started/index.md index 1d176c6392..7da757e1ab 100644 --- a/content/docs/tutorials/get-started/index.md +++ b/content/docs/tutorials/get-started/index.md @@ -34,6 +34,7 @@ $ git status Changes to be committed: new file: .dvc/.gitignore new file: .dvc/config + ... $ git commit -m "Initialize DVC repository" ``` From cd800fa4e5d7c3d83fc62fdd874e617acc280188 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 28 May 2020 03:35:51 -0500 Subject: [PATCH 02/36] user-guide: DVC File guide update (1) --- content/docs/sidebar.json | 4 +- ...file-format.md => dvc-metafile-formats.md} | 56 +++++++++++++++---- 2 files changed, 47 insertions(+), 13 deletions(-) rename content/docs/user-guide/{dvc-file-format.md => dvc-metafile-formats.md} (67%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 8bb03b1a9d..9a19e3712f 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -104,8 +104,8 @@ "slug": "dvc-files-and-directories" }, { - "label": "File Format (.dvc)", - "slug": "dvc-file-format" + "label": "DVC Metafile Formats", + "slug": "dvc-metafile-formats" }, { "slug": "dvcignore", diff --git a/content/docs/user-guide/dvc-file-format.md b/content/docs/user-guide/dvc-metafile-formats.md similarity index 67% rename from content/docs/user-guide/dvc-file-format.md rename to content/docs/user-guide/dvc-metafile-formats.md index a4b13ffe03..305c8c9f41 100644 --- a/content/docs/user-guide/dvc-file-format.md +++ b/content/docs/user-guide/dvc-metafile-formats.md @@ -1,16 +1,52 @@ -# DVC-File Format +# DVC Metafile Formats -When you add a file (with `dvc add`) or a command (with `dvc run`) to a -[pipeline](/doc/command-reference/pipeline), DVC creates a special text metafile -with the `.dvc` file extension (e.g. `process.dvc`), or with the default name -`Dvcfile`. These **DVC-files** (a.k.a. stage files) contain all the needed -information to track your data and reproduce pipeline stages. The file itself -contains a simple YAML format that could be easily written or altered manually. +There are two special metafiles created by DVC commands: + +- Files ending with the `.dvc` extension are basic placeholders to version data + files and directories. A project can have multiple `.dvc` files. +- The `dvc.yaml` file or _pipeline(s) file_ specifies stages that form the + pipeline(s) of a project, and their connections (_dependency graph_ or DAG). + +Both use human-friendly YAML schemas, described below. We encourage you to get +familiar with them so you may edit them freely, as needed. + +Both type of files should be versioned with Git (for Git-enabled +repositories). + +## .dvc files + +When you add a file or directory to a DVC project with `dvc add`, a +`.dvc` file is created based on the data file name (e.g. `data.txt.dvc`). These +files contain all the needed information to track your data with DVC. They use a +simple YAML format that can be easily written or altered manually. See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the highlighting for your editor. -Here is a sample DVC-file: +Here is a sample `.dvc` file: + +```yaml +outs: + - md5: a304afb96060aad90176268345e10355 + path: data.xml +# Comments like this line persist through multiple executions of +# dvc repro/commit but not through dvc run/add/import-url/get-url commands. +``` + +On the top level, `.dvc` file consists of these possible fields: + +- `outs`: List of outputs for this stage + +An output entry consists of these fields: + +- `md5`: MD5 hash for the output +- `path`: Path to the output, relative to the `wdir` path +- `cache`: Whether or not DVC should cache the output + +## dvc.yaml file + +When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is +created or updated. Here's a simple example: ```yaml always_changed: true @@ -37,13 +73,11 @@ outs: # Comments like this line persist through multiple executions of # dvc repro/commit but not through dvc run/add/import-url/get-url commands. -meta: # Special field to contain arbitary user data +meta: # Special field to contain arbitrary user data name: John email: john@xyz.com ``` -## Structure - On the top level, `.dvc` file consists of these possible fields: - `cmd`: Executable command defined in this stage From 92b3bf3e632fde79d069729331b0b36a6f8cac80 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 28 May 2020 11:49:35 -0500 Subject: [PATCH 03/36] user-guide: change DVC Metafile Formats guide link URL --- content/docs/api-reference/get_url.md | 2 +- content/docs/command-reference/add.md | 22 +++++++++---------- content/docs/command-reference/checkout.md | 6 ++--- content/docs/command-reference/commit.md | 10 ++++----- content/docs/command-reference/fetch.md | 16 +++++++------- content/docs/command-reference/get.md | 9 ++++---- content/docs/command-reference/import-url.md | 12 +++++----- content/docs/command-reference/import.md | 17 +++++++------- content/docs/command-reference/init.md | 8 +++---- content/docs/command-reference/install.md | 16 +++++++------- content/docs/command-reference/lock.md | 2 +- content/docs/command-reference/metrics/add.md | 10 ++++----- .../docs/command-reference/metrics/diff.md | 2 +- .../docs/command-reference/metrics/modify.md | 9 ++++---- .../docs/command-reference/metrics/remove.md | 9 ++++---- .../docs/command-reference/metrics/show.md | 8 +++---- content/docs/command-reference/move.md | 6 ++--- .../docs/command-reference/params/index.md | 8 +++---- .../docs/command-reference/pipeline/show.md | 2 +- content/docs/command-reference/plots/diff.md | 2 +- content/docs/command-reference/pull.md | 4 ++-- content/docs/command-reference/push.md | 6 ++--- .../docs/command-reference/remote/modify.md | 3 ++- content/docs/command-reference/remove.md | 4 ++-- content/docs/command-reference/repro.md | 4 ++-- content/docs/command-reference/run.md | 8 +++---- content/docs/command-reference/status.md | 5 +++-- content/docs/command-reference/unlock.md | 2 +- content/docs/command-reference/update.md | 9 ++++---- content/docs/install/plugins.md | 2 +- .../docs/tutorials/deep/define-ml-pipeline.md | 18 +++++++-------- .../docs/tutorials/deep/reproducibility.md | 6 ++--- content/docs/tutorials/deep/sharing-data.md | 10 ++++----- .../tutorials/get-started/data-pipelines.md | 6 ++--- .../tutorials/get-started/data-versioning.md | 18 ++++++++------- content/docs/tutorials/pipelines.md | 2 +- content/docs/tutorials/versioning.md | 20 ++++++++--------- .../docs/understanding-dvc/how-it-works.md | 2 +- .../understanding-dvc/related-technologies.md | 11 +++++----- content/docs/understanding-dvc/what-is-dvc.md | 4 ++-- content/docs/use-cases/data-registries.md | 17 +++++++------- .../use-cases/sharing-data-and-model-files.md | 4 ++-- .../versioning-data-and-model-files.md | 4 ++-- .../user-guide/basic-concepts/dvc-project.md | 2 +- content/docs/user-guide/contributing/docs.md | 4 ++-- .../user-guide/dvc-files-and-directories.md | 4 ++-- .../docs/user-guide/managing-external-data.md | 8 +++---- 47 files changed, 187 insertions(+), 176 deletions(-) diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 05d123d7f5..0d0d1befa7 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -30,7 +30,7 @@ specified by its `path` in a `repo` (DVC project), is stored. The URL is formed by reading the project's [remote configuration](/doc/command-reference/config#remote) and the -[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is found +[DVC-file](/doc/user-guide/dvc-metafile-formats) where the given `path` is found (`outs` field). The URL schema returned depends on the [type](/doc/command-reference/remote/add#supported-storage-types) of the `remote` used (see the [Parameters](#parameters) section). diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 49fc6fc36b..9a66c59ae7 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -1,7 +1,7 @@ # add Track data files or directories with DVC, by creating a corresponding -[DVC-file](/doc/user-guide/dvc-file-format). +[DVC-file](/doc/user-guide/dvc-metafile-formats). ## Synopsis @@ -17,7 +17,7 @@ positional arguments: The `dvc add` command is analogous to `git add`, in that it makes DVC aware of the target data, as a first step to version it. It creates a -[DVC-file](/doc/user-guide/dvc-file-format) to track the added data. +[DVC-file](/doc/user-guide/dvc-metafile-formats) to track the added data. The `targets` are files or directories to add with this command, that are turned into data artifacts of the project. By default, these @@ -48,8 +48,8 @@ Under the hood, a few actions are taken for each file (or directory) in appropriate. Summarizing, the result is that the target data is replaced small DVC-files can -be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-file-format) for -more details. +be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) +for more details. > Note that DVC-files created by this command are considered _orphan stage > files_ because they have no _dependencies_, only outputs. These are always @@ -125,8 +125,8 @@ To track the changes with git run: git add .gitignore data.xml.dvc ``` -As shown above, a [DVC-file](/doc/user-guide/dvc-file-format) has been created -for `data.xml`. Let's explore the result: +As shown above, a [DVC-file](/doc/user-guide/dvc-metafile-formats) has been +created for `data.xml`. Let's explore the result: ```dvc $ tree @@ -194,11 +194,11 @@ Saving information to 'pics.dvc'. ... ``` -There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this -directory structure, but the images are all added to the cache. DVC -prints a message mentioning that MD5 hash values are computed for each file. A -single `pics.dvc` DVC-file is generated for the top-level directory, and it -contains: +There are no [DVC-files](/doc/user-guide/dvc-metafile-formats) generated within +this directory structure, but the images are all added to the +cache. DVC prints a message mentioning that MD5 hash values are +computed for each file. A single `pics.dvc` DVC-file is generated for the +top-level directory, and it contains: ```yaml md5: df06d8d51e6483ed5a74d3979f8fe42e diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index e0be68bc46..5985de501f 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -16,9 +16,9 @@ positional arguments: ## Description -[DVC-files](/doc/user-guide/dvc-file-format) act as pointers to specific version -of data files or directories tracked by DVC. This command synchronizes the -workspace data with the versions specified in the current DVC-files. +[DVC-files](/doc/user-guide/dvc-metafile-formats) act as pointers to specific +version of data files or directories tracked by DVC. This command synchronizes +the workspace data with the versions specified in the current DVC-files. `dvc checkout` is useful, for example, when using Git in the project, after `git clone`, `git checkout`, or any other operation diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 6c1266a5b7..960f26dc9a 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -1,8 +1,8 @@ # commit Record changes to DVC-tracked files in the project, by updating -[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to -the cache. +[DVC-files](/doc/user-guide/dvc-metafile-formats) and saving outputs +to the cache. ## Synopsis @@ -66,8 +66,8 @@ cache. This is where the `dvc commit` command comes into play. It performs that last step (saving the data in cache). Note that it's best to avoid the last two scenarios. They essentially -force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to -cache. They are still useful, but keep in mind that DVC can't guarantee +force-update the [DVC-files](/doc/user-guide/dvc-metafile-formats) and save data +to cache. They are still useful, but keep in mind that DVC can't guarantee reproducibility in those cases. ## Options @@ -226,7 +226,7 @@ the new instance of `model.pkl` is there. It is also possible to execute the commands that are executed by `dvc repro` by hand. You won't have DVC helping you, but you have the freedom to run any command you like, even ones not defined in a -[DVC-file](/doc/user-guide/dvc-file-format). For example: +[DVC-file](/doc/user-guide/dvc-metafile-formats). For example: ```dvc $ python src/featurization.py data/prepared data/features diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 7e63436ad3..e4e8afc081 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -22,7 +22,7 @@ of the project, but without placing them in the workspace. This makes the data files available for linking (or copying) into the workspace. (Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along with `dvc checkout`, it's performed automatically by `dvc pull` when the target -[DVC-files](/doc/user-guide/dvc-file-format) are not already in the cache: +[DVC-files](/doc/user-guide/dvc-metafile-formats) are not already in the cache: ``` Controlled files Commands @@ -49,7 +49,7 @@ on DVC remotes.) These necessary data or model files are listed as [stage](/doc/command-reference/run)) so they are required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the corresponding [pipeline](/doc/command-reference/pipeline). (See -[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on +[DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more information on dependencies and outputs.) `dvc fetch` ensures that the files needed for a DVC-file to be @@ -276,12 +276,12 @@ $ tree .dvc/cache ``` Fetching using `--with-deps` starts with the target -[DVC-file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches -backwards through its pipeline for data to download into the project's cache. -All the data for the second and third stages ("featurize" and "train") has now -been downloaded to the cache. We could now use `dvc checkout` to get the data -files needed to reproduce this pipeline up to the third stage into the workspace -(with `dvc repro train.dvc`). +[DVC-file](/doc/user-guide/dvc-metafile-formats) (`train.dvc` stage) and +searches backwards through its pipeline for data to download into the project's +cache. All the data for the second and third stages ("featurize" and "train") +has now been downloaded to the cache. We could now use `dvc checkout` to get the +data files needed to reproduce this pipeline up to the third stage into the +workspace (with `dvc repro train.dvc`). > Note that in this example project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages, so at this point all diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 9b636773da..eba12a36f3 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -40,7 +40,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[DVC-file](/doc/user-guide/dvc-file-format) of the project. +[DVC-file](/doc/user-guide/dvc-metafile-formats) of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -181,9 +181,10 @@ The `model.monograms.pkl` file now contains the older version of the model. To get the most recent one, we use a similar command, but with `-o model.bigrams.pkl` and `--rev bigrams-experiment` (or even without `--rev` since that tag has the latest model version anyway). In fact, in this case using -`dvc pull` with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) -should suffice, downloading the file as just `model.pkl`. We can then rename it -to make its variant explicit: +`dvc pull` with the corresponding +[DVC-files](/doc/user-guide/dvc-metafile-formats) should suffice, downloading +the file as just `model.pkl`. We can then rename it to make its variant +explicit: ```dvc $ dvc pull train.dvc diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index ab5894f4d0..255007f125 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -41,8 +41,8 @@ while `out` can be used to specify the directory and/or file name desired for the downloaded data. If an existing directory is specified, the file or directory will be placed inside. -[DVC-files](/doc/user-guide/dvc-file-format) support references to data in an -external location, see +[DVC-files](/doc/user-guide/dvc-metafile-formats) support references to data in +an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such a DVC-file, the `deps` field stores the remote URL, and the `outs` field contains the corresponding local path in the workspace. It records enough @@ -102,8 +102,8 @@ $ dvc run -d https://example.com/path/to/data.csv \ wget https://example.com/path/to/data.csv -O data.csv ``` -Both methods generate a [DVC-files](/doc/user-guide/dvc-file-format) with an -external dependency, but the one created by `dvc import-url` preserves the +Both methods generate a [DVC-files](/doc/user-guide/dvc-metafile-formats) with +an external dependency, but the one created by `dvc import-url` preserves the connection to the data source. We call this an _import stage_. Note that import stages are considered always locked, meaning that if you run @@ -188,8 +188,8 @@ The `etag` field in the DVC-file contains the If the remote file changes, its ETag will be different. This metadata allows DVC to determine whether its necessary to download it again. -> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the -> text format above. +> See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more details +> on the text format above. You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 8b43cdaa86..4980c280c5 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -2,7 +2,7 @@ Download a file or directory tracked by DVC or by Git into the workspace. It also creates a -[DVC-file](/doc/user-guide/dvc-file-format) with information about the data +[DVC-file](/doc/user-guide/dvc-metafile-formats) with information about the data source, which can later be used to [update](/doc/command-reference/update) the import. @@ -44,7 +44,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[DVC-file](/doc/user-guide/dvc-file-format) of the project. +[DVC-file](/doc/user-guide/dvc-metafile-formats) of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -112,9 +112,10 @@ Importing 'data/data.xml (git@github.com:iterative/example-get-started)' In contrast with `dvc get`, this command doesn't just download the data file, but it also creates an import stage -([DVC-file](/doc/user-guide/dvc-file-format)) with a link to the data source (as -explained in the description above). (This import stage can later be used to -[update](/doc/command-reference/update) the import.) Check `data.xml.dvc`: +([DVC-file](/doc/user-guide/dvc-metafile-formats)) with a link to the data +source (as explained in the description above). (This import stage can later be +used to [update](/doc/command-reference/update) the import.) Check +`data.xml.dvc`: ```yaml md5: 7de90e7de7b432ad972095bc1f2ec0f8 @@ -152,8 +153,8 @@ Importing ``` When using this option, the import stage -([DVC-file](/doc/user-guide/dvc-file-format)) will also have a `rev` subfield -under `repo`: +([DVC-file](/doc/user-guide/dvc-metafile-formats)) will also have a `rev` +subfield under `repo`: ```yaml deps: @@ -184,7 +185,7 @@ If you take a look at our [dataset registry](https://github.com/iterative/dataset-registry) project, you'll see that it's organized into different directories such as `tutorial/ver` and `use-cases/`, and these contain -[DVC-files](/doc/user-guide/dvc-file-format) that track different datasets. +[DVC-files](/doc/user-guide/dvc-metafile-formats) that track different datasets. Given this simple structure, its data files can be easily shared among several other projects using `dvc get` and `dvc import`. For example: diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 1d9d95f6f2..3551d21691 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -56,7 +56,7 @@ sub-projects to mitigate the issues of initializing in the Git repository root: - Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`, and others analyze the whole repository to look for - [DVC-files](/doc/user-guide/dvc-file-format) to download files and + [DVC-files](/doc/user-guide/dvc-metafile-formats) to download files and directories, to reproduce pipelines, etc. It can be expensive in the large repositories with a lot of projects. @@ -126,9 +126,9 @@ include: - SCM other than Git is being used. Even though there are DVC features that require DVC to be run in the Git repo, DVC can work well with other version control systems. Since DVC relies on simple text - [DVC-files](/doc/user-guide/dvc-file-format) to manage pipelines, - data, etc, they can be added into any SCM thus providing large data files and - directories versioning. + [DVC-files](/doc/user-guide/dvc-metafile-formats) to manage + pipelines, data, etc, they can be added into any SCM thus + providing large data files and directories versioning. - There is no need to keep the history at all, e.g. having a deployment automation like running a data pipeline using `cron`. diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d554b12662..d8631d5695 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -22,10 +22,10 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: **Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The -project's DVC-files in turn refer to data stored in cache, but not -necessarily in the workspace. Normally, it would be necessary to -use `dvc checkout` to synchronize workspace and DVC-files. +[DVC-files](/doc/user-guide/dvc-metafile-formats) corresponding to that version. +The project's DVC-files in turn refer to data stored in cache, but +not necessarily in the workspace. Normally, it would be necessary +to use `dvc checkout` to synchronize workspace and DVC-files. This hook automates `dvc checkout` after `git checkout`. @@ -153,7 +153,7 @@ $ dvc pull --all-branches --all-tags ## Example: Checkout both Git and DVC Switching from one Git commit to another (with `git checkout`) may change the -set of [DVC-files](/doc/user-guide/dvc-file-format) in the +set of [DVC-files](/doc/user-guide/dvc-metafile-formats) in the workspace. This would mean that the currently present data files and directories no longer matches project's version (which can be fixed with `dvc checkout`). @@ -206,9 +206,9 @@ project's cache and the data files currently in the workspace. Git changed the DVC-files in the workspace, which changed references to data files. `dvc status` first informed us that the data files in the workspace no longer matched the hash values in the corresponding -[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings -them up to date, and a second `dvc status` tells us that the data files now do -match the DVC-files. +[DVC-files](/doc/user-guide/dvc-metafile-formats). Running `dvc checkout` then +brings them up to date, and a second `dvc status` tells us that the data files +now do match the DVC-files. ```dvc $ git checkout master diff --git a/content/docs/command-reference/lock.md b/content/docs/command-reference/lock.md index 3b26ea8815..89c18d99fb 100644 --- a/content/docs/command-reference/lock.md +++ b/content/docs/command-reference/lock.md @@ -1,6 +1,6 @@ # lock -Lock a [DVC-file](/doc/user-guide/dvc-file-format) +Lock a [DVC-file](/doc/user-guide/dvc-metafile-formats) ([stage](/doc/command-reference/run)). Use `dvc unlock` to unlock the file. ## Synopsis diff --git a/content/docs/command-reference/metrics/add.md b/content/docs/command-reference/metrics/add.md index 448ddbcb7f..3d22a87eb2 100644 --- a/content/docs/command-reference/metrics/add.md +++ b/content/docs/command-reference/metrics/add.md @@ -13,9 +13,9 @@ positional arguments: ## Description -Sets the `metric` field in the [DVC-file](/doc/user-guide/dvc-file-format) that -defines the given `path` as an output, marking `path` as a metric -file to track. +Sets the `metric` field in the [DVC-file](/doc/user-guide/dvc-metafile-formats) +that defines the given `path` as an output, marking `path` as a +metric file to track. Note that outputs can also be marked as metrics via the `-m` or `-M` options of `dvc run`. We recommend using `-M` option to keep metrics in Git history. @@ -65,8 +65,8 @@ $ dvc run -O metrics.json \ Even when we named this output file `metrics.json`, DVC won't know that it's a metric if we don't specify so. The content of stage file `metrics.json.dvc` (a -[DVC-file](/doc/user-guide/dvc-file-format)) should look like this: (Notice the -`metric: false` field.) +[DVC-file](/doc/user-guide/dvc-metafile-formats)) should look like this: (Notice +the `metric: false` field.) ```yaml md5: 906ea9489e432c85d085b248c712567b diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index 3f3e8cc6c5..e12cad2560 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -35,7 +35,7 @@ difference (delta) from the previous value of metrics (with 3-digit accuracy). They're calculated between two commits (hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-file-format) in both references. +[DVC-files](/doc/user-guide/dvc-metafile-formats) in both references. ## Options diff --git a/content/docs/command-reference/metrics/modify.md b/content/docs/command-reference/metrics/modify.md index 5ede1c74d2..8ab88379ac 100644 --- a/content/docs/command-reference/metrics/modify.md +++ b/content/docs/command-reference/metrics/modify.md @@ -14,10 +14,11 @@ positional arguments: ## Description -This command finds a corresponding [DVC-file](/doc/user-guide/dvc-file-format) -for the provided metric file (`path` is defined among the outputs -of the DVC-file), and updates the default formatting of the metric. See the -[options](#options) below and `dvc metrics show` for more info. +This command finds a corresponding +[DVC-file](/doc/user-guide/dvc-metafile-formats) for the provided metric file +(`path` is defined among the outputs of the DVC-file), and updates +the default formatting of the metric. See the [options](#options) below and +`dvc metrics show` for more info. If `path` isn't tracked by DVC (described in one of the workspace DVC-files), the following error will be raised: diff --git a/content/docs/command-reference/metrics/remove.md b/content/docs/command-reference/metrics/remove.md index a3f1503f79..998948465c 100644 --- a/content/docs/command-reference/metrics/remove.md +++ b/content/docs/command-reference/metrics/remove.md @@ -16,9 +16,10 @@ positional arguments: ## Description -This command finds a corresponding [DVC-file](/doc/user-guide/dvc-file-format) -for the provided metric file (`path` is defined among the outputs -of the DVC-file), and resets the `metric` field for the file. +This command finds a corresponding +[DVC-file](/doc/user-guide/dvc-metafile-formats) for the provided metric file +(`path` is defined among the outputs of the DVC-file), and resets +the `metric` field for the file. This does not remove or delete the file in question. It only unmarks it as a metric file. It also keeps the file as an output of the corresponding DVC-file. @@ -81,7 +82,7 @@ $ dvc metrics remove metrics.json ``` Let's check the outputs field (`outs`) of same -[DVC-file](/doc/user-guide/dvc-file-format) again: +[DVC-file](/doc/user-guide/dvc-metafile-formats) again: ```yaml outs: diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 784fbaf178..80dae49da7 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -17,9 +17,9 @@ positional arguments: ## Description Finds and prints all metrics in the project by examining all of its -[DVC-files](/doc/user-guide/dvc-file-format). If `targets` are provided, it will -show those specific metric files instead. With the `-a` or`-T` options, this -command shows the different metrics values across all Git branches or tags, +[DVC-files](/doc/user-guide/dvc-metafile-formats). If `targets` are provided, it +will show those specific metric files instead. With the `-a` or`-T` options, +this command shows the different metrics values across all Git branches or tags, respectively. The optional `targets` argument can contain one or more metric files. With the @@ -28,7 +28,7 @@ shows all metric files inside. Providing a `type` (`-t` option) overrides the full metric specification (both `type` and `xpath` fields) defined in the -[DVC-file](/doc/user-guide/dvc-file-format) (with `dvc metrics modify`, +[DVC-file](/doc/user-guide/dvc-metafile-formats) (with `dvc metrics modify`, typically). If `type` (via `-t`) is not specified and only `xpath` (`-x` option) is, only diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index 74a72a9a8e..0eec010c5d 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -1,7 +1,7 @@ # move Rename a file or a directory and modify the corresponding -[DVC-file](/doc/user-guide/dvc-file-format) (see `dvc add`) to reflect the +[DVC-file](/doc/user-guide/dvc-metafile-formats) (see `dvc add`) to reflect the change. If the file or directory has the same name as the corresponding DVC-file, it also renames it. @@ -19,7 +19,7 @@ positional arguments: `dvc move` is useful when a `src` file or directory has previously been added to the project with `dvc add`, creating a -[DVC-file](/doc/user-guide/dvc-file-format) (with `src` as a dependency). +[DVC-file](/doc/user-guide/dvc-metafile-formats) (with `src` as a dependency). `dvc move` behaves like `mv src dst`, moving `src` to the given `dst` path, but it also renames and updates the corresponding DVC-file appropriately. @@ -107,7 +107,7 @@ $ tree We use `dvc add` to track a file with DVC, then we use `dvc move` to change its location. If target path already exists and is a directory, data file is moved with unchanged name into this folder. Note that the `data.csv.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is also moved. +[DVC-file](/doc/user-guide/dvc-metafile-formats) is also moved. ```dvc $ tree diff --git a/content/docs/command-reference/params/index.md b/content/docs/command-reference/params/index.md index b5c9a07f4c..a39dbd27a7 100644 --- a/content/docs/command-reference/params/index.md +++ b/content/docs/command-reference/params/index.md @@ -54,7 +54,7 @@ written, or generated, and these can be versioned directly with Git. You can then use `dvc run` with the `-p` (`--params`) option to specify parameter dependencies for your pipeline's stages (instead of or in addition to regular `-d` deps.) DVC saves the param names and values in the stage file (see -[DVC-file format](/doc/user-guide/dvc-file-format)). These values will be +[DVC-file format](/doc/user-guide/dvc-metafile-formats)). These values will be compared to the ones in the params files to determine if the stage is invalidated upon pipeline [reproduction](/doc/command-reference/repro). @@ -109,9 +109,9 @@ $ dvc run -d users.csv -o model.pkl \ ``` You can find that each parameter and it's value were saved in the -[DVC-file](/doc/user-guide/dvc-file-format). These values will be compared to -the ones in the parameters files whenever `dvc repro` is used, to determine if -dependency to the params file is invalidated: +[DVC-file](/doc/user-guide/dvc-metafile-formats). These values will be compared +to the ones in the parameters files whenever `dvc repro` is used, to determine +if dependency to the params file is invalidated: ```yaml md5: 05d178cfa0d1474b6c5800aa1e1b34ac diff --git a/content/docs/command-reference/pipeline/show.md b/content/docs/command-reference/pipeline/show.md index 8fa9e70fd1..336c5d9ef4 100644 --- a/content/docs/command-reference/pipeline/show.md +++ b/content/docs/command-reference/pipeline/show.md @@ -2,7 +2,7 @@ Show [stages](/doc/command-reference/run) in a pipeline that lead to the specified stage. By default it lists -[DVC-files](/doc/user-guide/dvc-file-format). +[DVC-files](/doc/user-guide/dvc-metafile-formats). ## Synopsis diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index af0ca1bb6e..e6ca8eb0eb 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -40,7 +40,7 @@ resulting plot shows all of them in a single output. This command can work with metric files that are committed to a repository history, data files controlled by DVC, or any other file in the workspace. In the case of DVC-tracked `datafile`, the `revisions` are used to find the -corresponding [DVC-files](/doc/user-guide/dvc-file-format). +corresponding [DVC-files](/doc/user-guide/dvc-metafile-formats). ## Options diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 195056d483..22621a3ed9 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -3,7 +3,7 @@ Download tracked files or directories from [remote storage](/doc/command-reference/remote) to the cache and workspace, based on the current -[DVC-files](/doc/user-guide/dvc-file-format). +[DVC-files](/doc/user-guide/dvc-metafile-formats). ## Synopsis @@ -37,7 +37,7 @@ remote. With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads only the files (or directories) missing from the workspace by searching all -[DVC-files](/doc/user-guide/dvc-file-format) currently in the +[DVC-files](/doc/user-guide/dvc-metafile-formats) currently in the project. It will not download files associated with earlier commits in the repository (if using Git), nor will it download files that have not changed. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 65f41494e4..e82ee547f0 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -38,9 +38,9 @@ save any changes in the code or DVC-files (those should be saved by using Under the hood a few actions are taken: - The push command by default uses all - [DVC-files](/doc/user-guide/dvc-file-format) in the workspace. - The command options listed below will either limit or expand the set of - DVC-files to consult. + [DVC-files](/doc/user-guide/dvc-metafile-formats) in the + workspace. The command options listed below will either limit or + expand the set of DVC-files to consult. - For each output referenced from each selected DVC-file, DVC finds a corresponding file or directory in the cache. DVC then checks diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 4b8620545a..e086090f30 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -68,7 +68,8 @@ The following config options are available for all remote types: DVC will recalculate the file hashes upon download (e.g. `dvc pull`) to make sure that these haven't been modified, or corrupted during download. It may slow down the aforementioned commands. The calculated hash is compared to the - value saved in the corresponding [DVC-file](/doc/user-guide/dvc-file-format). + value saved in the corresponding + [DVC-file](/doc/user-guide/dvc-metafile-formats). > Note that this option is enabled on **Google Drive** remotes by default. diff --git a/content/docs/command-reference/remove.md b/content/docs/command-reference/remove.md index db63ae521f..66cb4448e7 100644 --- a/content/docs/command-reference/remove.md +++ b/content/docs/command-reference/remove.md @@ -15,8 +15,8 @@ positional arguments: This command safely removes data files or directories that are tracked by DVC from the workspace. It takes a -[DVC-File](/doc/user-guide/dvc-file-format) as input, removes all of its outputs -(`outs`), and optionally removes the DVC-file itself. +[DVC-File](/doc/user-guide/dvc-metafile-formats) as input, removes all of its +outputs (`outs`), and optionally removes the DVC-file itself. Note that it does not remove files from the DVC cache or remote storage (see `dvc gc`). However, remember to run `dvc push` to save the files you actually diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 3f68b585e3..2326963316 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -35,7 +35,7 @@ There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, `--cwd`, or other options. -If specific [DVC-files](/doc/user-guide/dvc-file-format) (`targets`) are +If specific [DVC-files](/doc/user-guide/dvc-metafile-formats) (`targets`) are omitted, `Dvcfile` will be assumed. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data @@ -274,7 +274,7 @@ Data and pipelines are up to date. ``` The reason being that the `text.txt` file is a dependency in the target -[DVC-file](/doc/user-guide/dvc-file-format) (`Dvcfile` by default). This +[DVC-file](/doc/user-guide/dvc-metafile-formats) (`Dvcfile` by default). This `Dvcfile` stage is dependent on `filter.dvc`, which happens first in this pipeline (shown in the following figure): diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 01802fadb5..c34fee52f0 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -1,7 +1,7 @@ # run -Generate a stage file ([DVC-file](/doc/user-guide/dvc-file-format)) from a given -command and execute the command. +Generate a stage file ([DVC-file](/doc/user-guide/dvc-metafile-formats)) from a +given command and execute the command. ## Synopsis @@ -208,8 +208,8 @@ To track the changes with git, run: git add .gitignore metric.dvc ``` -> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the -> text format above. +> See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more details +> on the text format above. Execute a Python script as a DVC [pipeline](/doc/command-reference/pipeline) stage. The stage file name is not specified, so a `model.p.dvc` DVC-file is diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 8f8a3d6d98..54b6f8fd78 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -34,8 +34,9 @@ options: | remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | DVC determines which data and code files to compare by analyzing all -[DVC-files](/doc/user-guide/dvc-file-format) in the workspace (the -`--all-branches` and `--all-tags` options compare multiple workspace versions). +[DVC-files](/doc/user-guide/dvc-metafile-formats) in the workspace +(the `--all-branches` and `--all-tags` options compare multiple workspace +versions). The comparison can be limited to certain DVC-files only, by listing them as `targets`. (Changes are reported only against these.) When this is combined with diff --git a/content/docs/command-reference/unlock.md b/content/docs/command-reference/unlock.md index 85988e2e1c..fc6fc46a2c 100644 --- a/content/docs/command-reference/unlock.md +++ b/content/docs/command-reference/unlock.md @@ -1,6 +1,6 @@ # unlock -Unlock [DVC-file](/doc/user-guide/dvc-file-format) +Unlock [DVC-file](/doc/user-guide/dvc-metafile-formats) ([stage](/doc/command-reference/run)). See `dvc lock` for more information. ## Synopsis diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index d19bc7e657..bced952031 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -1,7 +1,8 @@ # update Update data artifacts imported from external DVC -projects, and corresponding [DVC-files](/doc/user-guide/dvc-file-format). +projects, and corresponding +[DVC-files](/doc/user-guide/dvc-metafile-formats). ## Synopsis @@ -16,7 +17,7 @@ positional arguments: ## Description After creating import stages -([DVC-files](/doc/user-guide/dvc-file-format)) with `dvc import` or +([DVC-files](/doc/user-guide/dvc-metafile-formats)) with `dvc import` or `dvc import-url`, the data source can change. Use `dvc update` to bring these imported file, directory, or data artifact up to date. @@ -83,8 +84,8 @@ This time nothing has changed, since the source project is rather stable. > Note that `dvc update` updates the `rev_lock` field of the corresponding -> [DVC-file](/doc/user-guide/dvc-file-format) (when there are changes to bring -> in). +> [DVC-file](/doc/user-guide/dvc-metafile-formats) (when there are changes to +> bring in). ## Example: Updating fixed revisions to a different version diff --git a/content/docs/install/plugins.md b/content/docs/install/plugins.md index 841f9d0f8a..8d34589bec 100644 --- a/content/docs/install/plugins.md +++ b/content/docs/install/plugins.md @@ -1,7 +1,7 @@ # IDE Plugins and Syntax Highlighting When you add a file or a stage to your pipeline, DVC creates a special -[DVC-file](/doc/user-guide/dvc-file-format) that contains all the needed +[DVC-file](/doc/user-guide/dvc-metafile-formats) that contains all the needed information to track your data and transformations. The file itself is in a simple YAML format. diff --git a/content/docs/tutorials/deep/define-ml-pipeline.md b/content/docs/tutorials/deep/define-ml-pipeline.md index bb4fde619f..ac15ff5774 100644 --- a/content/docs/tutorials/deep/define-ml-pipeline.md +++ b/content/docs/tutorials/deep/define-ml-pipeline.md @@ -51,11 +51,11 @@ or move it, you can use `dvc move`. ## Data file internals -If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by -`dvc add`, you will see that outputs are tracked in the `outs` -field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 hash. This hash value determines the -location of the actual content file in the +If you take a look at the [DVC-file](/doc/user-guide/dvc-metafile-formats) +created by `dvc add`, you will see that outputs are tracked in the +`outs` field. In this file, only one output is specified. The output contains +the data file path in the repository and its MD5 hash. This hash value +determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. @@ -139,8 +139,8 @@ files written to by the command, if any. - `-o out.dat` (lower case o) specifies an output data file. DVC will track this data file by creating a corresponding - [DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add out.dat` - after `dvc run` instead). + [DVC-file](/doc/user-guide/dvc-metafile-formats) (as if running + `dvc add out.dat` after `dvc run` instead). - `-O tmp.dat` (upper case O) specifies a simple output file (not to be added to DVC). @@ -186,8 +186,8 @@ command and does some additional work if the command was successful: 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage file in the project with information about this pipeline stage. - (See [DVC-File Format](/doc/user-guide/dvc-file-format)). Note that the name - of this file could be specified by using the `-f` option, for example + (See [DVC-File Format](/doc/user-guide/dvc-metafile-formats)). Note that the + name of this file could be specified by using the `-f` option, for example `-f extract.dvc`. Let's take a look at the resulting stage file created by `dvc run` above: diff --git a/content/docs/tutorials/deep/reproducibility.md b/content/docs/tutorials/deep/reproducibility.md index e43bfc3cff..ebca57b35d 100644 --- a/content/docs/tutorials/deep/reproducibility.md +++ b/content/docs/tutorials/deep/reproducibility.md @@ -19,9 +19,9 @@ automation tools ([Make](https://www.gnu.org/software/make/), Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of the graph nodes (pipeline [stages](/doc/command-reference/run)). -If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our -repository, nothing happens because nothing was changed in the pipeline defined -in the project: There's nothing to reproduce. +If you run `repro` on any [DVC-file](/doc/user-guide/dvc-metafile-formats) from +our repository, nothing happens because nothing was changed in the pipeline +defined in the project: There's nothing to reproduce. ```dvc $ dvc repro model.p.dvc diff --git a/content/docs/tutorials/deep/sharing-data.md b/content/docs/tutorials/deep/sharing-data.md index 199dbe59f7..25cfd5bd88 100644 --- a/content/docs/tutorials/deep/sharing-data.md +++ b/content/docs/tutorials/deep/sharing-data.md @@ -2,11 +2,11 @@ ## Pushing data to the cloud -We've gone over how source code and [DVC-files](/doc/user-guide/dvc-file-format) -can be shared using a Git repository. These DVC repositories will -contain all the information needed for reproducibility, so it might be a good -idea to share them with your team using Git hosting services (such as -[GitHub](https://github.com/)). +We've gone over how source code and +[DVC-files](/doc/user-guide/dvc-metafile-formats) can be shared using a Git +repository. These DVC repositories will contain all the information +needed for reproducibility, so it might be a good idea to share them with your +team using Git hosting services (such as [GitHub](https://github.com/)). DVC is able to push the cache to cloud storage. diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index b2b5226012..02ecd54a5a 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -69,9 +69,9 @@ $ dvc run -f prepare.dvc \ ``` The `prepare.dvc` _stage file_ is generated. It has the same -[format](/doc/user-guide/dvc-file-format) as the DVC-file we created previously -to [tack data](/doc/tutorials/get-started/data-versioning#changes), but it -additionally includes information about the command we ran +[format](/doc/user-guide/dvc-metafile-formats) as the DVC-file we created +previously to [tack data](/doc/tutorials/get-started/data-versioning#changes), +but it additionally includes information about the command we ran (`python src/prepare.py`), the dependencies, and outputs. diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index 2526c594fb..a0c1180b82 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -28,8 +28,8 @@ $ dvc add data/data.xml DVC stores information about the added file in a special _DVC-file_ named `data/data.xml.dvc`, a small text file with a human-readable -[format](/doc/user-guide/dvc-file-format). This metafile can committed with Git -instead, as a placeholder for the original data (which is added to +[format](/doc/user-guide/dvc-metafile-formats). This metafile can committed with +Git instead, as a placeholder for the original data (which is added to `.gitignore`): ```dvc @@ -110,8 +110,9 @@ data\data.xml.dvc: $ dvc add data/data.xml ``` -DVC updates the `data/data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) -to match the updated data. Let's commit this new version with Git: +DVC updates the `data/data.xml.dvc` +[DVC-file](/doc/user-guide/dvc-metafile-formats) to match the updated data. +Let's commit this new version with Git:
@@ -201,7 +202,7 @@ $ dvc push ``` > Usually, we also want to `git commit` and `git push` the corresponding -> [DVC-files](/doc/user-guide/dvc-file-format). +> [DVC-files](/doc/user-guide/dvc-metafile-formats). Pushing data or models ensures they're safely backed up remotely. This also means they can be retrieved from other environments. @@ -330,9 +331,10 @@ The `url` and `rev_lock` subfields under `repo` are used to save the origin and
-Additionally, the `data/data.xml` [DVC-file](/doc/user-guide/dvc-file-format) -now includes metadata to track changes in the source data. This allows you to -bring in changes from the data source later, using `dvc update`. +Additionally, the `data/data.xml` +[DVC-file](/doc/user-guide/dvc-metafile-formats) now includes metadata to track +changes in the source data. This allows you to bring in changes from the data +source later, using `dvc update`. ### Python API diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index 5283c3ea2c..302b084887 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -97,7 +97,7 @@ $ dvc add data/Posts.xml.zip ``` When we run `dvc add` `Posts.xml.zip`, DVC creates a -[DVC-file](/doc/user-guide/dvc-file-format). +[DVC-file](/doc/user-guide/dvc-metafile-formats).
diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index ede422613b..dc6419dad4 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -132,7 +132,7 @@ the cache (while keeping a [file link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) to it in the workspace, so you can continue working the same way as before). This is achieved by creating a simple human-readable -[DVC-file](/doc/user-guide/dvc-file-format) that serves as a pointer to the +[DVC-file](/doc/user-guide/dvc-metafile-formats) that serves as a pointer to the cache. Next, we train our first model with `train.py`. Because of the small dataset, @@ -168,8 +168,8 @@ As we mentioned briefly, DVC does not commit the `data/` directory and then `git commit` DVC-files that contain file hashes that point to cached data. In this case we created `data.dvc` and `model.h5.dvc`. Refer to -[DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these -files work. +[DVC-File Format](/doc/user-guide/dvc-metafile-formats) to learn more about how +these files work.
@@ -283,8 +283,8 @@ the `v2.0` tag. As we have learned already, DVC keeps data files out of Git (by adjusting `.gitignore`) and puts them into the cache (usually it's a `.dvc/cache` directory inside the repository). Instead, DVC creates -[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as data -placeholders that point to the cached files, and they can be easily version +[DVC-files](/doc/user-guide/dvc-metafile-formats). These text files serve as +data placeholders that point to the cached files, and they can be easily version controlled with Git. When we run `git checkout` we restore pointers (DVC-files) first. Then, when we @@ -325,11 +325,11 @@ $ dvc run -f Dvcfile \ ``` Similar to `dvc add`, `dvc run` creates a -[DVC-file](/doc/user-guide/dvc-file-format) named `Dvcfile` (specified using the -`-f` option). It tracks all outputs (`-o`) the same way as `dvc add` does. -Unlike `dvc add`, `dvc run` also tracks dependencies (`-d`) and the command -(`python train.py`) that was run to produce the result. We call such a DVC-file -a "stage file". +[DVC-file](/doc/user-guide/dvc-metafile-formats) named `Dvcfile` (specified +using the `-f` option). It tracks all outputs (`-o`) the same way as `dvc add` +does. Unlike `dvc add`, `dvc run` also tracks dependencies (`-d`) and the +command (`python train.py`) that was run to produce the result. We call such a +DVC-file a "stage file". > At this point you could run `git add .` and `git commit` to save the `Dvcfile` > stage file and its changed outputs to the repository. diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/understanding-dvc/how-it-works.md index 732433fca5..f3efd7608c 100644 --- a/content/docs/understanding-dvc/how-it-works.md +++ b/content/docs/understanding-dvc/how-it-works.md @@ -44,7 +44,7 @@ - DVC introduces the concept of data files for Git repositories. DVC keeps data files outside of the repository, replacing them with special - [DVC-files](/doc/user-guide/dvc-file-format) in the Git repo: + [DVC-files](/doc/user-guide/dvc-metafile-formats) in the Git repo: ```dvc $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md index 34afd0e9b6..9fa14ed929 100644 --- a/content/docs/understanding-dvc/related-technologies.md +++ b/content/docs/understanding-dvc/related-technologies.md @@ -60,8 +60,9 @@ Luigi, etc. (DAG): - The DAG or dependency graph is defined implicitly by the connections between - [DVC-files](/doc/user-guide/dvc-file-format) (with file names `.dvc` - or `Dvcfile`), based on their dependencies and outputs. + [DVC-files](/doc/user-guide/dvc-metafile-formats) (with file names + `.dvc` or `Dvcfile`), based on their dependencies and + outputs. - Each DVC-file defines one node in the DAG. All DVC-files in a repository make up a single pipeline (think a single Makefile). All DVC-files (and @@ -99,9 +100,9 @@ Luigi, etc. Git-annex repository is cloned via `git clone`, data files won't be copied to the local machine, as file contents are stored in separate [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible - workflow, are always included in the Git repository. Hence, they can be - executed locally with minimal effort. + [DVC-files](/doc/user-guide/dvc-metafile-formats), which provide the + reproducible workflow, are always included in the Git repository. Hence, they + can be executed locally with minimal effort. - DVC is not fundamentally bound to Git, and users have the option of using DVC without SCM. diff --git a/content/docs/understanding-dvc/what-is-dvc.md b/content/docs/understanding-dvc/what-is-dvc.md index 56865ef8c8..44bc274c21 100644 --- a/content/docs/understanding-dvc/what-is-dvc.md +++ b/content/docs/understanding-dvc/what-is-dvc.md @@ -45,8 +45,8 @@ DVC uses a few core concepts: - **Data files**: Cached files (for large files). Data files are stored outside of the Git repository on a local/shared hard drive or remote storage, but - [DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored - in Git for DVC needs (to maintain pipelines and reproducibility). + [DVC-files](/doc/user-guide/dvc-metafile-formats) describing that data are + stored in Git for DVC needs (to maintain pipelines and reproducibility). - **Cache directory**: Directory with all data files on a local hard drive or in cloud storage, but not in the Git repository. See `dvc cache dir`. diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index 4b66691b5f..b3377df8b8 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -36,8 +36,9 @@ Advantages of using a DVC **data registry**: copies on other remotes). This simplifies data management and optimizes space requirements. - Security: Registries can be setup to have read-only remote storage (e.g. an - HTTP location). Git versioning of [DVC-files](/doc/user-guide/dvc-file-format) - allows us to track and audit data changes. + HTTP location). Git versioning of + [DVC-files](/doc/user-guide/dvc-metafile-formats) allows us to track and audit + data changes. - Data as code: Leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think Git for cloud storage, but without ad-hoc conventions. @@ -65,10 +66,10 @@ $ dvc add music/songs > [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset). A regular Git workflow can be followed with the tiny -[DVC-files](/doc/user-guide/dvc-file-format) that substitute the actual data -(`music/songs.dvc` in this example). This enables team collaboration on data at -the same level as with source code (commit history, branching, pull requests, -reviews, etc.): +[DVC-files](/doc/user-guide/dvc-metafile-formats) that substitute the actual +data (`music/songs.dvc` in this example). This enables team collaboration on +data at the same level as with source code (commit history, branching, pull +requests, reviews, etc.): ```dvc $ git add music/songs.dvc music/.gitignore @@ -147,8 +148,8 @@ $ dvc import https://github.com/example/registry \ Besides downloading, importing saves the dependency from the local project to the data source (registry repo). This is achieved by creating a particular kind -of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_). This -file can be used staged and committed with Git. +of [DVC-file](/doc/user-guide/dvc-metafile-formats) (a.k.a. _import stage_). +This file can be used staged and committed with Git. As an addition to the import workflow, and enabled the saved dependency, we can easily bring it up to date in our consumer project(s) with `dvc update` whenever diff --git a/content/docs/use-cases/sharing-data-and-model-files.md b/content/docs/use-cases/sharing-data-and-model-files.md index b23c2e9b25..025169766c 100644 --- a/content/docs/use-cases/sharing-data-and-model-files.md +++ b/content/docs/use-cases/sharing-data-and-model-files.md @@ -67,8 +67,8 @@ with the `dvc push` command: $ dvc push ``` -Code and [DVC-files](/doc/user-guide/dvc-file-format) can be safely committed -and pushed with Git. +Code and [DVC-files](/doc/user-guide/dvc-metafile-formats) can be safely +committed and pushed with Git. ## Download code diff --git a/content/docs/use-cases/versioning-data-and-model-files.md b/content/docs/use-cases/versioning-data-and-model-files.md index 1cd1ceb296..f913021492 100644 --- a/content/docs/use-cases/versioning-data-and-model-files.md +++ b/content/docs/use-cases/versioning-data-and-model-files.md @@ -8,8 +8,8 @@ DVC allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository. It's useful when dealing with files that are too large for Git to handle properly in general. DVC saves information about your data in special -[DVC-files](/doc/user-guide/dvc-file-format), and these metafiles can be used -for versioning. To actually store the data, DVC supports various types of +[DVC-files](/doc/user-guide/dvc-metafile-formats), and these metafiles can be +used for versioning. To actually store the data, DVC supports various types of [remote storage](/doc/command-reference/remote). This allows easily saving and sharing data alongside code. diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index 486e994379..282df02c7b 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -16,5 +16,5 @@ match: Initialized by running `dvc init` in the **workspace** (typically in a Git repository). It will contain the [`.dvc/` directory](/doc/user-guide/dvc-files-and-directories) and -[DVC-files](/doc/user-guide/dvc-file-format) created with commands such as +[DVC-files](/doc/user-guide/dvc-metafile-formats) created with commands such as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md index d15ef180ab..4a613cac22 100644 --- a/content/docs/user-guide/contributing/docs.md +++ b/content/docs/user-guide/contributing/docs.md @@ -169,8 +169,8 @@ is installed when `yarn` runs (see [dev env](#development-environment)). `dvc`, `yaml`, or `diff` custom languages. `usage` is employed to show the `dvc --help` output for each command reference. `dvc` can be used to show examples of commands and their output in a terminal session. `yaml` is used to - show [DVC-file](/doc/user-guide/dvc-file-format) contents or other YAML data. - `diff` is used mainly for examples of `git diff` output. + show [DVC-file](/doc/user-guide/dvc-metafile-formats) contents or other YAML + data. `diff` is used mainly for examples of `git diff` output. > Check out the `.md` source code of any command reference to get a better idea, > for example in diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 7518ee9422..c2c08bc8a7 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -21,7 +21,7 @@ operation: > Note that DVC includes the cache directory in `.gitignore` during > initialization. No data tracked by DVC will ever be pushed to the Git - > repository, only [DVC-files](/doc/user-guide/dvc-file-format) that are + > repository, only [DVC-files](/doc/user-guide/dvc-metafile-formats) that are > needed to download or reproduce them. - `.dvc/plots`: Directory for @@ -84,7 +84,7 @@ $ dvc add data/images ``` When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the hash +[DVC-file](/doc/user-guide/dvc-metafile-formats) is created, containing the hash value of the directory: ```yaml diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 994640eadc..63af957ca3 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -11,10 +11,10 @@ DVC to control data outside of the project directory. ## Description DVC can track files on an external storage with `dvc add` or specify external -files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) -created by `dvc run` (stage files). External outputs are considered part of the -DVC project. DVC will track changes in them and reflect this in the output of -`dvc status`. +files as outputs for +[DVC-files](/doc/user-guide/dvc-metafile-formats) created by `dvc run` (stage +files). External outputs are considered part of the DVC project. DVC will track +changes in them and reflect this in the output of `dvc status`. Currently, the following types (protocols) of external outputs (and cache) are supported: From 1008321bb58a6fc23514e12cfa6b97d0d41f27e6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 28 May 2020 12:16:34 -0500 Subject: [PATCH 04/36] glossary: update DVC-file terminology for tooltips --- content/docs/user-guide/basic-concepts/dependency.md | 4 ++-- content/docs/user-guide/basic-concepts/dvc-project.md | 4 ++-- content/docs/user-guide/basic-concepts/import-stage.md | 4 ++-- content/docs/user-guide/basic-concepts/output.md | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 1ab32a397b..73cbe3440a 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -4,5 +4,5 @@ match: [dependency, dependencies] --- A file or directory (possibly tracked by DVC) recorded in the `deps` section of -a DVC-file (stage file). See `dvc run`. Stages are invalidated when any of their -dependencies change. +a DVC metafile (stage file). See `dvc run`. Stages are invalidated when any of +their dependencies change. diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index 282df02c7b..64a7d7fc30 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -16,5 +16,5 @@ match: Initialized by running `dvc init` in the **workspace** (typically in a Git repository). It will contain the [`.dvc/` directory](/doc/user-guide/dvc-files-and-directories) and -[DVC-files](/doc/user-guide/dvc-metafile-formats) created with commands such as -`dvc add` or `dvc run`. +[DVC metafiles](/doc/user-guide/dvc-metafile-formats) created with commands such +as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/basic-concepts/import-stage.md b/content/docs/user-guide/basic-concepts/import-stage.md index b1d96fecb7..23dece59d8 100644 --- a/content/docs/user-guide/basic-concepts/import-stage.md +++ b/content/docs/user-guide/basic-concepts/import-stage.md @@ -3,5 +3,5 @@ name: 'Import Stage' match: ['import stage', 'import stages'] --- -Stage (DVC-file) created with the `dvc import` or `dvc import-url` commands. -They represent files or directories from external sources. +`.dvc` file created with the `dvc import` or `dvc import-url` commands. They +represent files or directories from external sources. diff --git a/content/docs/user-guide/basic-concepts/output.md b/content/docs/user-guide/basic-concepts/output.md index a10b98d784..8776d7e541 100644 --- a/content/docs/user-guide/basic-concepts/output.md +++ b/content/docs/user-guide/basic-concepts/output.md @@ -3,6 +3,6 @@ name: Output match: [output, outputs] --- -A file or directory tracked by DVC, recorded in the `outs` section of a -DVC-file. Outputs are usually the result of stages. See `dvc add`, `dvc run`, +A file or directory tracked by DVC, recorded in the `outs` section of a DVC +metafile. Outputs are usually the result of stages. See `dvc add`, `dvc run`, `dvc import`, et al. A.k.a. _data artifact_ From bf050eb7dc8ece35a1eac4c40c21cb6deb7b3e36 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 28 May 2020 14:19:50 -0500 Subject: [PATCH 05/36] user-guide: full draft of new DVC Metafiles Format doc --- .../docs/user-guide/dvc-metafile-formats.md | 148 ++++++++---------- 1 file changed, 69 insertions(+), 79 deletions(-) diff --git a/content/docs/user-guide/dvc-metafile-formats.md b/content/docs/user-guide/dvc-metafile-formats.md index 305c8c9f41..ee80af94a9 100644 --- a/content/docs/user-guide/dvc-metafile-formats.md +++ b/content/docs/user-guide/dvc-metafile-formats.md @@ -1,47 +1,56 @@ # DVC Metafile Formats -There are two special metafiles created by DVC commands: +There are two special metafiles created by certain +[DVC commands](/doc/command-reference): - Files ending with the `.dvc` extension are basic placeholders to version data - files and directories. A project can have multiple `.dvc` files. -- The `dvc.yaml` file or _pipeline(s) file_ specifies stages that form the - pipeline(s) of a project, and their connections (_dependency graph_ or DAG). + files and directories. A DVC project can have multiple + [`.dvc` files](#dvc-files). +- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages + that form the pipeline(s) of a project, and their connections (_dependency + graph_ or DAG). Both use human-friendly YAML schemas, described below. We encourage you to get -familiar with them so you may edit them freely, as needed. +familiar with them so you may edit them freely, as needed. Both type of files +should be versioned with Git (for Git-enabled repositories). -Both type of files should be versioned with Git (for Git-enabled -repositories). +> See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the +> highlighting for your editor. ## .dvc files -When you add a file or directory to a DVC project with `dvc add`, a -`.dvc` file is created based on the data file name (e.g. `data.txt.dvc`). These -files contain all the needed information to track your data with DVC. They use a -simple YAML format that can be easily written or altered manually. +When you add a file or directory to a DVC project with `dvc add` or +`dvc import`, a `.dvc` file is created based on the data file name (e.g. +`data.xml.dvc`). These files contain the basic information needed to track the +data with DVC. -See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the -highlighting for your editor. - -Here is a sample `.dvc` file: +They use a simple YAML format that can be easily written or altered manually. +Here is a sample: ```yaml outs: - md5: a304afb96060aad90176268345e10355 path: data.xml -# Comments like this line persist through multiple executions of -# dvc repro/commit but not through dvc run/add/import-url/get-url commands. +# Manual comments can be added in. ``` -On the top level, `.dvc` file consists of these possible fields: +`.dvc` files contain a single top field: -- `outs`: List of outputs for this stage +- `outs` - list of outputs for this `.dvc` file -An output entry consists of these fields: +An output entry can consist of these fields: + +- `md5` - hash value for the output file +- `path` - path to the output in the workspace, relative to the + location of the `.dvc` file +- `cache` - (optional) whether or not DVC should cache the output. `true` by + default -- `md5`: MD5 hash for the output -- `path`: Path to the output, relative to the `wdir` path -- `cache`: Whether or not DVC should cache the output +Note that comments can be added to DVC metafiles using the `# comment` syntax. + +> `.dvc` file comments are preserved among executions of the `dvc repro` and +> `dvc commit` commands, but not when a `.dvc` file is overwritten by +> `dvc add`,`dvc import`, or `dvc import-url`. ## dvc.yaml file @@ -49,45 +58,34 @@ When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is created or updated. Here's a simple example: ```yaml -always_changed: true -locked: true -cmd: python cmd.py input.data output.data metrics.json -deps: - - md5: da2259ee7c12ace6db43644aef2b754c - path: cmd.py - - md5: e309de87b02312e746ec5a500844ce77 - path: input.data -md5: 521ac615cfc7323604059d81d052ce00 -outs: - - cache: true - md5: 70f3c9157e3b92a6d2c93eb51439f822 - metric: false - path: output.data - - cache: false - md5: d7a82c3cdfd45c4ace13484a931fc526 - metric: - type: json - xpath: AUC - path: metrics.json - -# Comments like this line persist through multiple executions of -# dvc repro/commit but not through dvc run/add/import-url/get-url commands. - -meta: # Special field to contain arbitrary user data - name: John - email: john@xyz.com +stages: + firstone: + cmd: python cmd.py input.data output.data metrics.json + deps: + - cmd.py + - input.data + outs: + - output.data + metrics: + - metrics.json + nextone: + cmd: python ... + ... ``` -On the top level, `.dvc` file consists of these possible fields: +`dvc.yaml` files consists of a group of `stages` with names provided explicitly +by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain +the following fields: -- `cmd`: Executable command defined in this stage -- `wdir`: Directory to run command in (default `.`) -- `md5`: MD5 hash for this DVC-file -- `deps`: List of dependencies for this stage -- `outs`: List of outputs for this stage -- `locked`: Whether or not this stage is locked from reproduction -- `always_changed`: Whether or not this stage is considered as changed by - commands such as `dvc status` and `dvc repro` (default `false`) +- `cmd` - executable command defined in this stage +- `deps` - list of dependencies for this stage +- `params` - (optional) list of the [parameter](/doc/command-reference/params) + names and their current values +- `outs` - list of outputs for this stage +- `metric` - (optional) list of [metric](/doc/command-reference/metrics) files +- `locked` - (optional) whether or not this stage is locked from reproduction +- `always_changed` (optional) - whether or not this stage is considered as + changed by commands such as `dvc status` and `dvc repro`. `false` by default A dependency entry consists of a these possible fields: @@ -95,8 +93,6 @@ A dependency entry consists of a these possible fields: - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) -- `params`: If this is a [parameter dependency](/doc/command-reference/params) - file, contains a list of the parameter names and their current values. - `repo`: This entry is only for external dependencies created with `dvc import`, and can contains the following fields: @@ -114,25 +110,19 @@ A dependency entry consists of a these possible fields: An output entry consists of these fields: -- `path`: Path to the output, relative to the `wdir` path -- `md5`: MD5 hash for the output -- `cache`: Whether or not DVC should cache the output -- `metric`: If this file is a [metric](/doc/command-reference/metrics), contains - the following fields: +- `md5` - hash value for the output file +- `path` - path to the output in the workspace, relative to the + location of the `.dvc` file +- `cache` - (optional) whether or not DVC should cache the output. `true` by + default - - `type`: Type of the metric file (`json`) - - `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` - for `{"AUC": {"value": 0.624321}}`) +Metrics entries can contain these fields: -A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry -can have any valid YAML structure containing any number of attributes. -`"meta: string"` is also possible, it doesn't need to contain a _hash_ structure -(a.k.a. dictionary) always. +- `type`: Type of the metric file (`json`) +- `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for + `{"AUC": {"value": 0.624321}}`) -Comments can be added to the DVC-file using `# comment` syntax. Comments and -meta values are preserved among executions of the `dvc repro` and `dvc commit` -commands. +`dvc.yaml` files also support `# comments`. -> Note that comments and meta values are not preserved when a DVC-file is -> overwritten with the `dvc run`,`dvc add`,`dvc import`, and `dvc import-url` -> commands. +> `dvc.yaml` comments are preserved among executions of `dvc run`, `dvc repro`, +> and `dvc commit`. From 85ceb015298d86968c6c8e1ce7c2015f6f6a0aa9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Jun 2020 17:55:04 -0500 Subject: [PATCH 06/36] term: don't use "meta" --- content/docs/command-reference/add.md | 12 +++--------- content/docs/command-reference/destroy.md | 6 +++--- 2 files changed, 6 insertions(+), 12 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 9a66c59ae7..5608c764b0 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -138,22 +138,16 @@ $ tree Let's check the `data.xml.dvc` file inside: ```yaml -md5: aae37d74224b05178153acd94e15956b outs: - cache: true - md5: d8acabbfd4ee51c95da5d7628c7ef74b - metric: false - path: data.xml -meta: # Special field to contain arbitary user data - name: John - email: john@xyz.com + md5: d8acabbfd4ee51c95da5d7628c7ef74b # file hash value + path: data.xml # file name ``` This is a standard DVC-file with only one output (in the `outs` field). The hash value should correspond to a file path in the cache. -> Note that the `meta` values above were entered manually for this example. Meta -> values and `#` comments are not preserved when a DVC-file is overwritten with +> Note that `#` comments are not preserved when a DVC-file is overwritten with > the `dvc add`, `dvc run`, `dvc import`, or `dvc import-url` commands. ```dvc diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index 533ba4f3b5..b4f38bfdc9 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -12,7 +12,7 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description -`dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the +`dvc destroy` removes DVC-files, and the internal `.dvc/` directory from the workspace. Note that the cache directory will normally be removed as well, unless it's set to an external location with `dvc cache dir`. (By default a local cache is located in the `.dvc/cache` @@ -94,8 +94,8 @@ $ ls -a .git code.py foo ``` -`dvc destroy` command removed DVC-files, and the entire `.dvc/` meta directory -from the workspace. But the cache files that are present in the +`dvc destroy` command removed DVC-files, and the internal `.dvc/` directory from +the workspace. But the cache files that are present in the `/mnt/cache` directory still persist: ```dvc From 7f9ed2cab706592fbef0faa17aa4a14d924f9c4e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Jun 2020 18:41:11 -0500 Subject: [PATCH 07/36] user-guide: merge files and dirs + metafiles guides --- content/docs/api-reference/get_url.md | 4 +- content/docs/command-reference/add.md | 14 +- content/docs/command-reference/cache/index.md | 4 +- content/docs/command-reference/checkout.md | 7 +- content/docs/command-reference/commit.md | 10 +- content/docs/command-reference/config.md | 5 +- content/docs/command-reference/destroy.md | 4 +- content/docs/command-reference/fetch.md | 9 +- content/docs/command-reference/get.md | 8 +- content/docs/command-reference/import-url.md | 14 +- content/docs/command-reference/import.md | 18 +-- content/docs/command-reference/init.md | 10 +- content/docs/command-reference/install.md | 17 +- content/docs/command-reference/lock.md | 2 +- .../docs/command-reference/metrics/diff.md | 2 +- .../docs/command-reference/metrics/show.md | 2 +- content/docs/command-reference/move.md | 13 +- .../docs/command-reference/params/index.md | 10 +- .../docs/command-reference/pipeline/show.md | 2 +- content/docs/command-reference/plots/diff.md | 2 +- content/docs/command-reference/pull.md | 4 +- content/docs/command-reference/push.md | 2 +- .../docs/command-reference/remote/modify.md | 2 +- content/docs/command-reference/remove.md | 4 +- content/docs/command-reference/repro.md | 8 +- content/docs/command-reference/run.md | 8 +- content/docs/command-reference/status.md | 6 +- content/docs/command-reference/unlock.md | 2 +- content/docs/command-reference/update.md | 8 +- content/docs/sidebar.json | 6 +- .../docs/tutorials/deep/define-ml-pipeline.md | 10 +- content/docs/tutorials/deep/preparation.md | 8 +- .../docs/tutorials/deep/reproducibility.md | 4 +- content/docs/tutorials/deep/sharing-data.md | 6 +- .../tutorials/get-started/data-pipelines.md | 2 +- .../tutorials/get-started/data-versioning.md | 14 +- content/docs/tutorials/get-started/index.md | 5 +- content/docs/tutorials/pipelines.md | 8 +- content/docs/tutorials/versioning.md | 20 +-- .../docs/understanding-dvc/how-it-works.md | 5 +- .../understanding-dvc/related-technologies.md | 6 +- content/docs/understanding-dvc/what-is-dvc.md | 4 +- content/docs/use-cases/data-registries.md | 14 +- .../use-cases/sharing-data-and-model-files.md | 2 +- .../versioning-data-and-model-files.md | 8 +- .../user-guide/basic-concepts/dvc-project.md | 6 +- content/docs/user-guide/contributing/docs.md | 4 +- .../user-guide/dvc-files-and-directories.md | 149 ++++++++++++++++-- .../docs/user-guide/dvc-metafile-formats.md | 128 --------------- .../user-guide/large-dataset-optimization.md | 4 +- .../docs/user-guide/managing-external-data.md | 6 +- 51 files changed, 313 insertions(+), 307 deletions(-) delete mode 100644 content/docs/user-guide/dvc-metafile-formats.md diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 0d0d1befa7..153cc531fd 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -30,8 +30,8 @@ specified by its `path` in a `repo` (DVC project), is stored. The URL is formed by reading the project's [remote configuration](/doc/command-reference/config#remote) and the -[DVC-file](/doc/user-guide/dvc-metafile-formats) where the given `path` is found -(`outs` field). The URL schema returned depends on the +[DVC-file](/doc/user-guide/dvc-files-and-directories) where the given `path` is +found (`outs` field). The URL schema returned depends on the [type](/doc/command-reference/remote/add#supported-storage-types) of the `remote` used (see the [Parameters](#parameters) section). diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 5608c764b0..aa1d97d564 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -1,7 +1,7 @@ # add Track data files or directories with DVC, by creating a corresponding -[DVC-file](/doc/user-guide/dvc-metafile-formats). +[DVC-file](/doc/user-guide/dvc-files-and-directories). ## Synopsis @@ -17,7 +17,7 @@ positional arguments: The `dvc add` command is analogous to `git add`, in that it makes DVC aware of the target data, as a first step to version it. It creates a -[DVC-file](/doc/user-guide/dvc-metafile-formats) to track the added data. +[DVC-file](/doc/user-guide/dvc-files-and-directories) to track the added data. The `targets` are files or directories to add with this command, that are turned into data artifacts of the project. By default, these @@ -48,8 +48,8 @@ Under the hood, a few actions are taken for each file (or directory) in appropriate. Summarizing, the result is that the target data is replaced small DVC-files can -be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) -for more details. +be tracked with Git. See +[DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more details. > Note that DVC-files created by this command are considered _orphan stage > files_ because they have no _dependencies_, only outputs. These are always @@ -125,7 +125,7 @@ To track the changes with git run: git add .gitignore data.xml.dvc ``` -As shown above, a [DVC-file](/doc/user-guide/dvc-metafile-formats) has been +As shown above, a [DVC-file](/doc/user-guide/dvc-files-and-directories) has been created for `data.xml`. Let's explore the result: ```dvc @@ -188,8 +188,8 @@ Saving information to 'pics.dvc'. ... ``` -There are no [DVC-files](/doc/user-guide/dvc-metafile-formats) generated within -this directory structure, but the images are all added to the +There are no [DVC-files](/doc/user-guide/dvc-files-and-directories) generated +within this directory structure, but the images are all added to the cache. DVC prints a message mentioning that MD5 hash values are computed for each file. A single `pics.dvc` DVC-file is generated for the top-level directory, and it contains: diff --git a/content/docs/command-reference/cache/index.md b/content/docs/command-reference/cache/index.md index d8e565a806..07c8fb54c6 100644 --- a/content/docs/command-reference/cache/index.md +++ b/content/docs/command-reference/cache/index.md @@ -17,8 +17,8 @@ positional arguments: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. The cache is where your data files, models, etc. (anything you want to version with DVC) are actually stored. The corresponding files you see in the diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index 5985de501f..07562305c3 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -16,9 +16,10 @@ positional arguments: ## Description -[DVC-files](/doc/user-guide/dvc-metafile-formats) act as pointers to specific -version of data files or directories tracked by DVC. This command synchronizes -the workspace data with the versions specified in the current DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories) act as pointers to +specific version of data files or directories tracked by DVC. This command +synchronizes the workspace data with the versions specified in the current +DVC-files. `dvc checkout` is useful, for example, when using Git in the project, after `git clone`, `git checkout`, or any other operation diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 960f26dc9a..bc2e7c364e 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -1,8 +1,8 @@ # commit Record changes to DVC-tracked files in the project, by updating -[DVC-files](/doc/user-guide/dvc-metafile-formats) and saving outputs -to the cache. +[DVC-files](/doc/user-guide/dvc-files-and-directories) and saving +outputs to the cache. ## Synopsis @@ -66,8 +66,8 @@ cache. This is where the `dvc commit` command comes into play. It performs that last step (saving the data in cache). Note that it's best to avoid the last two scenarios. They essentially -force-update the [DVC-files](/doc/user-guide/dvc-metafile-formats) and save data -to cache. They are still useful, but keep in mind that DVC can't guarantee +force-update the [DVC-files](/doc/user-guide/dvc-files-and-directories) and save +data to cache. They are still useful, but keep in mind that DVC can't guarantee reproducibility in those cases. ## Options @@ -226,7 +226,7 @@ the new instance of `model.pkl` is there. It is also possible to execute the commands that are executed by `dvc repro` by hand. You won't have DVC helping you, but you have the freedom to run any command you like, even ones not defined in a -[DVC-file](/doc/user-guide/dvc-metafile-formats). For example: +[DVC-file](/doc/user-guide/dvc-files-and-directories). For example: ```dvc $ python src/featurization.py data/prepared data/features diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index ddfe7fa3f9..5de61bdab3 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -179,8 +179,9 @@ for more details.) This section contains the following options: ### state -See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to -learn more about the state file (database) that is used for optimization. +See +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +to learn more about the state file (database) that is used for optimization. - `state.row_limit` - maximum number of entries in the state database, which affects the physical size of the state file itself, as well as the performance diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index b4f38bfdc9..258ac4a4c2 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -1,8 +1,8 @@ # destroy Remove all -[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from a -DVC project. +[DVC files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +from a DVC project. ## Synopsis diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index e4e8afc081..f568178dd5 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -22,7 +22,8 @@ of the project, but without placing them in the workspace. This makes the data files available for linking (or copying) into the workspace. (Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along with `dvc checkout`, it's performed automatically by `dvc pull` when the target -[DVC-files](/doc/user-guide/dvc-metafile-formats) are not already in the cache: +[DVC-files](/doc/user-guide/dvc-files-and-directories) are not already in the +cache: ``` Controlled files Commands @@ -49,8 +50,8 @@ on DVC remotes.) These necessary data or model files are listed as [stage](/doc/command-reference/run)) so they are required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the corresponding [pipeline](/doc/command-reference/pipeline). (See -[DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more information on -dependencies and outputs.) +[DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more +information on dependencies and outputs.) `dvc fetch` ensures that the files needed for a DVC-file to be [reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in @@ -276,7 +277,7 @@ $ tree .dvc/cache ``` Fetching using `--with-deps` starts with the target -[DVC-file](/doc/user-guide/dvc-metafile-formats) (`train.dvc` stage) and +[DVC-file](/doc/user-guide/dvc-files-and-directories) (`train.dvc` stage) and searches backwards through its pipeline for data to download into the project's cache. All the data for the second and third stages ("featurize" and "train") has now been downloaded to the cache. We could now use `dvc checkout` to get the diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index eba12a36f3..f2f7b53b91 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -40,7 +40,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[DVC-file](/doc/user-guide/dvc-metafile-formats) of the project. +[DVC-file](/doc/user-guide/dvc-files-and-directories) of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -182,9 +182,9 @@ get the most recent one, we use a similar command, but with `-o model.bigrams.pkl` and `--rev bigrams-experiment` (or even without `--rev` since that tag has the latest model version anyway). In fact, in this case using `dvc pull` with the corresponding -[DVC-files](/doc/user-guide/dvc-metafile-formats) should suffice, downloading -the file as just `model.pkl`. We can then rename it to make its variant -explicit: +[DVC-files](/doc/user-guide/dvc-files-and-directories) should suffice, +downloading the file as just `model.pkl`. We can then rename it to make its +variant explicit: ```dvc $ dvc pull train.dvc diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 255007f125..1e993519af 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -41,8 +41,8 @@ while `out` can be used to specify the directory and/or file name desired for the downloaded data. If an existing directory is specified, the file or directory will be placed inside. -[DVC-files](/doc/user-guide/dvc-metafile-formats) support references to data in -an external location, see +[DVC-files](/doc/user-guide/dvc-files-and-directories) support references to +data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such a DVC-file, the `deps` field stores the remote URL, and the `outs` field contains the corresponding local path in the workspace. It records enough @@ -102,9 +102,9 @@ $ dvc run -d https://example.com/path/to/data.csv \ wget https://example.com/path/to/data.csv -O data.csv ``` -Both methods generate a [DVC-files](/doc/user-guide/dvc-metafile-formats) with -an external dependency, but the one created by `dvc import-url` preserves the -connection to the data source. We call this an _import stage_. +Both methods generate a [DVC-files](/doc/user-guide/dvc-files-and-directories) +with an external dependency, but the one created by `dvc import-url` preserves +the connection to the data source. We call this an _import stage_. Note that import stages are considered always locked, meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to bring the import @@ -188,8 +188,8 @@ The `etag` field in the DVC-file contains the If the remote file changes, its ETag will be different. This metadata allows DVC to determine whether its necessary to download it again. -> See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more details -> on the text format above. +> See [DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more +> details on the text format above. You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index b551df2142..655af54b67 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -2,9 +2,9 @@ Download a file or directory tracked by DVC or by Git into the workspace. It also creates a -[DVC-file](/doc/user-guide/dvc-metafile-formats) with information about the data -source, which can later be used to [update](/doc/command-reference/update) the -import. +[DVC-file](/doc/user-guide/dvc-files-and-directories) with information about the +data source, which can later be used to [update](/doc/command-reference/update) +the import. > See also our `dvc.api.open()` Python API function. @@ -44,7 +44,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[DVC-file](/doc/user-guide/dvc-metafile-formats) of the project. +[DVC-file](/doc/user-guide/dvc-files-and-directories) of the project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -112,7 +112,7 @@ Importing 'data/data.xml (git@github.com:iterative/example-get-started)' In contrast with `dvc get`, this command doesn't just download the data file, but it also creates an import stage -([DVC-file](/doc/user-guide/dvc-metafile-formats)) with a link to the data +([DVC-file](/doc/user-guide/dvc-files-and-directories)) with a link to the data source (as explained in the description above). (This import stage can later be used to [update](/doc/command-reference/update) the import.) Check `data.xml.dvc`: @@ -153,7 +153,7 @@ Importing ``` When using this option, the import stage -([DVC-file](/doc/user-guide/dvc-metafile-formats)) will also have a `rev` +([DVC-file](/doc/user-guide/dvc-files-and-directories)) will also have a `rev` subfield under `repo`: ```yaml @@ -185,9 +185,9 @@ If you take a look at our [dataset registry](https://github.com/iterative/dataset-registry) project, you'll see that it's organized into different directories such as `tutorial/ver` and `use-cases/`, and these contain -[DVC-files](/doc/user-guide/dvc-metafile-formats) that track different datasets. -Given this simple structure, its data files can be easily shared among several -other projects using `dvc get` and `dvc import`. For example: +[DVC-files](/doc/user-guide/dvc-files-and-directories) that track different +datasets. Given this simple structure, its data files can be easily shared among +several other projects using `dvc get` and `dvc import`. For example: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 3551d21691..8ff6e31740 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -24,9 +24,9 @@ advanced scenarios: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git. ### Initializing DVC in subdirectories @@ -56,7 +56,7 @@ sub-projects to mitigate the issues of initializing in the Git repository root: - Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`, and others analyze the whole repository to look for - [DVC-files](/doc/user-guide/dvc-metafile-formats) to download files and + [DVC-files](/doc/user-guide/dvc-files-and-directories) to download files and directories, to reproduce pipelines, etc. It can be expensive in the large repositories with a lot of projects. @@ -126,7 +126,7 @@ include: - SCM other than Git is being used. Even though there are DVC features that require DVC to be run in the Git repo, DVC can work well with other version control systems. Since DVC relies on simple text - [DVC-files](/doc/user-guide/dvc-metafile-formats) to manage + [DVC-files](/doc/user-guide/dvc-files-and-directories) to manage pipelines, data, etc, they can be added into any SCM thus providing large data files and directories versioning. diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d8631d5695..459ed7bbfb 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -22,10 +22,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: **Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-metafile-formats) corresponding to that version. -The project's DVC-files in turn refer to data stored in cache, but -not necessarily in the workspace. Normally, it would be necessary -to use `dvc checkout` to synchronize workspace and DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories) corresponding to that +version. The project's DVC-files in turn refer to data stored in +cache, but not necessarily in the workspace. Normally, +it would be necessary to use `dvc checkout` to synchronize workspace and +DVC-files. This hook automates `dvc checkout` after `git checkout`. @@ -153,7 +154,7 @@ $ dvc pull --all-branches --all-tags ## Example: Checkout both Git and DVC Switching from one Git commit to another (with `git checkout`) may change the -set of [DVC-files](/doc/user-guide/dvc-metafile-formats) in the +set of [DVC-files](/doc/user-guide/dvc-files-and-directories) in the workspace. This would mean that the currently present data files and directories no longer matches project's version (which can be fixed with `dvc checkout`). @@ -206,9 +207,9 @@ project's cache and the data files currently in the workspace. Git changed the DVC-files in the workspace, which changed references to data files. `dvc status` first informed us that the data files in the workspace no longer matched the hash values in the corresponding -[DVC-files](/doc/user-guide/dvc-metafile-formats). Running `dvc checkout` then -brings them up to date, and a second `dvc status` tells us that the data files -now do match the DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories). Running `dvc checkout` +then brings them up to date, and a second `dvc status` tells us that the data +files now do match the DVC-files. ```dvc $ git checkout master diff --git a/content/docs/command-reference/lock.md b/content/docs/command-reference/lock.md index 89c18d99fb..902cef58d8 100644 --- a/content/docs/command-reference/lock.md +++ b/content/docs/command-reference/lock.md @@ -1,6 +1,6 @@ # lock -Lock a [DVC-file](/doc/user-guide/dvc-metafile-formats) +Lock a [DVC-file](/doc/user-guide/dvc-files-and-directories) ([stage](/doc/command-reference/run)). Use `dvc unlock` to unlock the file. ## Synopsis diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index 7366c03cd0..3e3b823d37 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -32,7 +32,7 @@ difference (delta) from the previous value of metrics (with 3-digit accuracy). They're calculated between two commits (hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-metafile-formats) in both references. +[DVC-files](/doc/user-guide/dvc-files-and-directories) in both references. Another way to display metrics is the `dvc metrics show` command, which just lists all the current metrics without comparisons. diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 7abdd6d76d..d08edf6007 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -15,7 +15,7 @@ positional arguments: ## Description Finds and prints all metrics in the project by examining all of its -[DVC-files](/doc/user-guide/dvc-metafile-formats). +[DVC-files](/doc/user-guide/dvc-files-and-directories). > This kind of metrics can be defined with the `-m` (`--metrics`) and `-M` > (`--metrics-no-cache`) options of `dvc run`. diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index 0eec010c5d..63437ad4e5 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -1,8 +1,8 @@ # move Rename a file or a directory and modify the corresponding -[DVC-file](/doc/user-guide/dvc-metafile-formats) (see `dvc add`) to reflect the -change. If the file or directory has the same name as the corresponding +[DVC-file](/doc/user-guide/dvc-files-and-directories) (see `dvc add`) to reflect +the change. If the file or directory has the same name as the corresponding DVC-file, it also renames it. ## Synopsis @@ -19,9 +19,10 @@ positional arguments: `dvc move` is useful when a `src` file or directory has previously been added to the project with `dvc add`, creating a -[DVC-file](/doc/user-guide/dvc-metafile-formats) (with `src` as a dependency). -`dvc move` behaves like `mv src dst`, moving `src` to the given `dst` path, but -it also renames and updates the corresponding DVC-file appropriately. +[DVC-file](/doc/user-guide/dvc-files-and-directories) (with `src` as a +dependency). `dvc move` behaves like `mv src dst`, moving `src` to the given +`dst` path, but it also renames and updates the corresponding DVC-file +appropriately. > Note that `src` may be a copy or a > [link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) @@ -107,7 +108,7 @@ $ tree We use `dvc add` to track a file with DVC, then we use `dvc move` to change its location. If target path already exists and is a directory, data file is moved with unchanged name into this folder. Note that the `data.csv.dvc` -[DVC-file](/doc/user-guide/dvc-metafile-formats) is also moved. +[DVC-file](/doc/user-guide/dvc-files-and-directories) is also moved. ```dvc $ tree diff --git a/content/docs/command-reference/params/index.md b/content/docs/command-reference/params/index.md index 2b510a8d13..5474172651 100644 --- a/content/docs/command-reference/params/index.md +++ b/content/docs/command-reference/params/index.md @@ -54,8 +54,8 @@ written, or generated, and these can be versioned directly with Git. You can then use `dvc run` with the `-p` (`--params`) option to specify parameter dependencies for your pipeline's stages (instead of or in addition to regular `-d` deps.) DVC saves the param names and values in the stage file (see -[DVC-file format](/doc/user-guide/dvc-metafile-formats)). These values will be -compared to the ones in the params files to determine if the stage is +[DVC-file format](/doc/user-guide/dvc-files-and-directories)). These values will +be compared to the ones in the params files to determine if the stage is invalidated upon pipeline [reproduction](/doc/command-reference/repro). `dvc params diff` is available to show changes in parameters, displaying the @@ -109,9 +109,9 @@ $ dvc run -d users.csv -o model.pkl \ ``` You can find that each parameter and it's value were saved in the -[DVC-file](/doc/user-guide/dvc-metafile-formats). These values will be compared -to the ones in the parameters files whenever `dvc repro` is used, to determine -if dependency to the params file is invalidated: +[DVC-file](/doc/user-guide/dvc-files-and-directories). These values will be +compared to the ones in the parameters files whenever `dvc repro` is used, to +determine if dependency to the params file is invalidated: ```yaml md5: 05d178cfa0d1474b6c5800aa1e1b34ac diff --git a/content/docs/command-reference/pipeline/show.md b/content/docs/command-reference/pipeline/show.md index 336c5d9ef4..6064df0bb5 100644 --- a/content/docs/command-reference/pipeline/show.md +++ b/content/docs/command-reference/pipeline/show.md @@ -2,7 +2,7 @@ Show [stages](/doc/command-reference/run) in a pipeline that lead to the specified stage. By default it lists -[DVC-files](/doc/user-guide/dvc-metafile-formats). +[DVC-files](/doc/user-guide/dvc-files-and-directories). ## Synopsis diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index 4d11fe3e7d..7bd01c2704 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -40,7 +40,7 @@ resulting plot shows all of them in a single output. This command can work with metric files that are committed to a repository history, data files controlled by DVC, or any other file in the workspace. In the case of DVC-tracked `datafile`, the `revisions` are used to find the -corresponding [DVC-files](/doc/user-guide/dvc-metafile-formats). +corresponding [DVC-files](/doc/user-guide/dvc-files-and-directories). ## Options diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 88ae2bbfee..a9663a3bde 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -3,7 +3,7 @@ Download tracked files or directories from [remote storage](/doc/command-reference/remote) to the cache and workspace, based on the current -[DVC-files](/doc/user-guide/dvc-metafile-formats). +[DVC-files](/doc/user-guide/dvc-files-and-directories). ## Synopsis @@ -37,7 +37,7 @@ remote. With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads only the files (or directories) missing from the workspace by searching all -[DVC-files](/doc/user-guide/dvc-metafile-formats) currently in the +[DVC-files](/doc/user-guide/dvc-files-and-directories) currently in the project. It will not download files associated with earlier commits in the repository (if using Git), nor will it download files that have not changed. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index e82ee547f0..d390aaddd0 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -38,7 +38,7 @@ save any changes in the code or DVC-files (those should be saved by using Under the hood a few actions are taken: - The push command by default uses all - [DVC-files](/doc/user-guide/dvc-metafile-formats) in the + [DVC-files](/doc/user-guide/dvc-files-and-directories) in the workspace. The command options listed below will either limit or expand the set of DVC-files to consult. diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 0dd9289684..24e25216f0 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -63,7 +63,7 @@ The following config options are available for all remote types: sure that these haven't been modified, or corrupted during download. It may slow down the aforementioned commands. The calculated hash is compared to the value saved in the corresponding - [DVC-file](/doc/user-guide/dvc-metafile-formats). + [DVC-file](/doc/user-guide/dvc-files-and-directories). > Note that this option is enabled on **Google Drive** remotes by default. diff --git a/content/docs/command-reference/remove.md b/content/docs/command-reference/remove.md index 66cb4448e7..6295c514e3 100644 --- a/content/docs/command-reference/remove.md +++ b/content/docs/command-reference/remove.md @@ -15,8 +15,8 @@ positional arguments: This command safely removes data files or directories that are tracked by DVC from the workspace. It takes a -[DVC-File](/doc/user-guide/dvc-metafile-formats) as input, removes all of its -outputs (`outs`), and optionally removes the DVC-file itself. +[DVC-File](/doc/user-guide/dvc-files-and-directories) as input, removes all of +its outputs (`outs`), and optionally removes the DVC-file itself. Note that it does not remove files from the DVC cache or remote storage (see `dvc gc`). However, remember to run `dvc push` to save the files you actually diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 2326963316..6a782744ef 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -35,8 +35,8 @@ There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, `--cwd`, or other options. -If specific [DVC-files](/doc/user-guide/dvc-metafile-formats) (`targets`) are -omitted, `Dvcfile` will be assumed. +If specific [DVC-files](/doc/user-guide/dvc-files-and-directories) (`targets`) +are omitted, `Dvcfile` will be assumed. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. @@ -274,8 +274,8 @@ Data and pipelines are up to date. ``` The reason being that the `text.txt` file is a dependency in the target -[DVC-file](/doc/user-guide/dvc-metafile-formats) (`Dvcfile` by default). This -`Dvcfile` stage is dependent on `filter.dvc`, which happens first in this +[DVC-file](/doc/user-guide/dvc-files-and-directories) (`Dvcfile` by default). +This `Dvcfile` stage is dependent on `filter.dvc`, which happens first in this pipeline (shown in the following figure): ```dvc diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 492fec6c58..e527b4fa81 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -1,7 +1,7 @@ # run -Generate a stage file ([DVC-file](/doc/user-guide/dvc-metafile-formats)) from a -given command and execute the command. +Generate a stage file ([DVC-file](/doc/user-guide/dvc-files-and-directories)) +from a given command and execute the command. ## Synopsis @@ -212,8 +212,8 @@ To track the changes with git, run: git add .gitignore metric.dvc ``` -> See [DVC-File Format](/doc/user-guide/dvc-metafile-formats) for more details -> on the text format above. +> See [DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more +> details on the text format above. Execute a Python script as a DVC [pipeline](/doc/command-reference/pipeline) stage. The stage file name is not specified, so a `model.p.dvc` DVC-file is diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 54b6f8fd78..c06e840dbd 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -34,9 +34,9 @@ options: | remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | DVC determines which data and code files to compare by analyzing all -[DVC-files](/doc/user-guide/dvc-metafile-formats) in the workspace -(the `--all-branches` and `--all-tags` options compare multiple workspace -versions). +[DVC-files](/doc/user-guide/dvc-files-and-directories) in the +workspace (the `--all-branches` and `--all-tags` options compare +multiple workspace versions). The comparison can be limited to certain DVC-files only, by listing them as `targets`. (Changes are reported only against these.) When this is combined with diff --git a/content/docs/command-reference/unlock.md b/content/docs/command-reference/unlock.md index fc6fc46a2c..0fa5061bd9 100644 --- a/content/docs/command-reference/unlock.md +++ b/content/docs/command-reference/unlock.md @@ -1,6 +1,6 @@ # unlock -Unlock [DVC-file](/doc/user-guide/dvc-metafile-formats) +Unlock [DVC-file](/doc/user-guide/dvc-files-and-directories) ([stage](/doc/command-reference/run)). See `dvc lock` for more information. ## Synopsis diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index bced952031..047b997bbe 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -2,7 +2,7 @@ Update data artifacts imported from external DVC projects, and corresponding -[DVC-files](/doc/user-guide/dvc-metafile-formats). +[DVC-files](/doc/user-guide/dvc-files-and-directories). ## Synopsis @@ -17,7 +17,7 @@ positional arguments: ## Description After creating import stages -([DVC-files](/doc/user-guide/dvc-metafile-formats)) with `dvc import` or +([DVC-files](/doc/user-guide/dvc-files-and-directories)) with `dvc import` or `dvc import-url`, the data source can change. Use `dvc update` to bring these imported file, directory, or data artifact up to date. @@ -84,8 +84,8 @@ This time nothing has changed, since the source project is rather stable. > Note that `dvc update` updates the `rev_lock` field of the corresponding -> [DVC-file](/doc/user-guide/dvc-metafile-formats) (when there are changes to -> bring in). +> [DVC-file](/doc/user-guide/dvc-files-and-directories) (when there are changes +> to bring in). ## Example: Updating fixed revisions to a different version diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 395acce892..705cb16b52 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -100,13 +100,9 @@ "source": "user-guide/index.md", "children": [ { - "label": "Files and Directories", + "label": "DVC Files and Directories", "slug": "dvc-files-and-directories" }, - { - "label": "DVC Metafile Formats", - "slug": "dvc-metafile-formats" - }, { "slug": "dvcignore", "tutorials": { diff --git a/content/docs/tutorials/deep/define-ml-pipeline.md b/content/docs/tutorials/deep/define-ml-pipeline.md index ac15ff5774..260d57839a 100644 --- a/content/docs/tutorials/deep/define-ml-pipeline.md +++ b/content/docs/tutorials/deep/define-ml-pipeline.md @@ -51,7 +51,7 @@ or move it, you can use `dvc move`. ## Data file internals -If you take a look at the [DVC-file](/doc/user-guide/dvc-metafile-formats) +If you take a look at the [DVC-file](/doc/user-guide/dvc-files-and-directories) created by `dvc add`, you will see that outputs are tracked in the `outs` field. In this file, only one output is specified. The output contains the data file path in the repository and its MD5 hash. This hash value @@ -139,7 +139,7 @@ files written to by the command, if any. - `-o out.dat` (lower case o) specifies an output data file. DVC will track this data file by creating a corresponding - [DVC-file](/doc/user-guide/dvc-metafile-formats) (as if running + [DVC-file](/doc/user-guide/dvc-files-and-directories) (as if running `dvc add out.dat` after `dvc run` instead). - `-O tmp.dat` (upper case O) specifies a simple output file (not to be added to @@ -186,9 +186,9 @@ command and does some additional work if the command was successful: 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage file in the project with information about this pipeline stage. - (See [DVC-File Format](/doc/user-guide/dvc-metafile-formats)). Note that the - name of this file could be specified by using the `-f` option, for example - `-f extract.dvc`. + (See [DVC-File Format](/doc/user-guide/dvc-files-and-directories)). Note that + the name of this file could be specified by using the `-f` option, for + example `-f extract.dvc`. Let's take a look at the resulting stage file created by `dvc run` above: diff --git a/content/docs/tutorials/deep/preparation.md b/content/docs/tutorials/deep/preparation.md index 1983c9099b..4a83476842 100644 --- a/content/docs/tutorials/deep/preparation.md +++ b/content/docs/tutorials/deep/preparation.md @@ -67,9 +67,9 @@ with: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git: +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git: ```dvc $ dvc init @@ -92,4 +92,4 @@ explained in more detail in the next chapter.) Note that it won't be tracked by Git — It's a local-only directory, and you cannot push it to a Git remote. For more information refer to -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories). +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). diff --git a/content/docs/tutorials/deep/reproducibility.md b/content/docs/tutorials/deep/reproducibility.md index ebca57b35d..a7791935a7 100644 --- a/content/docs/tutorials/deep/reproducibility.md +++ b/content/docs/tutorials/deep/reproducibility.md @@ -19,8 +19,8 @@ automation tools ([Make](https://www.gnu.org/software/make/), Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of the graph nodes (pipeline [stages](/doc/command-reference/run)). -If you run `repro` on any [DVC-file](/doc/user-guide/dvc-metafile-formats) from -our repository, nothing happens because nothing was changed in the pipeline +If you run `repro` on any [DVC-file](/doc/user-guide/dvc-files-and-directories) +from our repository, nothing happens because nothing was changed in the pipeline defined in the project: There's nothing to reproduce. ```dvc diff --git a/content/docs/tutorials/deep/sharing-data.md b/content/docs/tutorials/deep/sharing-data.md index 25cfd5bd88..054e88085e 100644 --- a/content/docs/tutorials/deep/sharing-data.md +++ b/content/docs/tutorials/deep/sharing-data.md @@ -3,7 +3,7 @@ ## Pushing data to the cloud We've gone over how source code and -[DVC-files](/doc/user-guide/dvc-metafile-formats) can be shared using a Git +[DVC-files](/doc/user-guide/dvc-files-and-directories) can be shared using a Git repository. These DVC repositories will contain all the information needed for reproducibility, so it might be a good idea to share them with your team using Git hosting services (such as [GitHub](https://github.com/)). @@ -15,8 +15,8 @@ DVC is able to push the cache to cloud storage. First, you need to setup remote storage for the project, that will be stored in the project's -[config file](https://dvc.org/doc/user-guide/dvc-files-and-directories). This -can be done using the CLI as shown below. +[config file](https://dvc.org/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). +This can be done using the CLI as shown below. > Note that we are using the `dvc-public` S3 bucket as an example and you don't > have write access to it, so in order to follow the tutorial you will need to diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 02ecd54a5a..7cca660c0f 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -69,7 +69,7 @@ $ dvc run -f prepare.dvc \ ``` The `prepare.dvc` _stage file_ is generated. It has the same -[format](/doc/user-guide/dvc-metafile-formats) as the DVC-file we created +[format](/doc/user-guide/dvc-files-and-directories) as the DVC-file we created previously to [tack data](/doc/tutorials/get-started/data-versioning#changes), but it additionally includes information about the command we ran (`python src/prepare.py`), the dependencies, and diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index a0c1180b82..ac8aa2b3ba 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -28,8 +28,8 @@ $ dvc add data/data.xml DVC stores information about the added file in a special _DVC-file_ named `data/data.xml.dvc`, a small text file with a human-readable -[format](/doc/user-guide/dvc-metafile-formats). This metafile can committed with -Git instead, as a placeholder for the original data (which is added to +[format](/doc/user-guide/dvc-files-and-directories). This metafile can committed +with Git instead, as a placeholder for the original data (which is added to `.gitignore`): ```dvc @@ -111,7 +111,7 @@ $ dvc add data/data.xml ``` DVC updates the `data/data.xml.dvc` -[DVC-file](/doc/user-guide/dvc-metafile-formats) to match the updated data. +[DVC-file](/doc/user-guide/dvc-files-and-directories) to match the updated data. Let's commit this new version with Git:
@@ -202,7 +202,7 @@ $ dvc push ``` > Usually, we also want to `git commit` and `git push` the corresponding -> [DVC-files](/doc/user-guide/dvc-metafile-formats). +> [DVC-files](/doc/user-guide/dvc-files-and-directories). Pushing data or models ensures they're safely backed up remotely. This also means they can be retrieved from other environments. @@ -332,9 +332,9 @@ The `url` and `rev_lock` subfields under `repo` are used to save the origin and
Additionally, the `data/data.xml` -[DVC-file](/doc/user-guide/dvc-metafile-formats) now includes metadata to track -changes in the source data. This allows you to bring in changes from the data -source later, using `dvc update`. +[DVC-file](/doc/user-guide/dvc-files-and-directories) now includes metadata to +track changes in the source data. This allows you to bring in changes from the +data source later, using `dvc update`. ### Python API diff --git a/content/docs/tutorials/get-started/index.md b/content/docs/tutorials/get-started/index.md index 7da757e1ab..e5a0eee2fe 100644 --- a/content/docs/tutorials/get-started/index.md +++ b/content/docs/tutorials/get-started/index.md @@ -39,8 +39,9 @@ $ git commit -m "Initialize DVC repository" ``` At DVC initialization, a new `.dvc/` directory is created for internal -[files and directories 📖](/doc/user-guide/dvc-files-and-directories). This -directory is automatically staged with Git, so it can be committed right away. +[files and directories 📖](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). +This directory is automatically staged with Git, so it can be committed right +away. ## What's ahead? diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index 302b084887..2789fa9167 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -97,7 +97,7 @@ $ dvc add data/Posts.xml.zip ``` When we run `dvc add` `Posts.xml.zip`, DVC creates a -[DVC-file](/doc/user-guide/dvc-metafile-formats). +[DVC-file](/doc/user-guide/dvc-files-and-directories).
@@ -105,9 +105,9 @@ When we run `dvc add` `Posts.xml.zip`, DVC creates a At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories) that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git. Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an _orphan_ [stage file](/doc/command-reference/run): diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index dc6419dad4..c7ee842bda 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -65,8 +65,8 @@ The repository you cloned is already DVC-initialized. It already contains a `.dvc/` directory with the `config` and `.gitignore` files. These and other files and directories are hidden from user, as typically there's no need to interact with them directly. See -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn -more. +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +to learn more.
@@ -132,8 +132,8 @@ the cache (while keeping a [file link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) to it in the workspace, so you can continue working the same way as before). This is achieved by creating a simple human-readable -[DVC-file](/doc/user-guide/dvc-metafile-formats) that serves as a pointer to the -cache. +[DVC-file](/doc/user-guide/dvc-files-and-directories) that serves as a pointer +to the cache. Next, we train our first model with `train.py`. Because of the small dataset, this training process should be small enough to run on most computers in a @@ -168,8 +168,8 @@ As we mentioned briefly, DVC does not commit the `data/` directory and then `git commit` DVC-files that contain file hashes that point to cached data. In this case we created `data.dvc` and `model.h5.dvc`. Refer to -[DVC-File Format](/doc/user-guide/dvc-metafile-formats) to learn more about how -these files work. +[DVC-File Format](/doc/user-guide/dvc-files-and-directories) to learn more about +how these files work. @@ -283,9 +283,9 @@ the `v2.0` tag. As we have learned already, DVC keeps data files out of Git (by adjusting `.gitignore`) and puts them into the cache (usually it's a `.dvc/cache` directory inside the repository). Instead, DVC creates -[DVC-files](/doc/user-guide/dvc-metafile-formats). These text files serve as -data placeholders that point to the cached files, and they can be easily version -controlled with Git. +[DVC-files](/doc/user-guide/dvc-files-and-directories). These text files serve +as data placeholders that point to the cached files, and they can be easily +version controlled with Git. When we run `git checkout` we restore pointers (DVC-files) first. Then, when we run `dvc checkout`, we use these pointers to put the right data in the right @@ -325,7 +325,7 @@ $ dvc run -f Dvcfile \ ``` Similar to `dvc add`, `dvc run` creates a -[DVC-file](/doc/user-guide/dvc-metafile-formats) named `Dvcfile` (specified +[DVC-file](/doc/user-guide/dvc-files-and-directories) named `Dvcfile` (specified using the `-f` option). It tracks all outputs (`-o`) the same way as `dvc add` does. Unlike `dvc add`, `dvc run` also tracks dependencies (`-d`) and the command (`python train.py`) that was run to produce the result. We call such a diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/understanding-dvc/how-it-works.md index f3efd7608c..576d1a9473 100644 --- a/content/docs/understanding-dvc/how-it-works.md +++ b/content/docs/understanding-dvc/how-it-works.md @@ -7,7 +7,8 @@ $ dvc init ``` - > See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) + > See + > [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). - DVC helps define command pipelines, and keeps each command [stage](/doc/command-reference/run) and dependencies in a Git repository: @@ -44,7 +45,7 @@ - DVC introduces the concept of data files for Git repositories. DVC keeps data files outside of the repository, replacing them with special - [DVC-files](/doc/user-guide/dvc-metafile-formats) in the Git repo: + [DVC-files](/doc/user-guide/dvc-files-and-directories) in the Git repo: ```dvc $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md index 9fa14ed929..30108db6d1 100644 --- a/content/docs/understanding-dvc/related-technologies.md +++ b/content/docs/understanding-dvc/related-technologies.md @@ -38,7 +38,7 @@ Luigi, etc. result, but we expect some GUI services will be created on top of DVC. - DVC has transparent design. Its - [internal files and directories](/doc/user-guide/dvc-files-and-directories) + [internal files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) (including the cache directory) have a human-readable format and can be easily reused by external tools. @@ -60,7 +60,7 @@ Luigi, etc. (DAG): - The DAG or dependency graph is defined implicitly by the connections between - [DVC-files](/doc/user-guide/dvc-metafile-formats) (with file names + [DVC-files](/doc/user-guide/dvc-files-and-directories) (with file names `.dvc` or `Dvcfile`), based on their dependencies and outputs. @@ -100,7 +100,7 @@ Luigi, etc. Git-annex repository is cloned via `git clone`, data files won't be copied to the local machine, as file contents are stored in separate [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-metafile-formats), which provide the + [DVC-files](/doc/user-guide/dvc-files-and-directories), which provide the reproducible workflow, are always included in the Git repository. Hence, they can be executed locally with minimal effort. diff --git a/content/docs/understanding-dvc/what-is-dvc.md b/content/docs/understanding-dvc/what-is-dvc.md index 44bc274c21..24a32662e9 100644 --- a/content/docs/understanding-dvc/what-is-dvc.md +++ b/content/docs/understanding-dvc/what-is-dvc.md @@ -45,8 +45,8 @@ DVC uses a few core concepts: - **Data files**: Cached files (for large files). Data files are stored outside of the Git repository on a local/shared hard drive or remote storage, but - [DVC-files](/doc/user-guide/dvc-metafile-formats) describing that data are - stored in Git for DVC needs (to maintain pipelines and reproducibility). + [DVC-files](/doc/user-guide/dvc-files-and-directories) describing that data + are stored in Git for DVC needs (to maintain pipelines and reproducibility). - **Cache directory**: Directory with all data files on a local hard drive or in cloud storage, but not in the Git repository. See `dvc cache dir`. diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index b3377df8b8..fc19733012 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -37,8 +37,8 @@ Advantages of using a DVC **data registry**: requirements. - Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of - [DVC-files](/doc/user-guide/dvc-metafile-formats) allows us to track and audit - data changes. + [DVC-files](/doc/user-guide/dvc-files-and-directories) allows us to track and + audit data changes. - Data as code: Leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think Git for cloud storage, but without ad-hoc conventions. @@ -66,9 +66,9 @@ $ dvc add music/songs > [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset). A regular Git workflow can be followed with the tiny -[DVC-files](/doc/user-guide/dvc-metafile-formats) that substitute the actual -data (`music/songs.dvc` in this example). This enables team collaboration on -data at the same level as with source code (commit history, branching, pull +[DVC-files](/doc/user-guide/dvc-files-and-directories) that substitute the +actual data (`music/songs.dvc` in this example). This enables team collaboration +on data at the same level as with source code (commit history, branching, pull requests, reviews, etc.): ```dvc @@ -148,8 +148,8 @@ $ dvc import https://github.com/example/registry \ Besides downloading, importing saves the dependency from the local project to the data source (registry repo). This is achieved by creating a particular kind -of [DVC-file](/doc/user-guide/dvc-metafile-formats) (a.k.a. _import stage_). -This file can be used staged and committed with Git. +of [DVC-file](/doc/user-guide/dvc-files-and-directories) (a.k.a. _import +stage_). This file can be used staged and committed with Git. As an addition to the import workflow, and enabled the saved dependency, we can easily bring it up to date in our consumer project(s) with `dvc update` whenever diff --git a/content/docs/use-cases/sharing-data-and-model-files.md b/content/docs/use-cases/sharing-data-and-model-files.md index 025169766c..7b46bd91af 100644 --- a/content/docs/use-cases/sharing-data-and-model-files.md +++ b/content/docs/use-cases/sharing-data-and-model-files.md @@ -67,7 +67,7 @@ with the `dvc push` command: $ dvc push ``` -Code and [DVC-files](/doc/user-guide/dvc-metafile-formats) can be safely +Code and [DVC-files](/doc/user-guide/dvc-files-and-directories) can be safely committed and pushed with Git. ## Download code diff --git a/content/docs/use-cases/versioning-data-and-model-files.md b/content/docs/use-cases/versioning-data-and-model-files.md index f913021492..6f4bd7449e 100644 --- a/content/docs/use-cases/versioning-data-and-model-files.md +++ b/content/docs/use-cases/versioning-data-and-model-files.md @@ -8,10 +8,10 @@ DVC allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository. It's useful when dealing with files that are too large for Git to handle properly in general. DVC saves information about your data in special -[DVC-files](/doc/user-guide/dvc-metafile-formats), and these metafiles can be -used for versioning. To actually store the data, DVC supports various types of -[remote storage](/doc/command-reference/remote). This allows easily saving and -sharing data alongside code. +[DVC-files](/doc/user-guide/dvc-files-and-directories), and these metafiles can +be used for versioning. To actually store the data, DVC supports various types +of [remote storage](/doc/command-reference/remote). This allows easily saving +and sharing data alongside code. ![](/img/model-versioning-diagram.png) diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index 64a7d7fc30..a862a9f222 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -15,6 +15,6 @@ match: Initialized by running `dvc init` in the **workspace** (typically in a Git repository). It will contain the -[`.dvc/` directory](/doc/user-guide/dvc-files-and-directories) and -[DVC metafiles](/doc/user-guide/dvc-metafile-formats) created with commands such -as `dvc add` or `dvc run`. +[`.dvc/` directory](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +and [DVC metafiles](/doc/user-guide/dvc-files-and-directories) created with +commands such as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md index 4a613cac22..34cae8b53d 100644 --- a/content/docs/user-guide/contributing/docs.md +++ b/content/docs/user-guide/contributing/docs.md @@ -169,8 +169,8 @@ is installed when `yarn` runs (see [dev env](#development-environment)). `dvc`, `yaml`, or `diff` custom languages. `usage` is employed to show the `dvc --help` output for each command reference. `dvc` can be used to show examples of commands and their output in a terminal session. `yaml` is used to - show [DVC-file](/doc/user-guide/dvc-metafile-formats) contents or other YAML - data. `diff` is used mainly for examples of `git diff` output. + show [DVC-file](/doc/user-guide/dvc-files-and-directories) contents or other + YAML data. `diff` is used mainly for examples of `git diff` output. > Check out the `.md` source code of any command reference to get a better idea, > for example in diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index c2c08bc8a7..def5781957 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -1,8 +1,138 @@ # DVC Files and Directories Once initialized in a project, DVC populates its installation -directory (`.dvc/`) with the internal files and directories needed for DVC -operation: +directory (`.dvc/`) with the +[internal directories and files](#internal-directories-and-files) needed for DVC +operation. + +Additionally, there are two special metafiles created by certain +[DVC commands](/doc/command-reference): + +- Files ending with the `.dvc` extension are basic placeholders to version data + files and directories. A DVC project can have multiple + [`.dvc` files](#dvc-files). +- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages + that form the pipeline(s) of a project, and their connections (_dependency + graph_ or DAG). + +Both use human-friendly YAML schemas, described below. We encourage you to get +familiar with them so you may edit them freely, as needed. Both type of files +should be versioned with Git (for Git-enabled repositories). + +> See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable > +> the highlighting for your editor. + +## .dvc files + +When you add a file or directory to a DVC project with `dvc add` or +`dvc import`, a `.dvc` file is created based on the data file name (e.g. +`data.xml.dvc`). These files contain the basic information needed to track the +data with DVC. + +They use a simple YAML format that can be easily written or altered manually. +Here is a sample: + +```yaml +outs: + - md5: a304afb96060aad90176268345e10355 + path: data.xml +# Manual comments can be added in. +``` + +`.dvc` files contain a single top field: + +- `outs` - list of outputs for this `.dvc` file + +An output entry can consist of these fields: + +- `md5` - hash value for the output file +- `path` - path to the output in the workspace, relative to the + location of the `.dvc` file +- `cache` - (optional) whether or not DVC should cache the output. `true` by + default + +Note that comments can be added to DVC metafiles using the `# comment` syntax. + +> `.dvc` file comments are preserved among executions of the `dvc repro` and +> `dvc commit` commands, but not when a `.dvc` file is overwritten by +> `dvc add`,`dvc import`, or `dvc import-url`. + +## dvc.yaml file + +When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is +created or updated. Here's a simple example: + +```yaml +stages: + stageone: + cmd: python cmd.py input.data output.data metrics.json + deps: + - cmd.py + - input.data + outs: + - output.data + metrics: + - metrics.json + stagetwo: + cmd: python ... + ... +``` + +`dvc.yaml` files consists of a group of `stages` with names provided explicitly +by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain +the following fields: + +- `cmd` - executable command defined in this stage +- `deps` - list of dependencies for this stage +- `params` - (optional) list of the [parameter](/doc/command-reference/params) + names and their current values +- `outs` - list of outputs for this stage +- `metric` - (optional) list of [metric](/doc/command-reference/metrics) files +- `locked` - (optional) whether or not this stage is locked from reproduction +- `always_changed` (optional) - whether or not this stage is considered as + changed by commands such as `dvc status` and `dvc repro`. `false` by default + +A dependency entry consists of a these possible fields: + +- `path`: Path to the dependency, relative to the `wdir` path (always present) +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `etag`: Strong ETag response header (only HTTP external + dependencies created with `dvc import-url`) +- `repo`: This entry is only for external dependencies created with + `dvc import`, and can contains the following fields: + + - `url`: URL of Git repository with source DVC project + - `rev`: Only present when the `--rev` option of `dvc import` is used. + Specific commit hash, branch or tag name, etc. (a + [Git revision](https://git-scm.com/docs/revisions)) used to import the + dependency from. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating (with `dvc update`) the dependency. + + > See the examples in + > [External Dependencies](/doc/user-guide/external-dependencies) for more + > info. + +An output entry consists of these fields: + +- `md5` - hash value for the output file +- `path` - path to the output in the workspace, relative to the + location of the `.dvc` file +- `cache` - (optional) whether or not DVC should cache the output. `true` by + default + +Metrics entries can contain these fields: + +- `type`: Type of the metric file (`json`) +- `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for + `{"AUC": {"value": 0.624321}}`) + +`dvc.yaml` files also support `# comments`. + +> `dvc.yaml` comments are preserved among executions of `dvc run`, `dvc repro`, +> and `dvc commit`. + +## Internal directories and files - `.dvc/config`: This is a configuration file. The config file can be edited by hand or with the `dvc config` command. @@ -13,16 +143,17 @@ operation: (credentials, private locations, etc). The local config file can be edited by hand or with the command `dvc config --local`. -- `.dvc/cache`: The [cache directory](#structure-of-cache-directory) will store - your data. The data files and directories in the workspace will - only contain links to the data files in the cache. (Refer to +- `.dvc/cache`: The cache directory will store your data in a + special [structure](#structure-of-cache-directory). The data files and + directories in the workspace will only contain links to the data + files in the cache. (Refer to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See `dvc config cache` for related configuration options. > Note that DVC includes the cache directory in `.gitignore` during > initialization. No data tracked by DVC will ever be pushed to the Git - > repository, only [DVC-files](/doc/user-guide/dvc-metafile-formats) that are - > needed to download or reproduce them. + > repository, only [DVC-files](/doc/user-guide/dvc-files-and-directories) that + > are needed to download or reproduce them. - `.dvc/plots`: Directory for [Plot templates](/doc/command-reference/plots#plot-templates). @@ -84,8 +215,8 @@ $ dvc add data/images ``` When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-metafile-formats) is created, containing the hash -value of the directory: +[DVC-file](/doc/user-guide/dvc-files-and-directories) is created, containing the +hash value of the directory: ```yaml md5: 77e511dafe2178d936e54331d5d6288f diff --git a/content/docs/user-guide/dvc-metafile-formats.md b/content/docs/user-guide/dvc-metafile-formats.md deleted file mode 100644 index ee80af94a9..0000000000 --- a/content/docs/user-guide/dvc-metafile-formats.md +++ /dev/null @@ -1,128 +0,0 @@ -# DVC Metafile Formats - -There are two special metafiles created by certain -[DVC commands](/doc/command-reference): - -- Files ending with the `.dvc` extension are basic placeholders to version data - files and directories. A DVC project can have multiple - [`.dvc` files](#dvc-files). -- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages - that form the pipeline(s) of a project, and their connections (_dependency - graph_ or DAG). - -Both use human-friendly YAML schemas, described below. We encourage you to get -familiar with them so you may edit them freely, as needed. Both type of files -should be versioned with Git (for Git-enabled repositories). - -> See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the -> highlighting for your editor. - -## .dvc files - -When you add a file or directory to a DVC project with `dvc add` or -`dvc import`, a `.dvc` file is created based on the data file name (e.g. -`data.xml.dvc`). These files contain the basic information needed to track the -data with DVC. - -They use a simple YAML format that can be easily written or altered manually. -Here is a sample: - -```yaml -outs: - - md5: a304afb96060aad90176268345e10355 - path: data.xml -# Manual comments can be added in. -``` - -`.dvc` files contain a single top field: - -- `outs` - list of outputs for this `.dvc` file - -An output entry can consist of these fields: - -- `md5` - hash value for the output file -- `path` - path to the output in the workspace, relative to the - location of the `.dvc` file -- `cache` - (optional) whether or not DVC should cache the output. `true` by - default - -Note that comments can be added to DVC metafiles using the `# comment` syntax. - -> `.dvc` file comments are preserved among executions of the `dvc repro` and -> `dvc commit` commands, but not when a `.dvc` file is overwritten by -> `dvc add`,`dvc import`, or `dvc import-url`. - -## dvc.yaml file - -When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is -created or updated. Here's a simple example: - -```yaml -stages: - firstone: - cmd: python cmd.py input.data output.data metrics.json - deps: - - cmd.py - - input.data - outs: - - output.data - metrics: - - metrics.json - nextone: - cmd: python ... - ... -``` - -`dvc.yaml` files consists of a group of `stages` with names provided explicitly -by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain -the following fields: - -- `cmd` - executable command defined in this stage -- `deps` - list of dependencies for this stage -- `params` - (optional) list of the [parameter](/doc/command-reference/params) - names and their current values -- `outs` - list of outputs for this stage -- `metric` - (optional) list of [metric](/doc/command-reference/metrics) files -- `locked` - (optional) whether or not this stage is locked from reproduction -- `always_changed` (optional) - whether or not this stage is considered as - changed by commands such as `dvc status` and `dvc repro`. `false` by default - -A dependency entry consists of a these possible fields: - -- `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) -- `etag`: Strong ETag response header (only HTTP external - dependencies created with `dvc import-url`) -- `repo`: This entry is only for external dependencies created with - `dvc import`, and can contains the following fields: - - - `url`: URL of Git repository with source DVC project - - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific commit hash, branch or tag name, etc. (a - [Git revision](https://git-scm.com/docs/revisions)) used to import the - dependency from. - - `rev_lock`: Git commit hash of the external DVC repository at - the time of importing or updating (with `dvc update`) the dependency. - - > See the examples in - > [External Dependencies](/doc/user-guide/external-dependencies) for more - > info. - -An output entry consists of these fields: - -- `md5` - hash value for the output file -- `path` - path to the output in the workspace, relative to the - location of the `.dvc` file -- `cache` - (optional) whether or not DVC should cache the output. `true` by - default - -Metrics entries can contain these fields: - -- `type`: Type of the metric file (`json`) -- `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for - `{"AUC": {"value": 0.624321}}`) - -`dvc.yaml` files also support `# comments`. - -> `dvc.yaml` comments are preserved among executions of `dvc run`, `dvc repro`, -> and `dvc commit`. diff --git a/content/docs/user-guide/large-dataset-optimization.md b/content/docs/user-guide/large-dataset-optimization.md index 4c64803ce6..a3ddec19a1 100644 --- a/content/docs/user-guide/large-dataset-optimization.md +++ b/content/docs/user-guide/large-dataset-optimization.md @@ -5,8 +5,8 @@ In order to track the data files and directories added with `dvc add` or project's cache is the hidden storage (by default located in `.dvc/cache`) for files that are tracked by DVC, and their different versions. (See `dvc cache` and -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more -details.) +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +for more details.) However, the versions of the tracked files that [match the current code](/doc/tutorials/get-started/data-pipelines) are also diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 63af957ca3..71da15531c 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -12,9 +12,9 @@ DVC to control data outside of the project directory. DVC can track files on an external storage with `dvc add` or specify external files as outputs for -[DVC-files](/doc/user-guide/dvc-metafile-formats) created by `dvc run` (stage -files). External outputs are considered part of the DVC project. DVC will track -changes in them and reflect this in the output of `dvc status`. +[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` +(stage files). External outputs are considered part of the DVC project. DVC will +track changes in them and reflect this in the output of `dvc status`. Currently, the following types (protocols) of external outputs (and cache) are supported: From bf44d1f40f0412802d2c04d3e82d261e28d40c4a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Jun 2020 18:43:04 -0500 Subject: [PATCH 08/36] install: remove InteliJ plugin info per https://github.com/iterative/intellij-dvc/issues/7 --- content/docs/install/plugins.md | 16 ++-------------- .../docs/user-guide/dvc-files-and-directories.md | 3 --- 2 files changed, 2 insertions(+), 17 deletions(-) diff --git a/content/docs/install/plugins.md b/content/docs/install/plugins.md index 8d34589bec..de4934f866 100644 --- a/content/docs/install/plugins.md +++ b/content/docs/install/plugins.md @@ -1,8 +1,8 @@ # IDE Plugins and Syntax Highlighting When you add a file or a stage to your pipeline, DVC creates a special -[DVC-file](/doc/user-guide/dvc-metafile-formats) that contains all the needed -information to track your data and transformations. +[DVC-file](/doc/user-guide/dvc-files-and-directories) that contains all the +needed information to track your data and transformations. The file itself is in a simple YAML format. @@ -16,15 +16,3 @@ autocmd! BufNewFile,BufRead Dvcfile,*.dvc setfiletype yaml ``` to your `~/.vimrc`(to be created if it doesn't exist). - -## IntelliJ IDEs - -A community member, [@prihoda](https://github.com/prihoda), maintains a plugin -for IntelliJ IDEs, it offers a more robust integration than just syntax -highlighting. - -You can download the plugin from -[JetBrains Plugins repository](https://plugins.jetbrains.com/plugin/11368-dvc-support-poc) - -For more information, visit the plugin's repository: -[iterative/intellij-dvc/](https://github.com/iterative/intellij-dvc/) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index def5781957..ee619f6712 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -19,9 +19,6 @@ Both use human-friendly YAML schemas, described below. We encourage you to get familiar with them so you may edit them freely, as needed. Both type of files should be versioned with Git (for Git-enabled repositories). -> See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable > -> the highlighting for your editor. - ## .dvc files When you add a file or directory to a DVC project with `dvc add` or From f928526e435a3cbed1e64757746a5c63ef464cf0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Jun 2020 19:37:22 -0500 Subject: [PATCH 09/36] term: avoid "metafile" --- content/docs/tutorials/get-started/data-versioning.md | 6 +++--- .../docs/use-cases/versioning-data-and-model-files.md | 2 +- content/docs/user-guide/dvc-files-and-directories.md | 9 +++++---- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index ac8aa2b3ba..d5b8e0e089 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -28,9 +28,9 @@ $ dvc add data/data.xml DVC stores information about the added file in a special _DVC-file_ named `data/data.xml.dvc`, a small text file with a human-readable -[format](/doc/user-guide/dvc-files-and-directories). This metafile can committed -with Git instead, as a placeholder for the original data (which is added to -`.gitignore`): +[format](/doc/user-guide/dvc-files-and-directories). This `.dvc` file can +committed with Git instead, as a placeholder for the original data (which is +added to `.gitignore`): ```dvc $ git add data/.gitignore data/data.xml.dvc diff --git a/content/docs/use-cases/versioning-data-and-model-files.md b/content/docs/use-cases/versioning-data-and-model-files.md index 6f4bd7449e..2ce83f14bb 100644 --- a/content/docs/use-cases/versioning-data-and-model-files.md +++ b/content/docs/use-cases/versioning-data-and-model-files.md @@ -8,7 +8,7 @@ DVC allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository. It's useful when dealing with files that are too large for Git to handle properly in general. DVC saves information about your data in special -[DVC-files](/doc/user-guide/dvc-files-and-directories), and these metafiles can +[`.dvc` files](/doc/user-guide/dvc-files-and-directories), and these files can be used for versioning. To actually store the data, DVC supports various types of [remote storage](/doc/command-reference/remote). This allows easily saving and sharing data alongside code. diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index ee619f6712..f560b00f32 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -5,7 +5,7 @@ directory (`.dvc/`) with the [internal directories and files](#internal-directories-and-files) needed for DVC operation. -Additionally, there are two special metafiles created by certain +Additionally, there are two special files created by certain [DVC commands](/doc/command-reference): - Files ending with the `.dvc` extension are basic placeholders to version data @@ -48,7 +48,8 @@ An output entry can consist of these fields: - `cache` - (optional) whether or not DVC should cache the output. `true` by default -Note that comments can be added to DVC metafiles using the `# comment` syntax. +Note that comments can be added to `.dvc` files and `dvc.yaml` using the +`# comment` syntax. > `.dvc` file comments are preserved among executions of the `dvc repro` and > `dvc commit` commands, but not when a `.dvc` file is overwritten by @@ -223,8 +224,8 @@ outs: # ... ``` -The directory in cache is stored as a JSON metafile describing it's contents, -along with the files it contains in cache, like this: +The directory in cache is stored as a JSON file (with `.dir` file extension) +describing it's contents, along with the files it contains in cache, like this: ```dvc $ tree .dvc/cache From 04cf1e63c264cb5e9268cb58c7bae10d67c09210 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Jun 2020 19:57:26 -0500 Subject: [PATCH 10/36] get-started: DVC-file -> .dvc file/dvc.yaml (incomplete) --- .../tutorials/get-started/data-pipelines.md | 15 +++++----- .../tutorials/get-started/data-versioning.md | 28 +++++++++---------- 2 files changed, 21 insertions(+), 22 deletions(-) diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 7cca660c0f..7fc681f73a 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -68,12 +68,11 @@ $ dvc run -f prepare.dvc \ python src/prepare.py data/data.xml data/prepared ``` -The `prepare.dvc` _stage file_ is generated. It has the same -[format](/doc/user-guide/dvc-files-and-directories) as the DVC-file we created -previously to [tack data](/doc/tutorials/get-started/data-versioning#changes), -but it additionally includes information about the command we ran +The `prepare.dvc` +[`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) is +generated. It includes information about the command we ran (`python src/prepare.py`), the dependencies, and -outputs. +outputs of this stage.
@@ -154,11 +153,11 @@ $ dvc run -f train.dvc \ python src/train.py data/features model.pkl ``` -Let's commit the changes, including the stage files (DVC-file) that describe our -pipeline so far: +Let's commit the changes, including to `dvc.yaml`, that describe our pipeline so +far: ```dvc -$ git add data/.gitignore .gitignore featurize.dvc train.dvc +$ git add data/.gitignore .gitignore dvc.yaml $ git commit -m "Create featurization & training stages (full ML pipeline)" ``` diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index d5b8e0e089..c8637af07b 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -26,11 +26,11 @@ $ dvc get https://github.com/iterative/dataset-registry \ $ dvc add data/data.xml ``` -DVC stores information about the added file in a special _DVC-file_ named +DVC stores information about the added file in a special file named `data/data.xml.dvc`, a small text file with a human-readable -[format](/doc/user-guide/dvc-files-and-directories). This `.dvc` file can -committed with Git instead, as a placeholder for the original data (which is -added to `.gitignore`): +[format](/doc/user-guide/dvc-files-and-directories#dvcyaml-file). This `.dvc` +file can be committed with Git instead, as a placeholder for the original data +(which is added to `.gitignore`): ```dvc $ git add data/.gitignore data/data.xml.dvc @@ -111,8 +111,8 @@ $ dvc add data/data.xml ``` DVC updates the `data/data.xml.dvc` -[DVC-file](/doc/user-guide/dvc-files-and-directories) to match the updated data. -Let's commit this new version with Git: +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) to match +the updated data. Let's commit this new version with Git:
@@ -152,7 +152,7 @@ $ dvc checkout data/data.xml.dvc ### Expand to see what happened internally -`git checkout` brought the `data/data.xml.dvc` DVC-file back to the version, +`git checkout` brought the `data/data.xml.dvc` `.dvc` file back to the version, with the previous hash value of the data (`a304afb...`): ```yaml @@ -202,7 +202,7 @@ $ dvc push ``` > Usually, we also want to `git commit` and `git push` the corresponding -> [DVC-files](/doc/user-guide/dvc-files-and-directories). +> [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvcyaml-file). Pushing data or models ensures they're safely backed up remotely. This also means they can be retrieved from other environments. @@ -310,8 +310,9 @@ $ dvc import https://github.com/iterative/dataset-registry \ #### Expand to see what happened internally -DVC-files created by `dvc import` are called _import stages_. These have fields, -such as the data source `repo`, and `path` (under `deps`): +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) created +by `dvc import` are called _import stages_. These have fields, such as the data +source `repo`, and `path` (under `deps`): ```yaml deps: @@ -331,10 +332,9 @@ The `url` and `rev_lock` subfields under `repo` are used to save the origin and
-Additionally, the `data/data.xml` -[DVC-file](/doc/user-guide/dvc-files-and-directories) now includes metadata to -track changes in the source data. This allows you to bring in changes from the -data source later, using `dvc update`. +Additionally, `data/data.xml` now includes metadata to track changes in the +source data. This allows you to bring in changes from the data source later, +using `dvc update`. ### Python API From 5f6113be7d7cd36eb0f3fe2df92eecb50941b562 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 6 Jun 2020 19:45:03 -0500 Subject: [PATCH 11/36] cmd ref: include note about targets not being previously tracked in add --- content/docs/command-reference/add.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index aa1d97d564..5a02a66a14 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -24,11 +24,12 @@ into data artifacts of the project. By default, these are committed to the cache (use the `--no-commit` option to avoid this, and `dvc commit` to finish the process when needed). -Note that [external data](/doc/user-guide/managing-external-data) (targets -outside the workspace) is supported. +Note that [external data](/doc/user-guide/managing-external-data) is supported +(targets outside the workspace). -Under the hood, a few actions are taken for each file (or directory) in -`targets`: +After checking that each `target` file (or directory) hasn't been added before +(or tracked with other DVC commands), a few actions are taken under the hood for +each one: 1. Calculate the file hash. 2. Move the file contents to the cache directory (by default in `.dvc/cache`), From 5a605d4c8fb4711e843abc55f5d6fb32e3716036 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 7 Jun 2020 04:14:18 -0500 Subject: [PATCH 12/36] get-started: remove unnecessary unlock (unfreeze) cmd from ex. --- content/docs/tutorials/get-started/data-pipelines.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 50f01a7c78..6c0be3f449 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -174,7 +174,6 @@ Move to another location in your file system and do this: $ git clone https://github.com/iterative/example-get-started $ cd example-get-started $ git checkout 7-train -$ dvc unlock data/data.xml.dvc ```
From a011e23afaf83c21c6346e04d4ed5098d46b7479 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 9 Jun 2020 11:26:39 -0500 Subject: [PATCH 13/36] user guide: review some links to DVC files & dirs guide --- content/docs/command-reference/add.md | 6 +++--- content/docs/tutorials/deep/preparation.md | 3 +-- content/docs/tutorials/versioning.md | 4 ++-- content/docs/understanding-dvc/how-it-works.md | 3 +-- content/docs/use-cases/versioning-data-and-model-files.md | 8 ++++---- content/docs/user-guide/basic-concepts/dvc-project.md | 5 ++--- content/docs/user-guide/large-dataset-optimization.md | 4 ++-- 7 files changed, 15 insertions(+), 18 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index b8eba1d811..acbfa46404 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -17,8 +17,8 @@ positional arguments: The `dvc add` command is analogous to `git add`, in that it makes DVC aware of the target data, as a first step to version it. It creates a -[`.dvc` file](/doc/user-guide/dvc-files-and-directories) to track the added -data. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track the +added data. The `targets` are files or directories to add with this command, that are turned into data artifacts of the project. By default, these @@ -51,7 +51,7 @@ each one: Summarizing, the result is that the target data is replaced by small `.dvc` files that can be tracked with Git. See -[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvcfiles) for more +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) for more details. > Note that `.dvc` files created by this command are considered _orphan stage diff --git a/content/docs/tutorials/deep/preparation.md b/content/docs/tutorials/deep/preparation.md index 4a83476842..6f3b593da7 100644 --- a/content/docs/tutorials/deep/preparation.md +++ b/content/docs/tutorials/deep/preparation.md @@ -91,5 +91,4 @@ The cache directory, one of the most important parts of any explained in more detail in the next chapter.) Note that it won't be tracked by Git — It's a local-only directory, and you cannot push it to a Git remote. -For more information refer to -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). +For more information refer to `dvc init`. diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index 608bcf5f0e..fa0e02f6a8 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -65,8 +65,8 @@ The repository you cloned is already DVC-initialized. It already contains a `.dvc/` directory with the `config` and `.gitignore` files. These and other files and directories are hidden from user, as typically there's no need to interact with them directly. See -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) -to learn more. +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn +more. diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/understanding-dvc/how-it-works.md index 576d1a9473..c5b5988ad3 100644 --- a/content/docs/understanding-dvc/how-it-works.md +++ b/content/docs/understanding-dvc/how-it-works.md @@ -7,8 +7,7 @@ $ dvc init ``` - > See - > [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). + > See `dvc init` for more info. - DVC helps define command pipelines, and keeps each command [stage](/doc/command-reference/run) and dependencies in a Git repository: diff --git a/content/docs/use-cases/versioning-data-and-model-files.md b/content/docs/use-cases/versioning-data-and-model-files.md index 2ce83f14bb..6d98891b1c 100644 --- a/content/docs/use-cases/versioning-data-and-model-files.md +++ b/content/docs/use-cases/versioning-data-and-model-files.md @@ -8,10 +8,10 @@ DVC allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository. It's useful when dealing with files that are too large for Git to handle properly in general. DVC saves information about your data in special -[`.dvc` files](/doc/user-guide/dvc-files-and-directories), and these files can -be used for versioning. To actually store the data, DVC supports various types -of [remote storage](/doc/command-reference/remote). This allows easily saving -and sharing data alongside code. +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files), and these +files can be used for versioning. To actually store the data, DVC supports +various types of [remote storage](/doc/command-reference/remote). This allows +easily saving and sharing data alongside code. ![](/img/model-versioning-diagram.png) diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index e5d6336f1c..4c724d8230 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -14,7 +14,6 @@ match: --- Initialized by running `dvc init` in the **workspace** (typically a Git -repository). It will contain the -[`.dvc/` directory](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) -and [DVC metafiles](/doc/user-guide/dvc-files-and-directories) created with +repository). It will contain the `.dvc/` directory and other +[special DVC files](/doc/user-guide/dvc-files-and-directories) created with commands such as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/large-dataset-optimization.md b/content/docs/user-guide/large-dataset-optimization.md index a3ddec19a1..4c64803ce6 100644 --- a/content/docs/user-guide/large-dataset-optimization.md +++ b/content/docs/user-guide/large-dataset-optimization.md @@ -5,8 +5,8 @@ In order to track the data files and directories added with `dvc add` or project's cache is the hidden storage (by default located in `.dvc/cache`) for files that are tracked by DVC, and their different versions. (See `dvc cache` and -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) -for more details.) +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more +details.) However, the versions of the tracked files that [match the current code](/doc/tutorials/get-started/data-pipelines) are also From d7ceede815febcddfaa97f93ff0bc9573ffc2567 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 9 Jun 2020 12:09:47 -0500 Subject: [PATCH 14/36] user guide: review links to files&dirs guide --- content/docs/command-reference/config.md | 2 +- content/docs/command-reference/import.md | 3 +-- content/docs/tutorials/deep/sharing-data.md | 6 +++--- content/docs/tutorials/get-started/index.md | 2 +- content/docs/understanding-dvc/related-technologies.md | 6 +++--- 5 files changed, 9 insertions(+), 10 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 5de61bdab3..1cd3892eba 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -180,7 +180,7 @@ for more details.) This section contains the following options: ### state See -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +[Internal directories and files](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) to learn more about the state file (database) that is used for optimization. - `state.row_limit` - maximum number of entries in the state database, which diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 015128a958..e2ebf4789d 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -65,8 +65,7 @@ data `path`, and the `outs` field contains the corresponding local path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -To actually -[track the data](https://dvc.org/doc/tutorials/get-started/data-versioning), +To actually [track the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import stage. Note that import stages are considered always diff --git a/content/docs/tutorials/deep/sharing-data.md b/content/docs/tutorials/deep/sharing-data.md index 054e88085e..6e945ad44e 100644 --- a/content/docs/tutorials/deep/sharing-data.md +++ b/content/docs/tutorials/deep/sharing-data.md @@ -13,9 +13,9 @@ DVC is able to push the cache to cloud storage. > Using shared cloud storage, a colleague can reuse ML models that were trained > on your machine. -First, you need to setup remote storage for the project, that will -be stored in the project's -[config file](https://dvc.org/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). +First, you need to setup the remote storage for this project, that +will be stored in the project's +[config file](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). This can be done using the CLI as shown below. > Note that we are using the `dvc-public` S3 bucket as an example and you don't diff --git a/content/docs/tutorials/get-started/index.md b/content/docs/tutorials/get-started/index.md index 83366096cb..b23ca49b3f 100644 --- a/content/docs/tutorials/get-started/index.md +++ b/content/docs/tutorials/get-started/index.md @@ -11,7 +11,7 @@ Move into the directory you want to use as workspace, and use `dvc init` inside to create a DVC project. It can contain existing project files. At initialization, a new `.dvc/` directory is created for the internal -[files and directories](/dvc-files-and-directories#internal-directories-and-files): +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files): ```dvc $ dvc init diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md index 660e792103..ed3ab7b4d2 100644 --- a/content/docs/understanding-dvc/related-technologies.md +++ b/content/docs/understanding-dvc/related-technologies.md @@ -38,9 +38,9 @@ Luigi, etc. result, but we expect some GUI services will be created on top of DVC. - DVC has transparent design. Its - [internal files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) - (including the cache directory) have a human-readable format and - can be easily reused by external tools. + [files and directories](/doc/user-guide/dvc-files-and-directories) (including + the cache directory) have a human-readable format and can be + easily reused by external tools. ### Git workflows/methodologies such as Gitflow From 2b08a54d503c00849ab5937a48b2348ede7c3704 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 9 Jun 2020 12:29:00 -0500 Subject: [PATCH 15/36] cmd ref: update destroy desc. and link to DVC files/dirs guide --- content/docs/command-reference/destroy.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index 258ac4a4c2..3b3c2a3830 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -1,8 +1,8 @@ # destroy Remove all -[DVC files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) -from a DVC project. +[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from a +DVC project. ## Synopsis @@ -12,14 +12,19 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description -`dvc destroy` removes DVC-files, and the internal `.dvc/` directory from the -workspace. Note that the cache directory will normally -be removed as well, unless it's set to an external location with -`dvc cache dir`. (By default a local cache is located in the `.dvc/cache` -directory.) If you were using +`dvc destroy` removes `dvc.yaml`, `.dvc` files, and the internal `.dvc/` +directory from the workspace. + +Note that the cache directory will be removed as well, unless it's +[set to an external location](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +(by default a local cache is located in `.dvc/cache`). If you were using [symlinks for linking](/doc/user-guide/large-dataset-optimization) data from the -cache, DVC will replace them with copies, so that your data is intact after the -project's destruction. +cache, DVC will replace them with the latest versions of the actual files and +directories first, so that your data is intact after the project's destruction. + +> Refer to +> [DVC files and directories](/doc/user-guide/dvc-files-and-directories) for +> more details on the directories and files deleted by this command. ## Options From 8fbf12949011217747f9a476772565cc0e432555 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 9 Jun 2020 13:02:51 -0500 Subject: [PATCH 16/36] user guide: term locked->frozen in files&dirs guide per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-426091399 --- content/docs/command-reference/import.md | 5 +---- content/docs/user-guide/dvc-files-and-directories.md | 2 +- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index e2ebf4789d..bbef0da556 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -212,8 +212,7 @@ in the future, where and when needed. This is achieved with the `repo` field, for example (matching the import command above): ```yaml -md5: 96fd8e791b0ee4824fc1ceffd13b1b49 -locked: true +frozen: true deps: - path: use-cases/cats-dogs repo: @@ -223,8 +222,6 @@ outs: - md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir path: cats-dogs cache: true - metric: false - persist: false ``` See a full explanation in our [Data Registries](/doc/use-cases/data-registries) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 920fa3334b..5288452984 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -86,7 +86,7 @@ the following fields: names and their current values - `outs` - list of outputs for this stage - `metric` - (optional) list of [metric](/doc/command-reference/metrics) files -- `locked` - (optional) whether or not this stage is locked from reproduction +- `frozen` - (optional) whether or not this stage is frozen from reproduction - `always_changed` (optional) - whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default From bdab8aeecd5ecfdc49d18000f46191866cc3ddc1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 9 Jun 2020 13:10:38 -0500 Subject: [PATCH 17/36] user guide: update and reorg all YAML structure fields per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-426092266 --- .../user-guide/dvc-files-and-directories.md | 80 +++++++++++-------- 1 file changed, 46 insertions(+), 34 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 5288452984..50a0fa681f 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -36,18 +36,39 @@ outs: # Manual comments can be added in. ``` -`.dvc` files contain a single top field: +`.dvc` files can contain the following two fields: -- `outs` - list of outputs for this `.dvc` file +- `outs`: List of outputs for this `.dvc` file +- `deps` (optional): List of dependencies for this stage, only + present when `dvc import` and `dvc import-url` are used. An output entry can consist of these fields: -- `md5` - hash value for the output file -- `path` - path to the output in the workspace, relative to the +- `md5`: Hash value for the output file +- `path`: Path to the output in the workspace, relative to the location of the `.dvc` file -- `cache` - (optional) whether or not DVC should cache the output. `true` by +- `cache` (optional): Whether or not DVC should cache the output. `true` by default +A `.dvc` file dependency entry consists of a these possible fields: + +- `path`: Path to the dependency, relative to the `wdir` path (always present) +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `repo`: This entry is only for external dependencies created with + `dvc import`, and can contains the following fields: + + - `url`: URL of Git repository with source DVC project + - `rev`: Only present when the `--rev` option of `dvc import` is used. + Specific commit hash, branch or tag name, etc. (a + [Git revision](https://git-scm.com/docs/revisions)) used to import the + dependency from. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating (with `dvc update`) the dependency. + + > See the examples in + > [External Dependencies](/doc/user-guide/external-dependencies) for more + > info. + Note that comments can be added to `.dvc` files and `dvc.yaml` using the `# comment` syntax. @@ -77,49 +98,40 @@ stages: ``` `dvc.yaml` files consists of a group of `stages` with names provided explicitly -by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain -the following fields: +by the user with the `--name` (`-n`) option of `dvc run`. -- `cmd` - executable command defined in this stage -- `deps` - list of dependencies for this stage -- `params` - (optional) list of the [parameter](/doc/command-reference/params) +Each stage can contain the following fields: + +- `cmd`: Executable command defined in this stage +- `deps`: List of dependencies for this stage +- `outs`: List of outputs for this stage +- `params` (optional): List of the [parameter](/doc/command-reference/params) names and their current values -- `outs` - list of outputs for this stage -- `metric` - (optional) list of [metric](/doc/command-reference/metrics) files -- `frozen` - (optional) whether or not this stage is frozen from reproduction -- `always_changed` (optional) - whether or not this stage is considered as +- `metric` (optional): List of [metric](/doc/command-reference/metrics) files +- `frozen` (optional): Whether or not this stage is frozen from reproduction +- `always_changed` (optional) : Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default -A dependency entry consists of a these possible fields: +An output entry consists of these fields: + +- `md5`: Hash value for the output file +- `path`: Path to the output in the workspace, relative to the + location of the `.dvc` file +- `cache` (optional): Whether or not DVC should cache the output. `true` by + default + +A `dvc.yaml` dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) -- `repo`: This entry is only for external dependencies created with - `dvc import`, and can contains the following fields: - - - `url`: URL of Git repository with source DVC project - - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific commit hash, branch or tag name, etc. (a - [Git revision](https://git-scm.com/docs/revisions)) used to import the - dependency from. - - `rev_lock`: Git commit hash of the external DVC repository at - the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more > info. -An output entry consists of these fields: - -- `md5` - hash value for the output file -- `path` - path to the output in the workspace, relative to the - location of the `.dvc` file -- `cache` - (optional) whether or not DVC should cache the output. `true` by - default - -Metrics entries can contain these fields: +Metric entries can contain these fields: - `type`: Type of the metric file (`json`) - `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for From ead834aaa3e21b97a9cb8893c7e75dffcca9af48 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 10 Jun 2020 17:30:02 -0500 Subject: [PATCH 18/36] user guide: reinstate info about meta fields and improve info on comments for DVC files&dirs guide per https://github.com/iterative/dvc.org/pull/1411#discussion_r438430369 --- .../user-guide/dvc-files-and-directories.md | 39 ++++++++++++------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 50a0fa681f..008f8867d2 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -27,13 +27,17 @@ When you add a file or directory to a DVC project with `dvc add` or data with DVC. They use a simple YAML format that can be easily written or altered manually. -Here is a sample: +Here is a full sample: ```yaml outs: - md5: a304afb96060aad90176268345e10355 path: data.xml -# Manual comments can be added in. + +# Comments and user metadata are supported. +meta: + name: 'John Doe' + email: john@doe.com ``` `.dvc` files can contain the following two fields: @@ -41,6 +45,9 @@ outs: - `outs`: List of outputs for this `.dvc` file - `deps` (optional): List of dependencies for this stage, only present when `dvc import` and `dvc import-url` are used. +- `meta` (optional): Arbitrary information can be added here manually. Any YAML + contents can be added. `meta` contents are ignored by DVC, but they can be + useful for user processes that read `.dvc` files. An output entry can consist of these fields: @@ -72,9 +79,9 @@ A `.dvc` file dependency entry consists of a these possible fields: Note that comments can be added to `.dvc` files and `dvc.yaml` using the `# comment` syntax. -> `.dvc` file comments are preserved among executions of the `dvc repro` and -> `dvc commit` commands, but not when a `.dvc` file is overwritten by -> `dvc add`,`dvc import`, or `dvc import-url`. +> `meta` fields and `#` comments are preserved among executions of the +> `dvc repro` and `dvc commit` commands, but not when a `.dvc` file is +> overwritten by `dvc add`,`dvc import`, or `dvc import-url`. ## dvc.yaml files @@ -86,21 +93,24 @@ stages: stageone: cmd: python cmd.py input.data output.data metrics.json deps: - - cmd.py - - input.data + - cmd.py + - input.data outs: - - output.data + - output.data metrics: - - metrics.json + - metrics.json stagetwo: cmd: python ... - ... + meta: '2nd stage' # User metadata and comments are supported. + deps: ... ``` `dvc.yaml` files consists of a group of `stages` with names provided explicitly by the user with the `--name` (`-n`) option of `dvc run`. -Each stage can contain the following fields: +Each stage's contents are similar to individual [`dvc` files](#dvcfiles) but +they can contain more information in `dvc.yaml` These are the possible following +fields: - `cmd`: Executable command defined in this stage - `deps`: List of dependencies for this stage @@ -111,6 +121,9 @@ Each stage can contain the following fields: - `frozen` (optional): Whether or not this stage is frozen from reproduction - `always_changed` (optional) : Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default +- `meta` (optional): Arbitrary information can be added here manually. Any YAML + contents can be added. `meta` contents are ignored by DVC, but they can be + useful for user processes that read `.dvc` files. An output entry consists of these fields: @@ -139,8 +152,8 @@ Metric entries can contain these fields: `dvc.yaml` files also support `# comments`. -> `dvc.yaml` comments are preserved among executions of `dvc run`, `dvc repro`, -> and `dvc commit`. +> `meta` fields and `#` comments are preserved among executions of `dvc run`, +> `dvc repro`, and `dvc commit`. ## Internal directories and files From 5a97a2dce20db5ca0ed25bede3caf15bdaa1571a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 10 Jun 2020 17:41:10 -0500 Subject: [PATCH 19/36] user guide: term metric->metrics + add plots field per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-427696120 --- content/docs/user-guide/dvc-files-and-directories.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 008f8867d2..f9da0a3a62 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -117,7 +117,8 @@ fields: - `outs`: List of outputs for this stage - `params` (optional): List of the [parameter](/doc/command-reference/params) names and their current values -- `metric` (optional): List of [metric](/doc/command-reference/metrics) files +- `metrics` (optional): List of [metrics](/doc/command-reference/metrics) +- `plots` (optional): List of [plot metrics](/doc/command-reference/plots) - `frozen` (optional): Whether or not this stage is frozen from reproduction - `always_changed` (optional) : Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default @@ -144,9 +145,9 @@ A `dvc.yaml` dependency entry consists of a these possible fields: > [External Dependencies](/doc/user-guide/external-dependencies) for more > info. -Metric entries can contain these fields: +Metrics entries can contain these fields: -- `type`: Type of the metric file (`json`) +- `type`: Type of the metrics file (`json`) - `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for `{"AUC": {"value": 0.624321}}`) From e957993e503ba864ff98b089c8fb7a925fff2703 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 11 Jun 2020 00:26:06 -0500 Subject: [PATCH 20/36] user guide: mark optional fields in dvc.yaml and .dvc files per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-427481072 --- .../docs/user-guide/dvc-files-and-directories.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index f9da0a3a62..297a93f977 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -45,7 +45,7 @@ meta: - `outs`: List of outputs for this `.dvc` file - `deps` (optional): List of dependencies for this stage, only present when `dvc import` and `dvc import-url` are used. -- `meta` (optional): Arbitrary information can be added here manually. Any YAML +- `meta` (manual): Arbitrary information can be added here manually. Any YAML contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. @@ -61,7 +61,7 @@ A `.dvc` file dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) -- `repo`: This entry is only for external dependencies created with +- `repo` (optional): This entry is only for external dependencies created with `dvc import`, and can contains the following fields: - `url`: URL of Git repository with source DVC project @@ -113,16 +113,16 @@ they can contain more information in `dvc.yaml` These are the possible following fields: - `cmd`: Executable command defined in this stage -- `deps`: List of dependencies for this stage -- `outs`: List of outputs for this stage +- `deps` (optional): List of dependencies for this stage +- `outs` (optional): List of outputs for this stage - `params` (optional): List of the [parameter](/doc/command-reference/params) names and their current values - `metrics` (optional): List of [metrics](/doc/command-reference/metrics) - `plots` (optional): List of [plot metrics](/doc/command-reference/plots) - `frozen` (optional): Whether or not this stage is frozen from reproduction -- `always_changed` (optional) : Whether or not this stage is considered as +- `always_changed` (optional): Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default -- `meta` (optional): Arbitrary information can be added here manually. Any YAML +- `meta` (manual): Arbitrary information can be added here manually. Any YAML contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. @@ -138,7 +138,7 @@ A `dvc.yaml` dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) -- `etag`: Strong ETag response header (only HTTP external +- `etag` (optional): Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) > See the examples in From f32ffd5723d43f0d2a73e97e8d6b6e03bcf53221 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 11 Jun 2020 00:36:59 -0500 Subject: [PATCH 21/36] user guide: metrics are now a regular outs field in dvc.yaml and also plots per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-427483677 --- content/docs/user-guide/dvc-files-and-directories.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 297a93f977..3f8458dbf7 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -126,7 +126,7 @@ fields: contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. -An output entry consists of these fields: +An output entry (`outs`, `metrics`, or `plots`) consists of these fields: - `md5`: Hash value for the output file - `path`: Path to the output in the workspace, relative to the @@ -145,12 +145,6 @@ A `dvc.yaml` dependency entry consists of a these possible fields: > [External Dependencies](/doc/user-guide/external-dependencies) for more > info. -Metrics entries can contain these fields: - -- `type`: Type of the metrics file (`json`) -- `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for - `{"AUC": {"value": 0.624321}}`) - `dvc.yaml` files also support `# comments`. > `meta` fields and `#` comments are preserved among executions of `dvc run`, From 910e6dbe6c3e24825fa99f8c46160eb5fe22cd9f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 11 Jun 2020 00:44:57 -0500 Subject: [PATCH 22/36] user guide: meta fields always preserved in dvc.yaml per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-428605193 --- content/docs/user-guide/dvc-files-and-directories.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 3f8458dbf7..71c04a7346 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -126,6 +126,8 @@ fields: contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. +> `meta` fields and `#` comments are always preserved in `dvc.yaml` files. + An output entry (`outs`, `metrics`, or `plots`) consists of these fields: - `md5`: Hash value for the output file @@ -147,9 +149,6 @@ A `dvc.yaml` dependency entry consists of a these possible fields: `dvc.yaml` files also support `# comments`. -> `meta` fields and `#` comments are preserved among executions of `dvc run`, -> `dvc repro`, and `dvc commit`. - ## Internal directories and files - `.dvc/config`: This is a configuration file. The config file can be edited by From 3986283398f9e37177b4b339b06015a8042fc61a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 12 Jun 2020 18:05:03 -0500 Subject: [PATCH 23/36] user guide: bring more links from old guide --- content/docs/command-reference/add.md | 18 ++++++++++-------- content/docs/command-reference/fetch.md | 10 +++++----- content/docs/command-reference/import-url.md | 2 +- content/docs/command-reference/list.md | 9 +++++---- .../docs/command-reference/metrics/index.md | 4 ++-- content/docs/command-reference/plots/diff.md | 4 ++-- content/docs/command-reference/plots/modify.md | 5 +++-- content/docs/command-reference/plots/show.md | 5 +++-- content/docs/command-reference/status.md | 4 ++-- content/docs/command-reference/update.md | 2 +- .../basic-concepts/external-dependency.md | 6 +++--- 11 files changed, 37 insertions(+), 32 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 8e12fb8193..dea8b81016 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -43,11 +43,12 @@ each one: for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). -4. Create a corresponding [`.dvc` file](/doc/user-guide/dvc-file-format) to - track the file, using its path and hash to identify the cached data. The - `.dvc` file lists the DVC-tracked file as an output (`outs` - field). Unless the `-f` option is used, the `.dvc` file name generated by - default is `.dvc`, where `` is the file name of the first target. +4. Create a corresponding + [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track + the file, using its path and hash to identify the cached data. The `.dvc` + file lists the DVC-tracked file as an output (`outs` field). + Unless the `-f` option is used, the `.dvc` file name generated by default is + `.dvc`, where `` is the file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project). @@ -61,7 +62,8 @@ easily tracked with Git. > Note that `.dvc` files can be considered _orphan stages_, because they have no > dependencies, only outputs. These are treated as _always changed_ > by `dvc status` and `dvc repro`, which always executes them. See -> [`dvc.yaml`](/doc/user-guide/dvc-file-format) to learn more about stages. +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) to learn +> more about stages. To avoid adding files inside a directory accidentally, you can add the corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file. @@ -76,8 +78,8 @@ large files. DVC also supports other link types for use on file systems without ### Tracking directories A `dvc add` target can be an individual file or a directory. In the latter case, -a [`.dvc` file](/doc/user-guide/dvc-file-format) is created for the top of the -directory (with default name `.dvc`). +a [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) is created +for the top of the directory (with default name `.dvc`). Every file in the hierarchy is added to the cache (unless the `--no-commit` option is used), but DVC does not produce individual `.dvc` files for each file diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 631161fe57..5f13b236a3 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -196,11 +196,11 @@ Note that the `.dvc/cache` directory was created and populated. > for more info. Used without arguments (as above), `dvc fetch` downloads all assets needed by -all [`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files in the current branch, including -for directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and -`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and -`data/features/` directory, respectively. +all [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in the +current branch, including for directories. The hash values +`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` +correspond to the `model.pkl` file and `data/features/` directory, respectively. Let's now link files from the cache to the workspace with: diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 26a66df5be..e2f5aec6ea 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -3,7 +3,7 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the workspace, and track changes in the remote data source. Creates a -[`.dvc` file](/doc/user-guide/dvc-file-format). +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files). > See `dvc import` to download and tack data/model files or directories from > other DVC repositories (e.g. hosted on Github). diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index c53ce73c01..f62f6e0687 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -19,10 +19,11 @@ positional arguments: DVC, by effectively replacing data files, models, directories with `.dvc` files (`.dvc`), hides actual locations and names. This means that you don't see data files when you browse a DVC repository on Git hosting (e.g. -Github), you just see the [`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files. This makes it hard to navigate -the project to find data artifacts for use with `dvc get`, -`dvc import`, or `dvc.api`. +Github), you just see the +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files. This makes +it hard to navigate the project to find data artifacts for use with +`dvc get`, `dvc import`, or `dvc.api`. `dvc list` prints a virtual view of a DVC repository, as if files and directories [tracked by DVC](/doc/use-cases/versioning-data-and-model-files) diff --git a/content/docs/command-reference/metrics/index.md b/content/docs/command-reference/metrics/index.md index 3624a8d245..2d271b043b 100644 --- a/content/docs/command-reference/metrics/index.md +++ b/content/docs/command-reference/metrics/index.md @@ -65,8 +65,8 @@ stages: > `cache: false` above specifies that `summary.json` is not tracked or > cached by DVC (`-M` option of `dvc run`). These metric files are > normally committed with Git instead. See -> [`dvc.yaml`](/doc/user-guide/dvc-file-format) for more information on the file -> format above. +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) for more +> information on the file format above. ### Supported file formats diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index 8125363141..621a282f71 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -46,8 +46,8 @@ please see `dvc plots`. ## Options - `--targets ` - specific metric files to visualize. These must be listed - in a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the `--plots` - option of `dvc run`). + in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) + file (see the `--plots` option of `dvc run`). - `-o , --out ` - name of the generated file. By default, the output file name is equal to the input filename with a `.html` file extension (or diff --git a/content/docs/command-reference/plots/modify.md b/content/docs/command-reference/plots/modify.md index ac9c24101a..376b90fae3 100644 --- a/content/docs/command-reference/plots/modify.md +++ b/content/docs/command-reference/plots/modify.md @@ -23,8 +23,9 @@ plots are generated with `dvc plot show` or `dvc plot diff`. This command sets (or unsets) default display properties for a specific metrics file. The path to the metrics file `target` is required. It must be listed in a -[`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the `--plots` option of -`dvc run`). `dvc plots modify` adds the display properties to `dvc.yaml`. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file (see +the `--plots` option of `dvc run`). `dvc plots modify` adds the display +properties to `dvc.yaml`. Property names are passed as [options](#options) to this command (prefixed with `--`). These are based on the full diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index 2875380811..ba107e0f37 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -23,8 +23,9 @@ AUC curves, confusion matrices, etc. All plots defined in `dvc.yaml` are used by default. Optionally, specific metric file `targets` to show are accepted. These must be -listed in a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the -`--plots` option of `dvc run`). +listed in a +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file (see +the `--plots` option of `dvc run`). The plot style can be customized with [plot templates](/doc/command-reference/plots#plot-templates), using the diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 0550c111e1..f00aa782c5 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -80,8 +80,8 @@ the changes (described below). - _new_: An output is found in the workspace, but there is no corresponding file hash saved in the - [`dvc.lock`](/doc/user-guide/dvc-file-format) or - [`.dvc`](/doc/user-guide/dvc-file-format) file yet. + [`dvc.lock`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or + [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file yet. - _modified_: An output or dependency is found in the workspace, but the corresponding file hash in the `dvc.lock` or `.dvc` file is not up to date. diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 80c2054ff3..25b0dbb235 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -2,7 +2,7 @@ Update data artifacts imported from external DVC projects, and corresponding import stage -[`.dvc` files](/doc/user-guide/dvc-file-format). +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files). ## Synopsis diff --git a/content/docs/user-guide/basic-concepts/external-dependency.md b/content/docs/user-guide/basic-concepts/external-dependency.md index 0856ff170e..76cc554fce 100644 --- a/content/docs/user-guide/basic-concepts/external-dependency.md +++ b/content/docs/user-guide/basic-concepts/external-dependency.md @@ -3,7 +3,7 @@ name: 'External Dependency' match: ['external dependency', 'external dependencies'] --- -A [`dvc.yaml`](/doc/user-guide/dvc-file-format) file dependency with origin in -an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage -remote locations, or even other DVC repositories. See +A [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file +dependency with origin in an external source, for example HTTP, SSH, Amazon S3, +Google Cloud Storage remote locations, or even other DVC repositories. See [External Dependencies](/doc/user-guide/external-dependencies). From caa5d84302269e084928dc5525da57f0670c20d9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 12 Jun 2020 18:26:45 -0500 Subject: [PATCH 24/36] server:add dvc files & dirs page redirect --- redirects-list.json | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/redirects-list.json b/redirects-list.json index b16be8a49a..2bcdddf51e 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -22,8 +22,9 @@ "^/doc/get-started(/.*)?$ /doc/tutorials/get-started$1", "^/doc/tutorial/?$ /doc/tutorials", "^/doc/tutorial/(.*)? /doc/tutorials/deep/$1", - "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files", + "^/doc/user-guide/dvc-file-format$ /doc/user-guide/dvc-files-and-directories", + "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/command-reference/plot$ /doc/command-reference/plots", "^/doc/command-reference/lock$ /doc/command-reference/freeze", "^/doc/command-reference/unlock$ /doc/command-reference/unfreeze", From 72d09bfcb8c48f3917c50e62607053e0fd73913c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 12 Jun 2020 18:50:24 -0500 Subject: [PATCH 25/36] user guide: fix more links to new files&dirs and details around them --- content/docs/command-reference/import-url.md | 4 ++-- content/docs/command-reference/metrics/diff.md | 13 +++++++------ content/docs/command-reference/run.md | 7 ++----- content/docs/tutorials/deep/define-ml-pipeline.md | 6 +++--- content/docs/tutorials/versioning.md | 4 ++-- 5 files changed, 16 insertions(+), 18 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 0bd0fb8e97..22794916f1 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -193,8 +193,8 @@ The `etag` field in the `.dvc` file contains the If the remote file changes, its ETag will be different. This metadata allows DVC to determine whether its necessary to download it again. -> See [DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more -> details on the text format above. +> See [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) for +> more details on the format above. You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index e019e0c8b6..178688bb84 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -29,12 +29,13 @@ Run without arguments, this command compares metrics currently present in the workspace uncommitted changes) with the latest committed version. The differences shown by this command include the new value, and numeric -difference (delta) from the previous value of metrics. All values and the delta -are [round](https://docs.python.org/3/library/functions.html#round)ed to 5 -digits precision after the decimal point. They're calculated between two commits -(hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) -for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-files-and-directories) in both references. +difference (delta) from the previous value of metrics (rounded to 5 digits +precision). They're calculated between two commits (hash, branch, tag, or any +[Git revision](https://git-scm.com/docs/revisions)) for all metrics in the +project, found by examining all of the +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in both +versions. Another way to display metrics is the `dvc metrics show` command, which just lists all the current metrics without comparisons. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index e527b4fa81..7db1d4405f 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -212,12 +212,9 @@ To track the changes with git, run: git add .gitignore metric.dvc ``` -> See [DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more -> details on the text format above. - Execute a Python script as a DVC [pipeline](/doc/command-reference/pipeline) -stage. The stage file name is not specified, so a `model.p.dvc` DVC-file is -created by default based on the registered output (`-o): +stage. The stage file name is not specified, so a `model.p.dvc` file is created +by default based on the registered output (`-o): ```dvc # Train ML model on the training dataset. 20180226 is a seed value. diff --git a/content/docs/tutorials/deep/define-ml-pipeline.md b/content/docs/tutorials/deep/define-ml-pipeline.md index 260d57839a..aff377c8f9 100644 --- a/content/docs/tutorials/deep/define-ml-pipeline.md +++ b/content/docs/tutorials/deep/define-ml-pipeline.md @@ -186,9 +186,9 @@ command and does some additional work if the command was successful: 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage file in the project with information about this pipeline stage. - (See [DVC-File Format](/doc/user-guide/dvc-files-and-directories)). Note that - the name of this file could be specified by using the `-f` option, for - example `-f extract.dvc`. + (See [DVC Files](/doc/user-guide/dvc-files-and-directories)). Note that the + name of this file could be specified by using the `-f` option, for example + `-f extract.dvc`. Let's take a look at the resulting stage file created by `dvc run` above: diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index fa0e02f6a8..9630c790c7 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -168,8 +168,8 @@ As we mentioned briefly, DVC does not commit the `data/` directory and then `git commit` DVC-files that contain file hashes that point to cached data. In this case we created `data.dvc` and `model.h5.dvc`. Refer to -[DVC-File Format](/doc/user-guide/dvc-files-and-directories) to learn more about -how these files work. +[DVC Files](/doc/user-guide/dvc-files-and-directories) to learn more about how +these files work. From 2462c83f49e8504d04f278e6ee96b92a8bb2cd8d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 13 Jun 2020 12:11:13 -0500 Subject: [PATCH 26/36] Update content/docs/user-guide/basic-concepts/external-dependency.md Co-authored-by: Saugat Pachhai --- content/docs/user-guide/basic-concepts/external-dependency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/basic-concepts/external-dependency.md b/content/docs/user-guide/basic-concepts/external-dependency.md index 275bec2631..373c3006cb 100644 --- a/content/docs/user-guide/basic-concepts/external-dependency.md +++ b/content/docs/user-guide/basic-concepts/external-dependency.md @@ -3,7 +3,7 @@ name: 'External Dependency' match: ['external dependency', 'external dependencies'] --- -A stage dependency (`dep` field in +A stage dependency (`deps` field in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or in an [import stage](/doc/command-reference/import) `.dvc` file) with origin in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote From c66d4eb84d83f8abefde96a17f2f90aa276b832c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 13 Jun 2020 17:36:21 -0500 Subject: [PATCH 27/36] user guide: imrovements to files & dirs guide Address Ivan's private feedback --- .../tutorials/get-started/data-pipelines.md | 2 +- .../user-guide/dvc-files-and-directories.md | 35 ++++++++----------- 2 files changed, 15 insertions(+), 22 deletions(-) diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 6c0be3f449..df84bbe08a 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -67,7 +67,7 @@ $ dvc run -f prepare.dvc \ python src/prepare.py data/data.xml data/prepared ``` -A [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) is +A [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) is generated. It includes information about the command we ran (`python src/prepare.py`), its dependencies, and outputs. diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 71c04a7346..18871c9f2c 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -5,13 +5,14 @@ directory (`.dvc/`) with the [internal directories and files](#internal-directories-and-files) needed for DVC operation. -Additionally, there are two special files created by certain +Additionally, there are two special kind of files created by certain [DVC commands](/doc/command-reference): -- Files ending with the `.dvc` extension are basic placeholders to version data - files and directories. A DVC project can have multiple - [`.dvc` files](#dvc-files). -- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages +- Files ending with the `.dvc` extension are placeholders to version data files + and directories. A DVC project usually has one + [`.dvc` file](#dvc-files) per large data file or dataset directory being + tracked. +- The [`dvc.yaml` file](#dvcyaml-files) or _pipeline(s) file_ specifies stages that form the pipeline(s) of a project, and their connections (_dependency graph_ or DAG). @@ -23,11 +24,11 @@ should be versioned with Git (for Git-enabled repositories). When you add a file or directory to a DVC project with `dvc add` or `dvc import`, a `.dvc` file is created based on the data file name (e.g. -`data.xml.dvc`). These files contain the basic information needed to track the -data with DVC. +`data.xml.dvc`). These files contain the information needed to track the data +with DVC. -They use a simple YAML format that can be easily written or altered manually. -Here is a full sample: +They use a simple [YAML](https://yaml.org/) format, meant to be easy to read, +edit, or even created manually by users. Here is a full sample: ```yaml outs: @@ -61,6 +62,8 @@ A `.dvc` file dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `etag` (optional): Strong ETag response header (only HTTP external + dependencies created with `dvc import-url`) - `repo` (optional): This entry is only for external dependencies created with `dvc import`, and can contains the following fields: @@ -72,10 +75,6 @@ A `.dvc` file dependency entry consists of a these possible fields: - `rev_lock`: Git commit hash of the external DVC repository at the time of importing or updating (with `dvc update`) the dependency. - > See the examples in - > [External Dependencies](/doc/user-guide/external-dependencies) for more - > info. - Note that comments can be added to `.dvc` files and `dvc.yaml` using the `# comment` syntax. @@ -122,11 +121,11 @@ fields: - `frozen` (optional): Whether or not this stage is frozen from reproduction - `always_changed` (optional): Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default -- `meta` (manual): Arbitrary information can be added here manually. Any YAML +- `meta` (optional): Arbitrary information can be added here manually. Any YAML contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. -> `meta` fields and `#` comments are always preserved in `dvc.yaml` files. +> `meta` fields and `#` comments are always preserved in `dvc.yaml` stages. An output entry (`outs`, `metrics`, or `plots`) consists of these fields: @@ -140,12 +139,6 @@ A `dvc.yaml` dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) -- `etag` (optional): Strong ETag response header (only HTTP external - dependencies created with `dvc import-url`) - - > See the examples in - > [External Dependencies](/doc/user-guide/external-dependencies) for more - > info. `dvc.yaml` files also support `# comments`. From de1f12c6315d9f63741bb1488f4f9a2723181461 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 01:50:00 -0500 Subject: [PATCH 28/36] user guide: md5 is optional et al. --- .../docs/user-guide/dvc-files-and-directories.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 18871c9f2c..6d789a1b5d 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -41,18 +41,18 @@ meta: email: john@doe.com ``` -`.dvc` files can contain the following two fields: +`.dvc` files can contain the following fields: - `outs`: List of outputs for this `.dvc` file - `deps` (optional): List of dependencies for this stage, only present when `dvc import` and `dvc import-url` are used. -- `meta` (manual): Arbitrary information can be added here manually. Any YAML +- `meta` (optional): Arbitrary information can be added here manually. Any YAML contents can be added. `meta` contents are ignored by DVC, but they can be useful for user processes that read `.dvc` files. An output entry can consist of these fields: -- `md5`: Hash value for the output file +- `md5` (optional): Hash value for the output file - `path`: Path to the output in the workspace, relative to the location of the `.dvc` file - `cache` (optional): Whether or not DVC should cache the output. `true` by @@ -61,7 +61,8 @@ An output entry can consist of these fields: A `.dvc` file dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `md5` (optional): MD5 hash for the dependency (most + [stages](/doc/command-reference/run)) - `etag` (optional): Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) - `repo` (optional): This entry is only for external dependencies created with @@ -129,7 +130,7 @@ fields: An output entry (`outs`, `metrics`, or `plots`) consists of these fields: -- `md5`: Hash value for the output file +- `md5` (optional): Hash value for the output file - `path`: Path to the output in the workspace, relative to the location of the `.dvc` file - `cache` (optional): Whether or not DVC should cache the output. `true` by @@ -138,7 +139,8 @@ An output entry (`outs`, `metrics`, or `plots`) consists of these fields: A `dvc.yaml` dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `md5` (optional): MD5 hash for the dependency (most + [stages](/doc/command-reference/run)) `dvc.yaml` files also support `# comments`. From b8055123bb5061f4ebc251a1ce21b574fd52ba88 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 05:54:14 -0500 Subject: [PATCH 29/36] user guide: update fields in dvc.yaml --- .../user-guide/dvc-files-and-directories.md | 88 ++++++++++--------- 1 file changed, 45 insertions(+), 43 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 6d789a1b5d..0b9fd78893 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -43,14 +43,16 @@ meta: `.dvc` files can contain the following fields: -- `outs`: List of outputs for this `.dvc` file -- `deps` (optional): List of dependencies for this stage, only - present when `dvc import` and `dvc import-url` are used. -- `meta` (optional): Arbitrary information can be added here manually. Any YAML - contents can be added. `meta` contents are ignored by DVC, but they can be - useful for user processes that read `.dvc` files. +- `outs`: List of output entries for this `.dvc` file. Typically + there is only one (but several can be added manually). +- `deps` (optional): List of dependency entries for this stage, + only present when `dvc import` and `dvc import-url` are used. Typically there + is only one (but several can be added manually). +- `meta` (optional): Arbitrary metadata can be added manually with this field. + Any YAML contents is supported. `meta` contents are ignored by DVC, but they + can be meaningful for user processes that read `.dvc` files. -An output entry can consist of these fields: +An _output entry_ can consist of these fields: - `md5` (optional): Hash value for the output file - `path`: Path to the output in the workspace, relative to the @@ -58,7 +60,7 @@ An output entry can consist of these fields: - `cache` (optional): Whether or not DVC should cache the output. `true` by default -A `.dvc` file dependency entry consists of a these possible fields: +A _dependency entry_ consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5` (optional): MD5 hash for the dependency (most @@ -90,19 +92,29 @@ created or updated. Here's a simple example: ```yaml stages: - stageone: - cmd: python cmd.py input.data output.data metrics.json + features: + cmd: jupyter nbconvert --execute featurize.ipynb deps: - - cmd.py - - input.data + - data/clean + params: + - levels.no outs: - - output.data + - features metrics: - - metrics.json - stagetwo: - cmd: python ... - meta: '2nd stage' # User metadata and comments are supported. - deps: ... + - performance.json + training: + cmd: python train.py + deps: + - train.py + - features + outs: + - model.pkl + plots: + - logs.csv: + x: epoch + x_label: Epoch + meta: 'For deployment' + # User metadata and comments are supported. ``` `dvc.yaml` files consists of a group of `stages` with names provided explicitly @@ -113,37 +125,27 @@ they can contain more information in `dvc.yaml` These are the possible following fields: - `cmd`: Executable command defined in this stage -- `deps` (optional): List of dependencies for this stage -- `outs` (optional): List of outputs for this stage -- `params` (optional): List of the [parameter](/doc/command-reference/params) - names and their current values -- `metrics` (optional): List of [metrics](/doc/command-reference/metrics) -- `plots` (optional): List of [plot metrics](/doc/command-reference/plots) +- `deps` (optional): List of dependency file or directory paths of + this stage +- `params` (optional): List of the [parameters](/doc/command-reference/params). + These are key paths referring to another YAML file (`params.yaml` by default). +- `outs` (optional): List of output file or directory paths of this + stage +- `metrics` (optional): List of [metric files](/doc/command-reference/metrics) +- `plots` (optional): List of [plot metrics](/doc/command-reference/plots) and + optionally, their default configuration (subfields matching the options of + `dvc plots modify`). - `frozen` (optional): Whether or not this stage is frozen from reproduction - `always_changed` (optional): Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default -- `meta` (optional): Arbitrary information can be added here manually. Any YAML - contents can be added. `meta` contents are ignored by DVC, but they can be - useful for user processes that read `.dvc` files. - -> `meta` fields and `#` comments are always preserved in `dvc.yaml` stages. - -An output entry (`outs`, `metrics`, or `plots`) consists of these fields: - -- `md5` (optional): Hash value for the output file -- `path`: Path to the output in the workspace, relative to the - location of the `.dvc` file -- `cache` (optional): Whether or not DVC should cache the output. `true` by - default - -A `dvc.yaml` dependency entry consists of a these possible fields: - -- `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5` (optional): MD5 hash for the dependency (most - [stages](/doc/command-reference/run)) +- `meta` (optional): Arbitrary metadata can be added manually with this field. + Any YAML contents is supported. `meta` contents are ignored by DVC, but they + can be meaningful for user processes that read `.dvc` files. `dvc.yaml` files also support `# comments`. +> `meta` fields and `#` comments are always preserved in `dvc.yaml` stages. + ## Internal directories and files - `.dvc/config`: This is a configuration file. The config file can be edited by From 137f044f23a6e8b26ba76f4aa8e6d81a3dcc44b5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 06:54:02 -0500 Subject: [PATCH 30/36] user guide: rename dvc.yaml file section --- content/docs/api-reference/get_url.md | 2 +- content/docs/command-reference/add.md | 2 +- content/docs/command-reference/fetch.md | 4 ++-- content/docs/command-reference/get.md | 2 +- content/docs/command-reference/import-url.md | 6 +++--- content/docs/command-reference/import.md | 2 +- content/docs/command-reference/init.md | 4 ++-- content/docs/command-reference/list.md | 2 +- content/docs/command-reference/metrics/diff.md | 2 +- content/docs/command-reference/metrics/index.md | 2 +- content/docs/command-reference/plots/diff.md | 4 ++-- content/docs/command-reference/plots/modify.md | 2 +- content/docs/command-reference/plots/show.md | 5 ++--- content/docs/command-reference/pull.md | 4 ++-- content/docs/command-reference/push.md | 2 +- content/docs/command-reference/remove.md | 2 +- content/docs/command-reference/status.md | 4 ++-- content/docs/install/plugins.md | 2 +- content/docs/tutorials/get-started/data-pipelines.md | 2 +- content/docs/user-guide/basic-concepts/dependency.md | 2 +- content/docs/user-guide/basic-concepts/dvc-project.md | 2 +- .../docs/user-guide/basic-concepts/external-dependency.md | 2 +- content/docs/user-guide/basic-concepts/output.md | 2 +- content/docs/user-guide/dvc-files-and-directories.md | 4 ++-- 24 files changed, 33 insertions(+), 34 deletions(-) diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 5114f00a83..e218d32c4a 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -30,7 +30,7 @@ specified by its `path` in a `repo` (DVC project), is stored. The URL is formed by reading the project's [remote configuration](/doc/command-reference/config#remote) and the -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) where the given `path` is found (`outs` field). The schema of the URL returned depends on the [type](/doc/command-reference/remote/add#supported-storage-types) of the diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index dea8b81016..dc0db7186d 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -62,7 +62,7 @@ easily tracked with Git. > Note that `.dvc` files can be considered _orphan stages_, because they have no > dependencies, only outputs. These are treated as _always changed_ > by `dvc status` and `dvc repro`, which always executes them. See -> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) to learn +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) to learn > more about stages. To avoid adding files inside a directory accidentally, you can add the diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index cd1f5c7ebd..a6315a3d66 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -23,7 +23,7 @@ of the project, but without placing them in the workspace. This makes the data files available for linking (or copying) into the workspace. (Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along with `dvc checkout`, it's performed automatically by `dvc pull` when the target -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files are not already in the cache: @@ -196,7 +196,7 @@ Note that the `.dvc/cache` directory was created and populated. > for more info. Used without arguments (as above), `dvc fetch` downloads all assets needed by -all [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +all [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in the current branch, including for directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 77efa05ebf..8fcf2e203d 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -40,7 +40,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file of the project. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 22794916f1..eccb9df27c 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -106,9 +106,9 @@ $ dvc run -d https://example.com/path/to/data.csv \ `dvc import-url` generates an import stage [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) and `dvc run` a regular stage (in -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files)). Both -have an external dependency, but the one created by `dvc import-url` preserves -the connection to the data source. We call this an _import stage_. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)). Both have +an external dependency, but the one created by `dvc import-url` preserves the +connection to the data source. We call this an _import stage_. Note that import stages are considered always [frozen](/doc/command-reference/freeze), meaning that if you run `dvc repro`, diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index e289bd7d68..8ad0f90cf4 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -44,7 +44,7 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file of the project. diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index f26d92ba6a..4965558e99 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -56,7 +56,7 @@ sub-projects to mitigate the issues of initializing in the Git repository root: - Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`, and others analyze the whole repository to look for - [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files to download files and directories, to reproduce pipelines, etc. It can be expensive in the large repositories with a lot of projects. @@ -127,7 +127,7 @@ include: - SCM other than Git is being used. Even though there are DVC features that require DVC to be run in the Git repo, DVC can work well with other version control systems. Since DVC relies on simple - [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) files to + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) files to manage pipelines, data, etc, they can be added into any SCM thus providing large data files and directories versioning. diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index f62f6e0687..ebf7415a07 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -20,7 +20,7 @@ DVC, by effectively replacing data files, models, directories with `.dvc` files (`.dvc`), hides actual locations and names. This means that you don't see data files when you browse a DVC repository on Git hosting (e.g. Github), you just see the -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files. This makes it hard to navigate the project to find data artifacts for use with `dvc get`, `dvc import`, or `dvc.api`. diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index 178688bb84..d904c2294e 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -33,7 +33,7 @@ difference (delta) from the previous value of metrics (rounded to 5 digits precision). They're calculated between two commits (hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) for all metrics in the project, found by examining all of the -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in both versions. diff --git a/content/docs/command-reference/metrics/index.md b/content/docs/command-reference/metrics/index.md index 2d271b043b..2e27c1f626 100644 --- a/content/docs/command-reference/metrics/index.md +++ b/content/docs/command-reference/metrics/index.md @@ -65,7 +65,7 @@ stages: > `cache: false` above specifies that `summary.json` is not tracked or > cached by DVC (`-M` option of `dvc run`). These metric files are > normally committed with Git instead. See -> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) for more +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) for more > information on the file format above. ### Supported file formats diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index 621a282f71..632bb8bc52 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -46,8 +46,8 @@ please see `dvc plots`. ## Options - `--targets ` - specific metric files to visualize. These must be listed - in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) - file (see the `--plots` option of `dvc run`). + in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file + (see the `--plots` option of `dvc run`). - `-o , --out ` - name of the generated file. By default, the output file name is equal to the input filename with a `.html` file extension (or diff --git a/content/docs/command-reference/plots/modify.md b/content/docs/command-reference/plots/modify.md index 376b90fae3..4bdeb75cde 100644 --- a/content/docs/command-reference/plots/modify.md +++ b/content/docs/command-reference/plots/modify.md @@ -23,7 +23,7 @@ plots are generated with `dvc plot show` or `dvc plot diff`. This command sets (or unsets) default display properties for a specific metrics file. The path to the metrics file `target` is required. It must be listed in a -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file (see +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file (see the `--plots` option of `dvc run`). `dvc plots modify` adds the display properties to `dvc.yaml`. diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index ba107e0f37..e4048c16c6 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -23,9 +23,8 @@ AUC curves, confusion matrices, etc. All plots defined in `dvc.yaml` are used by default. Optionally, specific metric file `targets` to show are accepted. These must be -listed in a -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file (see -the `--plots` option of `dvc run`). +listed in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) +file (see the `--plots` option of `dvc run`). The plot style can be customized with [plot templates](/doc/command-reference/plots#plot-templates), using the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 5b918f1470..159f393106 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -3,7 +3,7 @@ Download tracked files or directories from [remote storage](/doc/command-reference/remote) to the cache and workspace, based on the current -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files. ## Synopsis @@ -39,7 +39,7 @@ remote. With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads only the files (or directories) missing from the workspace by searching all -stages in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) +stages in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files currently in the project. It will not download files associated with earlier commits in the repository (if using Git), nor will it download diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index a547da1354..788f53b4af 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -39,7 +39,7 @@ with `git commit` and `git push`). Under the hood a few actions are taken: - The push command by default uses all - [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) in the workspace. The command options listed below will either limit or expand the set of stages (in dvc.yaml) or `.dvc` files to consult. diff --git a/content/docs/command-reference/remove.md b/content/docs/command-reference/remove.md index ac912cca04..aa5ed00045 100644 --- a/content/docs/command-reference/remove.md +++ b/content/docs/command-reference/remove.md @@ -18,7 +18,7 @@ from the workspace. It takes one or more stage names (see `-n` option of `dvc run`) or [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) as target, removes all of its outputs (outs field), and optionally removes the stage entry -from [dvc.yaml](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or the +from [dvc.yaml](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or the `.dvc` file itself. Note that it does not remove files from the DVC cache or remote storage (see diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index d1bf33b27e..d5af083b71 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -35,7 +35,7 @@ options: | remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | DVC determines which data and code files to compare by analyzing all stages (in -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) in the workspace (the `--all-branches` and `--all-tags` options compare multiple workspace versions). @@ -80,7 +80,7 @@ the changes (described below). - _new_: An output is found in the workspace, but there is no corresponding file hash saved in the - [`dvc.lock`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or + [`dvc.lock`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file yet. - _modified_: An output or dependency is found in the workspace, but the corresponding file hash in the `dvc.lock` or `.dvc` file is not up diff --git a/content/docs/install/plugins.md b/content/docs/install/plugins.md index e8846dc727..d1d6b2f8e1 100644 --- a/content/docs/install/plugins.md +++ b/content/docs/install/plugins.md @@ -2,7 +2,7 @@ When you add a file or a stage to your pipeline, DVC creates a special [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) or -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) file +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file (respectively) that contains all the needed information to track your data and transformations. diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index df84bbe08a..6c0be3f449 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -67,7 +67,7 @@ $ dvc run -f prepare.dvc \ python src/prepare.py data/data.xml data/prepared ``` -A [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) is +A [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) is generated. It includes information about the command we ran (`python src/prepare.py`), its dependencies, and outputs. diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 6cbe8d3891..7f00f9f5e9 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -5,6 +5,6 @@ match: [dependency, dependencies] A file or directory (possibly tracked by DVC) recorded in the `deps` section of a stage (in -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files)) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)) or [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) file. See `dvc run`. Stages are invalidated when any of their dependencies change. diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index 4c44008915..f395861a78 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -16,6 +16,6 @@ match: Initialized by running `dvc init` in the **workspace** (typically a Git repository). It will contain the [`.dvc/` directory](/doc/user-guide/dvc-files-and-directories), as well as -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) and +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files created with commands such as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/basic-concepts/external-dependency.md b/content/docs/user-guide/basic-concepts/external-dependency.md index 373c3006cb..9f30fef770 100644 --- a/content/docs/user-guide/basic-concepts/external-dependency.md +++ b/content/docs/user-guide/basic-concepts/external-dependency.md @@ -4,7 +4,7 @@ match: ['external dependency', 'external dependencies'] --- A stage dependency (`deps` field in -[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files) or in an +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or in an [import stage](/doc/command-reference/import) `.dvc` file) with origin in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote locations, or even other DVC repositories. See diff --git a/content/docs/user-guide/basic-concepts/output.md b/content/docs/user-guide/basic-concepts/output.md index 8636d76eef..d190c8fe14 100644 --- a/content/docs/user-guide/basic-concepts/output.md +++ b/content/docs/user-guide/basic-concepts/output.md @@ -4,7 +4,7 @@ match: [output, outputs] --- A file or directory tracked by DVC, recorded in the `outs` section of a stage -(in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-files)) or +(in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)) or [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files). Outputs are usually the result of stages. See `dvc add`, `dvc run`, `dvc import`, et al. A.k.a. _data artifact_ diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 0b9fd78893..40df23ad78 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -12,7 +12,7 @@ Additionally, there are two special kind of files created by certain and directories. A DVC project usually has one [`.dvc` file](#dvc-files) per large data file or dataset directory being tracked. -- The [`dvc.yaml` file](#dvcyaml-files) or _pipeline(s) file_ specifies stages +- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages that form the pipeline(s) of a project, and their connections (_dependency graph_ or DAG). @@ -20,7 +20,7 @@ Both use human-friendly YAML schemas, described below. We encourage you to get familiar with them so you may edit them freely, as needed. Both type of files should be versioned with Git (for Git-enabled repositories). -## .dvc files +## .dvc file When you add a file or directory to a DVC project with `dvc add` or `dvc import`, a `.dvc` file is created based on the data file name (e.g. From ed786b077731e2317c7d896db6a4ccb7ea7a07e7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 14:46:03 -0500 Subject: [PATCH 31/36] user guide: remove "optional" from files&dirs guide --- .../user-guide/dvc-files-and-directories.md | 55 ++++++++----------- 1 file changed, 24 insertions(+), 31 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 40df23ad78..da2e444778 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -43,31 +43,29 @@ meta: `.dvc` files can contain the following fields: -- `outs`: List of output entries for this `.dvc` file. Typically - there is only one (but several can be added manually). -- `deps` (optional): List of dependency entries for this stage, - only present when `dvc import` and `dvc import-url` are used. Typically there - is only one (but several can be added manually). +- `outs` (always present): List of output entries for this `.dvc` + file. Typically there is only one (but several can be added manually). +- `deps`: List of dependency entries for this stage, only present + when `dvc import` and `dvc import-url` are used. Typically there is only one + (but several can be added manually). - `meta` (optional): Arbitrary metadata can be added manually with this field. Any YAML contents is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read `.dvc` files. An _output entry_ can consist of these fields: -- `md5` (optional): Hash value for the output file +- `md5`: Hash value for the output file - `path`: Path to the output in the workspace, relative to the location of the `.dvc` file -- `cache` (optional): Whether or not DVC should cache the output. `true` by - default +- `cache`: Whether or not DVC should cache the output. `true` by default A _dependency entry_ consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5` (optional): MD5 hash for the dependency (most - [stages](/doc/command-reference/run)) -- `etag` (optional): Strong ETag response header (only HTTP external +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) -- `repo` (optional): This entry is only for external dependencies created with +- `repo`: This entry is only for external dependencies created with `dvc import`, and can contains the following fields: - `url`: URL of Git repository with source DVC project @@ -118,26 +116,21 @@ stages: ``` `dvc.yaml` files consists of a group of `stages` with names provided explicitly -by the user with the `--name` (`-n`) option of `dvc run`. - -Each stage's contents are similar to individual [`dvc` files](#dvcfiles) but -they can contain more information in `dvc.yaml` These are the possible following -fields: - -- `cmd`: Executable command defined in this stage -- `deps` (optional): List of dependency file or directory paths of - this stage -- `params` (optional): List of the [parameters](/doc/command-reference/params). - These are key paths referring to another YAML file (`params.yaml` by default). -- `outs` (optional): List of output file or directory paths of this - stage -- `metrics` (optional): List of [metric files](/doc/command-reference/metrics) -- `plots` (optional): List of [plot metrics](/doc/command-reference/plots) and - optionally, their default configuration (subfields matching the options of +by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain +the possible following fields: + +- `cmd` (always present): Executable command defined in this stage +- `deps`: List of dependency file or directory paths of this stage +- `params`: List of the [parameters](/doc/command-reference/params). These are + key paths referring to another YAML file (`params.yaml` by default). +- `outs`: List of output file or directory paths of this stage +- `metrics`: List of [metric files](/doc/command-reference/metrics) +- `plots`: List of [plot metrics](/doc/command-reference/plots) and optionally, + their default configuration (subfields matching the options of `dvc plots modify`). -- `frozen` (optional): Whether or not this stage is frozen from reproduction -- `always_changed` (optional): Whether or not this stage is considered as - changed by commands such as `dvc status` and `dvc repro`. `false` by default +- `frozen`: Whether or not this stage is frozen from reproduction +- `always_changed`: Whether or not this stage is considered as changed by + commands such as `dvc status` and `dvc repro`. `false` by default - `meta` (optional): Arbitrary metadata can be added manually with this field. Any YAML contents is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read `.dvc` files. From c69e0086b55a5ec79ca45361577fa05ab83d2ed6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 15:08:43 -0500 Subject: [PATCH 32/36] user guide: make meta and ocmment notes part of prev p per https://github.com/iterative/dvc.org/pull/1370#discussion_r439842062 --- .../docs/user-guide/dvc-files-and-directories.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index da2e444778..c986246f3c 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -77,11 +77,9 @@ A _dependency entry_ consists of a these possible fields: the time of importing or updating (with `dvc update`) the dependency. Note that comments can be added to `.dvc` files and `dvc.yaml` using the -`# comment` syntax. - -> `meta` fields and `#` comments are preserved among executions of the -> `dvc repro` and `dvc commit` commands, but not when a `.dvc` file is -> overwritten by `dvc add`,`dvc import`, or `dvc import-url`. +`# comment` syntax. `meta` fields and `#` comments are preserved among +executions of the `dvc repro` and `dvc commit` commands, but not when a `.dvc` +file is overwritten by `dvc add`,`dvc import`, or `dvc import-url`. ## dvc.yaml files @@ -135,9 +133,8 @@ the possible following fields: Any YAML contents is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read `.dvc` files. -`dvc.yaml` files also support `# comments`. - -> `meta` fields and `#` comments are always preserved in `dvc.yaml` stages. +`dvc.yaml` files also support `# comments`. `meta` fields and `#` comments are +always preserved in `dvc.yaml` stages. ## Internal directories and files From e1db7e135728accfe11ad5f6219e7163cd4f0a62 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 16:02:57 -0500 Subject: [PATCH 33/36] user guide: remove some periods --- content/docs/user-guide/dvc-files-and-directories.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index c986246f3c..9e037175fe 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -131,7 +131,7 @@ the possible following fields: commands such as `dvc status` and `dvc repro`. `false` by default - `meta` (optional): Arbitrary metadata can be added manually with this field. Any YAML contents is supported. `meta` contents are ignored by DVC, but they - can be meaningful for user processes that read `.dvc` files. + can be meaningful for user processes that read or write `.dvc` files directly. `dvc.yaml` files also support `# comments`. `meta` fields and `#` comments are always preserved in `dvc.yaml` stages. @@ -160,12 +160,12 @@ always preserved in `dvc.yaml` stages. > are needed to download or reproduce them. - `.dvc/plots`: Directory for - [Plot templates](/doc/command-reference/plots#plot-templates). + [Plot templates](/doc/command-reference/plots#plot-templates) - `.dvc/tmp`: Directory for miscellaneous temporary files - `.dvc/tmp/index`: Directory for remote index files that are used for - optimizing `dvc push`, `dvc pull`, `dvc fetch` and `dvc status -c` operations. + optimizing `dvc push`, `dvc pull`, `dvc fetch` and `dvc status -c` operations - `.dvc/tmp/state`: This file is used for optimization. It is a SQLite database, that contains hash values for files tracked in a DVC project, with respective @@ -187,7 +187,7 @@ always preserved in `dvc.yaml` stages. - `.dvc/tmp/rwlock`: JSON file that contains read and write locks for specific dependencies and outputs, to allow safely running multiple DVC commands in - parallel. + parallel ## Structure of cache directory From d4705d637b8d31c14e2066ce5fe29eafde417df0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 18:26:48 -0500 Subject: [PATCH 34/36] user guide: don't use term "output" so much per https://github.com/iterative/dvc.org/pull/1370#pullrequestreview-430229304 --- .../docs/user-guide/dvc-files-and-directories.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 9e037175fe..4a9059d44c 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -43,8 +43,9 @@ meta: `.dvc` files can contain the following fields: -- `outs` (always present): List of output entries for this `.dvc` - file. Typically there is only one (but several can be added manually). +- `outs` (always present): List of output entries that represent + the files or directories tracked with DVC. Typically there is only one per + `.dvc` file (but several can be added or combined manually). - `deps`: List of dependency entries for this stage, only present when `dvc import` and `dvc import-url` are used. Typically there is only one (but several can be added manually). @@ -54,10 +55,11 @@ meta: An _output entry_ can consist of these fields: -- `md5`: Hash value for the output file -- `path`: Path to the output in the workspace, relative to the - location of the `.dvc` file -- `cache`: Whether or not DVC should cache the output. `true` by default +- `md5`: Hash value for the file or directory being tracked with DVC +- `path`: Path to the file or directory, relative to the location of the `.dvc` + file +- `cache`: Whether or not DVC should cache the file or directory. `true` by + default A _dependency entry_ consists of a these possible fields: From ac556c30bd6d62ecdc2be69934a8dca247d8efac Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 18:42:35 -0500 Subject: [PATCH 35/36] user guide: few more improvements for iterative/dvc.org/pull/1370 --- content/docs/user-guide/dvc-files-and-directories.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 4a9059d44c..cc0a8c26a7 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -121,8 +121,9 @@ the possible following fields: - `cmd` (always present): Executable command defined in this stage - `deps`: List of dependency file or directory paths of this stage -- `params`: List of the [parameters](/doc/command-reference/params). These are - key paths referring to another YAML file (`params.yaml` by default). +- `params`: List of [parameter dependencies](/doc/command-reference/params). + These are key paths referring to a YAML or JSON file (`params.yaml` by + default). - `outs`: List of output file or directory paths of this stage - `metrics`: List of [metric files](/doc/command-reference/metrics) - `plots`: List of [plot metrics](/doc/command-reference/plots) and optionally, @@ -135,8 +136,7 @@ the possible following fields: Any YAML contents is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read or write `.dvc` files directly. -`dvc.yaml` files also support `# comments`. `meta` fields and `#` comments are -always preserved in `dvc.yaml` stages. +`dvc.yaml` files also support `# comments`. ## Internal directories and files @@ -162,7 +162,7 @@ always preserved in `dvc.yaml` stages. > are needed to download or reproduce them. - `.dvc/plots`: Directory for - [Plot templates](/doc/command-reference/plots#plot-templates) + [plot templates](/doc/command-reference/plots#plot-templates) - `.dvc/tmp`: Directory for miscellaneous temporary files From cd7bca056c78e554836afc07c3fcb642550169fa Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Jun 2020 18:47:53 -0500 Subject: [PATCH 36/36] user guide: more unnecessary periods removed per https://github.com/iterative/dvc.org/pull/1370#discussion_r439874457 --- content/docs/user-guide/dvc-files-and-directories.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index cc0a8c26a7..ec68b5fd00 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -76,7 +76,7 @@ A _dependency entry_ consists of a these possible fields: [Git revision](https://git-scm.com/docs/revisions)) used to import the dependency from. - `rev_lock`: Git commit hash of the external DVC repository at - the time of importing or updating (with `dvc update`) the dependency. + the time of importing or updating the dependency (with `dvc update`) Note that comments can be added to `.dvc` files and `dvc.yaml` using the `# comment` syntax. `meta` fields and `#` comments are preserved among @@ -128,7 +128,7 @@ the possible following fields: - `metrics`: List of [metric files](/doc/command-reference/metrics) - `plots`: List of [plot metrics](/doc/command-reference/plots) and optionally, their default configuration (subfields matching the options of - `dvc plots modify`). + `dvc plots modify`) - `frozen`: Whether or not this stage is frozen from reproduction - `always_changed`: Whether or not this stage is considered as changed by commands such as `dvc status` and `dvc repro`. `false` by default