diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 003b85ad45..e218d32c4a 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -30,10 +30,10 @@ specified by its `path` in a `repo` (DVC project), is stored. The URL is formed by reading the project's [remote configuration](/doc/command-reference/config#remote) and the -[`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc` file](/doc/user-guide/dvc-file-format) where the given `path` is found -(`outs` field). The URL schema returned depends on the -[type](/doc/command-reference/remote/add#supported-storage-types) of the +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) where the +given `path` is found (`outs` field). The schema of the URL returned depends on +the [type](/doc/command-reference/remote/add#supported-storage-types) of the `remote` used (see the [Parameters](#parameters) section). If the target is a directory, the returned URL will end in `.dir`. Refer to diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index bfaa51b274..dc0db7186d 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -1,7 +1,7 @@ # add Track data files or directories with DVC, by creating a corresponding -[`.dvc` file](/doc/user-guide/dvc-file-format). +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files). ## Synopsis @@ -17,7 +17,8 @@ positional arguments: The `dvc add` command is analogous to `git add`, in that it makes DVC aware of the target data, in order to start versioning it. It creates a -[`.dvc` file](/doc/user-guide/dvc-file-format) to track the added data. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track the +added data. This command can be used to [version control](/doc/use-cases/versioning-data-and-model-files) large files, @@ -31,8 +32,9 @@ The `targets` are the files or directories to add, which are turned into > See also `dvc run` for more advanced ways to version intermediate and final > results (like ML models). -Under the hood, a few actions are taken for each file (or directory) in -`targets`: +After checking that each `target` file (or directory) hasn't been added before +(or tracked with other DVC commands), a few actions are taken under the hood for +each one: 1. Calculate the file hash. 2. Move the file contents to the cache (by default in `.dvc/cache`), using the @@ -41,25 +43,27 @@ Under the hood, a few actions are taken for each file (or directory) in for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). -4. Create a corresponding [`.dvc` file](/doc/user-guide/dvc-file-format) to - track the file, using its path and hash to identify the cached data. The - `.dvc` file lists the DVC-tracked file as an output (`outs` - field). Unless the `-f` option is used, the `.dvc` file name generated by - default is `.dvc`, where `` is the file name of the first target. +4. Create a corresponding + [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track + the file, using its path and hash to identify the cached data. The `.dvc` + file lists the DVC-tracked file as an output (`outs` field). + Unless the `-f` option is used, the `.dvc` file name generated by default is + `.dvc`, where `` is the file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project). 6. Instructions are printed showing `git` commands for adding the files, if appropriate. -Summarizing, the result is that the target data is replaced by small `.dvc` -files that can easily be tracked with Git. See -[DVC-File Format](/doc/user-guide/dvc-file-format) for more details. +Summarizing, the result is that the target data is replaced by small +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) that can be +easily tracked with Git. > Note that `.dvc` files can be considered _orphan stages_, because they have no > dependencies, only outputs. These are treated as _always changed_ > by `dvc status` and `dvc repro`, which always executes them. See -> [`dvc.yaml`](/doc/user-guide/dvc-file-format) to learn more about stages. +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) to learn +> more about stages. To avoid adding files inside a directory accidentally, you can add the corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file. @@ -74,8 +78,8 @@ large files. DVC also supports other link types for use on file systems without ### Tracking directories A `dvc add` target can be an individual file or a directory. In the latter case, -a [`.dvc` file](/doc/user-guide/dvc-file-format) is created for the top of the -directory (with default name `.dvc`). +a [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) is created +for the top of the directory (with default name `.dvc`). Every file in the hierarchy is added to the cache (unless the `--no-commit` option is used), but DVC does not produce individual `.dvc` files for each file @@ -135,7 +139,8 @@ To track the changes with git, run: git add .gitignore data.xml.dvc ``` -As indicated above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been +As indicated above, a +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) has been created for `data.xml`. Let's explore the result: ```dvc @@ -187,10 +192,10 @@ Tracking a directory with DVC as simple as with a single file: $ dvc add pics ``` -There are no [`.dvc` files](/doc/user-guide/dvc-file-format) generated within -this directory structure to match each images, but the image files are all -cached. A single `pics.dvc` file is generated for the top-level -directory, and it contains: +There are no [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) +generated within this directory structure to match each image, but the image +files are all cached. A single `pics.dvc` file is generated for the +top-level directory, and it contains: ```yaml outs: diff --git a/content/docs/command-reference/cache/index.md b/content/docs/command-reference/cache/index.md index d8e565a806..07c8fb54c6 100644 --- a/content/docs/command-reference/cache/index.md +++ b/content/docs/command-reference/cache/index.md @@ -17,8 +17,8 @@ positional arguments: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. The cache is where your data files, models, etc. (anything you want to version with DVC) are actually stored. The corresponding files you see in the diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index c5b085cbeb..6bb9512e92 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -17,9 +17,10 @@ positional arguments: ## Description -[DVC-files](/doc/user-guide/dvc-file-format) act as pointers to specific version -of data files or directories tracked by DVC. This command synchronizes the -workspace data with the versions specified in the current DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories) act as pointers to +specific version of data files or directories tracked by DVC. This command +synchronizes the workspace data with the versions specified in the current +DVC-files. `dvc checkout` is useful, for example, when using Git in the project, after `git clone`, `git checkout`, or any other operation diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 9fdbdbe85a..7b238473b7 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -1,8 +1,8 @@ # commit Record changes to DVC-tracked files in the project, by updating -[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to -the cache. +[DVC-files](/doc/user-guide/dvc-files-and-directories) and saving +outputs to the cache. ## Synopsis @@ -67,8 +67,8 @@ cache. This is where the `dvc commit` command comes into play. It performs that last step (saving the data in cache). Note that it's best to avoid the last two scenarios. They essentially -force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to -cache. They are still useful, but keep in mind that DVC can't guarantee +force-update the [DVC-files](/doc/user-guide/dvc-files-and-directories) and save +data to cache. They are still useful, but keep in mind that DVC can't guarantee reproducibility in those cases. ## Options @@ -227,7 +227,7 @@ the new instance of `model.pkl` is there. It is also possible to execute the commands that are executed by `dvc repro` by hand. You won't have DVC helping you, but you have the freedom to run any command you like, even ones not defined in a -[DVC-file](/doc/user-guide/dvc-file-format). For example: +[DVC-file](/doc/user-guide/dvc-files-and-directories). For example: ```dvc $ python src/featurization.py data/prepared data/features diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 7fa176bb43..b683cec054 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -179,8 +179,9 @@ for more details.) This section contains the following options: ### state -See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to -learn more about the state file (database) that is used for optimization. +See +[Internal directories and files](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +to learn more about the state file (database) that is used for optimization. - `state.row_limit` - maximum number of entries in the state database, which affects the physical size of the state file itself, as well as the performance diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index 533ba4f3b5..3b3c2a3830 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -12,14 +12,19 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description -`dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the -workspace. Note that the cache directory will normally -be removed as well, unless it's set to an external location with -`dvc cache dir`. (By default a local cache is located in the `.dvc/cache` -directory.) If you were using +`dvc destroy` removes `dvc.yaml`, `.dvc` files, and the internal `.dvc/` +directory from the workspace. + +Note that the cache directory will be removed as well, unless it's +[set to an external location](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +(by default a local cache is located in `.dvc/cache`). If you were using [symlinks for linking](/doc/user-guide/large-dataset-optimization) data from the -cache, DVC will replace them with copies, so that your data is intact after the -project's destruction. +cache, DVC will replace them with the latest versions of the actual files and +directories first, so that your data is intact after the project's destruction. + +> Refer to +> [DVC files and directories](/doc/user-guide/dvc-files-and-directories) for +> more details on the directories and files deleted by this command. ## Options @@ -94,8 +99,8 @@ $ ls -a .git code.py foo ``` -`dvc destroy` command removed DVC-files, and the entire `.dvc/` meta directory -from the workspace. But the cache files that are present in the +`dvc destroy` command removed DVC-files, and the internal `.dvc/` directory from +the workspace. But the cache files that are present in the `/mnt/cache` directory still persist: ```dvc diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 776cdd1d19..a6315a3d66 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -23,8 +23,9 @@ of the project, but without placing them in the workspace. This makes the data files available for linking (or copying) into the workspace. (Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along with `dvc checkout`, it's performed automatically by `dvc pull` when the target -[`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc`](/doc/user-guide/dvc-file-format) files are not already in the cache: +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files are not +already in the cache: ``` Controlled files Commands @@ -52,8 +53,7 @@ on DVC remotes.) These necessary data or model files are listed as required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the corresponding [pipeline](/doc/command-reference/pipeline). -`dvc fetch` ensures that the files needed for a -[stage](/doc/command-reference/run) or `.dvc` file to be +`dvc fetch` ensures that the files needed for a stage or `.dvc` file to be [reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in cache. If no `targets` are specified, the set of data files to fetch is determined by analyzing all `dvc.yaml` and `.dvc` files in the current branch, @@ -196,11 +196,11 @@ Note that the `.dvc/cache` directory was created and populated. > for more info. Used without arguments (as above), `dvc fetch` downloads all assets needed by -all [`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files in the current branch, including -for directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and -`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and -`data/features/` directory, respectively. +all [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in the +current branch, including for directories. The hash values +`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` +correspond to the `model.pkl` file and `data/features/` directory, respectively. Let's now link files from the cache to the workspace with: @@ -214,7 +214,8 @@ $ dvc checkout > follow this example if you tried the previous one (**Default behavior**). `dvc fetch` only downloads the data files of a specific stage when the -corresponding `.dvc` file (command target) is specified: +corresponding [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) +(command target) is specified: ```dvc $ dvc fetch prepare.dvc @@ -280,12 +281,12 @@ $ tree .dvc/cache ``` Fetching using `--with-deps` starts with the target -[`.dvc` file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches -backwards through its pipeline for data to download into the project's cache. -All the data for the second and third stages ("featurize" and "train") has now -been downloaded to the cache. We could now use `dvc checkout` to get the data -files needed to reproduce this pipeline up to the third stage into the workspace -(with `dvc repro train.dvc`). +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) (`train.dvc`) +and searches backwards through its pipeline for data to download into the +project's cache. All the data for the second and third stages ("featurize" and +"train") has now been downloaded to the cache. We could now use `dvc checkout` +to get the data files needed to reproduce this pipeline up to the third stage +into the workspace (with `dvc repro train.dvc`). > Note that in this example project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages, so at this point all diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 62a0dc7821..8fcf2e203d 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -40,8 +40,9 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc`](/doc/user-guide/dvc-file-format) file of the project. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file of the +project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -183,8 +184,9 @@ get the most recent one, we use a similar command, but with `-o model.bigrams.pkl` and `--rev bigrams-experiment` (or even without `--rev` since that tag has the latest model version anyway). In fact, in this case using `dvc pull` with the corresponding -[`.dvc` files](/doc/user-guide/dvc-file-format) should suffice, downloading the -file as just `model.pkl`. We can then rename it to make its variant explicit: +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) should +suffice, downloading the file as just `model.pkl`. We can then rename it to make +its variant explicit: ```dvc $ dvc pull train.dvc diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 9842f5153b..eccb9df27c 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -3,7 +3,7 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the workspace, and track changes in the remote data source. Creates a -[`.dvc` file](/doc/user-guide/dvc-file-format). +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files). > See `dvc import` to download and tack data/model files or directories from > other DVC repositories (e.g. hosted on Github). @@ -42,8 +42,8 @@ while `out` can be used to specify the directory and/or file name desired for the downloaded data. If an existing directory is specified, the file or directory will be placed inside. -[`.dvc` files](/doc/user-guide/dvc-file-format) support references to data in an -external location, see +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) support +references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such a `.dvc` file, the `deps` field stores the remote URL, and the `outs` field contains the corresponding local path in the workspace. It records enough @@ -104,10 +104,11 @@ $ dvc run -d https://example.com/path/to/data.csv \ ``` `dvc import-url` generates an import stage -[`.dvc` file](/doc/user-guide/dvc-file-format) and `dvc run` a regular stage (in -[`dvc.yaml`](/doc/user-guide/dvc-file-format)). Both have an external -dependency, but the one created by `dvc import-url` preserves the connection to -the data source. We call this an _import stage_. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) and `dvc run` +a regular stage (in +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)). Both have +an external dependency, but the one created by `dvc import-url` preserves the +connection to the data source. We call this an _import stage_. Note that import stages are considered always [frozen](/doc/command-reference/freeze), meaning that if you run `dvc repro`, @@ -192,8 +193,8 @@ The `etag` field in the `.dvc` file contains the If the remote file changes, its ETag will be different. This metadata allows DVC to determine whether its necessary to download it again. -> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the -> text format above. +> See [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) for +> more details on the format above. You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 160cfd11bd..8ad0f90cf4 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -2,9 +2,9 @@ Download a file or directory tracked by DVC or by Git into the workspace. It also creates a -[`.dvc` file](/doc/user-guide/dvc-file-format) with information about the data -source, which can later be used to [update](/doc/command-reference/update) the -import. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) with +information about the data source, which can later be used to +[update](/doc/command-reference/update) the import. > See also our `dvc.api.open()` Python API function. @@ -44,8 +44,9 @@ The `path` argument is used to specify the location of the target to be downloaded within the source repository at `url`. `path` can specify any file or directory in the source repo, including those tracked by DVC, or by Git. Note that DVC-tracked targets should be found in a -[`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc`](/doc/user-guide/dvc-file-format) file of the project. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file of the +project. ⚠️ The project should have a default [DVC remote](/doc/command-reference/remote), containing the actual data for this @@ -67,8 +68,7 @@ path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -To actually -[track the data](https://dvc.org/doc/tutorials/get-started/data-versioning), +To actually [track the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import stage. Note that import stages are considered always @@ -115,9 +115,10 @@ Importing 'data/data.xml (git@github.com:iterative/example-get-started)' In contrast with `dvc get`, this command doesn't just download the data file, but it also creates an import stage -([`.dvc` file](/doc/user-guide/dvc-file-format)) with a link to the data source -(as explained in the description above). (This import stage can later be used to -[update](/doc/command-reference/update) the import.) Check `data.xml.dvc`: +([`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files)) with a link +to the data source (as explained in the description above). (This import stage +can later be used to [update](/doc/command-reference/update) the import.) Check +`data.xml.dvc`: ```yaml md5: 7de90e7de7b432ad972095bc1f2ec0f8 @@ -155,8 +156,8 @@ Importing ``` When using this option, the import stage -([`.dvc` file](/doc/user-guide/dvc-file-format)) will also have a `rev` subfield -under `repo`: +([`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files)) will also +have a `rev` subfield under `repo`: ```yaml deps: @@ -187,9 +188,10 @@ If you take a look at our [dataset registry](https://github.com/iterative/dataset-registry) project, you'll see that it's organized into different directories such as `tutorial/ver` and `use-cases/`, and these contain -[`.dvc` files](/doc/user-guide/dvc-file-format) that track different datasets. -Given this simple structure, its data files can be easily shared among several -other projects using `dvc get` and `dvc import`. For example: +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) that track +different datasets. Given this simple structure, its data files can be easily +shared among several other projects using `dvc get` and `dvc import`. For +example: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ @@ -214,8 +216,7 @@ in the future, where and when needed. This is achieved with the `repo` field, for example (matching the import command above): ```yaml -md5: 96fd8e791b0ee4824fc1ceffd13b1b49 -locked: true +frozen: true deps: - path: use-cases/cats-dogs repo: @@ -225,8 +226,6 @@ outs: - md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir path: cats-dogs cache: true - metric: false - persist: false ``` See a full explanation in our [Data Registries](/doc/use-cases/data-registries) @@ -247,8 +246,8 @@ Importing ... > Note that Git-tracked files can be imported from DVC repos as well. The file is imported, and along with it, an import stage -([`.dvc` file](/doc/user-guide/dvc-file-format)) file is created. Check -`it-standards.csv.dvc`: +([`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files)) is created. +Check `it-standards.csv.dvc`: ```yaml deps: diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index c7113f35a4..4965558e99 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -24,9 +24,9 @@ advanced scenarios: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git. ### Initializing DVC in subdirectories @@ -56,10 +56,10 @@ sub-projects to mitigate the issues of initializing in the Git repository root: - Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`, and others analyze the whole repository to look for - [`dvc.yaml`](/doc/user-guide/dvc-file-format) or - [`.dvc`](/doc/user-guide/dvc-file-format) files to download files and - directories, to reproduce pipelines, etc. It can be expensive in - the large repositories with a lot of projects. + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or + [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files to + download files and directories, to reproduce pipelines, etc. It + can be expensive in the large repositories with a lot of projects. - Not enough isolation/granularity - commands like `dvc metrics diff`, `dvc pipeline show` and others by default dump all the metrics, all the @@ -127,8 +127,8 @@ include: - SCM other than Git is being used. Even though there are DVC features that require DVC to be run in the Git repo, DVC can work well with other version control systems. Since DVC relies on simple - [`dvc.yaml`](/doc/user-guide/dvc-file-format) files to manage - pipelines, data, etc, they can be added into any SCM thus + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) files to + manage pipelines, data, etc, they can be added into any SCM thus providing large data files and directories versioning. - There is no need to keep the history at all, e.g. having a deployment diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d554b12662..459ed7bbfb 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -22,10 +22,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: **Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The -project's DVC-files in turn refer to data stored in cache, but not -necessarily in the workspace. Normally, it would be necessary to -use `dvc checkout` to synchronize workspace and DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories) corresponding to that +version. The project's DVC-files in turn refer to data stored in +cache, but not necessarily in the workspace. Normally, +it would be necessary to use `dvc checkout` to synchronize workspace and +DVC-files. This hook automates `dvc checkout` after `git checkout`. @@ -153,7 +154,7 @@ $ dvc pull --all-branches --all-tags ## Example: Checkout both Git and DVC Switching from one Git commit to another (with `git checkout`) may change the -set of [DVC-files](/doc/user-guide/dvc-file-format) in the +set of [DVC-files](/doc/user-guide/dvc-files-and-directories) in the workspace. This would mean that the currently present data files and directories no longer matches project's version (which can be fixed with `dvc checkout`). @@ -206,9 +207,9 @@ project's cache and the data files currently in the workspace. Git changed the DVC-files in the workspace, which changed references to data files. `dvc status` first informed us that the data files in the workspace no longer matched the hash values in the corresponding -[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings -them up to date, and a second `dvc status` tells us that the data files now do -match the DVC-files. +[DVC-files](/doc/user-guide/dvc-files-and-directories). Running `dvc checkout` +then brings them up to date, and a second `dvc status` tells us that the data +files now do match the DVC-files. ```dvc $ git checkout master diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index c53ce73c01..ebf7415a07 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -19,10 +19,11 @@ positional arguments: DVC, by effectively replacing data files, models, directories with `.dvc` files (`.dvc`), hides actual locations and names. This means that you don't see data files when you browse a DVC repository on Git hosting (e.g. -Github), you just see the [`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files. This makes it hard to navigate -the project to find data artifacts for use with `dvc get`, -`dvc import`, or `dvc.api`. +Github), you just see the +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files. This makes +it hard to navigate the project to find data artifacts for use with +`dvc get`, `dvc import`, or `dvc.api`. `dvc list` prints a virtual view of a DVC repository, as if files and directories [tracked by DVC](/doc/use-cases/versioning-data-and-model-files) diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index 181d456e69..d904c2294e 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -29,12 +29,13 @@ Run without arguments, this command compares metrics currently present in the workspace uncommitted changes) with the latest committed version. The differences shown by this command include the new value, and numeric -difference (delta) from the previous value of metrics. All values and the delta -are [round](https://docs.python.org/3/library/functions.html#round)ed to 5 -digits precision after the decimal point. They're calculated between two commits -(hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) -for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-file-format) in both references. +difference (delta) from the previous value of metrics (rounded to 5 digits +precision). They're calculated between two commits (hash, branch, tag, or any +[Git revision](https://git-scm.com/docs/revisions)) for all metrics in the +project, found by examining all of the +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in both +versions. Another way to display metrics is the `dvc metrics show` command, which just lists all the current metrics without comparisons. diff --git a/content/docs/command-reference/metrics/index.md b/content/docs/command-reference/metrics/index.md index 3624a8d245..2e27c1f626 100644 --- a/content/docs/command-reference/metrics/index.md +++ b/content/docs/command-reference/metrics/index.md @@ -65,8 +65,8 @@ stages: > `cache: false` above specifies that `summary.json` is not tracked or > cached by DVC (`-M` option of `dvc run`). These metric files are > normally committed with Git instead. See -> [`dvc.yaml`](/doc/user-guide/dvc-file-format) for more information on the file -> format above. +> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) for more +> information on the file format above. ### Supported file formats diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 898d2995c8..6cf281573f 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -15,7 +15,7 @@ positional arguments: ## Description Finds and prints all metrics in the project by examining all of its -[DVC-files](/doc/user-guide/dvc-file-format). +[DVC-files](/doc/user-guide/dvc-files-and-directories). > This kind of metrics can be defined with the `-m` (`--metrics`) and `-M` > (`--metrics-no-cache`) options of `dvc run`. diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index 56ce8918ae..20ef132ce3 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -1,9 +1,9 @@ # move Rename a file or a directory and modify the corresponding -[`.dvc` file](/doc/user-guide/dvc-file-format) (see `dvc add`) to reflect the -change. If the file or directory has the same name as the corresponding `.dvc` -file, it also renames it. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) (see +`dvc add`) to reflect the change. If the file or directory has the same name as +the corresponding `.dvc` file, it also renames it. ## Synopsis @@ -19,9 +19,9 @@ positional arguments: `dvc move` is useful when a `src` file or directory has previously been added to the project with `dvc add`, creating a -[`.dvc` file](/doc/user-guide/dvc-file-format) (with `src` as a dependency). -`dvc move` behaves similar to `mv src dst`, moving `src` to the given `dst` -path, but it also renames and updates the corresponding `.dvc` file +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) (with `src` +as a dependency). `dvc move` behaves like `mv src dst`, moving `src` to the +given `dst` path, but it also renames and updates the corresponding `.dvc` file appropriately. > Note that `src` may be a copy or a @@ -108,7 +108,8 @@ $ tree We use `dvc add` to track a file with DVC, then we use `dvc move` to change its location. If target path already exists and is a directory, data file is moved with unchanged name into this folder. Note that the `data.csv.dvc` -[`.dvc` file](/doc/user-guide/dvc-file-format) is also moved. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) is also +moved. ```dvc $ tree diff --git a/content/docs/command-reference/params/index.md b/content/docs/command-reference/params/index.md index aeba5558e8..4b3e41a15e 100644 --- a/content/docs/command-reference/params/index.md +++ b/content/docs/command-reference/params/index.md @@ -54,8 +54,8 @@ written, or generated, and these can be versioned directly with Git. You can then use `dvc run` with the `-p` (`--params`) option to specify parameter dependencies for your pipeline's stages (instead of or in addition to regular `-d` deps.) DVC saves the param names and values in the stage file (see -[DVC-file format](/doc/user-guide/dvc-file-format)). These values will be -compared to the ones in the params files to determine if the stage is +[DVC-file format](/doc/user-guide/dvc-files-and-directories)). These values will +be compared to the ones in the params files to determine if the stage is invalidated upon pipeline [reproduction](/doc/command-reference/repro). `dvc params diff` is available to show changes in parameters, displaying the @@ -109,9 +109,9 @@ $ dvc run -d users.csv -o model.pkl \ ``` You can find that each parameter and it's value were saved in the -[DVC-file](/doc/user-guide/dvc-file-format). These values will be compared to -the ones in the parameters files whenever `dvc repro` is used, to determine if -dependency to the params file is invalidated: +[DVC-file](/doc/user-guide/dvc-files-and-directories). These values will be +compared to the ones in the parameters files whenever `dvc repro` is used, to +determine if dependency to the params file is invalidated: ```yaml md5: 05d178cfa0d1474b6c5800aa1e1b34ac diff --git a/content/docs/command-reference/pipeline/show.md b/content/docs/command-reference/pipeline/show.md index 791c64c9be..e41cade4cd 100644 --- a/content/docs/command-reference/pipeline/show.md +++ b/content/docs/command-reference/pipeline/show.md @@ -2,7 +2,7 @@ Show [stages](/doc/command-reference/run) in a pipeline that lead to the specified stage. By default it lists -[DVC-files](/doc/user-guide/dvc-file-format). +[DVC-files](/doc/user-guide/dvc-files-and-directories). ## Synopsis diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index 8125363141..632bb8bc52 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -46,8 +46,8 @@ please see `dvc plots`. ## Options - `--targets ` - specific metric files to visualize. These must be listed - in a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the `--plots` - option of `dvc run`). + in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file + (see the `--plots` option of `dvc run`). - `-o , --out ` - name of the generated file. By default, the output file name is equal to the input filename with a `.html` file extension (or diff --git a/content/docs/command-reference/plots/modify.md b/content/docs/command-reference/plots/modify.md index ac9c24101a..4bdeb75cde 100644 --- a/content/docs/command-reference/plots/modify.md +++ b/content/docs/command-reference/plots/modify.md @@ -23,8 +23,9 @@ plots are generated with `dvc plot show` or `dvc plot diff`. This command sets (or unsets) default display properties for a specific metrics file. The path to the metrics file `target` is required. It must be listed in a -[`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the `--plots` option of -`dvc run`). `dvc plots modify` adds the display properties to `dvc.yaml`. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file (see +the `--plots` option of `dvc run`). `dvc plots modify` adds the display +properties to `dvc.yaml`. Property names are passed as [options](#options) to this command (prefixed with `--`). These are based on the full diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index 2875380811..e4048c16c6 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -23,8 +23,8 @@ AUC curves, confusion matrices, etc. All plots defined in `dvc.yaml` are used by default. Optionally, specific metric file `targets` to show are accepted. These must be -listed in a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file (see the -`--plots` option of `dvc run`). +listed in a [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) +file (see the `--plots` option of `dvc run`). The plot style can be customized with [plot templates](/doc/command-reference/plots#plot-templates), using the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index c0b656ae6d..159f393106 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -3,8 +3,8 @@ Download tracked files or directories from [remote storage](/doc/command-reference/remote) to the cache and workspace, based on the current -[`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files. ## Synopsis @@ -39,11 +39,11 @@ remote. With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads only the files (or directories) missing from the workspace by searching all -stages in [`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc`](/doc/user-guide/dvc-file-format) files currently in the -project. It will not download files associated with earlier commits -in the repository (if using Git), nor will it download files that -have not changed. +stages in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) +or [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files currently +in the project. It will not download files associated with earlier +commits in the repository (if using Git), nor will it download +files that have not changed. The command `dvc status -c` can list files referenced in current stages (in `dvc.yaml`) or `.dvc` files, but missing from the cache. It can be diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index baf8cfb245..788f53b4af 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -39,10 +39,10 @@ with `git commit` and `git push`). Under the hood a few actions are taken: - The push command by default uses all - [`dvc.yaml`](/doc/user-guide/dvc-file-format) and - [`.dvc` files](/doc/user-guide/dvc-file-format) in the workspace. - The command options listed below will either limit or expand the set of stages - (in dvc.yaml) or `.dvc` files to consult. + [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and + [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) in the + workspace. The command options listed below will either limit or + expand the set of stages (in dvc.yaml) or `.dvc` files to consult. - For each output referenced in every selected stage or `.dvc` file, DVC finds a corresponding file or directory in the cache. diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 60493cd499..95926a121e 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -62,7 +62,8 @@ The following config options are available for all remote types: DVC will recalculate the file hashes upon download (e.g. `dvc pull`) to make sure that these haven't been modified, or corrupted during download. It may slow down the aforementioned commands. The calculated hash is compared to the - value saved in the corresponding [DVC-file](/doc/user-guide/dvc-file-format). + value saved in the corresponding + [DVC-file](/doc/user-guide/dvc-files-and-directories). > Note that this option is enabled on **Google Drive** remotes by default. diff --git a/content/docs/command-reference/remove.md b/content/docs/command-reference/remove.md index a99c7f98c8..aa5ed00045 100644 --- a/content/docs/command-reference/remove.md +++ b/content/docs/command-reference/remove.md @@ -14,10 +14,12 @@ positional arguments: ## Description This command safely removes data files or directories that are tracked by DVC -from the workspace. It takes a stage name (see -n option of -`dvc run`) or a .dvc file as target, removes all of its outputs (outs field), -and optionally removes the stage entry in -[dvc.yaml(/doc/user-guide/dvc-file-format) or the `.dvc` file itself. +from the workspace. It takes one or more stage names (see `-n` +option of `dvc run`) or +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) as target, +removes all of its outputs (outs field), and optionally removes the stage entry +from [dvc.yaml](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or the +`.dvc` file itself. Note that it does not remove files from the DVC cache or remote storage (see `dvc gc`). However, remember to run `dvc push` to save the files you actually diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 359365b218..dc1048ca3a 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -35,8 +35,8 @@ There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, `--cwd`, or other options. -If specific [DVC-files](/doc/user-guide/dvc-file-format) (`targets`) are -omitted, `Dvcfile` will be assumed. +If specific [DVC-files](/doc/user-guide/dvc-files-and-directories) (`targets`) +are omitted, `Dvcfile` will be assumed. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. @@ -274,8 +274,8 @@ Data and pipelines are up to date. ``` The reason being that the `text.txt` file is a dependency in the target -[DVC-file](/doc/user-guide/dvc-file-format) (`Dvcfile` by default). This -`Dvcfile` stage is dependent on `filter.dvc`, which happens first in this +[DVC-file](/doc/user-guide/dvc-files-and-directories) (`Dvcfile` by default). +This `Dvcfile` stage is dependent on `filter.dvc`, which happens first in this pipeline (shown in the following figure): ```dvc diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 24b93d71d7..7db1d4405f 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -1,7 +1,7 @@ # run -Generate a stage file ([DVC-file](/doc/user-guide/dvc-file-format)) from a given -command and execute the command. +Generate a stage file ([DVC-file](/doc/user-guide/dvc-files-and-directories)) +from a given command and execute the command. ## Synopsis @@ -212,12 +212,9 @@ To track the changes with git, run: git add .gitignore metric.dvc ``` -> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the -> text format above. - Execute a Python script as a DVC [pipeline](/doc/command-reference/pipeline) -stage. The stage file name is not specified, so a `model.p.dvc` DVC-file is -created by default based on the registered output (`-o): +stage. The stage file name is not specified, so a `model.p.dvc` file is created +by default based on the registered output (`-o): ```dvc # Train ML model on the training dataset. 20180226 is a seed value. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 24c9665664..d5af083b71 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -35,10 +35,10 @@ options: | remote | `--cloud` | Comparisons are made between the cache, and the default remote, typically defined with `dvc remote --default`. | DVC determines which data and code files to compare by analyzing all stages (in -[`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc` files](/doc/user-guide/dvc-file-format) in the workspace -(the `--all-branches` and `--all-tags` options compare multiple workspace -versions). +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) in the +workspace (the `--all-branches` and `--all-tags` options compare +multiple workspace versions). The comparison can be limited to certain stages (in `dvc.yaml`) or `.dvc` files only, by listing them as `targets`. (Changes are reported only against these.) @@ -80,8 +80,8 @@ the changes (described below). - _new_: An output is found in the workspace, but there is no corresponding file hash saved in the - [`dvc.lock`](/doc/user-guide/dvc-file-format) or - [`.dvc`](/doc/user-guide/dvc-file-format) file yet. + [`dvc.lock`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or + [`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file yet. - _modified_: An output or dependency is found in the workspace, but the corresponding file hash in the `dvc.lock` or `.dvc` file is not up to date. diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index bf0c360637..5897e45868 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -1,8 +1,8 @@ # update Update data artifacts imported from external DVC -projects, and corresponding -[`.dvc` files](/doc/user-guide/dvc-file-format). +projects, and corresponding import stage +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files). ## Synopsis @@ -11,16 +11,17 @@ usage: dvc update [-h] [-q | -v] [--rev ] [-R] targets [targets ...] positional arguments: - targets import stage .dvc files to update. Using -R, directories + targets Import stage .dvc files to update. Using -R, directories to search for .dvc files can also be given. ``` ## Description After creating import stages -([`.dvc` files](/doc/user-guide/dvc-file-format)) with `dvc import` or -`dvc import-url`, the data source can change. Use `dvc update` to bring these -imported file, directory, or data artifact up to date. +([`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files)) with +`dvc import` or `dvc import-url`, the data source can change. Use `dvc update` +to bring these imported file, directory, or data artifact up to +date. To indicate which import stages to update, we must specify the corresponding `.dvc` file `targets` as command arguments. @@ -85,8 +86,8 @@ This time nothing has changed, since the source project is rather stable. > Note that `dvc update` updates the `rev_lock` field of the corresponding -> [`.dvc` file](/doc/user-guide/dvc-file-format) (when there are changes to -> bring in). +> [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) (when there +> are changes to bring in). ## Example: Updating fixed revisions to a different version diff --git a/content/docs/install/plugins.md b/content/docs/install/plugins.md index 2ddb788d81..d1d6b2f8e1 100644 --- a/content/docs/install/plugins.md +++ b/content/docs/install/plugins.md @@ -1,8 +1,10 @@ # IDE Plugins and Syntax Highlighting When you add a file or a stage to your pipeline, DVC creates a special -[`.dvc` file](/doc/user-guide/dvc-file-format) that contains all the needed -information to track your data and transformations. +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) or +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) file +(respectively) that contains all the needed information to track your data and +transformations. The file itself is in a simple YAML format. @@ -16,15 +18,3 @@ autocmd! BufNewFile,BufRead Dvcfile,*.dvc setfiletype yaml ``` to your `~/.vimrc`(to be created if it doesn't exist). - -## IntelliJ IDEs - -A community member, [@prihoda](https://github.com/prihoda), maintains a plugin -for IntelliJ IDEs, it offers a more robust integration than just syntax -highlighting. - -You can download the plugin from -[JetBrains Plugins repository](https://plugins.jetbrains.com/plugin/11368-dvc-support-poc) - -For more information, visit the plugin's repository: -[iterative/intellij-dvc/](https://github.com/iterative/intellij-dvc/) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 38ebaa26c8..ae4a6937f8 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -105,13 +105,9 @@ "source": "user-guide/index.md", "children": [ { - "label": "Files and Directories", + "label": "DVC Files and Directories", "slug": "dvc-files-and-directories" }, - { - "label": "File Format (.dvc)", - "slug": "dvc-file-format" - }, { "slug": "dvcignore", "tutorials": { diff --git a/content/docs/tutorials/deep/define-ml-pipeline.md b/content/docs/tutorials/deep/define-ml-pipeline.md index bb4fde619f..aff377c8f9 100644 --- a/content/docs/tutorials/deep/define-ml-pipeline.md +++ b/content/docs/tutorials/deep/define-ml-pipeline.md @@ -51,11 +51,11 @@ or move it, you can use `dvc move`. ## Data file internals -If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by -`dvc add`, you will see that outputs are tracked in the `outs` -field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 hash. This hash value determines the -location of the actual content file in the +If you take a look at the [DVC-file](/doc/user-guide/dvc-files-and-directories) +created by `dvc add`, you will see that outputs are tracked in the +`outs` field. In this file, only one output is specified. The output contains +the data file path in the repository and its MD5 hash. This hash value +determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. @@ -139,8 +139,8 @@ files written to by the command, if any. - `-o out.dat` (lower case o) specifies an output data file. DVC will track this data file by creating a corresponding - [DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add out.dat` - after `dvc run` instead). + [DVC-file](/doc/user-guide/dvc-files-and-directories) (as if running + `dvc add out.dat` after `dvc run` instead). - `-O tmp.dat` (upper case O) specifies a simple output file (not to be added to DVC). @@ -186,8 +186,8 @@ command and does some additional work if the command was successful: 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage file in the project with information about this pipeline stage. - (See [DVC-File Format](/doc/user-guide/dvc-file-format)). Note that the name - of this file could be specified by using the `-f` option, for example + (See [DVC Files](/doc/user-guide/dvc-files-and-directories)). Note that the + name of this file could be specified by using the `-f` option, for example `-f extract.dvc`. Let's take a look at the resulting stage file created by `dvc run` above: diff --git a/content/docs/tutorials/deep/preparation.md b/content/docs/tutorials/deep/preparation.md index 1983c9099b..6f3b593da7 100644 --- a/content/docs/tutorials/deep/preparation.md +++ b/content/docs/tutorials/deep/preparation.md @@ -67,9 +67,9 @@ with: At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories), that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git: +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git: ```dvc $ dvc init @@ -91,5 +91,4 @@ The cache directory, one of the most important parts of any explained in more detail in the next chapter.) Note that it won't be tracked by Git — It's a local-only directory, and you cannot push it to a Git remote. -For more information refer to -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories). +For more information refer to `dvc init`. diff --git a/content/docs/tutorials/deep/reproducibility.md b/content/docs/tutorials/deep/reproducibility.md index e43bfc3cff..a7791935a7 100644 --- a/content/docs/tutorials/deep/reproducibility.md +++ b/content/docs/tutorials/deep/reproducibility.md @@ -19,9 +19,9 @@ automation tools ([Make](https://www.gnu.org/software/make/), Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of the graph nodes (pipeline [stages](/doc/command-reference/run)). -If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our -repository, nothing happens because nothing was changed in the pipeline defined -in the project: There's nothing to reproduce. +If you run `repro` on any [DVC-file](/doc/user-guide/dvc-files-and-directories) +from our repository, nothing happens because nothing was changed in the pipeline +defined in the project: There's nothing to reproduce. ```dvc $ dvc repro model.p.dvc diff --git a/content/docs/tutorials/deep/sharing-data.md b/content/docs/tutorials/deep/sharing-data.md index 19b297bbc0..1739218c3c 100644 --- a/content/docs/tutorials/deep/sharing-data.md +++ b/content/docs/tutorials/deep/sharing-data.md @@ -2,21 +2,21 @@ ## Pushing data to the cloud -We've gone over how source code and [DVC-files](/doc/user-guide/dvc-file-format) -can be shared using a Git repository. These DVC repositories will -contain all the information needed for reproducibility, so it might be a good -idea to share them with your team using Git hosting services (such as -[GitHub](https://github.com/)). +We've gone over how source code and +[DVC-files](/doc/user-guide/dvc-files-and-directories) can be shared using a Git +repository. These DVC repositories will contain all the information +needed for reproducibility, so it might be a good idea to share them with your +team using Git hosting services (such as [GitHub](https://github.com/)). DVC is able to push the cache to cloud storage. > Using shared cloud storage, a colleague can reuse ML models that were trained > on your machine. -First, you need to setup remote storage for the project, that will -be stored in the project's -[config file](https://dvc.org/doc/user-guide/dvc-files-and-directories). This -can be done using the CLI as shown below. +First, you need to setup the remote storage for this project, that +will be stored in the project's +[config file](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files). +This can be done using the CLI as shown below. > Note that we are using the `dvc-public` S3 bucket as an example and you don't > have write access to it, so in order to follow the tutorial you will need to diff --git a/content/docs/tutorials/get-started/data-access.md b/content/docs/tutorials/get-started/data-access.md index 9416e69a1c..b3f0cc7b4b 100644 --- a/content/docs/tutorials/get-started/data-access.md +++ b/content/docs/tutorials/get-started/data-access.md @@ -52,10 +52,10 @@ $ dvc import https://github.com/iterative/dataset-registry \ use-cases/cats-dogs ``` -This is similar to `dvc get`+`dvc add`, but the resulting -[DVC-file](/doc/user-guide/dvc-file-format) includes metadata to track changes -in the source repository. This allows you to bring in changes from the data -source later, using `dvc update`. +This is similar to `dvc get` + `dvc add`, but the resulting +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) includes +metadata to track changes in the source repository. This allows you to bring in +changes from the data source later, using `dvc update`.
@@ -66,8 +66,8 @@ source later, using `dvc update`. > doesn't actually contain a `cats-dogs/` directory. Like `dvc get`, > `dvc import` downloads from [remote storage](/doc/command-reference/remote). -DVC-files created by `dvc import` are called _import stages_. These have special -fields, such as the data source `repo`, and `path` (under `deps`): +`.dvc` files created by `dvc import` are called _import stages_. These have +special fields, such as the data source `repo`, and `path` (under `deps`): ```yaml deps: diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 977b7978fd..6c0be3f449 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -67,11 +67,8 @@ $ dvc run -f prepare.dvc \ python src/prepare.py data/data.xml data/prepared ``` -A `prepare.dvc` _stage file_ is generated with the same -[format](/doc/user-guide/dvc-file-format) as the DVC-file we created previously -to -[tack existing data](/doc/tutorials/get-started/data-versioning#tracking-changes). -Additionally, it includes information about the command we ran +A [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) is +generated. It includes information about the command we ran (`python src/prepare.py`), its dependencies, and outputs. @@ -157,7 +154,7 @@ $ dvc run -f train.dvc \ ``` This would be a good point to commit the changes with Git. This includes any -`.gitignore` files, and all the stage files that describe our pipeline so far. +`.gitignore` files, and `dvc.yaml` — which describes our pipeline. > 📖 See also the `dvc pipeline` command. @@ -177,7 +174,6 @@ Move to another location in your file system and do this: $ git clone https://github.com/iterative/example-get-started $ cd example-get-started $ git checkout 7-train -$ dvc unlock data/data.xml.dvc ```
diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index 650beb3787..bf48c1d6c0 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -30,11 +30,11 @@ to rename it locally to `datadir/`. $ dvc add datadir ``` -DVC stores information about the added directory in a special _DVC-file_ named +DVC stores information about the added directory in a special file named `datadir.dvc`, a small text file with a human-readable -[format](/doc/user-guide/dvc-file-format). This file can be easily **versioned -like source code** with Git, as a placeholder for the original data (which is -listed in `.gitignore`): +[format](/doc/user-guide/dvc-files-and-directories#dvc-files). This `.dvc` file +can be easily **versioned like source code** with Git, as a placeholder for the +original data (which is listed in `.gitignore`): ```dvc $ git add .gitignore datadir.dvc @@ -98,8 +98,8 @@ $ dvc add datadir ``` DVC caches the changes to the `datadir/` directory, and updates the -`datadir.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to match the changes. -Let's commit this new version with Git: +`datadir.dvc` [`.dvc` file](/dvc-files-and-directories#dvc-files) to match the +updated data. Let's commit this new version with Git:
@@ -140,8 +140,8 @@ $ dvc checkout datadir.dvc ### Expand to see what happened internally -`git checkout` brought the `datadir.dvc` DVC-file back to the version, with the -previous hash value of the data (`a304afb...`): +`git checkout` brought the `datadir.dvc` `.dvc` file back to the previous +version, with the original hash value of the data (`a304afb...`): ```yaml outs: @@ -194,7 +194,7 @@ $ dvc push ``` Usually, we also want to `git commit` and `git push` the corresponding -[DVC-files](/doc/user-guide/dvc-file-format). +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files).
diff --git a/content/docs/tutorials/get-started/index.md b/content/docs/tutorials/get-started/index.md index 585e30085b..26486ae5eb 100644 --- a/content/docs/tutorials/get-started/index.md +++ b/content/docs/tutorials/get-started/index.md @@ -10,7 +10,8 @@ and concepts of DVC step by step. Move into the directory you want to use as workspace, and use `dvc init` inside to create a DVC project. It can contain existing project files. At initialization, a new `.dvc/` directory is created for the -internal [files and directories](/doc/user-guide/dvc-files-and-directories): +internal +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files): ```dvc $ dvc init diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index a305fabe9f..7babe6da2d 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -97,7 +97,7 @@ $ dvc add data/Posts.xml.zip ``` When we run `dvc add` `Posts.xml.zip`, DVC creates a -[DVC-file](/doc/user-guide/dvc-file-format). +[DVC-file](/doc/user-guide/dvc-files-and-directories).
@@ -105,9 +105,9 @@ When we run `dvc add` `Posts.xml.zip`, DVC creates a At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories) that are -hidden from the user. This directory is automatically staged with `git add`, so -it can be easily committed with Git. +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files) +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git. Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an _orphan stage_ (see `dvc add`): diff --git a/content/docs/tutorials/versioning.md b/content/docs/tutorials/versioning.md index 14dcce70cc..9630c790c7 100644 --- a/content/docs/tutorials/versioning.md +++ b/content/docs/tutorials/versioning.md @@ -132,8 +132,8 @@ the cache (while keeping a [file link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) to it in the workspace, so you can continue working the same way as before). This is achieved by creating a simple human-readable -[DVC-file](/doc/user-guide/dvc-file-format) that serves as a pointer to the -cache. +[DVC-file](/doc/user-guide/dvc-files-and-directories) that serves as a pointer +to the cache. Next, we train our first model with `train.py`. Because of the small dataset, this training process should be small enough to run on most computers in a @@ -168,8 +168,8 @@ As we mentioned briefly, DVC does not commit the `data/` directory and then `git commit` DVC-files that contain file hashes that point to cached data. In this case we created `data.dvc` and `model.h5.dvc`. Refer to -[DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these -files work. +[DVC Files](/doc/user-guide/dvc-files-and-directories) to learn more about how +these files work.
@@ -283,9 +283,9 @@ the `v2.0` tag. As we have learned already, DVC keeps data files out of Git (by adjusting `.gitignore`) and puts them into the cache (usually it's a `.dvc/cache` directory inside the repository). Instead, DVC creates -[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as data -placeholders that point to the cached files, and they can be easily version -controlled with Git. +[DVC-files](/doc/user-guide/dvc-files-and-directories). These text files serve +as data placeholders that point to the cached files, and they can be easily +version controlled with Git. When we run `git checkout` we restore pointers (DVC-files) first. Then, when we run `dvc checkout`, we use these pointers to put the right data in the right @@ -325,11 +325,11 @@ $ dvc run -f Dvcfile \ ``` Similar to `dvc add`, `dvc run` creates a -[DVC-file](/doc/user-guide/dvc-file-format) named `Dvcfile` (specified using the -`-f` option). It tracks all outputs (`-o`) the same way as `dvc add` does. -Unlike `dvc add`, `dvc run` also tracks dependencies (`-d`) and the command -(`python train.py`) that was run to produce the result. We call such a DVC-file -a "stage file". +[DVC-file](/doc/user-guide/dvc-files-and-directories) named `Dvcfile` (specified +using the `-f` option). It tracks all outputs (`-o`) the same way as `dvc add` +does. Unlike `dvc add`, `dvc run` also tracks dependencies (`-d`) and the +command (`python train.py`) that was run to produce the result. We call such a +DVC-file a "stage file". > At this point you could run `git add .` and `git commit` to save the `Dvcfile` > stage file and its changed outputs to the repository. diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/understanding-dvc/how-it-works.md index 732433fca5..c5b5988ad3 100644 --- a/content/docs/understanding-dvc/how-it-works.md +++ b/content/docs/understanding-dvc/how-it-works.md @@ -7,7 +7,7 @@ $ dvc init ``` - > See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) + > See `dvc init` for more info. - DVC helps define command pipelines, and keeps each command [stage](/doc/command-reference/run) and dependencies in a Git repository: @@ -44,7 +44,7 @@ - DVC introduces the concept of data files for Git repositories. DVC keeps data files outside of the repository, replacing them with special - [DVC-files](/doc/user-guide/dvc-file-format) in the Git repo: + [DVC-files](/doc/user-guide/dvc-files-and-directories) in the Git repo: ```dvc $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files diff --git a/content/docs/understanding-dvc/related-technologies.md b/content/docs/understanding-dvc/related-technologies.md index bf1904857c..ed3ab7b4d2 100644 --- a/content/docs/understanding-dvc/related-technologies.md +++ b/content/docs/understanding-dvc/related-technologies.md @@ -38,9 +38,9 @@ Luigi, etc. result, but we expect some GUI services will be created on top of DVC. - DVC has transparent design. Its - [internal files and directories](/doc/user-guide/dvc-files-and-directories) - (including the cache directory) have a human-readable format and - can be easily reused by external tools. + [files and directories](/doc/user-guide/dvc-files-and-directories) (including + the cache directory) have a human-readable format and can be + easily reused by external tools. ### Git workflows/methodologies such as Gitflow @@ -60,8 +60,9 @@ Luigi, etc. (DAG): - The DAG or dependency graph is defined implicitly by the connections between - [DVC-files](/doc/user-guide/dvc-file-format) (with file names `.dvc` - or `Dvcfile`), based on their dependencies and outputs. + [DVC-files](/doc/user-guide/dvc-files-and-directories) (with file names + `.dvc` or `Dvcfile`), based on their dependencies and + outputs. - Each DVC-file defines one node in the DAG. All DVC-files in a repository make up a single pipeline (think a single Makefile). All DVC-files (and @@ -99,9 +100,9 @@ Luigi, etc. Git-annex repository is cloned via `git clone`, data files won't be copied to the local machine, as file contents are stored in separate [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible - workflow, are always included in the Git repository. Hence, they can be - executed locally with minimal effort. + [DVC-files](/doc/user-guide/dvc-files-and-directories), which provide the + reproducible workflow, are always included in the Git repository. Hence, they + can be executed locally with minimal effort. - DVC is not fundamentally bound to Git, and users have the option of using DVC without Git. diff --git a/content/docs/understanding-dvc/what-is-dvc.md b/content/docs/understanding-dvc/what-is-dvc.md index 56865ef8c8..24a32662e9 100644 --- a/content/docs/understanding-dvc/what-is-dvc.md +++ b/content/docs/understanding-dvc/what-is-dvc.md @@ -45,8 +45,8 @@ DVC uses a few core concepts: - **Data files**: Cached files (for large files). Data files are stored outside of the Git repository on a local/shared hard drive or remote storage, but - [DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored - in Git for DVC needs (to maintain pipelines and reproducibility). + [DVC-files](/doc/user-guide/dvc-files-and-directories) describing that data + are stored in Git for DVC needs (to maintain pipelines and reproducibility). - **Cache directory**: Directory with all data files on a local hard drive or in cloud storage, but not in the Git repository. See `dvc cache dir`. diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index 4b66691b5f..fc19733012 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -36,8 +36,9 @@ Advantages of using a DVC **data registry**: copies on other remotes). This simplifies data management and optimizes space requirements. - Security: Registries can be setup to have read-only remote storage (e.g. an - HTTP location). Git versioning of [DVC-files](/doc/user-guide/dvc-file-format) - allows us to track and audit data changes. + HTTP location). Git versioning of + [DVC-files](/doc/user-guide/dvc-files-and-directories) allows us to track and + audit data changes. - Data as code: Leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think Git for cloud storage, but without ad-hoc conventions. @@ -65,10 +66,10 @@ $ dvc add music/songs > [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset). A regular Git workflow can be followed with the tiny -[DVC-files](/doc/user-guide/dvc-file-format) that substitute the actual data -(`music/songs.dvc` in this example). This enables team collaboration on data at -the same level as with source code (commit history, branching, pull requests, -reviews, etc.): +[DVC-files](/doc/user-guide/dvc-files-and-directories) that substitute the +actual data (`music/songs.dvc` in this example). This enables team collaboration +on data at the same level as with source code (commit history, branching, pull +requests, reviews, etc.): ```dvc $ git add music/songs.dvc music/.gitignore @@ -147,8 +148,8 @@ $ dvc import https://github.com/example/registry \ Besides downloading, importing saves the dependency from the local project to the data source (registry repo). This is achieved by creating a particular kind -of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_). This -file can be used staged and committed with Git. +of [DVC-file](/doc/user-guide/dvc-files-and-directories) (a.k.a. _import +stage_). This file can be used staged and committed with Git. As an addition to the import workflow, and enabled the saved dependency, we can easily bring it up to date in our consumer project(s) with `dvc update` whenever diff --git a/content/docs/use-cases/sharing-data-and-model-files.md b/content/docs/use-cases/sharing-data-and-model-files.md index b23c2e9b25..7b46bd91af 100644 --- a/content/docs/use-cases/sharing-data-and-model-files.md +++ b/content/docs/use-cases/sharing-data-and-model-files.md @@ -67,8 +67,8 @@ with the `dvc push` command: $ dvc push ``` -Code and [DVC-files](/doc/user-guide/dvc-file-format) can be safely committed -and pushed with Git. +Code and [DVC-files](/doc/user-guide/dvc-files-and-directories) can be safely +committed and pushed with Git. ## Download code diff --git a/content/docs/use-cases/versioning-data-and-model-files.md b/content/docs/use-cases/versioning-data-and-model-files.md index 1cd1ceb296..6d98891b1c 100644 --- a/content/docs/use-cases/versioning-data-and-model-files.md +++ b/content/docs/use-cases/versioning-data-and-model-files.md @@ -8,10 +8,10 @@ DVC allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository. It's useful when dealing with files that are too large for Git to handle properly in general. DVC saves information about your data in special -[DVC-files](/doc/user-guide/dvc-file-format), and these metafiles can be used -for versioning. To actually store the data, DVC supports various types of -[remote storage](/doc/command-reference/remote). This allows easily saving and -sharing data alongside code. +[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files), and these +files can be used for versioning. To actually store the data, DVC supports +various types of [remote storage](/doc/command-reference/remote). This allows +easily saving and sharing data alongside code. ![](/img/model-versioning-diagram.png) diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 6f8246e46c..7f00f9f5e9 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -4,5 +4,7 @@ match: [dependency, dependencies] --- A file or directory (possibly tracked by DVC) recorded in the `deps` section of -a [`dvc.yaml`](/doc/user-guide/dvc-file-format) file. See `dvc run`. Stages are -invalidated when any of their dependencies change. +a stage (in +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)) or +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) file. See +`dvc run`. Stages are invalidated when any of their dependencies change. diff --git a/content/docs/user-guide/basic-concepts/dvc-project.md b/content/docs/user-guide/basic-concepts/dvc-project.md index bf1da02f71..f395861a78 100644 --- a/content/docs/user-guide/basic-concepts/dvc-project.md +++ b/content/docs/user-guide/basic-concepts/dvc-project.md @@ -16,6 +16,6 @@ match: Initialized by running `dvc init` in the **workspace** (typically a Git repository). It will contain the [`.dvc/` directory](/doc/user-guide/dvc-files-and-directories), as well as -[`dvc.yaml`](/doc/user-guide/dvc-file-format) and -[`.dvc`](/doc/user-guide/dvc-file-format) files created with commands such as -`dvc add` or `dvc run`. +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and +[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files created with +commands such as `dvc add` or `dvc run`. diff --git a/content/docs/user-guide/basic-concepts/external-dependency.md b/content/docs/user-guide/basic-concepts/external-dependency.md index f8d1113654..9f30fef770 100644 --- a/content/docs/user-guide/basic-concepts/external-dependency.md +++ b/content/docs/user-guide/basic-concepts/external-dependency.md @@ -3,8 +3,9 @@ name: 'External Dependency' match: ['external dependency', 'external dependencies'] --- -A stage dependency (`dep` field in [`dvc.yaml`](/doc/user-guide/dvc-file-format) -or in an [import stage](/doc/command-reference/import) `.dvc` file) with origin -in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage -remote locations, or even other DVC repositories. See +A stage dependency (`deps` field in +[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or in an +[import stage](/doc/command-reference/import) `.dvc` file) with origin in an +external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote +locations, or even other DVC repositories. See [External Dependencies](/doc/user-guide/external-dependencies). diff --git a/content/docs/user-guide/basic-concepts/import-stage.md b/content/docs/user-guide/basic-concepts/import-stage.md index 4378f2cf3d..39bf73bbf2 100644 --- a/content/docs/user-guide/basic-concepts/import-stage.md +++ b/content/docs/user-guide/basic-concepts/import-stage.md @@ -3,6 +3,6 @@ name: 'Import Stage' match: ['import stage', 'import stages'] --- -[`.dvc` file](/doc/user-guide/dvc-file-format) created with the `dvc import` or -`dvc import-url` commands. They represent files or directories from external -sources. +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) created with +the `dvc import` or `dvc import-url` commands. They represent files or +directories from external sources. diff --git a/content/docs/user-guide/basic-concepts/output.md b/content/docs/user-guide/basic-concepts/output.md index f5d38be705..d190c8fe14 100644 --- a/content/docs/user-guide/basic-concepts/output.md +++ b/content/docs/user-guide/basic-concepts/output.md @@ -3,7 +3,8 @@ name: Output match: [output, outputs] --- -A file or directory tracked by DVC, recorded in the `outs` section of a -[`dvc.yaml`](/doc/user-guide/dvc-file-format) or -[`.dvc` file](/doc/user-guide/dvc-file-format). Outputs are usually the result -of stages. See `dvc add`, `dvc run`, `dvc import`, et al. A.k.a. _data artifact_ +A file or directory tracked by DVC, recorded in the `outs` section of a stage +(in [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)) or +[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files). Outputs are +usually the result of stages. See `dvc add`, `dvc run`, `dvc import`, et al. +A.k.a. _data artifact_ diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md index d15ef180ab..34cae8b53d 100644 --- a/content/docs/user-guide/contributing/docs.md +++ b/content/docs/user-guide/contributing/docs.md @@ -169,8 +169,8 @@ is installed when `yarn` runs (see [dev env](#development-environment)). `dvc`, `yaml`, or `diff` custom languages. `usage` is employed to show the `dvc --help` output for each command reference. `dvc` can be used to show examples of commands and their output in a terminal session. `yaml` is used to - show [DVC-file](/doc/user-guide/dvc-file-format) contents or other YAML data. - `diff` is used mainly for examples of `git diff` output. + show [DVC-file](/doc/user-guide/dvc-files-and-directories) contents or other + YAML data. `diff` is used mainly for examples of `git diff` output. > Check out the `.md` source code of any command reference to get a better idea, > for example in diff --git a/content/docs/user-guide/dvc-file-format.md b/content/docs/user-guide/dvc-file-format.md deleted file mode 100644 index d1036dfa95..0000000000 --- a/content/docs/user-guide/dvc-file-format.md +++ /dev/null @@ -1,105 +0,0 @@ -# DVC-File Format - -When you add a file (with `dvc add`) or a command (with `dvc run`) to a -[pipeline](/doc/command-reference/pipeline), DVC creates a special text metafile -with the `.dvc` file extension (e.g. `process.dvc`), or with the default name -`Dvcfile`. These **DVC-files** (a.k.a. stage files) contain all the needed -information to track your data and reproduce pipeline stages. The file itself -contains a simple YAML format that could be easily written or altered manually. - -See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the -highlighting for your editor. - -Here is a sample DVC-file: - -```yaml -always_changed: true -locked: true -cmd: python cmd.py input.data output.data metrics.json -deps: - - md5: da2259ee7c12ace6db43644aef2b754c - path: cmd.py - - md5: e309de87b02312e746ec5a500844ce77 - path: input.data -md5: 521ac615cfc7323604059d81d052ce00 -outs: - - cache: true - md5: 70f3c9157e3b92a6d2c93eb51439f822 - metric: false - path: output.data - - cache: false - md5: d7a82c3cdfd45c4ace13484a931fc526 - metric: - type: json - xpath: AUC - path: metrics.json - -# Comments like this line persist through multiple executions of -# dvc repro/commit but not through dvc run/add/import-url/get-url commands. - -meta: # Special field to contain arbitary user data - name: John - email: john@xyz.com -``` - -## Structure - -On the top level, `.dvc` file consists of these possible fields: - -- `cmd`: Executable command defined in this stage -- `wdir`: Directory to run command in (default `.`) -- `md5`: MD5 hash for this DVC-file -- `deps`: List of dependencies for this stage -- `outs`: List of outputs for this stage -- `locked`: Whether or not this stage is locked from reproduction -- `always_changed`: Whether or not this stage is considered as changed by - commands such as `dvc status` and `dvc repro` (default `false`) - -A dependency entry consists of a these possible fields: - -- `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) -- `etag`: Strong ETag response header (only HTTP - [external dependencies](/doc/user-guide/external-dependencies) created with - `dvc import-url`) -- `params`: If this is a [parameter dependency](/doc/command-reference/params) - file, contains a list of the parameter names and their current values. -- `repo`: This entry is only for external dependencies created with - `dvc import`, and can contains the following fields: - - - `url`: URL of Git repository with source DVC project - - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific commit hash, branch or tag name, etc. (a - [Git revision](https://git-scm.com/docs/revisions)) used to import the - dependency from. - - `rev_lock`: Git commit hash of the external DVC repository at - the time of importing or updating (with `dvc update`) the dependency. - - > See the examples in - > [External Dependencies](/doc/user-guide/external-dependencies) for more - > info. - -An output entry consists of these fields: - -- `path`: Path to the output, relative to the `wdir` path -- `md5`: MD5 hash for the output -- `cache`: Whether or not DVC should cache the output -- `metric`: If this file is a [metric](/doc/command-reference/metrics), contains - the following fields: - - - `type`: Type of the metric file (`json`) - - `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` - for `{"AUC": {"value": 0.624321}}`) - -A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry -can have any valid YAML structure containing any number of attributes. -`"meta: string"` is also possible, it doesn't need to contain a _hash_ structure -(a.k.a. dictionary) always. - -Comments can be added to the DVC-file using `# comment` syntax. Comments and -meta values are preserved among executions of the `dvc repro` and `dvc commit` -commands. - -> Note that comments and meta values are not preserved when a DVC-file is -> overwritten with the `dvc run`,`dvc add`,`dvc import`, and `dvc import-url` -> commands. diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 7518ee9422..ec68b5fd00 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -1,8 +1,144 @@ # DVC Files and Directories Once initialized in a project, DVC populates its installation -directory (`.dvc/`) with the internal files and directories needed for DVC -operation: +directory (`.dvc/`) with the +[internal directories and files](#internal-directories-and-files) needed for DVC +operation. + +Additionally, there are two special kind of files created by certain +[DVC commands](/doc/command-reference): + +- Files ending with the `.dvc` extension are placeholders to version data files + and directories. A DVC project usually has one + [`.dvc` file](#dvc-files) per large data file or dataset directory being + tracked. +- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages + that form the pipeline(s) of a project, and their connections (_dependency + graph_ or DAG). + +Both use human-friendly YAML schemas, described below. We encourage you to get +familiar with them so you may edit them freely, as needed. Both type of files +should be versioned with Git (for Git-enabled repositories). + +## .dvc file + +When you add a file or directory to a DVC project with `dvc add` or +`dvc import`, a `.dvc` file is created based on the data file name (e.g. +`data.xml.dvc`). These files contain the information needed to track the data +with DVC. + +They use a simple [YAML](https://yaml.org/) format, meant to be easy to read, +edit, or even created manually by users. Here is a full sample: + +```yaml +outs: + - md5: a304afb96060aad90176268345e10355 + path: data.xml + +# Comments and user metadata are supported. +meta: + name: 'John Doe' + email: john@doe.com +``` + +`.dvc` files can contain the following fields: + +- `outs` (always present): List of output entries that represent + the files or directories tracked with DVC. Typically there is only one per + `.dvc` file (but several can be added or combined manually). +- `deps`: List of dependency entries for this stage, only present + when `dvc import` and `dvc import-url` are used. Typically there is only one + (but several can be added manually). +- `meta` (optional): Arbitrary metadata can be added manually with this field. + Any YAML contents is supported. `meta` contents are ignored by DVC, but they + can be meaningful for user processes that read `.dvc` files. + +An _output entry_ can consist of these fields: + +- `md5`: Hash value for the file or directory being tracked with DVC +- `path`: Path to the file or directory, relative to the location of the `.dvc` + file +- `cache`: Whether or not DVC should cache the file or directory. `true` by + default + +A _dependency entry_ consists of a these possible fields: + +- `path`: Path to the dependency, relative to the `wdir` path (always present) +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) +- `etag`: Strong ETag response header (only HTTP external + dependencies created with `dvc import-url`) +- `repo`: This entry is only for external dependencies created with + `dvc import`, and can contains the following fields: + + - `url`: URL of Git repository with source DVC project + - `rev`: Only present when the `--rev` option of `dvc import` is used. + Specific commit hash, branch or tag name, etc. (a + [Git revision](https://git-scm.com/docs/revisions)) used to import the + dependency from. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating the dependency (with `dvc update`) + +Note that comments can be added to `.dvc` files and `dvc.yaml` using the +`# comment` syntax. `meta` fields and `#` comments are preserved among +executions of the `dvc repro` and `dvc commit` commands, but not when a `.dvc` +file is overwritten by `dvc add`,`dvc import`, or `dvc import-url`. + +## dvc.yaml files + +When you add commands to a pipeline with `dvc run`, the `dvc.yaml` file is +created or updated. Here's a simple example: + +```yaml +stages: + features: + cmd: jupyter nbconvert --execute featurize.ipynb + deps: + - data/clean + params: + - levels.no + outs: + - features + metrics: + - performance.json + training: + cmd: python train.py + deps: + - train.py + - features + outs: + - model.pkl + plots: + - logs.csv: + x: epoch + x_label: Epoch + meta: 'For deployment' + # User metadata and comments are supported. +``` + +`dvc.yaml` files consists of a group of `stages` with names provided explicitly +by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain +the possible following fields: + +- `cmd` (always present): Executable command defined in this stage +- `deps`: List of dependency file or directory paths of this stage +- `params`: List of [parameter dependencies](/doc/command-reference/params). + These are key paths referring to a YAML or JSON file (`params.yaml` by + default). +- `outs`: List of output file or directory paths of this stage +- `metrics`: List of [metric files](/doc/command-reference/metrics) +- `plots`: List of [plot metrics](/doc/command-reference/plots) and optionally, + their default configuration (subfields matching the options of + `dvc plots modify`) +- `frozen`: Whether or not this stage is frozen from reproduction +- `always_changed`: Whether or not this stage is considered as changed by + commands such as `dvc status` and `dvc repro`. `false` by default +- `meta` (optional): Arbitrary metadata can be added manually with this field. + Any YAML contents is supported. `meta` contents are ignored by DVC, but they + can be meaningful for user processes that read or write `.dvc` files directly. + +`dvc.yaml` files also support `# comments`. + +## Internal directories and files - `.dvc/config`: This is a configuration file. The config file can be edited by hand or with the `dvc config` command. @@ -13,24 +149,25 @@ operation: (credentials, private locations, etc). The local config file can be edited by hand or with the command `dvc config --local`. -- `.dvc/cache`: The [cache directory](#structure-of-cache-directory) will store - your data. The data files and directories in the workspace will - only contain links to the data files in the cache. (Refer to +- `.dvc/cache`: The cache directory will store your data in a + special [structure](#structure-of-cache-directory). The data files and + directories in the workspace will only contain links to the data + files in the cache. (Refer to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See `dvc config cache` for related configuration options. > Note that DVC includes the cache directory in `.gitignore` during > initialization. No data tracked by DVC will ever be pushed to the Git - > repository, only [DVC-files](/doc/user-guide/dvc-file-format) that are - > needed to download or reproduce them. + > repository, only [DVC-files](/doc/user-guide/dvc-files-and-directories) that + > are needed to download or reproduce them. - `.dvc/plots`: Directory for - [Plot templates](/doc/command-reference/plots#plot-templates). + [plot templates](/doc/command-reference/plots#plot-templates) - `.dvc/tmp`: Directory for miscellaneous temporary files - `.dvc/tmp/index`: Directory for remote index files that are used for - optimizing `dvc push`, `dvc pull`, `dvc fetch` and `dvc status -c` operations. + optimizing `dvc push`, `dvc pull`, `dvc fetch` and `dvc status -c` operations - `.dvc/tmp/state`: This file is used for optimization. It is a SQLite database, that contains hash values for files tracked in a DVC project, with respective @@ -52,7 +189,7 @@ operation: - `.dvc/tmp/rwlock`: JSON file that contains read and write locks for specific dependencies and outputs, to allow safely running multiple DVC commands in - parallel. + parallel ## Structure of cache directory @@ -84,8 +221,8 @@ $ dvc add data/images ``` When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the hash -value of the directory: +[DVC-file](/doc/user-guide/dvc-files-and-directories) is created, containing the +hash value of the directory: ```yaml md5: 77e511dafe2178d936e54331d5d6288f @@ -95,8 +232,8 @@ outs: # ... ``` -The directory in cache is stored as a JSON metafile describing it's contents, -along with the files it contains in cache, like this: +The directory in cache is stored as a JSON file (with `.dir` file extension) +describing it's contents, along with the files it contains in cache, like this: ```dvc $ tree .dvc/cache diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 994640eadc..71da15531c 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -11,10 +11,10 @@ DVC to control data outside of the project directory. ## Description DVC can track files on an external storage with `dvc add` or specify external -files as outputs for [DVC-files](/doc/user-guide/dvc-file-format) -created by `dvc run` (stage files). External outputs are considered part of the -DVC project. DVC will track changes in them and reflect this in the output of -`dvc status`. +files as outputs for +[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` +(stage files). External outputs are considered part of the DVC project. DVC will +track changes in them and reflect this in the output of `dvc status`. Currently, the following types (protocols) of external outputs (and cache) are supported: diff --git a/redirects-list.json b/redirects-list.json index b16be8a49a..2bcdddf51e 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -22,8 +22,9 @@ "^/doc/get-started(/.*)?$ /doc/tutorials/get-started$1", "^/doc/tutorial/?$ /doc/tutorials", "^/doc/tutorial/(.*)? /doc/tutorials/deep/$1", - "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files", + "^/doc/user-guide/dvc-file-format$ /doc/user-guide/dvc-files-and-directories", + "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/command-reference/plot$ /doc/command-reference/plots", "^/doc/command-reference/lock$ /doc/command-reference/freeze", "^/doc/command-reference/unlock$ /doc/command-reference/unfreeze",