diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 0cdb531271..27aed63526 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -36,7 +36,7 @@ URL returned depends on the `remote` used (see the [Parameters](#parameters) section). If the target is a directory, the returned URL will end in `.dir`. Refer to -[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) and `dvc add` to learn more about how DVC handles data directories. ⚠️ This function does not check for the actual existence of the file or diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 75501d2eb9..1173c78b41 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -38,7 +38,7 @@ each one: 1. Calculate the file hash. 2. Move the file contents to the cache (by default in `.dvc/cache`), using the file hash to form the cached file path. (See - [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) + [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). @@ -71,7 +71,7 @@ large files. DVC also supports other link types for use on file systems without `reflink` support, but they have to be specified manually. Refer to the `cache.type` config option in `dvc config cache` for more information. -### Tracking directories +### Adding entire directories A `dvc add` target can be an individual file or a directory. In the latter case, a `.dvc` file is created for the top of the directory (with default name @@ -83,9 +83,13 @@ in the directory tree. Instead, the single `.dvc` file references a special JSON file in the cache (with `.dir` extension), that in turn points to the added files. -Note that DVC commands that use tracked files support granular targeting of -files, even when the directory is added as a whole. Examples: `dvc push`, -`dvc pull`, `dvc get`, `dvc import`, etc. +> Refer to +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) +> for more info. on `.dir` cache entries. + +Note that DVC commands that use tracked data support granular targeting of files +and directories, even when contained in a parent directory added as a whole. +Examples: `dvc push`, `dvc pull`, `dvc get`, `dvc import`, etc. As a rarely needed alternative, the `--recursive` option causes every file in the hierarchy to be added individually. A corresponding `.dvc` file will be @@ -192,9 +196,8 @@ outs: path: pics ``` -> Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -> for more info. +> Refer to [Adding entire directories](#adding-entire-directories) for more +> info. This allows us to treat the entire directory structure as a single data artifact. For example, you can pass the whole directory tree as a diff --git a/content/docs/command-reference/cache/dir.md b/content/docs/command-reference/cache/dir.md index cc8bc49ddc..d2c64e119d 100644 --- a/content/docs/command-reference/cache/dir.md +++ b/content/docs/command-reference/cache/dir.md @@ -17,7 +17,7 @@ positional arguments: ## Description Helper to set the `cache.dir` configuration option. (See -[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).) +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).) Unlike doing so with `dvc config cache`, this command transform paths (`value`) that are provided relative to the current working directory into paths **relative to the config file location**. However, if the `value` provided is an diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index b683cec054..9c5af7052f 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -99,7 +99,7 @@ remote. See `dvc remote` for more information. A DVC project cache is the hidden storage (by default located in the `.dvc/cache` directory) for files that are tracked by DVC, and their different versions. (See `dvc cache` and -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more details.) This section contains the following options: - `cache.dir` - set/unset cache directory location. A correct value must be diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 36d3194ff8..af3abf8509 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -189,7 +189,7 @@ $ tree .dvc Note that the `.dvc/cache` directory was created and populated. > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Used without arguments (as above), `dvc fetch` downloads all assets needed by diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index d3d17312d5..2fb92be3f2 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -29,7 +29,7 @@ of commits (determined by reading the DVC-files in them). See the [Options](#options) section for more details. > Note that `dvc gc` tries to fetch any missing -> [`.dir` files](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [`.dir` files](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > from [remote storage](/doc/command-reference/remote) to the local > cache, in order to know which files should exist inside cached > directories. These files may be missing if the cache directory was previously diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 33ff39cc9b..e9168bb81b 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -194,7 +194,7 @@ Finally, we used `dvc status` to double check that all data had been uploaded. ## Example: What happens in the cache? Let's take a detailed look at what happens to the -[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) as you run an experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and having set up a remote. @@ -242,7 +242,7 @@ the cache having more files in it than the remote – which is what the `new` state means. > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Next we can copy the remaining data from the cache to the remote using diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index e94a797c6b..7746a9cfdf 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -73,23 +73,38 @@ $ dvc run -n printer -d write.sh -o pages ./write.sh $ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh ``` +Stage dependencies can be any file or directory, either untracked, or more +commonly tracked by DVC or Git. Outputs will be tracked and cached +by DVC when the stage is run. Every output version will be cached when the stage +is reproduced (see also `dvc gc`). + Relevant notes: -- Typically, scripts being run (or a directory containing the source code) are - included among the specified `-d` dependencies. This ensures that when the - source code changes, DVC knows that the stage needs to be reproduced. (You can - chose whether to do this.) +- Typically, scripts being run (or possibly a directory containing the source + code) are included among the specified `-d` dependencies. This ensures that + when the source code changes, DVC knows that the stage needs to be reproduced. + (You can chose whether to do this.) - `dvc run` checks the dependency graph integrity before creating a new stage. - For example: two stage cannot explicitly specify the same output, there should - be no cycles, etc. + For example: two stage cannot specify the same output or overlapping output + paths, there should be no cycles, etc. - DVC does not feed dependency files to the command being run. The program will have to read by itself the files specified with `-d`. -- Outputs are deleted from the workspace before executing the - command (including at `dvc repro`), so it should be able to recreate any - directories marked as outputs. +- Entire directories produced by the stage can be tracked as outputs by DVC, + which generates a single `.dir` entry in the cache (refer to + [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) + for more info.) + +- [external dependencies](/doc/user-guide/external-dependencies) and + [external outputs](/doc/user-guide/managing-external-data) (outside of the + workspace) are also supported. + +- Outputs are deleted from the workspace before executing the command (including + at `dvc repro`) if their paths are found as existing files/directories. This + also means that the stage command needs to recreate any directory structures + defined as outputs every time its executed by DVC. ### For displaying and comparing data science experiments @@ -117,7 +132,7 @@ systems and require certain software packages to be installed. Wrap the command with double quotes `"` if there are special characters in it like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to -`dvc run` as a whole. Use single quotes `'` instead if there are environment +`dvc run` itself. Use single quotes `'` instead if there are environment variables in it that should be evaluated dynamically. Examples: ```dvc diff --git a/content/docs/user-guide/basic-concepts/dvc-cache.md b/content/docs/user-guide/basic-concepts/dvc-cache.md index b1afec5846..1d080775f4 100644 --- a/content/docs/user-guide/basic-concepts/dvc-cache.md +++ b/content/docs/user-guide/basic-concepts/dvc-cache.md @@ -6,4 +6,4 @@ match: ['DVC cache', cache, caches, cached] The DVC cache is a hidden storage (by default located in the `.dvc/cache` directory) for files that are under DVC control, and their different versions. For more details, please refer to this -[document](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory). +[document](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory). diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index d5e78bfea8..48b9c3a34d 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -236,7 +236,7 @@ separately under `params`, grouped by parameters file. hand or with the command `dvc config --local`. - `.dvc/cache`: The cache directory will store your data in a - special [structure](#structure-of-cache-directory). The data files and + special [structure](#structure-of-the-cache-directory). The data files and directories in the workspace will only contain links to the data files in the cache. (Refer to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See @@ -277,7 +277,7 @@ separately under `params`, grouped by parameters file. dependencies and outputs, to allow safely running multiple DVC commands in parallel -## Structure of cache directory +## Structure of the cache directory There are two ways in which the data is stored in cache: As a single file (eg. `data.csv`), or a directory of files. diff --git a/content/docs/user-guide/dvcignore.md b/content/docs/user-guide/dvcignore.md index eceef9dff1..5fc2ca3291 100644 --- a/content/docs/user-guide/dvcignore.md +++ b/content/docs/user-guide/dvcignore.md @@ -85,7 +85,7 @@ Only the hash values of a directory (`data/`) and one file have been (`data1`). > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Now, let's modify file `data1` and see if it affects `dvc status`. diff --git a/content/docs/user-guide/setup-google-drive-remote.md b/content/docs/user-guide/setup-google-drive-remote.md index d8d7d42b50..2c874bd89f 100644 --- a/content/docs/user-guide/setup-google-drive-remote.md +++ b/content/docs/user-guide/setup-google-drive-remote.md @@ -18,10 +18,13 @@ to establish GDrive remote connections (e.g. CI/CD). ## Quick start To start using a Google Drive remote, you only need to add it with a -[valid URL format](#url-format). Then use any DVC command that needs it (e.g. -`dvc pull`, `dvc fetch`, `dvc push`). For example: +[valid URL format](#url-format). Then use any DVC command that needs to connect +to it (e.g. `dvc pull` or `dvc push` once there's tracked data to synchronize). +For example: ```dvc +$ dvc add data +... $ dvc remote add --default myremote \ gdrive://0AIac4JZqHhKmUk9PDA/dvcstore $ dvc push @@ -192,9 +195,10 @@ authentication is needed. ## Authorization On the first usage of a GDrive [remote](/doc/command-reference/remote), for -example when trying to `dvc push` for the first time after adding the remote -with a [valid URL](#url-format), DVC will prompt you to visit a special Google -authentication web page. There you'll need to sign into your Google account. The +example when trying to `dvc push` tracked data for the first time, DVC will +prompt you to visit a special Google authentication web page. There you'll need +to sign into a Google account with the needed access to the GDrive +[URL](#url-format) in question. The [auth process](https://developers.google.com/drive/api/v2/about-auth) will ask you to grant DVC the necessary permissions, and produce a verification code needed for DVC to complete the connection. On success, the necessary credentials