From 8f2e4b31e1293e48fd8133a42b320367cafa1767 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Feb 2020 22:30:34 -0600 Subject: [PATCH] Regular updates (Feb 10+) (#988) * cmd ref: copy edits and improve `move` examples, adding one with `import` related to https://discordapp.com/channels/485586884165107732/485596304961962003/676360262416203776 * tutorials: correct section title in versioning per https://github.com/iterative/dvc.org/pull/933#discussion_r378681324 * term: review "point" (file hash context) per https://github.com/iterative/dvc.org/issues/552#issuecomment-520633331 --- public/static/docs/command-reference/add.md | 9 ++-- public/static/docs/command-reference/get.md | 13 +++--- .../static/docs/command-reference/import.md | 17 ++++---- public/static/docs/command-reference/move.md | 42 +++++++++---------- .../docs/command-reference/remote/add.md | 2 +- public/static/docs/tutorials/versioning.md | 20 ++++----- .../docs/understanding-dvc/core-features.md | 5 ++- .../docs/user-guide/external-dependencies.md | 4 +- .../user-guide/large-dataset-optimization.md | 16 +++---- .../docs/user-guide/managing-external-data.md | 7 ++-- 10 files changed, 69 insertions(+), 66 deletions(-) diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index e74af2c70a..2d58084210 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -72,10 +72,11 @@ to work with directory hierarchies with `dvc add`: `--no-commit` flag is used). 2. When not using `--recursive` a DVC-file is created for the top of the directory (with default name `dirname.dvc`). Every file in the hierarchy is - added to the cache (unless `--no-commit` flag is added), but DVC does not - produce individual DVC-files for each file in the directory tree. Instead, - the single DVC-file references a file in the cache that in turn points to the - files in the added hierarchy. + added to the cache (unless the `--no-commit` option is used), but DVC does + not produce individual DVC-files for each file in the directory tree. + Instead, the single DVC-file references a special JSON file in the cache + (with `.dir` extension), that in turn points to the files added from the + hierarchy. In a DVC project, `dvc add` can be used to version control any data artifact (input, intermediate, or output files and diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 28e36e3754..36ebc1fcba 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -34,12 +34,13 @@ to an "offline" repo (if it's a DVC repo without a default remote, instead of downloading, DVC will try to copy the target data from its cache). The `path` argument of this command is used to specify the location of the -target to be downloaded within the source repository at `url`. It can point to -any file or directory in there, including outputs tracked by DVC, -as well as files tracked by Git. Note that for DVC repos, the target should be -found in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the project. -The project should also have a default -[DVC remote](/doc/command-reference/remote), containing the actual data. +target to be downloaded within the source repository at `url`. `path` can +specify any file or directory in the source repo, including outputs +tracked by DVC, as well as files tracked by Git. Note that for DVC repos, the +target should be found in one of the +[DVC-files](/doc/user-guide/dvc-file-format) of the project. The project should +also have a default [DVC remote](/doc/command-reference/remote), containing the +actual data. > See `dvc get-url` to download data from other supported locations such as S3, > SSH, HTTP, etc. diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 92f323e5a6..5d24211551 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -35,12 +35,13 @@ to an "offline" repo (if it's a DVC repo without a default remote, instead of downloading, DVC will try to copy the target data from its cache). The `path` argument of this command is used to specify the location of the -target to be downloaded within the source repository at `url`. It can point to -any file or directory in there, including outputs tracked by DVC, -as well as files tracked by Git. Note that for DVC repos, the target should be -found in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the project. -The project should also have a default -[DVC remote](/doc/command-reference/remote), containing the actual data. +target to be downloaded within the source repository at `url`. `path` can +specify any file or directory in the source repo, including outputs +tracked by DVC, as well as files tracked by Git. Note that for DVC repos, the +target should be found in one of the +[DVC-files](/doc/user-guide/dvc-file-format) of the project. The project should +also have a default [DVC remote](/doc/command-reference/remote), containing the +actual data. > See `dvc import-url` to download and track data from other supported locations > such as S3, SSH, HTTP, etc. @@ -156,8 +157,8 @@ deps: rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c ``` -If `rev` is a Git branch or tag (where the commit it points to changes), the -data source may have updates at a later time. To bring it up to date if so (and +If `rev` is a Git branch or tag (where the underlying commit changes), the data +source may have updates at a later time. To bring it up to date if so (and update `rev_lock` in the DVC-file), simply use `dvc update .dvc`. If `rev` is a specific commit hash (does not change), `dvc update` will never have an effect on the import stage. You may **re-import** a different commit instead, diff --git a/public/static/docs/command-reference/move.md b/public/static/docs/command-reference/move.md index 5196c05fbe..3cdf71754d 100644 --- a/public/static/docs/command-reference/move.md +++ b/public/static/docs/command-reference/move.md @@ -81,31 +81,33 @@ outs: - `-v`, `--verbose` - displays detailed tracing information. -## Examples +## Example: change an output file name -Here we use `dvc add`to put a file under DVC control. Then we change the name of -it using `dvc move`. +We first use `dvc add` to track file with DVC. Then, we change its name using +`dvc move`. ```dvc $ dvc add data.csv +... $ tree . ├── data.csv └── data.csv.dvc - $ dvc move data.csv other.csv +... $ tree . ├── other.csv └── other.csv.dvc ``` -Here we use `dvc add` to put a file under DVC control. Then we use `dvc move` to -change its location. Note that the `data.csv.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is also moved. If target path -already exists and is a directory, data file is moved with unchanged name into -this folder. +## Example: change an output location + +We use `dvc add` to track a file with DVC, then we use `dvc move` to change its +location. If target path already exists and is a directory, data file is moved +with unchanged name into this folder. Note that the `data.csv.dvc` +[DVC-file](/doc/user-guide/dvc-file-format) is also moved. ```dvc $ tree @@ -116,6 +118,7 @@ $ tree └── subdir $ dvc add data/foo +... $ tree . ├── data @@ -125,6 +128,7 @@ $ tree └── subdir $ dvc move data/foo data2/subdir/ +... $ tree . ├── data @@ -134,28 +138,24 @@ $ tree └── foo.dvc ``` -In this example we use `dvc add` to put a directory under DVC control. Then we -use `dvc move` to move the whole directory. As in other cases, DVC-file is also -moved. +## Example: change an imported directory name and location -```dvc -$ tree -. -├── data -│   ├── bar -│   └── foo -└── data2 +Let's try the same with an entire directory imported from an external DVC +repository with `dvc import`. Note that, as in the previous cases, the +DVC-file is also moved. -$ dvc add data +```dvc +$ dvc import ../another-repo data +... $ tree . ├── data │   ├── bar │   └── foo -├── data2 └── data.dvc $ dvc move data data2/data3 +... $ tree . └── data2 diff --git a/public/static/docs/command-reference/remote/add.md b/public/static/docs/command-reference/remote/add.md index c0249e6893..ba4d21a5fb 100644 --- a/public/static/docs/command-reference/remote/add.md +++ b/public/static/docs/command-reference/remote/add.md @@ -24,7 +24,7 @@ positional arguments: ## Description `name` and `url` are required. `url` specifies a location to store your data. It -can point to a cloud storage service, an SSH server, network-attached storage, +can represent a cloud storage service, an SSH server, network-attached storage, or even a directory in the local file system. (See all the supported remote storage types in the examples below.) If `url` is a relative path, it will be resolved against the current working directory, but saved **relative to the diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index a649c89d59..cec7a622bf 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -15,8 +15,8 @@ to build a powerful image classifier using a pretty small dataset. We first train a classifier model using 1000 labeled images, then we double the number of images (2000) and retrain our model. We capture both datasets and -classifier results and show how to use `dvc checkout` to switch between data -and/or model versions. +classifier results and show how to use `dvc checkout` to switch between +workspace versions. The specific algorithm used to train and validate the classifier is not important, and no prior knowledge of Keras is required. We'll reuse the @@ -165,7 +165,7 @@ $ git tag -a "v1.0" -m "model v1.0, 1000 images" As we mentioned briefly, DVC does not commit the `data/` directory and `model.h5` file with Git. Instead, `dvc add` stores them in the cache (usually in `.dvc/cache`) and adds them to `.gitignore`. We then `git commit` DVC-files -that contain pointers to the cached data. +that contain file hashes that point to cached data. In this case we created `data.dvc` and `model.h5.dvc`. Refer to [DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these @@ -241,11 +241,11 @@ $ git commit -m "Second model, trained with 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images" ``` -That's it! We have a second model and dataset saved and pointers to them -committed with Git. Let's now look at how DVC can help us go back to the -previous version if we need to. +That's it! We have tracked a second dataset, model, and metrics versioned DVC, +and the DVC-files that point to them committed with Git. Let's now look at how +DVC can help us go back to the previous version if we need to. -## Switching between data and/or model versions +## Switching between workspace versions The DVC command that helps get a specific committed version of data is designed to be similar to `git checkout`. All we need to do in our case is to @@ -291,8 +291,8 @@ directory inside the repository). Instead, DVC creates placeholders that point to the cached files, and they can be easily version controlled with Git. -When we run `git checkout` we restore pointers (DVC-files) first, then when we -run `dvc checkout` we use these pointers to put the right data in the right +When we run `git checkout` we restore pointers (DVC-files) first. Then, when we +run `dvc checkout`, we use these pointers to put the right data in the right place. @@ -312,7 +312,7 @@ When you have a script that takes some data as an input and produces other data outputs, a better way to capture them is to use `dvc run`: > If you tried the commands in the -> [Switching between data and/or model versions](#switching-between-data-and-or-model-versions) +> [Switching between workspace versions](#switching-between-workspace-versions) > section, go back to the master branch code and data with: > > ```dvc diff --git a/public/static/docs/understanding-dvc/core-features.md b/public/static/docs/understanding-dvc/core-features.md index 4e3abcdc7c..44dee4ad2e 100644 --- a/public/static/docs/understanding-dvc/core-features.md +++ b/public/static/docs/understanding-dvc/core-features.md @@ -6,8 +6,9 @@ - It makes data science projects **reproducible** by creating lightweight [pipelines](/doc/command-reference/pipeline) using implicit dependency graphs. -- **Large data file versioning** works by creating pointers in your Git - repository to the cache, typically stored on a local hard drive. +- **Large data file versioning** works by creating special files in your Git + repository that point to the cache, typically stored on a local + hard drive. - DVC is **Programming language agnostic**: Python, R, Julia, shell scripts, etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc. diff --git a/public/static/docs/user-guide/external-dependencies.md b/public/static/docs/user-guide/external-dependencies.md index 1df8821ca2..cecbc21f84 100644 --- a/public/static/docs/user-guide/external-dependencies.md +++ b/public/static/docs/user-guide/external-dependencies.md @@ -28,8 +28,8 @@ supported: > `dvc remote`. In order to specify an external dependency for your stage, use the usual '-d' -option in `dvc run` with the external path or URL pointing to your desired file -or directory. +option in `dvc run` with the external path or URL to your desired file or +directory. ## Examples diff --git a/public/static/docs/user-guide/large-dataset-optimization.md b/public/static/docs/user-guide/large-dataset-optimization.md index b88a94a96f..aaff5dce72 100644 --- a/public/static/docs/user-guide/large-dataset-optimization.md +++ b/public/static/docs/user-guide/large-dataset-optimization.md @@ -16,17 +16,17 @@ will be duplicated between the workspace and the cache? **That would not be efficient!** Especially with large files (several Gigabytes or larger). In order to have the files present in both directories without duplication, DVC -can automatically create **file links** in the workspace that "point" to the -data in cache. In fact, by default it will attempt to use reflinks\* if -supported by the file system. +can automatically create **file links** to the cached data in the workspace. In +fact, by default it will attempt to use reflinks\* if supported by the file +system. ## File link types for the DVC cache -File links are entries in the file system that don't necessarily hold the file -contents, but point to where the file is actually stored. File links are more -common in file systems used with UNIX-like operating systems and come in -different kinds, that differ in how they connect file names to _inodes_ in the -system. +File links are lightweight entries in the file system that don't hold the file +contents, but work as shortcuts to where the original data is actually stored. +They're more common in file systems used with UNIX-like operating systems, and +come in different kinds that differ in how they connect file names to _inodes_ +in the system. > **Inodes** are metadata file records to locate and store permissions to the > actual file contents. See **Linking files** in diff --git a/public/static/docs/user-guide/managing-external-data.md b/public/static/docs/user-guide/managing-external-data.md index d396148a0f..af673517e8 100644 --- a/public/static/docs/user-guide/managing-external-data.md +++ b/public/static/docs/user-guide/managing-external-data.md @@ -29,10 +29,9 @@ supported: > `dvc remote`. In order to specify an external output for a stage file, use the usual `-o` or -`-O` options of the `dvc run` command, but with the external path or URL -pointing to the file in question. For cached external outputs -(`-o`) you will need to -[setup an external cache](/doc/command-reference/config#cache) location. +`-O` options of the `dvc run` command, but with the external path or URL to the +file in question. For cached external outputs (`-o`) you will need +to [setup an external cache](/doc/command-reference/config#cache) location. Non-cached external outputs (`-O`) do not require an external cache to be setup. > Avoid using the same remote location that you are using for `dvc push`,