From 6fc87643a7e83d7573511a3c2c4def1cf5a8feb3 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 9 Aug 2019 19:18:46 -0500 Subject: [PATCH 01/17] term: review usage of "workspace" which is a general concept, not not super specific to DVC. See https://github.com/iterative/dvc.org/pull/522#pullrequestreview-272147157 --- src/Documentation/glossary.js | 21 +++-- static/docs/commands-reference/add.md | 18 ++-- static/docs/commands-reference/cache/index.md | 5 +- static/docs/commands-reference/checkout.md | 26 +++--- static/docs/commands-reference/commit.md | 10 +-- static/docs/commands-reference/config.md | 20 ++--- static/docs/commands-reference/destroy.md | 20 ++--- static/docs/commands-reference/fetch.md | 32 ++++--- static/docs/commands-reference/gc.md | 2 +- static/docs/commands-reference/import-url.md | 21 +++-- static/docs/commands-reference/import.md | 17 ++-- static/docs/commands-reference/install.md | 85 +++++++++---------- .../docs/commands-reference/metrics/modify.md | 4 +- static/docs/commands-reference/pull.md | 45 +++++----- static/docs/commands-reference/push.md | 38 ++++----- static/docs/commands-reference/remove.md | 22 +++-- static/docs/commands-reference/repro.md | 14 +-- static/docs/commands-reference/status.md | 40 ++++----- static/docs/commands-reference/unprotect.md | 19 +++-- static/docs/commands-reference/version.md | 32 +++---- 20 files changed, 256 insertions(+), 235 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index a6d31a05e5..3677018cc4 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -8,16 +8,25 @@ export default { name: 'Workspace', match: ['workspace'], desc: ` -By "workspace" we refer to the directory containing all your project files. For -example raw datasets, source code, ML models, etc. A workspace becomes a DVC -project when [\`dvc init\`](/doc/commands-reference/init) is run, and -[DVC-files](/doc/user-guide/dvc-file-format) are created in it. It\s typically -also a Git repository. +Directory containing all your project files. For example raw datasets, source +code, ML models, etc. A workspace becomes a **DVC project** when +[\`dvc init\`](/doc/commands-reference/init) is run, and +[DVC-files](/doc/user-guide/dvc-file-format) are created in it. + ` + }, + { + name: 'DVC Project', + match: ['DVC project', 'project'], + desc: ` +Initialized by running \`dvc init\` in the **workspace**. It will contain the +\`.dvc/\` directory and [DVC-files](/doc/user-guide/dvc-file-format) created +with commands such as \`dvc add\` or \`dvc run\`. It's typically also a Git +repository. ` }, { name: 'DVC Cache', - match: ['DVC cache', 'cache', 'cache directory'], + match: ['DVC cache', 'cache', 'cache directory', 'cached'], desc: ` The DVC cache is a hidden storage (by default located in the \`.dvc/cache\` directory) for files that are under DVC control, and their different versions. diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 41ff03d2be..369dbd0954 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -27,8 +27,9 @@ Under the hood, a few actions are taken for each file in `targets`: 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store the checksum to identify the cache entry. -5. Add the target(s) to `.gitignore` (if Git is used in this workspace) to - prevent it from being committed to the Git repository. +5. Add the target(s) to `.gitignore` (if Git is used in this + workspace) to prevent it from being committed to the Git + repository. 6. Instructions are printed showing `git` commands for adding the files to a Git repository. If a different SCM system is being used, use the equivalent command for that system or nothing is printed if `--no-scm` was specified for @@ -69,12 +70,13 @@ to work with directory hierarchies with `dvc add`. the single DVC-file points to a file in the DVC cache that contains references to the files in the added hierarchy. -In a DVC project `dvc add` can be used to version control any data -artifact (input, intermediate, or output files and directories, and model -files). It is useful by itself to go back and forth between different versions -of datasets or models. Usually though, it is recommended to use `dvc run` and -`dvc repro` mechanism to version control intermediate and final results (like -models). This way you bring data provenance and make your project reproducible. +In a DVC project, `dvc add` can be used to version control any +data artifact (input, intermediate, or output files and +directories, and model files). It is useful by itself to go back and forth +between different versions of datasets or models. Usually though, it is +recommended to use `dvc run` and `dvc repro` mechanism to version control +intermediate and final results (like models). This way you bring data provenance +and make your project reproducible. ## Options diff --git a/static/docs/commands-reference/cache/index.md b/static/docs/commands-reference/cache/index.md index 0dfaf3a189..80b66d28b8 100644 --- a/static/docs/commands-reference/cache/index.md +++ b/static/docs/commands-reference/cache/index.md @@ -21,8 +21,9 @@ default `cache` directory. The DVC cache is where your data files, models, etc (anything you want to version with DVC) are actually stored. The corresponding files you see in the -workspace simply link to the ones in cache. (See `dvc config cache`, `type` -config option, for more information on file links on different platforms.) +workspace simply link to the ones in cache. (See +`dvc config cache`, `type` config option, for more information on file links on +different platforms.) > For more cache-related configuration options refer to `dvc config cache`. diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 7640c80ee4..943149825f 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -1,6 +1,7 @@ # checkout -Update data files and directories in workspace based on current DVC-files. +Update data files and directories in the workspace based on current +DVC-files. ## Synopsis @@ -15,14 +16,14 @@ positional arguments: ## Description -[DVC-files](/doc/user-guide/dvc-file-format) in the workspace specify which -instance of each data file or directory is to be used, using the checksum saved -in the `outs` fields. The `dvc checkout` command updates the workspace data to -match with the cache files corresponding to those checksums. +[DVC-files](/doc/user-guide/dvc-file-format) in a DVC project +specify which instance of each data file or directory is to be used, using the +checksum saved in the `outs` fields. The `dvc checkout` command updates the +workspace data to match with the cache files corresponding to those checksums. Using an SCM like Git, the DVC-files are kept under version control. At a given -branch or tag of the SCM workspace, the DVC-files will contain checksums for the -corresponding data files kept in the DVC cache. After an SCM command like +branch or tag of the SCM repository, the DVC-files will contain checksums for +the corresponding data files kept in the DVC cache. After an SCM command like `git checkout` is run, the DVC-files will change to the state at the specified branch or commit or tag. Afterwards, the `dvc checkout` command is required in order to synchronize the data files with the currently checked out DVC-files. @@ -45,8 +46,9 @@ The execution of `dvc checkout` does: `hardlink`, `symlink`, or `copy`) depends on the OS and the configured value for `cache.type` – See `dvc config cache`. -Note that this DVC by default tries NOT to copy files between the cache and the -workspace by using reflinks when supported by the file system. (Refer to +Note that this command by default tries NOT to copy files between the cache and +the workspace, using reflinks instead when supported by the file system. (Refer +to [File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).) The next linking strategy default value is `copy` though, so unless other file link types are manually configured in `cache.type` (using `dvc config`), files @@ -103,8 +105,8 @@ be pulled from a remote cache using `dvc pull`. ## Examples -Let's employ a simple workspace with some data, code, ML models, pipeline -stages, as well as a few Git tags, such as our +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, as well as a few Git tags, such as our [get started example repo](https://github.com/iterative/example-get-started). Then we can see what happens with `git checkout` and `dvc checkout` as we switch from tag to tag. @@ -122,7 +124,7 @@ $ cd example-get-started -The workspace looks almost like in this +The workspace `tree` looks almost like in this [pipeline setup](/doc/get-started/example-pipeline): ```dvc diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index e892d12cbb..06e5fc05c1 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -52,7 +52,7 @@ to the DVC cache as the last step. What _commit_ means is that DVC: - Computes a checksum for the file/directory - Enters the checksum and file name into the DVC-file - Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`) - (Note that if the workspace was initialized with no SCM support + (Note that if the workspace was initialized with no SCM support (`dvc init --no-scm`), this does not happen.) - Adds the file/directory or to the DVC cache @@ -90,10 +90,10 @@ into play. It handles that last step of adding the file to the DVC cache. ## Examples -Let's employ a simple workspace with some data, code, ML models, pipeline -stages, such as the DVC project created in our [Get Started](/doc/get-started) -section. Then we can see what happens with `git commit` and `dvc commit` in -different situations. +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`git commit` and `dvc commit` in different situations.
diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md index c1771c834b..c6b6fce4aa 100644 --- a/static/docs/commands-reference/config.md +++ b/static/docs/commands-reference/config.md @@ -51,7 +51,7 @@ corresponding config file. ## Configuration sections These are the `name` parameters that can be used with `dvc config`, or the -sections in the project config file (`.dvc/config`). +sections in the DVC project config file (`.dvc/config`). ### core @@ -98,21 +98,21 @@ for more details.) > option, properly transforming paths relative to the current working > directory into paths relative to the config file location. -- `cache.protected` - makes files in the workspace read-only. Possible values - are `true` or `false` (default). Run `dvc checkout` for the change go into - effect. (It affects only files that are under DVC control.) +- `cache.protected` - make files under DVC control read-only. Possible values + are `true` or `false` (default). Run `dvc checkout` for the change to go into + effect. Due to the way DVC handles linking between the data files in the cache and - their counterparts in the workspace, it's easy to accidentally corrupt the - cached version of a file by editing or overwriting it. Turning this config - option on forces you to run `dvc unprotect` before updating a file, providing - an additional layer of security to your data. + their counterparts in the workspace, it's easy to accidentally + corrupt the cached version of a file by editing or overwriting it. Turning + this config option on forces you to run `dvc unprotect` before updating a + file, providing an additional layer of security to your data. It's highly recommended to enable this mod when `cache.type` is set to `hardlink` or `symlink`. - `cache.type` - link type that DVC should use to link data files from cache to - your workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or a + the workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or a combination of those, separated by commas e.g: `reflink,hardlink,copy`. By default, DVC will try `reflink,copy` link types in order to choose the most @@ -188,7 +188,7 @@ Set the `dvc` log level to `debug`: $ dvc config core.loglevel debug ``` -Add an S3 remote and set it as the project default: +Add an S3 remote and set it as the project default: > **Note!** Before adding a new remote be sure to login into AWS services and > follow instructions at diff --git a/static/docs/commands-reference/destroy.md b/static/docs/commands-reference/destroy.md index 570f81b7e3..bd35dbe1fe 100644 --- a/static/docs/commands-reference/destroy.md +++ b/static/docs/commands-reference/destroy.md @@ -12,13 +12,13 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description -It removes DVC-files, and the entire `.dvc/` meta directory from the current -workspace. Note that the DVC cache will normally be removed as -well, unless it's set to an external location with `dvc cache dir`. (By default -a local cache is located in the `.dvc/cache` directory.) If you were using -[symlinks for linking data](/doc/user-guide/large-dataset-optimization) from the -cache, DVC will replace them with copies, so that your data is intact after the -DVC repository destruction. +`dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the +workspace. Note that the DVC cache will normally be +removed as well, unless it's set to an external location with `dvc cache dir`. +(By default a local cache is located in the `.dvc/cache` directory.) If you were +using [symlinks for linking data](/doc/user-guide/large-dataset-optimization) +from the cache, DVC will replace them with copies, so that your data is intact +after the DVC repository destruction. ## Options @@ -64,7 +64,7 @@ $ dvc cache dir /mnt/cache $ dvc add foo ``` -`dvc cache dir` changed the location of cache storage to exernal location. +`dvc cache dir` changed the location of cache storage to external location. Content of DVC repository: ```dvc @@ -96,8 +96,8 @@ yes ``` `dvc destroy` command removed DVC-files, and the entire `.dvc/` meta directory -from the workspace. But the cache files that are present in the `/mnt/cache` -directory still persist: +from the workspace. But the cache files that are present in the +`/mnt/cache` directory still persist: ```dvc $ tree /mnt/cache diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index 373d5ac495..175970fa09 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -19,10 +19,10 @@ positional arguments: ## Description The `dvc fetch` command is a means to download files from remote storage into -the local cache, but not directly into the workspace. This makes the data files -available for linking (or copying) into the workspace. (Refer to -[dvc config cache.type](/doc/commands-reference/config#cache).) Along with -`dvc checkout`, it's performed automatically by `dvc pull` when the target +the local cache, but without placing them in the workspace. This +makes the data files available for linking (or copying) into the workspace. +(Refer to [dvc config cache.type](/doc/commands-reference/config#cache).) Along +with `dvc checkout`, it's performed automatically by `dvc pull` when the target [DVC-files](/doc/user-guide/dvc-file-format) are not already in the local cache: ``` @@ -42,11 +42,12 @@ local cache ++ | dvc pull | workspace ``` -Fetching could be useful when first checking out an existing DVC project, since -files under DVC control could already exist in remote storage, but won't be in -your local cache. (Refer to `dvc remote` for more information on DVC remotes.) -These necessary data or model files are listed as dependencies or outputs in a -DVC-file (target [stage](/doc/commands-reference/run)) so they are required to +Fetching could be useful when first checking out an existing DVC +project, since files under DVC control could already exist in remote +storage, but won't be in your local cache. (Refer to `dvc remote` for more +information on DVC remotes.) These necessary data or model files are listed as +dependencies or outputs in a DVC-file (target +[stage](/doc/commands-reference/run)) so they are required to [reproduce](/doc/get-started/reproduce) the corresponding [pipeline](/doc/get-started/pipeline). (See [DVC-File Format](/doc/user-guide/dvc-file-format) for more information on @@ -67,7 +68,7 @@ which the set of files to push/fetch/pull is determined begins with calculating the checksums of the files in question, when these are [added](/doc/get-started/add-files) to DVC. File checksums are then stored in the corresponding DVC-files (usually saved in a Git branch). Only the checksums -specified in DVC-files currently in the workspace are considered by `dvc fetch` +specified in DVC-files currently in the project are considered by `dvc fetch` (unless the `-a` or `-T` options are used). ## Options @@ -114,8 +115,9 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch` ## Examples -Let's employ a simple workspace with some data, code, ML models, pipeline -stages, as well as a few Git tags, such as our +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, as well as a few Git tags, such as the DVC project +created in our [get started example repo](https://github.com/iterative/example-get-started). Then we can see what happens with `dvc fetch` as we switch from tag to tag. @@ -132,7 +134,7 @@ $ cd example-get-started
-The workspace looks almost like in this +The workspace `tree` looks almost like in this [pipeline setup](/doc/get-started/example-pipeline): ```dvc @@ -196,8 +198,10 @@ by all DVC-files in the current branch, including for directories. The checksums `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and `data/features/` directory, respectively. +Let's link files from local cache to the workspace with: + ```dvc -$ dvc checkout <- links files from local cache to workspace +$ dvc checkout Checking out '{'scheme': 'local', 'path': '.../example-get-started/model.pkl'}' with cache '3863d0e317dee0a55c4e59d2ec0eef33'. Checking out '{'scheme': 'local', 'path': '.../example-get-started/data/... ``` diff --git a/static/docs/commands-reference/gc.md b/static/docs/commands-reference/gc.md index 92bbf6b503..6e983b1fdb 100644 --- a/static/docs/commands-reference/gc.md +++ b/static/docs/commands-reference/gc.md @@ -69,7 +69,7 @@ $ du -sh .dvc/cache/ ``` When you run `dvc gc` it removes all objects from cache that are not referenced -in the workspace (by collecting hash sums from the DVC-files): +in the workspace (by collecting hash sums from the DVC-files): ```dvc $ dvc gc diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index e946bcc429..67040dfcbb 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -117,8 +117,8 @@ downloaded file or directory from the external data source. ## Examples -To illustrate these examples we will be using the project explained in the -[Get Started](/doc/get-started) section. +To illustrate these examples we will be using the project explained +in the [Get Started](/doc/get-started) section.
@@ -139,15 +139,15 @@ $ git checkout 2-remote $ mkdir data ``` -You should now have a blank workspace, just before the +You should now have a blank workspace, just before the [Add Files](/doc/get-started/add-files) step.
## Example: Tracking a remote file -An advanced alternate to initialize the _Get Started_ workspace, is using -`dvc import-url`: +An advanced alternate to [Add Files](/doc/get-started/add-files) step of the +_Get Started_ section is to use `dvc import-url`: ```dvc $ dvc import-url https://dvc.org/s3/get-started/data.xml data/data.xml @@ -200,19 +200,18 @@ updated data source. A [pipeline](/doc/commands-reference/pipeline) can be triggered to re-execute based on a changed external dependency. Let's use the [Get Started](/doc/get-started) project again, simulating an -updated external data source. (Remember to prepare the sample project as -explained in [Examples](#examples)) +updated external data source. (Remember to prepare the workspace, +as explained in [Examples](#examples)) To make it easy to experiment with this, let's use a local machine directory -(external to the sample DVC project) to simulate a remote data source location. -(In real life, the data file will probably be on a remote server.) Run these -commands: +(external to the workspace) to simulate a remote data source location. (In real +life, the data file will probably be on a remote server.) Run these commands: ```dvc $ mkdir /tmp/dvc-import-url-example $ cd /tmp/dvc-import-url-example/ $ wget https://dvc.org/s3/get-started/data.xml -$ cd - # to go back to the Get Started project +$ cd - # to go back to the project ``` In a production system, you might have a process to update data files. That's diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index 16d695c941..aa2214e9e6 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -23,14 +23,15 @@ positional arguments: ## Description DVC provides an easy way to reuse datasets, intermediate results, ML models, or -other files and directories tracked in another DVC repository into the present -workspace. The `dvc import` command downloads such a data -artifact in a way that it is tracked with DVC, so it can be updated when -the external data source changes. - -The `url` argument specifies the external DVC project's Git repository URL (both -HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while -`path` is used to specify the path to the data to be downloaded within the repo. +other files and directories tracked in another DVC repository into the +workspace. The `dvc import` command downloads such a data artifact +in a way that it is tracked with DVC, so it can be updated when the external +data source changes. + +The `url` argument specifies the Git repository URL of the external DVC +project (both HTTP and SSH protocols are supported, e.g. +`[user@]server:project.git`), while `path` is used to specify the path to the +data to be downloaded within the repo. > See `dvc import-url` to download and tack data from other supported URLs. diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 059b88099e..e19bb5115f 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -17,20 +17,20 @@ automatically. Namely: -**Checkout**: For any given SCM branch or tag, Git checks out the +**Checkout**: For any given branch or tag, Git checks out the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The DVC-files in turn refer to data files in the DVC cache by checksum. When switching from one SCM branch or tag to another, the SCM retrieves the -corresponding DVC-files. By default that leaves the workspace in a state where -the DVC-files refer to data files other than what is currently in the workspace. -The user at this point should run `dvc checkout` so that the data files will -match the current DVC-files. +corresponding DVC-files. By default that leaves the project in a +state where the DVC-files refer to data files other than what is currently in +the workspace. The user at this point should run `dvc checkout` so +that the data files will match the current DVC-files. The installed Git hook automates running `dvc checkout`. **Commit**: When committing a change to the Git repository, that change possibly requires reproducing the corresponding [pipeline](/doc/get-started/pipeline) -(with `dvc repro`) to regenerate the workspace results. Or there might be files +(with `dvc repro`) to regenerate the project results. Or there might be files not yet in the cache, which is a reminder to run `dvc commit`. The installed Git hook automates reminding the user to run either `dvc repro` or @@ -45,11 +45,11 @@ The installed Git hook automates executing `dvc push`. ## Installed Git hooks -- Git `pre-commit` hook executes `dvc status` before `git commit` to inform the - user about the workspace status; -- Git `post-checkout` hook executes `dvc checkout` after `git checkout` to - automatically synchronize the data files with the new workspace state; -- Git `pre-push` hook executes `dvc push` before `git push` to upload files and +- A `pre-commit` hook executes `dvc status` before `git commit` to inform the + user about the differences between cache and workspace. +- A `post-checkout` hook executes `dvc checkout` after `git checkout` to + automatically synchronize the data files with the new workspace state. +- A `pre-push` hook executes `dvc push` before `git push` to upload files and directories under DVC control to remote. For more information about git hooks, refer to the @@ -85,37 +85,33 @@ and edit it._ ## Examples -To explore `dvc install` let's consider a simple pipeline with several stages: -the example workspace used in the [Get Started](/doc/get-started) section. +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`dvc install` in different situations.
### Click and expand to setup the project -This step is optional, and you can run it only if you want to run this examples -in your environment. First, you need to download the project: +Start by cloning our sample repo if you don't already have it: ```dvc $ git clone https://github.com/iterative/example-get-started +$ cd example-get-started ``` -Second, let's install the requirements. But before we do that, we **strongly** -recommend creating a virtual environment with -[virtualenv](https://virtualenv.pypa.io/en/stable/) or a similar tool: +Now let's install the requirements. But before we do that, we **strongly** +recommend creating a virtual environment with a tool such as +[virtualenv](https://virtualenv.pypa.io/en/stable/): ```dvc -$ cd example-get-started $ virtualenv -p python3 .env $ source .env/bin/activate -``` - -Now, we can install requirements for the project: - -```dvc $ pip install -r requirements.txt ``` -Then download the precomputed data using: +Download the precomputed data using: ```dvc $ dvc pull --all-branches --all-tags @@ -128,12 +124,12 @@ This data will be retrieved from a preconfigured remote cache. ## Example: Checkout both DVC and Git Let's start our exploration with the impact of `dvc install` on the -`dvc checkout` command. Remember that switching from one SCM tag or branch to -another changes the set of DVC-files in the workspace, which then also changes -the data files that should be in the workspace. +`dvc checkout` command. Remember that switching from one Git version to another +(with `git checkout`) changes the set of DVC-files in the project, which then +also changes the data files that should be placed in the workspace (with +`dvc checkout`). -With the _Get Started_ project described above, let's first list the available -tags: +Let's first list the available tags in the _Get Started_ project: ```dvc $ git tag @@ -152,9 +148,9 @@ baseline-experiment bigrams-experiment ``` -These tags are used to mark points in the development of this workspace, and to -document specific experiments conducted in the workspace. To take a look at one -we checkout the workspace using the SCM (in this case Git): +These tags are used to mark points in the development of the project, and to +document specific experiments conducted in it. To take a look at one, we +checkout the `6-featurization` tag: ```dvc $ git checkout 6-featurization @@ -178,16 +174,16 @@ Data and pipelines are up to date. ``` After running `git checkout` we are also shown a message saying _You are in -'detached HEAD' state_, and the Git documentation explains what that means. -Bottom line is returning the workspace to a normal state requires the command -`git checkout master`. +'detached HEAD' state_. Returning the workspace to a normal state requires +running `git checkout master`. -We also see that `dvc status` tells us about differences between the workspace -and the data files currently in the workspace. Git changed the DVC-files in the -workspace, which changed references to data files. What `dvc status` did is -inform us the data files in the workspace no longer matched the checksums in the -DVC-files. Running `dvc checkout` then checks out the corresponding data files, -and now `dvc status` tells us the data files match the DVC-files. +We also see that the first `dvc status` tells us about differences between the +project cache and the data files currently in the workspace. Git +changed the DVC-files in the workspace, which changed references to data files. +What `dvc status` did is inform us the data files in the workspace no longer +matched the checksums in the [DVC-files](/doc/user-guide/dvc-file-format). +Running `dvc checkout` then checks out the corresponding data files, and a +second `dvc status` now tells us the data files match the DVC-files. ```dvc $ git checkout master @@ -201,7 +197,7 @@ $ dvc checkout We've seen the default behavior with there being no Git hooks installed. We want to see how the behavior changes after installing the Git hooks. We must first -reset the workspace to he at the head commit before installing the hooks. +reset the workspace to the `HEAD` commit before installing the hooks. ```dvc $ dvc install @@ -242,7 +238,8 @@ matching what is referenced by the DVC-files. The other hook installed by `dvc install` runs before `git commit` operation. To see see what that does, start with the same workspace, making sure it is not in -the detached HEAD state from the previous example. +the _detached HEAD_ state from the previous example by first running +`git checkout master`. If we simply edit one of the code files: diff --git a/static/docs/commands-reference/metrics/modify.md b/static/docs/commands-reference/metrics/modify.md index 038f892c9f..82cd5ab817 100644 --- a/static/docs/commands-reference/metrics/modify.md +++ b/static/docs/commands-reference/metrics/modify.md @@ -19,8 +19,8 @@ for the metric file `path` provided (the one that specifies the file path in question among its outputs – see `dvc metrics add` or `dvc run` with `-m` and `-M` options), and updates the information that represents the metric. -If the path provided is not defined in a workspace DVC-file, the following error -will be raised: +If the path provided is not defined in a workspace DVC-file, the +following error will be raised: ```dvc ERROR: failed to modify metric file settings - diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index eb27f266ca..e5d065a11a 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -1,9 +1,9 @@ # pull Downloads missing files and directories from -[remote storage](/doc/commands-reference/remote) to the local cache based on -[DVC-files](/doc/user-guide/dvc-file-format) in the workspace, then links the -downloaded files into the workspace. +[remote storage](/doc/commands-reference/remote) to the local cache +based on [DVC-files](/doc/user-guide/dvc-file-format) in the +workspace, then links the downloaded files into the workspace. ## Synopsis @@ -37,14 +37,15 @@ configured with the `core.config` config option, is used. See `dvc remote`, on how to configure a remote. With no arguments, just `dvc pull` or `dvc pull --remote REMOTE`, it downloads -only the files (or directories) missing from the local repository to the project -by searching all [DVC-files](/doc/user-guide/dvc-file-format) in the current -version. It will not download files associated with earlier versions or branches -of the project directory, nor will it download files which have not changed. +only the files (or directories) missing from the workspace by searching all +[DVC-files](/doc/user-guide/dvc-file-format) currently in the +project. It will not download files associated with earlier +versions or branches of the repository if using Git, nor will it download files +which have not changed. The command `dvc status -c` can list files that are missing in the local cache -and are referenced in the current workspace. It can be used to see what files -`dvc pull` would download. +but referenced in the current project DVC-files. It can be used to see what +files `dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -52,7 +53,7 @@ backward from the target [stage](/doc/commands-reference/run) file(s), through the corresponding [pipeline(s)](/doc/get-started/pipeline), to find data files to pull. -After data file is in cache DVC, `dvc pull` uses OS-specific mechanisms like +After a data file is in cache, `dvc pull` can use OS-specific mechanisms like reflinks or hardlinks to put it in the workspace without copying. See `dvc checkout` for more details. @@ -61,14 +62,13 @@ reflinks or hardlinks to put it in the workspace without copying. See - `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see `dvc remote list`) to pull from. The value for `REMOTE` is a cache name defined using the `dvc remote` command. If no `REMOTE` is given, or if no - remote's are defined in the workspace, an error message is printed. If the + remote's are defined in the project, an error message is printed. If the option is not specified, then the default remote, configured with the `core.config` config option, is used. -- `-a`, `--all-branches` - determines the files to download by examining files - associated with all branches of the DVC-files in the project directory. It's - useful if branches are used to track "checkpoints" of an experiment or - project. +- `-a`, `--all-branches` - determines the files to download by examining + DVC-files in all branches of the project repository (if using Git). It's + useful if branches are used to track checkpoints of an experiment or project. - `-T`, `--all-tags` - the same as `-a`, `--all-branches` but tags are used to save different experiments or project checkpoints. @@ -84,9 +84,9 @@ reflinks or hardlinks to put it in the workspace without copying. See each target directory and its subdirectories for DVC-files to inspect. - `-f`, `--force` - does not prompt when removing workspace files, which occurs - during the process of updating the workspace. This option surfaces behavior - from the `dvc checkout` command because `dvc pull` in effect performs a - _checkout_ after downloading files. + when these file no longer match the DVC-file references. This option surfaces + behavior from the `dvc fetch` and `dvc checkout` commands because `dvc pull` + in effect performs those 2 functions in a single command. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously while downloading files from the remote cache. The effect is to control the @@ -103,10 +103,11 @@ reflinks or hardlinks to put it in the workspace without copying. See ## Examples -For using the `dvc pull` command, remote storage must be defined. For an -existing project a remote is usually defined and you can use `dvc remote list` -to check existing remotes. Just to remind how it is done and set a context for -the example, let's define an SSH remote with the `dvc remote add` command: +For using the `dvc pull` command, remote storage must be defined. (See +`dvc remote`.) For an existing project, remotes are usually already set up and +you can use `dvc remote list` to check them. Just to remind how it is done and +set a context for the example, let's define an SSH remote with the +`dvc remote add` command: ```dvc $ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/cache/directory diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 1aec92bd61..aac2086487 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -36,9 +36,9 @@ Under the hood a few actions are taken: DVC-files to consult. - For each output referenced from each selected DVC-files, it finds a - corresponding entry in the local cache. DVC checks if the entry exists, or - not, in the remote simply by looking for it using the checksum. From this DVC - gathers a list of files missing from the remote storage. + corresponding entry in the local cache. DVC checks if the entry + exists, or not, in the remote simply by looking for it using the checksum. + From this DVC gathers a list of files missing from the remote storage. - Upload the cache files missing from the remote cache, if any, to the remote. @@ -52,11 +52,11 @@ configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to the remote cache. It will not upload files associated with earlier versions or -branches of the project directory, nor will it upload files which have not -changed. +branches of the project directory, nor will it upload files which +have not changed. The command `dvc status -c` can list files that are new in the local cache and -are referenced in the current workspace. It can be used to see what files +are referenced in the workspace. It can be used to see what files `dvc push` would upload. The `dvc status -c` command can show files which exist in the remote cache and @@ -74,14 +74,13 @@ to push. - `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see `dvc remote list`) to push to. The value for `REMOTE` is a cache name defined using the `dvc remote` command. If no `REMOTE` is given, or if no remote's are - defined in the workspace, an error message is printed. If the option is not + defined in the project, an error message is printed. If the option is not specified, then the default remote, configured with the `core.config` config option, is used. -- `-a`, `--all-branches` - determines the files to upload by examining files - associated with all branches of the DVC-files in the project directory. It's - useful if branches are used to track "checkpoints" of an experiment or - project. +- `-a`, `--all-branches` - determines the files to upload by examining DVC-files + in all branches of the project repository (if using Git). It's useful if + branches are used to track checkpoints of an experiment or project. - `-T`, `--all-tags` - the same as `-a`, `--all-branches` but tags are used to save different experiments or project checkpoints. @@ -111,10 +110,11 @@ to push. ## Examples -For using the `dvc push` command, remote storage must be defined. For an -existing project a remote is usually defined and you can use `dvc remote list` -to check existing remotes. Just to remind how it is done and set a context for -the example, let's define an SSH remote with the `dvc remote add` command: +For using the `dvc push` command, remote storage must be defined. (See +`dvc remote`.) For an existing project, remotes are usually already +set up and you can use `dvc remote list` to check them. Just to remind how it is +done and set a context for the example, let's define an SSH remote with the +`dvc remote add` command: ```dvc $ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/cache/directory @@ -212,11 +212,9 @@ double check that all data had been uploaded. ## Example: What happens in the cache Let's take a detailed look at what happens to the DVC cache as you run an -experiment in a local workspace and push data to a remote cache. To set the -example consider having created a workspace that contains some code and data, -and having created a remote cache. In this section we'll show the cache of a -very simple project, but the details of this project doesn't matter so much as -what happens in the caches as data is pushed. +experiment locally and push data to a remote cache. To set the example consider +having created a workspace that contains some code and data, and +having set up a remote cache. Some work has been performed in the local workspace, and it contains new data to upload to the shared remote cache. When running `dvc status --cloud` the report diff --git a/static/docs/commands-reference/remove.md b/static/docs/commands-reference/remove.md index e1879b6c78..5682851fb6 100644 --- a/static/docs/commands-reference/remove.md +++ b/static/docs/commands-reference/remove.md @@ -1,15 +1,8 @@ # remove -Remove data file or data directory. +Properly remove data files or directories tracked by DVC. -This command safely removes data files or directories that are tracked by DVC -from your _workspace_. It takes a [DVC-File](/doc/user-guide/dvc-file-format) as -input, removes all of its outputs (`outs`), and optionally removes the file -itself. - -Note that it does not remove files from the DVC cache or remote storage (see -`dvc gc`). However, remember to run `dvc push` to save the files you actually -want to use or share in the future. +## Synopsis ```usage usage: dvc remove [-h] [-q | -v] [-o | -p] [-f] targets [targets ...] @@ -19,6 +12,17 @@ positional arguments: DVC-files in the workspace by default.) ``` +## Description + +This command safely removes data files or directories that are tracked by DVC +from the workspace. It takes a +[DVC-File](/doc/user-guide/dvc-file-format) as input, removes all of its outputs +(`outs`), and optionally removes the DVC-file itself. + +Note that it does not remove files from the DVC cache or remote storage (see +`dvc gc`). However, remember to run `dvc push` to save the files you actually +want to use or share in the future. + Refer to [Update Tracked Files](/doc/user-guide/update-tracked-file) to see how it can be used to replace or modify files that are under DVC control. diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index ea9da5b839..e76eb782bb 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -20,8 +20,8 @@ positional arguments: `dvc repro` provides an interface to run the commands in a computational graph (a.k.a. pipeline) again, as defined in the stage files (DVC-files) found in the -workspace. (A pipeline is typically defined using the `dvc run` command, while -data input nodes are defined by the `dvc add` command.) +project. (A pipeline is typically defined using the `dvc run` +command, while data input nodes are defined by the `dvc add` command.) There's a few ways to restrict the stages that will be run again by this command: by specifying stage file(s) as `targets`, or by using the @@ -36,8 +36,8 @@ corresponding commands again. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. It saves all the data files, intermediate -or final results into the DVC cache (unless `--no-commit` option is specified), -and updates stage files with the new checksum information. +or final results into the DVC cache (unless `--no-commit` option is +specified), and updates stage files with the new checksum information. ## Options @@ -49,7 +49,7 @@ and updates stage files with the new checksum information. recursive search for changed dependencies. Multiple stages are run (non-recursively) if multiple stage files are given as `targets`. -- `-c`, `--cwd` - directory within your project to reproduce from. If no +- `-c`, `--cwd` - directory within the project to reproduce from. If no `targets` are given, it attempts to use `Dvcfile` in the specified directory. Instead of using `--cwd`, one can alternately specify a target in a subdirectory as `path/to/target.dvc`. This option can be useful for example @@ -63,8 +63,8 @@ and updates stage files with the new checksum information. inspect. - `--no-commit` - do not save outputs to cache. Useful when running different - experiments and you don't want to fill up your cache with temporary files. Use - `dvc commit` when you are ready to save your results to cache. + experiments and you don't want to fill up the cache with temporary files. Use + `dvc commit` when ready to save results to cache. - `-m`, `--metrics` - show metrics after reproduction. The target pipeline(s) must have at least one metrics file defined either with the `dvc metrics` diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 395acf8519..90d6987c45 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -18,11 +18,11 @@ positional arguments: ## Description `dvc status` searches for changes in the existing pipeline(s), either showing -which [stages](/doc/commands-reference/run) have changed in the local workspace -and must be reproduced (with `dvc repro`), or differences between the local -cache and remote cache (meaning `dvc push` or `dvc pull` should be run to -synchronize them). The two modes, _local_ and _cloud_ are triggered by using the -`--cloud` or `--remote` options: +which [stages](/doc/commands-reference/run) have changed in the +workspace and must be reproduced (with `dvc repro`), or differences +between local vs. remote cache (meaning `dvc push` or `dvc pull` +should be run to synchronize them). The two modes, _local_ and _cloud_ are +triggered by using the `--cloud` or `--remote` options: | Mode | CLI Option | Description | | ------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------- | @@ -31,9 +31,9 @@ synchronize them). The two modes, _local_ and _cloud_ are triggered by using the | remote | `--cloud` | Comparisons are made between the local cache, and the default remote, defined with `dvc remote --default` command. | DVC determines data and code files to compare by analyzing all -[DVC-files](/doc/user-guide/dvc-file-format) in the workspace (`--all-branches` -and `--all-tags` in the `cloud` mode compare multiple workspaces - across all -branches or tags). The comparison can be limited to specific DVC-files by +[DVC-files](/doc/user-guide/dvc-file-format) in the project +(`--all-branches` and `--all-tags` in the `cloud` mode compare multiple +workspace versions). The comparison can be limited to specific DVC-files by listing them as `targets`. Changes are reported only against the given `targets`. When combined with the `--with-deps` option, a search is made for changes in other stages that affect the target. @@ -87,10 +87,10 @@ outputs described in it. - _deleted_ means the file doesn't exist in the local cache, but exists in the remote cache -For either the _new_ and _deleted_ cases, the local cache (subset of it, that is -determined by the active workspace) is different from the remote cache. Bringing -the two into sync requires `dvc pull` or `dvc push` to synchronize the DVC -cache. For the typical process to update workspaces, see +For either the _new_ and _deleted_ cases, the local cache (subset of it +determined by the current workspace) is different from the remote cache. +Bringing the two into sync requires `dvc pull` or `dvc push` to synchronize the +DVC cache. For the typical process to update the workspace, see [Share Data And Model Files](/doc/use-cases/share-data-and-model-files). ## Options @@ -112,14 +112,14 @@ cache. For the typical process to update workspaces, see name defined using the `dvc remote` command. Implies `--cloud`. - `-a`, `--all-branches` - compares cache content against all Git branches. - Instead of checking just the workspace, it runs the same status command in all - the branches of this repo. The corresponding branches are shown in the status - output. Applies only if `--cloud` or a remote is specified. - -- `-T`, `--all-tags` - compares cache content against all Git tags. Both the - `--all-branches` and `--all-tags` options cause DVC to check more than just - the workspace. The corresponding tags are shown in the status output. Applies - only if `--cloud` or a remote is specified. + Instead of checking just the current workspace version, it runs the same + status command in all the branches of this repo. The corresponding branches + are shown in the status output. Applies only if `--cloud` or a `-r` remote is + specified. + +- `-T`, `--all-tags` - compares cache content against all Git tags instead of + checking just the current workspace version. The corresponding tags are shown + in the status output. Applies only if `--cloud` or a `-r` remote is specified. - `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to retrieve information from remote servers. This only applies when the `--cloud` diff --git a/static/docs/commands-reference/unprotect.md b/static/docs/commands-reference/unprotect.md index 9e2f2dd9dc..ab37d027d0 100644 --- a/static/docs/commands-reference/unprotect.md +++ b/static/docs/commands-reference/unprotect.md @@ -1,7 +1,7 @@ # unprotect -Unprotect tracked files or directories (when the cache protected mode has been -enabled with `dvc config cache`). +Unprotect tracked files or directories (when the cache protected +mode has been enabled with `dvc config cache`). ## Synopsis @@ -15,11 +15,13 @@ positional arguments: ## Description By default this command is not necessary, as DVC avoids hardlinks and symlinks -to link tracked data files in the workspace to the cache. However, these types -of file links can be enabled with `dvc config cache` (`cache.type` config -option). These link types also require the `cache.protected` mode to be turned -on, which makes the tracked data files in the workspace read-only to prevent -users from accidentally corrupting the cache by modifying them. +to link tracked data files from the cache to the workspace. +However, these types of file links can be enabled with `dvc config cache` +(`cache.type` config option). + +Enabling hardlinks or symlinks also requires the `cache.protected` mode to be +turned on, which makes the tracked data files in the workspace read-only. (This +prevent users from accidentally corrupting the cache by modifying file links.) Running `dvc unprotect` guarantees that the target files or directories (`targets`) in the workspace are physically "unlinked" from the cache and can be @@ -83,7 +85,8 @@ $ dvc unprotect Posts.xml.zip ``` Check that the file is writable now, the cached version is intact, and they are -not linked (the file in the workspace is a copy of the file): +not linked (the file in the workspace is a copy of the +cached file): ```dvc $ ls -lh diff --git a/static/docs/commands-reference/version.md b/static/docs/commands-reference/version.md index bd54d5802b..e868341181 100644 --- a/static/docs/commands-reference/version.md +++ b/static/docs/commands-reference/version.md @@ -14,26 +14,26 @@ usage: dvc version [-h] [-q | -v] Running the command `dvc version` outputs the following information about the system/environment: -| Type | Detail | -| ------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | -| `Python version` | Version of the Python being used for the project in which DVC is initialized | -| `Platform` | Information about the operating system of the machine | -| [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | -| `Cache` | [Type of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported between the DVC workspace and the cache directory | -| `Filesystem type` | Shows the filesystem type (eg. ext4, FAT, etc.) and mount point of workspace and the cache directory | - -> If `dvc version` is executed outside a DVC workspace, the command outputs the -> filesystem type of the current working directory. +| Line | Detail | +| ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | +| `Python version` | Version of the Python being used on the environment in which DVC is initialized | +| `Platform` | Information about the operating system of the machine | +| [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | +| `Cache` | [Type of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported between the workspace and the cache | +| `Filesystem type` | Shows the filesystem type (eg. ext4, FAT, etc.) and mount point of the cache and workspace directories | + +> If `dvc version` is executed outside a DVC project, no `Cache` is output and +> the `Filesystem type` output is of the current working directory. > **Note** that if you've installed dvc using pip, you will need to install > `psutil` by yourself with `pip install psutil` in order for `dvc version` to -> report fs information. Please see the original -> [issue on github](https://github.com/iterative/dvc/issues/2284) for more info. +> report file system information. Please see the original +> [issue on Github](https://github.com/iterative/dvc/issues/2284) for more info. #### Components of DVC version -The detail of DVC version depends upon the way of installing the project. +The detail of DVC version depends upon the way of installing DVC. - **Official release**: This [install guide](/doc/get-started/install) mentions ways to install DVC using the official package stored in Python Packaging @@ -100,7 +100,7 @@ The detail of `Binary` depends on the way DVC was downloading and Getting the DVC version and environment information: -Inside a DVC workspace: +Inside a DVC project: ```dvc $ dvc version @@ -114,7 +114,7 @@ Filesystem type (cache directory): ('ext4', '/dev/sdb3') Filesystem type (workspace): ('ext4', '/dev/sdb3') ``` -Outside a DVC workspace: +Outside a DVC project: ```dvc $ dvc version From a29b2a362de2202dc201cabccba79bb086813e44 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 10 Aug 2019 21:17:46 -0500 Subject: [PATCH 02/17] term: finish review of "workspace", and apply `` tags for #461 --- src/Documentation/glossary.js | 4 +-- static/docs/get-started/add-files.md | 14 ++++---- static/docs/get-started/example-pipeline.md | 13 +++---- static/docs/get-started/example-versioning.md | 20 ++++++----- static/docs/get-started/experiments.md | 7 ++-- static/docs/get-started/initialize.md | 4 +-- static/docs/get-started/retrieve-data.md | 9 ++--- static/docs/tutorial/define-ml-pipeline.md | 17 +++++---- .../understanding-dvc/related-technologies.md | 8 ++--- .../data-and-model-files-versioning.md | 12 +++---- ...ple-data-scientists-on-a-single-machine.md | 36 +++++++++---------- .../user-guide/dvc-files-and-directories.md | 19 +++++----- .../user-guide/large-dataset-optimization.md | 8 ++--- static/docs/user-guide/update-tracked-file.md | 6 ++-- 14 files changed, 91 insertions(+), 86 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index 3677018cc4..b072666ca2 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -16,7 +16,7 @@ code, ML models, etc. A workspace becomes a **DVC project** when }, { name: 'DVC Project', - match: ['DVC project', 'project'], + match: ['DVC project', 'project', 'projects'], desc: ` Initialized by running \`dvc init\` in the **workspace**. It will contain the \`.dvc/\` directory and [DVC-files](/doc/user-guide/dvc-file-format) created @@ -26,7 +26,7 @@ repository. }, { name: 'DVC Cache', - match: ['DVC cache', 'cache', 'cache directory', 'cached'], + match: ['DVC cache', 'cache', 'cache directory', 'data cache', 'cached'], desc: ` The DVC cache is a hidden storage (by default located in the \`.dvc/cache\` directory) for files that are under DVC control, and their different versions. diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 9f71e2ce83..b09af1c6a7 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -44,9 +44,9 @@ the data as it evolves with the source code under Git control. ### Expand to learn about DVC internals -You can see that actual data file has been moved to the `.dvc/cache` directory, -while the entries in the workspace may be links to the actual files in the DVC -cache. +You can see that actual data file has been moved to the cache +directory, while the entries in the workspace may be file +links to the actual files in the DVC cache. ```dvc $ ls -R .dvc/cache @@ -85,10 +85,10 @@ and `dvc config cache` for more information.
-If your workspace uses Git, without DVC you would have to manually -put each data file or directory in into `.gitignore`. DVC commands that take or -make files that will go under its control automatically takes care of this for -you! (You just have to add the changes to Git.) +If your workspace uses Git, without DVC you would have to manually put each data +file or directory in into `.gitignore`. DVC commands that take or make files +that will go under its control automatically takes care of this for you! (You +just have to add the changes to Git.) Refer to [Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning), diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index b9bb70f1d6..ac166c75ec 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -64,9 +64,9 @@ $ pip install -r code/requirements.txt Next, we will create a pipeline step-by-step, utilizing the same set of commands that are described in earlier [get started](/doc/get-started) chapters. -> Note that its possible to define more than one pipeline in each project. This -> will be determined by the interdependencies between DVC-files, mentioned -> below. +> Note that its possible to define more than one pipeline in each DVC +> project. This will be determined by the interdependencies between +> DVC-files, mentioned below. Initialize DVC repository (run it inside your Git repository): @@ -112,9 +112,10 @@ of the data file itself. Actual data file `Posts.xml.zip` is linked into the `.dvc/cache` directory, under the `.dvc/cache/ce/68b98d82545628782c66192c96f2d2` name and is added to -`.gitignore`. Even if you remove it in the workspace, or checkout a different -branch/commit the data is not lost if a corresponding DVC-file is committed. -It's enough to run `dvc checkout` or `dvc pull` to restore data files. +`.gitignore`. Even if you remove it in the workspace, or checkout a +different branch/commit the data is not lost if a corresponding DVC-file is +committed. It's enough to run `dvc checkout` or `dvc pull` to restore data +files. diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 224eca5c70..08705554cb 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -68,9 +68,9 @@ $ pip install -r requirements.txt ### Expand to learn more about DVC internals The repository you cloned is already DVC-initialized. There should be a `.dvc/` -directory with `config`, `.gitignore` files and the `cache` directory. These -files and directories are hidden from users in general. Users don't interact -with these files directly. See +directory with `config`, `.gitignore` files and the cache +directory. These files and directories are hidden from users in general. +Users don't interact with these files directly. See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn more. @@ -133,10 +133,12 @@ $ dvc add data This command should be used instead of `git add` on files or directories that are too large to be put into Git. Usually, input datasets, models, some intermediate results, etc. It tells Git to ignore the directory and puts it into -the DVC cache (of course, it keeps a link to it in the workspace, so you can -continue working with it the same way as before). Instead, it creates a simple -human-readable [DVC-file](/doc/user-guide/dvc-file-format) that can be -considered as a pointer to the cache. +the DVC cache (while keeping a +[file link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +to it in the workspace, so you can continue working with it the +same way as before). Instead, it creates a simple human-readable +[DVC-file](/doc/user-guide/dvc-file-format) that can be considered as a pointer +to the cache. Next, we run the training with `python train.py`. We picked this example and datasets to be small enough to be run on your machine in a reasonable amount of @@ -238,11 +240,11 @@ An operation that helps to get the specific committed version of data is designed to be similar to Git. In Git (or any other code version control system) when you need to get to a previous committed version of the code you run `git checkout`. All we need to do in our case is to run additionally -`dvc checkout` to get the right data to the workspace. +`dvc checkout` to get the right data into the workspace. ![](/static/img/versioning.png) -There are two ways of doing this - a full workspace checkout or checkout of a +There are two ways of doing this: a full workspace checkout or checkout of a specific data or mode file. Let's consider the full checkout first. It's quite straightforward: diff --git a/static/docs/get-started/experiments.md b/static/docs/get-started/experiments.md index b83908c4e8..b4068e61e6 100644 --- a/static/docs/get-started/experiments.md +++ b/static/docs/get-started/experiments.md @@ -38,6 +38,7 @@ $ dvc checkout ``` DVC is designed to checkout large data files (no matter how large they are) into -your workspace instantly on almost all modern operating systems with file links. -See [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization) -for more information. +your workspace instantly on almost all modern operating systems +with file links. See +[Large Dataset Optimization](/docs/user-guide/large-dataset-optimization) for +more information. diff --git a/static/docs/get-started/initialize.md b/static/docs/get-started/initialize.md index a937cefb59..d778778872 100644 --- a/static/docs/get-started/initialize.md +++ b/static/docs/get-started/initialize.md @@ -5,8 +5,8 @@ In order to start using DVC, you need first to initialize it in your control management system, but for the best experience we recommend using DVC on top of Git repositories. -If you don't have a directory for this project already, create it now with these -commands: +If you don't have a directory for this project already, create it +now with these commands: ```dvc $ mkdir example-get-started diff --git a/static/docs/get-started/retrieve-data.md b/static/docs/get-started/retrieve-data.md index 7288f84d20..1c35aad7f9 100644 --- a/static/docs/get-started/retrieve-data.md +++ b/static/docs/get-started/retrieve-data.md @@ -5,15 +5,16 @@ > [configuration](/doc/get-started/configure) are completed before you run the > `dvc pull` command in a newly cloned or initialized Git repository. -To retrieve data files to your local machine and your project's workspace run: +To retrieve data files into the workspace in your local machine, +run: ```dvc $ dvc pull ``` -This command retrieves data files that are referenced in _all_ -[DVC-files](/doc/user-guide/dvc-file-format) in the workspace. So, you usually -run it after `git clone`, `git pull`, or `git checkout`. +This command retrieves data files that are referenced in all +[DVC-files](/doc/user-guide/dvc-file-format) in the project. So, +you usually run it after `git clone`, `git pull`, or `git checkout`. As an easy way to test it: diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 346c5add5f..5d22bb11d3 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -81,8 +81,8 @@ $ du -sh .dvc/cache/ec/* > Outputs from DVC-files define the relationship between the data file path in a > repository and the path in a cache directory. -Keeping actual file content in a cache directory and a copy of the caches in -user workspace during `$ git checkout` is a regular trick that +Keeping actual file content in a cache directory and a copy of the caches in the +user workspace during `$ git checkout` is a regular trick that [Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. This trick works fine for tracking small files with source code. For large data files, this might not be the best approach, because of _checkout_ operation for @@ -125,10 +125,9 @@ same. ## Running commands -Once data source files are in the workspace you can start processing the data -and train ML models out of the data files. DVC helps you to define steps of your -ML process and pipe them together into a ML -[pipeline](/doc/get-started/pipeline). +Once the data files are in the workspace, you can start processing the data and +train ML models out of the data files. DVC helps you to define steps of your ML +process and pipe them together into a ML [pipeline](/doc/get-started/pipeline). `dvc run` executes any command that you pass into it as a list of parameters. However, the command to run alone is not as interesting as its role within a @@ -190,9 +189,9 @@ and does some additional work if the command was successful: of the file names will be added to `.gitignore`. 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage - file in the workspace with information about this pipeline stage. (See - [DVC-File Format](/doc/user-guide/dvc-file-format)). Note that the name of - this file could be specified by using the `-f` option, for example + file in the project with information about this pipeline stage. + (See [DVC-File Format](/doc/user-guide/dvc-file-format)). Note that the name + of this file could be specified by using the `-f` option, for example `-f extract.dvc`. Let's take a look at the resulting stage file created by `dvc run` above: diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 0ca460b1d8..371e5a77ea 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -37,8 +37,8 @@ process. - DVC has transparent design: [meta files and directories](/doc/user-guide/dvc-files-and-directories) - (including the data cache) have a human-readable format and can be easily - reused by external tools. + (including the data cache) have a human-readable format and + can be easily reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The differences are: @@ -116,8 +116,8 @@ process. - DVC attempts to use reflinks\* and has other [file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache). This way the `dvc checkout` command does not actually copy data files from - cache to the workspace, as copying files is a heavy operation for large - files (30 GB+). + cache to the workspace, as copying files is a heavy operation + for large files (30 GB+). - `git-lfs` was not made with data science scenarios in mind, so it does not provide related features (e.g. pipelines, metrics), and thus Github has a diff --git a/static/docs/use-cases/data-and-model-files-versioning.md b/static/docs/use-cases/data-and-model-files-versioning.md index 1ed3aa06c5..e6e4c3d29a 100644 --- a/static/docs/use-cases/data-and-model-files-versioning.md +++ b/static/docs/use-cases/data-and-model-files-versioning.md @@ -23,9 +23,9 @@ ML data artifacts like data files, models, etc. Unlike `git-lfs`, DVC doesn't require installing a server; it can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure). -Let's say you already have a project that uses a bunch of images that are stored -in `images` directory and has a `model.pkl` file - your model file that is -deployed to production. +Let's say you already have a project that uses a bunch of images +stored in `images/` directory and has a `model.pkl` file - your model file that +is deployed to production. ```dvc $ ls images @@ -79,9 +79,9 @@ $ git add .gitignore images.dvc model.pkl.dvc $ git commit -m "track images and models with dvc" ``` -There are two ways to get to the previous version of the dataset or model - a -full workspace checkout or checkout of a specific data or model file. Let's -consider the full checkout first. It's quite straightforward: +There are two ways to get to the previous version of the dataset or model: a +full workspace checkout or checkout of a specific data or model +file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` is a Git tag that should be created in advance to identify the dataset > version you are interested in, it can be just a Git commit hash instead. diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index ab95e84994..b727fceae6 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -7,15 +7,15 @@ such as ability to use multiple GPUs, store all your data in one place, etc. ![](/static/img/shared-server.png) With DVC, you can easily setup a shared data storage on the server that will -allow your team to share and store data for your projects as effectively as -possible and have a workspace restoration/switching speed as instant -as`git checkout` for your code. +allow your team to store and share data for your projects effectively, as well +as to have an instantaneous workspace restoration/switching speed – +similar to `git checkout` for your code. ### Preparation -In order to make it work on a shared server, you need to setup a shared -cache location for your projects, so that every team member is -using the same cache storage: +In order to leverage DVC on a shared server, you need to setup a shared +cache location for your projects, so that every team +member is using the same cache storage: ```dvc $ mkdir -p /path/to/dvc-cache @@ -24,7 +24,7 @@ $ mkdir -p /path/to/dvc-cache You will have to make sure that the directory has proper permissions setup, so that every one on your team can read and write to it and can access cache files written by others. The most straightforward way to do that is to make sure that -you and your colleagues are members of the same group (e.g. 'users') and that +you and your colleagues are members of the same group (e.g. `users`) and that your shared cache directory is owned by that group and has respective permissions. @@ -33,17 +33,17 @@ permissions. This step is optional. You can skip it if you are setting up a new DVC repository and don't have your local cache stored in `.dvc/cache`. If you did work on your project with DVC previously and you wish to transfer your cache to -the shared cache directory (external to your workspace), you will -need to simply move it from an old cache location to the new one: +the shared cache directory (external to your workspace), you will need to simply +move it from an old cache location to the new one: ```dvc $ mv .dvc/cache/* /path/to/dvc-cache ``` -### Configure External Cache +### Configure External Cache (Optional) -Tell DVC to use the directory we've set up above as an shared cache location by -running: +This step is optional. Tell DVC to use the directory we've set up above as an +shared cache location by running: ```dvc $ dvc config cache.dir /path/to/dvc-cache @@ -58,9 +58,9 @@ $ git commit -m "dvc: setup external cache dir" ### Examples -You and your colleagues can work in your own workspaces as usual and DVC will -handle all your data in the most effective way possible. Let's say you are -cleaning up the data: +You and your colleagues can work in your own workspaces as usual +and DVC will handle all your data in the most effective way possible. Let's say +you are cleaning up the data: ```dvc $ dvc add raw @@ -71,8 +71,8 @@ $ git push ``` Your colleague can pull the code and have both `raw` and `clean` instantly -appear in his workspace without copying. After this he decides to continue -building this pipeline and process the cleaned up data: +appear in his workspace without copying anything. After this he decides to +continue building this pipeline and process the cleaned up data: ```dvc $ git pull @@ -83,7 +83,7 @@ $ git commit -m "process clean data" $ git push ``` -And now you can just as easily get his work appear in your workspace by: +And now you can just as easily make his work appear in your workspace by: ```dvc $ git pull diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 2c8aad4106..ace683d0e6 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -1,7 +1,7 @@ # DVC Files and Directories -Once initialized in a project, DVC populates its installation directory -(`.dvc/`) with special DVC internal files and directories: +Once initialized in a project, DVC populates its installation +directory (`.dvc/`) with special internal files and directories: - `.dvc/config` - this is a configuration file. The config file can be edited by hand or with a special command: `dvc config`. @@ -24,10 +24,10 @@ Once initialized in a project, DVC populates its installation directory > are needed to reproduce them. - `.dvc/state` - this file is used for optimization. It is a SQLite db, that - contains checksums for files in a project with respective timestamps and - inodes to avoid unnecessary checksum computations. It also contains a list of - links (from cache to workspace) created by DVC and - is used to cleanup your workspace when calling `dvc checkout`. + contains checksums for files tracked in a DVC project, with respective + timestamps and inodes to avoid unnecessary checksum computations. It also + contains a list of links (from cache to workspace) created by DVC + and is used to cleanup your workspace when calling `dvc checkout`. - `.dvc/state-journal` - temporary file for SQLite operations @@ -38,12 +38,13 @@ Once initialized in a project, DVC populates its installation directory - `.dvc/updater.lock` - a lock file for `.dvc/updater`. -- `.dvc/lock` - a lock file for the whole dvc project. +- `.dvc/lock` - a lock file for the whole DVC project. ## Structure of cache directory -There are two ways in which the data is stored in cache. It depends on if the -actual data is stored in a file (eg. `data.csv`) or it is a directory of files. +There are two ways in which the data is stored in cache. It depends +on if the actual data is stored in a file (eg. `data.csv`) or it is a directory +of files. We evaluate a checksum, usually MD5, for the data file which is a 32 characters long string. The first two characters are assigned to name the directory inside diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index e0de1368f4..f31dd156ea 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -9,10 +9,10 @@ details.) However, the versions of the tracked files that [match the current code](/doc/get-started/connect-code-and-data) are also needed -in the workspace, so a subset of the cached files must be kept in the working -directory (using `dvc checkout`). Does this mean that some files will be -duplicated between the workspace and the cache? **That would not be efficient!** -Especially with large files (several Gigabytes or larger). +in the workspace, so a subset of the cached files must be kept in +the working directory (using `dvc checkout`). Does this mean that some files +will be duplicated between the workspace and the cache? **That would not be +efficient!** Especially with large files (several Gigabytes or larger). In order to have the files present in both directories without duplication, DVC can automatically create **file links** in the workspace that "point" to the diff --git a/static/docs/user-guide/update-tracked-file.md b/static/docs/user-guide/update-tracked-file.md index c468c865af..61e21b8e4b 100644 --- a/static/docs/user-guide/update-tracked-file.md +++ b/static/docs/user-guide/update-tracked-file.md @@ -1,8 +1,8 @@ # Update a Tracked File -Due to the way DVC handles linking between the data files in the cache and their -counterparts in the workspace (refer to -[Large Dataset Optimization](/docs/user-guide/large-dataset-optimization)), +Due to the way DVC handles linking between the data files in the +cache and their counterparts in the workspace (refer +to [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization)), updating tracked files has to be carried out with caution to avoid data corruption when the DVC config option `cache.type` is set to `hardlink` or/and `symlink`. (See `dvc config cache` for more details on setting the cache file From 2b96e0e28a270c49de1ce330b39439a7d9d0d3eb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 10 Aug 2019 21:25:54 -0500 Subject: [PATCH 03/17] term: use abbreviations e.g. "doesn't" in autocomplete guide per https://github.com/iterative/dvc.org/pull/537#issuecomment-520149106 --- static/docs/user-guide/autocomplete.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index 63cf7dd5e4..f76c683a25 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -37,7 +37,7 @@ $ echo $0 /bin/bash ``` -In this case, follow the steps to configure Bash as it is your active shell. +In this case, follow these steps to configure Bash, as it is your active shell. ## Configure Bash @@ -77,7 +77,7 @@ completion. ### Click to expand if it doesn't work on Debian/Ubuntu -As mentioned above, it should work out of the box. But if it does not, try these +As mentioned above, it should work out of the box. But if it doesn't, try these steps: - Make sure that the package `bash-completion` is installed: @@ -86,7 +86,7 @@ steps: $ sudo apt install --reinstall bash-completion ``` -- Make sure that it is enabled. Edit `~/.bashrc` and make sure that these lines +- Make sure that it's enabled. Edit `~/.bashrc` and make sure that these lines are there: ```bash From b229540a83a38c37dd3cf9191697fd0d1c1a2d86 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 10 Aug 2019 21:33:05 -0500 Subject: [PATCH 04/17] gender neutral language --- .../multiple-data-scientists-on-a-single-machine.md | 6 +++--- static/docs/user-guide/contributing.md | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index b727fceae6..c87cad55d5 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -70,8 +70,8 @@ $ git commit -m "cleanup raw data" $ git push ``` -Your colleague can pull the code and have both `raw` and `clean` instantly -appear in his workspace without copying anything. After this he decides to +Your colleagues can pull the code and have both `raw` and `clean` instantly +appear in their workspace without copying anything. After this they decide to continue building this pipeline and process the cleaned up data: ```dvc @@ -83,7 +83,7 @@ $ git commit -m "process clean data" $ git push ``` -And now you can just as easily make his work appear in your workspace by: +And now you can just as easily make their work appear in your workspace by: ```dvc $ git pull diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index f95b845ec0..bfdbd13497 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -153,8 +153,8 @@ $ source tests/remotes_env ``` If some member of your team had already went through all of this you may just -ask his `remotes_env` file and Google Cloud credentials and you can skip any -manipulations with `ENV` below. +ask for their `remotes_env` file and Google Cloud credentials and you can skip +any manipulations with `ENV` below.
From ca78fb442755be0c947fe0ebfcab5022e6ee8ffb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 10 Aug 2019 22:05:39 -0500 Subject: [PATCH 05/17] term: "pipeline" should mostly link to the cmd ref. --- static/docs/commands-reference/checkout.md | 2 +- static/docs/commands-reference/commit.md | 6 +++--- static/docs/commands-reference/fetch.md | 2 +- static/docs/commands-reference/install.md | 7 ++++--- static/docs/commands-reference/pull.md | 6 +++--- static/docs/commands-reference/push.md | 6 +++--- static/docs/commands-reference/repro.md | 2 +- static/docs/commands-reference/status.md | 2 +- static/docs/tutorial/define-ml-pipeline.md | 3 ++- static/docs/user-guide/dvc-file-format.md | 10 +++++----- static/docs/user-guide/dvcignore.md | 3 ++- 11 files changed, 26 insertions(+), 23 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 943149825f..639fb1c92c 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -39,7 +39,7 @@ The execution of `dvc checkout` does: data files. The scanned DVC-files is limited by the listed `targets` (if any) on the command line. And if the `--with-deps` option is specified, it scans backward from the given `targets` in the corresponding - [pipeline](/doc/get-started/pipeline). + [pipeline](/doc/commands-reference/pipeline). - For any data files where the checksum doesn't match their DVC-file entry, the data file is restored from the cache. The link strategy used (`reflink`, diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 06e5fc05c1..83547a87a2 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -18,9 +18,9 @@ positional arguments: The `dvc commit` command is useful for several scenarios where a dataset is being changed: when a [stage](/doc/commands-reference/run) or -[pipeline](/doc/get-started/pipeline) is in development, when one wishes to run -commands outside the control of DVC, or to force DVC-file updates to save time -tying stages or a pipeline. +[pipeline](/doc/commands-reference/pipeline) is in development, when one wishes +to run commands outside the control of DVC, or to force DVC-file updates to save +time tying stages or a pipeline. - Code or data for a stage is under active development, with rapid iteration of code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index 175970fa09..939ce8d17c 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -49,7 +49,7 @@ information on DVC remotes.) These necessary data or model files are listed as dependencies or outputs in a DVC-file (target [stage](/doc/commands-reference/run)) so they are required to [reproduce](/doc/get-started/reproduce) the corresponding -[pipeline](/doc/get-started/pipeline). (See +[pipeline](/doc/commands-reference/pipeline). (See [DVC-File Format](/doc/user-guide/dvc-file-format) for more information on dependencies and outputs.) diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index e19bb5115f..b43ec5d9d9 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -29,9 +29,10 @@ that the data files will match the current DVC-files. The installed Git hook automates running `dvc checkout`. **Commit**: When committing a change to the Git repository, that change possibly -requires reproducing the corresponding [pipeline](/doc/get-started/pipeline) -(with `dvc repro`) to regenerate the project results. Or there might be files -not yet in the cache, which is a reminder to run `dvc commit`. +requires reproducing the corresponding +[pipeline](/doc/commands-reference/pipeline) (with `dvc repro`) to regenerate +the project results. Or there might be files not yet in the cache, which is a +reminder to run `dvc commit`. The installed Git hook automates reminding the user to run either `dvc repro` or `dvc commit`. diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index e5d065a11a..9cdf00fbe5 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -50,8 +50,8 @@ files `dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies backward from the target [stage](/doc/commands-reference/run) file(s), through -the corresponding [pipeline(s)](/doc/get-started/pipeline), to find data files -to pull. +the corresponding [pipeline(s)](/doc/commands-reference/pipeline), to find data +files to pull. After a data file is in cache, `dvc pull` can use OS-specific mechanisms like reflinks or hardlinks to put it in the workspace without copying. See @@ -145,7 +145,7 @@ default remote. The only files considered in this case are what is listed in the ## Example: With dependencies Demonstrating the `--with-deps` flag requires a larger example. First, assume a -[pipeline](/doc/get-started/pipeline) has been setup with these +[pipeline](/doc/commands-reference/pipeline) has been setup with these [stages](/doc/commands-reference/run): ```dvc diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index aac2086487..8f1b9dd915 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -66,8 +66,8 @@ remove nor modify those files in the remote cache. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies backward from the target [stage](/doc/commands-reference/run) file(s), through -the corresponding [pipeline(s)](/doc/get-started/pipeline), to find data files -to push. +the corresponding [pipeline(s)](/doc/commands-reference/pipeline), to find data +files to push. ## Options @@ -148,7 +148,7 @@ $ dvc push data.zip.dvc ## Example: With dependencies Demonstrating the `--with-deps` flag requires a larger example. First, assume a -[pipeline](/doc/get-started/pipeline) has been setup with these +[pipeline](/doc/commands-reference/pipeline) has been setup with these [stages](/doc/commands-reference/run): ```dvc diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index e76eb782bb..6ea13a4795 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -1,7 +1,7 @@ # repro Run again commands recorded in the [stages](/doc/commands-reference/run) of one -or more [pipelines](/doc/get-started/pipeline), in the correct order. The +or more [pipelines](/doc/commands-reference/pipeline), in the correct order. The commands to be run are determined by recursively analyzing target stages and changes in their dependencies. diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 90d6987c45..66931d2723 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -1,6 +1,6 @@ # status -Show changes in the [pipeline(s)](/doc/get-started/pipeline), as well as +Show changes in the [pipeline(s)](/doc/commands-reference/pipeline), as well as mismatches either between the local cache and local files, or between the local cache and remote cache. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 5d22bb11d3..cd607b14eb 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -127,7 +127,8 @@ same. Once the data files are in the workspace, you can start processing the data and train ML models out of the data files. DVC helps you to define steps of your ML -process and pipe them together into a ML [pipeline](/doc/get-started/pipeline). +process and pipe them together into a ML +[pipeline](/doc/commands-reference/pipeline). `dvc run` executes any command that you pass into it as a list of parameters. However, the command to run alone is not as interesting as its role within a diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 57ad6bcc6b..76eeec32fd 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -1,11 +1,11 @@ # DVC-File Format When you add a file (with `dvc add`) or a command (with `dvc run`) to a -[pipeline](/doc/get-started/pipeline), DVC creates a special text metafile with -the `.dvc` file extension (e.g. `process.dvc`), or with the default name -`Dvcfile`. DVC-files a.k.a. **stage files** contain all the needed information -to track your data and reproduce pipeline stages. The file itself contains a -simple YAML format that could be easily written or altered manually. +[pipeline](/doc/commands-reference/pipeline), DVC creates a special text +metafile with the `.dvc` file extension (e.g. `process.dvc`), or with the +default name `Dvcfile`. DVC-files a.k.a. **stage files** contain all the needed +information to track your data and reproduce pipeline stages. The file itself +contains a simple YAML format that could be easily written or altered manually. See the [Syntax Highlighting](/doc/user-guide/plugins) to learn how to enable the highlighting for your editor. diff --git a/static/docs/user-guide/dvcignore.md b/static/docs/user-guide/dvcignore.md index 5ae2e7c6d4..94c3c37c19 100644 --- a/static/docs/user-guide/dvcignore.md +++ b/static/docs/user-guide/dvcignore.md @@ -29,7 +29,8 @@ directories. **It is crucial to understand, that DVC might remove ignored files upon `dvc run` or `dvc repro`. If they are not produced by a -[pipeline](/doc/get-started/pipeline) step, they can be deleted permanently.** +[pipeline](/doc/commands-reference/pipeline) +[stage](/doc/commands-reference/run), they can be deleted permanently.** Keep in mind, that when you add to `.dvcignore` entries that affect one of the existing outputs, its status will change and DVC will behave as if From 9206bdfd787235a1f1122f44ddb5f1c6c4ae4fae Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 14:09:12 -0500 Subject: [PATCH 06/17] glossary: clarify DVC-file links in "workspace" and "project" terms per https://github.com/iterative/dvc.org/pull/549#pullrequestreview-273446912 and https://github.com/iterative/dvc.org/pull/549#pullrequestreview-273446918 --- src/Documentation/glossary.js | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index b072666ca2..d6077d3037 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -11,7 +11,7 @@ export default { Directory containing all your project files. For example raw datasets, source code, ML models, etc. A workspace becomes a **DVC project** when [\`dvc init\`](/doc/commands-reference/init) is run, and -[DVC-files](/doc/user-guide/dvc-file-format) are created in it. +[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. ` }, { @@ -19,9 +19,9 @@ code, ML models, etc. A workspace becomes a **DVC project** when match: ['DVC project', 'project', 'projects'], desc: ` Initialized by running \`dvc init\` in the **workspace**. It will contain the -\`.dvc/\` directory and [DVC-files](/doc/user-guide/dvc-file-format) created -with commands such as \`dvc add\` or \`dvc run\`. It's typically also a Git -repository. +[\`.dvc/\` directory](/doc/user-guide/dvc-files-and-directories) and +[DVC-files](/doc/user-guide/dvc-file-format) created with commands such as +\`dvc add\` or \`dvc run\`. It's typically also a Git repository. ` }, { From 0307e7481e68bb437a96611161f63cabe3d95cfd Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 14:44:47 -0500 Subject: [PATCH 07/17] use-cases: remove optionality of a step in multiple-... doc per https://github.com/iterative/dvc.org/pull/549#pullrequestreview-273447177 --- .../multiple-data-scientists-on-a-single-machine.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index c87cad55d5..524d92b67a 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -40,10 +40,10 @@ move it from an old cache location to the new one: $ mv .dvc/cache/* /path/to/dvc-cache ``` -### Configure External Cache (Optional) +### Configure Shared Cache -This step is optional. Tell DVC to use the directory we've set up above as an -shared cache location by running: +Tell DVC to use the directory we've set up above as an shared cache location by +running: ```dvc $ dvc config cache.dir /path/to/dvc-cache @@ -53,7 +53,7 @@ Commit changes to `.dvc/config` and push them to your git remote: ```dvc $ git add .dvc/config -$ git commit -m "dvc: setup external cache dir" +$ git commit -m "dvc: shared external cache dir" ``` ### Examples From 8650a842594e192a554ec449859808a150bfc202 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 16:14:14 -0500 Subject: [PATCH 08/17] docs: consistent use of `/` after dir names rel https://github.com/iterative/dvc.org/pull/549#pullrequestreview-273447243 --- static/docs/commands-reference/cache/index.md | 2 +- static/docs/commands-reference/diff.md | 2 +- static/docs/commands-reference/import-url.md | 2 +- static/docs/commands-reference/init.md | 9 +++++---- static/docs/get-started/example-pipeline.md | 20 ++++++++++--------- static/docs/get-started/example-versioning.md | 12 +++++------ static/docs/get-started/initialize.md | 5 +++-- static/docs/tutorial/define-ml-pipeline.md | 2 +- static/docs/user-guide/dvcignore.md | 2 +- .../docs/user-guide/external-dependencies.md | 2 +- static/docs/user-guide/external-outputs.md | 2 +- 11 files changed, 32 insertions(+), 28 deletions(-) diff --git a/static/docs/commands-reference/cache/index.md b/static/docs/commands-reference/cache/index.md index 80b66d28b8..b0bd01f166 100644 --- a/static/docs/commands-reference/cache/index.md +++ b/static/docs/commands-reference/cache/index.md @@ -17,7 +17,7 @@ positional arguments: After DVC initialization, a hidden directory `.dvc/` is created with the [DVC internal files](/doc/user-guide/dvc-files-and-directories), including the -default `cache` directory. +default cache directory. The DVC cache is where your data files, models, etc (anything you want to version with DVC) are actually stored. The corresponding files you see in the diff --git a/static/docs/commands-reference/diff.md b/static/docs/commands-reference/diff.md index 82ebb12fe3..3f347e1297 100644 --- a/static/docs/commands-reference/diff.md +++ b/static/docs/commands-reference/diff.md @@ -114,7 +114,7 @@ reference these experiments. ### Click and expand to setup example Having followed the previous example's setup, move into the -`example-get-started` directory. Then make sure that you have the latest code +`example-get-started/` directory. Then make sure that you have the latest code and data with the following commands. ```dvc diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 67040dfcbb..6080b3e643 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -189,7 +189,7 @@ its necessary to download it again. > See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the > text format above. -You may want to get out of and remove the `example-get-started` directory after +You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). ## Example: Detecting remote file changes diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md index c84c02ade1..0286c4000a 100644 --- a/static/docs/commands-reference/init.md +++ b/static/docs/commands-reference/init.md @@ -14,14 +14,15 @@ usage: dvc init [-h] [-q | -v] [--no-scm] [-f] ## Description After DVC initialization, a new directory `.dvc/` will be created with `config` -and `.gitignore` files and `cache` directory. These files and directories are -hidden from the user generally and are not meant to be manipulated directly. +and `.gitignore` files and cache directory. These files and +directories are hidden from the user generally and are not meant to be +manipulated directly. `.dvc/cache` is one of the most important [DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all the contents of tracked data files. Note that `.dvc/.gitignore` lists this -directory, which means that the cache directory is not under Git -control. This is your local cache and you cannot push it to any Git remote. +directory, which means that the cache directory is not under Git control. This +is your local cache and you cannot push it to any Git remote. ## Options diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index ac166c75ec..58cf9511e6 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -75,7 +75,8 @@ $ dvc init $ git commit -m "initialize DVC" ``` -Download an input dataset to the `data` directory and take it under DVC control: +Download an input dataset to the `data/` directory and take it under DVC +control: ```dvc $ mkdir data @@ -91,10 +92,10 @@ When we run `dvc add` `Posts.xml.zip`, DVC creates a ### Expand to learn more about DVC internals `dvc init` created a new directory `example/.dvc/` with `config`, `.gitignore` -files and the `cache` directory. These files and directories are hidden from -users in general. Users don't interact with these files directly. See -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn -more. +files and the cache directory. These files and directories are +hidden from users in general. Users don't interact with these files directly. +See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to +learn more. Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an "_orphan_ stage file": @@ -165,10 +166,11 @@ outs: path: data/Posts.xml ``` -This file is using the same technique - pointers (md5 hashes) to the cache to -describe and version control dependencies and outputs. Output `Posts.xml` file -is automatically added to the `.gitignore` file and a link is created into a -cache `.dvc/cache/a3/04afb96060aad90176268345e10355` to save it. +This file is using the same technique - pointers (md5 hashes) to the +cache to describe and version control dependencies and outputs. +Output `Posts.xml` file is automatically added to the `.gitignore` file and a +link is created into a cache `.dvc/cache/a3/04afb96060aad90176268345e10355` to +save it. Two things are worth noticing here. First, by analyzing dependencies and outputs that DVC-files describe, we can restore the full chain (DAG) of commands we need diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 08705554cb..0cdf892604 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -165,11 +165,11 @@ $ git tag -a "v1.0" -m "model v1.0, 1000 images" ### Expand to learn more about DVC internals -As we mentioned briefly, DVC does not commit the `data` directory and `model.h5` -file into git, `dvc add` pushed them into the DVC cache and added to the -`.gitignore`. Instead, we commit DVC-files that serve as pointers to the cache -(usually in the `.dvc/cache` directory inside the repository) where actual data -resides. +As we mentioned briefly, DVC does not commit the `data/` directory and +`model.h5` file into git, `dvc add` pushed them into the DVC cache and added to +the `.gitignore`. Instead, we commit DVC-files that serve as pointers to the +cache (usually in the `.dvc/cache` directory inside the repository) where actual +data resides. In this case we created `data.dvc` and `model.h5.dvc` files. Refer to the [DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these @@ -291,7 +291,7 @@ place. `dvc add` is appropriate when you need to keep track of different versions of datasets or model files that come and are updated from external sources. The -`data` directory above (with cats and dogs images) is a good example. +`data/` directory above (with cats and dogs images) is a good example. On the other hand, there are files that are a result of running some code. In our example, please notice that `train.py` produces binary files (e.g. diff --git a/static/docs/get-started/initialize.md b/static/docs/get-started/initialize.md index d778778872..8549ee92e8 100644 --- a/static/docs/get-started/initialize.md +++ b/static/docs/get-started/initialize.md @@ -23,8 +23,9 @@ $ git commit -m "initialize DVC" ``` After DVC initialization, a new directory `.dvc/` will be created with `config` -and `.gitignore` files and `cache` directory. These files and directories are -hidden from the user generally and are not meant to be manipulated directly. +and `.gitignore` files and cache directory. These files and +directories are hidden from the user generally and are not meant to be +manipulated directly. > See `dvc init` if you want to get more details about the initialization > process, and diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index cd607b14eb..5c2f319e9f 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -3,7 +3,7 @@ ## Get data file To include a data file into your data science environment, you need to copy the -file into the repository. We'll create a special `data` directory for the data +file into the repository. We'll create a special `data/` directory for the data files and download a 40MB data archive into this directory.
diff --git a/static/docs/user-guide/dvcignore.md b/static/docs/user-guide/dvcignore.md index 94c3c37c19..ff2fb152f6 100644 --- a/static/docs/user-guide/dvcignore.md +++ b/static/docs/user-guide/dvcignore.md @@ -60,7 +60,7 @@ $ tree . └── data2 ``` -We created the `data` directory. Let's ignore part of the `data` and add it +We created the `data/` directory. Let's ignore part of the `data` and add it under DVC control. ```dvc diff --git a/static/docs/user-guide/external-dependencies.md b/static/docs/user-guide/external-dependencies.md index 009bed7126..0ea0fe756a 100644 --- a/static/docs/user-guide/external-dependencies.md +++ b/static/docs/user-guide/external-dependencies.md @@ -33,7 +33,7 @@ As examples, let's take a look at a [stage](/doc/commands-reference/run) that simply moves local file from an external location, producing a `data.txt.dvc` stage file (DVC-file). -> Note that some of these commands use the `/home/shared/` directory, typical in +> Note that some of these commands use the `/home/shared` directory, typical in > Linux distributions. ### Local diff --git a/static/docs/user-guide/external-outputs.md b/static/docs/user-guide/external-outputs.md index a903f9a022..ac96ea9354 100644 --- a/static/docs/user-guide/external-outputs.md +++ b/static/docs/user-guide/external-outputs.md @@ -45,7 +45,7 @@ For the examples, let's take a look at a [stage](/doc/commands-reference/run) that simply moves local file to an external location, producing a `data.txt.dvc` stage file (DVC-file). -> Note that some of these commands use the `/home/shared/` directory, typical in +> Note that some of these commands use the `/home/shared` directory, typical in > Linux distributions. ### Local From e0e0285211b3a696780914517746683c36897679 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 16:17:52 -0500 Subject: [PATCH 09/17] term: remove `tree` before sampe console blocks per https://github.com/iterative/dvc.org/pull/549#pullrequestreview-273447293 --- static/docs/commands-reference/checkout.md | 2 +- static/docs/commands-reference/fetch.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 639fb1c92c..17d62424f2 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -124,7 +124,7 @@ $ cd example-get-started
-The workspace `tree` looks almost like in this +The workspace looks almost like in this [pipeline setup](/doc/get-started/example-pipeline): ```dvc diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index 939ce8d17c..193cf69017 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -134,7 +134,7 @@ $ cd example-get-started
-The workspace `tree` looks almost like in this +The workspace looks almost like in this [pipeline setup](/doc/get-started/example-pipeline): ```dvc From 1fa76ef7df68a0735e5f1b5ac40bb94e4ae11b9a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 16:24:03 -0500 Subject: [PATCH 10/17] use-cases: use lower case in headers after h2 in multi-... --- .../use-cases/multiple-data-scientists-on-a-single-machine.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index 524d92b67a..85efd5c08a 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -28,7 +28,7 @@ you and your colleagues are members of the same group (e.g. `users`) and that your shared cache directory is owned by that group and has respective permissions. -### Transfer Existing Cache (Optional) +### Transfer existing cache (Optional) This step is optional. You can skip it if you are setting up a new DVC repository and don't have your local cache stored in `.dvc/cache`. If you did @@ -40,7 +40,7 @@ move it from an old cache location to the new one: $ mv .dvc/cache/* /path/to/dvc-cache ``` -### Configure Shared Cache +### Configure shared cache Tell DVC to use the directory we've set up above as an shared cache location by running: From 9fce9a7747666846ef078595a56e56808e856e63 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 16:24:54 -0500 Subject: [PATCH 11/17] cmd ref: remove unnecessary remark --- static/docs/commands-reference/commit.md | 2 -- static/docs/commands-reference/install.md | 2 -- 2 files changed, 4 deletions(-) diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 83547a87a2..591d47c641 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -122,8 +122,6 @@ Download the precomputed data using: $ dvc pull --all-branches --all-tags ``` -This data will be retrieved from a preconfigured remote cache. - ## Example: Rapid iterations diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index b43ec5d9d9..b596b7ebf0 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -118,8 +118,6 @@ Download the precomputed data using: $ dvc pull --all-branches --all-tags ``` -This data will be retrieved from a preconfigured remote cache. - ## Example: Checkout both DVC and Git From 8d0451cfb09e3cfdcdc5bd47e70bc1e62ca7eb99 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 23:08:53 -0500 Subject: [PATCH 12/17] term: review usage of "the project" and add where needed --- .../docs/commands-reference/metrics/show.md | 6 ++--- .../docs/commands-reference/pipeline/list.md | 4 ++-- static/docs/commands-reference/pull.md | 6 ++--- static/docs/commands-reference/remote/add.md | 6 ++--- .../docs/commands-reference/remote/index.md | 22 +++++++++---------- static/docs/commands-reference/remote/list.md | 6 ++--- static/docs/get-started/agenda.md | 2 +- static/docs/get-started/configure.md | 14 ++++++------ static/docs/get-started/index.md | 2 +- static/docs/get-started/reproduce.md | 5 +++-- static/docs/tutorial/reproducibility.md | 2 +- static/docs/tutorial/sharing-data.md | 4 ++-- static/docs/user-guide/analytics.md | 6 ++--- .../user-guide/contributing-documentation.md | 4 ++-- static/docs/user-guide/contributing.md | 9 +++++--- static/docs/user-guide/dvcignore.md | 14 ++++++------ 16 files changed, 58 insertions(+), 54 deletions(-) diff --git a/static/docs/commands-reference/metrics/show.md b/static/docs/commands-reference/metrics/show.md index 934cadca3a..fe66290cb2 100644 --- a/static/docs/commands-reference/metrics/show.md +++ b/static/docs/commands-reference/metrics/show.md @@ -1,6 +1,6 @@ # metrics show -Find and print project metrics. +Find and print project metrics. ## Synopsis @@ -48,8 +48,8 @@ detected by the file extension automatically. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if metric file contains multiple numbers and you need to get a only one of them. Only single path is allowed. If multiple metric files - exist in the project, the same parser and path will be applied to all of them. - If xpath for particular metric has been set using + exist in the project, the same parser and path will be applied to + all of them. If `xpath` for particular metric has been set using [`dvc metrics modify`](https://dvc.org/doc/commands-reference/metrics/modify#options) `xpath` passed in this option will owervrite it, only for current command run. It may fail to produce any results or parse files that are not in a diff --git a/static/docs/commands-reference/pipeline/list.md b/static/docs/commands-reference/pipeline/list.md index f46838249e..01b1d7f856 100644 --- a/static/docs/commands-reference/pipeline/list.md +++ b/static/docs/commands-reference/pipeline/list.md @@ -11,8 +11,8 @@ usage: dvc pipeline list [-h] [-q | -v] ## Description -`dvc list` displays a list of all existing stages in the project, grouped in -their corresponding pipeline(s) when connected. (See `dvc pipeline`.) +`dvc list` displays a list of all existing stages in the project, +grouped in their corresponding pipeline(s) when connected. (See `dvc pipeline`.) > Note that the stages in these lists are in ascending order, that is, from last > to first. diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 9cdf00fbe5..9fd97f14b3 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -104,9 +104,9 @@ reflinks or hardlinks to put it in the workspace without copying. See ## Examples For using the `dvc pull` command, remote storage must be defined. (See -`dvc remote`.) For an existing project, remotes are usually already set up and -you can use `dvc remote list` to check them. Just to remind how it is done and -set a context for the example, let's define an SSH remote with the +`dvc remote`.) For an existing project, remotes are usually already +set up and you can use `dvc remote list` to check them. Just to remind how it is +done and set a context for the example, let's define an SSH remote with the `dvc remote add` command: ```dvc diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md index e1c170b8aa..86b41a81bf 100644 --- a/static/docs/commands-reference/remote/add.md +++ b/static/docs/commands-reference/remote/add.md @@ -85,9 +85,9 @@ The following are the types and of remotes (protocols) supported: ### Click for a local remote example > While the term may seem contradictory, it doesn't have to be. The "local" part -> refers to the machine where the project is stored, so it can be any directory -> accessible to the same system. The "remote" part refers specifically to the -> project/repository itself. +> refers to the machine where the project is stored, so it can be +> any directory accessible to the same system. The "remote" part refers +> specifically to the project/repository itself. Using an absolute path (recommended): diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index d69ecce46e..a591c79173 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -25,13 +25,13 @@ positional arguments: What is data remote? -The same way as Github serves as a master storage for Git-based projects, DVC -data remotes provide a central place to keep and share data and model files. -With a remote data storage, you can pull models and data files which were -created by your team members without spending time and resources to re-build -models and re-process data files. It also saves space on your local -environment - DVC can [fetch](/doc/commands-reference/fetch) into the local -cache only the data you need for a specific branch/commit. +The same way as Github provides storage hosting for Git repositories, DVC data +remotes provide a central place to keep and share data and model files. With a +remote data storage, you can pull models and data files which were created by +your team members without spending time and resources to re-build models and +re-process data files. It also saves space on your local environment - DVC can +[fetch](/doc/commands-reference/fetch) into the local cache only the data you +need for a specific branch/commit. > If you installed DVC via `pip`, depending on the remote type you plan to use > you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, @@ -51,7 +51,7 @@ repository), which enables basic DVC usage scenarios out of the box. [config files](/doc/commands-reference/config). Alternatively, `dvc config` can be used or these files could be edited manually. -For the typical process to share the project via remote, see +For the typical process to share the project via remote, see [Share Data And Model Files](/doc/use-cases/share-data-and-model-files). ## Options @@ -72,9 +72,9 @@ For the typical process to share the project via remote, see ### What is a "local remote" ? While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any directory -accessible to the same system. The "remote" part refers specifically to the -project/repository itself. +refers to the machine where the project is stored, so it can be any +directory accessible to the same system. The "remote" part refers specifically +to the project/repository itself. diff --git a/static/docs/commands-reference/remote/list.md b/static/docs/commands-reference/remote/list.md index cac4550813..c4a4d6f634 100644 --- a/static/docs/commands-reference/remote/list.md +++ b/static/docs/commands-reference/remote/list.md @@ -38,9 +38,9 @@ Let's for simplicity add a default local remote: ### What is a "local remote" ? While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any directory -accessible to the same system. The "remote" part refers specifically to the -project/repository itself. +refers to the machine where the project is stored, so it can be any +directory accessible to the same system. The "remote" part refers specifically +to the project/repository itself. diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index 647654a5f6..8a06121436 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -13,7 +13,7 @@ $ git clone https://github.com/iterative/example-get-started Otherwise, bear with us and we will introduce the basic DVC concepts and get to the same result together! -The idea of the project is a simplified version of the +The idea of this project is a simplified version of the [tutorial](/doc/tutorial). It explores the NLP problem of predicting tags for a given StackOverflow question. For example, we want one classifier which can predict a post that is about the Python language by tagging it `python`. diff --git a/static/docs/get-started/configure.md b/static/docs/get-started/configure.md index 90acffc13c..8f4f52205e 100644 --- a/static/docs/get-started/configure.md +++ b/static/docs/get-started/configure.md @@ -4,10 +4,10 @@ Once you install DVC, you will be able to start using it (in its local setup) immediately. However, remote storage should be set up (see `dvc remote`) if you need to share -data or models outside of the context of single project, for example with other -collaborators or even with yourself, in a different computing environment. It's -similar to the way you would use Github or any other Git server to store and -share your code. +data or models outside of the context of a single project, for example with +other collaborators or even with yourself, in a different computing environment. +It's similar to the way you would use Github or any other Git server to store +and share your code. For simplicity, let's setup a local remote: @@ -16,9 +16,9 @@ For simplicity, let's setup a local remote: ### What is a "local remote" ? While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the machine where the project is stored, so it can be any directory -accessible to the same system. The "remote" part refers specifically to the -project/repository itself. +refers to the machine where the project is stored, so it can be any +directory accessible to the same system. The "remote" part refers specifically +to the project/repository itself. diff --git a/static/docs/get-started/index.md b/static/docs/get-started/index.md index 9e63e49e53..153b666608 100644 --- a/static/docs/get-started/index.md +++ b/static/docs/get-started/index.md @@ -15,7 +15,7 @@ if you have any questions or need any help. We are very responsive ⚡. us a ⭐ if you like the project! ✅ Contribute either on [Github](https://github.com/iterative/dvc) or -[Patreon](https://www.patreon.com/DVCorg/overview) to support the Project. +[Patreon](https://www.patreon.com/DVCorg/overview) to support the project. This longer [Tutorial](/doc/tutorial) introduces DVC step-by-step while explaining in great detail the motivation and what's happening internally. diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md index dd1a460478..ad12233b73 100644 --- a/static/docs/get-started/reproduce.md +++ b/static/docs/get-started/reproduce.md @@ -6,8 +6,9 @@ single stage we need to run (a pipeline) towards a final result. Each depends on some data (either raw data files or intermediate results from another DVC-file) and code files. -If you freshly checked out the project, make sure you first fetch the input data -from DVC by calling `dvc pull`. +If you just cloned the +[project](https://github.com/iterative/example-get-started), make sure you first +fetch the input data from DVC by calling `dvc pull`. It's now extremely easy for you or anyone in your team to reproduce the result end-to-end: diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index ca8fd29cdc..13bf134011 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -21,7 +21,7 @@ designed in such a way to localize specification of DAG nodes. If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our repository, nothing happens because nothing was changed in the pipeline defined -in the project. There's nothing to reproduce. +in the project. There's nothing to reproduce. ```dvc $ dvc repro model.p.dvc diff --git a/static/docs/tutorial/sharing-data.md b/static/docs/tutorial/sharing-data.md index f828990b30..5dcfa7ff06 100644 --- a/static/docs/tutorial/sharing-data.md +++ b/static/docs/tutorial/sharing-data.md @@ -12,8 +12,8 @@ DVC is able to push the cache to a cloud. > Using your shared cache a colleague can reuse ML models that were trained on > your machine. -First, you need to set a data remote which will be stored in the project's -config file. This can be done using the CLI as shown below. +First, you need to set a data remote which will be stored in the config file of +the project. This can be done using the CLI as shown below. > Note that we are using `dvc-share` s3 bucket as an example and you don't have > write access to it, so in order to follow the tutorial you will need to either diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index cd06d47dfc..649b3fa3c8 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -57,6 +57,6 @@ However, if you want to opt out of DVC's analytics, you can disable it via $ dvc config core.analytics false ``` -This will disable it for the project. Alternatively, you can specify `--global` -or `--system` flags to disable it for an active user or for everyone in the -system. +This will disable it for the project. Alternatively, you can +specify `--global` or `--system` flags to disable it for an active user or for +everyone in the system. diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md index 95ded4fc7a..1c1e7858fa 100644 --- a/static/docs/user-guide/contributing-documentation.md +++ b/static/docs/user-guide/contributing-documentation.md @@ -26,8 +26,8 @@ to update the docs and redeploy the website. ## Submitting changes In case of a minor change, you can use the **Edit on Github** button (found to -the right of each page) to fork the project, edit it in place (with the source -code file **Edit** button in Github), and create a pull request (PR). +the right of each page) to fork the repository, edit it in place (with the +source code file **Edit** button in Github), and create a pull request (PR). Otherwise, please refer to the following procedure: diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index bfdbd13497..a9800daf4a 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -9,10 +9,13 @@ guide if you want to fix or update the documentation or this website. Please search [issue tracker](https://github.com/iterative/dvc/issues) before creating a new issue (problem or an improvement request). Feel free to add -issues related to the project and [dvc.org](https://dvc.org/) site. +issues related to the project. -If you feel that you can fix or implement it, please read a few paragraphs below -to learn how to submit your changes. +For problems with [dvc.org](https://dvc.org/) site please use this +[Github repository](https://github.com/iterative/dvc.org/). + +If you feel that you can fix or implement it yourself, please read a few +paragraphs below to learn how to submit your changes. ## Submitting changes diff --git a/static/docs/user-guide/dvcignore.md b/static/docs/user-guide/dvcignore.md index ff2fb152f6..f77c458980 100644 --- a/static/docs/user-guide/dvcignore.md +++ b/static/docs/user-guide/dvcignore.md @@ -4,12 +4,12 @@ Marks which files and/or directories should be ignored when traversing repository. Sometimes you might want DVC to ignore some files while working with the -project. For example, when working on a project with many files in its data -directory, you might encounter extended execution time for operations that are -as simple as `dvc status`. In other case you might want to omit files or folders -unrelated to the project (like `.DS_Store` on Mac). To address these -requirements we are implementing `.dvcignore` files handling. `.dvcignore` by -design works similar way as `.gitignore` does. +project. For example, when working on a workspace with +many files in its data directory, you might encounter extended execution time +for operations that are as simple as `dvc status`. In other case you might want +to omit files or folders unrelated to the project (like `.DS_Store` on Mac). To +address these requirements we are implementing `.dvcignore` files handling. +`.dvcignore` by design works similar way as `.gitignore` does. ## How does it work? @@ -146,7 +146,7 @@ data.dvc: ## Example: Ignore dvc controlled file -Let's analyze an example project: +Let's analyze an example workspace: ```dvc $ mkdir dir1 dir2 From 0dce2c01cca4ddeed37e35052ae6e568c44f7a07 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 23:23:19 -0500 Subject: [PATCH 13/17] cmd ref: review `--xpath` option text (`modify` commands) --- static/docs/commands-reference/metrics/add.md | 26 ++++++++-------- .../docs/commands-reference/metrics/modify.md | 26 ++++++++-------- .../docs/commands-reference/metrics/show.md | 30 +++++++++---------- .../use-cases/share-data-and-model-files.md | 2 +- 4 files changed, 41 insertions(+), 43 deletions(-) diff --git a/static/docs/commands-reference/metrics/add.md b/static/docs/commands-reference/metrics/add.md index 1bdb255d10..eb0af582d2 100644 --- a/static/docs/commands-reference/metrics/add.md +++ b/static/docs/commands-reference/metrics/add.md @@ -35,19 +35,19 @@ contains multiple metrics. when no type is provided. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric - value. Should be used if metric file contains multiple numbers and you need to - get a only one of them. Only single path is allowed. This path will be saved - into the corresponding DVC-file and will be used automatically in - `dvc metrics show`. Accepted value depends on the metric file type (`-t` - option): - - - `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) for - available options. For example, `"AUC"` extracts the value from the - following json-formatted metric file: `{"AUC": "0.624652"}`. - - `tsv`/`csv` - `row,column`, e.g. `1,2`. Indices are 0-based. - - `htsv`/`hcsv` - `row,column name`. Row index is 0-based. First row is used - to specify column names and is not included into index. For example: - `0,Name`. + value. Should be used if the metric file contains multiple numbers and you + need to get a only one of them. Only a single path is allowed. This path will + be saved into the corresponding DVC-file and will be used automatically in + `dvc metrics show`. The accepted value depends on the metric file type + (`--type` option): + + - `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or + [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. + For example, `"AUC"` extracts the value from the following JSON-formatted + metric file: `{"AUC": "0.624652"}`. + - `tsv`/`csv` - `row,column` e.g. `1,2`. Indices are 0-based. + - `htsv`/`hcsv` - `row,column name` e.g. `0,Name`. Row index is 0-based. First + row is used to specify column names and is not included into index. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/metrics/modify.md b/static/docs/commands-reference/metrics/modify.md index 82cd5ab817..4b13f1316e 100644 --- a/static/docs/commands-reference/metrics/modify.md +++ b/static/docs/commands-reference/metrics/modify.md @@ -40,19 +40,19 @@ ERROR: failed to modify metric file settings - when no type is provided. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric - value. Should be used if metric file contains multiple numbers and you need to - get a only one of them. Only single path is allowed. This path will be saved - into the corresponding DVC-file and will be used automatically in - `dvc metrics show`. Accepted value depends on the metric file type (`-t` - option): - - - `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) for - available options. For example, `"AUC"` extracts the value from the - following json-formatted metric file: `{"AUC": "0.624652"}`. - - `tsv`/`csv` - `row,column`, e.g. `1,2`. Indices are 0-based. - - `htsv`/`hcsv` - `row,column name`. Row index is 0-based. First row is used - to specify column names and is not included into index. For example: - `0,Name`. + value. Should be used if the metric file contains multiple numbers and you + need to get a only one of them. Only a single path is allowed. This path will + be saved into the corresponding DVC-file and will be used automatically in + `dvc metrics show`. The accepted value depends on the metric file type + (`--type` option): + + - `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or + [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. + For example, `"AUC"` extracts the value from the following JSON-formatted + metric file: `{"AUC": "0.624652"}`. + - `tsv`/`csv` - `row,column` e.g. `1,2`. Indices are 0-based. + - `htsv`/`hcsv` - `row,column name` e.g. `0,Name`. Row index is 0-based. First + row is used to specify column names and is not included into index. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/metrics/show.md b/static/docs/commands-reference/metrics/show.md index fe66290cb2..8dac7c3259 100644 --- a/static/docs/commands-reference/metrics/show.md +++ b/static/docs/commands-reference/metrics/show.md @@ -46,28 +46,26 @@ detected by the file extension automatically. file. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric - value. Should be used if metric file contains multiple numbers and you need to - get a only one of them. Only single path is allowed. If multiple metric files - exist in the project, the same parser and path will be applied to - all of them. If `xpath` for particular metric has been set using - [`dvc metrics modify`](https://dvc.org/doc/commands-reference/metrics/modify#options) - `xpath` passed in this option will owervrite it, only for current command run. - It may fail to produce any results or parse files that are not in a - corresponding format in this case. Accepted value depends on the metric file - type (`-t` option): + value. Should be used if the metric file contains multiple numbers and you + need to get a only one of them. Only a single path is allowed. If multiple + metric files exist in the project, the same parser and path will + be applied to all of them. If `xpath` for a particular metric has been set + using `dvc metrics modify`, the path passed with this option will overwrite it + for the current command run only – It may fail to produce any results or parse + files that are not in a corresponding format in this case. The Accepted value + depends on the metric file type (`--type` option): - `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. For example, `"AUC"` extracts the value from the following json-formatted - metric file: `{"AUC": "0.624652"}`. You can also filter on certain values. - For example, `"$.metrics[?(@.deviation_mse<0.30) & (@.value_mse>0.4)]"` + metric file: `{"AUC": "0.624652"}`. You can also filter on certain values, + for example `"$.metrics[?(@.deviation_mse<0.30) & (@.value_mse>0.4)]"` extracts only the values for model versions if they meet the given - condition(s) from the metric file: + conditions from the metric file: `{"metrics": [{"dataset": "train", "deviation_mse": 0.173461, "value_mse": 0.421601}]}` - - `tsv`/`csv` - `row,column`, e.g. `1,2`. Indices are 0-based. - - `htsv`/`hcsv` - `row,column name`. Row index is 0-based. First row is used - to specify column names and is not included into index. For example: - `0,Name`. + - `tsv`/`csv` - `row,column` e.g. `1,2`. Indices are 0-based. + - `htsv`/`hcsv` - `row,column name` e.g. `0,Name`. Row index is 0-based. First + row is used to specify column names and is not included into index. - `-a`, `--all-branches` - get and print metric file contents across all branches. It can be used to compare different variants of an experiment. diff --git a/static/docs/use-cases/share-data-and-model-files.md b/static/docs/use-cases/share-data-and-model-files.md index 951d60d7c9..0ee6088b8a 100644 --- a/static/docs/use-cases/share-data-and-model-files.md +++ b/static/docs/use-cases/share-data-and-model-files.md @@ -44,7 +44,7 @@ remote = myremote ``` `dvc remote` provides a wide variety of options to configure S3 bucket. For more -information visit [`dvc remote modify`](/doc/commands-reference/remote/modify). +information visit `dvc remote modify`. Let's, commit your changes and push your code: From 97e3d7fe20b23f52197aa7050818c2b909600eb4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 11 Aug 2019 23:58:25 -0500 Subject: [PATCH 14/17] term: elimintae usage of "(s)" for possible plural in favor of simple plural --- static/docs/commands-reference/add.md | 4 ++-- static/docs/commands-reference/checkout.md | 6 +++--- static/docs/commands-reference/commit.md | 6 +++--- static/docs/commands-reference/fetch.md | 6 +++--- static/docs/commands-reference/get-url.md | 4 ++-- static/docs/commands-reference/get.md | 2 +- .../docs/commands-reference/metrics/show.md | 2 +- .../docs/commands-reference/pipeline/list.md | 3 ++- static/docs/commands-reference/pull.md | 14 +++++++------- static/docs/commands-reference/push.md | 14 +++++++------- static/docs/commands-reference/remove.md | 4 ++-- static/docs/commands-reference/repro.md | 19 ++++++++++--------- static/docs/commands-reference/status.md | 15 ++++++++------- .../user-guide/contributing-documentation.md | 2 +- static/docs/user-guide/contributing.md | 2 +- 15 files changed, 53 insertions(+), 50 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 369dbd0954..e3c6bbfd1f 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -26,8 +26,8 @@ Under the hood, a few actions are taken for each file in `targets`: 2. Move the file content to the DVC cache (default location is `.dvc/cache`). 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store - the checksum to identify the cache entry. -5. Add the target(s) to `.gitignore` (if Git is used in this + the MD5 checksum to identify the cache entry. +5. Add the targets to `.gitignore` (if Git is used in this workspace) to prevent it from being committed to the Git repository. 6. Instructions are printed showing `git` commands for adding the files to a Git diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 17d62424f2..9a4e98d2dc 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -79,10 +79,10 @@ be pulled from a remote cache using `dvc pull`. ## Options - `-d`, `--with-deps` - determine files to update by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not checkout files referenced in later stage(s) than `targets`. + backward from the target stages in the corresponding pipelines. This means DVC + will not checkout files referenced in later stages than the `targets`. - `-R`, `--recursive` - `targets` is expected to contain at least one directory path for this option to have effect. Determines the files to checkout by diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 591d47c641..1dc80aa991 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -67,10 +67,10 @@ into play. It handles that last step of adding the file to the DVC cache. ## Options - `-d`, `--with-deps` - determine files to commit by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not commit files referenced in later stage(s) than `targets`. + backward from the target stages in the corresponding pipelines. This means DVC + will not commit files referenced in later stages than the `targets`. - `-R`, `--recursive` - `targets` is expected to contain at least one directory path for this option to have effect. Determines the files to commit by diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index 193cf69017..7887a138df 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -80,10 +80,10 @@ specified in DVC-files currently in the project are considered by `dvc fetch` using the `dvc remote` command. - `-d`, `--with-deps` - determine files to download by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not fetch files referenced in later stage(s) than `targets`. + backward from the target stages in the corresponding pipelines. This means DVC + will not fetch files referenced in later stages than the `targets`. - `-R`, `--recursive` - `targets` is expected to contain at least one directory path for this option to have effect. Determines the files to fetch by diff --git a/static/docs/commands-reference/get-url.md b/static/docs/commands-reference/get-url.md index cbd8cc53dc..5f9e1af344 100644 --- a/static/docs/commands-reference/get-url.md +++ b/static/docs/commands-reference/get-url.md @@ -3,8 +3,8 @@ Download or copy file or directory from any supported URL (for example `s3://`, `ssh://`, and other protocols) or local directory to the local file system. -> Unlike `dvc import-url`, this command does not track the downloaded data -> file(s) (does not create a DVC-file). +> Unlike `dvc import-url`, this command does not track the downloaded data files +> (does not create a DVC-file). ## Synopsis diff --git a/static/docs/commands-reference/get.md b/static/docs/commands-reference/get.md index 7bdc10cb43..cef60ad606 100644 --- a/static/docs/commands-reference/get.md +++ b/static/docs/commands-reference/get.md @@ -3,7 +3,7 @@ Download or copy file or directory from another DVC repository (on a git server such as Github) into the local file system. -> Unlike `dvc import`, this command does not track the downloaded data file(s) +> Unlike `dvc import`, this command does not track the downloaded data files > (does not create a DVC-file). ## Synopsis diff --git a/static/docs/commands-reference/metrics/show.md b/static/docs/commands-reference/metrics/show.md index 8dac7c3259..0a5e1d1316 100644 --- a/static/docs/commands-reference/metrics/show.md +++ b/static/docs/commands-reference/metrics/show.md @@ -32,7 +32,7 @@ detected by the file extension automatically. ## Options -- `-t`, `--type` - specify a type of the metric file(s) that will be used to +- `-t`, `--type` - specify a type of the metric file that will be used to determine how to handle `xpath` parameter from down below. Accepted values are: `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. If this parameter is not given, the type can be detected by the file extension automatically if the diff --git a/static/docs/commands-reference/pipeline/list.md b/static/docs/commands-reference/pipeline/list.md index 01b1d7f856..69be0e3271 100644 --- a/static/docs/commands-reference/pipeline/list.md +++ b/static/docs/commands-reference/pipeline/list.md @@ -12,7 +12,8 @@ usage: dvc pipeline list [-h] [-q | -v] ## Description `dvc list` displays a list of all existing stages in the project, -grouped in their corresponding pipeline(s) when connected. (See `dvc pipeline`.) +grouped in their corresponding [pipeline](/doc/commands-reference/pipeline), +when connected. > Note that the stages in these lists are in ascending order, that is, from last > to first. diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 9fd97f14b3..8dfd7bdc27 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -49,9 +49,9 @@ files `dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies -backward from the target [stage](/doc/commands-reference/run) file(s), through -the corresponding [pipeline(s)](/doc/commands-reference/pipeline), to find data -files to pull. +backward from the target [stage files](/doc/commands-reference/run), through the +corresponding [pipelines](/doc/commands-reference/pipeline), to find data files +to pull. After a data file is in cache, `dvc pull` can use OS-specific mechanisms like reflinks or hardlinks to put it in the workspace without copying. See @@ -74,10 +74,10 @@ reflinks or hardlinks to put it in the workspace without copying. See save different experiments or project checkpoints. - `-d`, `--with-deps` - determines files to download by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not pull files referenced in later stage(s) than `targets`. + backward from the target stages in the corresponding pipelines. This means DVC + will not pull files referenced in later stages than the `targets`. - `-R`, `--recursive` - `targets` is expected to contain at least one directory path for this option to have effect. Determines the files to pull by searching @@ -140,7 +140,7 @@ $ dvc pull data.zip.dvc In this case we left off the `--remote` option, so it will have pulled from the default remote. The only files considered in this case are what is listed in the -`out` section of the target DVC-file(s). +`out` section of the DVC-file `targets`. ## Example: With dependencies diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 8f1b9dd915..7ee6436c53 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -65,9 +65,9 @@ remove nor modify those files in the remote cache. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies -backward from the target [stage](/doc/commands-reference/run) file(s), through -the corresponding [pipeline(s)](/doc/commands-reference/pipeline), to find data -files to push. +backward from the target [stage files](/doc/commands-reference/run), through the +corresponding [pipelines](/doc/commands-reference/pipeline), to find data files +to push. ## Options @@ -86,10 +86,10 @@ files to push. save different experiments or project checkpoints. - `-d`, `--with-deps` - determines files to upload by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not push files referenced in later stage(s) than `targets`. + backward from the target stages in the corresponding pipelines. This means DVC + will not push files referenced in later stages than the `targets`. - `-R`, `--recursive` - `targets` is expected to contain at least one directory path for this option to have effect. Determines the files to push by searching @@ -268,7 +268,7 @@ the local cache compared to the remote. Next we can upload part of the data from the local cache to a remote using the command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` searches -backwards from the target DVC-file(s) to locate files to upload, and does not +backwards from the DVC-file `targets` to locate files to upload, and does not upload files in subsequent stages. After doing that we can inspect the remote cache again: diff --git a/static/docs/commands-reference/remove.md b/static/docs/commands-reference/remove.md index 5682851fb6..3eb045c710 100644 --- a/static/docs/commands-reference/remove.md +++ b/static/docs/commands-reference/remove.md @@ -28,8 +28,8 @@ it can be used to replace or modify files that are under DVC control. ## Options -- `-o`, `--outs` (default) - remove outputs described in the provided DVC - file(s), keep the DVC-files. +- `-o`, `--outs` (default) - remove outputs described in the given `targets`, + keep the DVC-files. - `-p`, `--purge` - remove outputs and DVC-files. diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 6ea13a4795..5c5973c9af 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -19,13 +19,14 @@ positional arguments: ## Description `dvc repro` provides an interface to run the commands in a computational graph -(a.k.a. pipeline) again, as defined in the stage files (DVC-files) found in the +(a.k.a. pipeline) again, as defined in the +[stage files](/doc/commands-reference/run) (DVC-files) found in the project. (A pipeline is typically defined using the `dvc run` command, while data input nodes are defined by the `dvc add` command.) There's a few ways to restrict the stages that will be run again by this -command: by specifying stage file(s) as `targets`, or by using the -`--single-item`, `--cwd`, or other options. +command: by specifying stage file `targets`, or by using the `--single-item`, +`--cwd`, or other options. If specific [DVC-files](/doc/user-guide/dvc-file-format) (`targets`) are omitted, `Dvcfile` will be assumed. @@ -66,9 +67,9 @@ specified), and updates stage files with the new checksum information. experiments and you don't want to fill up the cache with temporary files. Use `dvc commit` when ready to save results to cache. -- `-m`, `--metrics` - show metrics after reproduction. The target pipeline(s) - must have at least one metrics file defined either with the `dvc metrics` - command, or by the `-M` or `-m` options on the `dvc run` command. +- `-m`, `--metrics` - show metrics after reproduction. The target pipelines must + have at least one metrics file defined either with the `dvc metrics` command, + or by the `-M` or `-m` options on the `dvc run` command. - `--dry` - only print the commands that would be executed without actually executing the commands. @@ -76,8 +77,8 @@ specified), and updates stage files with the new checksum information. - `-i`, `--interactive` - ask for confirmation before reproducing each stage. The stage is only run if the user types "y". -- `-p`, `--pipeline` - reproduce the entire pipeline(s) that the target stage - file(s) belong(s) to. Use `dvc pipeline show .dvc` to show the parent +- `-p`, `--pipeline` - reproduce the entire pipelines that the stage file + `targets` belong to. Use `dvc pipeline show .dvc` to show the parent pipeline of a target stage. - `-P`, `--all-pipelines` - reproduce all pipelines, for all the stage files @@ -107,7 +108,7 @@ specified), and updates stage files with the new checksum information. - `-v`, `--verbose` - displays detailed tracing information. - `--downstream` - only run again the stages after the given `targets` in their - corresponding pipeline(s), including the target stages themselves. + corresponding pipelines, including the target stages themselves. ## Examples diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 66931d2723..4260f73887 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -1,8 +1,9 @@ # status -Show changes in the [pipeline(s)](/doc/commands-reference/pipeline), as well as -mismatches either between the local cache and local files, or between the local -cache and remote cache. +Show changes in the project +[pipelines](/doc/commands-reference/pipeline), as well as mismatches either +between the local cache and local files, or between the local cache and remote +cache. ## Synopsis @@ -17,7 +18,7 @@ positional arguments: ## Description -`dvc status` searches for changes in the existing pipeline(s), either showing +`dvc status` searches for changes in the existing pipelines, either showing which [stages](/doc/commands-reference/run) have changed in the workspace and must be reproduced (with `dvc repro`), or differences between local vs. remote cache (meaning `dvc push` or `dvc pull` @@ -96,10 +97,10 @@ DVC cache. For the typical process to update the workspace, see ## Options - `-d`, `--with-deps` - determines files to check by tracking dependencies to - the target DVC-file(s) (stages). This option only has effect when one or more + the target DVC-files (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward from the target stage(s) in the corresponding pipeline(s). This means - DVC will not show changes occurring in later stage(s) than `targets`. Applies + backward from the target stages in the corresponding pipelines. This means DVC + will not show changes occurring in later stages than the `targets`. Applies whether or not `--cloud` is specified. - `-c`, `--cloud` - enables comparison against a remote cache. If no `--remote` diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md index 1c1e7858fa..5602d6a917 100644 --- a/static/docs/user-guide/contributing-documentation.md +++ b/static/docs/user-guide/contributing-documentation.md @@ -91,7 +91,7 @@ in question. - We use [Prettier](https://prettier.io/) default conventions to format our source code files. The formatting of staged files will automatically be done by the Git pre-commit hook we have configured. You may also run - `npx prettier --write ` manually before committing changes. + `npx prettier --write ` manually before committing changes. - Using `dvc ` in the Markdown files, the docs engine will create a link to that command automatically. (No need to use `[]()` explicitly to diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index a9800daf4a..cf2083db24 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -130,7 +130,7 @@ This tests require additional effort to set up, so they are skipped by default. You don't need this in most cases, however, if you develop or fix some remote related code you might need to go through steps below. -Install requirements for whatever remote(s) you are going to test: +Install requirements for whatever remotes you are going to test: ```dvc $ pip install -e ".[s3]" From 131b04fb5b7dc055e094fc743f5aa2078be57294 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 12 Aug 2019 17:33:57 -0500 Subject: [PATCH 15/17] symbol: review usage of `-` (1) --- static/docs/commands-reference/repro.md | 6 +-- static/docs/get-started/configure.md | 4 +- static/docs/get-started/example-versioning.md | 2 +- static/docs/get-started/install.md | 2 +- static/docs/tutorial/define-ml-pipeline.md | 11 ++--- static/docs/understanding-dvc/resources.md | 2 +- static/docs/user-guide/autocomplete.md | 49 +++++++++---------- 7 files changed, 37 insertions(+), 39 deletions(-) diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 5c5973c9af..089fc3cf54 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -98,6 +98,9 @@ specified), and updates stage files with the new checksum information. nondeterministic stages the outputs can vary on each execution, meaning the cache cannot be trusted for such stages. +- `--downstream` - only run again the stages after the given `targets` in their + corresponding pipelines, including the target stages themselves. + - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if all @@ -107,9 +110,6 @@ specified), and updates stage files with the new checksum information. - `-v`, `--verbose` - displays detailed tracing information. -- `--downstream` - only run again the stages after the given `targets` in their - corresponding pipelines, including the target stages themselves. - ## Examples For simplicity, let's build a pipeline defined below (if you want get your hands diff --git a/static/docs/get-started/configure.md b/static/docs/get-started/configure.md index 8f4f52205e..414294bc81 100644 --- a/static/docs/get-started/configure.md +++ b/static/docs/get-started/configure.md @@ -35,12 +35,12 @@ $ git commit .dvc/config -m "initialize DVC local remote" Adding a remote should be specified by both its type prefix (protocol) and its path. DVC currently supports seven types of remotes: -- `local` - Local directory +- `local` - local directory - `s3` - Amazon Simple Storage Service - `gs` - Google Cloud Storage - `azure` - Azure Blob Storage - `ssh` - Secure Shell -- `hdfs` - The Hadoop Distributed File System +- `hdfs` - Hadoop Distributed File System - `http` - HTTP and HTTPS protocols > If you installed DVC via `pip`, depending on the remote type you plan to use diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 0cdf892604..17958d38e2 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -143,7 +143,7 @@ to the cache. Next, we run the training with `python train.py`. We picked this example and datasets to be small enough to be run on your machine in a reasonable amount of time (a few minutes to train a model). This command produces a bunch of files, -among them `model.h5` and `metrics.json` - weights of the trained model and +among them `model.h5` and `metrics.json`, weights of the trained model and metrics history. The simplest way to capture the current version of the model is to use `dvc add` again: diff --git a/static/docs/get-started/install.md b/static/docs/get-started/install.md index 7a7c29eeba..070ec02a35 100644 --- a/static/docs/get-started/install.md +++ b/static/docs/get-started/install.md @@ -14,7 +14,7 @@ $ pip install dvc > [remote](/doc/commands-reference/remote) type you plan to use you might need > to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, `[azure]`, and > `[oss]`; or `[all]` to include them all. The command should look like this: -> `pip install "dvc[s3]"` - it installs `boto3` library along with DVC to +> `pip install "dvc[s3]"`. This installs `boto3` library along with DVC to > support AWS S3 storage. The easiest option, self-contained binary packages (or Windows installer), are diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 5c2f319e9f..d86feb7ed2 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -141,9 +141,8 @@ written to by the command, if any. directory. The dependency can be a regular file from a repository or a data file. -2. `-o file.tsv` (lower case o) specifies output data file which means DVC will - transform this file into a data file (think — it will run - `dvc add file.tsv`). +2. `-o file.tsv` (lower case o) specifies output data file, which means DVC will + transform this file into a data file (as if running `dvc add file.tsv`). 3. `-O file.tsv` (upper case O) specifies a regular output file (not to be added to DVC). @@ -213,9 +212,9 @@ outs: Sections of the file above include: -- `cmd` — the command to run -- `deps` — dependencies with md5 checksums -- `outs` — outputs with md5 checksums +- `cmd` - the command to run +- `deps` - dependencies with md5 checksums +- `outs` - outputs with md5 checksums And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. diff --git a/static/docs/understanding-dvc/resources.md b/static/docs/understanding-dvc/resources.md index 80f21ae405..08b69eae6b 100644 --- a/static/docs/understanding-dvc/resources.md +++ b/static/docs/understanding-dvc/resources.md @@ -23,7 +23,7 @@ picture-in-picture" allowfullscreen> - Podcast featured by `Podcast.__init__`: - [Version control For your Machine Learning Projects](https://www.pythonpodcast.com/data-version-control-episode-206/) - + [Version control For your Machine Learning Projects](https://www.pythonpodcast.com/data-version-control-episode-206/) ## Articles diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index 076ba23f0b..0c10f5bf91 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -77,34 +77,33 @@ completion. ### Click to expand if it doesn't work on Debian/Ubuntu -As mentioned above, it should work out of the box. But if it doesn't, try these -steps: - -- Make sure that the package `bash-completion` is installed: - - ```dvc - $ sudo apt install --reinstall bash-completion - ``` - -- Make sure that it's enabled. Edit `~/.bashrc` and make sure that these lines - are there: - - ```bash - # enable bash completion in interactive shells - if ! shopt -oq posix; then - if [ -f /usr/share/bash-completion/bash_completion ]; then - . /usr/share/bash-completion/bash_completion - elif [ -f /etc/bash_completion ]; then - . /etc/bash_completion - fi +As mentioned above, it should work out of the box. But if it doesn't, try this: + +Make sure that the package `bash-completion` is installed: + +```dvc +$ sudo apt install --reinstall bash-completion +``` + +Make sure that it's enabled. Edit `~/.bashrc` and make sure that these lines are +there: + +```bash +# enable bash completion in interactive shells +if ! shopt -oq posix; then + if [ -f /usr/share/bash-completion/bash_completion ]; then + . /usr/share/bash-completion/bash_completion + elif [ -f /etc/bash_completion ]; then + . /etc/bash_completion fi - ``` +fi +``` -- Exit from the shell and open a new one, or just reload `~/.bashrc`: +Exit from the shell and open a new one, or just reload `~/.bashrc`: - ```dvc - $ source ~/.bashrc - ``` +```dvc +$ source ~/.bashrc +``` For more details see: https://linuxhandbook.com/enable-tab-completion/ From 46db91988d26cf5caf5eb5611be928b38c335f4a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 12 Aug 2019 18:52:01 -0500 Subject: [PATCH 16/17] symbol: review usage of ` - ` (2) --- static/docs/changelog/0.18.md | 8 ++++---- static/docs/changelog/0.35.md | 7 +++++-- static/docs/commands-reference/commit.md | 2 +- static/docs/commands-reference/import-url.md | 2 +- static/docs/commands-reference/import.md | 2 +- .../docs/commands-reference/metrics/index.md | 2 +- .../docs/commands-reference/metrics/modify.md | 6 +++--- static/docs/commands-reference/remote/add.md | 2 +- .../docs/commands-reference/remote/index.md | 6 +++--- static/docs/commands-reference/remote/list.md | 7 +++++++ .../docs/commands-reference/remote/modify.md | 4 ++-- static/docs/commands-reference/repro.md | 6 +++--- static/docs/commands-reference/update.md | 2 +- static/docs/get-started/configure.md | 16 +++++++-------- static/docs/get-started/example-pipeline.md | 4 ++-- static/docs/get-started/example-versioning.md | 15 +++++++------- static/docs/get-started/experiments.md | 4 ++-- static/docs/get-started/index.md | 6 +++--- static/docs/get-started/metrics.md | 2 +- static/docs/get-started/reproduce.md | 8 ++++---- static/docs/tutorial/define-ml-pipeline.md | 6 +++--- .../data-and-model-files-versioning.md | 4 ++-- static/docs/use-cases/index.md | 2 +- static/docs/user-guide/contributing.md | 8 ++++---- .../user-guide/dvc-files-and-directories.md | 20 +++++++++---------- .../user-guide/large-dataset-optimization.md | 16 +++++++-------- static/docs/user-guide/update-tracked-file.md | 2 +- 27 files changed, 89 insertions(+), 80 deletions(-) diff --git a/static/docs/changelog/0.18.md b/static/docs/changelog/0.18.md index 230e3bcf09..48072c3dd5 100644 --- a/static/docs/changelog/0.18.md +++ b/static/docs/changelog/0.18.md @@ -4,7 +4,7 @@ We have been working hard last few weeks improving **user experience**, **performance**, and **documentation**. Kudos to `@sotte` and `@Hong-Xiang` for the great feedback they gave us! -We are very close to the 1.0 release! Two major changes are coming in DVC 1.0 - +We are very close to the 1.0 release! Two major changes are coming in DVC 1.0: [commit semantics](https://github.com/iterative/dvc/issues/919#issuecomment-414540094) and [execution matrix](https://github.com/iterative/dvc/issues/973#issuecomment-412739728). @@ -14,10 +14,10 @@ discuss and let us know about your thoughts! Since the last announcement we have released versions 0.12 through 0.18 and are really excited to share the progress with you: -- ⚡ **DVC just got faster**: +- ⚡ **DVC just got faster** - - Data files management commands - `dvc add`, `dvc push`, `dvc pull`, etc got - up to 10x faster on data sets with large number of files. + - Data files management commands like `dvc add`, `dvc push`, `dvc pull`, etc. + got up to 10x faster on data sets with large number of files. - Commands startup latency reduced 3x diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md index 35493b658a..1db77ca708 100644 --- a/static/docs/changelog/0.35.md +++ b/static/docs/changelog/0.35.md @@ -1,8 +1,8 @@ # v0.19 - v0.35 We've launched the -[DVC Patreon campaign](https://www.patreon.com/DVCorg/overview) - it's one of -the ways to support the project if you like it. +[DVC Patreon campaign](https://www.patreon.com/DVCorg/overview). It's one of the +ways to support the project if you like it. Now, let’s **highlight the changes** (not including bug fixes, and minor improvements) we have done in the last few months: @@ -43,12 +43,15 @@ improvements) we have done in the last few months: flag to give a way to **avoid uncontrolled cache growth** and as a way to save some `dvc repro` runs. In the future we plan to have “do-not-cache-my-data” as a default mode for `dvc run`, `dvc add` and `dvc repro`. + - **SSH remotes (data storage) support** - config options to set port, key files, timeouts, password, etc + improved stability and Windows support! Introduced **HTTP remotes** - external dependencies and as a read-only cache. + - **Control over where DVC-files are located in your project** - place them wherever you want with the `-f` option supported by all relevant commands - `dvc add`, `dvc run`, and `dvc import`. + - 🙂A lot of **UI improvements** . Starting from the finally fixed nasty issue with Windows terminal printing a lot of garbage symbols, to using progress bars for checkouts, better metrics output, and lots of smaller things: diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 1dc80aa991..56b9bfa773 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -43,7 +43,7 @@ time tying stages or a pipeline. The last two use cases are **not recommended**, and essentially force update the DVC-files and save data to cache. They are still useful, but keep in mind that -DVC can't guarantee reproducibility in those cases - you commit any data your +DVC can't guarantee reproducibility in those cases – You commit any data you want. Let's take a look at what is happening in the fist scenario closely: Normally DVC commands like `dvc add`, `dvc repro` or `dvc run`, commit the data diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 6080b3e643..3f40d8e55d 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -95,7 +95,7 @@ Both methods generate an equivalent [stage file](/doc/commands-reference/run) user from having to manually copy files from each of the remote storage schemes, and from having to install CLI tools for each service. -Note that import stages are considered always "locked" - meaning that if you run +Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to update the downloaded file or directory from the external data source. diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index aa2214e9e6..e5d3395fe3 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -51,7 +51,7 @@ determine whether the local copy is out of date. To actually [track the data](https://dvc.org/doc/get-started/add-files), `git add` (and `git commit`) the import stage (DVC-file). -Note that import stages are considered always "locked" - meaning that if you run +Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to update the downloaded data artifact from the external DVC repo. diff --git a/static/docs/commands-reference/metrics/index.md b/static/docs/commands-reference/metrics/index.md index 062a9a6634..6089498212 100644 --- a/static/docs/commands-reference/metrics/index.md +++ b/static/docs/commands-reference/metrics/index.md @@ -22,7 +22,7 @@ positional arguments: ## Description DVC has the ability to tag a specified output file as a file that contains -metrics to track. Metrics are usually any project specific numbers - `AUC`, +metrics to track. Metrics are usually any project specific numbers e.g. `AUC`, `ROC`, etc. DVC itself does not ascribe any specific meaning for these numbers. Usually these numbers are produced by the model evaluation script and serve as a way to compare and pick the best performing experiment variant. diff --git a/static/docs/commands-reference/metrics/modify.md b/static/docs/commands-reference/metrics/modify.md index 4b13f1316e..08b7c5e780 100644 --- a/static/docs/commands-reference/metrics/modify.md +++ b/static/docs/commands-reference/metrics/modify.md @@ -92,9 +92,9 @@ $ dvc metrics show metrics.csv metrics.csv: auc, 0.9567 ``` -Okay. Let's now, imagine we are interested only in numbers - second column of -the CSV file. We can specify the type `CSV` and a path to extract the second -column: +Okay. Let's now, imagine we are interested only in the numeric values – second +column of the CSV file. We can specify the `CSV` type (`-t`) and an `xpath` +(`-x`) to extract the second column: ```dvc $ dvc metrics modify -t csv -x '0,1' metrics.csv diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md index 86b41a81bf..59d54a3d1e 100644 --- a/static/docs/commands-reference/remote/add.md +++ b/static/docs/commands-reference/remote/add.md @@ -34,7 +34,7 @@ though and will rely on default access settings. > If you installed DVC via `pip`, depending on the remote type you plan to use > you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, > `[azure]`, and `[oss]`; or `[all]` to include them all. The command should -> look like this: `pip install "dvc[s3]"` - it installs `boto3` library along +> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along > with DVC to support AWS S3 storage. This command creates a section in the DVC diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index a591c79173..602893962d 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -28,15 +28,15 @@ What is data remote? The same way as Github provides storage hosting for Git repositories, DVC data remotes provide a central place to keep and share data and model files. With a remote data storage, you can pull models and data files which were created by -your team members without spending time and resources to re-build models and -re-process data files. It also saves space on your local environment - DVC can +your team members without spending time and resources to rebuild models and +re-process data files. It also saves space on your local environment – DVC can [fetch](/doc/commands-reference/fetch) into the local cache only the data you need for a specific branch/commit. > If you installed DVC via `pip`, depending on the remote type you plan to use > you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, > `[azure]`, and `[oss]`; or `[all]` to include them all. The command should -> look like this: `pip install "dvc[s3]"` - it installs `boto3` library along +> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along > with DVC to support AWS S3 storage. Using DVC with a remote data storage is optional. By default, DVC is configured diff --git a/static/docs/commands-reference/remote/list.md b/static/docs/commands-reference/remote/list.md index c4a4d6f634..6d4920dc06 100644 --- a/static/docs/commands-reference/remote/list.md +++ b/static/docs/commands-reference/remote/list.md @@ -29,6 +29,13 @@ Including names and URLs. - `--local` - read a local [config file](/doc/commands-reference/config) instead of `.dvc/config`. It is located in `.dvc/config.local` and is Git-ignored. +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + ## Examples Let's for simplicity add a default local remote: diff --git a/static/docs/commands-reference/remote/modify.md b/static/docs/commands-reference/remote/modify.md index 10d882af15..04859b65c0 100644 --- a/static/docs/commands-reference/remote/modify.md +++ b/static/docs/commands-reference/remote/modify.md @@ -27,8 +27,8 @@ positional arguments: ## Description Remote `name` and `option` name are required. Option names are remote type -specific. See below examples and a list of per remote type - AWS S3, Google -cloud, Azure, SSH, ALiyun OSS, and others. +specific. See below examples and a list of per remote type: AWS S3, Google +Cloud, Azure, SSH, ALiyun OSS, and others. This command modifies a `remote` section in the DVC project's [config file](/doc/commands-reference/config). Alternatively, `dvc config` or diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 089fc3cf54..1f47c3cc4b 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -88,9 +88,9 @@ specified), and updates stage files with the new checksum information. reproduce `A` first and then `B` even if `B` was previously executed with the same inputs from `A` (cached). It might be useful when we have a common dependency among all stages and want to specify it once (for the stage `A` - here). For example, if we know that all stages - `A` and below - depend on - `requirements.txt`, we can specify it only once in `A` and omit in `B` and - `C`. To be precise - it reproduces all descendants of a changed stage, or the + here). For example, if we know that all stages (`A` and below) depend on + `requirements.txt`, we can specify it only once in `A`, omitting it in `B` and + `C`. To be precise , it reproduces all descendants of a changed stage or the stages following the changed stage, even if their direct dependencies did not change. Like with the same option on `dvc run`, this is a way to force stages without changes to run again. This can also be useful for pipelines containing diff --git a/static/docs/commands-reference/update.md b/static/docs/commands-reference/update.md index 9e37f37989..b1961921e7 100644 --- a/static/docs/commands-reference/update.md +++ b/static/docs/commands-reference/update.md @@ -18,7 +18,7 @@ After creating import stages `dvc import-url`, the external data source can change. Use `dvc update` to bring these imported file, directory, or data artifact up to date. -Note that import stages are considered always "locked" - meaning that if you run +Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. `dvc update` is the only command that can update them. diff --git a/static/docs/get-started/configure.md b/static/docs/get-started/configure.md index 414294bc81..95af1a68d4 100644 --- a/static/docs/get-started/configure.md +++ b/static/docs/get-started/configure.md @@ -35,18 +35,18 @@ $ git commit .dvc/config -m "initialize DVC local remote" Adding a remote should be specified by both its type prefix (protocol) and its path. DVC currently supports seven types of remotes: -- `local` - local directory -- `s3` - Amazon Simple Storage Service -- `gs` - Google Cloud Storage -- `azure` - Azure Blob Storage -- `ssh` - Secure Shell -- `hdfs` - Hadoop Distributed File System -- `http` - HTTP and HTTPS protocols +- `local`: local directory +- `s3`: Amazon Simple Storage Service +- `gs`: Google Cloud Storage +- `azure`: Azure Blob Storage +- `ssh`: Secure Shell +- `hdfs`: Hadoop Distributed File System +- `http`: HTTP and HTTPS protocols > If you installed DVC via `pip`, depending on the remote type you plan to use > you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, > `[azure]`, and `[oss]`; or `[all]` to include them all. The command should -> look like this: `pip install "dvc[s3]"` - it installs `boto3` library along +> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along > with DVC to support AWS S3 storage. For example, to setup an S3 remote we would use something like (make sure that diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 58cf9511e6..5f091f21f2 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -166,8 +166,8 @@ outs: path: data/Posts.xml ``` -This file is using the same technique - pointers (md5 hashes) to the -cache to describe and version control dependencies and outputs. +This file is using the same technique (checksums that point to to the +cache) to describe and version control dependencies and outputs. Output `Posts.xml` file is automatically added to the `.gitignore` file and a link is created into a cache `.dvc/cache/a3/04afb96060aad90176268345e10355` to save it. diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 17958d38e2..bbd3296d20 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -10,9 +10,8 @@ models and datasets, let's play with a [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) that [François Chollet](https://twitter.com/fchollet) put together to show how to build a powerful image classifier, using only a small dataset. The goal of -this example is to give you some hands-on experience with a very basic -scenario - working with multiple versions of datasets and ML models using DVC -commands. +this example is to give you some hands-on experience with a very basic scenario +– working with multiple versions of datasets and ML models using DVC commands. ![](/static/img/cats-and-dogs.jpg) @@ -24,7 +23,7 @@ different versions. The specific algorithm that is used to train and validate the classifier is not important. No prior knowledge is required about Keras. We reuse the [script](https://gist.github.com/fchollet/f35fbc80e066a49d65f1688a7e99f069) (it -goes along the blog post) in a "black box" way - it takes some data and produces +goes along the blog post) in a "black box" way – it takes some data and produces a model file. We would highly recommend reading the [post](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) itself since it's a great demonstration on how a general pre-trained model can @@ -98,9 +97,9 @@ $ unzip data.zip $ rm -f data.zip ``` -This command downloads and extracts our raw dataset - 1000 labeled images for -training and 800 labeled images for validation. In summary, it's a 43 MB -dataset, with a directory structure like this: +This command downloads and extracts our raw dataset, consisting of 1000 labeled +images for training and 800 labeled images for validation. In summary, it's a 43 +MB dataset, with a directory structure like this: ```sh data @@ -347,7 +346,7 @@ DVC pipelines. See this [example](/doc/get-started/example-pipeline) to get a hands-on experience with them and try to apply it here. Don't hesitate to join our [community](/chat) to ask any questions! -Another thing, you should have noticed, is the metrics file - `metrics.json` and +Another thing, you should have noticed, is the metrics file (`metrics.json`) and the way we captured it with `-M metrics.json` option. Metric file is a special type of output DVC provides an interface on top to compare across tags or branches. See `dvc metrics` command and diff --git a/static/docs/get-started/experiments.md b/static/docs/get-started/experiments.md index b4068e61e6..14c0c628d1 100644 --- a/static/docs/get-started/experiments.md +++ b/static/docs/get-started/experiments.md @@ -1,7 +1,7 @@ # Experiments -Data science process is inherently iterative and R&D like - data scientist may -try many different approaches, different hyper-parameter values and "fail" many +Data science process is inherently iterative and R&D like. Data scientist may +try many different approaches, different hyper-parameter values, and "fail" many times before the required level of a metric is achieved. DVC is built to provide a way to capture different experiments and navigate diff --git a/static/docs/get-started/index.md b/static/docs/get-started/index.md index 153b666608..1533a7878d 100644 --- a/static/docs/get-started/index.md +++ b/static/docs/get-started/index.md @@ -4,9 +4,9 @@ Get started is a step by step introduction into basic DVC concepts. It doesn't go into details much, but provides links and expandable sections to learn more. At the very end there are a few complete step-by-step examples to give you more -hands-on experience with real-life scenarios - first is about model and dataset -[versioning](/doc/get-started/example-versioning), and the second one is focused -on [pipelines and reproducibility](/doc/get-started/example-pipeline). +hands-on experience with real life scenarios. The first on is about model and +dataset [versioning](/doc/get-started/example-versioning), and the second one is +focused on [pipelines and reproducibility](/doc/get-started/example-pipeline). ✅ Please, join our [community](/chat) or see these [support](/support) options if you have any questions or need any help. We are very responsive ⚡. diff --git a/static/docs/get-started/metrics.md b/static/docs/get-started/metrics.md index 2ac439ed11..81458b25a4 100644 --- a/static/docs/get-started/metrics.md +++ b/static/docs/get-started/metrics.md @@ -15,7 +15,7 @@ $ dvc run -f evaluate.dvc \ ``` `evaluate.py` calculates AUC value using the test dataset. It reads features -from the `features/test.pkl` file and produces a DVC metric file - `auc.metric`. +from the `features/test.pkl` file and produces a DVC metric file (`auc.metric`). It is a special DVC output file type, in this case it's just a plain text file with a single number inside. diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md index ad12233b73..423d052983 100644 --- a/static/docs/get-started/reproduce.md +++ b/static/docs/get-started/reproduce.md @@ -19,10 +19,10 @@ $ dvc repro train.dvc `train.dvc` file internally describes what data files and code we should take and how to run the command to get the binary model file. For each data file it -depends on, we can, in turn, do the same analysis - find a corresponding -DVC-file that includes the data file in its outputs, get dependencies and -commands, and so on. It means that DVC can recursively build a complete tree of -commands it needs to execute to get the model file. +depends on, we can in turn do the same analysis – find a corresponding DVC-file +that includes the data file in its outputs, get dependencies and commands, and +so on. It means that DVC can recursively build a complete tree of commands it +needs to execute to get the model file. `dvc repro` is, essentially, building this execution graph, detects stages with modified dependencies or missing outputs and recursively executes this graph diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index d86feb7ed2..d91e5a6176 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -212,9 +212,9 @@ outs: Sections of the file above include: -- `cmd` - the command to run -- `deps` - dependencies with md5 checksums -- `outs` - outputs with md5 checksums +- `cmd`: The command to run +- `deps`: Dependencies with MD5 checksums +- `outs`: Outputs with MD5 checksums And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. diff --git a/static/docs/use-cases/data-and-model-files-versioning.md b/static/docs/use-cases/data-and-model-files-versioning.md index e6e4c3d29a..dcfb51e568 100644 --- a/static/docs/use-cases/data-and-model-files-versioning.md +++ b/static/docs/use-cases/data-and-model-files-versioning.md @@ -24,8 +24,8 @@ DVC doesn't require installing a server; it can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure). Let's say you already have a project that uses a bunch of images -stored in `images/` directory and has a `model.pkl` file - your model file that -is deployed to production. +stored in `images/` directory and has a `model.pkl` file – model file deployed +to production. ```dvc $ ls images diff --git a/static/docs/use-cases/index.md b/static/docs/use-cases/index.md index 68b895ea11..e4154e4d5f 100644 --- a/static/docs/use-cases/index.md +++ b/static/docs/use-cases/index.md @@ -1,6 +1,6 @@ # Use Cases -Here we provide an overview for some DVC use cases - from very basic +Here we provide an overview for some DVC use cases, from very basic ([data and model files management](/doc/use-cases/data-and-model-files-versioning)) to more advanced (optimizing resources on a [single development machine](/doc/use-cases/multiple-data-scientists-on-a-single-machine)). diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index eeb6117703..70a0f6bf52 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -290,12 +290,12 @@ Fixes #(Github issue id). Message types: -- *component* - name of a component that this patch is affecting. Use `dvc` in a +- _component_: Name of a component that this patch is affecting. Use `dvc` in a general case -- _short description_ - short description of the patch -- _long description_ - if needed, longer message describing the patch in more +- _short description_: Short description of the patch +- _long description_: If needed, longer message describing the patch in more details -- _github issue id_ - id of the GitHub issue that this patch is addressing +- _github issue id_: ID of the GitHub issue that this patch is addressing Example: diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 6f7ed1bf55..7fcb142dd8 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -5,16 +5,16 @@ directory (`.dvc/`) with special internal files and directories: ### Special DVC internal files and directories -- `.dvc/config` - this is a configuration file. The config file can be edited by +- `.dvc/config`: This is a configuration file. The config file can be edited by hand or with a special command: `dvc config`. -- `.dvc/config.local` - this is a local configuration file, that will overwrite +- `.dvc/config.local`: This is a local configuration file, that will overwrite options in `.dvc/config`. This is useful when you need to specify private options in your config that you don't want to track and share through Git (credentials, private locations, etc). The local config file can be edited by hand or with a special command: `dvc config --local`. -- `.dvc/cache` - the [cache directory](#structure-of-cache-directory) will +- `.dvc/cache`: The [cache directory](#structure-of-cache-directory) will contain your data files. (The data directories of DVC repositories will only contain links to the data files in the cache, refer to [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization).) @@ -25,22 +25,22 @@ directory (`.dvc/`) with special internal files and directories: > the Git repository, only [DVC-files](/doc/user-guide/dvc-file-format) that > are needed to reproduce them. -- `.dvc/state` - this file is used for optimization. It is a SQLite db, that +- `.dvc/state`: This file is used for optimization. It is a SQLite db, that contains checksums for files tracked in a DVC project, with respective timestamps and inodes to avoid unnecessary checksum computations. It also contains a list of links (from cache to workspace) created by DVC and is used to cleanup your workspace when calling `dvc checkout`. -- `.dvc/state-journal` - temporary file for SQLite operations +- `.dvc/state-journal`: Temporary file for SQLite operations -- `.dvc/state-wal` - another SQLite temporary file +- `.dvc/state-wal`: Another SQLite temporary file -- `.dvc/updater` - this file is used store latest available version of dvc, - which is used to remind user to upgrade. +- `.dvc/updater`: This file is used store latest available version of dvc, which + is used to remind user to upgrade. -- `.dvc/updater.lock` - lock file for `.dvc/updater` +- `.dvc/updater.lock`: Lock file for `.dvc/updater` -- `.dvc/lock` - lock file for the whole DVC project +- `.dvc/lock`: Lock file for the whole DVC project ## Structure of cache directory diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index f31dd156ea..70a7dd42ad 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -62,17 +62,17 @@ File link type benefits summary: Each file linking option is further detailed below, in function of their efficiency: -1. **`reflink`** - copy-on-write\* links or "reflinks" are the best possible - link type, when available. They're is as efficient as hard/symlinks, but - don't carry a risk of cache corruption since the file system takes care of - copying the file if you try to edit it in place, thus keeping the linked - cache file intact. +1. **`reflink`**: Copy-on-write\* links or "reflinks" are the best possible link + type, when available. They're is as efficient as hard/symlinks, but don't + carry a risk of cache corruption since the file system takes care of copying + the file if you try to edit it in place, thus keeping the linked cache file + intact. > Unfortunately reflinks are currently supported on a limited number of file > systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future > they will be supported by the majority of file systems in use. -2. **`hardlink`** - hard links are the most efficient way to link your data to +2. **`hardlink`**: Hard links are the most efficient way to link your data to cache if both your repo and your cache directory are located on the same partition or storage device. @@ -80,7 +80,7 @@ efficiency: > instead deleted and then replaced with a new file, otherwise it might cause > cache corruption – and automatic deletion of cached files by DVC. -3. **`symlink`** - symbolic (a.k.a. "soft") links are the most efficient way to +3. **`symlink`**: Symbolic (a.k.a. "soft") links are the most efficient way to link your data to cache if your repo and your cache directory are located on different file systems/drives (i.e. repo is located on SSD for performance, but cache dir is located on HDD for bigger storage). @@ -89,7 +89,7 @@ efficiency: > instead deleted and then replaced with a new file, otherwise it might cause > cache corruption – and automatic deletion of cached files by DVC. -4. **`copy`** - an inefficient "linking" strategy, yet supported on all file +4. **`copy`**: An inefficient "linking" strategy, yet supported on all file systems. Using `copy` means there will be no file links, but that the tracked files will be duplicated as copies existing in both the cache and workspace. Suitable for scenarios with relatively small data files, where copying them diff --git a/static/docs/user-guide/update-tracked-file.md b/static/docs/user-guide/update-tracked-file.md index 61e21b8e4b..09b63a9de6 100644 --- a/static/docs/user-guide/update-tracked-file.md +++ b/static/docs/user-guide/update-tracked-file.md @@ -20,7 +20,7 @@ manually, DVC removes them for you before running the stage which generates them. If you use DVC to track a file that is generated during your pipeline (e.g. some -intermediate result or a final model file - `model.pkl`) and you don't use +intermediate result or a final model file i.e. `model.pkl`) and you don't use `dvc run` and `dvc repro` to manage your pipeline, use the procedure below (run `dvc unprotect` or `dvc remove`) to unlink it from DVC cache prior to the execution of the script that modifies it. From be94200f162a9acd4abb43487a347ee8b1e5f148 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 12 Aug 2019 19:02:50 -0500 Subject: [PATCH 17/17] symbol: review usage of "re-" (3) --- static/docs/commands-reference/import-url.md | 4 ++-- static/docs/commands-reference/remote/index.md | 4 ++-- static/docs/get-started/example-pipeline.md | 2 +- static/docs/understanding-dvc/collaboration-issues.md | 4 ++-- static/docs/understanding-dvc/related-technologies.md | 2 +- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 3f40d8e55d..b1e2db7562 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -196,8 +196,8 @@ trying this example (especially if trying out the following one). What if that remote file is one which will be updated regularly? The project goals might include regenerating a data artifact based on the -updated data source. A [pipeline](/doc/commands-reference/pipeline) can be -triggered to re-execute based on a changed external dependency. +updated data source. [Pipeline](/doc/commands-reference/pipeline) reproduction +can be triggered based on a changed external dependency. Let's use the [Get Started](/doc/get-started) project again, simulating an updated external data source. (Remember to prepare the workspace, diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index 602893962d..86573a15b3 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -28,8 +28,8 @@ What is data remote? The same way as Github provides storage hosting for Git repositories, DVC data remotes provide a central place to keep and share data and model files. With a remote data storage, you can pull models and data files which were created by -your team members without spending time and resources to rebuild models and -re-process data files. It also saves space on your local environment – DVC can +your team members without spending time and resources to build or process them +locally. It also saves space on your local environment – DVC can [fetch](/doc/commands-reference/fetch) into the local cache only the data you need for a specific branch/commit. diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 5f091f21f2..9515c9e726 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -360,7 +360,7 @@ master: By wrapping your commands with `dvc run` it's easy to integrate DVC into your existing ML development pipeline/processes without any significant effort to -re-implement your code/application. +rewrite your code. The key step to notice is that DVC automatically derives the dependencies between the experiment stages and builds the dependency graph (DAG) diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md index 1cc3a2c221..cd316ca06a 100644 --- a/static/docs/understanding-dvc/collaboration-issues.md +++ b/static/docs/understanding-dvc/collaboration-issues.md @@ -27,14 +27,14 @@ principled way: 3. Navigating through experiments. - How do you recover a model from last week without wasting time waiting for the - model to re-train? + model to retrain? - How do you quickly switch between the large data source and a small data subset without modifying source code? 4. Reproducibility. -- How do you run a model's evaluation again without re-training the model and +- How do you run a model's evaluation again without retraining the model and preprocessing a raw dataset? 5. Managing and sharing large data files. diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 2a7c1caafe..3e9035fa7d 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -67,7 +67,7 @@ process. - File tracking: - DVC tracks files based on checksum (md5) instead of file timestamps. This - helps avoid running into heavy processes like model re-training when you + helps avoid running into heavy processes like model retraining when you checkout a previous, trained version of a modeling code (Makefile will retrain the model).