From 7580d2b65c7bd6234d57ab0ccc538565730b1fa6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 2 Oct 2019 17:04:34 -0400 Subject: [PATCH] term: review occurrences of "which" to minimize and correct its use. Based on https://www.quickanddirtytips.com/education/grammar/which-versus-that-0 and common sense. Also reworded related phrases and sentences. --- src/Documentation/glossary.js | 4 +- static/docs/command-reference/add.md | 10 ++--- static/docs/command-reference/commit.md | 16 ++++---- static/docs/command-reference/config.md | 12 +++--- static/docs/command-reference/diff.md | 12 +++--- static/docs/command-reference/gc.md | 2 +- static/docs/command-reference/get.md | 17 ++++---- static/docs/command-reference/import-url.md | 14 +++---- static/docs/command-reference/import.md | 4 +- static/docs/command-reference/index.md | 6 +-- static/docs/command-reference/install.md | 17 ++++---- static/docs/command-reference/metrics/add.md | 6 +-- static/docs/command-reference/pull.md | 2 +- static/docs/command-reference/push.md | 4 +- static/docs/command-reference/remote/add.md | 7 ++-- .../docs/command-reference/remote/default.md | 6 +-- static/docs/command-reference/remote/index.md | 10 ++--- .../docs/command-reference/remote/modify.md | 4 +- static/docs/command-reference/repro.md | 10 ++--- static/docs/command-reference/run.md | 8 ++-- static/docs/command-reference/status.md | 6 +-- static/docs/command-reference/version.md | 14 +++---- static/docs/get-started/add-files.md | 2 +- static/docs/get-started/agenda.md | 4 +- static/docs/get-started/example-pipeline.md | 6 +-- static/docs/tutorial/define-ml-pipeline.md | 11 ++--- static/docs/tutorial/preparation.md | 7 ++-- static/docs/tutorial/reproducibility.md | 4 +- static/docs/tutorial/sharing-data.md | 6 ++- .../understanding-dvc/related-technologies.md | 2 +- static/docs/use-cases/index.md | 13 +++--- static/docs/user-guide/contributing-docs.md | 8 ++-- static/docs/user-guide/contributing.md | 8 ++-- .../user-guide/dvc-files-and-directories.md | 40 +++++++++---------- static/docs/user-guide/index.md | 7 ++-- .../user-guide/large-dataset-optimization.md | 8 ++-- .../docs/user-guide/update-tracked-files.md | 2 +- 37 files changed, 161 insertions(+), 158 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index 53370c5c01..52758da74f 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -1,8 +1,8 @@ export default { name: 'Glossary', desc: - 'This guide is aimed to familiarize the users with definitions to ' + - 'relevant DVC concepts and terminologies which are frequently used.', + 'This guide is aimed to provide definitions for relevant DVC concepts ' + + 'and terminologies that are frequently used.', contents: [ { name: 'Workspace', diff --git a/static/docs/command-reference/add.md b/static/docs/command-reference/add.md index b8c78177e6..d004651d65 100644 --- a/static/docs/command-reference/add.md +++ b/static/docs/command-reference/add.md @@ -24,7 +24,7 @@ The `targets` are files or directories to be places under DVC control. These are turned into outputs (`outs` field) in a resulting [DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.) Note that target data outside the current workspace is supported, -which becomes [external outputs](/doc/user-guide/managing-external-data). +that becomes [external outputs](/doc/user-guide/managing-external-data). Under the hood, a few actions are taken for each file (or directory) in `targets`: @@ -126,8 +126,8 @@ To track the changes with git run: git add .gitignore data.xml.dvc ``` -As the output says, a DVC-file has been created for `data.xml`. Let us explore -the result: +As the output says, a [DVC-file](/doc/user-guide/dvc-file-format)) has been +created for `data.xml`. Let us explore the result: ```dvc $ tree @@ -229,8 +229,8 @@ Saving 'pics/train/cats/cat.438.jpg' to cache '.dvc/cache'. In this case a DVC-file corresponding to each file is generated, and no top-level DVC-file is generated. But this is less convenient. -With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets -us treat the entire directory structure in one unit. It lets you pass the whole +With `dvc add pics`, a single `pics.dvc` DVC-file is generated, that lets us +treat the entire directory structure in one unit. It lets you pass the whole directory tree as a dependency to a `dvc run` stage definition, like this: ```dvc diff --git a/static/docs/command-reference/commit.md b/static/docs/command-reference/commit.md index 8ce3dc8595..15a745305a 100644 --- a/static/docs/command-reference/commit.md +++ b/static/docs/command-reference/commit.md @@ -38,9 +38,9 @@ time tying stages or a pipeline. - Sometimes we want to clean up a code or configuration file in a way that doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any - other change which doesn't introduce a change in the output of pipeline - stages. `dvc commit` can help avoid having to reproduce a pipeline in these - cases by forcing the update of the DVC-files. + other change that doesn't produce different output of pipeline stages. + `dvc commit` can help avoid having to reproduce a pipeline in these cases by + forcing the update of the DVC-files. Let's take a look at what is happening in the fist scenario closely. Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the @@ -145,8 +145,8 @@ bag_of_words = CountVectorizer(stop_words='english', max_features=6000, ngram_range=(1, 2)) ``` -This option not only changes the trained model, it also introduces a change -which would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to +This option not only changes the trained model, it also introduces a change that +would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to execute if we ran `dvc repro`. But if we want to try several values for this option and save only the best result to the cache, we can execute as so: @@ -230,15 +230,15 @@ $ python src/train.py data/features model.pkl $ python src/evaluate.py model.pkl data/features auc.metric ``` -As before, `dvc status` will show which the files have changed, and when your -work is finalized `dvc commit` will commit everything to the cache. +As before, `dvc status` will show which files have changed, and when your work +is finalized `dvc commit` will commit everything to the cache. ## Example: Updating dependencies Sometimes we want to clean up a code or configuration file in a way that doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any other -change which doesn't introduce a change in the output of pipeline stages. +change that doesn't produce different output of pipeline stages. ```dvc $ git status -s diff --git a/static/docs/command-reference/config.md b/static/docs/command-reference/config.md index 1e24628f0f..943cfef446 100644 --- a/static/docs/command-reference/config.md +++ b/static/docs/command-reference/config.md @@ -96,8 +96,8 @@ for more details.) - `cache.dir` - set/unset cache directory location. A correct value must be either an absolute path or a path **relative to the config file location**. - The default value is `cache`, which resolved relative to the default project - config location results in `.dvc/cache`. + The default value is `cache` that, resolved relative to the default project + config location, results in `.dvc/cache`. > See also helper command `dvc cache dir` to intuitively set this config > option, properly transforming paths relative to the current working @@ -171,10 +171,10 @@ for more details.) State config options. See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn -more about the state file that is used for optimization. +more about the state file (database) that is used for optimization. -- `state.row_limit` - maximum number of entries in the state database which - affects the physical size of the state file itself as well as the performance +- `state.row_limit` - maximum number of entries in the state database, which + affects the physical size of the state file itself, as well as the performance of certain DVC operations. The bigger the limit the more checksum history DVC can keep in order to avoid sequential checksum recalculations for the files. Default limit is set to 10 000 000 rows. @@ -226,7 +226,7 @@ Clear default remote value: $ dvc config --unset core.remote ``` -which is equivalent to: +The above command is equivalent to: ```dvc $ dvc config core.remote -u diff --git a/static/docs/command-reference/diff.md b/static/docs/command-reference/diff.md index 8ffeb5a2b8..ed138ebb2d 100644 --- a/static/docs/command-reference/diff.md +++ b/static/docs/command-reference/diff.md @@ -104,8 +104,8 @@ added file with size 37.9 MB We can base this example in the [Metrics](/doc/get-started/metrics) and [Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get -Started_ section, which describe different experiments to produce the -`model.pkl` file. Our example repository has the `bigrams-experiment` and +Started_ section, that describe different experiments to produce the `model.pkl` +file. Our example repository has the `bigrams-experiment` and `baseline-experiment` [tags](https://github.com/iterative/example-get-started/tags) respectively to reference these experiments. @@ -155,7 +155,7 @@ command dependencies or outputs). We can use `dvc diff` to check for changes in a directory by specifying the directory as the target (with option `-t`). Note that we skip the `b_ref` -argument this time, which defaults to `HEAD`. +argument this time, that defaults to `HEAD`. ```dvc $ dvc diff -t data/features baseline-experiment @@ -171,11 +171,11 @@ diff for 'data/features' ## Example: Confirming that a target has not changed -Let's use our example repo once again, which has several +Let's use our example repo once again, that has several [available tags](https://github.com/iterative/example-get-started/tags) for conveniency. The `5-preparation` tag corresponds to the -[Connect Code and Data](/doc/get-started/connect-code-and-data) section of our -_Get Started_ section, in which the `dvc run` command is used to create the +[Connect Code and Data](/doc/get-started/connect-code-and-data) chapter of our +_Get Started_ section, where the `dvc run` command is used to create a `prepare.dvc` stage file. This DVC-file tracks the `data/prepared` directory output. diff --git a/static/docs/command-reference/gc.md b/static/docs/command-reference/gc.md index 517680204e..9bb53feb43 100644 --- a/static/docs/command-reference/gc.md +++ b/static/docs/command-reference/gc.md @@ -15,7 +15,7 @@ This command deletes (garbage collects) data files or directories that may exist in the cache (or [remote storage](/doc/command-reference/remote)) but no longer referred to in [DVC-files](/doc/user-guide/dvc-file-format) currently [checked out](/doc/command-reference/checkout) in the project. By -default this command only cleans up the local cache, which is typically located +default this command only cleans up the local cache, that is typically located on the same machine as the project in question. This usually helps to free up disk space. diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index fca37de14b..0a95351864 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -66,10 +66,10 @@ created in the current working directory, with its original file name. > DVC is [installed](/doc/get-started/install). We can use `dvc get` to download the resulting model file from our -[get started example repo](https://github.com/iterative/example-get-started), -which is a DVC project external to the current working directory. -The desired output file would be located in the root of the -external project (if the +[get started example repo](https://github.com/iterative/example-get-started), a +DVC project external to the current working directory. The desired +output file would be located in the root of the external project +(if the [`train.dvc` stage](https://github.com/iterative/example-get-started/blob/master/train.dvc) was reproduced) and named `model.pkl`. @@ -83,7 +83,7 @@ Note that the `model.pkl` file doesn't actually exist in the [root directory](https://github.com/iterative/example-get-started/tree/master/) of the external Git repository. Instead, the corresponding DVC-file [train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc) -is found, which specifies `model.pkl` in its outputs (`outs`). DVC then +is found, that specifies `model.pkl` in its outputs (`outs`). DVC then [pulls](/doc/command-reference/pull) the file from the default [remote](/doc/command-reference/remote) of the external DVC project (found in its @@ -140,9 +140,10 @@ The `model.monograms.pkl` file now contains the older version of the model. To get the most recent one, we use a similar command, but with `-o model.bigrams.pkl` and `--rev 9-bigrams-model` or even without `--rev` -(since it's the latest version anyway). In fact in this case using `dvc pull` -should suffice, downloading the file as just `model.pkl`, which we can then -rename to make it extra obvious: +(since it's the latest version anyway). In fact, in this case using `dvc pull` +with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) should +suffice, downloading the file as just `model.pkl`. We can then rename it to make +its version explicit: ```dvc $ dvc pull train.dvc diff --git a/static/docs/command-reference/import-url.md b/static/docs/command-reference/import-url.md index c34b32c042..441afa3266 100644 --- a/static/docs/command-reference/import-url.md +++ b/static/docs/command-reference/import-url.md @@ -5,7 +5,7 @@ Download or copy a file or directory from any supported URL (for example workspace, and track changes in the remote data source with DVC. Creates a DVC-file. -> See also `dvc get-url` which corresponds to the first half of what this +> See also `dvc get-url`, that corresponds to the first half of what this > command does (downloading the data artifact). ## Synopsis @@ -37,8 +37,8 @@ for the imported data file or directory in the workspace. > See `dvc import` to download and tack data or model files or directories from > other DVC repositories (e.g. Github URLs). -DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in -an external location, see +DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in +external locations, see [External Dependencies](/doc/user-guide/external-dependencies). In such a DVC-file, the `deps` section stores the remote URL, and the `outs` section contains the corresponding local path in the workspace. It records metadata from @@ -196,10 +196,10 @@ trying this example (especially if trying out the following one). ## Example: Detecting remote file changes -What if that remote file is one which will be updated regularly? The project -goals might include regenerating a data artifact based on the -updated data source. [Pipeline](/doc/command-reference/pipeline) reproduction -can be triggered based on a changed external dependency. +What if that remote file is updated regularly? The project goals might include +regenerating a data artifact based on the updated data source. +[Pipeline](/doc/command-reference/pipeline) reproduction can be triggered based +on a changed external dependency. Let's use the [Get Started](/doc/get-started) project again, simulating an updated external data source. (Remember to prepare the workspace, diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index 3f584ee7d5..cec6139173 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -5,7 +5,7 @@ repository (e.g. hosted on Github) into the workspace, and track changes in this [external dependency](/doc/user-guide/external-dependencies). Creates a DVC-file. -> See also `dvc get` which corresponds to the first step this command performs +> See also `dvc get`, that corresponds to the first step this command performs > (just download the data). ## Synopsis @@ -47,7 +47,7 @@ stage (DVC-file) is then created extending the full file or directory name of the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to generate the same output. -DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in +DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in an external DVC repository (hosted on a Git server). In such a DVC-file, the `deps` section specifies the `repo` URL and data `path`, and the `outs` section contains the corresponding local path in the workspace. It records enough data diff --git a/static/docs/command-reference/index.md b/static/docs/command-reference/index.md index 91cf9093e8..a709578221 100644 --- a/static/docs/command-reference/index.md +++ b/static/docs/command-reference/index.md @@ -6,9 +6,9 @@ DVC is a command-line tool. The typical DVC workflow goes as follows: `dvc init`. - Copy source code files for modeling into the repository and track the files with DVC using the `dvc add` command. -- Process raw data with your own data processing and modeling code using the - `dvc run` command, using the `--outs` option to outputs which will also be - tracked by DVC after the code is executed. +- Process raw data with your own data processing and modeling code, using the + `dvc run` command, along with its `--outs` option for outputs that should also + be tracked by DVC after the code is executed. - Sharing a Git repository with the source code of your ML [pipeline](/doc/command-reference/pipeline) will not include the project's cache. Use [remote storage](/doc/command-reference/remote) and diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md index c3909d0e2e..faa7aa9a28 100644 --- a/static/docs/command-reference/install.md +++ b/static/docs/command-reference/install.md @@ -29,8 +29,8 @@ The installed Git hook automates running `dvc checkout`. **Commit**: When committing a change to the Git repository, that change possibly requires reproducing the corresponding [pipeline](/doc/command-reference/pipeline) (using `dvc repro`) to regenerate -the project results. Or there might be new data not yet in cache, which requires -running `dvc commit` to update. +the project results. Or there might be new data files not yet in cache, which +requires running `dvc commit` to store them. The installed Git hook automates reminding the user to run either `dvc repro` or `dvc commit`, as needed. @@ -121,9 +121,10 @@ $ dvc pull --all-branches --all-tags Let's start our exploration with the impact of `dvc install` on the `dvc checkout` command. Remember that switching from one Git version to another -(with `git checkout`) changes the set of DVC-files in the project, which then -also changes the data files that should be placed in the workspace (with -`dvc checkout`). +(with `git checkout`) changes the set of +[DVC-files](/doc/user-guide/dvc-file-format) in the project. This changes the +set of data files that should be located in the workspace (which can be achieved +with `dvc checkout`). Let's first list the available tags in the _Get Started_ project: @@ -209,7 +210,7 @@ exec dvc checkout ``` The two Git hooks have been installed, and the one of interest for this exercise -is the `post-checkout` script which runs after `git checkout`. +is the `post-checkout` script that runs after `git checkout`. We can now repeat the command run earlier, to see the difference. @@ -252,8 +253,8 @@ featurize.dvc: 1 file changed, 1 insertion(+), 1 deletion(-) ``` -We see that `dvc status` output has appeared in the `git commit` interaction. -This new behavior corresponds to the Git hook which was installed, and it +We see that the output of `dvc status` has appeared in the `git commit` +interaction. This new behavior corresponds to the Git hook installed, and it helpfully informs us the workspace is out of sync. We should therefore run the `dvc repro` command. diff --git a/static/docs/command-reference/metrics/add.md b/static/docs/command-reference/metrics/add.md index 8404406086..c35aff8530 100644 --- a/static/docs/command-reference/metrics/add.md +++ b/static/docs/command-reference/metrics/add.md @@ -72,9 +72,9 @@ $ dvc run -o metrics.txt "echo 0.9643 > metrics.txt" ``` Even when we named this output file `metrics.txt`, DVC won't know that it's a -metric if we don't specify so. The content of stage file `metrics.txt.dvc` -(which is a [DVC-file](/doc/user-guide/dvc-file-format)) should look like this: -(Notice the `metric: false` field.) +metric if we don't specify so. The content of stage file `metrics.txt.dvc` (a +[DVC-file](/doc/user-guide/dvc-file-format)) should look like this: (Notice the +`metric: false` field.) ```yaml cmd: echo 0.9643 > metrics.txt diff --git a/static/docs/command-reference/pull.md b/static/docs/command-reference/pull.md index bb5f7cf761..c190163c27 100644 --- a/static/docs/command-reference/pull.md +++ b/static/docs/command-reference/pull.md @@ -41,7 +41,7 @@ only the files (or directories) missing from the workspace by searching all [DVC-files](/doc/user-guide/dvc-file-format) currently in the project. It will not download files associated with earlier versions or branches of the repository if using Git, nor will it download files -which have not changed. +that have not changed. The command `dvc status -c` can list files referenced in current DVC-files, but missing in the cache. It can be used to see what files `dvc pull` diff --git a/static/docs/command-reference/push.md b/static/docs/command-reference/push.md index 92ede66d84..e3b472ec6f 100644 --- a/static/docs/command-reference/push.md +++ b/static/docs/command-reference/push.md @@ -52,7 +52,7 @@ configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to remote storage. It will not upload files associated with earlier versions or branches -of the project directory, nor will it upload files which have not +of the project directory, nor will it upload files that have not changed. The `dvc status -c` command can list files tracked by DVC that are new in the @@ -259,7 +259,7 @@ $ tree ../vault/recursive The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the remote storage – a "local remote" in this case. This listing shows the cache -having more files in it than the remote does (which is what `new` means). +having more files in it than the remote – which is what the `new` state means. Next we can upload part of the data from the cache to the remote using the command `dvc push --with-deps .dvc`. Remember that `--with-deps` searches diff --git a/static/docs/command-reference/remote/add.md b/static/docs/command-reference/remote/add.md index d0fecd9ef2..9e998fabe0 100644 --- a/static/docs/command-reference/remote/add.md +++ b/static/docs/command-reference/remote/add.md @@ -49,10 +49,9 @@ url = /tmp/dvc-storage remote = myremote ``` -DVC supports the concept of a _default remote_. For the commands which take a +DVC supports the concept of a _default remote_. For the commands that accept a `--remote` option (`dvc pull`, `dvc push`, `dvc status`, `dvc gc`, `dvc fetch`), -this option can be left off the command line and the default remote will be used -instead. +the default remote is used if that option is not used. Use `dvc config` to unset/change the default remote as so: `dvc config -u core.remote`. @@ -327,7 +326,7 @@ $ export OSS_ACCESS_KEY_ID='AccessKeyID' $ export OSS_ACCESS_KEY_SECRET='AccessKeySecret' ``` -> Use default key id and key secret when they are not given, which gives read +> Uses default key id and key secret when they are not given, which gives read > access to public read bucket and public bucket. diff --git a/static/docs/command-reference/remote/default.md b/static/docs/command-reference/remote/default.md index 6dcff01a88..9e5f5aa895 100644 --- a/static/docs/command-reference/remote/default.md +++ b/static/docs/command-reference/remote/default.md @@ -38,9 +38,9 @@ This command assigns the default remote in the core section of the DVC remote = myremote ``` -For the commands which take a `--remote` option (`dvc pull`, `dvc push`, -`dvc status`, `dvc gc`, `dvc fetch`), default remote is used if that option is -not specified. +For the commands that accept a `--remote` option (`dvc pull`, `dvc push`, +`dvc status`, `dvc gc`, `dvc fetch`), the default remote is used if that option +is not used. You can also use `dvc config`, `dvc remote add` and `dvc remote modify` commands to set/unset/change the default remote configurations. diff --git a/static/docs/command-reference/remote/index.md b/static/docs/command-reference/remote/index.md index f31fb6a7d6..1aea2bc2ad 100644 --- a/static/docs/command-reference/remote/index.md +++ b/static/docs/command-reference/remote/index.md @@ -27,9 +27,9 @@ What is data remote? The same way as Github provides storage hosting for Git repositories, DVC remotes provide a central place to keep and share data and model files. With -this remote storage, you can pull models and data files which were created by -your team members without spending time and resources to build or process them -locally. It also saves space on your local environment – DVC can +this remote storage, you can pull models and data files created by colleagues +without spending time and resources to build or process them locally. It also +saves space on your local environment – DVC can [fetch](/doc/command-reference/fetch) into the cache directory only the data you need for a specific branch/commit. @@ -44,8 +44,8 @@ more details. > along with DVC to support S3 storage. Using DVC with a remote data storage is optional. By default, DVC is configured -to use a local data storage only (usually `.dvc/cache` directory inside your -repository), which enables basic DVC usage scenarios out of the box. +to use a local data storage only (usually the `.dvc/cache` directory). This +enables basic DVC usage scenarios out of the box. [Add](/doc/command-reference/remote/add), [default](/doc/command-reference/remote/default), diff --git a/static/docs/command-reference/remote/modify.md b/static/docs/command-reference/remote/modify.md index d819e58fd0..7301c728be 100644 --- a/static/docs/command-reference/remote/modify.md +++ b/static/docs/command-reference/remote/modify.md @@ -267,8 +267,8 @@ For more information on configuring Azure Storage connection strings, visit - `gss_auth` - use Generic Security Services authentication if available on host (for example, [with kerberos](https://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface#Relationship_to_Kerberos)). - Using this option requires `paramiko[gssapi]` which is currently only - supported by our pip package and could be installed with + Using this option requires `paramiko[gssapi]`, which is currently only + supported by our pip package, and could be installed with `pip install 'dvc[ssh_gssapi]'`. Other packages (Conda, Windows, Homebrew cask and Mac pkg) do not support it. diff --git a/static/docs/command-reference/repro.md b/static/docs/command-reference/repro.md index af7cbfbd87..c49b14ae0a 100644 --- a/static/docs/command-reference/repro.md +++ b/static/docs/command-reference/repro.md @@ -145,7 +145,7 @@ $ dvc run -f Dvcfile -d numbers.txt -d process.py -M count.txt \ > example because that's the default stage file name `dvc repro` will read > without having to provide any `targets`. -Where `process.py` is a script which for simplicity just prints the number of +Where `process.py` is a script that, for simplicity, just prints the number of lines: ```python @@ -233,10 +233,10 @@ Stage 'Dvcfile' didn't change. Data and pipelines are up to date. ``` -The reason being that the `text.txt` is a file which is a dependency in the -target DVC-file (`Dvcfile` by default). Instead, it's dependent on `filter.dvc`, -which happens before the target stage in this pipeline (shown above in the -following figure). +The reason being that the `text.txt` file is a dependency in the target +[DVC-file](/doc/user-guide/dvc-file-format) (`Dvcfile` by default). This +`Dvcfile` stage is dependent on `filter.dvc`, which happens first in this +pipeline (shown in the following figure): ```dvc $ dvc pipeline show --ascii diff --git a/static/docs/command-reference/run.md b/static/docs/command-reference/run.md index 22df9b308d..b015a719ad 100644 --- a/static/docs/command-reference/run.md +++ b/static/docs/command-reference/run.md @@ -121,12 +121,12 @@ pipeline. - `-y`, `--yes` - deprecated, use `--overwrite-dvcfile` instead. -- `--overwrite-dvcfile` - overwrite an existing DVC-file (the same file name - which is determined by the logic described in the `-f` option) without asking - for confirmation. +- `--overwrite-dvcfile` - overwrite an existing DVC-file (with file name + determined by the logic described in the `-f` option) without asking for + confirmation. - `--ignore-build-cache` - if an exactly equal DVC-file exists (same list of - outputs and inputs, the same command to run) which has been already executed, + outputs and inputs, the same command to run), which has been already executed and is up to date, `dvc run` won't normally execute the command again (thus "build cache"). This option gives a way to forcefully execute the command anyway. It's useful if the command is non-deterministic (meaning it produces diff --git a/static/docs/command-reference/status.md b/static/docs/command-reference/status.md index d4673ea901..aba47aea77 100644 --- a/static/docs/command-reference/status.md +++ b/static/docs/command-reference/status.md @@ -64,9 +64,9 @@ outputs described in it. (e.g. someone manually edited the file). - _always changed_ means that this is a DVC-file with no dependencies (an - _orphan_ stage file) or it has the `always_changed: true` value set (see - `--always-changed` option in `dvc run`), which is considered always changed - and is always executed by `dvc repro`. + _orphan_ stage file) or that it has the `always_changed: true` value set (see + `--always-changed` option in `dvc run`), so its considered always changed, and + thus is always executed by `dvc repro`. - _changed deps_ or _changed outs_ means that there are changes in dependencies or outputs tracked by the DVC-file. Depending on the use case, diff --git a/static/docs/command-reference/version.md b/static/docs/command-reference/version.md index 8ea75eb0d7..e5fef23fa7 100644 --- a/static/docs/command-reference/version.md +++ b/static/docs/command-reference/version.md @@ -45,13 +45,13 @@ The detail of DVC version depends upon the way of installing DVC. install DVC using the `master` branch of DVC's repository. Another way of setting up the development version is to clone the repository and run `pip install .`. The master branch is continuously being updated with changes - which might not be ready to publish yet. Therefore installing using the above + that might not be ready to publish yet. Therefore installing using the above command might have issues regarding its usage. So to trace any error reported - with this setup, we need to know exactly which version is being used. For - this, we rely on git commit hash which is displayed in output as - `0.40.2+292cab.mod`. The part before `+` is the `BASE_VERSION` and the latter - part is the git commit hash which is one of the commits in the `master` branch - (also, optional suffix `.mod` means that code is modified). + with this setup, we need to know exactly which version is being used. For this + we rely on a git commit hash that is displayed in this command's output like + this: `0.40.2+292cab.mod`. The part before `+` is the `BASE_VERSION` and the + latter part is the `master` branch commit hash. The optional suffix `.mod` + means that code is modified. #### What we mean by "Binary" @@ -68,7 +68,7 @@ The detail of `Binary` depends on the way DVC was downloading and - Windows executable (`.exe`) - file used to install applications on Windows These downloads are available from our [home page](/). They ultimately contain - a binary bundle, which is the executable version of a software program, + a binary bundle, which is the executable file of a software application, meaning that it will run natively on a specific platform (Linux, Windows, Mac). In our case, we use [PyInstaller](https://pythonhosted.org/PyInstaller/) to bundle our source code into the binary package app. diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index d66d297688..02b6757470 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -59,7 +59,7 @@ will see that it has this hash inside. DVC tries to use reflinks\* by default to link your data files from the DVC cache to the workspace, optimizing speed and storage space. However, reflinks are not widely supported yet and DVC falls back to actually copying data files -to/from the cache **which can be very slow with large files**, and duplicates +to/from the cache. **Copying can be very slow with large files**, and duplicates storage requirements. Hardlinks and symlinks are also available for optimized cache linking but, diff --git a/static/docs/get-started/agenda.md b/static/docs/get-started/agenda.md index 2022b1145a..c76390cb57 100644 --- a/static/docs/get-started/agenda.md +++ b/static/docs/get-started/agenda.md @@ -14,8 +14,8 @@ the same result together! The idea of the project is a simplified version of the [Tutorial](/doc/tutorial). It explores the NLP problem of predicting tags for a -given StackOverflow question. For example, we want one classifier which can -predict a post that is about the Python language by tagging it `python`. +given StackOverflow question. For example, we want a classifier that can predict +posts about the Python language by tagging them `python`. ![](/static/img/example-flow-2x.png) diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index ad7602a0a6..76163a613a 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -3,9 +3,9 @@ To show DVC in action, let's play with an actual machine learning scenario. Let's explore the natural language processing ([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) problem of -predicting tags for a given StackOverflow question. For example, we want one -classifier which can predict a post that is about the Python language by tagging -it `python`. This is a short version of the [Tutorial](/doc/tutorial). +predicting tags for a given StackOverflow question. For example, we want a +classifier that can predict posts about the Python language by tagging them +`python`. (This is a short version of the [Tutorial](/doc/tutorial).) In this example, we will focus on building a simple ML [pipeline](/doc/command-reference/pipeline) that takes an archive with diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 8d944e8fdf..36c8eaf2b8 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -98,7 +98,7 @@ hundreds of gigabytes in file size. Instead of copying files from cache to workspace, DVC can create reflinks or other file link types. > When reflinks are not supported by the file system, DVC defaults to copying -> files, which doesn't save file storage. However, it's easy to enable other +> files, which doesn't optimize file storage. However, it's easy to enable other > file link types on most systems. See > [File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) > for more information. @@ -148,8 +148,10 @@ the actual command to run. Outputs are files written to by the command, if any. directory. The dependency can be a regular file from a repository or a data file. -2. `-o file.tsv` (lower case o) specifies output data file, which means DVC will - transform this file into a data file (as if running `dvc add file.tsv`). +2. `-o file.tsv` (lower case o) specifies output data file. DVC will track this + data file by creating a corresponding + [DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv` + after `dvc run` instead). 3. `-O file.tsv` (upper case O) specifies a regular output file (not to be added to DVC). @@ -238,8 +240,7 @@ Posts.xml The output file `Posts.xml` was transformed by DVC into a data file in accordance with the `-o` option. You can find the corresponding cache file with -the checksum, which starts with `c1fa36d` as we can see in the `Posts.xml.dvc` -stage file: +the checksum, with a path starting in `c1/fa36d` as we can see below: ```dvc $ ls .dvc/cache/ diff --git a/static/docs/tutorial/preparation.md b/static/docs/tutorial/preparation.md index 8181b66177..8a45a73d38 100644 --- a/static/docs/tutorial/preparation.md +++ b/static/docs/tutorial/preparation.md @@ -96,10 +96,9 @@ $ git commit -m "init DVC" The `.dvc/cache` directory is one of the most important parts of any DVC repository. The directory contains all the content of data files and will be -described in the next chapter in more detail. The most important part about this -directory is that it is contained in the `.dvc/.gitignore` file, which means -that the cache directory is not under Git control — this is your local directory -and you cannot push it to any Git remote. +described in the next chapter in more detail. Note that the cache directory is +contained in the `.dvc/.gitignore` file, which means that it's not under Git +control — this is your local directory and you cannot push it to any Git remote. For more information refer to [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories). diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index e3b812d4f1..003fcd9931 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -7,7 +7,7 @@ The most exciting part of DVC is reproducibility. > Reproducibility is the time you are getting benefits out of DVC instead of > spending time managing ML pipelines. -DVC tracks all the dependencies, which helps you iterate on ML models faster +DVC tracks all the dependencies. This helps you iterate on ML models faster without thinking what was affected by your last change. In order to track all the dependencies, DVC finds and reads all the DVC-files in @@ -238,7 +238,7 @@ mismatches in the branches. You can properly merge conflicts by prioritizing the checksums from the bigrams branch: that is, by removing all checksums of the other branch. [Here](https://help.github.com/en/articles/resolving-a-merge-conflict-using-the-command-line) -you can find a tutorial which clarifies how to do that. It is also important to +you can find a tutorial that clarifies how to do that. It is also important to remove all automatically generated [conflict markers](https://git-scm.com/book/en/v2/Git-Tools-Advanced-Merging#_checking_out_conflicts) (<<<<<<<, diff --git a/static/docs/tutorial/sharing-data.md b/static/docs/tutorial/sharing-data.md index d80b62450b..e3015b6fe4 100644 --- a/static/docs/tutorial/sharing-data.md +++ b/static/docs/tutorial/sharing-data.md @@ -12,8 +12,10 @@ DVC is able to push the cache to cloud storage. > Using shared cloud storage, a colleague can reuse ML models that were trained > on your machine. -First, you need to set a remote storage which will be stored in the config file -of the project. This can be done using the CLI as shown below. +First, you need to setup remote storage for the project, that will +be stored in the +[config file](https://dvc.org/doc/user-guide/dvc-files-and-directories). This +can be done using the CLI as shown below. > Note that we are using the `dvc-public` S3 bucket as an example and you don't > have write access to it, so in order to follow the tutorial you will need to diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 933c8b7f1b..faacf6c2d7 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -97,7 +97,7 @@ process. Git-annex repository is cloned via `git clone`, data files won't be copied to the local machine as file contents are stored in separate [remotes](/doc/command-reference/remote). With DVC, - [DVC-files](/doc/user-guide/dvc-file-format) (which provide the reproducible + [DVC-files](/doc/user-guide/dvc-file-format) (that provide the reproducible workflow) are always included in the Git repository and hence can be recreated locally with minimal effort. diff --git a/static/docs/use-cases/index.md b/static/docs/use-cases/index.md index ceabc33c8d..3820ca3d5f 100644 --- a/static/docs/use-cases/index.md +++ b/static/docs/use-cases/index.md @@ -17,10 +17,9 @@ This list of use cases is _not_ exhaustive. We keep reviewing our docs and will include interesting scenarios that surface in our community. Please, [contact us](/support) if you need help or have suggestions! -Use cases are not written to be run end-to-end. For more general hands-on -experience with DVC, we recommend to follow the [Get Started](/doc/get-started), -and/or [Tutorial](/doc/tutorial)] first – which include prepared datasets, -source code files, and example commands that can be copied and pasted into -terminal. These articles also link to our more general -[User Guide](/doc/user-guide), and technical -[command references](/doc/command-reference). +Use cases are not written to be run end-to-end. For more general, hands-on +experience with DVC, we recommend following the [Get Started](/doc/get-started), +and/or [Tutorial](/doc/tutorial)] first. These include prepared datasets, source +code files, and example commands that can be copied and pasted into terminal. +These use cases also link to our more general [User Guide](/doc/user-guide), and +technical [command references](/doc/command-reference). diff --git a/static/docs/user-guide/contributing-docs.md b/static/docs/user-guide/contributing-docs.md index fbe031f341..b3939756f0 100644 --- a/static/docs/user-guide/contributing-docs.md +++ b/static/docs/user-guide/contributing-docs.md @@ -74,8 +74,8 @@ changes before submitting them, and its very much needed in order to make changes to the docs JavaScript engine itself (rare). Source code files need to be properly formatted as well, which is also ensured by the full setup below. -Start the development server using `yarn dev` which will start the server on the -default port `3000`. Visit `http://localhost:3000/` and navigate to the docs in +Start the development server using `yarn dev`. This will start the server on the +default port, `3000`. Visit `http://localhost:3000/` and navigate to the docs in question. If you intend to change JavaScript files, test the changes with `yarn test` @@ -89,8 +89,8 @@ command before committing them. Visual Studio Code and the [Rewrap](https://marketplace.visualstudio.com/items?itemName=stkb.rewrap) plugin. Correct formatting will be done automatically by a Git pre-commit hook - which is integrated when `yarn` installs the project dependencies (explained - in the instructions above). + that is integrated when `yarn` installs the project dependencies (explained in + the instructions above). - We use [Prettier](https://prettier.io/) default conventions to format our source code files. The formatting of staged files will automatically be done diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index 83c91f6f98..ccfbb711fc 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -25,9 +25,9 @@ paragraphs below to learn how to submit your changes. run tests or [run](#running-development-version) the DVC with your changes. - Fork [DVC](https://github.com/iterative/dvc.git) and prepare necessary changes. -- Add tests for your changes to `tests/test_*.py`. You can skip this step if - the effort to create tests for your change is unreasonable. Changes - without tests are still going to be considered by us. +- Add tests for your changes to `tests/test_*.py`. You can skip this step if the + effort to create tests for your change is unreasonable. Changes without tests + are still going to be considered by us. - [Run tests](#running-tests) and make sure all of them pass. - Submit a pull request, referencing any issues it addresses. @@ -170,7 +170,7 @@ Install [aws cli](https://docs.aws.amazon.com/en_us/cli/latest/userguide/cli-chap-install.html) tools. -Set up an account, get credentials, which will have access to S3. Then, set env +To set up AWS access, first get credentials with S3 permissions. Then, set env vars like this: ```dvc diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 8e5ece4b1b..c38b75bf21 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -24,8 +24,8 @@ operation: > repository, only [DVC-files](/doc/user-guide/dvc-file-format) that are > needed to download or reproduce them. -- `.dvc/state`: This file is used for optimization. It is a SQLite db, that - contains checksums for files tracked in a DVC project, with respective +- `.dvc/state`: This file is used for optimization. It is a SQLite database, + that contains checksums for files tracked in a DVC project, with respective timestamps and inodes to avoid unnecessary checksum computations. It also contains a list of links (from cache to workspace) created by DVC and is used to cleanup your workspace when calling `dvc checkout`. @@ -34,8 +34,8 @@ operation: - `.dvc/state-wal`: Another SQLite temporary file -- `.dvc/updater`: This file is used store latest available version of dvc, which - is used to remind user to upgrade. +- `.dvc/updater`: This file is used store the latest available version of DVC. + It's used to remind the user to upgrade when the installed version is behind. - `.dvc/updater.lock`: Lock file for `.dvc/updater` @@ -44,20 +44,20 @@ operation: ## Structure of cache directory There are two ways in which the data is stored in cache. It depends -on whether the actual data is stored in a single file (eg. `data.csv`) or in a -directory of files. +on whether the data in question is a single file (eg. `data.csv`) or a directory +of files. -We evaluate a checksum, usually MD5, for the data file which is a 32 characters -long string. The first two characters are assigned to name the directory inside -`.dvc/cache` and rest are given to name the cache file. For example, if a data -file, say `Posts.xml.zip`, is converted to a MD5 checksum, it will evaluate to -`ec1d2935f811b77cc49b031b999cbf17`. The cache file for this data file will be -stored as `.dvc/ec/1d2935f811b77cc49b031b999cbf17` on the local storage and if -it is pushed to a remote storage, its location will be -`/ec/1d2935f811b77cc49b031b999cbf17` where prefix is the name of the -remote storage. `/tmp/dvc-storage` can be one example of a prefix. +For the first case, we calculate the file's checksum, a 32 characters long +string (usually MD5). The first two characters are used to name the directory +inside `.dvc/cache` and the rest become the file name of the cached file. For +example, if a data file `Posts.xml.zip` has checksum +`ec1d2935f811b77cc49b031b999cbf17`, its cache entry will be +`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17` locally. If pushed to +[remote storage](/doc/command-reference/remote), its location will be +`/ec/1d2935f811b77cc49b031b999cbf17`, where prefix is the name of the +DVC remote. -For the second case, let us consider a directory of 2 images. +For the second case, let us consider a directory with 2 images. ```dvc $ tree data/images/ @@ -69,10 +69,9 @@ $ dvc add data/images ... ``` -On running `dvc add` on this directory of images, a -[DVC-file](/doc/user-guide/dvc-file-format) is created by default, with -information including the checksum of the directory, which is cached as a file -in `.dvc/cache`. +When running `dvc add` on this directory of images, a +[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the checksum +of the directory. ```yaml - md5: 196a322c107c2572335158503c64bfba.dir @@ -91,6 +90,7 @@ $ tree │   └── a6c8271c0c8fbf75d3b97aecee589f └── df └── f70c0392d7d386c39a23c64fcc0376 +... ``` Like the previous case, the first two digits of the checksum are used to name diff --git a/static/docs/user-guide/index.md b/static/docs/user-guide/index.md index e33d74f013..173d31cf95 100644 --- a/static/docs/user-guide/index.md +++ b/static/docs/user-guide/index.md @@ -3,9 +3,10 @@ This section describes the main DVC concepts and features comprehensively, explaining when and how to use them, as well as connections between them. These guides don't focus on specific scenarios, but have a general scope – like a user -manual. Their topics range from more technical basics, which impact more parts -of DVC, to more advanced things you can do. We also include a few guides related -to contributing to this [open-source project](https://github.com/iterative/dvc). +manual. Their topics range from more technical foundations, impacting more parts +of DVC, to more advanced and specific things you can do. We also include a few +guides related to contributing to this +[open-source project](https://github.com/iterative/dvc). - [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) describes the internal `.dvc/` directory and it's contents. diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index 5eaba8e88b..d65612e51c 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -23,8 +23,8 @@ supported by the file system. ## File link types for the DVC cache File links are entries in the file system that don't necessarily hold the file -contents, but which point to where the file is actually stored. File links are -more common in file systems used with UNIX-like operating systems and come in +contents, but point to where the file is actually stored. File links are more +common in file systems used with UNIX-like operating systems and come in different kinds, that differ in how they connect filenames to inodes in the system. @@ -41,8 +41,8 @@ system, but may break your workflow since updating hard/sym-linked files tracked by DVC in the workspace causes cache corruption. These 2 link types thus require using cache **protected mode** (see the `cache.protected` config option in `dvc config cache`). Finally, a 4th "linking" option is to actually -copy files from/to the cache, which is safe but inefficient, especially for -large files (several GBs or more data). +copy files from/to the cache, which is safe but inefficient – especially for +large files (several GBs or more). > Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) > support hard or soft links on the diff --git a/static/docs/user-guide/update-tracked-files.md b/static/docs/user-guide/update-tracked-files.md index 3866a2f004..6112157c8d 100644 --- a/static/docs/user-guide/update-tracked-files.md +++ b/static/docs/user-guide/update-tracked-files.md @@ -16,7 +16,7 @@ may mean either replacing `train.tsv` with a new file having the same name or editing the content of the file. If you run `dvc repro` there is no need to manage generated (output) files -manually, DVC removes them for you before executing the stage which generates +manually. DVC removes them for you before executing the stage that generates them. If you use DVC to track a file that is generated during your pipeline (e.g. some