diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 369e482950..23cb6c96a4 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -21,15 +21,15 @@ file is committed to the DVC cache. Using the `--no-commit` option, the file will not be added to the cache and instead the `dvc commit` command is used when (or if) the file is to be committed to the DVC cache. -Under the hood, a few actions are taken for each file in the target(s): +Under the hood, a few actions are taken for each file in `targets`: 1. Calculate the file checksum. 2. Move the file content to the DVC cache (default location is `.dvc/cache`). 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store the checksum to identify the cache entry. -5. Add the _target_ filename to `.gitignore` (if Git is used in this workspace) - to prevent it from being committed to the Git repository. +5. Add the target(s) to `.gitignore` (if Git is used in this workspace) to + prevent it from being committed to the Git repository. 6. Instructions are printed showing `git` commands for adding the files to a Git repository. If a different SCM system is being used, use the equivalent command for that system or nothing is printed if `--no-scm` was specified for @@ -79,8 +79,10 @@ This way you bring data provenance and make your project reproducible. ## Options -- `-R`, `--recursive` - recursively add each file under the named directory. For - each file a new DVC-file is created using the process described earlier. +- `-R`, `--recursive` - `targets` is expected to contain directory path(s). + Determines the files to add by searching each target directory and its + subdirectories for data files. For each file found, a new DVC-file is created + using the process described in this command's description. - `--no-commit` - do not put files/directories into cache. A DVC-file is created, and an entry is added to `.dvc/state`, while nothing is added to the diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 04ad62c070..5a5203aa75 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -37,7 +37,8 @@ The execution of `dvc checkout` does: - Scan the `outs` entries in DVC-files to compare with the currently checked out data files. The scanned DVC-files is limited by the listed `targets` (if any) on the command line. And if the `--with-deps` option is specified, it scans - backward in the [pipeline](/doc/get-started/pipeline) from the named targets. + backward from the given `targets` in the corresponding + [pipeline](/doc/get-started/pipeline). - For any data files where the checksum doesn't match their DVC-file entry, the data file is restored from the cache. The link strategy used (`reflink`, `hardlink`, `symlink`, or `copy`) depends on the OS and the configured value @@ -68,18 +69,17 @@ such a case, `dvc checkout` prints a warning message. Any files that can be checked out without error will be restored. There are two methods to restore a file missing from the cache, depending on the -situation. In some cases the pipeline must be rerun using the `dvc repro` -command. In other cases the cache can be pulled from a remote cache using the -`dvc pull` command. See also `dvc pipeline` +situation. In some cases a pipeline must be reproduced (using `dvc repro`) to +regenerate its outputs. (See also `dvc pipeline`.) In other cases the cache can +be pulled from a remote cache using `dvc pull`. ## Options -- `-d`, `--with-deps` - determine workspace files to update by tracking - dependencies to the named target DVC-file(s). This option only has effect when - one or more `targets` are specified. By traversing all stage dependencies, DVC - searches backward through the pipeline from the named target(s). This means - DVC will not checkout files referenced later in the pipeline than the named - target(s). +- `-d`, `--with-deps` - determine files to update by tracking dependencies to + the target DVC-file(s) (stages). This option only has effect when one or more + `targets` are specified. By traversing all stage dependencies, DVC searches + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not checkout files referenced in later stage(s) than `targets`. - `-f`, `--force` - do not prompt when removing workspace files. Changing the current set of DVC-files with SCM commands like `git checkout` can result in @@ -134,8 +134,8 @@ $ pip install -r requirements.txt -The existing pipeline looks almost like in this -[example](/doc/get-started/example-pipeline): +The workspace looks almost like in this +[pipeline setup](/doc/get-started/example-pipeline): ```dvc . diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index a418187717..36e48f5226 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -17,10 +17,10 @@ positional arguments: ## Description The `dvc commit` command is useful for several scenarios where a dataset is -being changed: a [stage](/doc/commands-reference/run) or +being changed: when a [stage](/doc/commands-reference/run) or [pipeline](/doc/get-started/pipeline) is in development, when one wishes to run -commands outside the control of DVC, or to force DVC-files updates to save time -rerunning the stage or pipeline. +commands outside the control of DVC, or to force DVC-file updates to save time +tying stages or a pipeline. - Code or data for a stage is under active development, with rapid iteration of code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and @@ -36,8 +36,8 @@ rerunning the stage or pipeline. doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any other change which doesn't introduce a change in the output of pipeline - stages. `dvc commit` can help avoid rerunning the pipeline in these cases by - forcing the update of the DVC-files. + stages. `dvc commit` can help avoid having to reproduce a pipeline in these + cases by forcing the update of the DVC-files. The last two use cases are **not recommended**, and essentially force update the DVC-files and save data to cache. They are still useful, but keep in mind that @@ -65,16 +65,16 @@ It handles that last step of adding the file to the DVC cache. ## Options - `-d`, `--with-deps` - determine files to commit by tracking dependencies to - the named target DVC-file(s). This option only has effect when one or more + the target DVC-file(s) (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward through the pipeline from the named target(s). This means DVC will - not commit files referenced later in the pipeline than the named target(s). + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not commit files referenced in later stage(s) than `targets`. -- `-R`, `--recursive` - the `targets` value is expected to be a directory path. - With this option, `dvc commit` determines the files to commit by searching the - named directory, and its subdirectories, for DVC-files for which to commit - data. Along with providing a `target`, or `target` along with `--with-deps`, - it is yet another way to limit the scope of DVC-files to upload. +- `-R`, `--recursive` - `targets` is expected to contain directory path(s). + Determines the files to commit by searching each target directory and its + subdirectories for DVC-files to inspect. Along with providing `targets`, or + `targets` and `--with-deps`, this is another way to limit the scope of + DVC-files to commit. - `-f`, `--force` - commit data even if checksums for dependencies or outputs did not change. @@ -131,11 +131,11 @@ This data will be retrieved from a preconfigured remote cache. ## Example: Rapid iterations -Sometimes we want to iterate through multiple changes to configuration, or to -code, sometimes to data, trying multiple options, and improving the output of a -stage. To avoid filling the DVC cache with undesired intermediate results, we -can rerun the whole pipeline using `dvc repro --no-commit`, or a single stage -with `dvc run --no-commit`. This prevents data from being pushed to cache. When +Sometimes we want to iterate through multiple changes to configuration, code, or +data, trying multiple options to improve the output of a stage. To avoid filling +the DVC cache with undesired intermediate results, we can run a single stage +with `dvc run --no-commit`, or reproduce an entire pipeline using +`dvc repro --no-commit`. This prevents data from being pushed to cache. When development of the stage is finished, `dvc commit` can be used to store data files in the DVC cache. @@ -195,11 +195,11 @@ outs: wdir: . ``` -To verify this instance of `model.pkl` is not in the cache, we must know how the -cache files are named. In the DVC cache the first two characters of the checksum -are used as a directory name, and the file name is the remaining characters. -Therefore, if the file had been committed to the cache it would appear in the -directory `.dvc/cache/70`. But: +To verify this instance of `model.pkl` is not in the cache, we must know the +names of the cache files. In the DVC cache the first two characters of the +checksum are used as a directory name, and the file name is the remaining +characters. Therefore, if the file had been committed to the cache it would +appear in the directory `.dvc/cache/70`. But: ```dvc $ ls .dvc/cache/70 @@ -256,9 +256,10 @@ train.dvc: Let's edit one of the source files. It doesn't matter which one. You'll see that both Git and DVC recognize a change was made. -If we ran `dvc repro` at this point the pipeline would be rerun. But since the -change was inconsequential, that would be a waste of time and CPU resources. -That's especially critical if the pipeline takes a long time to execute. +If we ran `dvc repro` at this point, this pipeline would be reproduced. But +since the change was inconsequential, that would be a waste of time and CPU. +That's especially critical if the corresponding stages lots of resources to +execute. ```dvc $ git add src/train.py @@ -277,4 +278,4 @@ Pipeline is up to date. Nothing to reproduce. ``` Nothing special is required, we simply `commit` to both the SCM and DVC. Since -the pipeline is up to date, `dvc repro` will not do anything. +this pipeline is up to date, `dvc repro` will not do anything. diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index 6e67fa9df7..d139a83d3c 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -46,7 +46,7 @@ files under DVC control could already exist in remote storage, but won't be in your local cache. (Refer to `dvc remote` for more information on DVC remotes.) These necessary data or model files are listed as dependencies or outputs in a DVC-file (target [stage](/doc/commands-reference/run)) so they are required to -[reproduce](/doc/get-started/reproduce) the +[reproduce](/doc/get-started/reproduce) the corresponding [pipeline](/doc/get-started/pipeline). (See [DVC-File Format](/doc/user-guide/dvc-file-format) for more information on dependencies and outputs.) @@ -78,10 +78,10 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch` using the `dvc remote` command. - `-d`, `--with-deps` - determine files to download by tracking dependencies to - the named target DVC-file(s). This option only has effect when one or more + the target DVC-file(s) (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward through the pipeline from the named target(s). This means DVC will - not fetch files referenced later in the pipeline than the named target(s). + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not fetch files referenced in later stage(s) than `targets`. - `-R`, `--recursive` - this option tells DVC that `targets` are directories (not DVC-files), and to traverse them recursively. All DVC-files found will be @@ -112,9 +112,10 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch` ## Examples -To explore `dvc fetch` let's consider a simple pipeline with several stages and -a few Git tags. Then we can see what happens with `fetch` as we shift from tag -to tag with `git`. +To explore `dvc fetch` let's consider a simple +[pipeline](/doc/get-started/pipeline) with several stages and a few Git tags. +Then we can see what happens with `fetch` as we shift from tag to tag with +`git`.
@@ -145,8 +146,8 @@ $ pip install -r requirements.txt
-The existing pipeline looks almost like in this -[example](/doc/get-started/example-pipeline): +The workspace looks almost like in this +[pipeline setup](/doc/get-started/example-pipeline): ```dvc . @@ -241,7 +242,7 @@ $ tree .dvc/cache └── 603888ec04a6e75a560df8678317fb ``` -> Note that `prepare.dvc` is the first stage in our example's implicit pipeline. +> Note that `prepare.dvc` is the first stage in our example's pipeline. Cache entries for the necessary directories, as well as the actual `data/prepared/test.tsv` and `data/prepared/train.tsv` files were download, @@ -251,7 +252,8 @@ checksums shown above. After following the previous example (**Specific stages**), only the files associated with the `prepare.dvc` stage file have been fetched. Several -dependencies/outputs for the full pipeline are still missing from local cache: +dependencies/outputs of other pipeline stages are still missing from local +cache: ```dvc $ dvc status -c @@ -296,13 +298,13 @@ $ tree .dvc/cache ``` Fetching using `--with-deps` starts with the target DVC-file (stage) and -searches backwards through the pipeline for data files to download into the +searches backwards through its pipeline for data files to download into the local cache. All the data for the second and third stages ("featurize" and "train") has now been downloaded to cache. We could now use `dvc checkout` to -get the data files needed to reproduce the pipeline up to the third stage into +get the data files needed to reproduce this pipeline up to the third stage into the workspace (with `dvc repro train.dvc`). > Note that in this sample project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages so at this point all -> the pipeline's files are in local cache and `dvc status -c` would output -> "Pipeline is up to date. Nothing to reproduce." +> of the files for this pipeline are in local cache and `dvc status -c` would +> output "Pipeline is up to date. Nothing to reproduce." diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index b941fb0fd3..bd34bbb1a9 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -61,8 +61,8 @@ DVC supports several types of (local or) remote locations: > running it internally expands this URL into a regular S3, SSH, GS, etc URL by > appending `/path/to/file` to the `myremote`'s configured base path. -Another way to understand the `dvc import` command is as a short-cut for more -verbose `dvc run` commands. This is discussed in the +Another way to understand the `dvc import` command is as a short-cut for a more +verbose `dvc run` command. This is discussed in the [External Dependencies](/doc/user-guide/external-dependencies) documentation, where an alternative is demonstrated for each of these schemes. @@ -80,16 +80,16 @@ $ dvc run -d https://example.com/path/to/data.csv \ wget https://example.com/path/to/data.csv -O data.csv ``` -Both methods generate a DVC-file with an external dependency, and they perform a -roughly equivalent result. The `dvc import` command saves the user from using -the command to copy files from each of the remote storage schemes, and from +Both methods generate a stage file (DVC-file) with an external dependency, and +they produce equivalent results. The `dvc import` command saves the user from +having to manually copy files from each of the remote storage schemes, and from having to install CLI tools for each service. -When DVC inspects a DVC-file, one step is inspecting the dependencies to see if -any have changed. A changed dependency will appear in the `dvc status` report, -indicating the need to re-run the corresponding part of the pipeline. When DVC -inspects an external dependency, it uses a method appropriate to that dependency -to test its current status. +When DVC inspects a DVC-file, its dependencies will be checked to see if any +have changed. A changed dependency will appear in the `dvc status` report, +indicating the need to reproduce this import stage. When DVC inspects an +external dependency, it uses a method appropriate to that dependency to test its +current status. ## Options @@ -146,7 +146,7 @@ $ pip install -r requirements.txt ## Example: Tracking a remote file -The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC +The [DVC getting started tutorial](/doc/get-started) demonstrates a simple pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to download a file, then use `dvc add` to integrate it with the workspace. @@ -189,7 +189,7 @@ $ git add data/.gitignore data.xml.dvc > [stages](/doc/commands-reference/run) from the _Getting Started_ example, but > since we don't need them for this example, we'll skip it. -Let's take a look at the resulting DVC-file `data.xml.dvc`: +Let's take a look at the resulting stage file (DVC-file) `data.xml.dvc`: ```yaml deps: @@ -215,9 +215,8 @@ file has changed. ## Example: Detecting remote file changes What if that remote file is one which will be updated regularly? The project -goal might include regenerating some artifact based on the updated data. With a -DVC external dependency, the pipeline can be triggered to re-execute based on a -changed external dependency. +goal might include regenerating some artifact based on the updated data. A +pipeline can be triggered to re-execute based on a changed external dependency. Let us again use the [Getting Started](/doc/get-started) example, in a way which will mimic an updated external data source. @@ -242,8 +241,8 @@ On your machine initialize the workspace again: ### Click and expand to prepare the workspace -This is needed to actually run the command below in case you are reproducing -this example: +This is needed to actually run the command below in case you are trying this +example: ```dvc $ git checkout 2-remote @@ -269,9 +268,9 @@ To track the changes with git run: ``` At this point we have the workspace set up in a similar fashion. The difference -is that DVC-file references now references the editable data file in the data -store directory we just set up. We did this to make it easy to edit the data -file: +is that stage file (DVC-file) outputs (`outs`) now references the editable file +in the data store directory we just set up. We did this to make it easy to edit +the data file: ```yaml deps: @@ -316,8 +315,8 @@ $ dvc run -f prepare.dvc \ python src/prepare.py data/data.xml ``` -Having setup this "prepare" stage means that later when we run `dvc repro` a -pipeline will be executed. +> Having setup this "prepare" stage means that later when we run `dvc repro`, a +> pipeline will be executed. The workspace says it is fine: @@ -393,7 +392,7 @@ Pipeline is up to date. Nothing to reproduce. ``` Because the external source for the data file changed, the change was noticed by -the `dvc status` command. Running `dvc repro` then ran both stages of the +the `dvc status` command. Running `dvc repro` then ran both stages of this pipeline, and if we had set up the other stages they also would have been run. It first downloaded the updated data file. And then noticing that `data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute. diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 9f8e789347..d49d0b3ad5 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -10,26 +10,28 @@ usage: dvc install [-h] [-q] [-v] ## Description -DVC provides an intelligent data repository on top of a regular SCM like Git to -store code and configuration files. With `dvc install`, the two are more tightly -integrated in order to cause certain convenient actions to happen automatically. +DVC provides an intelligent data repository on top of a regular SCM repository +like Git to store code and configuration files. With `dvc install`, the two are +more tightly integrated in order to cause certain convenient actions to happen +automatically. Namely: -**Checkout**: For any given SCM branch or tag, Git checks out the DVC-files -corresponding to that version. The DVC-files in turn refer to data files in the -DVC cache by checksum. When switching from one SCM branch or tag to another, the -SCM retrieves the corresponding DVC-files. By default that leaves the workspace -in a state where the DVC-files refer to data files other than what is currently -in the workspace. The user at this point should run `dvc checkout` so that the -data files will match the current DVC-files. +**Checkout**: For any given SCM branch or tag, Git checks out the +[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The +DVC-files in turn refer to data files in the DVC cache by checksum. When +switching from one SCM branch or tag to another, the SCM retrieves the +corresponding DVC-files. By default that leaves the workspace in a state where +the DVC-files refer to data files other than what is currently in the workspace. +The user at this point should run `dvc checkout` so that the data files will +match the current DVC-files. The installed Git hook automates running `dvc checkout`. **Commit**: When committing a change to the Git repository, that change possibly -requires rerunning the [pipeline](/doc/get-started/pipeline) to reproduce the -workspace results, which is a reminder to run `dvc repro`. Or there might be -files not yet in the cache, which is a reminder to run `dvc commit`. +requires reproducing the corresponding [pipeline](/doc/get-started/pipeline) +(with `dvc repro`) to regenerate the workspace results. Or there might be files +not yet in the cache, which is a reminder to run `dvc commit`. The installed Git hook automates reminding the user to run either `dvc repro` or `dvc commit`. @@ -262,8 +264,8 @@ Pipeline is up to date. Nothing to reproduce. 5 files changed, 12 insertions(+), 12 deletions(-) ``` -After rerunning the DVC pipeline, of course the data files are in sync with the -other files but we must now commit some files to the Git repository. Looking -closely we see that `dvc status` is again run, informing us that the data files -are synchronized with the statement: _Pipeline is up to date. Nothing to -reproduce_. +After reproducing this pipeline up to the "evaluate" stage, the data files are +in sync with the code/config files, but we must now commit the changes to the +Git repository. Looking closely we see that `dvc status` is run again, informing +us that the data files are synchronized with the `Pipeline is up to date.` +message. diff --git a/static/docs/commands-reference/metrics.md b/static/docs/commands-reference/metrics.md index 37b9bdef5e..0e8df0de07 100644 --- a/static/docs/commands-reference/metrics.md +++ b/static/docs/commands-reference/metrics.md @@ -48,14 +48,14 @@ up and manage DVC metrics. First, let's create a simple DVC-file: ```dvc -$ dvc run -d code/evaluate.py -M data/eval.json -f Dvcfile \ +$ dvc run -d code/evaluate.py -M data/eval.json \ python code/evaluate.py ``` > `-M|--metrics-no-cache` is telling DVC to mark `data/eval.json` as a metric > file. Using this option is equivalent to using `-O|--outs-no-cache` and then -> using `dvc metrics add data/eval.json` to explicitly mark `data/eval.json` as -> a metric file. +> running `dvc metrics add data/eval.json` to explicitly mark `data/eval.json` +> as a metric file. Now let's print metric values that we are tracking in the current project: diff --git a/static/docs/commands-reference/metrics_add.md b/static/docs/commands-reference/metrics_add.md index 5fa182ce1c..98278ddf73 100644 --- a/static/docs/commands-reference/metrics_add.md +++ b/static/docs/commands-reference/metrics_add.md @@ -74,8 +74,8 @@ outs: If you run `dvc metrics show` you should get an error message like this: -```text -Error: failed to show metrics - no metric files in +```dvc +ERROR: failed to show metrics - no metric files in this repository. use 'dvc metrics add' to add a metric file to track. ``` @@ -104,6 +104,6 @@ outs: And if you run `dvc metrics show` you should see something like this: -```text +```dvc metrics.txt: 0.9643 ``` diff --git a/static/docs/commands-reference/metrics_modify.md b/static/docs/commands-reference/metrics_modify.md index 2ab6c06815..6fef1ef0da 100644 --- a/static/docs/commands-reference/metrics_modify.md +++ b/static/docs/commands-reference/metrics_modify.md @@ -19,12 +19,12 @@ for the metric file `path` provided (the one that specifies the file path in question among its outputs – see `dvc metrics add` or `dvc run` with `-m` and `-M` options), and updates the information that represents the metric. -If the path provided is not part of the pipeline, the following error will be -raised: +If the path provided is not defined in a workspace DVC-file, the following error +will be raised: -```text -Error: failed to modify metrics - unable - to find file '' in the pipeline +```dvc +ERROR: failed to modify metric file settings - + unable to find stage file with output '' ``` ## Options diff --git a/static/docs/commands-reference/move.md b/static/docs/commands-reference/move.md index 7fd391244b..cb8ce351d1 100644 --- a/static/docs/commands-reference/move.md +++ b/static/docs/commands-reference/move.md @@ -1,8 +1,9 @@ # move -Renames a file or a directory and modifies the corresponding DVC-file (see -`dvc add`) to reflect the change. If the file or directory has the same name as -the corresponding DVC-file, it would also rename the DVC-file. +Renames a file or a directory and modifies the corresponding +[DVC-file](/doc/user-guide/dvc-file-format) (see `dvc add`) to reflect the +change. If the file or directory has the same name as the corresponding +DVC-file, it would also rename the DVC-file. ## Synopsis @@ -17,13 +18,19 @@ positional arguments: ## Description -`dvc move` moves the file named by the `src` operand to the destination path -named by the `dst` operand. It also renames and updates the corresponding DVC -file. In general it behaves the same way as `mv src dst`, but takes care of a -DVC-file. +`dvc move` is useful when a `src` file or directory has previously been added to +DVC with `dvc add`, creating a [DVC-file](/doc/user-guide/dvc-file-format) (with +`src` as a dependency). `dvc move` behaves like `mv src dst`, moving `src` to +the given `dst` path, but it also renames and updates the corresponding DVC-file +appropriately. -If destination path already exists and is a directory, source file or directory -is moved unchanged into this folder along with the corresponding DVC-file. +> Note that `src` may be a copy or a +> [link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> to a file in cache. The cached file is not changed by this command. + +If the destination path (`dst`) already exists and is a directory, the source +file or directory (`src`) is moved unchanged into this folder along with the +corresponding DVC-file. Let's imagine the following scenario: diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 63e3cd45fa..0d7c2db6d5 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -48,9 +48,10 @@ and are referenced in the current workspace. It can be used to see what files `dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated -with those DVC-files. Using the `--with-deps` option DVC tracks dependencies -backward through the [pipeline](/doc/get-started/pipeline) to find data files to -pull. +with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies +backward from the target [stage](/doc/commands-reference/run) file(s), through +the corresponding [pipeline(s)](/doc/get-started/pipeline), to find data files +to pull. After data file is in cache DVC, `dvc pull` uses OS-specific mechanisms like reflinks or hardlinks to put it in the workspace without copying. See @@ -76,20 +77,20 @@ reflinks or hardlinks to put it in the workspace without copying. See save different experiments or project checkpoints. - `-d`, `--with-deps` - determines files to download by tracking dependencies to - the named target DVC-file(s). This option only has effect when one or more + the target DVC-file(s) (stages). This option only has effect when one or more `targets` are specified. By traversing all stage dependencies, DVC searches - backward through the pipeline from the named target(s). This means DVC will - not pull files referenced later in the pipeline than the named target(s). + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not pull files referenced in later stage(s) than `targets`. - `-f`, `--force` - do not prompt when removing workspace files. This option surfaces behavior from the `dvc checkout` command because `dvc pull` in effect performs a _checkout_ after downloading files. -- `-R`, `--recursive` - `targets` values is expected to be a directory path. - Determines the files to download by searching the named directory and its - subdirectories for DVC-files to download data for. Along with providing a - `target`, or `target` along with `--with-deps` it is yet another way to cut - the scope of DVC-files to download. +- `-R`, `--recursive` - `targets` is expected to contain directory path(s). + Determines the files to download by searching each target directory and its + subdirectories for DVC-files to inspect. Along with providing a `target`, or + `target` and `--with-deps`, this is another way to limit the scope of + DVC-files to download. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously while downloading files from the remote cache. The effect is to control the @@ -142,7 +143,7 @@ $ dvc pull data.zip.dvc In this case we left off the `--remote` option, so it will have pulled from the default remote. The only files considered in this case are what is listed in the -`out` section of the named target DVC-file. +`out` section of the target DVC-file(s). ## Examples: With dependencies @@ -195,10 +196,10 @@ $ dvc pull --remote r1 Everything is up to date. ``` -With the first `dvc pull` we specified a stage in the middle of the pipeline +With the first `dvc pull` we specified a stage in the middle of this pipeline (`matrix-train.p.dvc`) while using `--with-deps`. DVC started with that DVC-file and searched backwards through the pipeline for data files to download. Because -the `model.p.dvc` stage occurs later in the pipeline, its data was not pulled. +the `model.p.dvc` stage occurs later, its data was not pulled. Then we ran `dvc pull` specifying the last stage, `model.p.dvc`, and its data was downloaded. Finally, we ran `dvc pull` with no options to make sure that all diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 13abbfe77d..ab63b88b07 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -60,8 +60,10 @@ not exist in the local cache. Running `dvc push` from the local cache does not remove nor modify those files in the remote cache. If one or more `targets` are specified, DVC only considers the files associated -with those DVC-files. Using the `--with-deps` option DVC tracks dependencies -backward through the pipeline to find data files to push. +with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies +backward from the target [stage](/doc/commands-reference/run) file(s), through +the corresponding [pipeline(s)](/doc/get-started/pipeline), to find data files +to push. ## Options @@ -83,16 +85,16 @@ backward through the pipeline to find data files to push. save different experiments or project checkpoints. - `-d`, `--with-deps` - determines files to upload by tracking dependencies to - the named target DVC-file(s). This option only has effect when one or more - `targets` are specified. By traversing each stage dependencies, DVC searches - backward through the pipeline from the named target(s). This means DVC will - not push files referenced later in the pipeline than the named target(s). + the target DVC-file(s) (stages). This option only has effect when one or more + `targets` are specified. By traversing all stage dependencies, DVC searches + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not push files referenced in later stage(s) than `targets`. -- `-R`, `--recursive` - the `targets` value is expected to be a directory path. - With this option, `dvc pull` determines the files to upload by searching the - named directory, and its subdirectories, for DVC-files for which to upload - data. Along with providing a `target`, or `target` along with `--with-deps`, - it is yet another way to limit the scope of DVC-files to upload. +- `-R`, `--recursive` - `targets` is expected to contain directory path(s). + Determines the files to upload by searching each target directory and its + subdirectories for DVC-files to inspect. Along with providing a `target`, or + `target` and `--with-deps`, this is another way to limit the scope of + DVC-files to upload. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously while uploading files to the remote cache. The effect is to control the number @@ -198,10 +200,10 @@ $ dvc status --cloud Pipeline is up to date. Nothing to reproduce. ``` -With the first `dvc push` we specified a stage in the middle of the pipeline +With the first `dvc push` we specified a stage in the middle of this pipeline (`matrix-train.p.dvc`) while using `--with-deps`. DVC started with that DVC-file and searched backwards through the pipeline for data files to upload. Because -the `model.p.dvc` stage occurs later in the pipeline, its data was not pushed. +the `model.p.dvc` stage occurs later, its data was not pushed. Then we ran `dvc push` specifying the last stage, `model.p.dvc`, and its data was uploaded. Finally, we ran `dvc push` and `dvc status` with no options to @@ -268,8 +270,8 @@ the local cache compared to the remote. Next we can upload part of the data from the local cache to a remote using the command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` searches -backwards from the named DVC-file to locate files to upload, and does not upload -files in subsequent stages. +backwards from the target DVC-file(s) to locate files to upload, and does not +upload files in subsequent stages. After doing that we can inspect the remote cache again: diff --git a/static/docs/commands-reference/remote_add.md b/static/docs/commands-reference/remote_add.md index 267c9fd06b..ebd044dd66 100644 --- a/static/docs/commands-reference/remote_add.md +++ b/static/docs/commands-reference/remote_add.md @@ -152,12 +152,12 @@ So, make sure you have the following permissions enabled:
-### Click for an S3 API compatible storage example +### Click for S3 API compatible storage example To communicate with a remote object storage that supports an S3 compatible API -(e.g. [Minio](https://minio.io/), [Wasabi](https://wasabi.com/), -[Eucalyptus](https://www.eucalyptus.cloud/index.html), -[DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), etc.) you +(e.g. [Minio](https://minio.io/), +[DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), +[IBM Cloud Object Storage](https://www.ibm.com/cloud/object-storage) etc.) you must explicitly set the `endpointurl` in the configuration: For example: diff --git a/static/docs/commands-reference/remote_modify.md b/static/docs/commands-reference/remote_modify.md index 56f28aaa73..a2df593191 100644 --- a/static/docs/commands-reference/remote_modify.md +++ b/static/docs/commands-reference/remote_modify.md @@ -112,17 +112,23 @@ $ dvc remote modify myremote listobjects true $ dvc remote modify myremote sse AES256 ``` +
+ +
+ +### Click for S3 API compatible storage available options + To communicate with a remote object storage that supports an S3 compatible API -(e.g. [Minio](https://minio.io/), [Wasabi](https://wasabi.com/), -[Eucalyptus](https://www.eucalyptus.cloud/index.html), -[DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), etc.) you +(e.g. [Minio](https://minio.io/), +[DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), +[IBM Cloud Object Storage](https://www.ibm.com/cloud/object-storage) etc.) you must explicitly set the `endpointurl` in the configuration: For example: ```dvc -$ dvc remote add -d mybucket s3://path/to/dir -$ dvc remote modify mybucket endpointurl https://object-storage.example.com +$ dvc remote add -d myremote s3://path/to/dir +$ dvc remote modify myremote endpointurl https://object-storage.example.com ``` AWS S3 remote can also be configured entirely via environment variables: diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 30a6d2297b..268a9231df 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -1,80 +1,82 @@ # repro -Rerun commands recorded in the [pipeline](/doc/get-started/pipeline) -[stages](/doc/commands-reference/run) in the same order. Commands to rerun are -determined by recursively analyzing which stages and changes in their -dependencies to find only those that have to be rerun. +Run again commands recorded in the [stages](/doc/commands-reference/run) of one +or more [pipelines](/doc/get-started/pipeline), in the correct order. The +commands to be run are determined by recursively analyzing target stages and +changes in their dependencies. ## Synopsis ```usage usage: dvc repro [-h] [-q | -v] - [-f] [-s] [-c CWD] [-m] [--dry] [-i] - [-p] [-P] [--ignore-build-cache] [--no-commit] + [-f] [-s] [-c CWD] [-m] [--dry] [-i] [-p] [-P] [-R] + [--ignore-build-cache] [--no-commit] [--downstream] [targets [targets ...]] positional arguments: - target DVC-file to reproduce. + targets DVC-file to reproduce (default - 'Dvcfile'). ``` ## Description -If the [DVC-file](/doc/user-guide/dvc-file-format) (`target`) is omitted, -`Dvcfile` will be assumed. +`dvc repro` provides an interface to run the commands in a computational graph +(a.k.a. pipeline) again, as defined in the stage files (DVC-files) found in the +current workspace. (A pipeline is typically defined using the `dvc run` command, +while data input nodes are defined by the `dvc add` command.) -`dvc repro` provides an interface to rerun the commands in the computational -graph (a.k.a. pipeline) defined by the connected stages (DVC-files) in the -current workspace. By default, this command recursively searches, starting from -the `Dvcfile`, the pipeline stages to find any which have changed. It then -reruns the corresponding commands. The pipeline is mostly defined using the -`dvc run` command, while data input nodes are defined by the `dvc add` command. +There's a few ways to restrict the stages that will be run again by this +command: by specifying stage file(s) as `targets`, or by using the +`--single-item`, `--cwd`, or other options. -There are several ways to restrict the stages to rerun, by listing DVC-files as -targets, or using the `--single-item`, `--pipeline`, or `--cwd` options. +If specific [DVC-files](/doc/user-guide/dvc-file-format) (`targets`) are +omitted, `Dvcfile` will be assumed. + +By default, this command recursively searches in pipeline stages, starting from +the `targets`, to determine which ones have changed. Then it executes the +corresponding commands again. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get source data files, intermediate or final results. It saves all the data files, -intermediate or final results into the DVC local cache (unless `--no-commit` -option is specified) and updates DVC-files with the new checksum information. +intermediate or final results into the DVC cache (unless `--no-commit` option is +specified), and updates stage files with the new checksum information. ## Options -- `-f`, `--force` - rerun the pipeline, reproducing its results, even if no - changes were found. By default this reruns the entire pipeline. To rerun a - single stage, specify the stage name on the command-line along with the - `--single-item` option. +- `-f`, `--force` - reproduce a pipeline, regenerating its results, even if no + changes were found. By default this runs all of its stages but it can be + limited with the `targets` argument and `-s`, `-p`, or `-c` options. - `-s`, `--single-item` - reproduce only a single stage by turning off the - recursive search for changed dependencies. Multiple stages are rerun if - multiple stage names are listed on the command-line. - -- `-c`, `--cwd` - directory within your project to reproduce from. If no target - names are given, it attempts to use `Dvcfile` in the specified directory, if - it exists, for stages to rerun. Instead of using `--cwd` one can alternately - specify a target in a subdirectory as `path/to/target.dvc`. This could be - useful for subdirectories containing a semi-independent pipeline, that can - either be rerun as part of the pipeline in the parent directory, or as an + recursive search for changed dependencies. Multiple stages are run + (non-recursively) if multiple stage files are given as `targets`. + +- `-c`, `--cwd` - directory within your project to reproduce from. If no + `targets` are given, it attempts to use `Dvcfile` in the specified directory. + Instead of using `--cwd`, one can alternately specify a target in a + subdirectory as `path/to/target.dvc`. This option can be useful for example + with subdirectories containing a separate pipeline that can either be + reproduced as part of the pipeline in the parent directory, or as an independent unit. - `--no-commit` - do not save outputs to cache. Useful when running different experiments and you don't want to fill up your cache with temporary files. Use `dvc commit` when you are ready to save your results to cache. -- `-m`, `--metrics` - show metrics after reproduction. The pipeline must have at - least one metrics file defined either with the `dvc metrics` command, or by - the `-M` or `-m` options on the `dvc run` command. +- `-m`, `--metrics` - show metrics after reproduction. The target pipeline(s) + must have at least one metrics file defined either with the `dvc metrics` + command, or by the `-M` or `-m` options on the `dvc run` command. - `--dry` - only print the commands that would be executed without actually executing the commands. - `-i`, `--interactive` - ask for confirmation before reproducing each stage. - The stage is rerun if the user types "y". + The stage is only run if the user types "y". -- `-p`, `--pipeline` - reproduce the whole pipeline that the specified stage - file belongs to. Use `dvc pipeline show target.dvc` to show the entire - pipeline the named stage belongs to. +- `-p`, `--pipeline` - reproduce the entire pipeline(s) that the target stage + file(s) belong(s) to. Use `dvc pipeline show .dvc` to show the parent + pipeline of a target stage. -- `--ignore-build-cache` - in case like `... -> A (changed) -> B -> C` it will +- `--ignore-build-cache` - in cases like `... -> A (changed) -> B -> C` it will reproduce `A` first and then `B` even if `B` was previously executed with the same inputs from `A` (cached). It might be useful when we have a common dependency among all stages and want to specify it once (for the stage `A` @@ -82,23 +84,23 @@ option is specified) and updates DVC-files with the new checksum information. `requirements.txt`, we can specify it only once in `A` and omit in `B` and `C`. To be precise - it reproduces all descendants of a changed stage, or the stages following the changed stage, even if their direct dependencies did not - change. Like with the same option on `dvc run`, this is a way to force certain - stages to run if they would not otherwise be rerun (thus the name). This can - be useful also for pipelines containing stages that produce nondeterministic - (semi-random) outputs. For nondeterministic stages the outputs can vary on - each execution, meaning the cache cannot be trusted for such stages. + change. Like with the same option on `dvc run`, this is a way to force stages + without changes to run again. This can also be useful for pipelines containing + stages that produce nondeterministic (semi-random) outputs. For + nondeterministic stages the outputs can vary on each execution, meaning the + cache cannot be trusted for such stages. - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if all - stages are up to date or if all stages are successfully rerun, otherwise exit + stages are up to date or if all stages are successfully run, otherwise exit with 1. The command run by the stage is free to make output irregardless of this flag. - `-v`, `--verbose` - displays detailed tracing information. -- `--downstream` - rerun the commands down the pipeline of the target file - including the one in it. +- `--downstream` - only run again the stages after the given `targets` in their + corresponding pipeline(s), including the target stages themselves. ## Examples @@ -126,6 +128,11 @@ $ dvc run -f Dvcfile -d numbers.txt -d process.py -M count.txt \ "python process.py numbers.txt > count.txt" ``` +> Note that using `-f Dvcfile` with `dvc run` above is optional, the stage file +> name would otherwise default to `count.txt.dvc`. We use `Dvcfile` in this +> example because that's the default stage file name `dvc repro` will read +> without having to provide any `targets`. + Where `process.py` is a script which for simplicity just prints the number of lines: @@ -151,12 +158,15 @@ $ tree └── text.txt <---- text file to process ``` +You may want to check the contents of `Dvcfile` and `count.txt` for later +reference. + Ok, now, let's run the `dvc repro` command (remember, by default it reproduces outputs defined in `Dvcfile`, `count.txt` in this case): ```dvc $ dvc repro - +WARNING: assuming default target 'Dvcfile'. Stage 'filter.dvc' didn't change. Stage 'Dvcfile' didn't change. Pipeline is up to date. Nothing to reproduce. @@ -176,26 +186,25 @@ If we now run `dvc repro`, that's what we should see: ```dvc $ dvc repro - +WARNING: assuming default target 'Dvcfile'. Stage 'filter.dvc' didn't change. Stage 'Dvcfile' changed. Reproducing 'Dvcfile' Running command: python process.py numbers.txt > count.txt - +Output 'count.txt' doesn't use cache. Skipping saving. Saving information to 'Dvcfile'. ``` -You can check now that `Dvcfile` and `count.txt` have been updated with the new +You can now check that `Dvcfile` and `count.txt` have been updated with the new information, new `md5` checksums and a new result respectively. ## Examples: Downstream -There is also an option which allows one to reproduce results from a specific -command in the pipeline. Enabling this option requires adding flag -`--downstream` to command `dvc repro`. - -To demonstrate working of this let us make a change in `text.txt`: +The `--downstream` option allows us to only reproduce results from commands +after a specific stage in a pipeline. To demonstrate how it works, lets make a +change in `text.txt` (the input of our first stage, defined in the previous +example): ``` ... @@ -203,18 +212,19 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Now running the command `dvc repro --downstream` results in the following -output: +Now, using the `--downstream` option results in the following output: ```dvc +$ dvc repro --downstream WARNING: assuming default target 'Dvcfile'. Stage 'Dvcfile' didn't change. Pipeline is up to date. Nothing to reproduce. ``` -The reason being that the `text.txt` is a file which is not directly dependent -on `Dvcfile`. Instead it is dependent on `filter.dvc` which is above our target -file in the pipeline. +The reason being that the `text.txt` is a file which is a dependency in the +target DVC-file (`Dvcfile` by default). Instead, it's dependent on `filter.dvc`, +which happens before the target stage in this pipeline (shown above in the +following figure). ```dvc $ dvc pipeline show --ascii diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index afa035a019..f52e9a0fce 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -23,17 +23,17 @@ positional arguments: `dvc run` provides an interface to build a computational graph (aka pipeline). It's a way to describe commands, data inputs and intermediate results that went into a model (or other data results). By explicitly specifying a list of -dependencies (with `-d` option) and outputs (with `-o`, `-O`, or `-M` options) -DVC can connect individual stages (commands) into a directed acyclic graph -(DAG). `dvc repro` provides an interface to check state and reproduce this graph -later. This concept is similar to the one of the `Makefile` but DVC captures -data and caches data artifacts along the way. Check this +dependencies (with `-d` option) and outputs (with `-o`, `-O`, `-m`, or `-M` +options) DVC can connect individual stages (commands) into a directed acyclic +graph (DAG). `dvc repro` provides an interface to check state and reproduce this +graph later. This concept is similar to the one of the `Makefile` but DVC +captures data and caches data artifacts along the way. Check this [example](/doc/get-started/example-pipeline) to learn more and try to build a pipeline. Unless the `-f` options is used, by default the DVC-file name generated is -`.dvc`, where `` is file name of the first output (`-o`, `-O`, or -`-M` option). If neither `-f`, nor outputs are specified, the stage name +`.dvc`, where `` is file name of the first output (`-o`, `-O`, `-m`, +or `-M` option). If neither `-f`, nor outputs are specified, the stage name defaults to `Dvcfile`. Since `dvc run` provides a way to build a graph of computations, using @@ -90,9 +90,9 @@ be no cycles, etc. - `-f`, `--file` - specify stage file name. By default the DVC-file name generated is `.dvc`, where `` is file name of the first output - (`-o`, `-O`, or `-M` option). The stage file is placed in the same directory - where `dvc run` is run by default, but `-f` can be used to change this - location, by including a path in the provided value (e.g. + (`-o`, `-O`, `-m`, or `-M` option). The stage file is placed in the same + directory where `dvc run` is run by default, but `-f` can be used to change + this location, by including a path in the provided value (e.g. `-f stages/stage.dvc`). - `-c`, `--cwd` - deprecated, use `-f` and `-w` to change location and working diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index c7ccd9c7eb..8387265238 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -1,8 +1,8 @@ # status -Show changes in the [pipeline](/doc/get-started/pipeline) and mismatches either -between the local cache and local files, or between the local cache and remote -cache. +Show changes in the [pipeline(s)](/doc/get-started/pipeline), as well as +mismatches either between the local cache and local files, or between the local +cache and remote cache. ## Synopsis @@ -17,12 +17,12 @@ positional arguments: ## Description -`dvc status` searches for changes in the pipeline, either showing which -[stages](/doc/commands-reference/run) have changed in the local workspace and -must be reproduced (with `dvc repro`), or differences between the local cache -and remote cache (meaning `dvc push` or `dvc pull` should be run to synchronize -them). The two modes, _local_ and _cloud_ are triggered by using the `--cloud` -or `--remote` options: +`dvc status` searches for changes in the existing pipeline(s), either showing +which [stages](/doc/commands-reference/run) have changed in the local workspace +and must be reproduced (with `dvc repro`), or differences between the local +cache and remote cache (meaning `dvc push` or `dvc pull` should be run to +synchronize them). The two modes, _local_ and _cloud_ are triggered by using the +`--cloud` or `--remote` options: | Mode | CLI Option | Description | | ------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------- | @@ -33,14 +33,14 @@ or `--remote` options: DVC determines data and code files to compare by analyzing all DVC-files in the current workspace (`--all-branches` and `--all-tags` in the `cloud` mode compare multiple workspaces - across all branches or tags). The comparison can be -limited to specific DVC-files (stages) by listing them as `targets`. Changes are -reported only against the named `targets`. When combined with the `--with-deps` -option, a search is made for changes in other stages that affect the target. +limited to specific DVC-files by listing them as `targets`. Changes are reported +only against the given `targets`. When combined with the `--with-deps` option, a +search is made for changes in other stages that affect the target. In the `local` mode, changes are detected through the checksum of every file -listed in every DVC-file in the pipeline against the corresponding file in the -file system. The output indicates the detected changes, if any. If no -differences are detected, `dvc status` prints this message: +listed in every DVC-file in question against the corresponding file in the file +system. The output indicates the detected changes, if any. If no differences are +detected, `dvc status` prints this message: ```dvc $ dvc status @@ -48,7 +48,7 @@ differences are detected, `dvc status` prints this message: ``` This says that no differences were detected, and therefore that no stages would -be rerun if `dvc repro` were executed. +be run again if `dvc repro` were executed. If instead, differences are detected, `dvc status` lists those changes. For each DVC-file (stage) with differences, the _dependencies_ and/or _outputs_ that @@ -56,7 +56,7 @@ differ are listed. For each item listed, either the file name or the checksum is shown, and additionally a status word is shown describing the change: - For the local workspace: - - _changed_ means the named file has changed + - _changed_ means the file has changed - For comparison against a remote cache: - _new_ means the file exists in the local cache but not the remote cache - _deleted_ means the file doesn't exist in the local cache, but exists in the @@ -72,11 +72,11 @@ cache. For the typical process to update workspaces, see ## Options -- `-d`, `--with-deps` - finds changes by tracking dependencies to the named - target DVC-file(s). This option only has effect when one or more `targets` are - specified. By traversing each stage dependencies, DVC searches backward - through the pipeline from the named target(s). This means DVC will not show - changes occurring later in the pipeline than the named target(s). Applies +- `-d`, `--with-deps` - determines files to check by tracking dependencies to + the target DVC-file(s) (stages). This option only has effect when one or more + `targets` are specified. By traversing all stage dependencies, DVC searches + backward from the target stage(s) in the corresponding pipeline(s). This means + DVC will not show changes occurring in later stage(s) than `targets`. Applies whether or not `--cloud` is specified. - `-c`, `--cloud` - enables comparison against a remote cache. If no `--remote` diff --git a/static/docs/commands-reference/version.md b/static/docs/commands-reference/version.md index 183d1d8ae3..c6fb98b18c 100644 --- a/static/docs/commands-reference/version.md +++ b/static/docs/commands-reference/version.md @@ -19,7 +19,7 @@ system/environment: | [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | | `Python version` | Version of the Python being used for the project in which DVC is initialized | | `Platform` | Information about the operating system of the machine | -| [`Binary`](#output-of-binary) | Shows whether the package is installed from a binary release or source | +| [`Binary`](#what-we-mean-by-binary) | Shows whether the package is installed from a binary release or source | | `Cache` | [Type of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported between the DVC workspace and the cache directory | | `Filesystem type` | Shows the filesystem type (eg. ext4, FAT, etc.) and mount point of workspace and the cache directory | @@ -48,42 +48,39 @@ The detail of DVC version depends upon the way of installing the project. part is the git commit hash which is one of the commits in the `master` branch (also, optional suffix `.mod` means that code is modified). -#### Output of Binary +#### What we mean by "Binary" -The detail of Binary depends upon the way of downloading a package. +The detail of `Binary` depends on the way DVC was downloading and +[installed](/doc/get-started/install). -- **`Binary: True`** - This output is displayed when DVC package is installed as - a: +- **`Binary: True`** - displayed when DVC is downloaded/installed as one of: - - Debian package (`.deb`) - file used to install a software in Linux - distributions like Ubuntu. - - Red Hat package (`.rpm`) - file used to install a software in Linux based - distributions such as Fedora, CentOS, etc. - - Windows executable (`.exe`) - file used to install packages for Windows. - - PKG file (`.pkg`) - file used to install packages for macOS. + - Debian package (`.deb`) - file used to install packages in several Linux + distributions, like Ubuntu. + - Red Hat package (`.rpm`) - file used to install packages in some Linux based + distributions, such as Fedora, CentOS, etc. + - PKG file (`.pkg`) - file used to install apps on macOS. + - Windows executable (`.exe`) - file used to install applications on Windows. - All these files are bundled as a binary file which is the compiled version of - a software which means it has already been built as machine code and can be - understood by computer systems. In our case, we use - [PyInstaller](https://pythonhosted.org/PyInstaller/) to bundle our source code - into a binary package. + These downloads are available from our [home page](/). They ultimately contain + a binary bundle, which is the executable version of a software program, + meaning that it will run natively on a specific platform (Linux, Windows, + Mac). In our case, we use [PyInstaller](https://pythonhosted.org/PyInstaller/) + to bundle our source code into the binary package app. -* **`Binary: False`** - This output is displayed when DVC package is downloaded - from: +* **`Binary: False`** - shown when DVC is downloaded and installed from: - - [DVC's GitHub repository](https://github.com/iterative/dvc) - raw source - code is hosted. + - [DVC's GitHub repository](https://github.com/iterative/dvc) - where core + source code is hosted. - [The Python Package Index (PyPI)](https://pypi.org/project/dvc/) - source - code is stored as a python package. + code is stored as a Python package. - [Homebrew package manager](https://github.com/iterative/homebrew-dvc) (for macOS systems) - source code is stored as Python package. - This method of setting up downloads the project's source code which is simply - human understandable code and not compiled. A user has to follow certain setup - instructions to build the project and then use it. Some projects use a - `Makefile` to build their project from the source code. We include setup - instructions, written in `setup.py`, in our code which handles its compilation - and henceforth, setting it up for usage. + This method of installation involves downloading DVC source code, and + following certain setup instructions (See the + [development](/doc/user-guide/development) guide) to build the application + before being able to it. ## Options diff --git a/static/docs/get-started/connect-code-and-data.md b/static/docs/get-started/connect-code-and-data.md index 477cf12e51..fe59822e33 100644 --- a/static/docs/get-started/connect-code-and-data.md +++ b/static/docs/get-started/connect-code-and-data.md @@ -60,9 +60,10 @@ $ git commit -m "add code"
-Having installed the `src/prepare.py` script in your repo, the following DVC -command transforms it into a reproducible **stage** for the ML **pipeline** -(describes in the next chapter). +Having installed the `src/prepare.py` script in your repo, the following command +transforms it into a reproducible +[stage](/doc/user-guide/dvc-files-and-directories) for the ML pipeline we're +building (described in the [next chapter](/doc/get-started/example-pipeline)). ```dvc $ dvc run -f prepare.dvc \ @@ -124,14 +125,14 @@ wdir: . > instructions on how to build a ML model (data file) from previous data files > (or directories). -We would recommend to try to read a few next chapters first, before switching to -other documents. Hopefully, `dvc run` and `dvc repro` will make more sense after +We would recommend to read a few next chapters first, before switching to other +documents. Hopefully, `dvc run` and `dvc repro` will make more sense after finishing up this guide. You can always refer to the `dvc run` and `dvc repro` documentation to learn the specific details about how they behave and all of their options. Let's briefly mention what the options used above mean for this particular example: -`-f prepare.dvc` specifies a name for the pipeline DVC-file (stage). It's +`-f prepare.dvc` specifies a name for the DVC-file (pipeline stage). It's optional but we highly recommend using it to make your project structure more readable. diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 32de08d8e0..ca1bdb1e78 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -1,12 +1,13 @@ # Example: Pipelines To show DVC in action, let's play with an actual machine learning scenario. -Let's explore the natural language processing (NLP) problem of predicting tags -for a given StackOverflow question. For example, we want one classifier which -can predict a post that is about the Python language by tagging it `python`. -This is a short version of the [Tutorial](/doc/tutorial). +Let's explore the natural language processing +([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) problem of +predicting tags for a given StackOverflow question. For example, we want one +classifier which can predict a post that is about the Python language by tagging +it `python`. This is a short version of the [Tutorial](/doc/tutorial). -In this example, we will focus on building a simple pipeline that takes an +In this example, we will focus on building a simple ML pipeline that takes an archive with StackOverflow posts and trains the prediction model and saves it as an output. Check [get started](/doc/get-started) to see links to other examples, tutorials, use cases if you want to cover other aspects of the DVC. The pipeline @@ -61,8 +62,12 @@ Install the required dependencies: $ pip install -r code/requirements.txt ``` -Then, we are creating the pipeline step-by-step, utilizing the same set of -commands that are described in the [get started](/doc/get-started) chapters. +Next, we will create a pipeline step-by-step, utilizing the same set of commands +that are described in earlier [get started](/doc/get-started) chapters. + +> Note that its possible to define more than one pipeline in each project. This +> will be determined by the interdependencies between DVC-files, mentioned +> below. - Initialize DVC repository (run it inside your Git repository): @@ -82,6 +87,9 @@ $ dvc add data/Posts.xml.zip
+When we run `dvc add` `Posts.xml.zip`, DVC creates a +[DVC-file](/doc/user-guide/dvc-file-format). + ### Expand to learn more about DVC internals `dvc init` created a new directory `example/.dvc/` with `config`, `.gitignore` @@ -90,8 +98,7 @@ users in general. Users don't interact with these files directly. Check [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn more. -When we run `dvc add` `Posts.xml.zip`, DVC creates a -[DVC-file](/doc/user-guide/dvc-file-format) with no dependencies, a.k.a. and +Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an "_orphan_ stage file": ```yaml @@ -120,15 +127,16 @@ $ git add data/Posts.xml.zip.dvc data/.gitignore $ git commit -m "add dataset" ``` -## Define staged +## Define stages -Each stage is described by providing a command to run, input data it takes and a +Each [stage](/doc/user-guide/dvc-files-and-directories) – the parts of a +pipeline – is described by providing a command to run, input data it takes and a list of output files. DVC is not Python or any other language specific and can wrap any command runnable via CLI. -- The first actual stage, extract XML from the archive. Note that we don't need - to run `dvc add` on `Posts.xml`, `dvc run` saves (commits into the cache, - takes the file under DVC control) automatically: +- The first stage is to extract XML from the archive. Note that we don't need to + run `dvc add` on `Posts.xml` below, `dvc run` saves the data automatically + (commits into the cache, takes the file under DVC control): ```dvc $ dvc run -d data/Posts.xml.zip \ @@ -139,10 +147,12 @@ $ dvc run -d data/Posts.xml.zip \
+Similar to `dvc add`, `dvc run` creates a +[DVC-file](/doc/user-guide/dvc-file-format) (or "stage file"). + ### Expand to learn more about DVC internals -Similar to `dvc add`, `dvc run` creates a -[DVC-file](/doc/user-guide/dvc-file-format) (or "stage file"): +Here's what the DVC-file (stage file, with dependencies `deps`) looks like: ```yaml cmd: ' unzip data/Posts.xml.zip -d data' @@ -235,8 +245,8 @@ By analyzing dependencies and outputs in DVC-files, we can restore the full chain of commands (DAG) we need to apply. This is important when you run `dvc repro` to reproduce the final or intermediate result. -`dvc pipeline show` helps to visualize the pipeline (run it with `-c` option to -see actual commands instead of DVC-files): +`dvc pipeline show` helps to visualize pipelines (run it with `-c` option to see +actual commands instead of DVC-files): ```dvc $ dvc pipeline show --ascii evaluate.dvc @@ -297,7 +307,7 @@ $ dvc metrics show auc.metric: 0.620091 ``` -It's time to save the pipeline. You can confirm that we do not save pickle model +It's time to save our pipeline. You can confirm that we do not save pickle model files or initial data sets into Git using the `git status` command. We are just saving a snapshot of the DVC-files that describe data and code versions and relationships between them. diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 025f09bd60..3c323cf455 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -308,15 +308,15 @@ $ dvc run -f Dvcfile \ python train.py ``` -Similar to `dvc add`, `dvc run` creates a single DVC-file (`Dvcfile` in this -case, specified by the `-f` option). It puts all outputs (`-o`) under DVC +Similar to `dvc add`, `dvc run` creates a single DVC-file (forced to have file +name `Dvcfile` with the `-f` option). It puts all outputs (`-o`) under DVC control the same way as `dvc add` does. Unlike, `dvc add`, `dvc run` also tracks dependencies (`-d`) and the command (`python train.py`) that was run to produce the result. -`dvc repro` runs `Dvcfile` when some dependencies changed (for example, we added -new images like we did when we built the second model version). It also updates -outputs and puts them into the cache. +`dvc repro` will run `Dvcfile` if any of its dependencies (`-d`) changed, for +example after we added new images like we did when we built the second model +version. It also updates outputs and puts them into the cache. If `dvc add` and `dvc checkout` provide a basic mechanism to version control large data files or models, `dvc run` and `dvc repro` provide a build system for @@ -340,8 +340,8 @@ changed. Here where DVC pipelines feature comes very handy and was designed for. We touched it briefly when we described `dvc run` and `dvc repro` at the very end. -Next step here would be splitting the script into two steps and utilize DVC -pipelines. Check this [example](/doc/get-started/example-pipeline) to get a +The next step here would be splitting the script into two steps and utilizing +DVC pipelines. Check this [example](/doc/get-started/example-pipeline) to get a hands-on experience with them and try to apply it here. Don't hesitate to join our [community](/chat) to ask any questions! diff --git a/static/docs/get-started/install.md b/static/docs/get-started/install.md index 358975167a..efe3d6c479 100644 --- a/static/docs/get-started/install.md +++ b/static/docs/get-started/install.md @@ -1,7 +1,7 @@ # Install -There are three ways to install DVC: `pip`, OS-specific package, and Homebrew -(depending on your OS some of these ways may be not available for you). +There are three recommended ways to install DVC: OS-specific package/installer, +`pip`, and Homebrew. (Depending on your OS, some of these may be not available.) To install DVC from terminal, run: @@ -17,10 +17,11 @@ $ pip install dvc > This is valid for `pip install` option only. Other ways to install DVC already > include support for all remotes. -As an easier option, self-contained binary packages are also available. Use the -Download button in the [home page](/) to the left or get them -[here](https://github.com/iterative/dvc/releases/). We also provide `deb`, `rpm` -and `homebrew` repositories: +The easiest option, self-contained binary packages (or Windows installer), are +available by using the big "Download" button in the [home page](/). You may also +get them [here](https://github.com/iterative/dvc/releases/). + +We also provide `deb`, `rpm` and `homebrew` repositories:
@@ -62,36 +63,15 @@ $ brew cask install iterative/homebrew-dvc/dvc
-
- -### Expand to install from pkg installer (Mac OS) - -Click the `Download` button on the main page and download `.pkg` to install it. -Alternatively, you can always find the latest version of this installer -[here](https://github.com/iterative/dvc/releases). - -
- -
- -### Expand to install using installer (Windows) - -If you have any problems with `pip install`, click the `Download` button on the -main page and download `.exe` to install DVC. Alternatively, you can always find -the latest version of this binary installer here: -[here](https://github.com/iterative/dvc/releases). - -
- -See [Development](/doc/user-guide/development) if you want to install the most -recent development version. +To install the most recent development version, See our +[development](/doc/user-guide/development) guide. ### Shell autocomplete -Visit [Shell Autocomplete](/doc/user-guide/autocomplete) section to find and -install the completion scripts for your shell. +Visit the [Shell Autocomplete](/doc/user-guide/autocomplete) guide to learn how +to install the completion scripts for your command-line shell. ### Editors and IDEs integration -Visit [Vim and IDE Integrations](/doc/user-guide/plugins) for reference on how -to enable shell syntax highlighting and install DVC support for different IDEs. +Visit [Vim and IDE Integrations](/doc/user-guide/plugins) for how to enable +shell syntax highlighting and install DVC support on different IDEs. diff --git a/static/docs/get-started/metrics.md b/static/docs/get-started/metrics.md index 6c2b3f05fb..dc55b710ac 100644 --- a/static/docs/get-started/metrics.md +++ b/static/docs/get-started/metrics.md @@ -1,11 +1,10 @@ # Experiment Metrics -The last stage we would like to add to the pipeline is the evaluation stage. -Data science is a metric-driven R&D-like process and `dvc metrics` along with -DVC metric files provide a framework to capture and compare experiments -performance. It doesn't require installing any databases or instrumenting your -code to use some API, all is tracked by Git and is stored in Git or DVC remote -storage: +The last stage we would like to add to our pipeline is its the evaluation. Data +science is a metric-driven R&D-like process and `dvc metrics` along with DVC +metric files provide a framework to capture and compare experiments performance. +It doesn't require installing any databases or instrumenting your code to use +some API, all is tracked by Git and is stored in Git or DVC remote storage: ```dvc $ dvc run -f evaluate.dvc \ @@ -27,7 +26,7 @@ Let's again commit and save results: ```dvc $ git add evaluate.dvc auc.metric -$ git commit -m "add evaluation step to the pipeline" +$ git commit -m "add evaluation step to pipeline" $ dvc push ``` diff --git a/static/docs/get-started/pipeline.md b/static/docs/get-started/pipeline.md index 9821afc014..b1bb6770f3 100644 --- a/static/docs/get-started/pipeline.md +++ b/static/docs/get-started/pipeline.md @@ -30,12 +30,14 @@ Let's commit DVC-files that describe our pipeline so far: ```dvc $ git add data/.gitignore .gitignore featurize.dvc train.dvc -$ git commit -m "add featurization and train steps to the pipeline" +$ git commit -m "add featurization and train steps to pipeline" $ dvc push ``` -This example is simplified just to show you an idea of the pipeline, check -[example](/doc/get-started/example-pipeline) or complete -[tutorial](/doc/tutorial) to see the NLP processing pipeline end-to-end. +This example is simplified just to show you a basic pipeline, see a more +advanced [example](/doc/get-started/example-pipeline) or complete +[tutorial](/doc/tutorial) to create a +[NLP](https://en.wikipedia.org/wiki/Natural_language_processing) pipeline +end-to-end. > See also the `dvc pipeline` command. diff --git a/static/docs/get-started/visualize.md b/static/docs/get-started/visualize.md index 1f6baa6a26..00620da547 100644 --- a/static/docs/get-started/visualize.md +++ b/static/docs/get-started/visualize.md @@ -4,7 +4,7 @@ Now that we have built our pipeline, we need a good way to visualize it to be able to wrap our heads around it. Luckily, DVC allows us to do that without leaving the terminal, making the experience distraction-less. -We are using `--ascii` option below to better illustrate the pipeline. Please, +We are using `--ascii` option below to better illustrate this pipeline. Please, refer to the `dvc pipeline show` documentation to explore other options this command supports (e.g. `.dot` files that can be used then in other tools). diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 8eb9a3d8d7..5f944e1f94 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -60,14 +60,11 @@ need to run `dvc unprotect` or `dvc remove` first (check the ## Data file internals -If you take a look at the DVC-file, you will see that only outputs are defined -in `outs`. In this file, only one output is defined. The output contains the -data file path in the repository and md5 cache. This md5 cache determines a -location of the actual content file in DVC cache directory `.dvc/cache`. - -> Output from DVC-files defines the relationship between the data file path in a -> repository and the path in a cache directory. See also -> [DVC-File Format](/doc/user-guide/dvc-file-format) +If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by +`dvc add`, you will see that only outputs are defined in `outs`. In this file, +only one output is defined. The output contains the data file path in the +repository and md5 cache. This md5 cache determines a location of the actual +content file in DVC cache directory `.dvc/cache`. ```dvc $ cat data/Posts.xml.zip.dvc @@ -81,6 +78,9 @@ $ du -sh .dvc/cache/ec/* 41M .dvc/cache/ec/1d2935f811b77cc49b031b999cbf17 ``` +> Outputs from DVC-files define the relationship between the data file path in a +> repository and the path in a cache directory. + Keeping actual file content in a cache directory and a copy of the caches in user workspace during `$ git checkout` is a regular trick that [Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. This @@ -186,16 +186,16 @@ and does some additional work if the command was successful: 1. DVC transforms all the outputs `-o` files into data files. It is like applying `dvc add` for each of the outputs. As a result, all the actual data - files content goes to the cache directory `.dvc/cache` and each of the - filenames will be added to `.gitignore`. + files content goes to the cache directory `.dvc/cache` and each of the file + names will be added to `.gitignore`. -2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` DVC-file - in the workspace with information about this stage in the pipeline, see - [DVC-File Format](/doc/user-guide/dvc-file-format). Note that the name of +2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage + file in the workspace with information about this pipeline stage. (See + [DVC-File Format](/doc/user-guide/dvc-file-format)). Note that the name of this file could be specified by using the `-f` option, for example `-f extract.dvc`. -Let's take a look at the resulting DVC-file from the above example: +Let's take a look at the resulting stage file created by `dvc run` above: ```dvc $ cat Posts.xml.dvc @@ -234,8 +234,8 @@ Posts.xml The output file `Posts.xml` was transformed by DVC into a data file in accordance with the `-o` option. You can find the corresponding cache file with -the checksum, which starts with `c1fa36d` as we can see in the DVC-file -`Posts.xml.dvc`: +the checksum, which starts with `c1fa36d` as we can see in the `Posts.xml.dvc` +stage file: ```dvc $ ls .dvc/cache/ @@ -284,10 +284,10 @@ Reproducing 'Posts-test.tsv.dvc': Positive size 2049, negative size 97951 ``` -The result of the steps are two DVC-files corresponding to each of the commands -`Posts-test.tsv.dvc` and `Posts.tsv.dvc`. Also, a `code/conf.pyc` file was -created. This type of file should not be tracked by Git. Let’s manually include -this type of file into `.gitignore`. +The result of the steps are two stage files corresponding to each of the +commands: `Posts-test.tsv.dvc` and `Posts.tsv.dvc`. Also, a `code/conf.pyc` file +was created. This type of file should not be tracked by Git. Let’s manually +include this type of file into `.gitignore`. ```dvc $ git status -s @@ -306,9 +306,9 @@ $ git add . $ git commit -m "Process to TSV and separate test and train" ``` -Let’s run and commit the following steps of the pipeline. Define the feature -extraction step which takes train and test TSVs and generates corresponding -matrix files: +Let’s run and save the following commands for our pipeline. First, define the +feature extraction stage, that takes `train` and `test` TSVs and generates +corresponding matrix files: ```dvc $ dvc run -d code/featurization.py -d code/conf.py \ @@ -349,15 +349,18 @@ Reproducing 'Dvcfile': python code/evaluate.py ``` -The model evaluation step is the last one. To make it a reproducibility goal by -default we specify a DVC-file named `Dvcfile`. This will be discussed in the -next chapter in more details. +> Note that using `-f Dvcfile` with `dvc run` above isn't necessary as the +> default stage file name is `Dvcfile` when there are no outputs (option `-o`). + +The model evaluation step is the last one. To help in the pipeline's +reproducibility, we specify stage file name `Dvcfile`. (This will be discussed +in more detail in the next chapter.) Note that the output file `data/eval.txt` was transformed by DVC into a metric file in accordance with the `-M` option. -The result of the last three run commands execution is three DVC-files and a -modified .gitignore file. All the changes should be committed into Git. +The result of the last three `dvc run` commands execution is three stage files +and a modified .gitignore file. All the changes should be committed into Git: ```dvc $ git status -s @@ -390,4 +393,4 @@ focus is DVC, not ML modeling and we use a relatively small dataset without any advanced ML techniques. In the next chapter we will try to improve the metrics by changing our modeling -code and using reproducibility in the pipeline regeneration. +code and using reproducibility in our pipeline regeneration. diff --git a/static/docs/tutorial/preparation.md b/static/docs/tutorial/preparation.md index fd9cf54c0e..6462365f9a 100644 --- a/static/docs/tutorial/preparation.md +++ b/static/docs/tutorial/preparation.md @@ -53,7 +53,7 @@ $ pip install -r code/requirements.txt Now DVC software should be installed. The easiest way to install DVC is a system dependent package. DVC supports all common operating systems: Mac OS X, Linux and Windows. You can find the latest version of the package on the -[home page](). +[home page](/). Alternatively, you can install DVC by Python package manager — PIP if you use Python: diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index 106a9d1031..133c3005d0 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -5,7 +5,7 @@ The most exciting part of DVC is reproducibility. > Reproducibility is the time you are getting benefits out of DVC instead of -> spending time defining the ML pipelines. +> spending time managing ML pipelines. DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change. @@ -19,22 +19,23 @@ This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes. -If you run `repro` on any created DVC-file from our repository, nothing happens -because nothing was changed in the defined pipeline. +If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our +repository, nothing happens because nothing was changed in the pipeline defined +in the project. There's nothing to reproduce. ```dvc -# Nothing to reproduce $ dvc repro model.p.dvc ``` -By default, `dvc repro` reads DVC-files named `Dvcfile`: +> By default, `dvc repro` tries to read the DVC-file with name `Dvcfile`, like +> the one we define in the previous chapter. ```dvc -# Reproduce Dvcfile. -# But there is still nothing to reproduce: $ dvc repro ``` +Tries to reproduce the same pipeline... But there is still nothing to reproduce. + ## Adding bigrams Our NLP model was based on [unigrams](https://en.wikipedia.org/wiki/N-gram) @@ -61,7 +62,7 @@ bag_of_words = CountVectorizer(stop_words='english', ngram_range=(1, 2)) ``` -Reproduce the pipeline: +Reproduce our changed pipeline: ```dvc $ dvc repro @@ -85,7 +86,7 @@ Reproducing 'Dvcfile': The process started with the feature creation step because one of its parameters was changed — the edited source code `code/featurization.py`. All dependent -steps were regenerated as well. +stages were ran again as well. Let’s take a look at the metric’s change. The improvement is close to zero (+0.0075% to be precise): diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md index 35984ea993..b1fa6f8744 100644 --- a/static/docs/understanding-dvc/collaboration-issues.md +++ b/static/docs/understanding-dvc/collaboration-issues.md @@ -8,7 +8,7 @@ the community and the industry now, when ML algorithms and methods are no longer simply "tribal knowledge" but are still difficult to implement, reuse, and manage. -To make progress in this challenge, many areas of the ML experimentation process +To make progress on this challenge, many areas of the ML experimentation process need to be formalized. Many common questions need to be answered in an unified, principled way: @@ -34,7 +34,7 @@ principled way: 4. Reproducibility. - - How do you rerun a model's evaluation without re-training the model and + - How do you run a model's evaluation again without re-training the model and preprocessing a raw dataset? 5. Managing and sharing large data files. diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md index 2ec95337de..eb49eaa9b1 100644 --- a/static/docs/understanding-dvc/core-features.md +++ b/static/docs/understanding-dvc/core-features.md @@ -12,7 +12,7 @@ 4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML library agnostic: Keras, Tensorflow, PyTorch, scipy, etc. -5. **Open-sourced** and **Self-served**. DVC is free and doesn't require any +5. **Open-sourced** and **Self-served**: DVC is free and doesn't require any additional services. 6. DVC supports cloud storage (AWS S3, Azure Blob Storage and GCP storage) for diff --git a/static/docs/understanding-dvc/existing-tools.md b/static/docs/understanding-dvc/existing-tools.md index 006d1ddf76..9d9b9eec97 100644 --- a/static/docs/understanding-dvc/existing-tools.md +++ b/static/docs/understanding-dvc/existing-tools.md @@ -6,7 +6,7 @@ There is one common opinion regarding data science tooling. Data scientists as engineers are supposed to use the best practices and collaboration software from software engineering. Source code version control system (Git), continuous integration services (CI), and unit test frameworks are all expected to be -utilized in the data science pipeline. +utilized in data science pipelines. But a comprehensive look at data science processes shows that the software engineering toolset does not cover data science needs. Try to answer all the @@ -15,12 +15,12 @@ left wanting for more. ## Experiment management software -To solve data scientists collaboration issues a new type of software was created -called "experiment management software". This software aims to cover the gap -between data scientist needs and the existing toolset. +This new type of software was created to solve data scientists collaboration +issues. This software aims to cover the gap between data scientist needs and the +existing toolset. -The experimentation software is usually **graphical user interface** (GUI) -based, in contrast to the existing command line engineering tools. The GUI is a +Experiment management software is usually **graphical user interface** (GUI) +based, in contrast to existing command line engineering tools. The GUI is a bridge to a separate **cloud based environment**. The cloud environment is usually not so flexible as local data scientists environment. And the cloud environment is not fully integrated with the local environment. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index ade91a0b1b..6e27d394e0 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -29,7 +29,7 @@ $ git commit -m "CNN plots" ``` -4. DVC can reproduce a pipeline with respect to the pipeline's dependencies: +4. DVC can reproduce a pipeline with respect to its dependencies: ```dvc # The input dataset was changed diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 2fd512a6f1..0ca460b1d8 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -25,7 +25,7 @@ process. and doesn't run any daemons or servers. Nevertheless, DVC can generate images with pipeline and experiment workflow visualization. -3. **Experiment management** software today is mostly designed for enterprise +3. **Experiment management software** today is mostly designed for enterprise usage. An open-sourced experimentation tool example: http://studio.ml/. The differences are: @@ -54,8 +54,8 @@ process. - DVC utilizes a DAG: - - The DAG is defined by DVC-files with filenames `.dvc` or - `Dvcfile`. + - The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with + file names `.dvc` or `Dvcfile`). - One DVC-file defines one node in the DAG. All DVC-files in a repository make up a single pipeline (think a single Makefile). All DVC-files (and diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md index 4c117fa8be..b62c41c6a6 100644 --- a/static/docs/understanding-dvc/what-is-dvc.md +++ b/static/docs/understanding-dvc/what-is-dvc.md @@ -1,11 +1,11 @@ # What Is DVC? Data Version Control, or DVC, is **a new type of experiment management -software** that has been built **on top of the existing engineering toolset** +software** that has been built **on top of the existing engineering toolset**, and particularly on a source code version control system (currently Git). DVC reduces the gap between the existing tools and the data scientist needs. This -gives an ability to use the **advantages of the experimentation software while -reusing existing skills and intuition**. +gives an ability to use the advantages of experiment management software while +reusing existing skills and intuition. The underlying source code control system eliminates the need to use external services. Data science experiment sharing and collaboration can be done through @@ -18,34 +18,34 @@ branch or commit. DVC uses a few core concepts: -- **Experiment**: equivalent to a Git version. Each experiment (extract new +- **Experiment**: Equivalent to a Git version. Each experiment (extract new features, change model hyperparameters, data cleaning, add a new data source) should be performed in a separate branch and then merged into the master branch only if the experiment is successful. DVC allows experiments to be integrated into a project's history and NEVER needs to recompute the results after a successful merge. -- **Experiment state** or state: equivalent to a Git snapshot (all committed +- **Experiment state** or state: Equivalent to a Git snapshot (all committed files). Git checksum, branch name, or tag can be used as a reference to a experiment state. -- **Reproducibility**: action to reproduce an experiment state. This action +- **Reproducibility**: Action to reproduce an experiment state. This action generates output files based on a set of input files and source code. This action usually changes experiment state. -- **Pipeline**: directed acyclic graph (DAG) or chain of commands to reproduce +- **Pipeline**: Directed acyclic graph (DAG) or chain of commands to reproduce an experiment state. The commands are connected by input and output files. - Pipeline is defined by special **DVC-files** (which act like Makefiles). + Pipelines are defined by special **DVC-files** (which act like Makefiles). -- **Workflow**: set of experiments and relationships among them. Workflow +- **Workflow**: Set of experiments and relationships among them. Workflow corresponds to the entire Git repository. -- **Data files**: cached files (for large files). Data files are stored outside +- **Data files**: Cached files (for large files). Data files are stored outside of the Git repository on a local/shared hard drive or remote storage, but [DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored in Git for DVC needs (to maintain pipelines and reproducibility). -- **Data cache**: directory with all data files on a local hard drive or in +- **Data cache**: Directory with all data files on a local hard drive or in cloud storage, but not in the Git repository. - **Cloud storage** support: available complement to the core DVC features. This diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index dfa3147f5d..6622526caf 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -72,7 +72,7 @@ $ git push Your colleague can pull the code and have both `raw` and `clean` instantly appear in his workspace without copying. After this he decides to continue -building the pipeline and process the cleaned up data: +building this pipeline and process the cleaned up data: ```dvc $ git pull diff --git a/static/docs/user-guide/autocomplete.md b/static/docs/user-guide/autocomplete.md index c9a84cfafb..593b9b0ff5 100644 --- a/static/docs/user-guide/autocomplete.md +++ b/static/docs/user-guide/autocomplete.md @@ -85,8 +85,8 @@ The DVC specific completion script is located in this path of our main repository: [dvc/scripts/completion/dvc.zsh](https://github.com/iterative/dvc/blob/master/scripts/completion/dvc.zsh) -Place the completion script in a directory included in `$fpath`, the file should -be named `_dvc`. +Place the completion script in a directory included in `$fpath`, the file name +should be `_dvc`. For example: diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 0d21096970..d0461cbed2 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -1,6 +1,6 @@ # DVC-File Format -When you add a file (with `dvc add`) or a command (with `dvc run`) to the +When you add a file (with `dvc add`) or a command (with `dvc run`) to a [pipeline](/doc/get-started/pipeline), DVC creates a special text metafile with the `.dvc` file extension (e.g. `process.dvc`), or with the default name `Dvcfile`. DVC-files a.k.a. **stage files** contain all the needed information @@ -45,7 +45,7 @@ locked: True On the top level, `.dvc` file consists of such fields: -- `cmd`: a command that is being run in this stage of the pipeline; +- `cmd`: a command that is being run in this stage; - `deps`: a list of dependencies for this stage; - `outs`: a list of outputs for this stage; - `md5`: md5 checksum for this DVC-file;