Merge branch 'master' into get-started-1.0-experiments

iterative · Jun 26, 2020 · befb19b · befb19b
2 parents 7871e43 + 87e83e1
commit befb19b
Show file tree

Hide file tree

Showing 22 changed files with 424 additions and 488 deletions.
diff --git a/config/prismjs/dvc-commands.js b/config/prismjs/dvc-commands.js
@@ -25,9 +25,6 @@ module.exports = [
   'plots modify',
   'plots diff',
   'plots',
-  'pipeline show',
-  'pipeline list',
-  'pipeline',
   'move',
   'metrics show',
   'metrics diff',

diff --git a/content/blog/2020-06-22-dvc-1-0-release.md b/content/blog/2020-06-22-dvc-1-0-release.md
@@ -2,13 +2,13 @@
 title: 'DVC 1.0 release: new features for MLOps'
 date: 2020-06-22
 description: |
-  Today we're releasing DVC 1.0. New exciting features that users were waiting
-  for ❤️ . All the details in this blog post.
+  Today we're releasing DVC 1.0 with new exciting features that users were
+  waiting for ❤️. Find all the details in this blog post.
 
 descriptionLong: |
-  Today we're releasing DVC 1.0. New exciting features that users were waiting
-  for ❤️. DVC is a more mature product now with stable release cycles and
-  benchmarks. All the details in this blog post.
+  Today we're releasing DVC 1.0. It brings new exciting features that users
+  were waiting for ❤️. DVC is a more mature product now, with stable release
+  cycles and benchmarks. Find all the details in this blog post.
 
 picture: 2020-06-22/release.png
 pictureComment: DVC 1.0 release
@@ -24,31 +24,31 @@ tags:
 ## Introduction
 
 3 years ago, I was concerned about good engineering standards in data science:
-data versioning, reproducibility, workflow automation - like continuous
-integration and continuous delivery (CI/CD) - but for machine learning. I wanted
-there to be Git for data to make this possible. So I made DVC (Data Version
-Control), which works as version control for data projects.
+data versioning, reproducibility, workflow automation — like continuous
+integration and continuous delivery (CI/CD), but for machine learning. I wanted
+there to be a "Git for data" to make all this possible. So I created DVC (Data
+Version Control), which works as version control for data projects.
 
 Technically, DVC codifies your data and machine learning pipelines as text
 metafiles (with pointers to actual data in S3/GCP/Azure/SSH), while you use Git
-for the actual versioning. DevOps folks call this approach GitOps or more
-specificaly in this case - _DataOps_ or _MLOps_.
+for the actual versioning. DevOps folks call this approach GitOps or, more
+specifically, in this case _DataOps_ or _MLOps_.
 
 The new DVC 1.0. is inspired by discussions and contributions from our community
 of data scientists, ML engineers, developers and software engineers.
 
 ## DVC 1.0
 
-The new DVC 1.0 is inspired by discussions and contributions from our
-community - both fresh ideas and bug reports 😅. All these contributions, big
-and small, have a collective impact on DVC's development - I'm confident 1.0
-wouldn't be possible without our community. They tell us what features matter
-most, what approaches work (and what don't!), and what they need from DVC to
-support their ML projects.
+The new DVC 1.0 is inspired by discussions and contributions from our community
+— both fresh ideas and bug reports 😅. All these contributions, big and small,
+have a collective impact on DVC's development. I'm confident 1.0 wouldn't be
+possible without our community. They tell us what features matter most, which
+approaches work (and which don't!), and what they need from DVC to support their
+ML projects.
 
-A few weeks ago we announced the 1.0 prerelease. After lots of helpful feedback
+A few weeks ago we announced the 1.0 pre-release. After lots of helpful feedback
 from brave users, it's time to go live. Now, DVC 1.0 is available with all the
-standard installation methods including pip, conda, brew, choco, and
+standard installation methods including `pip`, `conda`, `brew`, `choco`, and
 system-specific packages: deb, rpm, msi, pkg. See https://dvc.org/doc/install
 for more details.
 
@@ -62,21 +62,21 @@ learned in 3 years of this journey and how these are reflected in the new DVC.
 
 Our users taught us that ML pipelines evolve much faster than data engineering
 pipelines with data processing steps. People need to change the commands of the
-pipeline often and it was not easy to do with the old DVC files.
+pipeline often and it was not easy to do this with the old DVC-files.
 
-In DVC 1.0, the DVC file format was changed in three big ways. First, instead of
-multiple DVC stage files (`*.dvc`), each project has a single DVC file
-`dvc.yaml`. By default, all stages go in this single `.yaml` file.
+In DVC 1.0, the DVC metafile format was changed in three big ways. First,
+instead of multiple DVC stage files (`*.dvc`), each project has a single
+`dvc.yaml` file. By default, all stages go in this single YAML file.
 
-Second, we made clear connections between the `dvc run` command, where pipeline
-stages are defined, and how stages appear in `dvc.yaml`. Many of the `dvc run`
-options are mirrored in the metafile. We wanted to make it far less complicated
-to edit an existing pipeline by making `dvc.yaml` more human readable and
-writable.
+Second, we made clear connections between the `dvc run` command (a helper to
+define pipeline stages), and how stages are defined in `dvc.yaml`. Many of the
+options of `dvc run` are mirrored in the metafile. We wanted to make it far less
+complicated to edit an existing pipeline by making `dvc.yaml` more human
+readable and writable.
 
-Third, data hash values are no longer stored in the pipeline metafile. This
-approach aligns better with GitOps paradigms and simplifies the usage of DVC by
-tremendously improving metafile human-readability:
+Third, file and directory hash values are no longer stored in the pipeline
+metafile. This approach aligns better with the GitOps paradigms and simplifies
+the usage of DVC by tremendously improving metafile human-readability:
 
 ```yaml
 stages:
@@ -99,53 +99,53 @@ stages:
       - dropout
     metrics:
       - logs.csv
-      - summary.json
+      - summary.json:
           cache: false
     outs:
       - model.pkl
 ```
 
-All the hashes have been moved to a special file, `dvc.lock`, which is a lot
-like the old DVC file format. DVC uses the `.lock` file to define what data
-files need to be restored to the workspace from data remotes (cloud storage) and
-if a particular pipeline stage needs to be rerun. In other words, we're
-separating the human-readable parts of the pipeline into `dvc.yaml` and
-auto-generated "machine" parts into `dvc.lock`.
+All of the hashes have been moved to a special file, `dvc.lock`, which is a lot
+like the old DVC-file format. DVC uses this lock file to define which data files
+need to be restored to the workspace from data remotes (cloud storage) and if a
+particular pipeline stage needs to be rerun. In other words, we're separating
+the human-readable parts of the pipeline into `dvc.yaml`, and the auto-generated
+"machine" parts into `dvc.lock`.
 
-Another cool change: the auto-generated part doesn't necessarily need to be
-stored in your Git repository. The new run-cache feature eliminates the need of
-storing the lock file in Git repositories. That brings us to our next big
-feature:
+Another cool change: the auto-generated part (`dvc.lock`) doesn't necessarily
+have to be stored in your Git repository. The new run-cache feature eliminates
+the need of storing the lock file in Git repositories. That brings us to our
+next big feature:
 
 ### [Run cache](https://github.com/iterative/dvc/issues/1234)
 
 We built DVC with a workflow in mind: one experiment to one commit. Some users
 love it, but this approach gets clunky fast for others (like folks who are
-grid-searching hyperparameter space). Forcing users to make Git commits for each
-ML experiment was a requirement for the old DVC, if you wanted to snapshot your
+grid-searching a hyperparameter space). Making Git commits for each ML
+experiment was a requirement with the old DVC, if you wanted to snapshot your
 project or pipelines on each experiment. Moving forward, we want to give users
 more flexibility to decide how often they want to commit.
 
 We had an insight that data remotes (S3, Azure Blob, SSH etc) can be used
 instead of Git for storing the codified meta information, not only data. In DVC
-1.0 a special structure is implemented - run-cache - that preserves the state
-including all the hashes. Basically, all the information that is stored in the
+1.0, a special structure is implemented, the run-cache, that preserves the state
+(including all the hashes). Basically, all the information that is stored in the
 new `dvc.lock` file is replicated in the run-cache.
 
 The advantage of the run-cache is that pipeline runs (and output file versions)
-are not directly connected to Git commits anymore. New DVC can store all the
-runs in run-cache - even if they were never committed to Git.
+are not directly connected to Git commits anymore. The new DVC can store all the
+runs in the run-cache, even if they were never committed to Git.
 
-This approach gives DVC a "long memory" of DVC stages runs. If a user runs a
-command that was run before (whether Git committed or not), then DVC can return
-the result of the command from the cache without rerunning it. It is a useful
-feature for a hyperparameter optimization stage - when users return to the
+This approach gives DVC a "long memory" of DVC stages runs. If a user tries to
+run a stage that was previously run (whether committed to Git or not), then DVC
+can return the result from the run-cache without rerunning it. It is a useful
+feature for a hyperparameter optimization stage — when users return to the
 previous sets of the parameters and don't want to wait for ML retraining.
 
 Another benefit of the run-cache is related to CI/CD systems for ML, which is a
 holy grail of MLOps. The long memory means users don't have to make auto-commits
 in their CI/CD system side - see
-[this stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments).
+[this Stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments).
 
 ### [Plots](https://github.com/iterative/dvc/issues/3409)
 

diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md
@@ -65,8 +65,8 @@ progress made by the checkout.
 
 There are two methods to restore a file missing from the cache, depending on the
 situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
-regenerate its outputs (see also `dvc pipeline`). In other cases the cache can
-be pulled from remote storage using `dvc pull`.
+regenerate its outputs (see also `dvc dag`). In other cases the cache can be
+pulled from remote storage using `dvc pull`.
 
 ## Options
 

diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md
@@ -0,0 +1,108 @@
+# dag
+
+Show [stages](/doc/command-reference/run) in a pipeline that lead to the
+specified stage. By default it lists
+[DVC-files](/doc/user-guide/dvc-files-and-directories).
+
+## Synopsis
+
+```usage
+usage: dvc dag [-h] [-q | -v] [--dot] [--full] [target]
+
+positional arguments:
+  targets         Stage or output to show pipeline for (optional)
+                  Finds all stages in the workspace by default.
+```
+
+## Description
+
+A data pipeline, in general, is a series of data processing
+[stages](/doc/command-reference/run) (for example console commands that take an
+input and produce an <abbr>output</abbr>). A pipeline may produce intermediate
+data, and has a final result. Machine learning (ML) pipelines typically start a
+with large raw datasets, include intermediate featurization and training stages,
+and produce a final model, as well as accuracy
+[metrics](/doc/command-reference/metrics).
+
+In DVC, pipeline stages and commands, their data I/O, interdependencies, and
+results (intermediate or final) are specified with `dvc add` and `dvc run`,
+among other commands. This allows DVC to restore one or more pipelines of stages
+interconnected by their dependencies and outputs later. (See `dvc repro`.)
+
+> DVC builds a dependency graph
+> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this.
+
+`dvc dag` displays the stages of a pipeline up to the target stage. If `target`
+is omitted, it will show the full project DAG.
+
+## Options
+
+- `--dot` - show DAG in
+  [DOT](<https://en.wikipedia.org/wiki/DOT_(graph_description_language)>)
+  format. It can be passed to third party visualization utilities.
+
+- `--full` - show full DAG that the `target` belongs too, instead of showing the
+  part that consists only of the target ancestors.
+
+- `-h`, `--help` - prints the usage/help message, and exit.
+
+- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
+  problems arise, otherwise 1.
+
+- `-v`, `--verbose` - displays detailed tracing information.
+
+## Paging the output
+
+This command's output is automatically piped to
+[Less](<https://en.wikipedia.org/wiki/Less_(Unix)>), if available in the
+terminal. (The exact command used is `less --chop-long-lines --clear-screen`.)
+If `less` is not available (e.g. on Windows), the output is simply printed out.
+
+> It's also possible to
+> [enable Less paging on Windows](/doc/user-guide/running-dvc-on-windows#enabling-paging-with-less).
+
+### Providing a custom pager
+
+It's possible to override the default pager via the `DVC_PAGER` environment
+variable. For example, the following command will replace the default pager with
+[`more`](<https://en.wikipedia.org/wiki/More_(command)>), for a single run:
+
+```dvc
+$ DVC_PAGER=more dvc dag
+```
+
+For a persistent change, define `DVC_PAGER` in the shell configuration. For
+example in Bash, we could add the following line to `~/.bashrc`:
+
+```bash
+export DVC_PAGER=more
+```
+
+## Examples
+
+Visualize DVC pipeline:
+
+```dvc
+$ dvc dag
+         +---------+
+         | prepare |
+         +---------+
+              *
+              *
+              *
+        +-----------+
+        | featurize |
+        +-----------+
+         **        **
+       **            *
+      *               **
++-------+               *
+| train |             **
++-------+            *
+         **        **
+           **    **
+             *  *
+        +----------+
+        | evaluate |
+        +----------+
+```
diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md
@@ -6,10 +6,8 @@ Get tracked files or directories from
 ## Synopsis
 
 ```usage
-usage: dvc fetch [-h] [-q | -v] [-j <number>]
-                 [-r <name>] [-a] [-T]
-                 [--all-commits] [-d] [-R]
-                 [--run-cache]
+usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
+                  [--all-commits] [-d] [-R] [--run-cache]
                  [targets [targets ...]]
 
 positional arguments: