From 4a110afc5e7f84fed082afe235811a4e9c36d2ad Mon Sep 17 00:00:00 2001 From: jeremydesroches <18587991+jeremydesroches@users.noreply.github.com> Date: Sun, 8 Nov 2020 22:21:04 -0700 Subject: [PATCH 1/4] Add "machine learning pipeline" references, outline of changes. --- content/docs/command-reference/dag.md | 7 ++++--- content/docs/command-reference/repro.md | 8 +++++--- content/docs/command-reference/run.md | 7 +++++-- content/docs/user-guide/dvc-files-and-directories.md | 3 ++- 4 files changed, 16 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md index 23da655edf..94ec019d17 100644 --- a/content/docs/command-reference/dag.md +++ b/content/docs/command-reference/dag.md @@ -20,7 +20,7 @@ A data pipeline, in general, is a series of data processing input and produce an output). A pipeline may produce intermediate data, and has a final result. -Data processing or ML pipelines typically start with large raw datasets, include +Data science and ML pipelines typically start with large raw datasets, include intermediate featurization and training stages, and produce a final model, as well as accuracy [metrics](/doc/command-reference/metrics). @@ -78,9 +78,10 @@ example in Bash, we could add the following line to `~/.bashrc`: export DVC_PAGER=more ``` -## Examples +## Example: Visualize a DVC Pipeline -Visualize DVC pipeline: +Visualize the prepare, featurize, train, and evaluate stages of a machine +learning pipeline as defined in `dvc.yaml`: ```dvc $ dvc dag diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 6ddf7788a4..15f32adbb7 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -175,9 +175,11 @@ up-to-date and only execute the final stage. ## Examples -For simplicity, let's build a pipeline defined below. (If you want get your -hands-on something more real, see this short -[pipeline tutorial](/doc/start/data-pipelines)). It takes this `text.txt` file: +To get hands-on experience with data science and machine learning pipelines, see +[Get Started: Data Pipelines](/doc/start/data-pipelines). + +To demonstrate `dvc repro`, let's build and reproduce the simple pipeline below. +It takes this `text.txt` file: ``` dvc diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index e9e19633cf..5ea3e2b8bc 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -23,8 +23,11 @@ positional arguments: `dvc run` is a helper for creating or updating [pipeline](/doc/command-reference/dag) stages in a `dvc.yaml` file (located in -the current working directory). _Stages_ represent individual data processes, -including their input and resulting outputs. +the current working directory). + +_Stages_ represent individual data processes, including their input and +resulting outputs. Combine stages to capture simple data workflows, organize +data science projects, or build detailed machine learning pipelines. A stage name is required and can be provided using the `-n` (`--name`) option. The other available [options](#options) are mostly meant to describe different diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 09e2ff8ee7..ba366b7d22 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -109,7 +109,8 @@ can be written manually or generated by user code. > `dvc.yaml`. Additionally, a `dvc.lock` file is also created or updated by > `dvc run` and `dvc repro`, to record the pipeline state. -Here's a comprehensive `dvc.yaml` example: +Here's a comprehensive example of a machine learning pipeline, described in +`dvc.yaml`: ```yaml stages: From aacd352aeeb22576aac7f941b66f5319ddfd3bc1 Mon Sep 17 00:00:00 2001 From: jeremydesroches <18587991+jeremydesroches@users.noreply.github.com> Date: Mon, 9 Nov 2020 22:24:36 -0700 Subject: [PATCH 2/4] Update content/docs/command-reference/repro.md Co-authored-by: Jorge Orpinel --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 15f32adbb7..c4b84ecd8b 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -178,7 +178,7 @@ up-to-date and only execute the final stage. To get hands-on experience with data science and machine learning pipelines, see [Get Started: Data Pipelines](/doc/start/data-pipelines). -To demonstrate `dvc repro`, let's build and reproduce the simple pipeline below. +Let's build and reproduce a simple pipeline. It takes this `text.txt` file: ``` From f273885765665637f619abde3118edd346d46f4f Mon Sep 17 00:00:00 2001 From: jeremydesroches <18587991+jeremydesroches@users.noreply.github.com> Date: Mon, 9 Nov 2020 22:33:00 -0700 Subject: [PATCH 3/4] Update content/docs/command-reference/run.md Co-authored-by: Jorge Orpinel --- content/docs/command-reference/run.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 5ea3e2b8bc..8efa5ae8c1 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -26,7 +26,7 @@ positional arguments: the current working directory). _Stages_ represent individual data processes, including their input and -resulting outputs. Combine stages to capture simple data workflows, organize +resulting outputs. They can be combined to capture simple data workflows, organize data science projects, or build detailed machine learning pipelines. A stage name is required and can be provided using the `-n` (`--name`) option. From f5b5a9a5e4bd5648a16f4d87bba7afed0abf9ed3 Mon Sep 17 00:00:00 2001 From: jeremydesroches <18587991+jeremydesroches@users.noreply.github.com> Date: Mon, 9 Nov 2020 23:00:24 -0700 Subject: [PATCH 4/4] Updated placements and reverted edits based on feedback. --- content/docs/command-reference/dag.md | 10 +++++----- content/docs/command-reference/repro.md | 9 ++++----- content/docs/command-reference/run.md | 4 ++-- content/docs/user-guide/dvc-files-and-directories.md | 7 +++---- 4 files changed, 14 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md index 94ec019d17..2fb7e9c5b4 100644 --- a/content/docs/command-reference/dag.md +++ b/content/docs/command-reference/dag.md @@ -20,9 +20,9 @@ A data pipeline, in general, is a series of data processing input and produce an output). A pipeline may produce intermediate data, and has a final result. -Data science and ML pipelines typically start with large raw datasets, include -intermediate featurization and training stages, and produce a final model, as -well as accuracy [metrics](/doc/command-reference/metrics). +Data science and machine learning pipelines typically start with large raw +datasets, include intermediate featurization and training stages, and produce a +final model, as well as accuracy [metrics](/doc/command-reference/metrics). In DVC, pipeline stages and commands, their data I/O, interdependencies, and results (intermediate or final) are specified in `dvc.yaml`, which can be @@ -80,8 +80,8 @@ export DVC_PAGER=more ## Example: Visualize a DVC Pipeline -Visualize the prepare, featurize, train, and evaluate stages of a machine -learning pipeline as defined in `dvc.yaml`: +Visualize the prepare, featurize, train, and evaluate stages of a pipeline as +defined in `dvc.yaml`: ```dvc $ dvc dag diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c4b84ecd8b..27eb2c6651 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -30,6 +30,9 @@ results. > (either manually or by using `dvc run`) while initial data dependencies can be > registered with `dvc add`. +To get hands-on experience with data science and machine learning pipelines, see +[Get Started: Data Pipelines](/doc/start/data-pipelines). + This command is similar to [Make](https://www.gnu.org/software/make/) in software build automation, but DVC captures build requirements ([dependencies and outputs](/doc/command-reference/run#dependencies-and-outputs)) @@ -175,11 +178,7 @@ up-to-date and only execute the final stage. ## Examples -To get hands-on experience with data science and machine learning pipelines, see -[Get Started: Data Pipelines](/doc/start/data-pipelines). - -Let's build and reproduce a simple pipeline. -It takes this `text.txt` file: +Let's build and reproduce a simple pipeline. It takes this `text.txt` file: ``` dvc diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 8efa5ae8c1..b897b2a5db 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -26,8 +26,8 @@ positional arguments: the current working directory). _Stages_ represent individual data processes, including their input and -resulting outputs. They can be combined to capture simple data workflows, organize -data science projects, or build detailed machine learning pipelines. +resulting outputs. They can be combined to capture simple data workflows, +organize data science projects, or build detailed machine learning pipelines. A stage name is required and can be provided using the `-n` (`--name`) option. The other available [options](#options) are mostly meant to describe different diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index ba366b7d22..ea235edacb 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -100,8 +100,8 @@ and `dvc commit` commands, but not when a `.dvc` file is overwritten by ## `dvc.yaml` file -`dvc.yaml` files describe data pipelines, similar to how -[Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction) +`dvc.yaml` files describe data science or machine learning pipelines, similar to +how [Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction) work for building software. Its YAML structure contains a list of stages which can be written manually or generated by user code. @@ -109,8 +109,7 @@ can be written manually or generated by user code. > `dvc.yaml`. Additionally, a `dvc.lock` file is also created or updated by > `dvc run` and `dvc repro`, to record the pipeline state. -Here's a comprehensive example of a machine learning pipeline, described in -`dvc.yaml`: +Here's a comprehensive `dvc.yaml` example: ```yaml stages: