diff --git a/.github/workflows/link-check-deploy.yml b/.github/workflows/link-check-deploy.yml
index 45068a38d8..1dca96cd73 100644
--- a/.github/workflows/link-check-deploy.yml
+++ b/.github/workflows/link-check-deploy.yml
@@ -13,6 +13,8 @@ jobs:
github.event.deployment_status.state == 'success'
steps:
- uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
- id: build_check
uses: LouisBrunner/checks-action@v1.0.0
with:
@@ -21,7 +23,7 @@ jobs:
status: queued
- name: Run Link Check
id: check
- uses: iterative/link-check.action@v0.8
+ uses: iterative/link-check.action@v0.9
with:
diff: true
configFile: config/link-check/config.yml
diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md
index 4e266aa374..3d0b22198a 100644
--- a/content/docs/start/data-pipelines.md
+++ b/content/docs/start/data-pipelines.md
@@ -6,32 +6,63 @@ version, and reproduce your data science and machine learning workflows.'
# Get Started: Data Pipelines
+>> ⁉️ It may be worthwhile to start with the question: "Why pipelines?"
+
Versioning large data files and directories for data science is great, but not
enough. How is data filtered, transformed, or used to train ML models? DVC
introduces a mechanism to capture _data pipelines_ — series of data processes
that produce a final result.
+>> ⁉️ What is data process? Why do we tie "pipelines" with "code" or "data"
+>> here? They are more general ideas, we can have a pipeline that downloads data
+>> from a URL using `wget` and checks whether it has changed, for example. (like
+>> `dvc get` or `dvc import-url`, but simpler.)
+
+>> I see that we are introducing pipelines to an ML/DS audience, but the idea is
+>> more general and I believe we can tell this here. It's also possible to tell
+>> this within ML/DS context in broader terms.
+
DVC pipelines and their data can also be easily versioned (using Git). This
allows you to better organize projects, and reproduce your workflow and results
later — exactly as they were built originally! For example, you could capture a
-simple ETL workflow, organize a data science project, or build a detailed
+simple [ETL workflow][etl], organize a data science project, or build a detailed
machine learning pipeline.
+[etl]: https://en.wikipedia.org/wiki/Extract,_transform,_load
+
+>> We need a figure here.
+
Watch and learn, or follow along with the code example below!
https://youtu.be/71IGzyH95UY
-## Pipeline stages
+>> ✍️ DVC has features to handle pipelines easily. You can create stages
+>> associated with commands, code, data and (hyper)parameters. It can run the
+>> commands, and cache the outputs. DVC handles relationships
+>> between these stages, so when these associated
+>> elements change, the stage is invalidated and run. If no dependencies are
+>> changed, it can report this and reuse the cached results.
+
+Use `dvc stage add` to create _stages_. These represent processes (source code
+tracked with Git) which form the steps of a _pipeline_.
-Use `dvc run` to create _stages_. These represent processes (source code tracked
-with Git) which form the steps of a _pipeline_. Stages also connect code to its
-corresponding data _input_ and _output_. Let's transform a Python script into a
-[stage](/doc/command-reference/run):
+>> ⁉️ Adding _data process_ to the concepts doesn't seem to serve well. Instead
+>> we can continue like: "Stages represent commands to run, along with their
+>> dependencies like data and code files, and outputs like model and plot files."
+
+>> ⁉️ I believe we don't need the following sentence if we write as the previous
+>> one.
+
+Stages also connect code
+to its corresponding data _input_ and _output_.
### ⚙️ Expand to download example code.
+>> ⁉️ I think it might be easier to grasp the concept if we use a simpler
+>> pipeline with 3 stages, with not many parameters, metrics and such.
+
Get the sample code like this:
```dvc
@@ -63,15 +94,20 @@ Please also add or commit the source code directory with Git at this point.
+
+>> ⁉️ The first stage we create may be a simpler one.
+
```dvc
-$ dvc run -n prepare \
- -p prepare.seed,prepare.split \
- -d src/prepare.py -d data/data.xml \
- -o data/prepared \
- python src/prepare.py data/data.xml
+$ dvc stage add -n prepare \
+ -p prepare.seed,prepare.split \
+ -d src/prepare.py -d data/data.xml \
+ -o data/prepared \
+ python src/prepare.py data/data.xml
```
-A `dvc.yaml` file is generated. It includes information about the command we ran
+>> ⁉️ We can move `dvc.yaml` discussion in a hidden section.
+
+A `dvc.yaml` file is generated. It includes information about the command we run
(`python src/prepare.py data/data.xml`), its dependencies, and
outputs.
@@ -79,6 +115,11 @@ A `dvc.yaml` file is generated. It includes information about the command we ran
### 💡 Expand to see what happens under the hood.
+>> ⁉️ I think, the short descriptions of options can be in the main text instead
+>> of `dvc.yaml` above. Also, the project should contain a simple pipeline that
+>> starts with `-d` and `-o`, then add `-p`, `-m` to the mix in a later stage.
+>> The first example of `stage add` is too complex here.
+
The command options used above mean the following:
- `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file
@@ -140,34 +181,53 @@ stages:
+>> ⁉️ The following information can also be hidden, or deleted. We assume this
+>> GS trail will be standalone, no need to mention `dvc add` here.
+
There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared`
-in this case); `dvc run` already took care of this. You only need to run
-`dvc push` if you want to save them to
+in this case); `dvc stage add` and `dvc exp run` takes care of this. You only need
+to run `dvc push` if you want to save them to
[remote storage](/doc/start/data-and-model-versioning#storing-and-sharing),
(usually along with `git commit` to version `dvc.yaml` itself).
+>> ⁉️ Here, it may be more natural to tell the Run-Cache and `dvc push` as
+>> pushing "pipeline artifacts" instead of "storing and sharing".
+
+>> `dvc push` can
+>> push the individual stages, and their associated code and data, so you don't
+>> have to re-run them in other machines.
+
## Dependency graphs (DAGs)
-By using `dvc run` multiple times, and specifying outputs of a
+By using `dvc stage add` multiple times, and specifying outputs of a
stage as dependencies of another one, we can describe a sequence of
commands which gets to a desired result. This is what we call a _data pipeline_
or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
+>> ⁉️ All pipelines are DAGs, but not all DAGs are pipelines, so these two are
+>> not identical. DAG reference seems complicating, rather than simplifying to
+>> me.
+
Let's create a second stage chained to the outputs of `prepare`, to perform
feature extraction:
+>> ⁉️ The second stage is almost identical with the first. It may be necessary
+>> for the project, but pedagogically we're spending reader's attention here
+>> unnecessarily here.
+
```dvc
-$ dvc run -n featurize \
- -p featurize.max_features,featurize.ngrams \
- -d src/featurization.py -d data/prepared \
- -o data/features \
- python src/featurization.py data/prepared data/features
+$ dvc stage add -n featurize \
+ -p featurize.max_features,featurize.ngrams \
+ -d src/featurization.py -d data/prepared \
+ -o data/features \
+ python src/featurization.py data/prepared data/features
```
The `dvc.yaml` file is updated automatically and should include two stages now.
+
### 💡 Expand to see what happens under the hood.
The changes to the `dvc.yaml` should look like this:
@@ -202,17 +262,23 @@ The changes to the `dvc.yaml` should look like this:
### ⚙️ Expand to add more stages.
-Let's add the training itself. Nothing new this time; just the same `dvc run`
+>> ⁉️ Another pipeline from the same. The first three stages look almost
+>> identical.
+
+Let's add the training itself. Nothing new this time; just the same `dvc stage add`
command with the same set of options:
```dvc
-$ dvc run -n train \
- -p train.seed,train.n_est,train.min_split \
- -d src/train.py -d data/features \
- -o model.pkl \
- python src/train.py data/features model.pkl
+$ dvc stage add -n train \
+ -p train.seed,train.n_est,train.min_split \
+ -d src/train.py -d data/features \
+ -o model.pkl \
+ python src/train.py data/features model.pkl
```
+>> ⁉️ The wording below is a bit _distrustful._ In case of an error, DVC should
+>> report it.
+
Please check the `dvc.yaml` again, it should have one more stage now.
@@ -220,13 +286,13 @@ Please check the `dvc.yaml` again, it should have one more stage now.
This should be a good time to commit the changes with Git. These include
`.gitignore`, `dvc.lock`, and `dvc.yaml` — which describe our pipeline.
-## Reproduce
+## Run the pipeline
The whole point of creating this `dvc.yaml` file is the ability to easily
reproduce a pipeline:
```dvc
-$ dvc repro
+$ dvc exp run
```
@@ -237,12 +303,14 @@ Let's try to play a little bit with it. First, let's try to change one of the
parameters for the training stage:
1. Open `params.yaml` and change `n_est` to `100`, and
-2. (re)run `dvc repro`.
+2. (re)run `dvc exp run`.
+
+>> Link to experiments trail here
You should see:
```dvc
-$ dvc repro
+$ dvc exp run
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train' with command: ...
@@ -251,10 +319,13 @@ Running stage 'train' with command: ...
DVC detected that only `train` should be run, and skipped everything else! All
the intermediate results are being reused.
-Now, let's change it back to `50` and run `dvc repro` again:
+Now, let's change it back to `50` and run `dvc exp run` again:
+
+>> It looks these manual changes are a bit tedious. We can replace these with
+>> code or data changes that can't be captured with `dvc exp run -S`
```dvc
-$ dvc repro
+$ dvc exp run
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
```
@@ -300,7 +371,7 @@ stages:
-DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few
+DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc exp run` commands) solve a few
important problems:
- _Automation_: run a sequence of steps in a "smart" way which makes iterating