Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: Pipelines trail #2919

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/link-check-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ jobs:
github.event.deployment_status.state == 'success'
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- id: build_check
uses: LouisBrunner/[email protected]
with:
Expand All @@ -21,7 +23,7 @@ jobs:
status: queued
- name: Run Link Check
id: check
uses: iterative/link-check.action@v0.8
uses: iterative/link-check.action@v0.9
with:
diff: true
configFile: config/link-check/config.yml
Expand Down
137 changes: 104 additions & 33 deletions content/docs/start/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,32 +6,63 @@ version, and reproduce your data science and machine learning workflows.'

# Get Started: Data Pipelines

>> ⁉️ It may be worthwhile to start with the question: "Why pipelines?"

Versioning large data files and directories for data science is great, but not
enough. How is data filtered, transformed, or used to train ML models? DVC
introduces a mechanism to capture _data pipelines_ — series of data processes
that produce a final result.

>> ⁉️ What is data process? Why do we tie "pipelines" with "code" or "data"
>> here? They are more general ideas, we can have a pipeline that downloads data
>> from a URL using `wget` and checks whether it has changed, for example. (like
>> `dvc get` or `dvc import-url`, but simpler.)

>> I see that we are introducing pipelines to an ML/DS audience, but the idea is
>> more general and I believe we can tell this here. It's also possible to tell
>> this within ML/DS context in broader terms.

DVC pipelines and their data can also be easily versioned (using Git). This
allows you to better organize projects, and reproduce your workflow and results
later — exactly as they were built originally! For example, you could capture a
simple ETL workflow, organize a data science project, or build a detailed
simple [ETL workflow][etl], organize a data science project, or build a detailed
machine learning pipeline.

[etl]: https://en.wikipedia.org/wiki/Extract,_transform,_load

>> We need a figure here.

Watch and learn, or follow along with the code example below!

https://youtu.be/71IGzyH95UY

## Pipeline stages
>> ✍️ DVC has features to handle pipelines easily. You can create stages
>> associated with commands, code, data and (hyper)parameters. It can run the
>> commands, and cache the outputs. DVC handles relationships
>> between these stages, so when these associated
>> elements change, the stage is invalidated and run. If no dependencies are
>> changed, it can report this and reuse the cached results.

Use `dvc stage add` to create _stages_. These represent processes (source code
tracked with Git) which form the steps of a _pipeline_.

Use `dvc run` to create _stages_. These represent processes (source code tracked
with Git) which form the steps of a _pipeline_. Stages also connect code to its
corresponding data _input_ and _output_. Let's transform a Python script into a
[stage](/doc/command-reference/run):
>> ⁉️ Adding _data process_ to the concepts doesn't seem to serve well. Instead
>> we can continue like: "Stages represent commands to run, along with their
>> dependencies like data and code files, and outputs like model and plot files."

>> ⁉️ I believe we don't need the following sentence if we write as the previous
>> one.

Stages also connect code
to its corresponding data _input_ and _output_.

<details>

### ⚙️ Expand to download example code.

>> ⁉️ I think it might be easier to grasp the concept if we use a simpler
>> pipeline with 3 stages, with not many parameters, metrics and such.

Get the sample code like this:

```dvc
Expand Down Expand Up @@ -63,22 +94,32 @@ Please also add or commit the source code directory with Git at this point.

</details>


>> ⁉️ The first stage we create may be a simpler one.

```dvc
$ dvc run -n prepare \
-p prepare.seed,prepare.split \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
$ dvc stage add -n prepare \
-p prepare.seed,prepare.split \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
```

A `dvc.yaml` file is generated. It includes information about the command we ran
>> ⁉️ We can move `dvc.yaml` discussion in a hidden section.

A `dvc.yaml` file is generated. It includes information about the command we run
(`python src/prepare.py data/data.xml`), its <abbr>dependencies</abbr>, and
<abbr>outputs</abbr>.

<details>

### 💡 Expand to see what happens under the hood.

>> ⁉️ I think, the short descriptions of options can be in the main text instead
>> of `dvc.yaml` above. Also, the project should contain a simple pipeline that
>> starts with `-d` and `-o`, then add `-p`, `-m` to the mix in a later stage.
>> The first example of `stage add` is too complex here.

The command options used above mean the following:

- `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file
Expand Down Expand Up @@ -140,34 +181,53 @@ stages:

</details>

>> ⁉️ The following information can also be hidden, or deleted. We assume this
>> GS trail will be standalone, no need to mention `dvc add` here.

There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared`
in this case); `dvc run` already took care of this. You only need to run
`dvc push` if you want to save them to
in this case); `dvc stage add` and `dvc exp run` takes care of this. You only need
to run `dvc push` if you want to save them to
[remote storage](/doc/start/data-and-model-versioning#storing-and-sharing),
(usually along with `git commit` to version `dvc.yaml` itself).

>> ⁉️ Here, it may be more natural to tell the Run-Cache and `dvc push` as
>> pushing "pipeline artifacts" instead of "storing and sharing".

>> `dvc push` can
>> push the individual stages, and their associated code and data, so you don't
>> have to re-run them in other machines.

## Dependency graphs (DAGs)

By using `dvc run` multiple times, and specifying <abbr>outputs</abbr> of a
By using `dvc stage add` multiple times, and specifying <abbr>outputs</abbr> of a
stage as <abbr>dependencies</abbr> of another one, we can describe a sequence of
commands which gets to a desired result. This is what we call a _data pipeline_
or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph).

>> ⁉️ All pipelines are DAGs, but not all DAGs are pipelines, so these two are
>> not identical. DAG reference seems complicating, rather than simplifying to
>> me.

Let's create a second stage chained to the outputs of `prepare`, to perform
feature extraction:

>> ⁉️ The second stage is almost identical with the first. It may be necessary
>> for the project, but pedagogically we're spending reader's attention here
>> unnecessarily here.

```dvc
$ dvc run -n featurize \
-p featurize.max_features,featurize.ngrams \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
$ dvc stage add -n featurize \
-p featurize.max_features,featurize.ngrams \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
```

The `dvc.yaml` file is updated automatically and should include two stages now.

<details>


### 💡 Expand to see what happens under the hood.

The changes to the `dvc.yaml` should look like this:
Expand Down Expand Up @@ -202,31 +262,37 @@ The changes to the `dvc.yaml` should look like this:

### ⚙️ Expand to add more stages.

Let's add the training itself. Nothing new this time; just the same `dvc run`
>> ⁉️ Another pipeline from the same. The first three stages look almost
>> identical.

Let's add the training itself. Nothing new this time; just the same `dvc stage add`
command with the same set of options:

```dvc
$ dvc run -n train \
-p train.seed,train.n_est,train.min_split \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
$ dvc stage add -n train \
-p train.seed,train.n_est,train.min_split \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
```

>> ⁉️ The wording below is a bit _distrustful._ In case of an error, DVC should
>> report it.

Please check the `dvc.yaml` again, it should have one more stage now.

</details>

This should be a good time to commit the changes with Git. These include
`.gitignore`, `dvc.lock`, and `dvc.yaml` — which describe our pipeline.

## Reproduce
## Run the pipeline

The whole point of creating this `dvc.yaml` file is the ability to easily
reproduce a pipeline:

```dvc
$ dvc repro
$ dvc exp run
```

<details>
Expand All @@ -237,12 +303,14 @@ Let's try to play a little bit with it. First, let's try to change one of the
parameters for the training stage:

1. Open `params.yaml` and change `n_est` to `100`, and
2. (re)run `dvc repro`.
2. (re)run `dvc exp run`.

>> Link to experiments trail here

You should see:

```dvc
$ dvc repro
$ dvc exp run
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train' with command: ...
Expand All @@ -251,10 +319,13 @@ Running stage 'train' with command: ...
DVC detected that only `train` should be run, and skipped everything else! All
the intermediate results are being reused.

Now, let's change it back to `50` and run `dvc repro` again:
Now, let's change it back to `50` and run `dvc exp run` again:

>> It looks these manual changes are a bit tedious. We can replace these with
>> code or data changes that can't be captured with `dvc exp run -S`

```dvc
$ dvc repro
$ dvc exp run
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
```
Expand Down Expand Up @@ -300,7 +371,7 @@ stages:

</details>

DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few
DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc exp run` commands) solve a few
important problems:

- _Automation_: run a sequence of steps in a "smart" way which makes iterating
Expand Down