Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: ML Pipelines (1): Defining Pipelines & Stages #3414

Merged
merged 103 commits into from
Sep 7, 2022
Merged
Changes from 8 commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
f91f015
initial plan and some content
Apr 5, 2022
fd20226
added content about stages
Apr 6, 2022
b279db7
title and restyle fixes
iesahin Apr 13, 2022
881f3af
added dag section
iesahin Apr 15, 2022
0e95212
depend to -> depend on
iesahin Apr 15, 2022
78f342d
added dependencies section
iesahin Apr 20, 2022
b2a949d
Update content/docs/user-guide/pipelines/index.md
jorgeorpinel May 9, 2022
9d5ee22
Update content/docs/user-guide/pipelines/index.md
jorgeorpinel May 9, 2022
678753d
Restyled by prettier (#3532)
restyled-io[bot] May 12, 2022
7fe19e7
Update content/docs/user-guide/pipelines/index.md
iesahin Jun 7, 2022
e932c8d
added pipelines to sidebar
iesahin Jun 7, 2022
baf6ce1
updated the title
iesahin Jun 7, 2022
caf3291
fixed formatting
iesahin Jun 7, 2022
2e98221
updating for dvc.yaml first
iesahin Jun 9, 2022
15ba67f
fed -> used
iesahin Jun 9, 2022
d749c2a
dvc.yaml-first
iesahin Jun 9, 2022
f01e750
editing to tell dvc.yaml first
iesahin Jun 9, 2022
78cfebd
minor fix
iesahin Jun 9, 2022
9989902
url dependency
iesahin Jun 9, 2022
f51bfd5
dvc lock example
iesahin Jun 9, 2022
5edc29b
section titles for deps
iesahin Jun 9, 2022
9649c0f
section titles for outputs
iesahin Jun 9, 2022
bb7c0a1
reproduction -> running
iesahin Jun 9, 2022
e26734e
adding hyperparameters section
iesahin Jun 9, 2022
424a196
added experiments section
iesahin Jun 14, 2022
e43bc98
adding url dependencies
iesahin Jun 14, 2022
ee7703c
added outputs section content
iesahin Jun 15, 2022
602e327
minor
iesahin Jun 24, 2022
1b08b42
added running pipelines content
iesahin Jun 24, 2022
1ff0479
moved outputs below running
iesahin Jun 24, 2022
27ef04f
removed plots section header
iesahin Jun 24, 2022
c2c0461
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Jul 13, 2022
7df801a
guide: Defining Data Pipelines
jorgeorpinel Jul 14, 2022
605a000
guide: split up Data Pipelines section
jorgeorpinel Jul 14, 2022
f22f2a0
Update content/docs/command-reference/plots/templates.md
jorgeorpinel Jul 19, 2022
928ad25
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Jul 20, 2022
73982c2
guide: Data Pipes -> ML Pipes
jorgeorpinel Jul 20, 2022
a97006e
guide: oops, remove op-pipes file
jorgeorpinel Jul 20, 2022
f9f0a59
guide: remoge ML Pipes intro
jorgeorpinel Jul 20, 2022
53b4321
guide: mention both imports in Def ML Pipes
jorgeorpinel Jul 20, 2022
b6d8a0c
guide: move DAG info from cmd ref
jorgeorpinel Jul 20, 2022
60af6a7
guide: move all info and links about DAG to ML Pipes
jorgeorpinel Jul 20, 2022
830fe2b
guide: point from some Stage links to ML Pipes
jorgeorpinel Jul 20, 2022
cb042af
guide: delete Running ML Pipes (for now)
jorgeorpinel Jul 20, 2022
b8844da
nav: remove future ML Pipes guides
jorgeorpinel Jul 20, 2022
2040246
guide: remove ML Pipes/ Experimental Pipes
jorgeorpinel Jul 20, 2022
192e189
roll back unrelated changes...
jorgeorpinel Jul 20, 2022
078b50d
guide: roll back dvc.yaml page changes
jorgeorpinel Jul 20, 2022
fed4832
guide: link ML Pipes/ Defining Stages to dvc.yaml/stages spec
jorgeorpinel Jul 20, 2022
49780b6
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Jul 21, 2022
3410b6a
guide: link deps and params tooltip to ML Pipes/ Stages guide sections
jorgeorpinel Jul 21, 2022
0ca32c2
guide: links from dvc.yaml doc to ML Pipes/ Stages
jorgeorpinel Jul 21, 2022
8a55302
guide: more links
jorgeorpinel Jul 21, 2022
7ebc3ad
guide: oops, remove unused files
jorgeorpinel Jul 21, 2022
7461b07
remove unrelated change
jorgeorpinel Jul 21, 2022
bea5368
guide: move stage definition details to ML Pipes
jorgeorpinel Jul 21, 2022
e44433c
guide: move stage command details into ML Pipes
jorgeorpinel Jul 21, 2022
e47d741
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Jul 28, 2022
a4824a1
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 1, 2022
1411ec7
ref: roll back unrelated changes
jorgeorpinel Aug 2, 2022
9800a35
.
jorgeorpinel Aug 2, 2022
43517c4
ref: few more links to dependency graph in guide
jorgeorpinel Aug 2, 2022
0b81e74
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 3, 2022
09f5aaf
ref: reorg exp init to include simple usage example in Desc
jorgeorpinel Aug 3, 2022
ded42ff
concept: reintroduce DAG in more places
jorgeorpinel Aug 4, 2022
8ebd1bb
guide: pipelines are not ML-specific
jorgeorpinel Aug 4, 2022
55afd17
guide: more details for params fields
jorgeorpinel Aug 4, 2022
076402f
one word
jorgeorpinel Aug 8, 2022
e065653
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 21, 2022
adc4382
guide: restructure Def Pipes and
jorgeorpinel Aug 22, 2022
3f8e29a
guide: rewrite Def Pipes intro
jorgeorpinel Aug 22, 2022
4f3391a
guide: move DAG up in Def Pipes
jorgeorpinel Aug 22, 2022
86e5b18
guide: inner link in Def Pipes
jorgeorpinel Aug 22, 2022
8e88b7f
guide: fix link and typos
jorgeorpinel Aug 22, 2022
9216380
start: revert DAG changes
jorgeorpinel Aug 23, 2022
51151c8
guide: use typical ML stage names
jorgeorpinel Aug 23, 2022
50b2513
guide: better flow in Pipes index
jorgeorpinel Aug 23, 2022
29790fd
glossary: high-level def of Pipes
jorgeorpinel Aug 23, 2022
dc3b6bc
guide: move Stage command to dvc.yaml ref
jorgeorpinel Aug 23, 2022
dadd099
guide: remove abc mention
jorgeorpinel Aug 23, 2022
0283efc
guide: edits to Defining Pipes
jorgeorpinel Aug 24, 2022
b55ac41
guide: improve Param deps in Def Pipes and
jorgeorpinel Aug 24, 2022
b947c67
guide: add Outputs to Def Pipes
jorgeorpinel Aug 24, 2022
bfe0ec2
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 24, 2022
8505bc7
guide: update dep, param and out tooltips
jorgeorpinel Aug 24, 2022
57ed5c0
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 25, 2022
627afb5
guide: separate params in Pipes vs Exps
jorgeorpinel Aug 25, 2022
502b3bd
ref: move Stage commands section of dvc.yaml up
jorgeorpinel Aug 25, 2022
533a1b0
guide: update Def Pipes and DAG
jorgeorpinel Aug 25, 2022
c5cbf58
params: more separation of content and
jorgeorpinel Aug 25, 2022
416d312
concept: rehash params
jorgeorpinel Aug 25, 2022
75340be
guide: more holistic pipelining info
jorgeorpinel Aug 26, 2022
dc16807
guide: Pipe edits
jorgeorpinel Aug 26, 2022
23cd9c6
params: roll back changes for now...
jorgeorpinel Aug 26, 2022
3c8f635
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 29, 2022
ef24051
guide: more edits to Pipelines
jorgeorpinel Aug 31, 2022
a542867
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Aug 31, 2022
ac15ca3
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Sep 2, 2022
aa7bfcb
guide: edits on Def Pipes
jorgeorpinel Sep 2, 2022
0f351bb
Merge branch 'main' into iesahin/ug-pipelines
jorgeorpinel Sep 7, 2022
c9e6272
guide: hide DAG text
jorgeorpinel Sep 7, 2022
afe2ca5
concept: expand on Stage
jorgeorpinel Sep 7, 2022
e242375
guide: term
jorgeorpinel Sep 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions content/docs/user-guide/pipelines/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
title: Pipelines Management in DVC
---

# What is a (DVC) pipeline?
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

A machine learning pipeline is like an assembly line in a factory. As raw
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
materials input to a factory and processed at each step to achieve a final
product, a machine learning pipeline gets the data as an input, processes it in
every stage and outputs a model.

If you are repeating a set of commands to get to the final artifacts (models,
results), then you already have a pipeline, albeit in a manual fashion. DVC
allows to automate your workflow, be it comprised of a single command or many
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
commands with complex relationships.

## Stages
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

A pipeline is a collection of stages. At each stage, we define a (shell) command
to run and specify the inputs and outputs for this stage. By defining stage
input and outputs, we can find the order to run the stages. If an output of
stage `A` is fed as an input to stage `B`, then DVC infers to run `A` before
`B`.
iesahin marked this conversation as resolved.
Show resolved Hide resolved

`dvc stage` set of commands are used to create a pipeline by defining its
stages. Each stage requires a `name`, and a `command`. Additionally it describes
a set of dependencies and outputs. When these outputs are missing or the
dependencies are newer than the outputs, the command is run.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

There are two equally valid ways to describe stages. The first is using
`dvc stage add` command, and the other is by editing `dvc.yaml`.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

For example, to add a stage named `preprocess` depending on `preprocess.py`

```dvc
$ dvc stage add --name preprocess \
--deps src/preprocess.py \
--deps data/raw \
--outs data/preprocessed \
python src/preprocess.py
```

The other way to define a stage is through editing `dvc.yaml`. You could create
the file with the following content to create the above stage:

```yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- data/raw
- src/preprocess.py
outs:
- data/preprocessed
```

Essentially the command creates the stage by writing to `dvc.yaml`file. If you
are creating the pipeline for the first time, adding the stages by
`dvc stage add` may be easier. If you're editing the pipeline though, working
with the YAML file might be easier.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

<admon type="tip">

Some advanced configuration for the stages may only be available through
editing `dvc.yaml`.

</admon>

<admon type="warn">

DVC works with YAML 1.2 and any version before this may have some minor quirks
that cause subtle bugs.

</admon>
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Directed-Acyclic Graph (DAG)

By adding stages to the pipeline, we define a [graph] where the nodes are stages
and the edges are dependency relationships. The final topology of the graph
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
should be a [Directed Acyclic Graph]. We know why it's directed, stage A depends
on stage B is different from stage B depends on stage A. But why it should be
_acyclic_?

The pipeline graph shouldn't contain any cycles in the form
`A depends on B depends on C depends on A`. Otherwise, invalidating one of the
stages causes pipeline to run indefinitely. If we invalidate `B` in this
example, it will cause `A` to be run, and as `C` depends on `A`, it will be
invalidated and run too. Finally, as `B` depends on `C`, it will need be
invalidated again. It never ends, it's an infinite loop.

Because of this, DVC checks the stage dependencies for an existence of a cycle
in the graph, and ensures that the graph is a [DAG].

If you want to visualize the graph yourself, you can use `dvc dag` command. It
outputs a text representation of the stages by default.

```dvc
$ dvc dag
```

TODO: output of `dvc dag example`

It can also output the graph in DOT format, that you can supply to `dot` to get
an image.

```dvc
$ dvc dag --dot | dot -Tpng pipeline.png
```

TODO: output of `dvc dag --dot`

ASK: Tell `dvc dag --output` here or skip it?

<admon type="warning">
DVC assumes the command you specify in a stage changes the outputs you specify.
Otherwise, the stage may never be validated and the pipeline will always run.
</admon>
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Dependencies

A stage has three different types of ingredients, command, dependencies, and
outputs. Commands are a required part of `dvc stage add`, and the other two are
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
optional. Although optional, most of the features that characterize pipelines
are brought forward by dependency definitions.

DVC has more than one type of dependency: The basic one, that we call only
`dependency` is either a file or a directory. A stage that depends on a file or
directory is invalidated when this file or directory _contents_ changes.

<admon type="note">
DVC doesn't only check the timestamp of files, it actually calculates the hash
of their contents to invalidate the dependent stages. This is one of the
distinctive features over other build tools like `make`.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
</admon>

The second type of dependencies is _hyperparameters._ The hyperparameters are
values in YAML, JSON or Python files that affect stage commands in some way. For
example, if you have a `train.py` script to build deep learning model, it can
read certain parameters from such a file to change training attributes. DVC can
keep track of these dependencies as separate dependencies. If you have multiple
parameters in `params.yaml` file that changes behavior of multiple stages, when
you change a certain parameter, only the dependent stages are invalidated and
rerun. This is much more granular than making the whole of `params.yaml` as a
file dependency.

Another kind of dependency is _URL dependencies._ Instead of files that reside
in local disk, you can `dvc import-url` a dependency from the web. DVC will
check whether the contents of this URL changed, and invalidate the dependent
stage if so.

File and directory stages are defined using `--deps / -d` option of
`dvc stage add`:

```dvc
$ dvc stage add -n train
--deps src/train.py \
--deps data/ \
python3 src/train.py
```

This means that when one of `deps/train.py` or `data/` changes, the command
associated with the stage has to be run.

Note that we also added the source file as a dependency, so that when we update
the code that trains the model, the stage will be run again.

- Why?
- How?

## Outputs

- Why?
- How?

## Reproduction

- Why?
- How?

## Multiple Pipelines

## Experiments with Pipelines

- Why?
- How?