Skip to content

Commit

Permalink
added some notes for the next iteration
Browse files Browse the repository at this point in the history
  • Loading branch information
iesahin committed Oct 23, 2021
1 parent aa52356 commit dd1d7dc
Showing 1 changed file with 70 additions and 8 deletions.
78 changes: 70 additions & 8 deletions content/docs/start/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,22 @@ version, and reproduce your data science and machine learning workflows.'

# Get Started: Data Pipelines

>> ⁉️ It may be worthwhile to start with the question: "Why pipelines?"
Versioning large data files and directories for data science is great, but not
enough. How is data filtered, transformed, or used to train ML models? DVC
introduces a mechanism to capture _data pipelines_ — series of data processes
that produce a final result.

>> ⁉️ What is data process? Why do we tie "pipelines" with "code" or "data"
>> here? They are more general ideas, we can have a pipeline that downloads data
>> from a URL using `wget` and checks whether it has changed, for example. (like
>> `dvc get` or `dvc import-url`, but simpler.)
>> I see that we are introducing pipelines to an ML/DS audience, but the idea is
>> more general and I believe we can tell this here. It's also possible to tell
>> this within ML/DS context in broader terms.
DVC pipelines and their data can also be easily versioned (using Git). This
allows you to better organize projects, and reproduce your workflow and results
later — exactly as they were built originally! For example, you could capture a
Expand All @@ -19,22 +30,39 @@ machine learning pipeline.

[etl]: https://en.wikipedia.org/wiki/Extract,_transform,_load

>> We need a figure here.
Watch and learn, or follow along with the code example below!

https://youtu.be/71IGzyH95UY

>> ✍️ DVC has features to handle pipelines easily. You can create stages
>> associated with commands, code, data and (hyper)parameters. It can run the
>> commands, and cache the outputs. DVC handles relationships
>> between these stages, so when these associated
>> elements change, the stage is invalidated and run. If no dependencies are
>> changed, it can report this and reuse the cached results.
Use `dvc stage add` to create _stages_. These represent processes (source code
tracked with Git) which form the steps of a _pipeline_. Stages also connect code
to its corresponding data _input_ and _output_. Let's start by creating a stage
to extract the data file in the project. to its corresponding data _input_ and
_output_. Let's start by creating a stage to extract the data file in the
project. to its corresponding data _input_ and _output_. Let's start by creating
a stage to extract the data file in the project.
tracked with Git) which form the steps of a _pipeline_.

>> ⁉️ Adding _data process_ to the concepts doesn't seem to serve well. Instead
>> we can continue like: "Stages represent commands to run, along with their
>> dependencies like data and code files, and outputs like model and plot files."
>> ⁉️ I believe we don't need the following sentence if we write as the previous
>> one.
Stages also connect code
to its corresponding data _input_ and _output_.

<details>

### ⚙️ Expand to download example code.

>> ⁉️ I think it might be easier to grasp the concept if we use a simpler
>> pipeline with 3 stages, with not many parameters, metrics and such.
Get the sample code like this:

```dvc
Expand Down Expand Up @@ -66,6 +94,9 @@ Please also add or commit the source code directory with Git at this point.

</details>


>> ⁉️ The first stage we create may be a simpler one.
```dvc
$ dvc stage add -n prepare \
-p prepare.seed,prepare.split \
Expand All @@ -74,6 +105,8 @@ $ dvc stage add -n prepare \
python src/prepare.py data/data.xml
```

>> ⁉️ We can move `dvc.yaml` discussion in a hidden section.
A `dvc.yaml` file is generated. It includes information about the command we run
(`python src/prepare.py data/data.xml`), its <abbr>dependencies</abbr>, and
<abbr>outputs</abbr>.
Expand All @@ -82,6 +115,11 @@ A `dvc.yaml` file is generated. It includes information about the command we run

### 💡 Expand to see what happens under the hood.

>> ⁉️ I think, the short descriptions of options can be in the main text instead
>> of `dvc.yaml` above. Also, the project should contain a simple pipeline that
>> starts with `-d` and `-o`, then add `-p`, `-m` to the mix in a later stage.
>> The first example of `stage add` is too complex here.
The command options used above mean the following:

- `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file
Expand Down Expand Up @@ -143,22 +181,40 @@ stages:

</details>

>> ⁉️ The following information can also be hidden, or deleted. We assume this
>> GS trail will be standalone, no need to mention `dvc add` here.

There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared`
in this case); `dvc stage add` and `dvc exp run` takes care of this. You only need
to run `dvc push` if you want to save them to
[remote storage](/doc/start/data-and-model-versioning#storing-and-sharing),
(usually along with `git commit` to version `dvc.yaml` itself).

>> ⁉️ Here, it may be more natural to tell the Run-Cache and `dvc push` as
>> pushing "pipeline artifacts" instead of "storing and sharing".

>> `dvc push` can
>> push the individual stages, and their associated code and data, so you don't
>> have to re-run them in other machines.

## Dependency graphs (DAGs)

By using `dvc stage add` multiple times, and specifying <abbr>outputs</abbr> of a
stage as <abbr>dependencies</abbr> of another one, we can describe a sequence of
commands which gets to a desired result. This is what we call a _data pipeline_
or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph).

>> ⁉️ All pipelines are DAGs, but not all DAGs are pipelines, so these two are
>> not identical. DAG reference seems complicating, rather than simplifying to
>> me.

Let's create a second stage chained to the outputs of `prepare`, to perform
feature extraction:

>> ⁉️ The second stage is almost identical with the first. It may be necessary
>> for the project, but pedagogically we're spending reader's attention here
>> unnecessarily here.

```dvc
$ dvc stage add -n featurize \
-p featurize.max_features,featurize.ngrams \
Expand Down Expand Up @@ -206,7 +262,10 @@ The changes to the `dvc.yaml` should look like this:

### ⚙️ Expand to add more stages.

Let's add the training itself. Nothing new this time; just the same `dvc run`
>> ⁉️ Another pipeline from the same. The first three stages look almost
>> identical.

Let's add the training itself. Nothing new this time; just the same `dvc stage add`
command with the same set of options:

```dvc
Expand All @@ -217,6 +276,9 @@ $ dvc stage add -n train \
python src/train.py data/features model.pkl
```

>> ⁉️ The wording below is a bit _distrustful._ In case of an error, DVC should
>> report it.

Please check the `dvc.yaml` again, it should have one more stage now.

</details>
Expand Down Expand Up @@ -309,7 +371,7 @@ stages:

</details>

DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few
DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc exp run` commands) solve a few
important problems:

- _Automation_: run a sequence of steps in a "smart" way which makes iterating
Expand Down

0 comments on commit dd1d7dc

Please sign in to comment.