diff --git a/.github/workflows/link-check-deploy.yml b/.github/workflows/link-check-deploy.yml index 45068a38d8..1dca96cd73 100644 --- a/.github/workflows/link-check-deploy.yml +++ b/.github/workflows/link-check-deploy.yml @@ -13,6 +13,8 @@ jobs: github.event.deployment_status.state == 'success' steps: - uses: actions/checkout@v2 + with: + fetch-depth: 0 - id: build_check uses: LouisBrunner/checks-action@v1.0.0 with: @@ -21,7 +23,7 @@ jobs: status: queued - name: Run Link Check id: check - uses: iterative/link-check.action@v0.8 + uses: iterative/link-check.action@v0.9 with: diff: true configFile: config/link-check/config.yml diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 4e266aa374..3d0b22198a 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -6,32 +6,63 @@ version, and reproduce your data science and machine learning workflows.' # Get Started: Data Pipelines +>> ⁉️ It may be worthwhile to start with the question: "Why pipelines?" + Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture _data pipelines_ — series of data processes that produce a final result. +>> ⁉️ What is data process? Why do we tie "pipelines" with "code" or "data" +>> here? They are more general ideas, we can have a pipeline that downloads data +>> from a URL using `wget` and checks whether it has changed, for example. (like +>> `dvc get` or `dvc import-url`, but simpler.) + +>> I see that we are introducing pipelines to an ML/DS audience, but the idea is +>> more general and I believe we can tell this here. It's also possible to tell +>> this within ML/DS context in broader terms. + DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results later — exactly as they were built originally! For example, you could capture a -simple ETL workflow, organize a data science project, or build a detailed +simple [ETL workflow][etl], organize a data science project, or build a detailed machine learning pipeline. +[etl]: https://en.wikipedia.org/wiki/Extract,_transform,_load + +>> We need a figure here. + Watch and learn, or follow along with the code example below! https://youtu.be/71IGzyH95UY -## Pipeline stages +>> ✍️ DVC has features to handle pipelines easily. You can create stages +>> associated with commands, code, data and (hyper)parameters. It can run the +>> commands, and cache the outputs. DVC handles relationships +>> between these stages, so when these associated +>> elements change, the stage is invalidated and run. If no dependencies are +>> changed, it can report this and reuse the cached results. + +Use `dvc stage add` to create _stages_. These represent processes (source code +tracked with Git) which form the steps of a _pipeline_. -Use `dvc run` to create _stages_. These represent processes (source code tracked -with Git) which form the steps of a _pipeline_. Stages also connect code to its -corresponding data _input_ and _output_. Let's transform a Python script into a -[stage](/doc/command-reference/run): +>> ⁉️ Adding _data process_ to the concepts doesn't seem to serve well. Instead +>> we can continue like: "Stages represent commands to run, along with their +>> dependencies like data and code files, and outputs like model and plot files." + +>> ⁉️ I believe we don't need the following sentence if we write as the previous +>> one. + +Stages also connect code +to its corresponding data _input_ and _output_.
### ⚙️ Expand to download example code. +>> ⁉️ I think it might be easier to grasp the concept if we use a simpler +>> pipeline with 3 stages, with not many parameters, metrics and such. + Get the sample code like this: ```dvc @@ -63,15 +94,20 @@ Please also add or commit the source code directory with Git at this point.
+ +>> ⁉️ The first stage we create may be a simpler one. + ```dvc -$ dvc run -n prepare \ - -p prepare.seed,prepare.split \ - -d src/prepare.py -d data/data.xml \ - -o data/prepared \ - python src/prepare.py data/data.xml +$ dvc stage add -n prepare \ + -p prepare.seed,prepare.split \ + -d src/prepare.py -d data/data.xml \ + -o data/prepared \ + python src/prepare.py data/data.xml ``` -A `dvc.yaml` file is generated. It includes information about the command we ran +>> ⁉️ We can move `dvc.yaml` discussion in a hidden section. + +A `dvc.yaml` file is generated. It includes information about the command we run (`python src/prepare.py data/data.xml`), its dependencies, and outputs. @@ -79,6 +115,11 @@ A `dvc.yaml` file is generated. It includes information about the command we ran ### 💡 Expand to see what happens under the hood. +>> ⁉️ I think, the short descriptions of options can be in the main text instead +>> of `dvc.yaml` above. Also, the project should contain a simple pipeline that +>> starts with `-d` and `-o`, then add `-p`, `-m` to the mix in a later stage. +>> The first example of `stage add` is too complex here. + The command options used above mean the following: - `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file @@ -140,34 +181,53 @@ stages: +>> ⁉️ The following information can also be hidden, or deleted. We assume this +>> GS trail will be standalone, no need to mention `dvc add` here. + There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared` -in this case); `dvc run` already took care of this. You only need to run -`dvc push` if you want to save them to +in this case); `dvc stage add` and `dvc exp run` takes care of this. You only need +to run `dvc push` if you want to save them to [remote storage](/doc/start/data-and-model-versioning#storing-and-sharing), (usually along with `git commit` to version `dvc.yaml` itself). +>> ⁉️ Here, it may be more natural to tell the Run-Cache and `dvc push` as +>> pushing "pipeline artifacts" instead of "storing and sharing". + +>> `dvc push` can +>> push the individual stages, and their associated code and data, so you don't +>> have to re-run them in other machines. + ## Dependency graphs (DAGs) -By using `dvc run` multiple times, and specifying outputs of a +By using `dvc stage add` multiple times, and specifying outputs of a stage as dependencies of another one, we can describe a sequence of commands which gets to a desired result. This is what we call a _data pipeline_ or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). +>> ⁉️ All pipelines are DAGs, but not all DAGs are pipelines, so these two are +>> not identical. DAG reference seems complicating, rather than simplifying to +>> me. + Let's create a second stage chained to the outputs of `prepare`, to perform feature extraction: +>> ⁉️ The second stage is almost identical with the first. It may be necessary +>> for the project, but pedagogically we're spending reader's attention here +>> unnecessarily here. + ```dvc -$ dvc run -n featurize \ - -p featurize.max_features,featurize.ngrams \ - -d src/featurization.py -d data/prepared \ - -o data/features \ - python src/featurization.py data/prepared data/features +$ dvc stage add -n featurize \ + -p featurize.max_features,featurize.ngrams \ + -d src/featurization.py -d data/prepared \ + -o data/features \ + python src/featurization.py data/prepared data/features ``` The `dvc.yaml` file is updated automatically and should include two stages now.
+ ### 💡 Expand to see what happens under the hood. The changes to the `dvc.yaml` should look like this: @@ -202,17 +262,23 @@ The changes to the `dvc.yaml` should look like this: ### ⚙️ Expand to add more stages. -Let's add the training itself. Nothing new this time; just the same `dvc run` +>> ⁉️ Another pipeline from the same. The first three stages look almost +>> identical. + +Let's add the training itself. Nothing new this time; just the same `dvc stage add` command with the same set of options: ```dvc -$ dvc run -n train \ - -p train.seed,train.n_est,train.min_split \ - -d src/train.py -d data/features \ - -o model.pkl \ - python src/train.py data/features model.pkl +$ dvc stage add -n train \ + -p train.seed,train.n_est,train.min_split \ + -d src/train.py -d data/features \ + -o model.pkl \ + python src/train.py data/features model.pkl ``` +>> ⁉️ The wording below is a bit _distrustful._ In case of an error, DVC should +>> report it. + Please check the `dvc.yaml` again, it should have one more stage now.
@@ -220,13 +286,13 @@ Please check the `dvc.yaml` again, it should have one more stage now. This should be a good time to commit the changes with Git. These include `.gitignore`, `dvc.lock`, and `dvc.yaml` — which describe our pipeline. -## Reproduce +## Run the pipeline The whole point of creating this `dvc.yaml` file is the ability to easily reproduce a pipeline: ```dvc -$ dvc repro +$ dvc exp run ```
@@ -237,12 +303,14 @@ Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: 1. Open `params.yaml` and change `n_est` to `100`, and -2. (re)run `dvc repro`. +2. (re)run `dvc exp run`. + +>> Link to experiments trail here You should see: ```dvc -$ dvc repro +$ dvc exp run Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping Running stage 'train' with command: ... @@ -251,10 +319,13 @@ Running stage 'train' with command: ... DVC detected that only `train` should be run, and skipped everything else! All the intermediate results are being reused. -Now, let's change it back to `50` and run `dvc repro` again: +Now, let's change it back to `50` and run `dvc exp run` again: + +>> It looks these manual changes are a bit tedious. We can replace these with +>> code or data changes that can't be captured with `dvc exp run -S` ```dvc -$ dvc repro +$ dvc exp run Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping ``` @@ -300,7 +371,7 @@ stages:
-DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few +DVC pipelines (`dvc.yaml` file, `dvc stage add`, and `dvc exp run` commands) solve a few important problems: - _Automation_: run a sequence of steps in a "smart" way which makes iterating