From aa52356c355b36e6b8644884d35efe75edac3e9f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Emre=20=C5=9Eahin?= Date: Thu, 21 Oct 2021 20:16:30 +0300 Subject: [PATCH] some command changes and inline comments. --- content/docs/start/data-pipelines.md | 52 ++++++++++++++++------------ 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index a649246033..86c310e895 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -67,11 +67,11 @@ Please also add or commit the source code directory with Git at this point. ```dvc -$ dvc run -n prepare \ - -p prepare.seed,prepare.split \ - -d src/prepare.py -d data/data.xml \ - -o data/prepared \ - python src/prepare.py data/data.xml +$ dvc stage add -n prepare \ + -p prepare.seed,prepare.split \ + -d src/prepare.py -d data/data.xml \ + -o data/prepared \ + python src/prepare.py data/data.xml ``` A `dvc.yaml` file is generated. It includes information about the command we run @@ -144,14 +144,14 @@ stages: There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared` -in this case); `dvc stage add` and `dvc repro` takes care of this. You only need +in this case); `dvc stage add` and `dvc exp run` takes care of this. You only need to run `dvc push` if you want to save them to [remote storage](/doc/start/data-and-model-versioning#storing-and-sharing), (usually along with `git commit` to version `dvc.yaml` itself). ## Dependency graphs (DAGs) -By using `dvc run` multiple times, and specifying outputs of a +By using `dvc stage add` multiple times, and specifying outputs of a stage as dependencies of another one, we can describe a sequence of commands which gets to a desired result. This is what we call a _data pipeline_ or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). @@ -160,17 +160,18 @@ Let's create a second stage chained to the outputs of `prepare`, to perform feature extraction: ```dvc -$ dvc run -n featurize \ - -p featurize.max_features,featurize.ngrams \ - -d src/featurization.py -d data/prepared \ - -o data/features \ - python src/featurization.py data/prepared data/features +$ dvc stage add -n featurize \ + -p featurize.max_features,featurize.ngrams \ + -d src/featurization.py -d data/prepared \ + -o data/features \ + python src/featurization.py data/prepared data/features ``` The `dvc.yaml` file is updated automatically and should include two stages now.
+ ### 💡 Expand to see what happens under the hood. The changes to the `dvc.yaml` should look like this: @@ -209,11 +210,11 @@ Let's add the training itself. Nothing new this time; just the same `dvc run` command with the same set of options: ```dvc -$ dvc run -n train \ - -p train.seed,train.n_est,train.min_split \ - -d src/train.py -d data/features \ - -o model.pkl \ - python src/train.py data/features model.pkl +$ dvc stage add -n train \ + -p train.seed,train.n_est,train.min_split \ + -d src/train.py -d data/features \ + -o model.pkl \ + python src/train.py data/features model.pkl ``` Please check the `dvc.yaml` again, it should have one more stage now. @@ -223,13 +224,13 @@ Please check the `dvc.yaml` again, it should have one more stage now. This should be a good time to commit the changes with Git. These include `.gitignore`, `dvc.lock`, and `dvc.yaml` — which describe our pipeline. -## Reproduce +## Run the pipeline The whole point of creating this `dvc.yaml` file is the ability to easily reproduce a pipeline: ```dvc -$ dvc repro +$ dvc exp run ```
@@ -240,12 +241,14 @@ Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage: 1. Open `params.yaml` and change `n_est` to `100`, and -2. (re)run `dvc repro`. +2. (re)run `dvc exp run`. + +>> Link to experiments trail here You should see: ```dvc -$ dvc repro +$ dvc exp run Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping Running stage 'train' with command: ... @@ -254,10 +257,13 @@ Running stage 'train' with command: ... DVC detected that only `train` should be run, and skipped everything else! All the intermediate results are being reused. -Now, let's change it back to `50` and run `dvc repro` again: +Now, let's change it back to `50` and run `dvc exp run` again: + +>> It looks these manual changes are a bit tedious. We can replace these with +>> code or data changes that can't be captured with `dvc exp run -S` ```dvc -$ dvc repro +$ dvc exp run Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping ```