From 1c0cd3386bbf7d1968814c8a662608a45f30b4ba Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Wed, 10 Mar 2021 12:38:04 +0300 Subject: [PATCH 1/3] updated intro for typos and changes ~/stages to ~/project --- get-started/stages/01-whats-a-stage.md | 2 +- .../stages/02-manual-data-preparation.md | 2 +- get-started/stages/05-how-dvc-tracks-stages.md | 2 +- .../stages/06-how-directories-are-cached.md | 2 +- .../stages/07-add-featurization-stage.md | 8 ++++---- get-started/stages/08-reproduce-a-pipeline.md | 2 +- .../stages/09-visualize-the-pipeline.md | 4 ++-- get-started/stages/index.json | 8 ++++---- get-started/stages/init.sh | 2 +- get-started/stages/install.sh | 2 +- get-started/stages/intro.md | 18 +++++++++--------- 11 files changed, 26 insertions(+), 26 deletions(-) diff --git a/get-started/stages/01-whats-a-stage.md b/get-started/stages/01-whats-a-stage.md index 2f55a1a..0ad2a39 100644 --- a/get-started/stages/01-whats-a-stage.md +++ b/get-started/stages/01-whats-a-stage.md @@ -7,7 +7,7 @@ machine learning project. [bcstage]: https://dvc.org/doc/user-guide/basic-concepts/stage -We have a machine learning project already provided in `~/stages`. We covered +We have a machine learning project already provided in `~/project`. We covered these steps in previous scenarios. DVC is installed. Data is downloaded from `https://github.com/iterative/dataset-registry` and made smaller. A _local remote_ is created in `/tmp/data-storage` named `mystorage`, and the data in the diff --git a/get-started/stages/02-manual-data-preparation.md b/get-started/stages/02-manual-data-preparation.md index 0b6bc18..968a839 100644 --- a/get-started/stages/02-manual-data-preparation.md +++ b/get-started/stages/02-manual-data-preparation.md @@ -3,7 +3,7 @@ The script `src/prepare.py` splits the data into train and test sets. You can click the link below to open the preparation script in the editor. -`stages/src/prepare.py`{{open}} +`project/src/prepare.py`{{open}} We first run this script without DVC to see what happens: diff --git a/get-started/stages/05-how-dvc-tracks-stages.md b/get-started/stages/05-how-dvc-tracks-stages.md index 0151209..fcf96b0 100644 --- a/get-started/stages/05-how-dvc-tracks-stages.md +++ b/get-started/stages/05-how-dvc-tracks-stages.md @@ -6,7 +6,7 @@ define relationships between the data, code, parameters, and stages. Let's take a look at `dvc.yaml` file to see the content: -`stages/dvc.yaml`{{open}} +`project/dvc.yaml`{{open}} It contains what we supplied to `dvc stage add`. It lists stages by name and defines `cmd`, `deps` and `outs` for each of them. diff --git a/get-started/stages/06-how-directories-are-cached.md b/get-started/stages/06-how-directories-are-cached.md index ed91eaa..7a88a37 100644 --- a/get-started/stages/06-how-directories-are-cached.md +++ b/get-started/stages/06-how-directories-are-cached.md @@ -16,7 +16,7 @@ their hash values. For example we see that the individual hash value of `train.tsv` as `fcebfd4c6f1645ac4987d39f1c5cf610` and check its content -`stages/.dvc/cache/fc/ebfd4c6f1645ac4987d39f1c5cf610`{{open}}. +`project/.dvc/cache/fc/ebfd4c6f1645ac4987d39f1c5cf610`{{open}}. Note also that DVC adds `/prepared` to `.gitignore` to prevent output data files to be committed in Git. diff --git a/get-started/stages/07-add-featurization-stage.md b/get-started/stages/07-add-featurization-stage.md index 7f475a3..942f047 100644 --- a/get-started/stages/07-add-featurization-stage.md +++ b/get-started/stages/07-add-featurization-stage.md @@ -7,7 +7,7 @@ with DVC. _Featurization_ step is run by `src/featurization.py`. You can check the contents of this program by clicking the link below. -`stages/src/featurization.py`{{open}} +`project/src/featurization.py`{{open}} We use `dvc.yaml` file in the previous step to add another stage. We name the stage `featurize`. It has two dependencies: one is the code file, and @@ -16,11 +16,11 @@ ready for training as an output. Please click the below link to open the file in the editor. -`stages/dvc.yaml`{{open}} +`project/dvc.yaml`{{open}} Now please click the below text to append the stage configuration to the file. -
+
   featurize:
     cmd: >-
       python3 src/featurization.py data/prepared data/features
@@ -42,4 +42,4 @@ dataset.
 ```
 git add dvc.yaml dvc.lock data/.gitignore
 git commit -m "Configured prepare stage"
-```{{execute}}
\ No newline at end of file
+```{{execute}}
diff --git a/get-started/stages/08-reproduce-a-pipeline.md b/get-started/stages/08-reproduce-a-pipeline.md
index 664585d..4e7847d 100644
--- a/get-started/stages/08-reproduce-a-pipeline.md
+++ b/get-started/stages/08-reproduce-a-pipeline.md
@@ -53,7 +53,7 @@ changes `dvc repro` won't rerun any part of it.
 Suppose we decided to update our code for `src/prepare.py` by adding the
 following line to it.
 
-
+
 # THIS COMMENT CHANGES MD5 HASH OF THE FILE
 
diff --git a/get-started/stages/09-visualize-the-pipeline.md b/get-started/stages/09-visualize-the-pipeline.md index 4d96be1..5507225 100644 --- a/get-started/stages/09-visualize-the-pipeline.md +++ b/get-started/stages/09-visualize-the-pipeline.md @@ -32,7 +32,7 @@ and convert the `.dot` file to PNG using: Now we can view the pipeline in an image format by clicking the link below: -`stages/pipeline.png`{{open}} +`project/pipeline.png`{{open}} Let's commit the changes in this step to Git. @@ -44,4 +44,4 @@ git commit -m "another stage to the pipeline is added" In the next step, we'll see how to run these two stages together. -[graphviz]: https://graphviz.org \ No newline at end of file +[graphviz]: https://graphviz.org diff --git a/get-started/stages/index.json b/get-started/stages/index.json index 7d99618..82b4b6c 100644 --- a/get-started/stages/index.json +++ b/get-started/stages/index.json @@ -66,20 +66,20 @@ }, { "file": "params.yaml", - "target": "/root/stages" + "target": "/root/project" }, { "file": "src/", - "target": "/root/stages/" + "target": "/root/project/" } ] } }, "environment": { - "uieditorpath": "/root/stages", + "uieditorpath": "/root/project", "uilayout": "vscode-terminal-split" }, "backend": { "imageid": "ubuntu:2004" } -} \ No newline at end of file +} diff --git a/get-started/stages/init.sh b/get-started/stages/init.sh index 6058e61..ae9cd6a 100755 --- a/get-started/stages/init.sh +++ b/get-started/stages/init.sh @@ -21,6 +21,6 @@ source /etc/bash_completion # clear screen clear -cd stages +cd project # auto-play preparation steps DELAY=0 play prepare.sh diff --git a/get-started/stages/install.sh b/get-started/stages/install.sh index b36d6f9..90e8e38 100755 --- a/get-started/stages/install.sh +++ b/get-started/stages/install.sh @@ -12,4 +12,4 @@ wget -O /etc/bash_completion.d/dvc \ https://raw.githubusercontent.com/iterative/dvc/master/scripts/completion/dvc.bash # this is about a bug in index.json -rm -f /root/stages/play /root/stages/prepare.sh /root/stages/example-flow.png +rm -f /root/project/play /root/project/prepare.sh /root/project/example-flow.png diff --git a/get-started/stages/intro.md b/get-started/stages/intro.md index a3998c9..b33fe5b 100644 --- a/get-started/stages/intro.md +++ b/get-started/stages/intro.md @@ -1,17 +1,17 @@ The commands that we have seen so far (`add`, `push`, `pull`, etc.) provide a -useful framework to track, save and share models and large data files. In -some cases and projects, this could be all you need. +useful framework to track, save, and share models and large data files. In some +cases and projects, this could be all you need. -Usually, in ML projects, you need to process data and generate -outputs in a reproducible way. This requires establishing a connection -between the data processed, the program that processes them, -the parameters, and the outputs. +Usually, in ML projects, you need to process data and generate outputs in a +reproducible way. This requires establishing a connection between the data +processed, the program that processes them, its parameters and the outputs. In a typical machine learning project we have the following stages: ![](/dvc/courses/get-started/stages/assets/example-flow.png) -This process is reflected in DVC with a [pipeline][bcpipeline]. In this scenario -we begin to build pipelines using stage definitions and connect them together. +This process is reflected in DVC with a [data pipeline][bcpipeline]. In this +scenario we begin to build pipelines using stage definitions and connect them +together. -[bcpipeline]: https://dvc.org/doc/user-guide/basic-concepts/pipeline \ No newline at end of file +[bcpipeline]: https://dvc.org/doc/user-guide/basic-concepts/pipeline From f1424133e16e59def079f7dca48b5d8146e8919a Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Wed, 10 Mar 2021 13:39:50 +0300 Subject: [PATCH 2/3] merged step1 and intro and other fixes in #29 --- ...ation.md => 01-manual-data-preparation.md} | 0 get-started/stages/01-whats-a-stage.md | 18 ------------ ...adding-a-stage.md => 02-adding-a-stage.md} | 0 ...nning-a-stage.md => 03-running-a-stage.md} | 0 ...-stages.md => 04-how-dvc-tracks-stages.md} | 0 ...ed.md => 05-how-directories-are-cached.md} | 0 ...stage.md => 06-add-featurization-stage.md} | 0 ...pipeline.md => 07-reproduce-a-pipeline.md} | 0 ...peline.md => 08-visualize-the-pipeline.md} | 0 .../stages/{10-ending.md => 09-ending.md} | 0 get-started/stages/index.json | 22 ++++++--------- get-started/stages/intro.md | 28 +++++++++++++------ 12 files changed, 28 insertions(+), 40 deletions(-) rename get-started/stages/{02-manual-data-preparation.md => 01-manual-data-preparation.md} (100%) delete mode 100644 get-started/stages/01-whats-a-stage.md rename get-started/stages/{03-adding-a-stage.md => 02-adding-a-stage.md} (100%) rename get-started/stages/{04-running-a-stage.md => 03-running-a-stage.md} (100%) rename get-started/stages/{05-how-dvc-tracks-stages.md => 04-how-dvc-tracks-stages.md} (100%) rename get-started/stages/{06-how-directories-are-cached.md => 05-how-directories-are-cached.md} (100%) rename get-started/stages/{07-add-featurization-stage.md => 06-add-featurization-stage.md} (100%) rename get-started/stages/{08-reproduce-a-pipeline.md => 07-reproduce-a-pipeline.md} (100%) rename get-started/stages/{09-visualize-the-pipeline.md => 08-visualize-the-pipeline.md} (100%) rename get-started/stages/{10-ending.md => 09-ending.md} (100%) diff --git a/get-started/stages/02-manual-data-preparation.md b/get-started/stages/01-manual-data-preparation.md similarity index 100% rename from get-started/stages/02-manual-data-preparation.md rename to get-started/stages/01-manual-data-preparation.md diff --git a/get-started/stages/01-whats-a-stage.md b/get-started/stages/01-whats-a-stage.md deleted file mode 100644 index 0ad2a39..0000000 --- a/get-started/stages/01-whats-a-stage.md +++ /dev/null @@ -1,18 +0,0 @@ -# What's a stage? - -[Stages][bcstage] are the basic building blocks of pipelines in DVC. They define -and execute an action, like data import or feature extraction, and usually -produce some output. In this scenario, we create stages and pipelines for a -machine learning project. - -[bcstage]: https://dvc.org/doc/user-guide/basic-concepts/stage - -We have a machine learning project already provided in `~/project`. We covered -these steps in previous scenarios. DVC is installed. Data is downloaded from -`https://github.com/iterative/dataset-registry` and made smaller. A _local -remote_ is created in `/tmp/data-storage` named `mystorage`, and the data in the -DVC repository is pushed. Code and python requirements are prepared, and all -changes are committed to Git. - -You can use the editor to browse the project. - diff --git a/get-started/stages/03-adding-a-stage.md b/get-started/stages/02-adding-a-stage.md similarity index 100% rename from get-started/stages/03-adding-a-stage.md rename to get-started/stages/02-adding-a-stage.md diff --git a/get-started/stages/04-running-a-stage.md b/get-started/stages/03-running-a-stage.md similarity index 100% rename from get-started/stages/04-running-a-stage.md rename to get-started/stages/03-running-a-stage.md diff --git a/get-started/stages/05-how-dvc-tracks-stages.md b/get-started/stages/04-how-dvc-tracks-stages.md similarity index 100% rename from get-started/stages/05-how-dvc-tracks-stages.md rename to get-started/stages/04-how-dvc-tracks-stages.md diff --git a/get-started/stages/06-how-directories-are-cached.md b/get-started/stages/05-how-directories-are-cached.md similarity index 100% rename from get-started/stages/06-how-directories-are-cached.md rename to get-started/stages/05-how-directories-are-cached.md diff --git a/get-started/stages/07-add-featurization-stage.md b/get-started/stages/06-add-featurization-stage.md similarity index 100% rename from get-started/stages/07-add-featurization-stage.md rename to get-started/stages/06-add-featurization-stage.md diff --git a/get-started/stages/08-reproduce-a-pipeline.md b/get-started/stages/07-reproduce-a-pipeline.md similarity index 100% rename from get-started/stages/08-reproduce-a-pipeline.md rename to get-started/stages/07-reproduce-a-pipeline.md diff --git a/get-started/stages/09-visualize-the-pipeline.md b/get-started/stages/08-visualize-the-pipeline.md similarity index 100% rename from get-started/stages/09-visualize-the-pipeline.md rename to get-started/stages/08-visualize-the-pipeline.md diff --git a/get-started/stages/10-ending.md b/get-started/stages/09-ending.md similarity index 100% rename from get-started/stages/10-ending.md rename to get-started/stages/09-ending.md diff --git a/get-started/stages/index.json b/get-started/stages/index.json index 82b4b6c..902569a 100644 --- a/get-started/stages/index.json +++ b/get-started/stages/index.json @@ -7,43 +7,39 @@ "steps": [ { "title": "Step 1", - "text": "01-whats-a-stage.md" + "text": "01-manual-data-preparation.md" }, { "title": "Step 2", - "text": "02-manual-data-preparation.md" + "text": "02-adding-a-stage.md" }, { "title": "Step 3", - "text": "03-adding-a-stage.md" + "text": "03-running-a-stage.md" }, { "title": "Step 4", - "text": "04-running-a-stage.md" + "text": "04-how-dvc-tracks-stages.md" }, { "title": "Step 5", - "text": "05-how-dvc-tracks-stages.md" + "text": "05-how-directories-are-cached.md" }, { "title": "Step 6", - "text": "06-how-directories-are-cached.md" + "text": "06-add-featurization-stage.md" }, { "title": "Step 7", - "text": "07-add-featurization-stage.md" + "text": "07-reproduce-a-pipeline.md" }, { "title": "Step 8", - "text": "08-reproduce-a-pipeline.md" - }, - { - "title": "Step 9", - "text": "09-visualize-the-pipeline.md" + "text": "08-visualize-the-pipeline.md" }, { "title": "Congratulations!", - "text": "10-ending.md" + "text": "09-ending.md" } ], "intro": { diff --git a/get-started/stages/intro.md b/get-started/stages/intro.md index b33fe5b..1d82f44 100644 --- a/get-started/stages/intro.md +++ b/get-started/stages/intro.md @@ -1,17 +1,27 @@ -The commands that we have seen so far (`add`, `push`, `pull`, etc.) provide a -useful framework to track, save, and share models and large data files. In some -cases and projects, this could be all you need. - -Usually, in ML projects, you need to process data and generate outputs in a +In ML projects, usually we need to process data and generate outputs in a reproducible way. This requires establishing a connection between the data -processed, the program that processes them, its parameters and the outputs. - -In a typical machine learning project we have the following stages: +processed, the program that processes them, its parameters, and the outputs. ![](/dvc/courses/get-started/stages/assets/example-flow.png) This process is reflected in DVC with a [data pipeline][bcpipeline]. In this -scenario we begin to build pipelines using stage definitions and connect them +scenario, we begin to build pipelines using stage definitions and connect them together. [bcpipeline]: https://dvc.org/doc/user-guide/basic-concepts/pipeline + +[Stages][bcstage] are the basic building blocks of pipelines in DVC. They define +and execute an action, like data import or feature extraction, and usually +produce some output. + +[bcstage]: https://dvc.org/doc/user-guide/basic-concepts/stage + +We have a machine learning project already provided in `~/project`. We provided +source files in `~/project/src/`, downloaded data to `data/data.xml`, and made +it smaller. You can review these steps in more detail in [Data and Model +Versioning][v] and [Accessing Data and Models][a] scenarios. + +[v]: https://katacoda.com/dvc/courses/get-started/versioning +[a]: https://katacoda.com/dvc/courses/get-started/accessing + +You can use the editor to browse the project. From 6d1acc95d60de3ff809dfd4c358ad9e9ddac7dda Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Wed, 10 Mar 2021 14:04:54 +0300 Subject: [PATCH 3/3] Edited some sentences and moved them to intro --- get-started/stages/01-manual-data-preparation.md | 12 ++---------- get-started/stages/intro.md | 10 +++++++--- 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/get-started/stages/01-manual-data-preparation.md b/get-started/stages/01-manual-data-preparation.md index 968a839..636d075 100644 --- a/get-started/stages/01-manual-data-preparation.md +++ b/get-started/stages/01-manual-data-preparation.md @@ -1,7 +1,6 @@ # Manual Data Preparation -The script `src/prepare.py` splits the data into train and test sets. You can -click the link below to open the preparation script in the editor. +The script `src/prepare.py` splits the data into train and test sets. (Click links to open in the editor) `project/src/prepare.py`{{open}} @@ -11,14 +10,7 @@ We first run this script without DVC to see what happens: It splits the data into train and test sets. We check the contents: -`head data/prepared/train.tsv`{{execute}} - -`head data/prepared/test.tsv`{{execute}} - -Our goal is to create a project that classifies the questions and assigns tags -to them. In a world _without_ DVC tasks like data preparation, training, -testing, evaluation, etc. are run manually, and this is prone to errors all we -know from working with too many moving parts. +`ls -l data/prepared`{{execute}} We use DVC to automate the tasks required to build a classifier and provide a fully reproducible pipeline. diff --git a/get-started/stages/intro.md b/get-started/stages/intro.md index 1d82f44..8008f8c 100644 --- a/get-started/stages/intro.md +++ b/get-started/stages/intro.md @@ -8,6 +8,7 @@ This process is reflected in DVC with a [data pipeline][bcpipeline]. In this scenario, we begin to build pipelines using stage definitions and connect them together. + [bcpipeline]: https://dvc.org/doc/user-guide/basic-concepts/pipeline [Stages][bcstage] are the basic building blocks of pipelines in DVC. They define @@ -16,9 +17,12 @@ produce some output. [bcstage]: https://dvc.org/doc/user-guide/basic-concepts/stage -We have a machine learning project already provided in `~/project`. We provided -source files in `~/project/src/`, downloaded data to `data/data.xml`, and made -it smaller. You can review these steps in more detail in [Data and Model +In this scenario, our goal is to create a project that classifies the +questions and assigns tags to them. In a world _without_ DVC, tasks like +data preparation, training, testing, evaluation are run manually, and this +is prone to errors caused by too many moving parts. We provided the source +files in `~/project/src/`, downloaded data to `data/data.xml`, and made it +smaller. You can review these steps in more detail in [Data and Model Versioning][v] and [Accessing Data and Models][a] scenarios. [v]: https://katacoda.com/dvc/courses/get-started/versioning