Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

GS: Update the Experiments scenario with the MNIST dataset and Tensorflow based dvc-get-started-mnist project #63

Merged
merged 7 commits into from
Apr 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions get-started/06-experiments/01-running-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,16 @@ see the help text first:

The first command we'll use is `dvc exp run`. It's like `dvc repro` with added
features for experiments, like changing the hyperparameters with command line
options:
options:

```
dvc exp run --set-param featurize.max_features=1500 \
-S featurize.ngrams=2
dvc exp run --set-param model.name=mlp
```{{execute}}

The `--set-param` (or `-S`) flag sets the values for parameters as a shortcut
to editing `params.yaml`.

Check that the `featurize.max_features` value has been updated in `params.yaml`:
Note that `model.name` parameter has been updated in `params.yaml`:

`git diff params.yaml`{{execute}}

Expand Down
14 changes: 6 additions & 8 deletions get-started/06-experiments/02-queueing-experiments.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
## Queueing experiments

We have been tuning the `featurize` stage so far, but there are also parameters
for the `train` stage, which trains a [random forest classifier][rfc].

[rfc]: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
For the MLP version of the project, we have two parameters that change the
number of hidden units and the activation function.

`example-get-started/params.yaml`{{open}}

Expand All @@ -12,10 +10,10 @@ combinations we want to try without executing anything, by using the `--queue`
flag:

```
dvc exp run --queue -n exp-1 -S train.n_est=50
dvc exp run --queue -n exp-2 -S train.n_est=100
dvc exp run --queue -n exp-3 -S train.n_est=150
dvc exp run --queue -n exp-4 -S train.n_est=200
dvc exp run --queue -n exp-1 -S model.mlp.units=32
dvc exp run --queue -n exp-2 -S model.mlp.units=64
dvc exp run --queue -n exp-3 -S model.mlp.units=128
dvc exp run --queue -n exp-4 -S model.mlp.units=256
```{{execute}}

The `-n` option is used to label the experiments. If it's not specified,
Expand Down
6 changes: 3 additions & 3 deletions get-started/06-experiments/03-comparing-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ To compare all of these experiments, we need more than `dvc exp diff`:

```
dvc exp show --no-timestamp \
--include-params train.n_est \
--include-params model.mlp.units \
--no-pager
```{{execute}}

Although the differences in metrics are minuscule due to the small size of
the data set, `exp-2` is a bit better in terms of `avg_prec`.
As we have the most hidden units in MLP for `exp-4`, it has the highest
`categorical_accuracy`.
8 changes: 4 additions & 4 deletions get-started/06-experiments/04-persisting-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,22 @@ ignore the rest.

`dvc exp apply` rolls back the workspace to the specified experiment:

`dvc exp apply exp-2`{{execute}}
`dvc exp apply exp-4`{{execute}}

`dvc exp apply` is similar to [`dvc checkout`][dvccheckout], but it works with experiments. DVC
tracks everything in the pipeline for each experiment (parameters, metrics,
dependencies, and outputs) and can later retrieve it as needed.

Check that `scores.json` reflects the metrics in the table above:
Check that `metrics.json` reflects the metrics in the table above:

`example-get-started/scores.json`{{open}}
`example-get-started/metrics.json`{{open}}

Once an experiment has been applied to the workspace, it is no different from
reproducing the result without `dvc exp run`. Let's make it persistent in our
regular pipeline by committing it in our Git branch:

```
git add dvc.lock params.yaml prc.json roc.json scores.json
git add dvc.lock params.yaml metrics.json train.log.csv
git commit -m "Preserve best Avg. Prec. experiment"
```{{execute}}

Expand Down
6 changes: 3 additions & 3 deletions get-started/06-experiments/05-cleaning-up.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ experiments table:

```
dvc exp show --no-timestamp \
--include-params train.n_est \
--include-params model.mlp.units \
--no-pager
```{{execute}}

Expand All @@ -16,7 +16,7 @@ experiments from the previous _n_ commits:

```
dvc exp show -n 2 --no-timestamp \
--include-params train.n_est \
--include-params model.mlp.units \
--no-pager
```{{execute}}

Expand All @@ -27,7 +27,7 @@ Eventually, old experiments may clutter the experiments table.
```
dvc exp gc --workspace
dvc exp show -n 2 --no-timestamp \
--include-params train.n_est \
--include-params model.mlp.units \
--no-pager
```{{execute}}

Expand Down
15 changes: 14 additions & 1 deletion get-started/06-experiments/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,25 @@
PS1='\[\033[01;34m\]\w\[\033[00m\]$ \[\033[01;32m\]'
trap 'echo -ne "\033[00m"' DEBUG

export CONTAINER="dvcorg/doc-katacoda:start-experiments"

docker volume create example-get-started

if [ -e /root/example-get-started ] ; then
rm -rf /root/example-get-started
fi
ln -s /var/lib/docker/volumes/example-get-started/_data example-get-started

clear

:;: ###########################################
:;: INSTALLING CONTAINER FOR THE SCENARIO
:;: ###########################################
until [ -f /tmp/docker-ready ] ; do echo -n "." ; sleep 1 ; done
# until [ -f /tmp/docker-ready ] ; do echo -n "." ; sleep 1 ; done

echo "Starting: $CONTAINER"

docker run -d -it --name dvc -v example-get-started:/root/example-get-started "$CONTAINER"

clear

Expand Down
20 changes: 10 additions & 10 deletions get-started/06-experiments/install.sh
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
#!/bin/bash

export CONTAINER="dvcorg/doc-katacoda:start-experiments"

docker volume create example-get-started

if [ -e /root/example-get-started ] ; then
rm -rf /root/example-get-started
fi
ln -s /var/lib/docker/volumes/example-get-started/_data /root/example-get-started

docker run -d -it --name dvc -v example-get-started:/root/example-get-started "$CONTAINER"
# export CONTAINER="dvcorg/doc-katacoda:start-experiments"
#
# docker volume create example-get-started
#
# if [ -e /root/example-get-started ] ; then
# rm -rf /root/example-get-started
# fi
# ln -s /var/lib/docker/volumes/example-get-started/_data example-get-started
#
# docker run -d -it --name dvc -v example-get-started:/root/example-get-started "$CONTAINER"

touch /tmp/docker-ready
10 changes: 7 additions & 3 deletions get-started/06-experiments/intro.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
Experiments proliferate quickly in ML projects where there are many
parameters to tune or other permutations of the code. DVC 2.0 introduces a new
way to organize such projects and only keep what we ultimately need with `dvc
Experiments proliferate quickly in ML projects where there are many parameters
to tune or other permutations of the code. DVC 2.0 introduces a new way to
organize such projects and only keep what we ultimately need with `dvc
experiments`. DVC can track experiments for you so there's no need to commit
each one to Git. This way your repo doesn't become polluted with all of them.
You can discard experiments once they're no longer needed.

For this scenario we have a new project that uses Tensorflow and the venerable
[MNIST](http://yann.lecun.com/exdb/mnist/) dataset. The project has
two Artifical Neural Networks with several hyperparameters.

> 📖 See [Experiment Management](https://dvc.org/doc/user-guide/experiment-management) for more
> information on DVC's approach.

Expand Down