Skip to content

Commit

Permalink
added second draft for experiments
Browse files Browse the repository at this point in the history
  • Loading branch information
Emre Sahin committed Jul 5, 2021
1 parent c3b4a03 commit 66505f2
Showing 1 changed file with 228 additions and 0 deletions.
228 changes: 228 additions & 0 deletions content/docs/start/experiments-trail/experiments-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
title: 'Get Started: Experiments'
---

# Get Started with Experiments

<abbr>Experiments</abbr> proliferate quickly in ML projects where there are many
parameters to tune or other permutations of the code. We can organize such
projects and keep only what we ultimately need with `dvc experiments`. DVC can
track experiments for you so there's no need to commit each one to Git. This way
your repo doesn't become polluted with all of them. You can discard experiments
once they're no longer needed.

Previously, we learned how to tune [ML pipelines](/doc/start/data-pipelines) and
[compare the changes](/doc/start/metrics-parameters-plots). Let's further
increase the number of features in the `featurize` stage to see how it compares.

In this section, we will explore the basic features of DVC experiment management
with `[get-started-experiments]` project.

get-started-experiments: https://github.com/iterative/get-started-experiments

## Installing and Configuring the Project

These commands are run in the `[get-started-experiments]` project. You can run
the commands in this document after cloning the repository and installing the
requirements.

### Clone the project and create virtual environment

Please clone the project and create a virtual environment.

> We create a virtual environment to keep the libraries we use isolated from the
> rest of your system. This prevents version conflicts.
```dvc
$ git clone https://github.com/iterative/get-started-experiments -b pipeline-added
$ cd get-started-experiments
$ virtualenv .venv
$ . .venv/bin/activate
$ python -m pip install -r requirements.txt
```

### Get the data set

The repository you cloned doesn't contain the dataset. In order to get
`fashion-mnist.tar.gz` from the `dataset-registry`, we use `dvc pull` to update
the missing data files. `dvc pull` is used in DVC repositories to update the missing data dependencies.

```dvc
$ dvc pull
```

Then we extract this file that contains labeled images.

```dvc
$ tar -xvzf data/images.tar.gz --directory data/
```

</details>

## Running experiments

### Running with default parameters

The purpose of `dvc exp` commands is to run the pipeline for ephemeral
experiments. By _ephemeral_ we mean the experiments can be run without
committing parameter and dependency changes to Git. Instead the artifacts
produced for each experiment is tracked by DVC and persisted on demand.

Running the pipeline with default values requires only the command:

```dvc
$ dvc exp run
TK
```

It runs the pipeline starting from the basic dependencies and produces
`metrics.json` file for the default state.

<details>

### If you used `dvc repro` before

Earlier versions of DVC uses `dvc repro` to run the pipeline. If you already
have a DVC project, you may be used to `dvc repro`.

In DVC 2.0 `dvc exp run` supersedes `dvc repro`. Both of these commands run the
pipeline.

We use `dvc repro` to run the pipeline as found in the <abbr>workspace</abbr>.
All the parameters and dependencies are retrieved from the current workspace. It
doesn't use any special objects to track the experiments. When you have large
number of experiments that you don't want to commit into Git, it's better to use
`dvc exp run`. It allows to change the parameters quickly, can track the history
of artifacts and has facilities to compare these experiments easily.

`dvc repro` is still available to run the pipeline that don't need these extra
features.

</details>

### Running by setting parameters

Now let's do some more experimentation.

DVC allows to update the parameters defined in the pipeline without modifying
the files manually. We use this feature to set the convolutional units in
`train.py`.

```dvc
$ dvc exp run --set-param conv_units=24
TK
```

Note that the pipeline didn't run the earliest stage. Only the stages that
depend on the updated parameter and subsequent stages are run.

When you run `dvc exp run` with `--set-param`, it updates the parameter file. We
can see the effect of it by looking at the diff.

```dvc
$ git diff params.yaml
TK
```

### Run multiple experiments in parallel

Instead of running the experiments one-by-one, we can define them first to run
them in a batch. This is especially handy when you have long running
experiments.

We add experiments to the queue using the `--queue` option of `dvc exp run`. We
also use `-S` (`--set-param`) to set a value for the parameter.

```dvc
$ dvc exp run --queue -S conv_units=32
$ dvc exp run --queue -S conv_units=64
$ dvc exp run --queue -S conv_units=128
$ dvc exp run --queue -S conv_units=256
```

Next, run all (`--run-all`) queued experiments in parallel (using `--jobs`):

```dvc
$ dvc exp run --run-all --jobs 2
TK
```

## Comparing experiments

The pipeline is run several times with different parameters. To compare all of
these experiments, we use `dvc exp show`. This command presents the parameters
and metrics produced in experiments in a nicely formatted table.

```dvc
$ dvc exp show
```

TK

By default it shows all the parameters and the metrics along with the timestamp.
If you have large number of parameters, metrics or experiments, this may lead to
a cluttered view. You can limit the table to specific metrics, or parameters, or
hide the timestamp column with `--include-metrics`, `--include-params`, or
`--no-timestamp` options of the command, respectively.

```dvc
$ dvc exp show --no-timestamp --include-params conv_units --include-metrics acc
TK
```

## Persisting experiments

After selecting a experiments from the table, you can commit the hyperparameters
and other dependencies that produced this successful experiment to your Git
history.

`dvc exp apply` brings back all specific artifacts and parameters from the
experiment to the <abbr>workspace</abbr>.

```dvc
$ dvc exp apply
TK
```

We can see the changes in the repository and commit them to Git.

```dvc
$ git diff
$ git add .
$ git commit -m "Successful experiment"
```

### Preparing an experiments pipeline

At the beginning of this document, we assumed that there is already a configured DVC project to simplify the introduction. DVC experiments are a feature added in DVC 2.0 and requires a DVC pipeline is defined in the project. In this section we'll show how to configure a project to run DVC experiments. You can get detailed information about these commands in other sections of DVC documentation.

If DVC is not initialized before in the project, you can do so by:

```dvc
$ dvc init
```

DVC also requires commands to be run and their dependencies to be defined as stages. We use `dvc stage add` to add a stage and set its dependencies.

```dvc
$ dvc stage add
TK
```

Note that the parameters (added with `-p`) are in the default parameters file `params.yaml` and used in the code as normal, by reading the file. DVC only tracks the changes and updates them with `--set-param`.


## Go Further

You can continue to experiment with
[the project](https://github.com/iterative/get-started-experiments). Please see
the `README.md` file of the project for these. Don't forget to
[notify us](https://dvc.org/chat) if you happen to find good parameters.

There are many other features of `dvc exp`, like cleaning up the unused
experiments, sharing them without committing into Git or getting differences
between two experiments.

Please see the section on
[Experiment Management](/doc/user-guide/experiment-management) in the User's
Guide or `dvc exp` and subcommands in the Command Reference.

0 comments on commit 66505f2

Please sign in to comment.