-
Notifications
You must be signed in to change notification settings - Fork 394
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Emre Sahin
committed
Jul 5, 2021
1 parent
c3b4a03
commit 66505f2
Showing
1 changed file
with
228 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,228 @@ | ||
--- | ||
title: 'Get Started: Experiments' | ||
--- | ||
|
||
# Get Started with Experiments | ||
|
||
<abbr>Experiments</abbr> proliferate quickly in ML projects where there are many | ||
parameters to tune or other permutations of the code. We can organize such | ||
projects and keep only what we ultimately need with `dvc experiments`. DVC can | ||
track experiments for you so there's no need to commit each one to Git. This way | ||
your repo doesn't become polluted with all of them. You can discard experiments | ||
once they're no longer needed. | ||
|
||
Previously, we learned how to tune [ML pipelines](/doc/start/data-pipelines) and | ||
[compare the changes](/doc/start/metrics-parameters-plots). Let's further | ||
increase the number of features in the `featurize` stage to see how it compares. | ||
|
||
In this section, we will explore the basic features of DVC experiment management | ||
with `[get-started-experiments]` project. | ||
|
||
get-started-experiments: https://github.com/iterative/get-started-experiments | ||
|
||
## Installing and Configuring the Project | ||
|
||
These commands are run in the `[get-started-experiments]` project. You can run | ||
the commands in this document after cloning the repository and installing the | ||
requirements. | ||
|
||
### Clone the project and create virtual environment | ||
|
||
Please clone the project and create a virtual environment. | ||
|
||
> We create a virtual environment to keep the libraries we use isolated from the | ||
> rest of your system. This prevents version conflicts. | ||
```dvc | ||
$ git clone https://github.com/iterative/get-started-experiments -b pipeline-added | ||
$ cd get-started-experiments | ||
$ virtualenv .venv | ||
$ . .venv/bin/activate | ||
$ python -m pip install -r requirements.txt | ||
``` | ||
|
||
### Get the data set | ||
|
||
The repository you cloned doesn't contain the dataset. In order to get | ||
`fashion-mnist.tar.gz` from the `dataset-registry`, we use `dvc pull` to update | ||
the missing data files. `dvc pull` is used in DVC repositories to update the missing data dependencies. | ||
|
||
```dvc | ||
$ dvc pull | ||
``` | ||
|
||
Then we extract this file that contains labeled images. | ||
|
||
```dvc | ||
$ tar -xvzf data/images.tar.gz --directory data/ | ||
``` | ||
|
||
</details> | ||
|
||
## Running experiments | ||
|
||
### Running with default parameters | ||
|
||
The purpose of `dvc exp` commands is to run the pipeline for ephemeral | ||
experiments. By _ephemeral_ we mean the experiments can be run without | ||
committing parameter and dependency changes to Git. Instead the artifacts | ||
produced for each experiment is tracked by DVC and persisted on demand. | ||
|
||
Running the pipeline with default values requires only the command: | ||
|
||
```dvc | ||
$ dvc exp run | ||
TK | ||
``` | ||
|
||
It runs the pipeline starting from the basic dependencies and produces | ||
`metrics.json` file for the default state. | ||
|
||
<details> | ||
|
||
### If you used `dvc repro` before | ||
|
||
Earlier versions of DVC uses `dvc repro` to run the pipeline. If you already | ||
have a DVC project, you may be used to `dvc repro`. | ||
|
||
In DVC 2.0 `dvc exp run` supersedes `dvc repro`. Both of these commands run the | ||
pipeline. | ||
|
||
We use `dvc repro` to run the pipeline as found in the <abbr>workspace</abbr>. | ||
All the parameters and dependencies are retrieved from the current workspace. It | ||
doesn't use any special objects to track the experiments. When you have large | ||
number of experiments that you don't want to commit into Git, it's better to use | ||
`dvc exp run`. It allows to change the parameters quickly, can track the history | ||
of artifacts and has facilities to compare these experiments easily. | ||
|
||
`dvc repro` is still available to run the pipeline that don't need these extra | ||
features. | ||
|
||
</details> | ||
|
||
### Running by setting parameters | ||
|
||
Now let's do some more experimentation. | ||
|
||
DVC allows to update the parameters defined in the pipeline without modifying | ||
the files manually. We use this feature to set the convolutional units in | ||
`train.py`. | ||
|
||
```dvc | ||
$ dvc exp run --set-param conv_units=24 | ||
TK | ||
``` | ||
|
||
Note that the pipeline didn't run the earliest stage. Only the stages that | ||
depend on the updated parameter and subsequent stages are run. | ||
|
||
When you run `dvc exp run` with `--set-param`, it updates the parameter file. We | ||
can see the effect of it by looking at the diff. | ||
|
||
```dvc | ||
$ git diff params.yaml | ||
TK | ||
``` | ||
|
||
### Run multiple experiments in parallel | ||
|
||
Instead of running the experiments one-by-one, we can define them first to run | ||
them in a batch. This is especially handy when you have long running | ||
experiments. | ||
|
||
We add experiments to the queue using the `--queue` option of `dvc exp run`. We | ||
also use `-S` (`--set-param`) to set a value for the parameter. | ||
|
||
```dvc | ||
$ dvc exp run --queue -S conv_units=32 | ||
$ dvc exp run --queue -S conv_units=64 | ||
$ dvc exp run --queue -S conv_units=128 | ||
$ dvc exp run --queue -S conv_units=256 | ||
``` | ||
|
||
Next, run all (`--run-all`) queued experiments in parallel (using `--jobs`): | ||
|
||
```dvc | ||
$ dvc exp run --run-all --jobs 2 | ||
TK | ||
``` | ||
|
||
## Comparing experiments | ||
|
||
The pipeline is run several times with different parameters. To compare all of | ||
these experiments, we use `dvc exp show`. This command presents the parameters | ||
and metrics produced in experiments in a nicely formatted table. | ||
|
||
```dvc | ||
$ dvc exp show | ||
``` | ||
|
||
TK | ||
|
||
By default it shows all the parameters and the metrics along with the timestamp. | ||
If you have large number of parameters, metrics or experiments, this may lead to | ||
a cluttered view. You can limit the table to specific metrics, or parameters, or | ||
hide the timestamp column with `--include-metrics`, `--include-params`, or | ||
`--no-timestamp` options of the command, respectively. | ||
|
||
```dvc | ||
$ dvc exp show --no-timestamp --include-params conv_units --include-metrics acc | ||
TK | ||
``` | ||
|
||
## Persisting experiments | ||
|
||
After selecting a experiments from the table, you can commit the hyperparameters | ||
and other dependencies that produced this successful experiment to your Git | ||
history. | ||
|
||
`dvc exp apply` brings back all specific artifacts and parameters from the | ||
experiment to the <abbr>workspace</abbr>. | ||
|
||
```dvc | ||
$ dvc exp apply | ||
TK | ||
``` | ||
|
||
We can see the changes in the repository and commit them to Git. | ||
|
||
```dvc | ||
$ git diff | ||
$ git add . | ||
$ git commit -m "Successful experiment" | ||
``` | ||
|
||
### Preparing an experiments pipeline | ||
|
||
At the beginning of this document, we assumed that there is already a configured DVC project to simplify the introduction. DVC experiments are a feature added in DVC 2.0 and requires a DVC pipeline is defined in the project. In this section we'll show how to configure a project to run DVC experiments. You can get detailed information about these commands in other sections of DVC documentation. | ||
|
||
If DVC is not initialized before in the project, you can do so by: | ||
|
||
```dvc | ||
$ dvc init | ||
``` | ||
|
||
DVC also requires commands to be run and their dependencies to be defined as stages. We use `dvc stage add` to add a stage and set its dependencies. | ||
|
||
```dvc | ||
$ dvc stage add | ||
TK | ||
``` | ||
|
||
Note that the parameters (added with `-p`) are in the default parameters file `params.yaml` and used in the code as normal, by reading the file. DVC only tracks the changes and updates them with `--set-param`. | ||
|
||
|
||
## Go Further | ||
|
||
You can continue to experiment with | ||
[the project](https://github.com/iterative/get-started-experiments). Please see | ||
the `README.md` file of the project for these. Don't forget to | ||
[notify us](https://dvc.org/chat) if you happen to find good parameters. | ||
|
||
There are many other features of `dvc exp`, like cleaning up the unused | ||
experiments, sharing them without committing into Git or getting differences | ||
between two experiments. | ||
|
||
Please see the section on | ||
[Experiment Management](/doc/user-guide/experiment-management) in the User's | ||
Guide or `dvc exp` and subcommands in the Command Reference. |