Experiments

DVC makes it easy to iterate on your project using Git commits, tags, or branches. You can try different ideas quickly by tuning parameters, compare their performance with metrics, and visualize them with plots.

Collecting metrics

First, let's see what is the mechanism to capture values for these ML experiment attributes. Let's add a final evaluation stage to our pipeline:

$ dvc run -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json

💡 Expand to see what happens under the hood.

DVC generates a new stage in the dvc.yaml file:

evaluate:
  cmd: python src/evaluate.py model.pkl data/features ...
  deps:
    - data/features
    - model.pkl
    - src/evaluate.py
  metrics:
    - scores.json:
        cache: false
  plots:
    - prc.json:
        cache: false

The biggest difference to previous stages in our pipeline is in two new sections: metrics and plots. These are used to mark certain files containing experiment "telemetry". Metrics files contain simple numeric values (e.g. AUC) and plots files contain matrices and data series (e.g. ROC or model loss plots) that meant to be visualizing and compared.

cache: false means that those file are small enough and versioned directly with Git.

evaluate.py writes the model's AUC value to scores.json, which is marked as a metrics file with -M:

{ "auc": 0.57313829 }

It also writes precision, recall, and thresholds arrays (obtained using precision_recall_curve) into plots file prc.json:

{
  "prc": [
    { "precision": 0.021473008227975116, "recall": 1.0, "threshold": 0.0 },
    ...,
    { "precision": 1.0, "recall": 0.009345794392523364, "threshold": 0.64 }
  ]
}

DVC doesn't force you to use any specific file names, or even format or structure of a metrics or plots file - it's pretty much user and case defined. Please refer to dvc metrics and dvc plots for more details.

Let's save this experiment, so we can compare it later:

$ git add scores.json prc.json
$ git commit -a -m "Create evaluation stage"

Later we we will see how these and other can be used to compare and visualize different experiment iterations. For now, let's see how can we capture another important piece of information that will be useful to compare experiments: parameters.

Defining parameters

It's pretty common for data science pipelines to include configuration files that define adjustable parameters to train a model, do pre-processing, etc. DVC provides a mechanism for stages to depend on the values of specific sections of such a config file (YAML or JSON formats are supported).

Luckily, we should already have a stage with parameters in dvc.yaml:

featurize:
  cmd: python src/featurization.py data/prepared data/features
  deps:
    - data/prepared
    - src/featurization.py
  params:
    - featurize.max_features
    - featurize.ngrams
  outs:
    - data/features

💡 Expand to recall how it was generated.

The featurize stage was created with this dvc run command. Notice the argument sent to the -p option (short for --params):

$ dvc run -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

The params section defines the parameter dependencies of the featurize stage. By default DVC reads those values (featurize.max_features and featurize.ngrams) from a params.yaml file. But as with metrics and plots, parameter file names and structure can also be user and case defined.

This is how our params.yaml file looks like:

prepare:
  split: 0.20
  seed: 20170428

featurize:
  max_features: 500
  ngrams: 1

train:
  seed: 20170428
  n_estimators: 50

Tuning and running experiments

We are definitely not happy with the AUC value we got so far! Let's now tune and run the new experiment. Edit the params.yaml file to use bigrams and increase the number of features:

 featurize:
-  max_features: 500
-  ngrams: 1
+  max_features: 1500
+  ngrams: 2

And the beauty of dvc.yaml is that all you need to do now is to run:

$ dvc repro

It'll analyze the changes, use existing cache of previous runs, and execute only the commands that are needed to get the new results (model, metrics, plots).

The same logic applies to other possible experiment adjustments — edit source code, update datasets — you do the changes, use dvc repro, and DVC runs what needs to be run.

Comparing experiments

Finally, we are now ready to compare everything! DVC has a few commands to see metrics and parameter changes, and to visualize plots, for one or more experiments. Let's compare the current "bigrams" run with the last committed "baseline" iteration:

$ dvc params diff
Path         Param                   Old    New
params.yaml  featurize.max_features  500    1500
params.yaml  featurize.ngrams        1      2

dvc params diff can show how params in the workspace differ vs. the last commit.

dvc metrics diff does the same for metrics:

$ dvc metrics diff
Path         Metric    Value    Change
scores.json  auc       0.61314  0.07139

And finally, we can compare ROC curves with a single command!

$ dvc plots diff
file:///Users/dvc/example-get-started/plots.html

All these commands also accept Git revisions (commits, tags, branch names) to compare. This is a powerful mechanism for navigating experiments to see the history, to pick the best ones, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments.md

experiments.md

Experiments

Collecting metrics

💡 Expand to see what happens under the hood.

Defining parameters

💡 Expand to recall how it was generated.

Tuning and running experiments

Comparing experiments

Files

experiments.md

Latest commit

History

experiments.md

File metadata and controls

Experiments

Collecting metrics

💡 Expand to see what happens under the hood.

Defining parameters

💡 Expand to recall how it was generated.

Tuning and running experiments

Comparing experiments