Skip to content

Commit

Permalink
Merge branch 'master' into short-alias-csv-json-md
Browse files Browse the repository at this point in the history
  • Loading branch information
shcheklein authored Oct 20, 2021
2 parents 15b9294 + 96cb495 commit a63d213
Show file tree
Hide file tree
Showing 177 changed files with 10,103 additions and 10,882 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ defaults: &defaults
working_directory: ~/repo
docker:
# Specify the version you desire here.
- image: circleci/node:12
- image: circleci/node:16

# Specify service dependencies here if necessary.
# CircleCI maintains a library of pre-built images,
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/download-link-check-schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ jobs:
with:
title: DVC Download Link Checker Report
content-filepath: ./lychee/out.md
labels: website, automated issue
labels: website, link-checker
34 changes: 17 additions & 17 deletions .github/workflows/update.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,20 @@ jobs:
update:
runs-on: ubuntu-18.04
steps:
- uses: actions/checkout@v2
- name: Update
id: update
shell: bash
run: |
url=https://api.github.com/repos/iterative/dvc/releases/latest
version=$(curl --silent $url | jq -r .tag_name)
path=src/components/DownloadButton/index.tsx
sed -i "s/^const VERSION = .*$/const VERSION = \`$version\`/g" $path
echo "::set-output name=changes::$(git diff)"
echo "::set-output name=version::$version"
- name: Create PR
if: ${{ steps.update.outputs.changes != '' }}
uses: peter-evans/create-pull-request@v3
with:
commit-message: dvc ${{ steps.update.outputs.version }}
title: dvc ${{ steps.update.outputs.version }}
- uses: actions/checkout@v2
- name: Update
id: update
shell: bash
run: |
url=https://api.github.com/repos/iterative/dvc/releases/latest
version=$(curl --silent $url | jq -r .tag_name)
path=src/components/DownloadButton/index.tsx
sed -i "s/^const VERSION = .*$/const VERSION = \`$version\`/g" $path
echo "::set-output name=changes::$(git diff)"
echo "::set-output name=version::$version"
- name: Create PR
if: ${{ steps.update.outputs.changes != '' }}
uses: peter-evans/create-pull-request@v3
with:
commit-message: dvc ${{ steps.update.outputs.version }}
title: dvc ${{ steps.update.outputs.version }}
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright {yyyy} {name of copyright owner}
Copyright 2018-2021 Iterative, Inc.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
10 changes: 9 additions & 1 deletion config/prismjs/dvc-commands.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,16 @@ module.exports = [
'gc',
'freeze',
'fetch',
'exp show',
'exp run',
'exp remove',
'exp push',
'exp pull',
'exp gc',
'exp diff',
'exp branch',
'exp apply',
'exp',
'experiments',
'doctor',
'diff',
'destroy',
Expand Down
3 changes: 3 additions & 0 deletions content/blog/2021-08-24-transfer-learning-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ tags:
- Experiments
- Reproducibility
- DVC
- Pre-trained Models
---

## Intro
Expand All @@ -26,6 +27,8 @@ or even people. This is called
[transfer learning](https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a)
and it can save a lot of time on developing a model from scratch.

https://youtu.be/S3Hm_BPLie0

For us to take advantage of transfer learning, we can use fine-tuning to adopt
the model to our new problem. In many cases, we start by replacing the last
layer of the model. With the AlexNet example, this might mean the last layer was
Expand Down
262 changes: 262 additions & 0 deletions content/blog/2021-10-05-adding-data-to-build-a-more-generic-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
---
title: Adding Data to Build a More Generic Model
date: 2021-10-05
description: >
You can easily make changes to your dataset using DVC to handle data
versioning. This will let you extend your models to handle more generic data.
descriptionLong: >
When you have an existing model trained for one problem, you might want to
extend it to handle other problems. When you have data versioning, it's easier
to see which data additions make your model better or worse and then you can
see where to make improvements.
picture: 2021-10-05/cats-and-dogs.png
pictureComment: Adding more data to your dataset for a more generic model
author: milecia_mcgregor
commentsUrl: https://discuss.dvc.org/t/extending-models-with-more-data/881
tags:
- MLOps
- DVC
- Git
- Experiments
- Data Versioning
---

## Intro

You might be in the middle of training a model and then the business problem
shifts. Now you have this model that has been going through the training process
with a specific dataset and you need to make the model more generic.

There's likely something that your model learned that can be useful on this new
dataset, so you might not have to restart the entire training process. We'll do
an example of updating a pre-trained model to use a broader dataset with DVC. By
the end of this, you should see how you can handle this quickly and start
running new experiments to get a more generic model.

## The original pre-trained model

For this post, we'll be making a more generic image classifier by taking the
original dataset with bees and ants and adding cats and dogs to it. You can
clone [this GitHub repo](https://github.com/iterative/pretrained-model-demo) to
get the current bees and ants model and check out
[this post](https://dvc.org/blog/transfer-learning-experiments) on how we
experimented with both AlexNet and SqueezeNet to build this model.

So we're starting from our current bees and ants model and extending it to
classify dogs and cats as well. We'll start by adding some cats and dogs data to
our validation data and do some experiments with the current model to see how it
performs on generic data.

Then we'll add the cats and dogs data to the training data and watch how the
model improves as we run experiments.

## Updating the dataset with DVC

To add the new cats and dogs dataset to the project, we'll use this DVC command.

```dvc
$ dvc get https://github.com/iterative/dataset-registry blog/cats-dogs
```

This downloads a sample dataset with images of cats and dogs. You can use this
command to download files or directories that are tracked by DVC or Git. This
command can be used from anywhere in the file system, as long as DVC is
installed.

This will make a new directory called `./cats-dogs/data/` that was downloaded
from the DVC remote and it has images for cats and dogs. Now we can slowly add
in the new data to the existing data.

We'll start by moving the `val` data for `cats` and `dogs` from the
`/cats-dogs/data/` directory to the corresponding directory in
`data/hymenoptera_data`.

_Just a quick note, cats and dogs don't really belong in the `hymenoptera`
directory since that's specific to ants and bees, but it's the easiest and
fastest way to add the data for this tutorial._

With this new data in place, we can start training our model.

## Running new experiments with generic data

With the updated data, let's run an experiment on the model and see how good the
results are. To run a new experiment, open your terminal and make sure you have
a virtual environment enabled. Then run this command:

```dvc
$ dvc exp run
```

Once the training epochs are finished, run the following command.

```dvc
$ dvc exp show --no-timestamp \
--include-metrics step,acc,val_acc,loss,val_loss \
--include-params lr,momentum
```

The `--no-timestamp` hides the timestamps from table. The `--includes-metrics`
option lets us choose which metrics we want to show in the table. The
`--includes-params` option does the same for hyperparameters. This gives us a
table that's easier to read quickly.

```dvctable
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ neutral:**Experiment** ┃ metric:**step** ┃ metric:**acc** ┃ metric:**val_acc** ┃ metric:**loss** ┃ metric:**val_loss** ┃ param:**lr** ┃ param:**momentum** ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
β”‚ **workspace** β”‚ **3** β”‚ **0.86885** β”‚ **0.46** β”‚ **0.31573** β”‚ **3.7067** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ **data-change** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ β”‚ β•“ 3b3a2a2 [exp-23593] β”‚ 3 β”‚ 0.86885 β”‚ 0.46 β”‚ 0.31573 β”‚ 3.7067 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 93d015d β”‚ 2 β”‚ 0.83197 β”‚ 0.41333 β”‚ 0.36851 β”‚ 3.4259 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ d474c42 β”‚ 1 β”‚ 0.79918 β”‚ 0.43333 β”‚ 0.46612 β”‚ 3.286 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 1582b4b β”‚ 0 β”‚ 0.52869 β”‚ 0.39 β”‚ 0.94102 β”‚ 2.5967 β”‚ 0.001 β”‚ 0.09 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

You'll notice that the validation accuracy is really low. That's because the
training metrics are based on bees and ants while the validation metrics are
based on bees, ants, cats, and dogs. If we looked at the validation metrics by
class, they'd likely be better for bees and ants than cats and dogs.

That means we should probably add more data to the training dataset.

## Adding the cats data to the training dataset

Let's add the `train` data for `cats` to the corresponding directory in
`data/hymenoptera_data` and go through another experiment run with a different
learning rate. With this new data, we can run another experiment. One important
thing to note here is that we're using checkpoints in our experiments. That's
how we get the metrics for each training epoch.

If we want to run a fresh experiment that doesn't resume training from the last
epoch, we need to reset our experiment. That's what we're going to do with this
command.

```dvc
$ dvc exp run --reset
```

This will reset all of the existing checkpoints and excute the training script.
Once it's finished, let's take a look at the metrics table with this command.
It's the same as the one we ran last time.

```dvc
$ dvc exp show --no-timestamp \
--include-metrics step,acc,val_acc,loss,val_loss \
--include-params lr,momentum
```

Now you'll have a table that shows both experiments and you can see how much
better the new one did with the `cats` data added.

```dvctable
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ neutral:**Experiment** ┃ metric:**step** ┃ metric:**acc** ┃ metric:**val_acc** ┃ metric:**loss** ┃ metric:**val_loss** ┃ param:**lr** ┃ param:**momentum** ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
β”‚ **workspace** β”‚ **3** β”‚ **0.91389** β”‚ **0.87** β”‚ **0.20506** β”‚ **0.66306** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ **data-change** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ β”‚ β•“ 9405575 [exp-54e8a] β”‚ 3 β”‚ 0.91389 β”‚ 0.87 β”‚ 0.20506 β”‚ 0.66306 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 856d80f β”‚ 2 β”‚ 0.90215 β”‚ 0.87333 β”‚ 0.27204 β”‚ 0.61631 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 23dc98f β”‚ 1 β”‚ 0.87671 β”‚ 0.86 β”‚ 0.35964 β”‚ 0.61713 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 99a3c34 β”‚ 0 β”‚ 0.71429 β”‚ 0.82 β”‚ 0.67674 β”‚ 0.62798 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•“ 3b3a2a2 [exp-23593] β”‚ 3 β”‚ 0.86885 β”‚ 0.46 β”‚ 0.31573 β”‚ 3.7067 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 93d015d β”‚ 2 β”‚ 0.83197 β”‚ 0.41333 β”‚ 0.36851 β”‚ 3.4259 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ d474c42 β”‚ 1 β”‚ 0.79918 β”‚ 0.43333 β”‚ 0.46612 β”‚ 3.286 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 1582b4b β”‚ 0 β”‚ 0.52869 β”‚ 0.39 β”‚ 0.94102 β”‚ 2.5967 β”‚ 0.001 β”‚ 0.09 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

There's another way you can look at the difference between the model before we
added the `cats` data and after. If you run this in your terminal, you'll get a
plot comparing the two experiments.

```dvc
$ dvc plots diff exp-23593 exp-54e8a
```

The `exp-23593` and `exp-54e8a` values are the ids for the experiments you want
to compare. You'll see a new file gets generated in the `dvc_plots` directory in
your project. That's where you'll find the `index.html` file you should open in
your browser. You'll see something similar to this.

![plots comparing the accuracy, validation accuracy, loss, and validation loss for all epochs of each experiment](2021-10-05/with-cats-data.png)

There's a huge difference in the accuracy of our model after we've added this
additional data. Let's see if we can make it even better by adding the `dogs`
data.

## Adding the dogs data to the training dataset

We'll add the `train` data for `dogs` to the corresponding directory in
`data/hymenoptera_data` just like we did for the `cats` data. Now we can run a
new experiment with all of the new data included. We'll still need to reset the
experiment like before, so run the following command.

```dvc
$ dvc exp run --reset
```

Once the training epochs are finished, we can take one more look at that metrics
table.

```dvc
$ dvc exp show --no-timestamp \
--include-metrics step,acc,val_acc,loss,val_loss \
--include-params lr,momentum
```

Now we'll have all three experiments to compare.

```dvctable
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ neutral:**Experiment** ┃ metric:**step** ┃ metric:**acc** ┃ metric:**val_acc** ┃ metric:**loss** ┃ metric:**val_loss** ┃ param:**lr** ┃ param:**momentum** ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
β”‚ **workspace** β”‚ **3** β”‚ **0.8795** β”‚ **0.90667** β”‚ **0.29302** β”‚ **0.25752** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ **data-change** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **-** β”‚ **0.001** β”‚ **0.09** β”‚
β”‚ β”‚ β•“ c20220f [exp-82f70] β”‚ 3 β”‚ 0.8795 β”‚ 0.90667 β”‚ 0.29302 β”‚ 0.25752 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ fcb5a0b β”‚ 2 β”‚ 0.85915 β”‚ 0.92333 β”‚ 0.38274 β”‚ 0.25257 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 3768821 β”‚ 1 β”‚ 0.80751 β”‚ 0.84667 β”‚ 0.47681 β”‚ 0.40228 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 7e1b8fb β”‚ 0 β”‚ 0.64632 β”‚ 0.84 β”‚ 0.87301 β”‚ 0.46744 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•“ 9405575 [exp-54e8a] β”‚ 3 β”‚ 0.91389 β”‚ 0.87 β”‚ 0.20506 β”‚ 0.66306 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 856d80f β”‚ 2 β”‚ 0.90215 β”‚ 0.87333 β”‚ 0.27204 β”‚ 0.61631 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 23dc98f β”‚ 1 β”‚ 0.87671 β”‚ 0.86 β”‚ 0.35964 β”‚ 0.61713 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 99a3c34 β”‚ 0 β”‚ 0.71429 β”‚ 0.82 β”‚ 0.67674 β”‚ 0.62798 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•“ 3b3a2a2 [exp-23593] β”‚ 3 β”‚ 0.86885 β”‚ 0.46 β”‚ 0.31573 β”‚ 3.7067 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ 93d015d β”‚ 2 β”‚ 0.83197 β”‚ 0.41333 β”‚ 0.36851 β”‚ 3.4259 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”‚ β•Ÿ d474c42 β”‚ 1 β”‚ 0.79918 β”‚ 0.43333 β”‚ 0.46612 β”‚ 3.286 β”‚ 0.001 β”‚ 0.09 β”‚
β”‚ β”œβ”€β•¨ 1582b4b β”‚ 0 β”‚ 0.52869 β”‚ 0.39 β”‚ 0.94102 β”‚ 2.5967 β”‚ 0.001 β”‚ 0.09 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

These results make sense for the experiments we've run. We're paying attention
to the validation accuracy here because this gives us a fair comparison of
what's happening as we add more data.

The first experiment's training metrics are for bees and ants. The second
experiment's training metrics are for bees, ants, and cats. And the third
experiment's training metrics are for all four classes. So we can't really
compare these metrics.

We can look at a comparison between the experiments with the `cats` data and
both the `cats` and `dogs` data.

```dvc
$ dvc plots diff exp-23593 exp-54e8a exp-82f70
```

![plot of differences between model with just cats data and model with both cats and dogs data](2021-10-05/with-cats-and-dogs-data.png)

The results you see line up with what is expected for the validation metrics
based on how we added the data to the training set. Now you can keep running
experiments until you get your model tuned like you need it!

## Conclusion

When you want to change datasets quickly and start tracking how they affect our
model, using a DVC remote makes it easy to do so on different computers. You'll
be able to quickly upload and download GBs of data and see how changes affect
individual experiments.

If you need help with anything DVC or CML, make sure to
[join our Discord community](https://discord.com/invite/dvwXA2N)! We're always
answering questions and having good conversations with everybody that shows up.
Loading

0 comments on commit a63d213

Please sign in to comment.