Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

Commit

Permalink
MLEM - DVC interaction update (#211)
Browse files Browse the repository at this point in the history
* MLEM - DVC interaction update

* fix headers and stuff, write explanation on why we need this at all

* rewriting the scenario

* last fixes

Co-authored-by: Alexander Guschin <[email protected]>
  • Loading branch information
igordertigor and aguschin authored Nov 3, 2022
1 parent 249c642 commit 6358cd9
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 51 deletions.
96 changes: 45 additions & 51 deletions content/docs/user-guide/dvc.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Versioning MLEM objects with DVC

To use MLEM with Git and enable GitOps, we need to commit MLEM models to Git
repository. While committing `.mlem` metafiles is easy, model binaries and
datasets are too heavy to store in Git. To fix that, we suggest using
[DVC](https://dvc.org). DVC
[stores objects in remote storages](https://dvc.org/doc/start/data-management/data-versioning),
allowing us to commit just pointers to them.

This page offers a small Tutorial on how to use DVC with already existing MLEM
project. We will reorganize our example repo to showcase that.

## Setting things up

<details>

### ⚙️ Expand for setup instructions
Expand All @@ -10,7 +22,6 @@ If you want to follow along with this tutorial, you can use our
```shell
$ git clone https://github.com/iterative/example-mlem-get-started
$ cd example-mlem-get-started
$ git checkout 5-deploy-meta
```

Next let's create a Python virtual environment to cleanly install all the
Expand All @@ -24,57 +35,54 @@ $ pip install -r requirements.txt

</details>

Often it’s a bad idea to store binary files in Git, especially big ones. To
solve this MLEM can utilize [DVC](https://dvc.org/doc) capabilities to connect
external cloud storage for model and dataset versioning.

We will reorganize our example repo to use DVC.

## Setting up repo

First, let’s initialize DVC and add a remote (we will use a local one for easier
testing, but you can use whatever is available to you):
First, let’s initialize DVC and add a DVC remote (we will use a local one for
easier testing, but you can use whatever is available to you):

```cli
$ dvc init
$ dvc remote add myremote -d /tmp/dvcstore/
$ git add .dvc/config
```

[DVC Initialized](https://github.com/iterative/example-mlem-get-started/tree/7-dvc-dvc-init)

Now, we also need to setup MLEM so it knows to use DVC.

```cli
$ mlem config set core.storage.type dvc
✅ Set `storage.type` to `dvc` in repo .
```

Also, let’s add `.mlem` files to `.dvcignore` so that metafiles are ignored by
DVC.
After the initial configuration is done, we need to decide how we're going to
use MLEM with DVC:

1. We could manually add model binaries to version control. This scenario is
covered in the [Versioning binaries manually](#versioning-binaries-manually)
section below (use this option if you hear about DVC for the first time).
2. We could use
[DVC Pipelines](https://dvc.org/doc/start/data-management/data-pipelines) to
version model binaries automatically. DVC Pipelines are generally used to
manage all stages of model creation (data cleaning, featurization, training,
etc.). This case is covered below in
[Using MLEM in DVC Pipeline](#using-mlem-in-dvc-pipeline).

## Versioning binaries manually

Let’s add `.mlem` files to `.dvcignore` so that metafiles are ignored by DVC.

```cli
$ echo "/**/?*.mlem" > .dvcignore
$ git add .dvcignore
```

Finally, we need to stop Git from keeping already indexed binaries.
We may need to stop Git from keeping already indexed binaries. For our example
repo, that would be:

```cli
$ git rm -r --cached models data
```

[Configured MLEM to work with DVC](https://github.com/iterative/example-mlem-get-started/tree/8-dvc-mlem-config)

## Saving objects

Next, let’s remove artifacts from Git and re-save them, so MLEM can use new
storage for them. You don't need to change a single line of code
Now we need re-generate them:

```cli
$ git rm -r --cached models data
$ python train.py
```

Expand All @@ -95,34 +103,19 @@ $ git push
Now, you can load MLEM objects from your repo even though there are no actual
binaries stored in Git. MLEM will know to use DVC to load them.

[Switch to DVC](https://github.com/iterative/example-mlem-get-started/tree/9-dvc-save-models)
## Using MLEM in DVC Pipeline

# Using MLEM in DVC Pipeline

[DVC pipelines](https://dvc.org/doc/start/data-management/pipelines) are the
useful DVC mechanism to build data pipelines, in which you can process your data
and train your model. You may be already training your ML models in them and
what to start using MLEM to save those models.
[DVC pipelines](https://dvc.org/doc/start/data-management/pipelines) is a
mechanism to build data pipelines, in which you can process your data and train
your model. You may be already training your ML models in them and what to start
using MLEM to save those models.

MLEM could be easily plug in into existing DVC pipelines. You'll need to mark
`.mlem` files as `cache: false`
[outputs](https://dvc.org/doc/user-guide/project-structure/pipelines-files#output-subfields)
of a pipelines stage.

## Example

Let's continue using the example from above. First, let's stop tracking the
artifact `models/rf` in DVC and stop ignoring MLEM files in `.dvcignore`.

```dvc
$ dvc remove models/rf.dvc
# we can delete the file since there are no other records
# beside one we added above:
$ git rm .dvcignore
```

Now let's create a simple pipeline to train your model:
Let's create a simple pipeline to train your model:

```yaml
# dvc.yaml
Expand All @@ -137,9 +130,8 @@ stages:
cache: false
```
The binary was already in, so there's no need to add it again. For the metafile,
we've added two rows and specify `cache: false` to track it with DVC while
storing it in Git.
We mark the metafile with `cache: false` so DVC pipeline is aware of it, while
still committing it to Git.

You can verify everything is working by running the pipeline:

Expand All @@ -151,5 +143,7 @@ Use `dvc push` to send your updates to remote storage.
```

Now DVC will take care of storing binaries, so you'll need to commit model
metafile (`models/rf.mlem`) and `dvc.lock` only. Learn more about
[DVC](https://dvc.org/doc) and how it can be useful for training your ML models.
metafile (`models/rf.mlem`) and `dvc.lock` only.

Learn more about [DVC](https://dvc.org/doc) and how it can be useful for
training your ML models.
35 changes: 35 additions & 0 deletions yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2536,6 +2536,15 @@
"@sentry/utils" "7.13.0"
tslib "^1.9.3"

"@sentry/[email protected]":
version "7.17.3"
resolved "https://registry.yarnpkg.com/@sentry/core/-/core-7.17.3.tgz#2b45c0507f1ef7018335b9bb61ed6b3f16accfad"
integrity sha512-PSboa9aOVnvZU+C6/shKlHUA7zjAl6z5BKRHF8mEljEYql6bh0HfJJKXtBHMz1sWnmzMa/qABSKLpnP5ZQlJNw==
dependencies:
"@sentry/types" "7.17.3"
"@sentry/utils" "7.17.3"
tslib "^1.9.3"

"@sentry/gatsby@^7.13.0":
version "7.13.0"
resolved "https://registry.yarnpkg.com/@sentry/gatsby/-/gatsby-7.13.0.tgz#50965623d43dc437704660a6cbef8ac2927043fc"
Expand All @@ -2556,6 +2565,19 @@
"@sentry/utils" "7.13.0"
tslib "^1.9.3"

"@sentry/node@^7.12.1":
version "7.17.3"
resolved "https://registry.yarnpkg.com/@sentry/node/-/node-7.17.3.tgz#d817e9ca53331b3c192d997ce0e570fb51c1eda9"
integrity sha512-kBmj5GiE0BWQ1CqnJN3bOOmaNNvS+HKb9nPic+QloPnH6xDFVUcmx774s3qjtnyLOQTzPpy3vXCA15rYflNJBQ==
dependencies:
"@sentry/core" "7.17.3"
"@sentry/types" "7.17.3"
"@sentry/utils" "7.17.3"
cookie "^0.4.1"
https-proxy-agent "^5.0.0"
lru_map "^0.3.3"
tslib "^1.9.3"

"@sentry/[email protected]":
version "7.13.0"
resolved "https://registry.yarnpkg.com/@sentry/react/-/react-7.13.0.tgz#64fa5a2b944c977f75626c6208afa3478c13714c"
Expand All @@ -2582,6 +2604,11 @@
resolved "https://registry.yarnpkg.com/@sentry/types/-/types-7.13.0.tgz#398e33e5c92ea0ce91e2c86e3ab003fe00c471a2"
integrity sha512-ttckM1XaeyHRLMdr79wmGA5PFbTGx2jio9DCD/mkEpSfk6OGfqfC7gpwy7BNstDH/VKyQj/lDCJPnwvWqARMoQ==

"@sentry/[email protected]":
version "7.17.3"
resolved "https://registry.yarnpkg.com/@sentry/types/-/types-7.17.3.tgz#ec66ea7b6881ae243255546680722488e7ff23bf"
integrity sha512-+buEJo/4TKErjwF8Tq3XXKFZx4Utpvqs52e7i7Sur2qfyBNwRgBILceQvdnzw86JNZT2myeYmrfVbsaxAk7ilA==

"@sentry/[email protected]":
version "7.13.0"
resolved "https://registry.yarnpkg.com/@sentry/utils/-/utils-7.13.0.tgz#0d47a9278806ece78ba3a83c7dbebce817462759"
Expand All @@ -2590,6 +2617,14 @@
"@sentry/types" "7.13.0"
tslib "^1.9.3"

"@sentry/[email protected]":
version "7.17.3"
resolved "https://registry.yarnpkg.com/@sentry/utils/-/utils-7.17.3.tgz#aafa67ed372f00be2e1bb490fa62d9d2d06a4c2f"
integrity sha512-Sd7BwVn6IClvaXbZaj/LnEcrMm8yjQtZkTVSrM2Vlv1lLeaH61JxSAFU6QntF+f/cCfZ7wSdNhWOfW3qZJ7t3Q==
dependencies:
"@sentry/types" "7.17.3"
tslib "^1.9.3"

"@sentry/[email protected]":
version "1.19.0"
resolved "https://registry.yarnpkg.com/@sentry/webpack-plugin/-/webpack-plugin-1.19.0.tgz#2b134318f1552ba7f3e3f9c83c71a202095f7a44"
Expand Down

0 comments on commit 6358cd9

Please sign in to comment.