Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

WIP: Docs draft for integration with DVC #323

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
178 changes: 178 additions & 0 deletions content/docs/gto/get-started-dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Get Started DVC

Copy link
Contributor

@omesser omesser Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HL:
Suggest to add sub-headers / sections
And structure for this page will be something like this:

## Defining artifact types
... example of a model, defining in different ways
... mentioning that in a similar way you can define other types, e.g. `type: artifact` or similar

## Browsing Artifacts
.. studio
... dvc ls

## Additional artifact metadata
...

## Registered artifact versions
...
## Working with artifacts in CI
...
## Restricting which types are allowed
...

To leverage concepts of Model and Data Registries in a more explicit way, you
can denote the `type` of each output. This will let you browse models and data
separately, address them by `name` in `dvc get`, and eventually, see them in DVC
Studio.

Let's start with marking an artifact as data or model. To do so, you need to add
it to a top section called `artifacts` in your `dvc.yaml`

```yaml
# dvc.yaml
artifacts: # artifact ID (name)
def-detector: # just like with plots, this could be a path or any string ID
# also, all options here are optional
type: model
description: glass defect image classifier
labels:
- algo=cnn
- owner=aguschin
- project=prod-qual-002
path: models/mymodel.pkl # if not specified, DVC will use ID as path
```

If you want this to be in a separate file (say, `artifacts.yaml`), you can tell
DVC to use it with:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think there is a user case this change may not support that well. One of our prospects asked to allow a single file (let's say mymodel.pkl) to be referenced as several GTO models (e.g. model1 and model2 - these are names). Since moving to DVC makes path essential (instead of name), I don't see how that feature would fit here. 🤔

The motivation is to be able to promote model1 and model2 to different stages at different moments of time separately. To clarify, let's assume there are two populations mymodel.pkl should be applied for. You can create stages like populationA-prod, populationA-staging and populationB-prod, populationB-staging, if you have many populations, this would make things cumbersome. The solution was to introduce model1 (for populationA) and model2 (for populationB). That required this feature.

The only workaround I see now is to create a "mirror file" with cp mymodel.pkl mymodel-for-populationB.pkl in some DVC PL stage. Or keep this name:path mapping outside of DVC somehow. Do you see any other solutions guys? WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @omesser that it's related the discussion above. Take a look at the top-level plots schema, where plots may be identified by either path or an arbitrary name. Feels like following a similar syntax may be best here.

```yaml
# dvc.yaml
artifacts: artifacts.yaml
```

You can also specify that while using DVCLive, which will also add your model to
the `artifacts` section in `dvc.yaml`:

```py
with Live() as live: # needs dvcyaml=True which is set by default
# you can pass `name`, `description`, `labels` as well
live.log_artifact("model.pkl", type="model")
```

Which, given no `artifacts` section existing, will produce:

```yaml
# dvclive/dvc.yaml
artifacts:
model.pkl:
type: model
```

When you commit and push this change, your models will appear in Studio Model
Registry:

![](https://user-images.githubusercontent.com/6797716/223443152-84f57b79-3395-4965-97f9-edc81896a1dc.png)

### As a next step, they will be available in `dvc ls`:

```dvc
# i didn't update this output to match the page
$ dvc ls --artifacts # add `--type model` to see models only
Path Name Type Labels Description
mymodel.pkl model
data.xml stackoverflow-dataset data data-registry,get-started imported code
data/data.xml another-dataset data data-registry,get-started imported
```

The same way you specify `type`, you can specify `description`, `labels` and
`name`. Defining human-readable `name` (should be unique) is useful when you
have complex folder structures or if you artifact can have different paths
during the project lifecycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we need 1 (concise) example setting all relevant fields
e.g.

dvc add models/mymodel.pkl --name def-detector --type model --description "glass defect image classifier" --label "algo=cnn" --label "owner=aguschin" --label "project=prod-qual-002"


You can use `name` to address the object in `dvc get`:

```dvc
$ dvc get $REPO dvc.yaml:def-detector -o model.pkl
$ dvc get $REPO dvclive/dvc.yaml:model.pkl -o model.pkl
# or simpler:
$ dvc get $REPO :def-detector -o model.pkl
$ dvc get $REPO dvclive:model.pkl -o model.pkl
```

<details>

### Getting `path` or `desc` or `labels` for artifact [extra for now]

You can also use shortcuts in `gto describe`:

```dvc
$ gto describe -r $REPO def-detector@latest --path
models/mymodel.pkl
```

We are most likely won't be supporting this initially, but then we can either
implement this in GTO,
https://github.com/iterative/gto/pull/346#issue-1647512184 `gto describe` which
will use DVC API under the hood, or we can implement it in DvC itself.

</details>

Now, you usually need a specific model version rather than one from the `main`
branch. You can keep track of the model's lineage by
[registering Semantic versions and promoting your models](/doc/gto/get-started)
(or other artifacts) to stages such as `dev` or `production` with GTO. GTO
operates by creating Git tags such as `[email protected]` or
`dvclive__model.pkl#prod`. Knowing the right Git tag, you can get the model
locally:

```dvc
$ dvc get $REPO mymodel.pkl --rev [email protected]
```

Check out
[GTO User Guide](/doc/gto/user-guide/#getting-artifacts-in-systems-downstream)
to learn how to get the Git tag of the `latest` version or version currently
promoted to stages like `prod`.

<details>

### Getting `latest` or what's in `prod` from Studio [extra for now]

You can also use shortcuts in `dvc get`:

```dvc
$ dvc cloud get $REPO def-detector@latest # download the latest version
```

The discussion for this is happening at
https://github.com/iterative/studio/issues/5215#issuecomment-1488920109

</details>

## Getting models in CI/CD

Git tags are great to [kick off CI/CD](/doc/gto/user-guide/#acting-in-cicd)
pipeline in which we can consume our model. You can use
[GTO GitHub action](https://github.com/iterative/gto-action) to interpret the
Git tag that triggered the workflow and act based on that. If you simply need to
download the model to CI, you can also use this Action with `download` option:

```yaml
steps:
- uses: actions/checkout@v3
- id: gto
uses: iterative/gto-action@v1
with:
download: True # you can provide a specific destination path here instead of `True`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? Will it download all artifacts in the repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should download the artifact, e.g. it will run dvc get . mymodel --rev $GITHUB_REF for a Git tag [email protected].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docs to explain this.

```

Which means, if the Git tag that triggered this workflow registers a version or
promotes it to a stage (like `[email protected]` or `mymodel#prod`), this will run
`dvc get . mymodel`.

## Restricting which types are allowed [extra for now]

To specify which `type`s are allowed to be used, you can add the following to
your `.dvc/config`:

```
# .dvc/config
types: [model, data]
```

## Seeing new model versions pushed with DVC experiments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Seeing new model versions pushed with DVC experiments
## Models and Experiments

Also, a question - is this an implicit behavior for artifacts with type: model specifically? or will there be similar side effects for any artifact with "type" defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only updated MDP, so this is only for type: model for now. How did you assume this should work with other types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't, I'm a bit concerned for implicit behaviors, we should probably find a way to give the user control over what to do with which artifacts on exp push. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What example of implicit behavior you have in mind? Like pushing a model that can be few GB in size? Not quite have specific examples in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, auto-pushing, exactly. maybe auto-versioning as well in the future (could be useful if running in the pipeline CI-CD as part of release. generate model, push it, and assign a version using GTO


After you run `dvc exp push` to push your experiment that updates your model,
you'll see a commit candidate to be registered:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should allow registration of unmerged experiments? Or maybe restrict what actions are available for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Don't have a strong opinion.
First, it's possible to do, so why not. We can also allow to click on "register", but then say something like "We advise to merge the experiment first" with buttons like "create a PR in GH" (default) and "register anyway".
We can prohibit registering (again, don't see a reason except for skipping polluting repo with dangling refs in a the workflow that requires users to merge experiments first).
We can delay answering this question for now I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, do we have an understanding how dvc exp push flow should look like on Studio's side? If that's still WIP, I guess we need to implement that first.


![](https://user-images.githubusercontent.com/6797716/223444959-d8ddd1a0-5582-405f-9ab0-807e1a0c9489.png)

Please note it's usually a good idea to merge your experiment before registering
a semantic version to avoid creating dangling commits (not reachable from any
branch).

In future you'll also be able to compare that new model version pushed (even non
semver-registered) with the latest one on this Model Details Page. Or have a
button to go to the main repo view with "compare" enabled:

![](https://user-images.githubusercontent.com/6797716/223445799-7ae65e58-6a9e-42a8-890a-f04839349873.png)
5 changes: 5 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -656,6 +656,11 @@
"label": "Get Started",
"source": "get-started.md"
},
{
"slug": "get-started-dvc",
"label": "Get Started for DVC",
"source": "get-started-dvc.md"
},
{
"slug": "user-guide",
"label": "User Guide",
Expand Down