WIP: Docs draft for integration with DVC #323

aguschin · 2023-03-07T09:32:41Z

This is a draft for the docs - how this should look for a user after we integrate GTO artifacts.yaml part into DVC

https://mlem-ai-gto-dvc-nrmufqqihx13et.herokuapp.com/doc/gto/get-started-dvc/

cc @dberenbaum to check this out

github-actions · 2023-03-07T09:42:06Z

Link Check Report

All 7 links passed!

dberenbaum · 2023-03-07T20:51:02Z

content/docs/gto/get-started-dvc.md

+and make them shown as models in `dvc ls`:
+
+```dvc
+$ dvc ls --registry  # add `--type model` to see models only
+ Path           Name                   Type     Labels                       Description
+ mymodel.pkl                           model
+ data.xml       stackoverflow-dataset  data     data-registry,get-started    imported code
+ data/data.xml  another-dataset        data     data-registry,get-started    imported
+```


Does Studio need this, or is solely to provide a CLI option to view the registry? I don't think the latter needs to be high priority unless I'm missing some use case where you need to access it from the CLI.

Not sure whether it can fit into dvc ls since the output is quite different (and potentially so are the arguments like --type). Need to think about whether we need this and where it can fit.

By the way, do we define an artifact as any output that has a type?

some use case where you need to access it from the CLI

If user would like to download a model locally, but not quite sure which one at the moment, he might want to see this. E.g. I don't remember the model name, but know labels or remember description. This OFC can be solved via Studio, and if we want to push users for that, that's also a decision.

Another use case would be if you're investigating a repo that's not familiar to you (let's say your team has few repos or you look at another team's repo). Again, if we want to make people go to Studio every time for this, it a valid workflow, but IMO it makes you leave CLI and do extra things which can't be inconvenient.

By the way, do we define an artifact as any output that has a type?

Either that, or any input/output file DVC keeps track of can be an artifact (without type in it's not defined). I think the latter is simpler and easier to convey. We can call it "file" instead of "artifact" I guess (if we're not going to introduce "compound" artifacts as we discussed before which I don't think is the case. Let's probably don't discuss this though, it's unrelated to this PR and not necessary at all now I believe).

Sounds good. Not sure I see enough to make it a p1 yet. WDYT?

By the way, do we define an artifact as any output that has a type?

Either that, or any input/output file DVC keeps track of can be an artifact (without type in it's not defined). I think the latter is simpler and easier to convey.

It might also depend on the schema discussion below. If we have a model/registry/artifacts section of dvc.yaml, I guess we will only include what's specified there.

dberenbaum · 2023-03-07T20:54:12Z

content/docs/gto/get-started-dvc.md

+branch. You can keep track of the model's lineage by
+[registering Semantic versions and promoting your models](/doc/gto/get-started)
+(or other artifacts) to stages such as `dev` or `production` with GTO. GTO
+operates by creating Git tags such as `[email protected]` or `mymodel#prod`.


I thought the stage tags always had some number like mymodel#prod#1?

It can be mymodel#prod as well. It's called "simple" Git tag format and it's not the default one. https://mlem.ai/doc/gto/user-guide/#git-tags-format

dberenbaum · 2023-03-07T20:54:52Z

content/docs/gto/get-started-dvc.md

+  - id: gto
+    uses: iterative/gto-action@v1
+    with:
+      download: True # you can provide a specific destination path here instead of `True`


Can you give an example? Will it download all artifacts in the repo?

This should download the artifact, e.g. it will run dvc get . mymodel --rev $GITHUB_REF for a Git tag [email protected].

Updated docs to explain this.

dberenbaum · 2023-03-07T20:56:56Z

content/docs/gto/get-started-dvc.md

+## Restricting which types are allowed
+
+To specify which `type`s are allowed to be used, you can add the following to
+your `.dvc/config`:
+
+```
+# .dvc/config
+types: [model, data]
+```


Do you think it needs to be part of the initial scope? I know it was requested, but it doesn't feel strictly necessary yet.

Not necessary. I can make a notes in the docs page for what's extra for now, if that makes sense.

dberenbaum · 2023-03-07T21:01:44Z

content/docs/gto/get-started-dvc.md

+## Seeing new model versions pushed with DVC experiments
+
+After you run `dvc exp push` to push your experiment that updates your model,
+you'll see a commit candidate to be registered:


Do you think we should allow registration of unmerged experiments? Or maybe restrict what actions are available for them?

Good question. Don't have a strong opinion.
First, it's possible to do, so why not. We can also allow to click on "register", but then say something like "We advise to merge the experiment first" with buttons like "create a PR in GH" (default) and "register anyway".
We can prohibit registering (again, don't see a reason except for skipping polluting repo with dangling refs in a the workflow that requires users to merge experiments first).
We can delay answering this question for now I think.

Btw, do we have an understanding how dvc exp push flow should look like on Studio's side? If that's still WIP, I guess we need to implement that first.

dberenbaum · 2023-03-07T21:04:18Z

Looks good @aguschin! Just a few details to work out. Thanks!

omesser · 2023-03-09T01:00:02Z

content/docs/gto/get-started-dvc.md

+live.log_artifact(artifact, "path", type="model")
+```
+
+This will make them appear in DVC Model Registry:


Suggested change

This will make them appear in DVC Model Registry:

This will make them appear in [Studio Model Registry](https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry):

(Fo now,. it's still called Studio and not DVC.Cloud or similar)

Thanks! Overall, let's keep the review scope to the level of ideas and user experience. We don't even know if this will be a separate page in DVC docs, or maybe we integrate it with some other page.

yeah this is a nitpick 😉 but was hard for me to pass the opportunity and suggest

omesser · 2023-03-09T01:04:24Z

content/docs/gto/get-started-dvc.md

+separately, address them by `name` in `dvc get`, and eventually, see them in DVC
+Studio.
+
+Let's start with marking an artifact as data or model.


Suggested change

Let's start with marking an artifact as data or model.

Let's start with marking a tracked artifact (file) as a `model`.

Personally, I don't think that "data" is a valid example for a type

Why? I assumed Data Registry would show type: data once we implement it.

It carries 0 information though, right?
"data of type data", similar to "artifact of type artifact" means the same as not defining type at all. it's the most general thing there is (data even more than artifact maybe?), just super abstract. doubt users will use it that way

Good point. Maybe it's dataset instead of data then? Or anyways, if after subtracting plots, metrics and models everything that's left (among DVC PL inputs and outputs) is dataset, then I guess it's redundant?

My 2cs. Data is not abstract for me (it's different vs model in my perception). But in DVC it's not needed. Any out w/o a specified type can be considered data.

I would personally try to simplify all of this - no multiple types initially. only models. We are prematurely generalizing this I think.

I think of the terms Data Registry, Model Registry, and Artifact Registry. I like having the type track to those names, so I like data.

Would love to keep the idea of data registry support around, haven't thought a lot about using gto for that, but certainly have thought about models and binaries.

I would love to use/try gto as an Artifact Registry for build artifacts - specifically compiled binaries. It might also be interesting to use as a Container Registry - there are lots of solutions in this space like Jfrog Artifactory, cloudsmith, GCP. If you wanted to be really fancy you could support some of those offerings as backends.

With all of that said I think prioritizing the Model Registry use case makes a lot of sense, a Data Registry would be my next priority. The solution seems like it may be general enough to support an Artifact Registry and Container Registry, might be worth doing some thinking about it and if it does make sense and there are advantages to keeping that in gto then keep that use case in mind while making the Model Registry experience awesome.

Trying to think of advantages of having gto also be a Container Registry and remembered that mlem can deploy docker containers, it would be nice to have the versioning of those container artifacts in gto and be able to reference with gto syntax.

If I might chip in as a user - I absolutely would like GTO to be used also to build a data(set) registry (together with DVC) and possibly a combined data(set) and model registry.

In fact, we intend to use GTO and DVC to build a dataset registry for one (fairly large) client in the coming weeks. By the way, while stuff like MLFlow is a viable alternative to the GTO+MLEM-based model registry (it does not have all of its features but has some others), I don't know any open source alternative to a GTO+DVC-based dataset registry (and general I've yet to see data versioning done better than with DVC)...which means it is a very good selling point IMO.

I like the idea of viewing GTO as a tool to build pretty much any artifact registry, though models and datasets are the most likely use-cases.

omesser · 2023-03-09T01:12:57Z

content/docs/gto/get-started-dvc.md

+types: [model, data]
+```
+
+## Seeing new model versions pushed with DVC experiments


Suggested change

## Seeing new model versions pushed with DVC experiments

## Models and Experiments

Also, a question - is this an implicit behavior for artifacts with type: model specifically? or will there be similar side effects for any artifact with "type" defined?

This only updated MDP, so this is only for type: model for now. How did you assume this should work with other types?

I didn't, I'm a bit concerned for implicit behaviors, we should probably find a way to give the user control over what to do with which artifacts on exp push. wdyt?

What example of implicit behavior you have in mind? Like pushing a model that can be few GB in size? Not quite have specific examples in mind.

yes, auto-pushing, exactly. maybe auto-versioning as well in the future (could be useful if running in the pipeline CI-CD as part of release. generate model, push it, and assign a version using GTO

omesser · 2023-03-09T01:13:22Z

content/docs/gto/get-started-dvc.md

@@ -0,0 +1,115 @@
+# Get Started DVC
+


HL:
Suggest to add sub-headers / sections
And structure for this page will be something like this:

## Defining artifact types ... example of a model, defining in different ways ... mentioning that in a similar way you can define other types, e.g. `type: artifact` or similar ## Browsing Artifacts .. studio ... dvc ls ## Additional artifact metadata ... ## Registered artifact versions ... ## Working with artifacts in CI ... ## Restricting which types are allowed ...

omesser · 2023-03-09T01:18:14Z

content/docs/gto/get-started-dvc.md

+The same way you specify `type`, you can specify `description`, `labels` and
+`name`. Defining human-readable `name` (should be unique) is useful when you
+have complex folder structures or if you artifact can have different paths
+during the project lifecycle.


I think here we need 1 (concise) example setting all relevant fields
e.g.

dvc add models/mymodel.pkl --name def-detector --type model --description "glass defect image classifier" --label "algo=cnn" --label "owner=aguschin" --label "project=prod-qual-002"

shcheklein · 2023-03-09T04:32:23Z

content/docs/gto/get-started-dvc.md

+      - data.xml
+    outs:
+      - mymodel.pkl:
+          type: model # like this


to make it symmetrical with plots, params, etc- why don't we make it:

models: mymodel.pkl

do we plan / need other artifact types? (data can be covered by outs I think)

QQ: in this case, do you assume other fields can be specified like:

models: mymodel.pkl: name: particles-classifier description: best hits labels: - boson - higgs

?

For now it looks like type: model may be enough. I can introduce this models: section in this docs page OFC (and I'd like to do this TBH: artifacts.yaml is almost like it, which will make a transition easier for existing users with something like models: artifacts.yaml), but then we need to take into account how we deal with duplication/overwrites if we allow models to be specified in models: section and in outs (type: model) simultaneously.

We also have a user asking if we could allow this to have a separate file, that's another "+" for having a models: section.

I'm thinking it might make more sense to have a dedicated section for registry with all metadata defined there, with artifact names, and the pipeline should only refer to items from there by names - otherwise it really clutters the pipeline mixing all those concerns.

A top-level dvc.yaml section means we need to do some more work on the DVC side instead of using what we have.

If we go this direction, the biggest question is whether to call it models: or make it generic like registry: or artifacts:. The appeal of models is that we start small, but I worry it doesn't make much sense long-term because:

There's no model-specific functionality in DVC now or in current plans.

People already use DVC as a data registry, and it's inevitable people will want to log non-model artifacts.

It provides flexibility for custom types (not sure if this is good or bad).

Thanks for ideas. I've updated the docs page - please TAL.

I think the dvc.yaml schema can look like this:

# dvc.yaml stages: train: cmd: python train.py deps: - data.xml outs: - models/mymodel.pkl artifacts: - def-detector: path: models/mymodel.pkl type: model desc: glass defect image classifier labels: - algo=cnn - owner=aguschin - project=prod-qual-002 - models/othermodel.pkl # If no path provided, use name as the path

This would reuse most of what we already have between:

Existing metadata fields for type, desc, labels.

Top-level fields we have for plots, metrics, and params.

GTO artifacts.yaml (artifacts are in a dict format in artifacts.yaml but a list format above; we could support both to make it less error-prone, which we already do for top-level plots).

I think we can mostly leave alone everything already implemented and only add this new top-level section. I would suggest that only things listed in artifacts: (and in artifacts.yaml if we keep it) show up in the Studio registry. Until we get to a non-model registry, we could hide artifacts without type: model.

It's in the right direction,I like it. by I would make it even simpler initially and do the first class citizen top level models section. I don't see a reason to overgeneralize everything at the moment tbh.

I think there are good reasons to generalize types. Besides those mentioned above in this thread and by users, we already support type in DVC metadata and GTO artifacts.yaml, so it fits with what we have already. Is there a good enough reason to rebuild to make it model-specific?

aguschin · 2023-03-09T10:20:51Z

content/docs/gto/get-started-dvc.md

+
+If you're producing your models in DVC pipeline, you'll need to add
+`type: model` to `dvc.yaml` instead:
+


Now I think there is a user case this change may not support that well. One of our prospects asked to allow a single file (let's say mymodel.pkl) to be referenced as several GTO models (e.g. model1 and model2 - these are names). Since moving to DVC makes path essential (instead of name), I don't see how that feature would fit here. 🤔

The motivation is to be able to promote model1 and model2 to different stages at different moments of time separately. To clarify, let's assume there are two populations mymodel.pkl should be applied for. You can create stages like populationA-prod, populationA-staging and populationB-prod, populationB-staging, if you have many populations, this would make things cumbersome. The solution was to introduce model1 (for populationA) and model2 (for populationB). That required this feature.

The only workaround I see now is to create a "mirror file" with cp mymodel.pkl mymodel-for-populationB.pkl in some DVC PL stage. Or keep this name:path mapping outside of DVC somehow. Do you see any other solutions guys? WDYT?

Agree with @omesser that it's related the discussion above. Take a look at the top-level plots schema, where plots may be identified by either path or an arbitrary name. Feels like following a similar syntax may be best here.

aguschin · 2023-03-10T08:15:28Z

content/docs/gto/get-started-dvc.md

+If you want this to be in a separate file (say, `artifacts.yaml`), you can tell
+DVC to use it with:
+
+```yaml
+# dvc.yaml
+registry: artifacts.yaml
+```


Extra for now, but was requested by users

Love this! I would love to do this with plots too.

content/docs/gto/get-started-dvc.md

shortcipher3

Overall I think this looks great.

shortcipher3 · 2023-03-10T18:27:59Z

content/docs/gto/get-started-dvc.md

+separately, address them by `name` in `dvc get`, and eventually, see them in DVC
+Studio.
+
+Let's start with marking an artifact as data or model.


I think of the terms Data Registry, Model Registry, and Artifact Registry. I like having the type track to those names, so I like data.

Would love to keep the idea of data registry support around, haven't thought a lot about using gto for that, but certainly have thought about models and binaries.

I would love to use/try gto as an Artifact Registry for build artifacts - specifically compiled binaries. It might also be interesting to use as a Container Registry - there are lots of solutions in this space like Jfrog Artifactory, cloudsmith, GCP. If you wanted to be really fancy you could support some of those offerings as backends.

With all of that said I think prioritizing the Model Registry use case makes a lot of sense, a Data Registry would be my next priority. The solution seems like it may be general enough to support an Artifact Registry and Container Registry, might be worth doing some thinking about it and if it does make sense and there are advantages to keeping that in gto then keep that use case in mind while making the Model Registry experience awesome.

shortcipher3 · 2023-03-10T18:32:38Z

content/docs/gto/get-started-dvc.md

+separately, address them by `name` in `dvc get`, and eventually, see them in DVC
+Studio.
+
+Let's start with marking an artifact as data or model.


Trying to think of advantages of having gto also be a Container Registry and remembered that mlem can deploy docker containers, it would be nice to have the versioning of those container artifacts in gto and be able to reference with gto syntax.

shortcipher3 · 2023-03-10T18:33:44Z

content/docs/gto/get-started-dvc.md

+If you want this to be in a separate file (say, `artifacts.yaml`), you can tell
+DVC to use it with:
+
+```yaml
+# dvc.yaml
+registry: artifacts.yaml
+```


Love this! I would love to do this with plots too.

init

edd81ae

aguschin self-assigned this Mar 7, 2023

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 09:35 Inactive

fixes

76b9cd7

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 11:41 Inactive

fix

faa44dc

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 11:55 Inactive

dvclive

ec9d8d7

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 13:18 Inactive

aguschin added 2 commits March 7, 2023 19:57

add studio screenshot

16e10d7

add studio screenshot

6308803

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 13:58 Inactive

add studio screenshot

dd08adf

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 13:59 Inactive

add dvc exp push section

b71561b

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 7, 2023 14:08 Inactive

dberenbaum reviewed Mar 7, 2023

View reviewed changes

omesser reviewed Mar 9, 2023

View reviewed changes

omesser suggested changes Mar 9, 2023

View reviewed changes

shcheklein reviewed Mar 9, 2023

View reviewed changes

aguschin commented Mar 9, 2023

View reviewed changes

some fixes

56f393e

dberenbaum mentioned this pull request Mar 9, 2023

frameworks: Use log_artifact iterative/dvclive#465

Closed

top-level registry: section

2bf18ca

reference artifacts by name/alias in DVC PL

e8156ef

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 10, 2023 08:15 Inactive

aguschin commented Mar 10, 2023

View reviewed changes

content/docs/gto/get-started-dvc.md Outdated Show resolved Hide resolved

Update content/docs/gto/get-started-dvc.md

446ed3a

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 10, 2023 08:16 Inactive

aguschin mentioned this pull request Mar 10, 2023

DVC+GTO integration iterative/gto#337

Closed

shortcipher3 reviewed Mar 10, 2023

View reviewed changes

dberenbaum mentioned this pull request Mar 15, 2023

log_model iterative/dvclive#472

Closed

aguschin added 2 commits March 21, 2023 16:35

remove dvc add

6b7f890

Merge branch 'gto-dvc' of github.com:iterative/mlem.ai into gto-dvc

0936229

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 21, 2023 10:36 Inactive

registry to artifacts

d80e5f3

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 21, 2023 10:40 Inactive

aguschin mentioned this pull request Mar 21, 2023

Introduce artifacts: section in DVC and make it work with GTO iterative/dvc#9219

Closed

20 tasks

dberenbaum mentioned this pull request Mar 22, 2023

Model registry docs iterative/dvc.org#4423

Closed

13 tasks

changes to keep page up-to-date

35470c1

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 31, 2023 06:22 Inactive

add gto describe as extra-for-now

1664b17

shcheklein temporarily deployed to mlem-ai-gto-dvc-nrmufqqihx13et March 31, 2023 06:32 Inactive

omesser self-requested a review June 4, 2023 13:37

	This will make them appear in DVC Model Registry:
	This will make them appear in [Studio Model Registry](https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry):

	Let's start with marking an artifact as data or model.
	Let's start with marking a tracked artifact (file) as a `model`.

	## Seeing new model versions pushed with DVC experiments
	## Models and Experiments


		If you're producing your models in DVC pipeline, you'll need to add
		`type: model` to `dvc.yaml` instead:

WIP: Docs draft for integration with DVC #323

Are you sure you want to change the base?

WIP: Docs draft for integration with DVC #323

Conversation

aguschin commented Mar 7, 2023 • edited Loading

github-actions bot commented Mar 7, 2023 • edited Loading

Link Check Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Mar 7, 2023

omesser Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omesser Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shortcipher3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aguschin commented Mar 7, 2023 •

edited

Loading

github-actions bot commented Mar 7, 2023 •

edited

Loading

aguschin Mar 9, 2023 •

edited

Loading

omesser Mar 9, 2023 •

edited

Loading

aguschin Mar 9, 2023 •

edited

Loading

omesser Mar 9, 2023 •

edited

Loading

aguschin Mar 9, 2023 •

edited

Loading

aguschin Mar 10, 2023 •

edited

Loading

aguschin Mar 10, 2023 •

edited

Loading