Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_model #472

Closed
dberenbaum opened this issue Feb 27, 2023 · 15 comments
Closed

log_model #472

dberenbaum opened this issue Feb 27, 2023 · 15 comments
Labels
p1-important Include in the next sprint

Comments

@dberenbaum
Copy link
Collaborator

Depends on #378.

Replaces #305.

#378 adds support for tracking any file or directory artifact with DVC. In #305, there was discussion of saving model-specific artifacts using MLEM. Instead of using MLEM, this issue focuses on how to track an arbitrary model file like log_artifact, except that it should be tagged as a model so it is automatically shown in the Studio model registry and can be viewed, promoted, consumed, etc.

cc @aguschin Let's think about this as the starting point for the basic model registry workflow.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Feb 27, 2023

DVC already has some built-in support that could be useful: iterative/dvc#8214. log_model could add to DVC with type: model for now, and Studio/GTO could consume that to show in the model registry. Could that be enough to support a simple end-to-end model workflow?

This could probably depend on the path of the model rather than a specified model name (which is what GTO uses now in artifacts.yaml). If that becomes too limited, we could add top-level models/artifacts section that looks like GTO's artifacts.yaml as a follow-up.

This was referenced Feb 27, 2023
@aguschin
Copy link
Contributor

aguschin commented Feb 28, 2023

Could that be enough to support a simple end-to-end model workflow

In general, yes. Although if we don't merge artifacts.yaml into DVC at that moment, we're getting two sources of info for artifacts. Now users may wonder why gto show doesn't show models logged with log_model.

We can go ahead and merge artifacts.yaml. Which, IIUC, means:

  • (done) taking Meta information for data dvc#8214,
  • (done) dvc models list or dvc data ls --type model,
  • dvc data describe $PATH --name to get name knowing model path,
  • implementing name metadata in "data" part of DVC, and implementing name/type/etc in pipelines,
  • taking out artifacts.yaml from GTO (should we keep read-only commands maybe?),
  • updating gto GH action,
  • updating docs,
  • communicating that to customers and OSS users).

What's unclear to me, is not implementing name concept in DVC. In this case, it can break users workflow (now they need to keep somewhere the connection between path and name themselves, otherwise GTO and DVC won't work together in scenarios we described, including "in CI, download a model called mymodel promoted to prod" scenario. We can also completely remove name from both GTO and Studio, but that's a big breaking change I wouldn't consider now.

we could add top-level models/artifacts section that looks like GTO's artifacts.yaml as a follow-up

I think it's a valid solution, but we as well could just add name: something field (just like type: model). It may work out fine even without models: ... section. But this may become clearer when we start implementing this for pipelines - so if we're ok with this, in general, we can just start implementing what's preferable (models: section, or type: model), and is it possible to maybe use both.

@dberenbaum
Copy link
Collaborator Author

Now users may wonder why gto show doesn't show models logged with log_model.

Sorry, my assumption here was that gto show would show the models logged with log_model.

* taking [Meta information for data dvc#8214](https://github.com/iterative/dvc/issues/8214),

* adding back `dvc models list` implemented there already,

Do you mean dvc data ls --type models instead of dvc models list?

* implementing metadata in pipelines,

I thought it was implemented in iterative/dvc#8251?

What's unclear to me, is not implementing name concept in DVC.

Sorry, I meant that if we default to using path when there is no name, we can simplify the basic scenario.

I think it's a valid solution, but we as well could just add name: something field (just like type: model). It may work out fine even without models: ... section. But this may become clearer when we start implementing this for pipelines

Sounds good. Agreed that it's not clear whether we need a models: section.

@aguschin
Copy link
Contributor

aguschin commented Mar 1, 2023

Do you mean dvc data ls --type models instead of dvc models list?

Yes, I meant that one, thanks. But I also would consider adding simpler commands dvc models ls and dvc data ls and dvc artifacts ls to highlight that DVC has a specific notion of "model" and "data" now. Not sure if that's possible considering existing dvc data status command though.

Right, just checked - it was merged indeed. I'll update this Notion page with regard to that.

Also, side-thought: as an alternative, we could make GTO integrated with DVC for read operations. So instead of

$ REV=$(...some gto command...)
$ dvc get $REPO $MODEL_PATH --rev $REV -o $MODEL_PATH

one would be able to also run

$ gto get $REPO $MODEL_NAME --version v1.0.0
# or 
$ gto get $REPO $MODEL_NAME --stage prod

Yet not sure how good is this idea from other angles, but having one command to download models, and do that without going back-and-forth with model names, paths and revisions sounds great to me.

@dberenbaum
Copy link
Collaborator Author

@aguschin suggested an alternative to log_model using an argument to live.log_artifact:

live.log_artifact(artifact, "path", type="model")

Makes sense to me! No need for an entirely separate method here.

@dberenbaum
Copy link
Collaborator Author

@aguschin @daavoo Related to the discussions in iterative/mlem.ai#323: should live.log_artifact("path/to/artifact") always add something like this to dvclive/dvc.yaml even if there is no type specified?

artifacts:
  - path/to/artifact

Also, do we plan to add other optional fields to log_artifact like name/alias, description, etc.?

None of this is urgent IMO but all would be nice to have.

@aguschin
Copy link
Contributor

@dberenbaum, it happened that we're discussing the same thing here iterative/dvc#9220 (comment), but let me answer with my current understanding:

  1. We want to write dvc.yaml at live.log_artifact("path/to/artifact"), but there are some corner cases - asking about them in the PR I linked.
  2. We want to allow logging name, description, type, etc.

We can continue the discussion here or there, whichever works best.

@daavoo
Copy link
Contributor

daavoo commented Mar 24, 2023

Moving the discussion to here.

At this step, if myartifactname exists in any other dvc.yaml, DVC will throw an error since no names duplication is allowed. How we should handle this? Is it ok to break runtime because of this (especially if user specified type, name, etc)? Or maybe better to issue a warning and continue code execution?

In general, we don't want DVCLive to fail but just warn.
In this specific case, I am not even sure if we should check for duplicates at all. IMO, checking for duplicates should happen at a different level, not at writing time.

if DVCLive wasn't set up to write to dvc.yaml the way you described, but user calls log_artifact(path, name="myname") or pass any other args (like labels, type or description), what should DVCLive do?

I think we want to keep it simple / consistent with current behavior:

  • log_artifact(path, **kwargs) will write to path.dvc, including the kwargs (name, labels, type, etc.).
  • If dvcyaml=True, we will add all that info to the artifacts section.
  • log_model would be just an alias for log_artifact(path, type="model", **kwargs)

I guess the answer should be similar to the case when one calls log_metric, but forgets to call make_dvcyaml at the end. What happens then, could you please explain?

https://dvc.org/doc/dvclive/how-it-works

log_{X} calls always write to some output file.
make_{X} calls are called on end and/or next_step depending on arguments passed to Live.

The user would not really forget to call make_dvcyaml but rather would be called depending on the value the user has passed to Live(dvcyaml=True/False).

@dberenbaum
Copy link
Collaborator Author

  • log_model would be just an alias for log_artifact(path, type="model", **kwargs)

Minor clarification: I don't think we need this method at all, it's just a relic from earlier discussions. Otherwise LGTM.

@aguschin
Copy link
Contributor

checking for duplicates should happen at a different level, not at writing time

Then after running log_artifact and committing changes, user may end up with a broken DVC repo (two artifacts with the same name, DVC don't know which one to take into account). Doesn't sound right to me... Is there some way to prevent this?

log_artifact(path, **kwargs) will write to path.dvc, including the kwargs (name, labels, type, etc.).

Sounds logical from dvclive/user perspective. Then as sources of artifacts we're having not only artifacts: section, also .dvc files.

Question then (again): are we showing in Studio artifacts from both places?

@dberenbaum
Copy link
Collaborator Author

Then after running log_artifact and committing changes, user may end up with a broken DVC repo (two artifacts with the same name, DVC don't know which one to take into account). Doesn't sound right to me... Is there some way to prevent this?

We also encourage manually editing dvc.yaml, so there's no way to prevent this. Having mechanisms to write to dvc.yaml and validate beforehand are nice to have but I don't think they are a priority. Let's please keep it simple to start, because I worry this will delay this PR which otherwise seems close to being ready to merge.

Question then (again): are we showing in Studio artifacts from both places?

My take from iterative/mlem.ai#323 was that the artifacts: section was the sole source of truth for GTO and the Studio model registry. We may want to drop support for desc, type, labels from .dvc files, although I don't think it's an immediate priority.

@aguschin
Copy link
Contributor

Ok, cool @dberenbaum. Then the PR is ready for review.

@daavoo
Copy link
Contributor

daavoo commented Mar 28, 2023

ay end up with a broken DVC repo (two artifacts with the same name, DVC don't know which one to take into account)
We also encourage manually editing dvc.yaml, so there's no way to prevent this.

Not really relevant since we leave that out of scope but my 2c is that if we consider artifacts to be a descriptive section (like top-level metrics,params,plots), the duplication should not mean a broken state.
DVC should differentiate conflicting names by adding the {path_do_dvcyam} prefix for those artifacts that are not defined in the root dvc.yaml.

@shcheklein
Copy link
Member

Folks, is this one done?

@dberenbaum
Copy link
Collaborator Author

I think we can close it. Thanks for the ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Include in the next sprint
Projects
None yet
Development

No branches or pull requests

4 participants