Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download models (type: model) with dvc get #9100

Closed
aguschin opened this issue Mar 2, 2023 · 34 comments · Fixed by #9770
Closed

Download models (type: model) with dvc get #9100

aguschin opened this issue Mar 2, 2023 · 34 comments · Fixed by #9770
Assignees
Labels
A: get Related to dvc get feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@aguschin
Copy link
Contributor

aguschin commented Mar 2, 2023

As it was partially discussed in iterative/dvclive#472 and iterative/gto#337, we're considering to merge artifacts.yaml part of GTO into DVC. We've listed down MR user scenarios we want to support after that, with one of the most important for CLI is "Download model from registry". This could happen both in CI and in CLI locally. In CI it's easier overall since we can use things like GTO action to help figure out things and even download models if needed, so let's think local CLI here.

After discussing this with @dberenbaum, we figured out the best solution would be to give a single command to download models (type: model). Something like:

$ dvc get $REPO mymodel --type model --rev [email protected]

note that mymodel in this example is not a path, but a model name (that's used in Git tags registering versions/assigning stages like [email protected] or mymodel#prod#1). Implementing such a command will save users figuring out the path if they know name.

What's more, the frequent way to use this is to get latest version or the version in prod/stage. In GTO I used these shortcuts to access those:

$ gto show mymodel@latest  # takes the latest version - the actual git tag will be like `[email protected]`
$ gto show mymodel#prod  # takes what's in prod - the actual git will be like `mymodel#prod#17`

I prefered this to gto show mymodel --latest or gto show mymodel --stage prod due to brevity and similarity to actual Git tags format, which makes it easier to remember.

Ultimately, providing DVC command that allows to download latest/what's in prod would be great. Maybe we can reuse GTO shortcuts:

$ dvc get $REPO mymodel@latest --type model

For that, we'll need to call GTO API under the hood (since we probably don't want to merge GTO's Git tags managing part into DVC due to making DVC even more complex).

The last command would be handy to users, cause right now to get latest version of the model named mymodel locally, you need to run something like:

$ ARTIFACT_PATH=$(gto describe --repo $REPO mymodel@latest --path)
$ REVISION=$(gto show --repo $REPO mymodel@latest --ref)
$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $ARTIFACT_PATH

see also this question about reusing dvc ls to list type: model - it's related to overall experience while working with type: model in DVC

WDYT about this? Does it look natural to extend dvc get to models or it could be better to have a separate command for that?

cc @dberenbaum @skshetry

@aguschin aguschin added feature request Requesting a new feature A: get Related to dvc get labels Mar 2, 2023
@skshetry
Copy link
Member

skshetry commented Mar 2, 2023

Broadly related to 1st point in https://github.com/iterative/studio/issues/4782.

@dberenbaum
Copy link
Collaborator

note that mymodel in this example is not a path, but a model name (that's used in Git tags registering versions/assigning stages like [email protected] or mymodel#prod#1)

As discussed in iterative/dvclive#472, I still think it would be useful to make the path the default name so we don't require people to think of a separate model name and additional abstraction. I doubt it changes much of the implementation, since I don't think we want to completely deprecate model names.

WDYT about this? Does it look natural to extend dvc get to models or it could be better to have a separate command for that?

Not really sure at this point. The UI doesn't look bad from what you show above, but hard to say yet whether it will make sense to shove gto tag aliases into --rev and artifact names as targets to dvc get, or whether we are better off with something like dvc get-artifact. We can try it out and decide this later.

@aguschin
Copy link
Contributor Author

aguschin commented Mar 3, 2023

Also @shcheklein pointed me towards the fact that in some DVC commands paths/names are already interchangeable, like in dvc pull, where targets relates to "tracked files/directories, .dvc files, or stage names". I believe dvc exp run is another example of this.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Mar 29, 2023

@aguschin We probably also need to make this work for dvc import and for API methods.

@dberenbaum
Copy link
Collaborator

Some questions to discuss/decide:

  1. How to specify artifacts (for example dvclive/dvc.yaml:def-defector)?
  2. Do we need a flag for artifacts (dvc get --artifact dvclive/dvc.yaml:def-defector) or can they be positional targets (dvc get dvclive/dvc.yaml:def-defector)?
  3. Do we need/would it be better to give users a way to get the path and pass that to dvc get and other commands/methods (some simplified version or command for dvc.repo.Repo().artifacts()["dvclive/dvc.yaml"]["def-defector"]["path"])?
  4. Do we need a GTO dependency to get the latest version or the prod version?

@aguschin
Copy link
Contributor Author

aguschin commented Mar 29, 2023

IMHO:

  1. dvc get $REPO dvclive/dvc.yaml:def-defector or dvc get $REPO dvclive:def-defector should work. Maybe we can allow dvc get $REPO dvclive/def-defector, but then we need to introduce a flag like --artifact to make it non-ambiguous. First approach is simpler to me if we use this already to reference stages/plots (I assume we don't support downloading plots/stages this way?).
  2. ^
  3. I think a simplified command would be nice, but it can wait until we hear this request from users.
  4. We're talking about dvc get $REPO dvclive/dvc.yaml:def-detector@latest where DVC calls GTO under the hood to figure out right rev. I'd prefer to make everything work without it, and then make that as an improvement. Rn it's too early to do this kind of things I assume.

@dberenbaum
Copy link
Collaborator

Added a comment to https://github.com/iterative/studio/issues/5215#issuecomment-1488920109 to discuss how Studio could simplify, especially for 4 above.

I think we still need dvc get or some DVC support for getting the artifact once we know the right revision, since I assume GTO will only work with tags and not be able to retrieve the path or file now that we migrated this functionality to DVC. However, let's see how that discussion goes before prioritizing this.

@aguschin
Copy link
Contributor Author

aguschin commented May 23, 2023

To sum up the discussion in here and #9345. When user wants to get the model, there are two ways he may want to do that: in CLI (e.g. in CI/CD), or directly in Python runtime:

  1. CLI. Here, the revision will change more frequently than the path to the artifact. Thus implementing dvc get . mymodel and dvc get . mymodel#prod together makes sense. One without the other will feel incomplete.
  2. API. Next step would be adding Python API support to get artifacts. We need support dvc.api.read, dvc.api.open and maybe DVCFileSystem?

Few moments re 1st:

  • instead of asking for a path to dvc.yaml, DVC should be able to find artifact in DVC repo using it's name: dvc get . mymodel instead of dvc get . long/path/to/dvc.yaml:mymodel
  • mymodel could be in several dvc.yaml files - we need to fail on this asking to specify the right dvc.yaml
  • it may be better to use dvc get . --artifact mymodel instead of dvc get . :mymodel (taking into account the sometimes needed path to dvc.yaml, it could be dvc get . --dvcyaml long/path/to/dvc.yaml mymodel or something)

After this, we can get into supporting dvc import for artifact

@aguschin aguschin added this to DVC May 23, 2023
@aguschin aguschin moved this to Todo in DVC May 23, 2023
@dberenbaum
Copy link
Collaborator

DVCFileSystem

We can skip this.

@efiop
Copy link
Contributor

efiop commented Jun 1, 2023

Feels like we run a risk of overloading dvc get's target argument too much if we are going to teach it custom parsing (we have a similar problem in, for example, remove and repro and it is not fun and confusing for everyone). What we are probably looking for here is more like:

dvc get mymodel --artifact --type model ...

just so we can for sure tell that we should resolve mymodel as an artifact (with whatever semantics it will need) when we see --artifact flag. This should avoid overcomplicating base dvc get and make it easy to tweak implementation. I'm sure this has been discussed before, but it makes me wonder if we need a dvc artifact(e.g. dvc artifact get) subcommand instead, so that we could keep the artifact-related logic in one place with clear semantics instead of overloading existing commands?

@aguschin
Copy link
Contributor Author

aguschin commented Jun 1, 2023

I think we touched that somewhere, but couldn't find it. Overall, beside "getting" and "importing" artifact, user needs to list them (like dvc data ls --type model or what DVC had recently), to get artifact annotation for a specific artifact (like "give me the labels for this specific artifact"), to add annotation for artifact (like what we have in API now) or remove it.

A note: I think dvc get mymodel --artifact ... is enough (no --type model needed, since artifact names are unique within a single dvc.yaml).

@dberenbaum
Copy link
Collaborator

I think dvc get mymodel --artifact ... is enough

👍 Do you expect --artifact to take simply an artifact name (--artifact mymodel) and handle the stage/tag with a separate flag, or do you expect it to be like --artifact mymodel#prod?

@efiop
Copy link
Contributor

efiop commented Jun 1, 2023

If we will soon need artifact listing too, maybe this is a yet another reason for dvc artifact get/ls/import? Can you foresee more artifact-specific things? This seems like a sign that we are already overloading ls/get/import. In practical terms an artifact(or other name) subcommand costs us nothing, will not duplicate code, is easy to maintain and to change later, and will save us the headache of shoving artifact info into import/get/ls docs. And we can also add stuff to dvc get/import/ls at any point in the future if we see the need. So would that be acceptable, WDYT?

@aguschin
Copy link
Contributor Author

aguschin commented Jun 1, 2023

👍 Do you expect --artifact to take simply an artifact name (--artifact mymodel) and handle the stage/tag with a separate flag, or do you expect it to be like --artifact mymodel#prod?

Right, it should've been dvc get --artifact mymodel I think.

I think in DVC it would be better to have --stage dev and --version latest. Although I've used shortcuts like mymodel#prod and mymodel@latest, but that was because in CLI --stage and --version were already used (like, find mymodel#stage and return its version). In any case it gets tricky there, since you have 2 actions with stage (find what's in stage, or show the stage of what you've found).

maybe this is a yet another reason for dvc artifact get/ls/import

dvc artifact get/ls/import is fine to me. As I see it, the point here was to "simply" make work with artifacts similar to working with files, supporting same workflows with familiar commands. Since we're still not sure how they'll look like and how well they fit in existing commands, making separate commands is a good idea I think @efiop

@dberenbaum
Copy link
Collaborator

@aguschin Can we close or deprioritize this ticket based on the discussions in https://github.com/iterative/studio/issues/5177 and https://github.com/iterative/studio/issues/5215?

@aguschin
Copy link
Contributor Author

aguschin commented Jun 8, 2023

Yes, let's do it. We should get back to this after we have API in Studio I think. We can reopen this issue then. Thanks everyone for the discussion!

@aguschin aguschin closed this as completed Jun 8, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Jun 8, 2023
@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 20, 2023

As promised, reopening this issue since @amritghimire and @aguschin have made a lot of progress on a Studio API. I think we are ready to move forward with dvc artifacts and dvc.api.artifacts (as suggested above). Proposed design:

P1

  • dvc artifacts get repo name [--version] [--stage] [--out]
  • dvc.api.artifacts.open/read(repo: str, name: str, version: str = None, stage: str = None)
  • Directories
    • dvc artifacts get can download a file or directory artifact
    • dvc.api.artifacts.ls(repo: str, name: str, version: str = None, stage: str - None) should return a dict like {"reldir/model.pth": "https://...", "reldir/metadata.yml": "https://..."}
  • Use Studio REST API when possible (for simplicity, let's first implement without this)
    • Try to use token to get signed url from studio if supported remote type (warn user if token missing in this case)
    • If not able to use signed url (token missing, remote type unsupported, etc.), fallback to dvc get workflow

P2

  • artifacts can be packed into an archive when pushed/uploaded and extracted when pulled/downloaded to make transfers faster and handle directories easier
  • importing artifacts
  • listing all available artifacts

Questions/Clarifications

  • if no version or stage is provided, return the latest version
  • name of the model should be enough unless there's a conflict between dvc.yaml files, in which case we can throw an error and ask users to specify the correct dvc.yaml
  • should we shorten version and stage args to name@latest/name#prod syntax or at least allow it as an alternative?
  • should we have some way to get the info about the artifact, including its revision and the relative path of each file?

@github-project-automation github-project-automation bot moved this from Done to Todo in DVC Jul 20, 2023
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Jul 22, 2023
@pmrowla pmrowla self-assigned this Jul 26, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jul 28, 2023

dvc.api.artifacts.ls(repo: str, name: str, version: str = None, stage: str - None) should return a dict like {"reldir/model.pth": "https://...", "reldir/metadata.yml": "https://..."}

Just to clarify, this is for listing the files contained in a single version of an artifact, and not for listing artifacts that are available in the repo/model registry?

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 31, 2023

dvc.api.artifacts.ls(repo: str, name: str, version: str = None, stage: str - None) should return a dict like {"reldir/model.pth": "https://...", "reldir/metadata.yml": "https://..."}

Just to clarify, this is for listing the files contained in a single version of an artifact, and not for listing artifacts that are available in the repo/model registry?

Yes, but maybe we should rename the method to clarify. WDYT about dvc.api.artifacts.listdir() or dvc.api.artifacts.get_url()?

@pmrowla
Copy link
Contributor

pmrowla commented Jul 31, 2023

What's the purpose of the API call? Thinking about this some more, api.artifacts.open/read also feel kind of useless since they don't handle dirs (which is the same limitation as regular api.open/read).

It seems to me that we just need a single API call that returns the rev + relative path for the artifact the user wants and then they can use DVCFileSystem/api.open/read to do whatever else they need, whether that's listing files, downloading them, or streaming them from source.

artifact = dvc.api.show_artifact(repo_url, 'myartifact', stage='prod')
with DVCFileSystem(repo_url, rev=artifact['rev']) as fs:
    # download entire artifact (whether its a file or dir)
    fs.get(artifact['path'])

    # do stuff with individual files in a directory
    for file in fs.ls(artifact['path']):
        # download a file
        fs.get(file)
        # stream a file
        fs.read(file)  # or fs.open().read()

The "use artifact name as a shortcut to avoid making the user use paths" premise is fine when an artifact is a single file, but as soon as you have a directory the user needs to use paths anyways. So I'm not convinced we actually need an api.artifacts.function() shortcut method for every other existing dvc.api or DVCFileSystem call.

@pmrowla pmrowla moved this from Todo to In Progress in DVC Jul 31, 2023
@dberenbaum
Copy link
Collaborator

A couple points:

  1. Most of the time, I would expect a model to be a single file.
  2. This is partially about doing GTO operations to find the rev + relative path, but it's also about being able to use the Studio signed URLs when they are available.

Do you see a way to handle files with the Studio signed URLs without introducing dvc.api.artifacts.read/open? I agree that directories are clunky and am open to other ideas there.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 1, 2023

Is supporting streaming file objects (with open/read) from the studio URL in the native DVC API actually a requirement? Can we just expose the same API as studio with

dvc.api.artifacts.get_download_uris()

and then the user can do whatever they need with the studio URIs themselves (whether it's download or stream)?

e: the streaming/file object use case probably depends on clarification from the studio/gto side: https://github.com/iterative/studio/pull/6338#issuecomment-1659609627

@pmrowla
Copy link
Contributor

pmrowla commented Aug 1, 2023

If we really need everything to work like native DVC usage but with studio urls instead of DVC remotes, we should just make the DVC data index aware of artifacts + studio. The remote: storage for artifact paths would use the signed url when it's available instead of the original DVC remote url. At that point, all of the regular DVC API or DVCFileSystem calls would just work as expected whenever they were used with a path that belongs to an artifact.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 1, 2023

One other thing we need to consider is that studio and the GTO tags use path/to/subdir:artifact-name for addressing artifacts in a nested dvc.yaml, but the DVC way of referencing the nested dvc.yaml would be path/to/subdir/dvc.yaml:artifact-name.

The current implementation in the PR assumes users are using the studio/GTO naming, but I'm not sure whether or not this is expected/intended.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Aug 1, 2023

Can we just expose the same API as studio

Another goal here is to have the user worry as little as possible about whether they are using the studio api or dvc to get the artifacts, so I would like to at least have the api work for both studio and non-studio artifacts. If dvc.api.artifacts.get_download_uris() can return either studio signed urls or direct paths to remote storage, then it might be enough for now. My only concern would be that it's extra work for users to open/stream/download those remote uris (need to use requests/boto/s3fs/etc).

we should just make the DVC data index aware of artifacts + studio. The remote: storage for artifact paths would use the signed url when it's available instead of the original DVC remote url. At that point, all of the regular DVC API or DVCFileSystem calls would just work as expected whenever they were used with a path that belongs to an artifact.

I'm not sure I follow. Would this all happen implicitly? If so, that sounds good to me, although we still need some artifacts api to look them up by name, version, and stage (edit: a benefit of this approach .

Edit: in that case, instead of a method to get the uris, we probably just need a python api to get the relative path and revision using gto, which can then be passed to any of the dvc methods (and dvc.api.get_url() could return the studio signed urls when available).

Edit: can it also work with the cli, so that commands like dvc get can try to use the signed url when available?

The current implementation in the PR assumes users are using the studio/GTO naming, but I'm not sure whether or not this is expected/intended.

Can we make it work with both? I know it's a bit ugly, but it would be nice to be robust to these kinds of errors.

@dberenbaum
Copy link
Collaborator

@pmrowla Do you have questions on the CLI?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 2, 2023

I'm not sure I follow. Would this all happen implicitly? If so, that sounds good to me, although we still need some artifacts api to look them up by name, version, and stage (edit: a benefit of this approach .

Edit: in that case, instead of a method to get the uris, we probably just need a python api to get the relative path and revision using gto, which can then be passed to any of the dvc methods (and dvc.api.get_url() could return the studio signed urls when available).

Yes, in this case it would all be done internally in DVC without the user doing anything specific for artifacts, other than using something like the show_artifact call I suggested to get the actual path and rev for an artifact to use with the other API calls #9100 (comment)

Edit: can it also work with the cli, so that commands like dvc get can try to use the signed url when available?

Implementing it this way would make it work with any kind of read operation in DVC, whether it's the API or the CLI (so even something like dvc pull path/to/artifact would end up using the studio URL and not the DVC remote)

Can we make it work with both? I know it's a bit ugly, but it would be nice to be robust to these kinds of errors.

Yes, but it still adds the the issue where the artifact tag could technically be an actual file (i.e. you can have a file named dir:model on posix). With dir/dvc.yaml:model it's less likely since it's unlikely anyone using DVC would also be using dvc.yaml in their own filenames. It would probably be best if studio used the dvc.yaml:... naming for consistency.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 2, 2023

On naming, we also never check to see if a model name conflicts with a stage name. This wasn't an issue when the plan was to have completely separate api/cli calls for artifacts, but if we are going to make handling artifacts part of the native DVC internals, we can't have any overlap with stage names and artifact names. Otherwise something like dvc pull dvc.yaml:train is ambiguous if there is both an artifact train and stage train in that pipeline file.

@dberenbaum
Copy link
Collaborator

Great feedback @pmrowla, let's go with your suggestion.

Yes, but it still adds the the issue where the artifact tag could technically be an actual file (i.e. you can have a file named dir:model on posix). With dir/dvc.yaml:model it's less likely since it's unlikely anyone using DVC would also be using dvc.yaml in their own filenames. It would probably be best if studio used the dvc.yaml:... naming for consistency.

More important than whether to include dvc.yaml is to allow to omit the path if it doesn't conflict with another artifact (see https://github.com/iterative/studio/issues/6939). Can we start with that?

On naming, we also never check to see if a model name conflicts with a stage name. This wasn't an issue when the plan was to have completely separate api/cli calls for artifacts, but if we are going to make handling artifacts part of the native DVC internals, we can't have any overlap with stage names and artifact names. Otherwise something like dvc pull dvc.yaml:train is ambiguous if there is both an artifact train and stage train in that pipeline file.

On all of these naming issues, can we try to make them as foolproof as possible and fail if there happens to be a conflict?

@shcheklein
Copy link
Member

Great feedback @pmrowla, let's go with your suggestion.

Could you folks summarize it please? How will the command look like for subrepo?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

Great feedback @pmrowla, let's go with your suggestion.

Could you folks summarize it please? How will the command look like for subrepo?

On the CLI side, to download an artifact inside a subrepo the user would be able to do

dvc artifacts get https://github.com/my/repo.git path/to/subrepo:artifact_name [--version version] [--stage stage]
dvc get https://github.com/my/repo.git path/to/subrepo/path/to/artifact/file_or_dir
git clone https://github.com/my/repo.git
cd repo/path/to/subrepo
dvc pull artifact_name

In all of these cases, if the user has a studio token set, DVC would download the artifact using the studio generated HTTP URL instead of the DVC remote.

On the Python API side it will look something like

>>> artifact = dvc.api.show_artifact(repo="https://github.com/my/repo.git", name="artifact_name", version=version, stage=stage)
>>> artifact
{
    "path": "path/to/subrepo/path/to/artifact/file_or_dir",
    "rev": "abc123...",  # git SHA containing the requested artifact version/stage
}
>>> dvc.api.open(artifact["path"], repo="https://github.com/my/repo.git", rev=artifact["rev"])  # open a file artifact for streaming
>>> with dvc.api.DVCFileSystem(repo="https://github.com/my/repo.git", subrepos=True, rev=artifact["rev"]) as fs:
    fs.get(artifact["path"])  # download the artifact (whether it's a dir or file)
    for file in fs.ls(artifact["path"]):  # iterate over files in a dir artifact
        fs.open(file)  # open individual files in a dir artifact for streaming

And again, when a studio token is set, any DVC API operation used with an artifact path would stream/download from the studio URL instead of DVC remote. This also applies if the user already knows the path/rev combination they want and just uses the DVC API calls directly (without calling show_artifact first)

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

More important than whether to include dvc.yaml is to allow to omit the path if it doesn't conflict with another artifact (see iterative/studio#6939). Can we start with that?

I don't think this is something we want in DVC. Supporting this in DVC would be a significant hit on performance since it requires searching the entire repo for all available dvc.yaml files on every DVC operation (instead of the current behavior which only looks in the current directory/repo root). For the git monorepo/dvc subrepo case, this also would end up forcing DVC to search the entire monorepo (and every subrepo within the monorepo) on every operation.

This also goes against the current convention for existing DVC commands that address anything in a dvc.yaml file. You cannot use dvc repro stage when stage actually refers to a named stage in path/to/some/other/dvc.yaml:stage, you have to address it explicitly if it is not in the current directory's dvc.yaml.

While this is possible to do studio (since they only need to search the repo for artifacts once and then store the set of known artifacts in the studio DB), I don't think it should be done in the studio API either. We should be aiming for studio to be consistent with DVC, there should not be separate naming/addressing conventions for the studio API vs DVC CLI/API.

This also relates to the studio path/to/dir:artifact vs DVC path/to/dir/dvc.yaml:artifact addressing. We can make the DVC CLI/API recognize both, but IMO it would be better for us to just pick one and be consistent. (And the DVC style addressing which requires dvc.yaml is an established convention that DVC CLI users have already been using for long time)

On all of these naming issues, can we try to make them as foolproof as possible and fail if there happens to be a conflict?

We can add the stage vs artifact naming check to the dvc.yaml validation, so DVC will fail when it encounters a dvc.yaml that contains an artifact with the same name as a stage. (But any users that have already been using model registry/artifacts may have existing dvc.yaml files that conflict with this rule, so we will no longer be able to parse the revs containing those dvc.yaml files)

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

I'm also still not sure whether this should actually apply to dvc fetch/pull. It seems like being able to fetch with only the studio credentials is useful, especially if you are dvc importing from a model registry. But this adds the requirement for us to support having separate URLs for fetch and push internally (since push would still have to be done with the DVC remote URL+credentials and not a studio URL).

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

After discussions with @dberenbaum and @shcheklein we agreed we can limit the scope on this in DVC for now. If/when we have a better idea of user needs with regard to using DVC w/the Studio model registry we can expand on this in the future.

On the DVC end we will add:

CLI:

dvc artifacts get <name> [--version <version>] [--stage <stage>]

Which will use dvc-studio-client to get the signed URLs from studio API and download those files when a studio token is available.

API:

>>> dvc.api.show_artifact(name, repo=..., version=..., stage=...)

Which will return a path and rev the user can use in conjunction with the existing DVC API calls. No other changes to internal DVC behavior will be done at this time, so streaming/downloading/etc from the existing DVC API will still require DVC remote credentials even when using a path that matches an artifact.

Regarding the naming conventions, we will leave existing behavior alone for now, since there is no concern about stage/artifact overlap in DVC while we are using artifacts specific behavior in artifacts get/api.show_artifact

pmrowla added a commit to iterative/gto that referenced this issue Aug 11, 2023
related: iterative/dvc#9100

Should fix #369

- Drops support for Python < 3.8
- Replaces gitpython usage with scmrepo
- Migrates tests to use pytest-test-utils

Public facing `gto.api` interface has not changed in this PR, but
internal GTO API has changed. Ideally these changes should probably be
released as a major version bump.

This PR will require changes in studio - gitpython instances can no
longer be passed into the GTO calls (cc @amritghimire)
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: get Related to dvc get feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants