-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download models (type: model
) with dvc get
#9100
Comments
Broadly related to 1st point in https://github.com/iterative/studio/issues/4782. |
As discussed in iterative/dvclive#472, I still think it would be useful to make the path the default name so we don't require people to think of a separate model name and additional abstraction. I doubt it changes much of the implementation, since I don't think we want to completely deprecate model names.
Not really sure at this point. The UI doesn't look bad from what you show above, but hard to say yet whether it will make sense to shove |
Also @shcheklein pointed me towards the fact that in some DVC commands paths/names are already interchangeable, like in |
@aguschin We probably also need to make this work for |
Some questions to discuss/decide:
|
IMHO:
|
Added a comment to https://github.com/iterative/studio/issues/5215#issuecomment-1488920109 to discuss how Studio could simplify, especially for 4 above. I think we still need |
To sum up the discussion in here and #9345. When user wants to get the model, there are two ways he may want to do that: in CLI (e.g. in CI/CD), or directly in Python runtime:
Few moments re 1st:
After this, we can get into supporting |
We can skip this. |
Feels like we run a risk of overloading
just so we can for sure tell that we should resolve |
I think we touched that somewhere, but couldn't find it. Overall, beside "getting" and "importing" artifact, user needs to list them (like A note: I think |
👍 Do you expect |
If we will soon need artifact listing too, maybe this is a yet another reason for |
Right, it should've been I think in DVC it would be better to have
|
@aguschin Can we close or deprioritize this ticket based on the discussions in https://github.com/iterative/studio/issues/5177 and https://github.com/iterative/studio/issues/5215? |
Yes, let's do it. We should get back to this after we have API in Studio I think. We can reopen this issue then. Thanks everyone for the discussion! |
As promised, reopening this issue since @amritghimire and @aguschin have made a lot of progress on a Studio API. I think we are ready to move forward with P1
P2
Questions/Clarifications
|
Just to clarify, this is for listing the files contained in a single version of an artifact, and not for listing artifacts that are available in the repo/model registry? |
Yes, but maybe we should rename the method to clarify. WDYT about |
What's the purpose of the API call? Thinking about this some more, It seems to me that we just need a single API call that returns the rev + relative path for the artifact the user wants and then they can use artifact = dvc.api.show_artifact(repo_url, 'myartifact', stage='prod')
with DVCFileSystem(repo_url, rev=artifact['rev']) as fs:
# download entire artifact (whether its a file or dir)
fs.get(artifact['path'])
# do stuff with individual files in a directory
for file in fs.ls(artifact['path']):
# download a file
fs.get(file)
# stream a file
fs.read(file) # or fs.open().read() The "use artifact name as a shortcut to avoid making the user use paths" premise is fine when an artifact is a single file, but as soon as you have a directory the user needs to use paths anyways. So I'm not convinced we actually need an |
A couple points:
Do you see a way to handle files with the Studio signed URLs without introducing |
Is supporting streaming file objects (with
and then the user can do whatever they need with the studio URIs themselves (whether it's download or stream)? e: the streaming/file object use case probably depends on clarification from the studio/gto side: https://github.com/iterative/studio/pull/6338#issuecomment-1659609627 |
If we really need everything to work like native DVC usage but with studio urls instead of DVC remotes, we should just make the DVC data index aware of artifacts + studio. The |
One other thing we need to consider is that studio and the GTO tags use The current implementation in the PR assumes users are using the studio/GTO naming, but I'm not sure whether or not this is expected/intended. |
Another goal here is to have the user worry as little as possible about whether they are using the studio api or dvc to get the artifacts, so I would like to at least have the api work for both studio and non-studio artifacts. If
I'm not sure I follow. Would this all happen implicitly? If so, that sounds good to me, although we still need some artifacts api to look them up by name, version, and stage (edit: a benefit of this approach . Edit: in that case, instead of a method to get the uris, we probably just need a python api to get the relative path and revision using gto, which can then be passed to any of the dvc methods (and Edit: can it also work with the cli, so that commands like
Can we make it work with both? I know it's a bit ugly, but it would be nice to be robust to these kinds of errors. |
@pmrowla Do you have questions on the CLI? |
Yes, in this case it would all be done internally in DVC without the user doing anything specific for artifacts, other than using something like the
Implementing it this way would make it work with any kind of read operation in DVC, whether it's the API or the CLI (so even something like
Yes, but it still adds the the issue where the artifact tag could technically be an actual file (i.e. you can have a file named |
On naming, we also never check to see if a model name conflicts with a stage name. This wasn't an issue when the plan was to have completely separate api/cli calls for artifacts, but if we are going to make handling artifacts part of the native DVC internals, we can't have any overlap with stage names and artifact names. Otherwise something like |
Great feedback @pmrowla, let's go with your suggestion.
More important than whether to include
On all of these naming issues, can we try to make them as foolproof as possible and fail if there happens to be a conflict? |
Could you folks summarize it please? How will the command look like for subrepo? |
On the CLI side, to download an artifact inside a subrepo the user would be able to do
In all of these cases, if the user has a studio token set, DVC would download the artifact using the studio generated HTTP URL instead of the DVC remote. On the Python API side it will look something like >>> artifact = dvc.api.show_artifact(repo="https://github.com/my/repo.git", name="artifact_name", version=version, stage=stage)
>>> artifact
{
"path": "path/to/subrepo/path/to/artifact/file_or_dir",
"rev": "abc123...", # git SHA containing the requested artifact version/stage
}
>>> dvc.api.open(artifact["path"], repo="https://github.com/my/repo.git", rev=artifact["rev"]) # open a file artifact for streaming
>>> with dvc.api.DVCFileSystem(repo="https://github.com/my/repo.git", subrepos=True, rev=artifact["rev"]) as fs:
fs.get(artifact["path"]) # download the artifact (whether it's a dir or file)
for file in fs.ls(artifact["path"]): # iterate over files in a dir artifact
fs.open(file) # open individual files in a dir artifact for streaming And again, when a studio token is set, any DVC API operation used with an artifact path would stream/download from the studio URL instead of DVC remote. This also applies if the user already knows the path/rev combination they want and just uses the DVC API calls directly (without calling |
I don't think this is something we want in DVC. Supporting this in DVC would be a significant hit on performance since it requires searching the entire repo for all available dvc.yaml files on every DVC operation (instead of the current behavior which only looks in the current directory/repo root). For the git monorepo/dvc subrepo case, this also would end up forcing DVC to search the entire monorepo (and every subrepo within the monorepo) on every operation. This also goes against the current convention for existing DVC commands that address anything in a dvc.yaml file. You cannot use While this is possible to do studio (since they only need to search the repo for artifacts once and then store the set of known artifacts in the studio DB), I don't think it should be done in the studio API either. We should be aiming for studio to be consistent with DVC, there should not be separate naming/addressing conventions for the studio API vs DVC CLI/API. This also relates to the studio
We can add the stage vs artifact naming check to the dvc.yaml validation, so DVC will fail when it encounters a dvc.yaml that contains an artifact with the same name as a stage. (But any users that have already been using model registry/artifacts may have existing dvc.yaml files that conflict with this rule, so we will no longer be able to parse the revs containing those dvc.yaml files) |
I'm also still not sure whether this should actually apply to |
After discussions with @dberenbaum and @shcheklein we agreed we can limit the scope on this in DVC for now. If/when we have a better idea of user needs with regard to using DVC w/the Studio model registry we can expand on this in the future. On the DVC end we will add: CLI:
Which will use dvc-studio-client to get the signed URLs from studio API and download those files when a studio token is available. API:
Which will return a path and rev the user can use in conjunction with the existing DVC API calls. No other changes to internal DVC behavior will be done at this time, so streaming/downloading/etc from the existing DVC API will still require DVC remote credentials even when using a path that matches an artifact. Regarding the naming conventions, we will leave existing behavior alone for now, since there is no concern about stage/artifact overlap in DVC while we are using artifacts specific behavior in |
related: iterative/dvc#9100 Should fix #369 - Drops support for Python < 3.8 - Replaces gitpython usage with scmrepo - Migrates tests to use pytest-test-utils Public facing `gto.api` interface has not changed in this PR, but internal GTO API has changed. Ideally these changes should probably be released as a major version bump. This PR will require changes in studio - gitpython instances can no longer be passed into the GTO calls (cc @amritghimire)
As it was partially discussed in iterative/dvclive#472 and iterative/gto#337, we're considering to merge
artifacts.yaml
part of GTO into DVC. We've listed down MR user scenarios we want to support after that, with one of the most important for CLI is "Download model from registry". This could happen both in CI and in CLI locally. In CI it's easier overall since we can use things like GTO action to help figure out things and even download models if needed, so let's think local CLI here.After discussing this with @dberenbaum, we figured out the best solution would be to give a single command to download models (
type: model
). Something like:note that
mymodel
in this example is not a path, but a modelname
(that's used in Git tags registering versions/assigning stages like[email protected]
ormymodel#prod#1
). Implementing such a command will save users figuring out thepath
if they knowname
.What's more, the frequent way to use this is to get latest version or the version in
prod
/stage
. In GTO I used these shortcuts to access those:I prefered this to
gto show mymodel --latest
orgto show mymodel --stage prod
due to brevity and similarity to actual Git tags format, which makes it easier to remember.Ultimately, providing DVC command that allows to download latest/what's in
prod
would be great. Maybe we can reuse GTO shortcuts:$ dvc get $REPO mymodel@latest --type model
For that, we'll need to call GTO API under the hood (since we probably don't want to merge GTO's Git tags managing part into DVC due to making DVC even more complex).
The last command would be handy to users, cause right now to get latest version of the model named
mymodel
locally, you need to run something like:see also this question about reusing
dvc ls
to listtype: model
- it's related to overall experience while working withtype: model
in DVCWDYT about this? Does it look natural to extend
dvc get
to models or it could be better to have a separate command for that?cc @dberenbaum @skshetry
The text was updated successfully, but these errors were encountered: