-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DVC+GTO integration #337
Comments
My perspective is that of someone who never used GTO without DVC, and also that of Unix philosophy:
Thus, I would say, let DVC do the data management and let GTO stay dedicated to managing git tags (for models). This is not to say that GTO cannot become merged into DVC as a subcommand - I imagine as One point in favour of this separation of concerns is that DVC has a lot of stability and contribution behind artifact management - artifact as in binaries, data, etc. From a development point of view, I would rather leverage the foundation of DVC rather than risk reinventing the wheel DVC has built - but I am not a code contributor to any of these projects so take that with a grain of salt. Second, would be visibility of the gto functionality: DVC has a bigger name. I mention this because I really want to see Git Tag Ops become a standard practice. For me, only 3. is a real use-case. As for 4., I don't know how DVC would handle potentially different remotes dissociated from git repositories, so I have no real take on that. Cheers ✌🏼 |
Hi @aguschin thanks for asking my opinion, I am happy to share my thoughts on this. I feel very much in line with @bgalvao concerning the Unix philosophy. However, the conclusions I come to are probably a bit different. What is What I think would be the right way to follow the Unix philosophy is to create an interface in gto to integrate other tools, like dvc. Allow users to write their own gto-plugins. And provide If we assume new gto commands like in #307, I can imagine an interface along these lines: class GTOIntegration(ABC):
def pre_put():
pass
def after_put():
pass
def pre_pick():
pass
def after_pick():
pass Btw, Then, one can call I should also mention that I don't like how many things does Hope my thoughts can be of some help :) |
You mentioned merging the Like the others, I like the separation of concerns. I think using dvc as a component for I can think of use cases where one might use I think treating Since it is open-source maybe at some point another party would want to add an option to use git lfs as an alternative to dvc, this would be easier to do if it were kept separate. |
Thanks @bgalvao @francesco086 @shortcipher3! That helps me a lot.
Sure! Mainly, two reasons I'm aware of:
maybe @shcheklein can also add something to the table. As I see, there are 3 options now:
Last option would probably overcomplicate DVC too much... WDYT? Also, we have many issues collected over time blocked by uncertainty of GTO/DVC integration.#254. Sharing some thoughts about them below (feel free to add something if you have that in mind). Some of them should be easier to implement in options 2&3, probably. Some will still require an integration between tags and
|
Regarding merging I'm not seeing any killer features that are supported by 3 and not 2. I prefer 2 unless it blocks a great feature. |
Hi folks! We're drafting how this integration may look like - happy to hear your thoughts iterative/mlem.ai#323, inside there's a link to docs page |
Hi @shortcipher3 @francesco086 @bgalvao! Happy to finally share that we finished with the update - I hope that'll be a smother experience now. I recorded a brief video explaining the update, will make something better in a few days now. We'll be working on implementing Sorry for deprecating Overall, thanks for participating here, your feedback definitely made it way easier for me to understand the direction of changes! |
If you'll need to migrate from |
@aguschin great work! Love seeing the value-proposition of GTO making it into DVC :) I have a question: if you declare models in the artifacts section of dvc.yaml, do you still declare them as outputs of a pipeline stage? For example: artifacts:
cv-classification: # artifact ID (name)
path: models/resnet.pt # same
type: model
desc: 'CV classification model, ResNet50'
labels:
- resnet50
- classification
meta:
framework: pytorch
stages:
train_resnet50:
cmd: python -m src.model.train.resnet50
outs:
- models/resnet.pt # same |
Thanks @bgalvao! Yes - adding |
it does :D |
Awesome work as always @aguschin ! Thank you 🙏 |
@aguschin which github issue can we follow to know about the progress on this? |
@francesco086, it's iterative/dvc#9100 (comment) Linking the comment that have a summary about all the issue, so you can skip reading other parts of it. |
Just had a discussion with @dberenbaum yesterday about this. Overall, feels like there are too many issues blocked by missing GTO/DVC integration. At the same time, I feel like GTO is not useful without DVC and is more like a plugin to DVC.
So, two options are possible: build an integration in GTO, or just merge GTO into DVC.
There are not many decent ways to actually store binaries in repo, which makes it hard to imagine using GTO without DVC in the real-world scenarios. I can recall these options:
Supporting option 4 in GTO to have
upload
/download
functionality as in #307 will require integrations with each place (for s3 it would be fsspec probably, for artifactory some python client of theirs, etc). It makes me think these integrations could be also part of DVC as well (dvc import-url
, etc).Now if we can't imagine a good use-case for GTO without DVC, or all those use-cases would require some machinery that could also be part of DVC, why not just merge GTO into DVC?
Asking your opinion @francesco086 @shortcipher3 @bgalvao since you're the most active GTO users I know about :)
(this is not something we need to decide right away, since we can just build this integration inside of GTO, and then merge - but I think merging could make it more straightforward. I'm just collecting opinions for now to have more detailed picture).
The text was updated successfully, but these errors were encountered: