-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to update the data in a data registry within another project #8103
Comments
i got answer on discord saying
So i think this could be a feature request then ... Maybe a workaround is to use data-registry in a project in a way that is similar to git submodule, and use something like |
@shelper |
@pared that's right, i wonder if it is feasible (and reasonable) to update data registry from training repo. |
I guess the right question to ask here is why not update the data registry? Is there a use case where one cannot update the data registry, and should be able to update it from another repo? |
I Agree with you but just a thought here is that if the registry gets bigger, and some one who only used one dataset from the registry will have to clone it to local and update it through GitHub Just in case the person accidentally contaminates other datasets in the registry, if there is a way to isolate in-between the datasets in the registry? |
Just a few points to clarify:
A person can download only single dataset to update it, no need to checkout / pull all the datasets in the registry to update a single one.
yes, you can always update only a single dataset and do Btw, in your specific case, where do you store your data - cloud, NAS, SSH, something else? |
Humm..., please see my comments below based on my understanding..
this above step is done by
Here above, the user needs to run the commands in the local registry repo, which means he/she has to clone the whole registry repo to local.
I believe it does not matter, but for example, in cloud. if what i understand is correct as commented above, then here is what happens
if that is correct, my only concern is that it is little tedious 'cus the user needs to switch between project and registry folder, i wonder if there could be a safer way to update the registry without leaving the project folder if that is the way it should work, then nevermind |
That was my point. They need to do There is a way to avoid even the second copy on a machine where they would be updating a dataset that needs to be updated. |
that is fine most cases, but it concerns me a little bit when some unexperienced github user cloned the registry repo and made some changes to it unexpectedly. expecially if he/she changes datasets other than the one he/she is using. So i just wonder if there is a way to restrain the user only to make changes to the dataset he imports/gets from the data-registry. and by I may be asking for something not have a strong reason for it, I will close this issue for now thanks for answering my concerns though :) |
Hi @shelper! Please don't take the discussion above as dismissal of your points. It's just trying to understand and gather info. I think it makes sense that you want to update the data registry from the consumer repo, and I have heard of others asking for something similar. As you said, it forces inexperienced users to understand a lot about how to make changes in Git. It also feels broken to me that you have to manually copy either the data or at least the I don't have a good solution for you now, but let's keep this open for discussion! |
There's also a related proposal in #8066 for how to upload data back to a repo. |
I second the wish for a more seamless way to upload the data. Why it's better than just copying, adding and pushing:
|
There is now support for a cross-repo registry of artifacts in DVC Studio that simplifies this workflow. Although it's branded as a Model Registry, it works with non-model artifacts as well, and we have discussed branding it more generically to clarify that it can work for non-models. Of course, you may prefer an open-source solution, but Studio has several advantages over the current data registry approach that would be hard to replicate in DVC alone:
|
i went through the doc for data registry at
https://dvc.org/doc/use-cases/data-registry
but i am still not clear about how to
update registry
if I understand correctly, for a project that imports data from a data registry (which is a git repo), and i changed the data in that project, when i run
dvc add changed_data.dvc
, what i changed is the data for my project, not the repository of the data regitry. and when i rungit commit
what i commit to is my project's git repo, not the registry's git repohow could i push back the new data change to the original data registry?
The text was updated successfully, but these errors were encountered: