Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to update the data in a data registry within another project #8103

Open
shelper opened this issue Aug 7, 2022 · 13 comments
Open

how to update the data in a data registry within another project #8103

shelper opened this issue Aug 7, 2022 · 13 comments

Comments

@shelper
Copy link

shelper commented Aug 7, 2022

i went through the doc for data registry at https://dvc.org/doc/use-cases/data-registry
but i am still not clear about how to update registry

if I understand correctly, for a project that imports data from a data registry (which is a git repo), and i changed the data in that project, when i run dvc add changed_data.dvc, what i changed is the data for my project, not the repository of the data regitry. and when i run git commit what i commit to is my project's git repo, not the registry's git repo

how could i push back the new data change to the original data registry?

@shelper shelper changed the title how to update the data in a data registry repository how to update the data in a data registry Aug 7, 2022
@shelper shelper changed the title how to update the data in a data registry how to update the data in a data registry within another project Aug 7, 2022
@shelper
Copy link
Author

shelper commented Aug 9, 2022

i got answer on discord saying

Unfortunately, there is not currently a very seamless way to push back the changes. You would need to copy the data back to the data registry repo and push it from there.

So i think this could be a feature request then ...

Maybe a workaround is to use data-registry in a project in a way that is similar to git submodule, and use something like git sparse-checkout to checkout only the data for that specific project from the data-registry as a submodule?

@pared
Copy link
Contributor

pared commented Aug 9, 2022

@shelper
Data registry is a separate project from your training and it should be treated as a "consumer" of the data rather than something that updates the registry repo. So the changes should be applied to data registry repo, and then, using dvc update the data should be updated on your training repo. As I understand you would like to update your data registry from your training repo?

@shelper
Copy link
Author

shelper commented Aug 9, 2022

@pared that's right, i wonder if it is feasible (and reasonable) to update data registry from training repo.

@pared
Copy link
Contributor

pared commented Aug 9, 2022

I guess the right question to ask here is why not update the data registry? Is there a use case where one cannot update the data registry, and should be able to update it from another repo?

@shelper
Copy link
Author

shelper commented Aug 10, 2022

I guess the right question to ask here is why not update the data registry? Is there a use case where one cannot update the data registry, and should be able to update it from another repo?

I Agree with you but just a thought here is that if the registry gets bigger, and some one who only used one dataset from the registry will have to clone it to local and update it through GitHub

Just in case the person accidentally contaminates other datasets in the registry, if there is a way to isolate in-between the datasets in the registry?

@shcheklein
Copy link
Member

Just a few points to clarify:

I Agree with you but just a thought here is that if the registry gets bigger, and some one who only used one dataset from the registry will have to clone it to local and update it through GitHub

A person can download only single dataset to update it, no need to checkout / pull all the datasets in the registry to update a single one.

Just in case the person accidentally contaminates other datasets in the registry, if there is a way to isolate in-between the datasets in the registry?

yes, you can always update only a single dataset and do git commit dataset1.dvc + git push. No matter if other datasets are updated or not or if you did dvc push for other datasets. This way nothing is contaminated. You are sharing exactly one new version of a specific dataset. Everything else stays safe.

Btw, in your specific case, where do you store your data - cloud, NAS, SSH, something else?

@shelper
Copy link
Author

shelper commented Aug 10, 2022

Humm..., please see my comments below based on my understanding..

A person can download only single dataset to update it, no need to checkout / pull all the datasets in the registry to update a single one.

this above step is done by dvc import registry-url dataset-folder, in the project folder

yes, you can always update only a single dataset and do git commit dataset1.dvc + git push. No matter if other datasets are updated or not or if you did dvc push for other datasets. This way nothing is contaminated. You are sharing exactly one new version of a specific dataset. Everything else stays safe.

Here above, the user needs to run the commands in the local registry repo, which means he/she has to clone the whole registry repo to local.

Btw, in your specific case, where do you store your data - cloud, NAS, SSH, something else?

I believe it does not matter, but for example, in cloud. if what i understand is correct as commented above, then here is what happens

  • someone works in project folder and dvc import from remote registry repo, and run dvc pull to get data from cloud
  • he/she changes the data in project,
  • if he/she now wants to update the data change to registry, he/she first runs dvc push to copy any new data to the cloud storage.
  • he/she then clones the registry repo to local and copy the update dvc file from the project folder to local registry repo folder and rum git commit and git push

if that is correct, my only concern is that it is little tedious 'cus the user needs to switch between project and registry folder, i wonder if there could be a safer way to update the registry without leaving the project folder

if that is the way it should work, then nevermind

@shcheklein
Copy link
Member

Here above, the user needs to run the commands in the local registry repo, which means he/she has to clone the whole registry repo to local.

That was my point. They need to do git clone data-registry (that's should be fine, right?) , but no need to do the full dvc pull. A specific dataset can be pulled - dvc pull dataset1 and then committed dvc commit dataset1 .. It can be granular.

There is a way to avoid even the second copy on a machine where they would be updating a dataset that needs to be updated. data-registry repo can be setup to point to a project repo that where dvc import was used with dvc cache dir. It's a bit more involved, but this way there will be no additional downloads, data copy, etc, etc.

@shelper
Copy link
Author

shelper commented Aug 11, 2022

That was my point. They need to do git clone data-registry (that's should be fine, right?)

that is fine most cases, but it concerns me a little bit when some unexperienced github user cloned the registry repo and made some changes to it unexpectedly. expecially if he/she changes datasets other than the one he/she is using.

So i just wonder if there is a way to restrain the user only to make changes to the dataset he imports/gets from the data-registry. and by make changes i mean commit changes to the original data-registry. That is why i thoght it would be useful if we can directly commit the data changes (dvc file) to the data-registry within the project folder when the dataset is imported from the data-registry

I may be asking for something not have a strong reason for it, I will close this issue for now

thanks for answering my concerns though :)

@shelper shelper closed this as completed Aug 11, 2022
@dberenbaum
Copy link
Collaborator

Hi @shelper! Please don't take the discussion above as dismissal of your points. It's just trying to understand and gather info.

I think it makes sense that you want to update the data registry from the consumer repo, and I have heard of others asking for something similar. As you said, it forces inexperienced users to understand a lot about how to make changes in Git. It also feels broken to me that you have to manually copy either the data or at least the .dvc file/checksum from the consumer repo to the data registry repo.

I don't have a good solution for you now, but let's keep this open for discussion!

@dberenbaum dberenbaum reopened this Sep 8, 2022
@dberenbaum
Copy link
Collaborator

There's also a related proposal in #8066 for how to upload data back to a repo.

@elenavolkova93
Copy link

elenavolkova93 commented Jan 2, 2024

I second the wish for a more seamless way to upload the data. Why it's better than just copying, adding and pushing:

  • less work (no need to clone the data registry repo or copy files). We have many dev servers where people run things, so they'd need to clone the registry on every server they work on.
  • easier to do programmatically (especially if you have a pipeline that generates multiple files that need to be saved to the registry on every run, creating a new commit)

@dberenbaum
Copy link
Collaborator

There is now support for a cross-repo registry of artifacts in DVC Studio that simplifies this workflow. Although it's branded as a Model Registry, it works with non-model artifacts as well, and we have discussed branding it more generically to clarify that it can work for non-models.

Of course, you may prefer an open-source solution, but Studio has several advantages over the current data registry approach that would be hard to replicate in DVC alone:

  • Cross-repo: Studio provides a place to combine artifacts from all your projects, so you don't have to maintain a separate data registry repo. You can directly push updates to the consumer repos and use Studio as the centralized registry. You can also programmatically update the version numbers and/or stages of artifacts on each run.
  • Metadata: DVC doesn’t have a way to search or use structured metadata fields like those supported by the registry or to search across repos, and it would be hard to do a good job with this in a CLI tool.
  • Access: Studio can connect to your cloud storage and provide consumers with a temporary URL without giving them access to the entire cloud storage or requiring them to install or configure anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants