Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull imports from source #8274

Closed
johnyaku opened this issue Sep 9, 2022 · 7 comments · Fixed by #8249
Closed

Pull imports from source #8274

johnyaku opened this issue Sep 9, 2022 · 7 comments · Fixed by #8249
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@johnyaku
Copy link

johnyaku commented Sep 9, 2022

Files that have been added via dvc import-url can be indvididually downloaded from their source location via dvc update <target>, but dvc pull looks for the files on the remote.

This request is for dvc pull to include an option to download these files from their source.

Whether this should be the default behaviour is open to discussion, but it would be helpful to at least have this option, let's called it --from-source for now.

Use case: In scientific research we often use previously published datasets as references (for comparison against new data, for example). These datasets are hosted on well-funded, stable file servers with stable URLs for each file. It would be helpful to be able to include such data in a DVC repo without having to include it in the DVC remote for that repo.

For less stable URLs (meaning URLs where the target data is subject to change) then I can see the value in including the data in the remote, as this will allow version control and fetching previous versions.

One way to deal with these conflicting use cases is to include an option for marking imported URLs as "fixed" (non-variable). This property could be recorded in the .dvc file so that DVC knows a) not to push the data to the remote and b) to pull the data from source as part of a dvc pull. Vanilla dvc import-url operations continue to function as currently.

@dberenbaum
Copy link
Collaborator

Related: #8172

@johnyaku
Copy link
Author

Perhaps the cleanest and simplest way to incorporate this kind of functionality is something like the following:

dvc update --all (download all files that have dvc import-urled from source into the workspace)

Then anyone wanting to replicate a dataset can grab all the necessary data with just

git clone <repo-url>
cd <repo>
dvc pull
dvc update --all

Alternatively, dvc pull could get an additional option as follows:

dvc pull --from-source (fall back to downloading from source iff data not available on remote, or if no remote set)

These alternatives are not exclusive.

But either way, suppose we have project A that uses dvc import-url to import data.csv from https://source.com. Then suppose that project B uses dvc import to import data.csv from project A into project B. Then dvc pull --from-source or dvc update --all (or both) should download data.csv from https://source.com into the workspace for project B.

In #8172 this is described as an edge case. But I suggest that this could be a fairly central use case in scientific publishing were invariant data is lodged in public repositories with stable URLs. Publishing raw data in these public repositories is often a condition of grant funding. So although we are likely to use DVC remotes while we initially compile and analyse our data, we will inevitably upload most if not all of the raw data to these public repositories. But if we continue to organise our datasets with DVC (just with data hosted in public repositories rather than DVC remotes) then future projects can build on already published datasets with a simple dvc import, regardless of where the data is actually hosted.

@dberenbaum
Copy link
Collaborator

Makes sense @johnyaku! For both import and import-url, an option to determine whether to push a copy of the data may be needed, like proposed in #4527.

Regardless, I see no reason DVC should not try to fallback to the original source if the data is not in the remote.

@dtrifiro dtrifiro added A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature labels Sep 14, 2022
@dtrifiro
Copy link
Contributor

Took a shot at update --all, here's a draft: #8288

@skshetry
Copy link
Member

This is already supported using --recursive.

See #3511. Originally, it was requested to support --all, but we went with --recursive/-R. There is however a bug, it should skip non-import stages which should be fixed.

@pmrowla
Copy link
Contributor

pmrowla commented Sep 16, 2022

update --recursive (or update --all) feels more like a workaround than an actual solution to this problem. There's a difference between pulling an import from source and updating it from source.

For fetch/pull I would expect DVC to verify that the source URL has not changed and then download it (like we do with import-url --no-download).

Update actually modifies the import to use the latest file from that source location, which is good enough for the case where you know the source is "stable", but does not solve this problem for the general purposes.

@johnyaku
Copy link
Author

johnyaku commented Sep 20, 2022

Thanks @dtrifiro for your efforts and @skshetry for the workaround with update --recursive. This gets me over the immediate hurdle that I'm facing, but @pmrowla is correct that it does not solve the general case.

In particular, using the example projects A and B above, with data.csv originally import-urled into project A, at the moment running dvc import <url-for-project-A> data.csv in Project B fails because data.csv is not in the remote for Project A. The error message is similar to that when dvc pull fails in Project A, and so presumably the mechanism is similar.

Despite my earlier insistance that there really is invariant data with stable URLs, the semantics of import and pull have important differences, and despite being grateful for the workaround with update --recursive, I think it would be better to have the ability to pull --from-source (and also checkout --from-source). Perhaps this could become the default behaviour, but it would probably better to first provide the functionality and road test it in the real world first.

Perhaps the simplest way to achieve this functionality is with an additional --from-source option for pull/checkout/import. Alternatively (or additionally) perhaps there could be an extra field in .dvc files to mark files as originating from a source URL rather than a source DVC project (and associated remote). Such files could then be pulled from source without the need to specify the --from-source option.

Tangentially, it might similarly be helpful to mark certain data files as "unprotected". We use symlinks to an external cache but often need to be able to rewrite files after first running dvc unprotect <target>. Generally we know which files need to be unprotected, and do this prior to running the workflow. But it would be convenient to be able to mark these files as "unprotected" so that an unprotect operation automatically follows pull/checkout/update etc. This is tangential to the central feature request here, but mentioned in case there is an appetite for revising the fields in .dvc files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes A: data-sync Related to dvc get/fetch/import/pull/push feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants