-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull imports from source #8274
Comments
Related: #8172 |
Perhaps the cleanest and simplest way to incorporate this kind of functionality is something like the following:
Then anyone wanting to replicate a dataset can grab all the necessary data with just
Alternatively,
These alternatives are not exclusive. But either way, suppose we have project A that uses In #8172 this is described as an edge case. But I suggest that this could be a fairly central use case in scientific publishing were invariant data is lodged in public repositories with stable URLs. Publishing raw data in these public repositories is often a condition of grant funding. So although we are likely to use DVC remotes while we initially compile and analyse our data, we will inevitably upload most if not all of the raw data to these public repositories. But if we continue to organise our datasets with DVC (just with data hosted in public repositories rather than DVC remotes) then future projects can build on already published datasets with a simple |
Took a shot at |
This is already supported using See #3511. Originally, it was requested to support |
For fetch/pull I would expect DVC to verify that the source URL has not changed and then download it (like we do with Update actually modifies the import to use the latest file from that source location, which is good enough for the case where you know the source is "stable", but does not solve this problem for the general purposes. |
Thanks @dtrifiro for your efforts and @skshetry for the workaround with In particular, using the example projects A and B above, with Despite my earlier insistance that there really is invariant data with stable URLs, the semantics of Perhaps the simplest way to achieve this functionality is with an additional Tangentially, it might similarly be helpful to mark certain data files as "unprotected". We use symlinks to an external cache but often need to be able to rewrite files after first running |
Files that have been added via
dvc import-url
can be indvididually downloaded from their source location viadvc update <target>
, butdvc pull
looks for the files on the remote.This request is for
dvc pull
to include an option to download these files from their source.Whether this should be the default behaviour is open to discussion, but it would be helpful to at least have this option, let's called it
--from-source
for now.Use case: In scientific research we often use previously published datasets as references (for comparison against new data, for example). These datasets are hosted on well-funded, stable file servers with stable URLs for each file. It would be helpful to be able to include such data in a DVC repo without having to include it in the DVC remote for that repo.
For less stable URLs (meaning URLs where the target data is subject to change) then I can see the value in including the data in the remote, as this will allow version control and fetching previous versions.
One way to deal with these conflicting use cases is to include an option for marking imported URLs as "fixed" (non-variable). This property could be recorded in the
.dvc
file so that DVC knows a) not to push the data to the remote and b) to pull the data from source as part of advc pull
. Vanilladvc import-url
operations continue to function as currently.The text was updated successfully, but these errors were encountered: