Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use datalad-osf for already existing OSF repositories #97

Open
sappelhoff opened this issue Jun 20, 2020 · 5 comments
Open

Use datalad-osf for already existing OSF repositories #97

sappelhoff opened this issue Jun 20, 2020 · 5 comments

Comments

@sappelhoff
Copy link
Contributor

Question

How can I turn an existing OSF repository into a datalad dataset and mirror it to GitHub?

Previous solution

I have an OSF repository with data: https://osf.io/cj2dr/

Several months after uploading this dataset on OSF, I wanted to easily download single files from it, and Datalad seemed a good way to do this.

So I did the following (felt very hacky, and still seems hacky to me):

  1. Locally created a new datalad dataset: datalad create eeg_matchingpennies
  2. Using a patched version of templateflow/datalad-osf I recursively put every file of my OSF repository into a CSV with the filenames and their download URLs
  3. Then using datalad addurls I updated my new datalad dataset with the existing OSF files
  4. Finally, I pushed this dataset to GitHub using these steps:
    1. from my new datalad dataset: datalad install -s eeg_matchingpennies clone
    2. cd clone
    3. git annex dead origin
    4. git remote rm origin
    5. git remote add origin <put URL of new, empty GitHub repo here>
    6. datalad publish --to origin

This worked nicely, and the result is here: https://github.com/sappelhoff/eeg_matchingpennies

Problems

  • Recently I had to update the source dataset, so I edited the files on OSF ... which then of course screwed up my datalad version (see my post on NeuroStars)
  • I don't know how I could have done this in a "proper" datalad style way ... because when I get my OSF data via datalad (which goes the GitHub route), and then change files locally and datalad save and datalad publish, there is no connection to the actual source of the data on OSF ...

I imagine that this problem can now be solved using this new datalad extension. Is that correct? If yes, perhaps we can make a user case out of this for the documentation.

I could imagine that there are several people with already existing OSF datasets that would want to datalad-ify them.

In a few steps, what I imagine:

  1. I have an OSF data repo
  2. Do some steps to turn into datalad dataset
  3. mirror this on some GitHub like site
  4. When I need to change something, use datalad install from the GitHub like site
  5. Edit locally
  6. Then publish, which should automatically update (i) the GitHub like site, and (ii) the OSF source data
@mih
Copy link
Member

mih commented Jun 21, 2020

Take a look at #100
This makes it possible to push the Git repo into the OSF project itself, and have it be self-contained. No need for GitHub anymore. But of course it can be kept as just another remote.

@bpoldrack
Copy link
Member

bpoldrack commented Jun 21, 2020

@sappelhoff : Re importing existing OSF data into a dataset:

The solution you found and described on NeuroStars does look good and I agree that this would have been the way to avoid the trouble in the first place. It could be made more convenient by integrating the necessary steps herein. Would be nice to have datalad import-osf, not bothering with a CSV and make use of OSF's API to discover current versioned links on import. We could then call git annex addurl with the versioned download link directly.

However, updating the other way around (assuming #100 and an annex on OSF is not an option) should use the "export" type osf-special-remote.
Issues to solve would be to allow to specify a path within a project ( eeg_matchingpennies in your case) to export to and to use OSF's versioning to (over-)write files at the remote end.

Edit:
As of now (with those two issues not yet addressed and tested) you'd need to configure this osf-special-remote to connect to an OSF project and it would write to the root of it (in your case leading to two trees). Alternatively, you could add a subcomponent to your OSF project and configure the special remote to connect to that one (there's no difference between "project" and "component"). For now would be the more clear separation in comparison to two parallel trees in the same component.
Or - after getting all the data off your project, you just delete all content in it, use it as the target for the osf-special-remote and export to it. Turning it into what it would look like, if you build it from scratch with datalad-osf (--mode export).

Finally, re publish:
With the export-type remote a publish won't work. This is because datalad publish/push doesn't account for export remotes at all. It would require a git annex export --to call, as opposed to git push / git annex copy --to.

@bpoldrack
Copy link
Member

bpoldrack commented Jun 21, 2020

Also note, that you can have several special remotes.
You can have one component in your project serving as an export remote showing the worktree and another one configured with --mode annex, providing a "real" annex store.
So, if you'd want to leave your OSF project as is, but continue with the data being represented in a Github repo, you could go the way you described to have the initial data, create a subcomponent on OSF serving as the target for the annex-type special remote and then just publish to that one.
In that case you would want to configure your GitHub remote (.git/config) with remote.NAME.datalad-publish-dependsset toOSF-SPECIAL_REMOTE-NAME` to publish to both in one go.

And you can do the same with another subcomponent (or the actual project as described in the above post) to serve as target for an export remote, putting the worktree on OSF for human consumption.

@sappelhoff
Copy link
Contributor Author

Thanks for your answers! I am a bit overwhelmed though and there seem to be many different ways to achieve what I want

--> just not as simple as datalad import-osf and datalad export-osf 🤔

I'll need to play around with all the options to understand them. I hope to find some time for that soon and then get back.

@bpoldrack
Copy link
Member

@sappelhoff
Just use datalad create-sibling-osf on an existing dataset to create a new project.

  • once using --mode annex (default) followed by datalad push . --to OSF-REMOTE-NAME and
  • once again w/ -mode export followed by git annex export HEAD --to OSF-REMOTE-NAME

If you then look at what's being created at OSF, that should make things a bit clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants