Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of pipeline ... and error when trying recursive update #1

Closed
sappelhoff opened this issue Jul 30, 2019 · 16 comments · Fixed by #2
Closed

Usage of pipeline ... and error when trying recursive update #1

sappelhoff opened this issue Jul 30, 2019 · 16 comments · Fixed by #2

Comments

@sappelhoff
Copy link
Contributor

Hi, I am trying to turn an OSF directory into a git annex repository and datalad-osf seems to be great for this.

I am not entirely sure whether this would work as I think it would, basically I expect:

  1. Make a new directory and call datalad create
  2. call update_recursive with my OSF key and the directory I want to be git annexed
  3. upload the new directory e.g., to GitHub and get the repository URL
  4. be able to pass this URL to datalad install and get data (that is stored on OSF, but indexed in my git annex repository)

Can someone tell me whether I am just completely misunderstanding / misusing the pipeline? @yarikoptic @rciric

Is there a simpler way to achieve what I want?


Apart from this, this is also a bug report. Here are the steps to reproduce:

  1. mkdir mystuff
  2. cd mystuff
  3. datalad create

into mystuff, put the following Python file try.py

key='cj2dr'
subset='eeg_matchingpennies'

import datalad_osf

datalad_osf.update_recursive(key, subset)
  1. from mystuff run python try.py

... after a considerable amount of time, this provides me with the following error message:

Traceback (most recent call last):
  File "try.py", line 6, in <module>
    datalad_osf.update_recursive(key, subset)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 186, in update_recursive
    addurls_from_csv(csv)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 65, in addurls_from_csv
    ifexists='overwrite')
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 492, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 480, in return_func
    results = list(results)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 429, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 522, in _process_results
    for res in results:
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 719, in __call__
    missing_value)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 407, in extract
    metacols = (c for c in sorted(rows[0].keys()) if c != urlcol)
IndexError: list index out of range
@yarikoptic
Copy link

myself I haven't used datalad-osf yet. From traceback sounds like some empty row?
if you run with datalad --dbg - you could get into pdb and troubleshoot details, or just check out that csv?

@sappelhoff
Copy link
Contributor Author

or just check out that csv?

there is no csv in my dataset 🤔

BUT having talked to @jasmainak a bit it seems like my premise is wrong.

I thought I could create a git annex repo that would look JUST LIKE my real dataset, but instead of the real data, it would contain symbolic links pointing to the OSF data.

And then I would be able to host that git annex repo (very low size) on GitHub, allow people to pull it with datalad, and use datalad.api.get() to download the data from OSF.

According to Mainak I would need my own git server to do something like that.

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

@yarikoptic
Copy link

According to Mainak I would need my own git server to do something like that.

I don't think so. git-annex will just contain urls pointing to OSF

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

yeap, and then you can publish that repository to github, along with git-annex branch (datalad publish does that) so anyone who clones it should be able to get actual files using git annex get or datalad get from OSF. So the premise is right as far as I see ;)

@yarikoptic
Copy link

didn't look into anything else but just FYI that the fetched csv has only the header.

$> cat eeg_matchingpennies 
name,url,location,sha256,path

as for datalad crashing instead of just silently exiting or issuing a warning that no records were received, I filed datalad/datalad#3577

kyleam added a commit to kyleam/datalad that referenced this issue Aug 1, 2019
extract() fails with an index error if the stream is valid JSON or CSV
but lacks any rows (i.e. for JSON an empty list or for CSV a
header-only file).  Update extract() to issue a warning and return an
empty list of rows.

Re: templateflow/datalad-osf#1
Closes datalad#3577.
kyleam added a commit to kyleam/datalad that referenced this issue Aug 1, 2019
extract() fails with an index error if the stream is valid JSON or CSV
but lacks any rows (i.e. for JSON an empty list or for CSV a
header-only file).  Update extract() to issue a warning and return an
empty list of rows.

Re: templateflow/datalad-osf#1
Closes datalad#3577.
@sappelhoff
Copy link
Contributor Author

didn't look into anything else but just FYI that the fetched csv has only the header.

Mh, yes - this is a bug, also the test example from the main README fails, perhaps we should wait for @rciric to work this out.

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

@kyleam
Copy link

kyleam commented Aug 2, 2019

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

I don't believe we have a high-level tutorial on addurls yet. But here's a quick example using a couple of the URLs from the OSF directory that you pointed to. This skips past the more involved task, which IIUC datalad-osf handles, of getting a set of stable URLs and putting them into either a .json or .csv file that addurls() understands.

example
#!/bin/sh

set -eu

datalad create someds
cd someds

cat >files.csv <<EOF
sub,url
sub-06,https://osf.io/9q8r2/download
sub-05,https://osf.io/5br27/download
EOF

datalad save -m"add list of URLs" files

# Tell add addurls to download the content from the "url" field to a
# file named by the "sub" field.
datalad addurls files.csv "{url}" "{sub}"

After executing the above, you end up with a dataset that has two files downloaded and annexed.

someds
|-- files.csv -> .git/annex/objects/[...]7f913d093561b0b385d076a32d1ea9f1.csv
|-- sub-05 -> .git/annex/objects/[...]f68f6c37ac758d82cd8c7d95dee70bbf
`-- sub-06 -> .git/annex/objects/[...]ecda9020e4f012517f531e5be571e8db

The public URLs for these files have been registered with git-annex:

> someds $ git annex whereis
whereis sub-05 (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here]

  web: https://osf.io/5br27/download
ok
whereis sub-06 (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here]

  web: https://osf.io/9q8r2/download
ok

This means that you can publish the repository without the data and people who have cloned it will be able to get the files with {git annex,datalad} get. (This requires publishing the git-annex branch.)

You can verify locally that this works by cloning the repo and then dropping the origin remote, so the only place annex can get the content from is the web.

$ datalad install -s someds clone
$ cd clone 
$ git annex dead origin
$ git remote rm origin 
$ git annex whereis
whereis sub-05 (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/5br27/download
ok
whereis sub-06 (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/9q8r2/download
ok
$ git annex get sub-05
get sub-05 (from web...) 
(checksum...) ok
(recording state in git...)

@sappelhoff
Copy link
Contributor Author

Wow, this is really great @kyleam thanks!

@sappelhoff
Copy link
Contributor Author

sappelhoff commented Aug 3, 2019

This seems to have worked to a large extend! I have made a CSV file with my file-paths and urls: "mp.csv" and made a datalad set:

CSV content for convenience
fpath,url
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_channels.tsv,https://osf.io/wdb42/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.eeg,https://osf.io/3at5h/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vhdr,https://osf.io/3m8et/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vmrk,https://osf.io/7gq4s/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_events.tsv,https://osf.io/9q8r2/download
eeg_matchingpennies/sourcedata/sub-05/eeg/sub-05_task-matchingpennies_eeg.xdf,https://osf.io/agj2q/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_channels.tsv,https://osf.io/256sk/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.eeg,https://osf.io/p52dn/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vhdr,https://osf.io/jk649/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vmrk,https://osf.io/wdjk9/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_events.tsv,https://osf.io/5br27/download
eeg_matchingpennies/sourcedata/sub-06/eeg/sub-06_task-matchingpennies_eeg.xdf,https://osf.io/rj3nf/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_channels.tsv,https://osf.io/qvze6/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.eeg,https://osf.io/z792x/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vhdr,https://osf.io/2an4r/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vmrk,https://osf.io/u7v2g/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_events.tsv,https://osf.io/uyhtd/download
eeg_matchingpennies/sourcedata/sub-07/eeg/sub-07_task-matchingpennies_eeg.xdf,https://osf.io/aqesz/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_channels.tsv,https://osf.io/4safg/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.eeg,https://osf.io/dg9b4/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vhdr,https://osf.io/w6kn2/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vmrk,https://osf.io/mrkag/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_events.tsv,https://osf.io/u76fs/download
eeg_matchingpennies/sourcedata/sub-08/eeg/sub-08_task-matchingpennies_eeg.xdf,https://osf.io/6t5vg/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_channels.tsv,https://osf.io/nqjfm/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.eeg,https://osf.io/6m5ez/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vhdr,https://osf.io/btv7d/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vmrk,https://osf.io/daz4f/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_events.tsv,https://osf.io/ue7ah/download
eeg_matchingpennies/sourcedata/sub-09/eeg/sub-09_task-matchingpennies_eeg.xdf,https://osf.io/59zde/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_channels.tsv,https://osf.io/5cfmh/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.eeg,https://osf.io/ya8kr/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vhdr,https://osf.io/he3c2/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vmrk,https://osf.io/bw6fp/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_events.tsv,https://osf.io/r5ydt/download
eeg_matchingpennies/sourcedata/sub-10/eeg/sub-10_task-matchingpennies_eeg.xdf,https://osf.io/gfsnv/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv,https://osf.io/6p8vr/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.eeg,https://osf.io/ywnpg/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vhdr,https://osf.io/p7xk2/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vmrk,https://osf.io/8u5fm/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_events.tsv,https://osf.io/rjzhy/download
eeg_matchingpennies/sourcedata/sub-11/eeg/sub-11_task-matchingpennies_eeg.xdf,https://osf.io/4m3g5/download
eeg_matchingpennies/.bidsignore,https://osf.io/6thgf/download
eeg_matchingpennies/CHANGES,https://osf.io/ckmbf/download
eeg_matchingpennies/dataset_description.json,https://osf.io/tsy4c/download
eeg_matchingpennies/LICENSE,https://osf.io/mkhd4/download
eeg_matchingpennies/participants.tsv,https://osf.io/6mceu/download
eeg_matchingpennies/participants.json,https://osf.io/ku2dn/download
eeg_matchingpennies/README,https://osf.io/k8hjf/download
eeg_matchingpennies/task-matchingpennies_eeg.json,https://osf.io/qf5d8/download
eeg_matchingpennies/task-matchingpennies_events.json,https://osf.io/3qztv/download
eeg_matchingpennies/stimuli/left_hand.png,https://osf.io/g45de/download
eeg_matchingpennies/stimuli/right_hand.png,https://osf.io/2r9zd/download

datalad create eeg_matchingpennies
datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/

(Note: I did not commit the csv to the repo, because I thought it's not necessary)

There seems to be a bug however with some of the files:

cd eeg_matchingpennies
git annex whereis

for some files, this prints several links, all except one are wrong, E.g.:

whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?


Separate question: I continued as @kyleam suggested to make a local clone and remove the origin, to get a publishable git-annex dataset with only the "web" source of the data.

See: https://github.com/sappelhoff/bogus

apparently something went wrong - can you tell me what I should do?

After cloning and removing the origin, I did (with the clone):

  1. make a new GitHub repository
  2. in the clone run git remote add origin https://github.com/sappelhoff/bogus
  3. run git push origin master

When I realized that this does not look right, I figured that datalad publish might be the way to go, so I tried (on top of the previous steps):

  1. from the root of the clone: datalad publish . --to origin --force

But all that gave me was a cryptic "git-annex" branch ...

I now want to use datalad install https://github.com/sappelhoff/bogus, do I first have to merge the git-annex branch into master? Do I leave both branches untouched?

Is this the right way to go at all?

@yarikoptic
Copy link

Just go ahead with
datalad install https://github.com/sappelhoff/bogus

git-annex branch should never be merged into any normal branch. Leave it for git-annex to deal with

@kyleam
Copy link

kyleam commented Aug 3, 2019

@sappelhoff:

for some files, this prints several links, all except one are wrong, E.g.:
[...]
I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?

Hrm that's odd.

I tried with --fast first, and all of the urls look ok on my end (i.e. I see only one web entry for each URL). Here's the one from the example:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/6p8vr/download
ok

I'm trying now without --fast.

I'm running this with datalad 0.11.6 and git-annex 7.20190730+git2-ga63bf35dc-1~ndall+1 on GNU/Linux. What's your version info?

@sappelhoff
Copy link
Contributor Author

Thanks Yaroslav, I'll try that later!

@kyleam I am using:

  • datalad 0.12.0rc4.dev311 (installed via pip install -e. from my clone of master)
  • git-annex version: 7.20190730-g1030771 (installed from conda-forge)
  • operating system: linux x86_64 (Ubuntu 18.04)

good to hear that it works with --fast ... I am excited what you'll see without it.

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

@kyleam
Copy link

kyleam commented Aug 3, 2019

good to hear that it works with --fast ... I am excited what you'll see without it.

Without --fast I see repeats, including the example you point to:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	24081d41-a5ee-434b-a58a-4401106dc189 -- foo [here]

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

It seems there has to be something going wrong in the underlying git annex addurl --batch call, but I don't know whether it's on our end (in AnnexRepo, not addurls.py) or git-annex's. Some time next week I'll try to see if I can trigger the issue using git-annex directly.

@kyleam
Copy link

kyleam commented Aug 3, 2019

Aah, it should've occurred to me sooner, but that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked. So I think things are working as expected.

@kyleam
Copy link

kyleam commented Aug 3, 2019

@sappelhoff:

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

It's more expensive, but leaving out --fast buys you a content guarantee. With --fast, future downloads will only verify that the file has the expected size.

You can see this difference by looking at the link targets. Without --fast, you get a file that points to the key generated from the file's content:

test3 -> .git/annex/objects/wj/6x/SHA256E-s250--dd8[...]7a0/SHA256E-s250--dd8[...]7a0

With --fast, the target only encodes the size:

test4 -> '.git/annex/objects/81/K7/URL-s250--https&c%%osf.io%5cfmh%download/URL-s250--https&c%%osf.io%5cfmh%download'

@sappelhoff
Copy link
Contributor Author

that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked

Interesting, thanks for the detective work!

--fast buys you a content guarantee.

okay, that's something I should like. That also makes sense then why we don't see duplicates with --fast

@sappelhoff
Copy link
Contributor Author

I think I found the error, why my CSV was never populated ...

It seems like this repo is MRI-centric and only .nii.gz files were expected to be loaded from OSF:

if item['attributes']['kind'] == 'file' and ext == '.nii.gz':
sha = item['attributes']['extra']['hashes']['sha256']
url = item['links']['download']
path = item['attributes']['materialized']
path = re.sub(subset_re, '', path)[1:] if subset else path[1:]
f.write('{},{},{},{},{}\n'.format(name, url, url, sha, path))

That should be easy to fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants