Usage of pipeline ... and error when trying recursive update #1

sappelhoff · 2019-07-30T10:47:58Z

Hi, I am trying to turn an OSF directory into a git annex repository and datalad-osf seems to be great for this.

I am not entirely sure whether this would work as I think it would, basically I expect:

Make a new directory and call datalad create
call update_recursive with my OSF key and the directory I want to be git annexed
upload the new directory e.g., to GitHub and get the repository URL
be able to pass this URL to datalad install and get data (that is stored on OSF, but indexed in my git annex repository)

Can someone tell me whether I am just completely misunderstanding / misusing the pipeline? @yarikoptic @rciric

Is there a simpler way to achieve what I want?

Apart from this, this is also a bug report. Here are the steps to reproduce:

mkdir mystuff
cd mystuff
datalad create

into mystuff, put the following Python file try.py

key='cj2dr'
subset='eeg_matchingpennies'

import datalad_osf

datalad_osf.update_recursive(key, subset)

from mystuff run python try.py

... after a considerable amount of time, this provides me with the following error message:

Traceback (most recent call last):
  File "try.py", line 6, in <module>
    datalad_osf.update_recursive(key, subset)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 186, in update_recursive
    addurls_from_csv(csv)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 65, in addurls_from_csv
    ifexists='overwrite')
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 492, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 480, in return_func
    results = list(results)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 429, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 522, in _process_results
    for res in results:
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 719, in __call__
    missing_value)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 407, in extract
    metacols = (c for c in sorted(rows[0].keys()) if c != urlcol)
IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

yarikoptic · 2019-07-30T22:12:39Z

myself I haven't used datalad-osf yet. From traceback sounds like some empty row?
if you run with datalad --dbg - you could get into pdb and troubleshoot details, or just check out that csv?

sappelhoff · 2019-08-01T08:38:51Z

or just check out that csv?

there is no csv in my dataset 🤔

BUT having talked to @jasmainak a bit it seems like my premise is wrong.

I thought I could create a git annex repo that would look JUST LIKE my real dataset, but instead of the real data, it would contain symbolic links pointing to the OSF data.

And then I would be able to host that git annex repo (very low size) on GitHub, allow people to pull it with datalad, and use datalad.api.get() to download the data from OSF.

According to Mainak I would need my own git server to do something like that.

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

yarikoptic · 2019-08-01T15:59:29Z

According to Mainak I would need my own git server to do something like that.

I don't think so. git-annex will just contain urls pointing to OSF

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

yeap, and then you can publish that repository to github, along with git-annex branch (datalad publish does that) so anyone who clones it should be able to get actual files using git annex get or datalad get from OSF. So the premise is right as far as I see ;)

yarikoptic · 2019-08-01T16:33:36Z

didn't look into anything else but just FYI that the fetched csv has only the header.

$> cat eeg_matchingpennies 
name,url,location,sha256,path

as for datalad crashing instead of just silently exiting or issuing a warning that no records were received, I filed datalad/datalad#3577

extract() fails with an index error if the stream is valid JSON or CSV but lacks any rows (i.e. for JSON an empty list or for CSV a header-only file). Update extract() to issue a warning and return an empty list of rows. Re: templateflow/datalad-osf#1 Closes datalad#3577.

sappelhoff · 2019-08-02T08:57:24Z

didn't look into anything else but just FYI that the fetched csv has only the header.

Mh, yes - this is a bug, also the test example from the main README fails, perhaps we should wait for @rciric to work this out.

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

kyleam · 2019-08-02T15:26:48Z

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

I don't believe we have a high-level tutorial on addurls yet. But here's a quick example using a couple of the URLs from the OSF directory that you pointed to. This skips past the more involved task, which IIUC datalad-osf handles, of getting a set of stable URLs and putting them into either a .json or .csv file that addurls() understands.

example

#!/bin/sh

set -eu

datalad create someds
cd someds

cat >files.csv <<EOF
sub,url
sub-06,https://osf.io/9q8r2/download
sub-05,https://osf.io/5br27/download
EOF

datalad save -m"add list of URLs" files

# Tell add addurls to download the content from the "url" field to a
# file named by the "sub" field.
datalad addurls files.csv "{url}" "{sub}"

After executing the above, you end up with a dataset that has two files downloaded and annexed.

someds
|-- files.csv -> .git/annex/objects/[...]7f913d093561b0b385d076a32d1ea9f1.csv
|-- sub-05 -> .git/annex/objects/[...]f68f6c37ac758d82cd8c7d95dee70bbf
`-- sub-06 -> .git/annex/objects/[...]ecda9020e4f012517f531e5be571e8db

The public URLs for these files have been registered with git-annex:

> someds $ git annex whereis
whereis sub-05 (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here]

  web: https://osf.io/5br27/download
ok
whereis sub-06 (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here]

  web: https://osf.io/9q8r2/download
ok

This means that you can publish the repository without the data and people who have cloned it will be able to get the files with {git annex,datalad} get. (This requires publishing the git-annex branch.)

You can verify locally that this works by cloning the repo and then dropping the origin remote, so the only place annex can get the content from is the web.

$ datalad install -s someds clone
$ cd clone 
$ git annex dead origin
$ git remote rm origin 
$ git annex whereis
whereis sub-05 (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/5br27/download
ok
whereis sub-06 (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/9q8r2/download
ok
$ git annex get sub-05
get sub-05 (from web...) 
(checksum...) ok
(recording state in git...)

sappelhoff · 2019-08-02T15:50:45Z

Wow, this is really great @kyleam thanks!

sappelhoff · 2019-08-03T11:28:39Z

This seems to have worked to a large extend! I have made a CSV file with my file-paths and urls: "mp.csv" and made a datalad set:

CSV content for convenience

fpath,url
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_channels.tsv,https://osf.io/wdb42/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.eeg,https://osf.io/3at5h/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vhdr,https://osf.io/3m8et/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vmrk,https://osf.io/7gq4s/download
eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_events.tsv,https://osf.io/9q8r2/download
eeg_matchingpennies/sourcedata/sub-05/eeg/sub-05_task-matchingpennies_eeg.xdf,https://osf.io/agj2q/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_channels.tsv,https://osf.io/256sk/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.eeg,https://osf.io/p52dn/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vhdr,https://osf.io/jk649/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vmrk,https://osf.io/wdjk9/download
eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_events.tsv,https://osf.io/5br27/download
eeg_matchingpennies/sourcedata/sub-06/eeg/sub-06_task-matchingpennies_eeg.xdf,https://osf.io/rj3nf/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_channels.tsv,https://osf.io/qvze6/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.eeg,https://osf.io/z792x/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vhdr,https://osf.io/2an4r/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vmrk,https://osf.io/u7v2g/download
eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_events.tsv,https://osf.io/uyhtd/download
eeg_matchingpennies/sourcedata/sub-07/eeg/sub-07_task-matchingpennies_eeg.xdf,https://osf.io/aqesz/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_channels.tsv,https://osf.io/4safg/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.eeg,https://osf.io/dg9b4/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vhdr,https://osf.io/w6kn2/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vmrk,https://osf.io/mrkag/download
eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_events.tsv,https://osf.io/u76fs/download
eeg_matchingpennies/sourcedata/sub-08/eeg/sub-08_task-matchingpennies_eeg.xdf,https://osf.io/6t5vg/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_channels.tsv,https://osf.io/nqjfm/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.eeg,https://osf.io/6m5ez/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vhdr,https://osf.io/btv7d/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vmrk,https://osf.io/daz4f/download
eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_events.tsv,https://osf.io/ue7ah/download
eeg_matchingpennies/sourcedata/sub-09/eeg/sub-09_task-matchingpennies_eeg.xdf,https://osf.io/59zde/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_channels.tsv,https://osf.io/5cfmh/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.eeg,https://osf.io/ya8kr/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vhdr,https://osf.io/he3c2/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vmrk,https://osf.io/bw6fp/download
eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_events.tsv,https://osf.io/r5ydt/download
eeg_matchingpennies/sourcedata/sub-10/eeg/sub-10_task-matchingpennies_eeg.xdf,https://osf.io/gfsnv/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv,https://osf.io/6p8vr/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.eeg,https://osf.io/ywnpg/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vhdr,https://osf.io/p7xk2/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vmrk,https://osf.io/8u5fm/download
eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_events.tsv,https://osf.io/rjzhy/download
eeg_matchingpennies/sourcedata/sub-11/eeg/sub-11_task-matchingpennies_eeg.xdf,https://osf.io/4m3g5/download
eeg_matchingpennies/.bidsignore,https://osf.io/6thgf/download
eeg_matchingpennies/CHANGES,https://osf.io/ckmbf/download
eeg_matchingpennies/dataset_description.json,https://osf.io/tsy4c/download
eeg_matchingpennies/LICENSE,https://osf.io/mkhd4/download
eeg_matchingpennies/participants.tsv,https://osf.io/6mceu/download
eeg_matchingpennies/participants.json,https://osf.io/ku2dn/download
eeg_matchingpennies/README,https://osf.io/k8hjf/download
eeg_matchingpennies/task-matchingpennies_eeg.json,https://osf.io/qf5d8/download
eeg_matchingpennies/task-matchingpennies_events.json,https://osf.io/3qztv/download
eeg_matchingpennies/stimuli/left_hand.png,https://osf.io/g45de/download
eeg_matchingpennies/stimuli/right_hand.png,https://osf.io/2r9zd/download

datalad create eeg_matchingpennies
datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/

(Note: I did not commit the csv to the repo, because I thought it's not necessary)

There seems to be a bug however with some of the files:

cd eeg_matchingpennies
git annex whereis

for some files, this prints several links, all except one are wrong, E.g.:

whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?

Separate question: I continued as @kyleam suggested to make a local clone and remove the origin, to get a publishable git-annex dataset with only the "web" source of the data.

See: https://github.com/sappelhoff/bogus

apparently something went wrong - can you tell me what I should do?

After cloning and removing the origin, I did (with the clone):

make a new GitHub repository
in the clone run git remote add origin https://github.com/sappelhoff/bogus
run git push origin master

When I realized that this does not look right, I figured that datalad publish might be the way to go, so I tried (on top of the previous steps):

from the root of the clone: datalad publish . --to origin --force

But all that gave me was a cryptic "git-annex" branch ...

I now want to use datalad install https://github.com/sappelhoff/bogus, do I first have to merge the git-annex branch into master? Do I leave both branches untouched?

Is this the right way to go at all?

yarikoptic · 2019-08-03T14:29:14Z

Just go ahead with
datalad install https://github.com/sappelhoff/bogus

git-annex branch should never be merged into any normal branch. Leave it for git-annex to deal with

kyleam · 2019-08-03T15:57:13Z

@sappelhoff:

for some files, this prints several links, all except one are wrong, E.g.:
[...]
I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?

Hrm that's odd.

I tried with --fast first, and all of the urls look ok on my end (i.e. I see only one web entry for each URL). Here's the one from the example:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/6p8vr/download
ok

I'm trying now without --fast.

I'm running this with datalad 0.11.6 and git-annex 7.20190730+git2-ga63bf35dc-1~ndall+1 on GNU/Linux. What's your version info?

sappelhoff · 2019-08-03T16:05:02Z

Thanks Yaroslav, I'll try that later!

@kyleam I am using:

datalad 0.12.0rc4.dev311 (installed via pip install -e. from my clone of master)
git-annex version: 7.20190730-g1030771 (installed from conda-forge)
operating system: linux x86_64 (Ubuntu 18.04)

good to hear that it works with --fast ... I am excited what you'll see without it.

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

kyleam · 2019-08-03T16:31:35Z

good to hear that it works with --fast ... I am excited what you'll see without it.

Without --fast I see repeats, including the example you point to:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (2 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	24081d41-a5ee-434b-a58a-4401106dc189 -- foo [here]

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

It seems there has to be something going wrong in the underlying git annex addurl --batch call, but I don't know whether it's on our end (in AnnexRepo, not addurls.py) or git-annex's. Some time next week I'll try to see if I can trigger the issue using git-annex directly.

kyleam · 2019-08-03T17:50:59Z

Aah, it should've occurred to me sooner, but that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked. So I think things are working as expected.

kyleam · 2019-08-03T18:10:01Z

@sappelhoff:

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

It's more expensive, but leaving out --fast buys you a content guarantee. With --fast, future downloads will only verify that the file has the expected size.

You can see this difference by looking at the link targets. Without --fast, you get a file that points to the key generated from the file's content:

test3 -> .git/annex/objects/wj/6x/SHA256E-s250--dd8[...]7a0/SHA256E-s250--dd8[...]7a0

With --fast, the target only encodes the size:

test4 -> '.git/annex/objects/81/K7/URL-s250--https&c%%osf.io%5cfmh%download/URL-s250--https&c%%osf.io%5cfmh%download'

sappelhoff · 2019-08-03T18:45:36Z

that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked

Interesting, thanks for the detective work!

--fast buys you a content guarantee.

okay, that's something I should like. That also makes sense then why we don't see duplicates with --fast

sappelhoff · 2019-08-09T14:00:55Z

I think I found the error, why my CSV was never populated ...

It seems like this repo is MRI-centric and only .nii.gz files were expected to be loaded from OSF:

datalad-osf/datalad_osf/utils.py

Lines 91 to 96 in 42a2b93

    
           if item['attributes']['kind'] == 'file' and ext == '.nii.gz': 
        
               sha = item['attributes']['extra']['hashes']['sha256'] 
        
               url = item['links']['download'] 
        
               path = item['attributes']['materialized'] 
        
               path = re.sub(subset_re, '', path)[1:] if subset else path[1:] 
        
               f.write('{},{},{},{},{}\n'.format(name, url, url, sha, path))

That should be easy to fix!

yarikoptic mentioned this issue Aug 1, 2019

addurls blows in extract upon a csv with only the header datalad/datalad#3577

Closed

sappelhoff mentioned this issue Aug 2, 2019

[MRG] Add fetch function for eeg_matchingpennies dataset mne-tools/mne-bids#249

Closed

6 tasks

sappelhoff mentioned this issue Aug 9, 2019

ENH: expose limit_to_ext param to control which files get written #2

Merged

oesteban closed this as completed in #2 Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of pipeline ... and error when trying recursive update #1

Usage of pipeline ... and error when trying recursive update #1

sappelhoff commented Jul 30, 2019

yarikoptic commented Jul 30, 2019

sappelhoff commented Aug 1, 2019

yarikoptic commented Aug 1, 2019

yarikoptic commented Aug 1, 2019

sappelhoff commented Aug 2, 2019

kyleam commented Aug 2, 2019 •

edited

Loading

sappelhoff commented Aug 2, 2019

sappelhoff commented Aug 3, 2019 •

edited

Loading

yarikoptic commented Aug 3, 2019

kyleam commented Aug 3, 2019

sappelhoff commented Aug 3, 2019

kyleam commented Aug 3, 2019

kyleam commented Aug 3, 2019

kyleam commented Aug 3, 2019

sappelhoff commented Aug 3, 2019

sappelhoff commented Aug 9, 2019

Usage of pipeline ... and error when trying recursive update #1

Usage of pipeline ... and error when trying recursive update #1

Comments

sappelhoff commented Jul 30, 2019

yarikoptic commented Jul 30, 2019

sappelhoff commented Aug 1, 2019

yarikoptic commented Aug 1, 2019

yarikoptic commented Aug 1, 2019

sappelhoff commented Aug 2, 2019

kyleam commented Aug 2, 2019 • edited Loading

sappelhoff commented Aug 2, 2019

sappelhoff commented Aug 3, 2019 • edited Loading

yarikoptic commented Aug 3, 2019

kyleam commented Aug 3, 2019

sappelhoff commented Aug 3, 2019

kyleam commented Aug 3, 2019

kyleam commented Aug 3, 2019

kyleam commented Aug 3, 2019

sappelhoff commented Aug 3, 2019

sappelhoff commented Aug 9, 2019

kyleam commented Aug 2, 2019 •

edited

Loading

sappelhoff commented Aug 3, 2019 •

edited

Loading