-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of pipeline ... and error when trying recursive update #1
Comments
myself I haven't used datalad-osf yet. From traceback sounds like some empty row? |
there is no csv in my dataset 🤔 BUT having talked to @jasmainak a bit it seems like my premise is wrong. I thought I could create a git annex repo that would look JUST LIKE my real dataset, but instead of the real data, it would contain symbolic links pointing to the OSF data. And then I would be able to host that git annex repo (very low size) on GitHub, allow people to pull it with datalad, and use According to Mainak I would need my own git server to do something like that. Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files). |
I don't think so. git-annex will just contain urls pointing to OSF
yeap, and then you can publish that repository to github, along with git-annex branch ( |
didn't look into anything else but just FYI that the fetched csv has only the header. $> cat eeg_matchingpennies
name,url,location,sha256,path as for datalad crashing instead of just silently exiting or issuing a warning that no records were received, I filed datalad/datalad#3577 |
extract() fails with an index error if the stream is valid JSON or CSV but lacks any rows (i.e. for JSON an empty list or for CSV a header-only file). Update extract() to issue a warning and return an empty list of rows. Re: templateflow/datalad-osf#1 Closes datalad#3577.
extract() fails with an index error if the stream is valid JSON or CSV but lacks any rows (i.e. for JSON an empty list or for CSV a header-only file). Update extract() to issue a warning and return an empty list of rows. Re: templateflow/datalad-osf#1 Closes datalad#3577.
Mh, yes - this is a bug, also the test example from the main README fails, perhaps we should wait for @rciric to work this out. In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad? |
I don't believe we have a high-level tutorial on addurls yet. But here's a quick example using a couple of the URLs from the OSF directory that you pointed to. This skips past the more involved task, which IIUC datalad-osf handles, of getting a set of stable URLs and putting them into either a .json or .csv file that addurls() understands. example#!/bin/sh
set -eu
datalad create someds
cd someds
cat >files.csv <<EOF
sub,url
sub-06,https://osf.io/9q8r2/download
sub-05,https://osf.io/5br27/download
EOF
datalad save -m"add list of URLs" files
# Tell add addurls to download the content from the "url" field to a
# file named by the "sub" field.
datalad addurls files.csv "{url}" "{sub}" After executing the above, you end up with a dataset that has two files downloaded and annexed.
The public URLs for these files have been registered with git-annex:
This means that you can publish the repository without the data and people who have cloned it will be able to get the files with You can verify locally that this works by cloning the repo and then dropping the origin remote, so the only place annex can get the content from is the web.
|
Wow, this is really great @kyleam thanks! |
This seems to have worked to a large extend! I have made a CSV file with my file-paths and urls: "mp.csv" and made a datalad set: CSV content for convenience
(Note: I did not commit the csv to the repo, because I thought it's not necessary) There seems to be a bug however with some of the files:
for some files, this prints several links, all except one are wrong, E.g.:
I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ? Separate question: I continued as @kyleam suggested to make a local clone and remove the origin, to get a publishable git-annex dataset with only the "web" source of the data. See: https://github.com/sappelhoff/bogus apparently something went wrong - can you tell me what I should do? After cloning and removing the origin, I did (with the clone):
When I realized that this does not look right, I figured that
But all that gave me was a cryptic "git-annex" branch ... I now want to use Is this the right way to go at all? |
Just go ahead with git-annex branch should never be merged into any normal branch. Leave it for git-annex to deal with |
Hrm that's odd. I tried with
I'm trying now without I'm running this with datalad 0.11.6 and git-annex 7.20190730+git2-ga63bf35dc-1~ndall+1 on GNU/Linux. What's your version info? |
Thanks Yaroslav, I'll try that later! @kyleam I am using:
good to hear that it works with However, reading what |
Without
It seems there has to be something going wrong in the underlying |
Aah, it should've occurred to me sooner, but that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked. So I think things are working as expected. |
It's more expensive, but leaving out --fast buys you a content guarantee. With --fast, future downloads will only verify that the file has the expected size. You can see this difference by looking at the link targets. Without --fast, you get a file that points to the key generated from the file's content:
With --fast, the target only encodes the size:
|
Interesting, thanks for the detective work!
okay, that's something I should like. That also makes sense then why we don't see duplicates with |
I think I found the error, why my CSV was never populated ... It seems like this repo is MRI-centric and only datalad-osf/datalad_osf/utils.py Lines 91 to 96 in 42a2b93
That should be easy to fix! |
Hi, I am trying to turn an OSF directory into a git annex repository and datalad-osf seems to be great for this.
I am not entirely sure whether this would work as I think it would, basically I expect:
datalad create
update_recursive
with my OSF key and the directory I want to be git annexeddatalad install
andget
data (that is stored on OSF, but indexed in my git annex repository)Can someone tell me whether I am just completely misunderstanding / misusing the pipeline? @yarikoptic @rciric
Is there a simpler way to achieve what I want?
Apart from this, this is also a bug report. Here are the steps to reproduce:
mkdir mystuff
cd mystuff
datalad create
into
mystuff
, put the following Python filetry.py
mystuff
runpython try.py
... after a considerable amount of time, this provides me with the following error message:
The text was updated successfully, but these errors were encountered: