Implement `dataverse-import-files` #292

mih · 2023-04-07T08:13:12Z

Dataverse provides full dataset (version) file listings that also include md5 sums (and others). Therefore it would be fairly simple to support sucking in a filetree without having to go through the full complexity of support git-annex's importtree. datalad-ebrains pretty much has the blueprint for that.

It is unclear to me whether such a starting point could be coupled with an export/filetree-only setup provided by

datalad add-sibling-dataverse --mode filetree-only URL PID

but the immediate answer is no. Git-annex refuses to try, because it has no export location on record.

Faking an (or performing an empty) export also does not work, base a datalad dataset will contain files that are not on dataverse (and possibly cannot be, ie. the importing agent has no write permissions).

A different approach would be to populate a dataset with keys that have attached URLs that point to the data access API of the respective dataverse instance. The uncurl special remote would then be able to take care of them. Possibly a dedicated handler needs to be implemented that performs the auth correctly. Such a handler can be configured in the dataset and for the specific dataverse instance specifically.

Here is a sketch

git annex initremote uncurl type=external externaltype=uncurl encryption=none

git annex registerurl SHA256E-s26309--6ba60e2f73d403beecd5e50afa8affa824e21150558f0b333e209dc4427604c8.tsv https://data.fz-juelich.de/api/access/datafile/2694

git annex fromkey SHA256E-s26309--6ba60e2f73d403beecd5e50afa8affa824e21150558f0b333e209dc4427604c8.tsv sub-042/eeg/sub-042_task-extstim_events.tsv --force

For public datasets (no auth), uncurl is not even needed. web does things alright.

The text was updated successfully, but these errors were encountered:

jsheunis · 2023-07-04T20:59:26Z

I tried this approach with a dataset with restricted files (https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8&version=1.3), but to no avail:

> datalad create dvtest

> cd dvtest

> git annex initremote uncurl type=external externaltype=uncurl encryption=none autoenable=true

# not really sure if the next line is necessary, but I ran it
> git annex enableremote uncurl 

# files.csv has header:  path,url,md5,size
# an example row: dataset_description.json,https://dataverse.nl/api/access/datafile/46664,7d63a6dd379b3b81574e31240433e008,1455
> datalad addurls --key 'et:MD5-s{size}--{md5}' files.csv '{url}' '{name}'
# no error results here, all good

# And then finally the get fails:
> datalad get task-fingerTapping_events.json
get(error): task-fingerTapping_events.json (file) [not available]

Then I ran fsck, which failed:

> git annex fsck --fast -f uncurl
fsck dataset_description.json (Failed to find 'MD5E-s1455--7d63a6dd379b3b81574e31240433e008.json' at any of ['https://dataverse.nl/axpi/access/datafile/46664']) failed
...

At this point I wondered if the file access urls were wrong for some reason, and I tested with download and my dataverse API token stored as a credential, which also failed:

> datalad download --credential dataversenl 'https://dataverse.nl/api/access/datafile/46664 dataset_desc.json'
download(error): dataset_descr.json [download failure] [403 Client Error: Forbidden for url: https://dataverse.nl/api/access/datafile/46664]

Another thing I tried was to navigate to the file access url (https://dataverse.nl/api/access/datafile/46664) in a browser tab where I had already logged in to the dataverse instance, and therefore the files were not restricted (I have admin privileges for the dataset). The browser downloaded the file automatically and successfully. As expected, doing this in an incognito tab fails:

{"status":"ERROR","code":403,"message":"Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://dataverse.nl/api/v1/access/datafile/46664","requestMethod":"GET"}

I'm out of ideas at the moment...

mih · 2023-07-05T07:29:43Z

Have you checked how it tries to authenticate? Above I wrote:

Possibly a dedicated handler needs to be implemented that performs the auth correctly.

And this is likely what you are seeing here. If you check the code here:

datalad-dataverse/datalad_dataverse/dataset.py

Lines 183 to 188 in d793dda

    
           def remove_file(self, fid: int): 
        
               status = delete_request( 
        
                   f'{self._api.base_url}/dvn/api/data-deposit/v1.1/swordv2/' 
        
                   f'edit-media/file/{fid}', 
        
                   # this relies on having established the NativeApi in prepare() 
        
                   auth=HTTPBasicAuth(self._api.api_token, ''))

you'll see that the SWORD API wants the API token to be given as the user (not the password) with HTTP Basic auth.

The dataverse special remote primarily uses the main API via pydataverse. It provides the API token via a key parameter:

https://github.com/gdcc/pyDataverse/blob/master/src/pyDataverse/api.py#L117-L118

There is no way that I am aware of how this could all be inferred magically. It needs a dedicated handler for such URLs.

jsheunis · 2023-07-05T07:44:29Z

thanks for the pointer, I will look into this

jsheunis · 2023-07-06T12:21:22Z

The preferred way to authenticate with v5 native APIs is using "X-Dataverse-key":"$API_TOKEN" in the http request header: https://guides.dataverse.org/en/5.10.1/api/auth.html#id3 (although the key url parameter is still supported).

I'm still trying to figure out where exactly a dedicated handler would need to fit in, and how to implement it. My current understanding is this:

if the uncurl special remote is initialized for a dataset (which requires installation of datalad-next), then any datalad get operation will somehow check with the uncurl special remote how it should try to retrieve a given file (I don't know where exactly this connection happens between a user running datalad get and some code in uncurl.py being executed, is this going through git annex?)
uncurl will, through the AnyUrlOperations class, access the specific url-handler relevant for accessing the file: https://github.com/datalad/datalad-next/blob/0cb44b0b5185a89b01d8a045ef1ea770c7cc1cd1/datalad_next/annexremotes/uncurl.py#L303
uncurl will also check with the -next-based credential system for relevant credentials (this I understood from the docstring, but haven't checked how the code works)

IIUC a handler for dataverse urls would essentially be the same as the current http, but with some differences:

The way the URL is recognized as a dataverse instance url needs to be defined somehow
X-Dataverse-key and api token (which needs to be queried from the credential system?) need to be passed in the header

So should there be a new urloperations class in next? I don't think so because HttpUrlOperations already does all that is generally necessary. For dataverse specifically we just need to pass the correct headers. So the question is then how should uncurl know that it's dealing with dataverse url. do we need to update the handler registry here? https://github.com/datalad/datalad-next/blob/0cb44b0b5185a89b01d8a045ef1ea770c7cc1cd1/datalad_next/url_operations/any.py#L41-L45?

So it would be something like the following:

'dataverse.nl/api/access/datafile': ('datalad_next.url_operations.http.HttpUrlOperations',headers),

where uncurl would already have had to know that it's a dataverse url so that it can query for the api token so that it can add that the header. This last-mentioned part is still very blurry to me.

Is this all on the right track?

adswa · 2023-07-06T12:59:36Z

I just want to register the thought that we should write this up once we have all the steps. Either in the datalad-dataverse docs, or datalad-next, or maybe the handbook

mih mentioned this issue Apr 11, 2023

Create DataLad dataset from Dataverse dataset psychoinformatics-de/knowledge-base#8

Closed

4 tasks

jsheunis mentioned this issue Jul 19, 2023

KBI0028: Notes on Sciebo/Nextcloud share URLs psychoinformatics-de/knowledge-base#104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `dataverse-import-files` #292

Implement `dataverse-import-files` #292

mih commented Apr 7, 2023 •

edited

Loading

jsheunis commented Jul 4, 2023 •

edited

Loading

mih commented Jul 5, 2023

jsheunis commented Jul 5, 2023

jsheunis commented Jul 6, 2023

adswa commented Jul 6, 2023

Implement dataverse-import-files #292

Implement dataverse-import-files #292

Comments

mih commented Apr 7, 2023 • edited Loading

jsheunis commented Jul 4, 2023 • edited Loading

mih commented Jul 5, 2023

jsheunis commented Jul 5, 2023

jsheunis commented Jul 6, 2023

adswa commented Jul 6, 2023

Implement `dataverse-import-files` #292

Implement `dataverse-import-files` #292

mih commented Apr 7, 2023 •

edited

Loading

jsheunis commented Jul 4, 2023 •

edited

Loading