Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dataverse-import-files #292

Open
mih opened this issue Apr 7, 2023 · 5 comments
Open

Implement dataverse-import-files #292

mih opened this issue Apr 7, 2023 · 5 comments

Comments

@mih
Copy link
Member

mih commented Apr 7, 2023

Dataverse provides full dataset (version) file listings that also include md5 sums (and others). Therefore it would be fairly simple to support sucking in a filetree without having to go through the full complexity of support git-annex's importtree. datalad-ebrains pretty much has the blueprint for that.

It is unclear to me whether such a starting point could be coupled with an export/filetree-only setup provided by

datalad add-sibling-dataverse --mode filetree-only URL PID

but the immediate answer is no. Git-annex refuses to try, because it has no export location on record.

Faking an (or performing an empty) export also does not work, base a datalad dataset will contain files that are not on dataverse (and possibly cannot be, ie. the importing agent has no write permissions).


A different approach would be to populate a dataset with keys that have attached URLs that point to the data access API of the respective dataverse instance. The uncurl special remote would then be able to take care of them. Possibly a dedicated handler needs to be implemented that performs the auth correctly. Such a handler can be configured in the dataset and for the specific dataverse instance specifically.

Here is a sketch

git annex initremote uncurl type=external externaltype=uncurl encryption=none

git annex registerurl SHA256E-s26309--6ba60e2f73d403beecd5e50afa8affa824e21150558f0b333e209dc4427604c8.tsv https://data.fz-juelich.de/api/access/datafile/2694

git annex fromkey SHA256E-s26309--6ba60e2f73d403beecd5e50afa8affa824e21150558f0b333e209dc4427604c8.tsv sub-042/eeg/sub-042_task-extstim_events.tsv --force

For public datasets (no auth), uncurl is not even needed. web does things alright.

@jsheunis
Copy link
Member

jsheunis commented Jul 4, 2023

I tried this approach with a dataset with restricted files (https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8&version=1.3), but to no avail:

> datalad create dvtest

> cd dvtest

> git annex initremote uncurl type=external externaltype=uncurl encryption=none autoenable=true

# not really sure if the next line is necessary, but I ran it
> git annex enableremote uncurl 

# files.csv has header:  path,url,md5,size
# an example row: dataset_description.json,https://dataverse.nl/api/access/datafile/46664,7d63a6dd379b3b81574e31240433e008,1455
> datalad addurls --key 'et:MD5-s{size}--{md5}' files.csv '{url}' '{name}'
# no error results here, all good

# And then finally the get fails:
> datalad get task-fingerTapping_events.json
get(error): task-fingerTapping_events.json (file) [not available]

Then I ran fsck, which failed:

> git annex fsck --fast -f uncurl
fsck dataset_description.json (Failed to find 'MD5E-s1455--7d63a6dd379b3b81574e31240433e008.json' at any of ['https://dataverse.nl/axpi/access/datafile/46664']) failed
...

At this point I wondered if the file access urls were wrong for some reason, and I tested with download and my dataverse API token stored as a credential, which also failed:

> datalad download --credential dataversenl 'https://dataverse.nl/api/access/datafile/46664 dataset_desc.json'
download(error): dataset_descr.json [download failure] [403 Client Error: Forbidden for url: https://dataverse.nl/api/access/datafile/46664]

Another thing I tried was to navigate to the file access url (https://dataverse.nl/api/access/datafile/46664) in a browser tab where I had already logged in to the dataverse instance, and therefore the files were not restricted (I have admin privileges for the dataset). The browser downloaded the file automatically and successfully. As expected, doing this in an incognito tab fails:

{"status":"ERROR","code":403,"message":"Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"https://dataverse.nl/api/v1/access/datafile/46664","requestMethod":"GET"}

I'm out of ideas at the moment...

@mih
Copy link
Member Author

mih commented Jul 5, 2023

Have you checked how it tries to authenticate? Above I wrote:

Possibly a dedicated handler needs to be implemented that performs the auth correctly.

And this is likely what you are seeing here. If you check the code here:

def remove_file(self, fid: int):
status = delete_request(
f'{self._api.base_url}/dvn/api/data-deposit/v1.1/swordv2/'
f'edit-media/file/{fid}',
# this relies on having established the NativeApi in prepare()
auth=HTTPBasicAuth(self._api.api_token, ''))

you'll see that the SWORD API wants the API token to be given as the user (not the password) with HTTP Basic auth.

The dataverse special remote primarily uses the main API via pydataverse. It provides the API token via a key parameter:

https://github.com/gdcc/pyDataverse/blob/master/src/pyDataverse/api.py#L117-L118

There is no way that I am aware of how this could all be inferred magically. It needs a dedicated handler for such URLs.

@jsheunis
Copy link
Member

jsheunis commented Jul 5, 2023

thanks for the pointer, I will look into this

@jsheunis
Copy link
Member

jsheunis commented Jul 6, 2023

The preferred way to authenticate with v5 native APIs is using "X-Dataverse-key":"$API_TOKEN" in the http request header: https://guides.dataverse.org/en/5.10.1/api/auth.html#id3 (although the key url parameter is still supported).

I'm still trying to figure out where exactly a dedicated handler would need to fit in, and how to implement it. My current understanding is this:

  • if the uncurl special remote is initialized for a dataset (which requires installation of datalad-next), then any datalad get operation will somehow check with the uncurl special remote how it should try to retrieve a given file (I don't know where exactly this connection happens between a user running datalad get and some code in uncurl.py being executed, is this going through git annex?)
  • uncurl will, through the AnyUrlOperations class, access the specific url-handler relevant for accessing the file: https://github.com/datalad/datalad-next/blob/0cb44b0b5185a89b01d8a045ef1ea770c7cc1cd1/datalad_next/annexremotes/uncurl.py#L303
  • uncurl will also check with the -next-based credential system for relevant credentials (this I understood from the docstring, but haven't checked how the code works)

IIUC a handler for dataverse urls would essentially be the same as the current http, but with some differences:

  • The way the URL is recognized as a dataverse instance url needs to be defined somehow
  • X-Dataverse-key and api token (which needs to be queried from the credential system?) need to be passed in the header

So should there be a new urloperations class in next? I don't think so because HttpUrlOperations already does all that is generally necessary. For dataverse specifically we just need to pass the correct headers. So the question is then how should uncurl know that it's dealing with dataverse url. do we need to update the handler registry here? https://github.com/datalad/datalad-next/blob/0cb44b0b5185a89b01d8a045ef1ea770c7cc1cd1/datalad_next/url_operations/any.py#L41-L45?

So it would be something like the following:

'dataverse.nl/api/access/datafile': ('datalad_next.url_operations.http.HttpUrlOperations',headers),

where uncurl would already have had to know that it's a dataverse url so that it can query for the api token so that it can add that the header. This last-mentioned part is still very blurry to me.

Is this all on the right track?

@adswa
Copy link
Member

adswa commented Jul 6, 2023

I just want to register the thought that we should write this up once we have all the steps. Either in the datalad-dataverse docs, or datalad-next, or maybe the handbook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants