Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support in datasets API for persistent id (doi) #1837

Closed
scolapasta opened this issue Apr 1, 2015 · 21 comments
Closed

Add support in datasets API for persistent id (doi) #1837

scolapasta opened this issue Apr 1, 2015 · 21 comments
Assignees
Milestone

Comments

@scolapasta
Copy link
Contributor

scolapasta commented Apr 1, 2015

We currently have the APis as using the db id, but we need to support persistent Id.

@scolapasta scolapasta added this to the 4.0.1 milestone Apr 1, 2015
@scolapasta
Copy link
Contributor Author

From #1717:

However, the doi naming Scheme contains slashes, and so does not play well with REST api. We could introduce a Scheme where the doi id is escaped or the slashes are replaces with dashes, or maybe base64-ed. Not sure any of these is a good idea - at least, it's not a very intuitive one. We could offer another endpoint that converts global ids to local ones.

@pdurbin
Copy link
Member

pdurbin commented Apr 9, 2015

I definitely have a need to figure out internal dataset ID numbers when working with APIs.

For a long times I've been doing this at https://github.com/IQSS/dataverse/blob/master/scripts/search/assumptions

export FIRST_FINCH_DATASET_ID=curl -s "http://localhost:8080/api/dataverses/finches/contents?key=$FINCHKEY" | jq '.data[0].id'

And more recently I've been using an undocumented feature of the Search API to expose database IDs (looking them up by globalId/persistentId/DOI) but this requires turning on an experimental feature I haven't fully implemented at #1299 -

Anyway, my point is that this is an important endpoint for sure. /cc @rliebz

@scolapasta scolapasta modified the milestones: 4.0.1, In Review - Short Term Apr 18, 2015
@garthg
Copy link

garthg commented May 14, 2015

Hi,

This is a blocker issue for my project, because without the ids I can't perform metadata updates, and I can't get the ids because the get_contents() call takes too long to complete. I will give a try on some of the workarounds described here, so thank you to folks who posted those!

One possible suggestion for a simple solution here would be to URL-escape the DOIs and then use them in the REST format as usual, so you'd get something like https://dataverse.harvard.edu/api/datasets/doi%3A10.7910%2FDVN%2FUXTXA/versions/:latest

Anyway, if anyone has any additional suggestions for how to find the IDs or how to perform metadata updates using only DOI, I would love to hear them!

Thanks,

Garth

pdurbin added a commit that referenced this issue May 29, 2015
@pdurbin
Copy link
Member

pdurbin commented May 29, 2015

While I was just trying to write a test for #2222 it was driving me crazy (again) that I can't see the dataset entity/database IDs from SWORD. I just pushed a proof of concept to correct this in 639d8c3.

pdurbin added a commit to IQSS/dataverse-apitester that referenced this issue Jun 1, 2015
Disabled because we still need a way to find a dataset id based on a
DOI: IQSS/dataverse#1837
@pdurbin
Copy link
Member

pdurbin commented Jun 29, 2015

the get_contents() call takes too long to complete

Right, get_contents is a method @garthg is calling from https://github.com/IQSS/dataverse-client-python and the corresponding issue about this slowness on the API side is #2122

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2015

Without this functionality of being able to look up datasets via DOI, the native "datasets" API ( http://guides.dataverse.org/en/4.0/api/native-api.html#datasets ) is way less useful. An example use case today from @aawinburn was "How do I get the file ID this PDF in my unpublished dataset?" Good question and #1795 was supposed to be the answer but you have to know the database id of the dataset. I've also answered this question at https://groups.google.com/d/msg/dataverse-community/fFrJi7NnBus/JUdOlOmhtQgJ encouraging people (for now) to get a list of file IDs via the SWORD statement ( http://guides.dataverse.org/en/latest/api/sword.html#display-a-dataset-statement ) mostly because SWORD operates via DOIs. See also infsci2711/MultiDBs-FilesAPIs2DBs-WebClient#6

@pdurbin
Copy link
Member

pdurbin commented Sep 4, 2015

As I just mentioned in a thread on the Dataverse Google Group, #2416 was opened recently which is about how hard it is to discover file IDs from the GUI.

In addition #2438 is a new issue about what persistent IDs we could/should use for files.

@pdurbin
Copy link
Member

pdurbin commented Oct 10, 2015

Developers of the Dataverse client for Python would like the ability to use DOIs (not just database IDs) to operate on the native API. IQSS/dataverse-client-python#28 has some discussion on this.

@leeper
Copy link
Member

leeper commented Nov 14, 2015

This would also be useful for the R client.

@leeper
Copy link
Member

leeper commented Nov 14, 2015

I should elaborate: there's a tension between the Native API's ability to get versions of a dataset (but only by dataset ID) and the SWORD API's ability to retrieve a dataset by DOI. It would be nice for these to be able to play together, particularly given that the Native API doesn't require an API key to view the contents of a public dataset, but the SWORD API does.

@RinkeHoekstra
Copy link

This is a blocker as well for my project, and I do not see what the reason is that the search API does not expose the dataset ID's by default.

As it turns out, several dataverse installations I've tested do provide the id's when the 'show_entity_ids=true' parameter is passed in the URL. However, this feature is undocumented in the API docs.

@pdurbin
Copy link
Member

pdurbin commented Dec 2, 2015

See also #1717 which spawned this ticket. I think @michbarsinai @scolapasta and I need to get together and decide on an approach to try. Options include:

  • put the DOI in a query parameter: /api/datasets?persistentId=doi:10.7910/DVN/UXTXA
  • escape the DOI keeping it where it is the path: /api/datasets/doi%3A10.7910%2FDVN%2FUXTXA
  • put the DOI at the end of path: /api/datasets/versions/:latest/doi:10.7910/DVN/UXTXA

@garthg means well when he suggests escaping the DOI in the URL like /api/datasets/doi%3A10.7910%2FDVN%2FUXTXA/versions/:latest (and @michbarsinai suggested the same at #1717 (comment) ) but my goodness is that hard on the eyes. I would much prefer using a query parameter like this: /api/datasets?persistentId=doi:10.7910/DVN/UXTXA which is exactly what we do on the dataset page: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UXTXA

Another approach would be to put the DOI at the end of the URL, like we do with SWORD ( /dvn/api/data-deposit/v1.1/swordv2/edit/study/doi:10.7910/DVN/UXTXA ) but I favor the query parameter approach.

Whatever we decide on we would, of course, continue to support the old way for a while. And I think we should continue to support looking up a dataset by id, even if we use a query parameter (/api/datasets?id=42).

@michbarsinai
Copy link
Member

Another option is to have a DOI endpoint. This will also allow to point to different types of items from a DOI, which is, I think, one of the main goals of the DOI project.

Something along the lines of:

/api/doi/10.7910/DVN/UXTXA

Not sure how to deal with versions there - we could append them (/api/doi/12.3456/DVNE/UXTXA/versions/:latest) and use some semi-clever URL parsing. Or we could return a list of the versions, and have the client access a specific version via the existing API.

@garthg
Copy link

garthg commented Dec 3, 2015

@RinkeHoekstra In case it's helpful, I wrote some Python that does cached lookup of dataverse IDs to make it slightly easier to manage this issue. Some code is on pastebin at: http://pastebin.com/ipdhEPXA . Obviously that's not a substitute for proper implementation through the API, but I wanted to pass it along just in case it's helpful.

@RinkeHoekstra
Copy link

@garthg thanks! I found similar code somewhere on Github and now have a workaround.

A separate issue is that the search API is rather picky as to how the DOI is quoted. For instance Python requests always quotes the query parameters in a GET request, but the API then searches for the quoted string rather than unquoting it first. But that is a separate issue ...

@michbarsinai
Copy link
Member

URL scheme for external persistent ids:

http://dataverse.org/api/datasets/:persistentid/:draft?persistentid=doi:10.2.3.4./open/ended/notation*
  • As long as the character is legal in URL parameters, so can't support, e.g. &.

@michbarsinai michbarsinai self-assigned this Dec 16, 2015
michbarsinai added a commit that referenced this issue Dec 18, 2015
@michbarsinai michbarsinai removed their assignment Dec 19, 2015
@pdurbin
Copy link
Member

pdurbin commented Jan 4, 2016

@scolapasta this is one of the issues I mentioned this morning for which code has been pushed to a branch made from 4.2.3 and a decision should be made whether to merge it in to the 4.2.3 branch or not.

@pdurbin pdurbin modified the milestones: 4.2.3, Not Assigned to a Release Jan 4, 2016
@scolapasta scolapasta modified the milestones: 4.3, 4.2.3 Jan 5, 2016
@scolapasta scolapasta assigned kcondon and unassigned scolapasta Feb 26, 2016
@pdurbin
Copy link
Member

pdurbin commented Mar 1, 2016

Most recently, this issue is affecting this user:

I'm replying with workarounds but really we should just fix this issue. @michbarsinai implemented a fix at #1837 (comment) and it has since become pull request #2893.

scolapasta added a commit that referenced this issue Mar 11, 2016
@scolapasta
Copy link
Contributor Author

Tested and merged.

@pdurbin
Copy link
Member

pdurbin commented Mar 16, 2016

You can see the fix in production at https://dataverse.harvard.edu/api/datasets/:persistentId?persistentId=doi:10.7910/DVN/ARKOTI

(That's the dataset @monogan said we could test with at IQSS/dataverse-client-r#2 (comment) .)

Docs at http://guides.dataverse.org/en/4.3/api/native-api.html#datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants