-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files: Need Persistent Identifiers/URL's for Data Files #2700
Comments
@monogan thank you very much for opening this issue! I'm hoping other people in the community add their comments. @bencomp just suggested that we should clarify issue titles and I completely agree so would you agree to changing the title from "URL Referencing Data Sets" to "URL Referencing Files"? This goes back to my post on terminology we use in Dataverse. I think what you're calling a data set we call a file. Sorry if this is confusing! If you agree to this change to the title, please go ahead and change it. I did go ahead and edit the markdown slightly and linked to my commit with the R script I wrote. Also, I wanted to mention that I opened a couple of related issues yesterday based on this one:
As I wrote there I think the best solution is to encourage the use of language-specific libraries so that people are insulated from API changes. Finally, the original thread that kicked off this discussion may be of interested as well as the internal ticket (for those who can access it). |
Yes, @pdurbin and I indeed chatted about this issue and updating issue titles to keep overview of open issues. Could you summarise the issue as Dataverse should provide persistent URLs for files and if yes, could you update the issue title accordingly? |
@pdurbin @bencomp Yes, I agree with this change, so I have made it based on Ben's revised suggestion. Thank you both for all of your help on this! I definitely think this is something that a lot of folks will be interested in for various settings, as I've heard about many students in test settings saying that they wanted to be able to just grab Dataverse files directly in R. So thank you all, and please let me know if I can do anything to assist. |
Is this the same as #2438? |
@monogan thanks for updating the title. @eaquigley #2438 about unique identifiers for files is definitely related! Thanks for linking to it. I meant to do that myself. We can all think about this: What are the pieces of information necessary to download a file? I would say:
|
#2416 is also related and provides an example of a URL from which you can download a file from Zenodo. 30 years from now will you be able to download "MHC_PosSel_dataset.zip" from the URL https://zenodo.org/record/18898/files/MHC_PosSel_dataset.zip ? (This file is part of https://zenodo.org/record/18898#.VjK086KpOL6 .) Is that file URL stable? Will it work in perpetuity (to use @monogan 's phrase)? @timbl has taught us that Cool URIs don't change and in a perfect world you'd never find a dead link. I say this not to knock Zenodo (they have really cool integration with GithHub and provide a great service). I'm just trying to reason about what forever means and what sort of promises can be made when it comes to URLs. |
@monogan heads up that @leeper just pushed an initial draft to https://github.com/IQSS/dataverse-client-r ! I'm sure he'd appreciate feedback. |
I'd like to request this feature as well. I see that it's been marked as "Triaged": is that equivalent to wont-fix or to will-consider-again-in-next-version? |
I've closed #2416 because it's very similar. @jeisner - sorry to miss your last question on this issue. The "triaged" label is left over from a previous process and it is not currently in use. The best way to see what the team is currently working on is on our waffle board: https://waffle.io/IQSS/dataverse The best way to see what the team is planning to work on next is our releases roadmap page (soon to be updated): |
I would definitely like this features sooner rather than later. I think that it is essential for datasets will lots of files (like mine). I what to Beale to blog/send posts about (and link to) specific files. |
@lwo commented on Sep 29, 2015 in issue #2438
I am moving that comment here, and closing that issue as a duplicate of this. |
@djbrooke well, @monogan first opened this issue so maybe we should check with him. There is a new file page for the file he mentioned in the description: https://dataverse.harvard.edu/file.xhtml?fileId=2692295&version=1.0 @monogan what do you think? |
@pdurbin @djbrooke I've worked several hours on this, and I'm struggling pretty badly. My goal has been to see if I can get @leeper 's program to work on this, but I cannot seem to install his "dataverse" package. Maybe one of you has more luck. Here's what my planned code was: I installed Rtools at one point thinking that might help, but this fails on the second line. For what it's worth, this is my error message: |
@monogan I think you're taking the right approach by using Dataverse client libraries such as the "dataverse" R package to insulate you from any changes in the Dataverse APIs. I'm not sure how to get that package installed on Windows and I just opened IQSS/dataverse-client-r#11 to try to find out how best to get help. Please do not worry about the DOI not working. The only thing that was added recently to Dataverse is a landing page per file. The APIs themselves didn't change and we intend to keep APIs as stable as possible. |
Just to confirm, this R code should work for the example file being discussed: library("dataverse")
f <- get_file("BPchap7.tab", "doi:10.7910/DVN/ARKOTI")
# optionally, load the data:
dat <- haven::read_dta(f)
dplyr::glimpse(haven::read_dta(f))
## Observations: 854
## Variables: 14
## $ st_fip <dbl> 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ female <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 9, 1, 0, 1...
## $ hrs_allev <dbl> 8.0, 3.0, 26.0, 44.0, 8.0, 4.0, 8.0, 4.0, 8.0, 8.0, 8.0, 5.5, 8.0, 21.0, 12.0, 9.5, 1...
## $ evol_course <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0...
## $ phase1 <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, -1.3663943, -1.3663943, -1.3663943, -1.36...
## $ senior_c <dbl> 1, 0, 0, 1, 1, 0, 1, -2, -1, 1, -2, -2, 1, 1, 1, 0, 0, 1, 0, -2, -1, 1, 0, 0, 0, -1, ...
## $ ph_senior <dbl> 0.2010949, 0.0000000, 0.0000000, 0.2010949, -1.3663943, 0.0000000, -1.3663943, 2.7327...
## $ notest_p <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ph_notest_p <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, 0.0000000, 0.0000000, 0.0000000, 0.000000...
## $ idsci_trans <dbl> 9.639460e-01, 5.422196e-01, 9.639460e-01, 1.710664e-01, 6.852229e-01, 3.391082e-01, 8...
## $ biocred3 <dbl> 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 0, 1, 1, 2, 1...
## $ certified <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1...
## $ degr3 <dbl> 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0...
## $ confident <dbl> 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0... |
@monogan now that you've got the "dataverse" R package installed and the script from @leeper above works for you, as you reported at IQSS/dataverse-client-r#11 (comment) (thanks for all the help, Thomas), do you have any more thoughts on this issue? Again, I think you're using the right approach by encouraging people to use a layer of abstraction such as the R dataverse client, which can evolve along with Dataverse APIs. |
@pdurbin @leeper @jeisner @smakonin I'll admit that I'm not exactly sure how much @leeper had to do to allow the extraction to work off of a DOI. Essentially, I think it's great if an R package like Thomas's can pull the file as easily as this does--I can very easily teach an undergraduate how to get data from Dataverse if it's as easy as Thomas @leeper 's software makes it. (Thank you again Thomas for such great, easy software.) So am I correct in understanding that Thomas was able to update the software so that the same DOI could get the same data, even as the API changed? Am I also correct in understanding that DOI's stay stable even as Dataverse's API updates? Essentially, if the DOI is stable and there's always some abstraction that can easily get the data, I think that's a great situation. My main goal all along was to be able to print in ink a way to get a specific data file from Dataverse and that code in ink be able to work in perpetuity so that the printed material was still good. The great work Thomas has been doing seems to be helping that cause. I'm not sure how easily API adjustments can keep it easy for software like the "dataverse" package to continue to work easily. Does that get at what you were wondering, @pdurbin? Again, thank you all for everything you do. |
@monogan yes, exactly. The same DOI will always reference the same dataset. As #3584 is somewhat related in the sense that we are going to soon expose a URL for public files on the new file landing page. |
@pdurbin Awesome, Philip. Sounds good to close it to me. Thank you. |
At top of ticket what appears to be asked for is a persistent identifier or doi that would allow download access in perpetuity. A download api link is not that. |
Dear colleagues:
One thing I've been trying to be able to do (and Phil Durbin has spent a lot of time helping on this) is reference a data file from Dataverse by a URL. The tricky thing about this is that I'd like to be able to print code that directly references a file that works in perpetuity. The motivator for this was a new R book I wrote, so once the code was printed, I wouldn't have the chance to update it.
Phil has come up with something like this (function defined and then file loaded):
The biggest issue is that versions of Dataverse may change in the future, so the URLs may change as well. So if there's some way we could guarantee a URL that would work for a Dataverse-based file forever, that would be ideal. Thank you for any possible help!
The text was updated successfully, but these errors were encountered: