Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files: Need Persistent Identifiers/URL's for Data Files #2700

Closed
ghost opened this issue Oct 27, 2015 · 27 comments
Closed

Files: Need Persistent Identifiers/URL's for Data Files #2700

ghost opened this issue Oct 27, 2015 · 27 comments
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Oct 27, 2015

Dear colleagues:

One thing I've been trying to be able to do (and Phil Durbin has spent a lot of time helping on this) is reference a data file from Dataverse by a URL. The tricky thing about this is that I'd like to be able to print code that directly references a file that works in perpetuity. The motivator for this was a new R book I wrote, so once the code was printed, I wouldn't have the chance to update it.

Phil has come up with something like this (function defined and then file loaded):

download.dataverse.file <- function(url) {
  if (length(url) == 0L)   {
    return(
      "Please provide a URL to a file: http://guides.dataverse.org/en/latest/api/dataaccess.html"
    )
  }
  url.to.download = url
  tsvfile = 'file.tsv'
  download.file(url = url.to.download, destfile =
                  tsvfile, method = 'curl')
  mydata <- read.table(tsvfile, header = TRUE, sep = "\t")
  unlink(tsvfile)
  return(mydata)
}

#LOAD FILE#
evolution<-download.dataverse.file(url<-'https://dataverse.harvard.edu/api/v1/access/datafile/2692295')

The biggest issue is that versions of Dataverse may change in the future, so the URLs may change as well. So if there's some way we could guarantee a URL that would work for a Dataverse-based file forever, that would be ideal. Thank you for any possible help!

@pdurbin
Copy link
Member

pdurbin commented Oct 29, 2015

@monogan thank you very much for opening this issue! I'm hoping other people in the community add their comments. @bencomp just suggested that we should clarify issue titles and I completely agree so would you agree to changing the title from "URL Referencing Data Sets" to "URL Referencing Files"? This goes back to my post on terminology we use in Dataverse. I think what you're calling a data set we call a file. Sorry if this is confusing! If you agree to this change to the title, please go ahead and change it. I did go ahead and edit the markdown slightly and linked to my commit with the R script I wrote.

Also, I wanted to mention that I opened a couple of related issues yesterday based on this one:

As I wrote there I think the best solution is to encourage the use of language-specific libraries so that people are insulated from API changes.

Finally, the original thread that kicked off this discussion may be of interested as well as the internal ticket (for those who can access it).

@bencomp
Copy link
Contributor

bencomp commented Oct 29, 2015

Yes, @pdurbin and I indeed chatted about this issue and updating issue titles to keep overview of open issues. Could you summarise the issue as Dataverse should provide persistent URLs for files and if yes, could you update the issue title accordingly?

@ghost ghost changed the title URL Referencing Data Sets Dataverse should provide persistent URLs for files Oct 29, 2015
@ghost
Copy link
Author

ghost commented Oct 29, 2015

@pdurbin @bencomp Yes, I agree with this change, so I have made it based on Ben's revised suggestion. Thank you both for all of your help on this! I definitely think this is something that a lot of folks will be interested in for various settings, as I've heard about many students in test settings saying that they wanted to be able to just grab Dataverse files directly in R. So thank you all, and please let me know if I can do anything to assist.

@eaquigley
Copy link
Contributor

Is this the same as #2438?

@pdurbin
Copy link
Member

pdurbin commented Oct 29, 2015

@monogan thanks for updating the title.

@eaquigley #2438 about unique identifiers for files is definitely related! Thanks for linking to it. I meant to do that myself.

We can all think about this: What are the pieces of information necessary to download a file? I would say:

@pdurbin
Copy link
Member

pdurbin commented Oct 30, 2015

#2416 is also related and provides an example of a URL from which you can download a file from Zenodo. 30 years from now will you be able to download "MHC_PosSel_dataset.zip" from the URL https://zenodo.org/record/18898/files/MHC_PosSel_dataset.zip ? (This file is part of https://zenodo.org/record/18898#.VjK086KpOL6 .) Is that file URL stable? Will it work in perpetuity (to use @monogan 's phrase)? @timbl has taught us that Cool URIs don't change and in a perfect world you'd never find a dead link. I say this not to knock Zenodo (they have really cool integration with GithHub and provide a great service). I'm just trying to reason about what forever means and what sort of promises can be made when it comes to URLs.

@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2015

As I wrote there I think the best solution is to encourage the use of language-specific libraries so that people are insulated from API changes.

@monogan heads up that @leeper just pushed an initial draft to https://github.com/IQSS/dataverse-client-r ! I'm sure he'd appreciate feedback.

@ghost
Copy link
Author

ghost commented Nov 10, 2015

@pdurbin @leeper This looks very promising!

@mercecrosas mercecrosas modified the milestone: In Review Nov 30, 2015
@scolapasta scolapasta modified the milestone: Not Assigned to a Release Jan 28, 2016
@jeisner
Copy link

jeisner commented Aug 27, 2016

I'd like to request this feature as well. I see that it's been marked as "Triaged": is that equivalent to wont-fix or to will-consider-again-in-next-version?

@djbrooke
Copy link
Contributor

djbrooke commented Oct 7, 2016

I've closed #2416 because it's very similar.

@jeisner - sorry to miss your last question on this issue. The "triaged" label is left over from a previous process and it is not currently in use.

The best way to see what the team is currently working on is on our waffle board:

https://waffle.io/IQSS/dataverse

The best way to see what the team is planning to work on next is our releases roadmap page (soon to be updated):

http://dataverse.org/releases-roadmap

@smakonin
Copy link

I would definitely like this features sooner rather than later. I think that it is essential for datasets will lots of files (like mine). I what to Beale to blog/send posts about (and link to) specific files.

@mheppler
Copy link
Contributor

@lwo commented on Sep 29, 2015 in issue #2438

I am entering via: https://groups.google.com/forum/#!msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

... I think persistent identifiers should not be derived from the local file id. An organization may want to migrate from their repository solution to dataverse and be able to import their PIDs if they already have them. And then rebind them to the dataverse dataset and future file pages. That would be possible with a String type, but not with a number.

I am moving that comment here, and closing that issue as a duplicate of this.

@mheppler mheppler changed the title Dataverse should provide persistent URLs for files Files: Need Persistent Identifiers/URL's for Data Files Oct 27, 2016
@djbrooke djbrooke modified the milestone: 4.6.2 - Provenance Jan 17, 2017
@djbrooke
Copy link
Contributor

@pdurbin @mheppler - was this delivered with the new file page?

@pdurbin
Copy link
Member

pdurbin commented Jan 18, 2017

@djbrooke well, @monogan first opened this issue so maybe we should check with him. There is a new file page for the file he mentioned in the description: https://dataverse.harvard.edu/file.xhtml?fileId=2692295&version=1.0

@monogan what do you think?

@ghost
Copy link
Author

ghost commented Jan 20, 2017

@pdurbin @djbrooke I've worked several hours on this, and I'm struggling pretty badly. My goal has been to see if I can get @leeper 's program to work on this, but I cannot seem to install his "dataverse" package. Maybe one of you has more luck. Here's what my planned code was:
library(ghit); ghit::install_github("iqss/dataverse-client-r"); library("dataverse"); get_dataset("doi:10.7910/DVN/ARKOTI")

I installed Rtools at one point thinking that might help, but this fails on the second line. For what it's worth, this is my error message:
Error in build_and_insert(p$pkgname, d, vers, build_args, verbose = verbose) : Package build for dataverse failed! In addition: Warning message: running command '"C:/Users/COX'SG~1/DOCUME~1/R/R-33~1.2/bin/x64/R" CMD build C:\Users\COX'SG~1\AppData\Local\Temp\RtmpGACfp0\dataverse-client-r1ca4271417e6' had status 1
Can any of you get my code to run? Anyone have any idea why I can't install the "dataverse" package @leeper wrote? So I hate that I'm failing before I even get to the starting line, but any advice would be appreciated. My hope would be that, once I got the "dataverse" package installed that DOI still works at the new page.

@pdurbin
Copy link
Member

pdurbin commented Jan 20, 2017

@monogan I think you're taking the right approach by using Dataverse client libraries such as the "dataverse" R package to insulate you from any changes in the Dataverse APIs. I'm not sure how to get that package installed on Windows and I just opened IQSS/dataverse-client-r#11 to try to find out how best to get help.

Please do not worry about the DOI not working. The only thing that was added recently to Dataverse is a landing page per file. The APIs themselves didn't change and we intend to keep APIs as stable as possible.

@ghost
Copy link
Author

ghost commented Jan 20, 2017

@pdurbin Thank you for opening up that comment. I appreciate your help in trying to sort this out. Is there anything else I need to be testing right now? Also very grateful to @leeper for all of the work he has done on this issue.

@leeper
Copy link
Member

leeper commented Jan 22, 2017

Just to confirm, this R code should work for the example file being discussed:

library("dataverse")
f <- get_file("BPchap7.tab", "doi:10.7910/DVN/ARKOTI")

# optionally, load the data:
dat <- haven::read_dta(f)
dplyr::glimpse(haven::read_dta(f))
## Observations: 854
## Variables: 14
## $ st_fip      <dbl> 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ female      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 9, 1, 0, 1...
## $ hrs_allev   <dbl> 8.0, 3.0, 26.0, 44.0, 8.0, 4.0, 8.0, 4.0, 8.0, 8.0, 8.0, 5.5, 8.0, 21.0, 12.0, 9.5, 1...
## $ evol_course <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0...
## $ phase1      <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, -1.3663943, -1.3663943, -1.3663943, -1.36...
## $ senior_c    <dbl> 1, 0, 0, 1, 1, 0, 1, -2, -1, 1, -2, -2, 1, 1, 1, 0, 0, 1, 0, -2, -1, 1, 0, 0, 0, -1, ...
## $ ph_senior   <dbl> 0.2010949, 0.0000000, 0.0000000, 0.2010949, -1.3663943, 0.0000000, -1.3663943, 2.7327...
## $ notest_p    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ph_notest_p <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, 0.0000000, 0.0000000, 0.0000000, 0.000000...
## $ idsci_trans <dbl> 9.639460e-01, 5.422196e-01, 9.639460e-01, 1.710664e-01, 6.852229e-01, 3.391082e-01, 8...
## $ biocred3    <dbl> 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 0, 1, 1, 2, 1...
## $ certified   <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1...
## $ degr3       <dbl> 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0...
## $ confident   <dbl> 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0...

@pdurbin
Copy link
Member

pdurbin commented Jan 25, 2017

@monogan now that you've got the "dataverse" R package installed and the script from @leeper above works for you, as you reported at IQSS/dataverse-client-r#11 (comment) (thanks for all the help, Thomas), do you have any more thoughts on this issue? Again, I think you're using the right approach by encouraging people to use a layer of abstraction such as the R dataverse client, which can evolve along with Dataverse APIs.

@pdurbin
Copy link
Member

pdurbin commented Jan 25, 2017

@jeisner @smakonin I'd be happy to hear your latest thoughts on this issue as well.

@ghost
Copy link
Author

ghost commented Jan 30, 2017

@pdurbin @leeper @jeisner @smakonin I'll admit that I'm not exactly sure how much @leeper had to do to allow the extraction to work off of a DOI. Essentially, I think it's great if an R package like Thomas's can pull the file as easily as this does--I can very easily teach an undergraduate how to get data from Dataverse if it's as easy as Thomas @leeper 's software makes it. (Thank you again Thomas for such great, easy software.)

So am I correct in understanding that Thomas was able to update the software so that the same DOI could get the same data, even as the API changed? Am I also correct in understanding that DOI's stay stable even as Dataverse's API updates? Essentially, if the DOI is stable and there's always some abstraction that can easily get the data, I think that's a great situation.

My main goal all along was to be able to print in ink a way to get a specific data file from Dataverse and that code in ink be able to work in perpetuity so that the printed material was still good. The great work Thomas has been doing seems to be helping that cause. I'm not sure how easily API adjustments can keep it easy for software like the "dataverse" package to continue to work easily.

Does that get at what you were wondering, @pdurbin? Again, thank you all for everything you do.

@pdurbin
Copy link
Member

pdurbin commented Feb 6, 2017

@monogan yes, exactly. The same DOI will always reference the same dataset. As get_file("BPchap7.tab", "doi:10.7910/DVN/ARKOTI") demonstrates, the combination of the DOI and the filename is enough to download a file. Yes, DOIs are stable. I'm actually going to suggest that we close this issue if that's ok with you.

#3584 is somewhat related in the sense that we are going to soon expose a URL for public files on the new file landing page.

@ghost
Copy link
Author

ghost commented Feb 6, 2017

@pdurbin Awesome, Philip. Sounds good to close it to me. Thank you.

@leeper
Copy link
Member

leeper commented Feb 7, 2017

Thanks, @pdurbin and @monogan! It's been a useful discussion for the development of the R package.

@pdurbin
Copy link
Member

pdurbin commented Feb 7, 2017

@leeper absolutely, I've connected this issue to my pull request for #3584 which is #3608 and will put this in QA to be closed, potentially.

@kcondon
Copy link
Contributor

kcondon commented Feb 7, 2017

At top of ticket what appears to be asked for is a persistent identifier or doi that would allow download access in perpetuity. A download api link is not that.

@pdurbin
Copy link
Member

pdurbin commented Feb 17, 2017

@djbrooke and I discussed this issue last week. I'm closing it because @monogan seems to be all set. At the same time I'm reopening #2438 which is the "DOI for files" issue, something that @kcondon points out is not supported in Dataverse.

@pdurbin pdurbin closed this as completed Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests