Files: Need Persistent Identifiers/URL's for Data Files #2700

ghost · 2015-10-27T20:12:58Z

Dear colleagues:

One thing I've been trying to be able to do (and Phil Durbin has spent a lot of time helping on this) is reference a data file from Dataverse by a URL. The tricky thing about this is that I'd like to be able to print code that directly references a file that works in perpetuity. The motivator for this was a new R book I wrote, so once the code was printed, I wouldn't have the chance to update it.

Phil has come up with something like this (function defined and then file loaded):

download.dataverse.file <- function(url) {
  if (length(url) == 0L)   {
    return(
      "Please provide a URL to a file: http://guides.dataverse.org/en/latest/api/dataaccess.html"
    )
  }
  url.to.download = url
  tsvfile = 'file.tsv'
  download.file(url = url.to.download, destfile =
                  tsvfile, method = 'curl')
  mydata <- read.table(tsvfile, header = TRUE, sep = "\t")
  unlink(tsvfile)
  return(mydata)
}

#LOAD FILE#
evolution<-download.dataverse.file(url<-'https://dataverse.harvard.edu/api/v1/access/datafile/2692295')

The biggest issue is that versions of Dataverse may change in the future, so the URLs may change as well. So if there's some way we could guarantee a URL that would work for a Dataverse-based file forever, that would be ideal. Thank you for any possible help!

The text was updated successfully, but these errors were encountered:

pdurbin · 2015-10-29T15:53:13Z

@monogan thank you very much for opening this issue! I'm hoping other people in the community add their comments. @bencomp just suggested that we should clarify issue titles and I completely agree so would you agree to changing the title from "URL Referencing Data Sets" to "URL Referencing Files"? This goes back to my post on terminology we use in Dataverse. I think what you're calling a data set we call a file. Sorry if this is confusing! If you agree to this change to the title, please go ahead and change it. I did go ahead and edit the markdown slightly and linked to my commit with the R script I wrote.

Also, I wanted to mention that I opened a couple of related issues yesterday based on this one:

As I wrote there I think the best solution is to encourage the use of language-specific libraries so that people are insulated from API changes.

Finally, the original thread that kicked off this discussion may be of interested as well as the internal ticket (for those who can access it).

bencomp · 2015-10-29T16:46:49Z

Yes, @pdurbin and I indeed chatted about this issue and updating issue titles to keep overview of open issues. Could you summarise the issue as Dataverse should provide persistent URLs for files and if yes, could you update the issue title accordingly?

ghost · 2015-10-29T20:21:23Z

@pdurbin @bencomp Yes, I agree with this change, so I have made it based on Ben's revised suggestion. Thank you both for all of your help on this! I definitely think this is something that a lot of folks will be interested in for various settings, as I've heard about many students in test settings saying that they wanted to be able to just grab Dataverse files directly in R. So thank you all, and please let me know if I can do anything to assist.

eaquigley · 2015-10-29T20:24:13Z

Is this the same as #2438?

pdurbin · 2015-10-29T20:35:54Z

@monogan thanks for updating the title.

@eaquigley #2438 about unique identifiers for files is definitely related! Thanks for linking to it. I meant to do that myself.

We can all think about this: What are the pieces of information necessary to download a file? I would say:

The DOI of the dataset
The name of the file within a version (since you can rename files and bump the version). But! While it's true today (Dataverse 4.2.1) that filenames are unique within a version, people like @dpwrussell and @cchoirat have pointed out in File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249 that they don't really like the flat file structure (everything in one "folder") in Dataverse. So right now, there can only be one "readme.txt" per dataset version but if we ever support a folder structure, you can imagine there being multiple "readme.txt" files in a version of a dataset.

pdurbin · 2015-10-30T00:19:53Z

#2416 is also related and provides an example of a URL from which you can download a file from Zenodo. 30 years from now will you be able to download "MHC_PosSel_dataset.zip" from the URL https://zenodo.org/record/18898/files/MHC_PosSel_dataset.zip ? (This file is part of https://zenodo.org/record/18898#.VjK086KpOL6 .) Is that file URL stable? Will it work in perpetuity (to use @monogan 's phrase)? @timbl has taught us that Cool URIs don't change and in a perfect world you'd never find a dead link. I say this not to knock Zenodo (they have really cool integration with GithHub and provide a great service). I'm just trying to reason about what forever means and what sort of promises can be made when it comes to URLs.

pdurbin · 2015-11-08T19:22:35Z

As I wrote there I think the best solution is to encourage the use of language-specific libraries so that people are insulated from API changes.

@monogan heads up that @leeper just pushed an initial draft to https://github.com/IQSS/dataverse-client-r ! I'm sure he'd appreciate feedback.

ghost · 2015-11-10T03:59:10Z

@pdurbin @leeper This looks very promising!

jeisner · 2016-08-27T05:44:30Z

I'd like to request this feature as well. I see that it's been marked as "Triaged": is that equivalent to wont-fix or to will-consider-again-in-next-version?

djbrooke · 2016-10-07T17:27:56Z

I've closed #2416 because it's very similar.

@jeisner - sorry to miss your last question on this issue. The "triaged" label is left over from a previous process and it is not currently in use.

The best way to see what the team is currently working on is on our waffle board:

https://waffle.io/IQSS/dataverse

The best way to see what the team is planning to work on next is our releases roadmap page (soon to be updated):

http://dataverse.org/releases-roadmap

smakonin · 2016-10-27T16:58:44Z

I would definitely like this features sooner rather than later. I think that it is essential for datasets will lots of files (like mine). I what to Beale to blog/send posts about (and link to) specific files.

mheppler · 2016-10-27T19:49:29Z

@lwo commented on Sep 29, 2015 in issue #2438

I am entering via: https://groups.google.com/forum/#!msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

... I think persistent identifiers should not be derived from the local file id. An organization may want to migrate from their repository solution to dataverse and be able to import their PIDs if they already have them. And then rebind them to the dataverse dataset and future file pages. That would be possible with a String type, but not with a number.

I am moving that comment here, and closing that issue as a duplicate of this.

djbrooke · 2017-01-17T22:47:55Z

@pdurbin @mheppler - was this delivered with the new file page?

pdurbin · 2017-01-18T01:04:37Z

@djbrooke well, @monogan first opened this issue so maybe we should check with him. There is a new file page for the file he mentioned in the description: https://dataverse.harvard.edu/file.xhtml?fileId=2692295&version=1.0

@monogan what do you think?

ghost · 2017-01-20T00:56:52Z

@pdurbin @djbrooke I've worked several hours on this, and I'm struggling pretty badly. My goal has been to see if I can get @leeper 's program to work on this, but I cannot seem to install his "dataverse" package. Maybe one of you has more luck. Here's what my planned code was:
library(ghit); ghit::install_github("iqss/dataverse-client-r"); library("dataverse"); get_dataset("doi:10.7910/DVN/ARKOTI")

I installed Rtools at one point thinking that might help, but this fails on the second line. For what it's worth, this is my error message:
Error in build_and_insert(p$pkgname, d, vers, build_args, verbose = verbose) : Package build for dataverse failed! In addition: Warning message: running command '"C:/Users/COX'SG~1/DOCUME~1/R/R-33~1.2/bin/x64/R" CMD build C:\Users\COX'SG~1\AppData\Local\Temp\RtmpGACfp0\dataverse-client-r1ca4271417e6' had status 1
Can any of you get my code to run? Anyone have any idea why I can't install the "dataverse" package @leeper wrote? So I hate that I'm failing before I even get to the starting line, but any advice would be appreciated. My hope would be that, once I got the "dataverse" package installed that DOI still works at the new page.

pdurbin · 2017-01-20T11:55:50Z

@monogan I think you're taking the right approach by using Dataverse client libraries such as the "dataverse" R package to insulate you from any changes in the Dataverse APIs. I'm not sure how to get that package installed on Windows and I just opened IQSS/dataverse-client-r#11 to try to find out how best to get help.

Please do not worry about the DOI not working. The only thing that was added recently to Dataverse is a landing page per file. The APIs themselves didn't change and we intend to keep APIs as stable as possible.

ghost · 2017-01-20T22:34:33Z

@pdurbin Thank you for opening up that comment. I appreciate your help in trying to sort this out. Is there anything else I need to be testing right now? Also very grateful to @leeper for all of the work he has done on this issue.

leeper · 2017-01-22T18:18:40Z

Just to confirm, this R code should work for the example file being discussed:

library("dataverse")
f <- get_file("BPchap7.tab", "doi:10.7910/DVN/ARKOTI")

# optionally, load the data:
dat <- haven::read_dta(f)
dplyr::glimpse(haven::read_dta(f))
## Observations: 854
## Variables: 14
## $ st_fip      <dbl> 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ female      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 9, 1, 0, 1...
## $ hrs_allev   <dbl> 8.0, 3.0, 26.0, 44.0, 8.0, 4.0, 8.0, 4.0, 8.0, 8.0, 8.0, 5.5, 8.0, 21.0, 12.0, 9.5, 1...
## $ evol_course <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0...
## $ phase1      <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, -1.3663943, -1.3663943, -1.3663943, -1.36...
## $ senior_c    <dbl> 1, 0, 0, 1, 1, 0, 1, -2, -1, 1, -2, -2, 1, 1, 1, 0, 0, 1, 0, -2, -1, 1, 0, 0, 0, -1, ...
## $ ph_senior   <dbl> 0.2010949, 0.0000000, 0.0000000, 0.2010949, -1.3663943, 0.0000000, -1.3663943, 2.7327...
## $ notest_p    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ph_notest_p <dbl> 0.2010949, 0.2010949, 0.2010949, 0.2010949, 0.0000000, 0.0000000, 0.0000000, 0.000000...
## $ idsci_trans <dbl> 9.639460e-01, 5.422196e-01, 9.639460e-01, 1.710664e-01, 6.852229e-01, 3.391082e-01, 8...
## $ biocred3    <dbl> 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 0, 1, 1, 2, 1...
## $ certified   <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1...
## $ degr3       <dbl> 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0...
## $ confident   <dbl> 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0...

pdurbin · 2017-01-25T14:06:03Z

@monogan now that you've got the "dataverse" R package installed and the script from @leeper above works for you, as you reported at IQSS/dataverse-client-r#11 (comment) (thanks for all the help, Thomas), do you have any more thoughts on this issue? Again, I think you're using the right approach by encouraging people to use a layer of abstraction such as the R dataverse client, which can evolve along with Dataverse APIs.

pdurbin · 2017-01-25T14:07:43Z

@jeisner @smakonin I'd be happy to hear your latest thoughts on this issue as well.

ghost · 2017-01-30T22:06:44Z

@pdurbin @leeper @jeisner @smakonin I'll admit that I'm not exactly sure how much @leeper had to do to allow the extraction to work off of a DOI. Essentially, I think it's great if an R package like Thomas's can pull the file as easily as this does--I can very easily teach an undergraduate how to get data from Dataverse if it's as easy as Thomas @leeper 's software makes it. (Thank you again Thomas for such great, easy software.)

So am I correct in understanding that Thomas was able to update the software so that the same DOI could get the same data, even as the API changed? Am I also correct in understanding that DOI's stay stable even as Dataverse's API updates? Essentially, if the DOI is stable and there's always some abstraction that can easily get the data, I think that's a great situation.

My main goal all along was to be able to print in ink a way to get a specific data file from Dataverse and that code in ink be able to work in perpetuity so that the printed material was still good. The great work Thomas has been doing seems to be helping that cause. I'm not sure how easily API adjustments can keep it easy for software like the "dataverse" package to continue to work easily.

Does that get at what you were wondering, @pdurbin? Again, thank you all for everything you do.

pdurbin · 2017-02-06T21:31:03Z

@monogan yes, exactly. The same DOI will always reference the same dataset. As get_file("BPchap7.tab", "doi:10.7910/DVN/ARKOTI") demonstrates, the combination of the DOI and the filename is enough to download a file. Yes, DOIs are stable. I'm actually going to suggest that we close this issue if that's ok with you.

#3584 is somewhat related in the sense that we are going to soon expose a URL for public files on the new file landing page.

ghost · 2017-02-06T23:44:39Z

@pdurbin Awesome, Philip. Sounds good to close it to me. Thank you.

leeper · 2017-02-07T06:34:21Z

Thanks, @pdurbin and @monogan! It's been a useful discussion for the development of the R package.

pdurbin · 2017-02-07T16:12:59Z

@leeper absolutely, I've connected this issue to my pull request for #3584 which is #3608 and will put this in QA to be closed, potentially.

kcondon · 2017-02-07T20:16:59Z

At top of ticket what appears to be asked for is a persistent identifier or doi that would allow download access in perpetuity. A download api link is not that.

pdurbin · 2017-02-17T11:44:32Z

@djbrooke and I discussed this issue last week. I'm closing it because @monogan seems to be all set. At the same time I'm reopening #2438 which is the "DOI for files" issue, something that @kcondon points out is not supported in Dataverse.

pdurbin referenced this issue Oct 28, 2015

added script to download file via R #2438

812424a

This was referenced Oct 28, 2015

Add integration test to download a file by filename IQSS/dataverse-client-python#29

Open

Add integration test to download a file by filename IQSS/dataverse-client-r#2

Closed

ghost changed the title ~~URL Referencing Data Sets~~ Dataverse should provide persistent URLs for files Oct 29, 2015

pdurbin mentioned this issue Oct 30, 2015

Hovering mouse over Download button does not reveal the URL of the file and the URL does not contain the file name #2416

Closed

pdurbin added a commit that referenced this issue Nov 9, 2015

official R package for Dataverse 4 shipped! #2700

99913fc

mercecrosas modified the milestone: In Review Nov 30, 2015

mheppler added Feature: File Upload & Handling Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Jan 28, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta modified the milestone: Not Assigned to a Release Jan 28, 2016

mheppler mentioned this issue Oct 27, 2016

File: Individual Landing Pages for Files #2465

Closed

mheppler changed the title ~~Dataverse should provide persistent URLs for files~~ Files: Need Persistent Identifiers/URL's for Data Files Oct 27, 2016

mheppler removed the Status: Triaged label Oct 27, 2016

mheppler mentioned this issue Oct 27, 2016

Files: Persistent Identifiers for Files (DOIs for files) #2438

Closed

mheppler mentioned this issue Dec 21, 2016

Provide File-Level Data Citation for All Files #1400

Closed

djbrooke modified the milestone: 4.6.2 - Provenance Jan 17, 2017

pdurbin mentioned this issue Jan 20, 2017

Trouble installing dataverse R package IQSS/dataverse-client-r#11

Closed

pdurbin mentioned this issue Feb 6, 2017

show URL for file on file landing page #3584 #3608

Merged

11 tasks

pdurbin added the Status: QA label Feb 7, 2017

kcondon added Status: Code Review and removed Status: QA labels Feb 7, 2017

djbrooke self-assigned this Feb 8, 2017

djbrooke removed the Status: Code Review label Feb 8, 2017

pdurbin closed this as completed Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files: Need Persistent Identifiers/URL's for Data Files #2700

Files: Need Persistent Identifiers/URL's for Data Files #2700

ghost commented Oct 27, 2015 •

edited by djbrooke

Loading

pdurbin commented Oct 29, 2015

bencomp commented Oct 29, 2015

ghost commented Oct 29, 2015

eaquigley commented Oct 29, 2015

pdurbin commented Oct 29, 2015

pdurbin commented Oct 30, 2015

pdurbin commented Nov 8, 2015

ghost commented Nov 10, 2015

jeisner commented Aug 27, 2016

djbrooke commented Oct 7, 2016 •

edited

Loading

smakonin commented Oct 27, 2016

mheppler commented Oct 27, 2016

djbrooke commented Jan 17, 2017

pdurbin commented Jan 18, 2017

ghost commented Jan 20, 2017 •

edited by ghost

Loading

pdurbin commented Jan 20, 2017

ghost commented Jan 20, 2017

leeper commented Jan 22, 2017

pdurbin commented Jan 25, 2017

pdurbin commented Jan 25, 2017

ghost commented Jan 30, 2017

pdurbin commented Feb 6, 2017

ghost commented Feb 6, 2017

leeper commented Feb 7, 2017

pdurbin commented Feb 7, 2017

kcondon commented Feb 7, 2017

pdurbin commented Feb 17, 2017

Files: Need Persistent Identifiers/URL's for Data Files #2700

Files: Need Persistent Identifiers/URL's for Data Files #2700

Comments

ghost commented Oct 27, 2015 • edited by djbrooke Loading

pdurbin commented Oct 29, 2015

bencomp commented Oct 29, 2015

ghost commented Oct 29, 2015

eaquigley commented Oct 29, 2015

pdurbin commented Oct 29, 2015

pdurbin commented Oct 30, 2015

pdurbin commented Nov 8, 2015

ghost commented Nov 10, 2015

jeisner commented Aug 27, 2016

djbrooke commented Oct 7, 2016 • edited Loading

smakonin commented Oct 27, 2016

mheppler commented Oct 27, 2016

djbrooke commented Jan 17, 2017

pdurbin commented Jan 18, 2017

ghost commented Jan 20, 2017 • edited by ghost Loading

pdurbin commented Jan 20, 2017

ghost commented Jan 20, 2017

leeper commented Jan 22, 2017

pdurbin commented Jan 25, 2017

pdurbin commented Jan 25, 2017

ghost commented Jan 30, 2017

pdurbin commented Feb 6, 2017

ghost commented Feb 6, 2017

leeper commented Feb 7, 2017

pdurbin commented Feb 7, 2017

kcondon commented Feb 7, 2017

pdurbin commented Feb 17, 2017

ghost commented Oct 27, 2015 •

edited by djbrooke

Loading

djbrooke commented Oct 7, 2016 •

edited

Loading

ghost commented Jan 20, 2017 •

edited by ghost

Loading