roadoi interacts with the Unpaywall API, a simple web-interface which links DOIs and open access versions of scholarly works. The API powers Unpaywall.
This client supports the most recent API Version 2.
API Documentation: http://unpaywall.org/products/api
Use the oadoi_fetch()
function in this package to get open access status
information and full-text links from Unpaywall.
roadoi::oadoi_fetch(dois = c("10.1038/ng.3260", "10.1093/nar/gkr1047"),
email = "[email protected]")
#> # A tibble: 2 x 18
#> doi best_oa_location oa_locations data_standard is_oa genre oa_status
#> <chr> <list> <list> <int> <lgl> <chr> <chr>
#> 1 10.1… <tibble [1 × 11… <tibble [1 … 2 TRUE jour… green
#> 2 10.1… <tibble [1 × 9]> <tibble [6 … 2 TRUE jour… gold
#> # … with 11 more variables: has_repository_copy <lgl>,
#> # journal_is_oa <lgl>, journal_is_in_doaj <lgl>, journal_issns <chr>,
#> # journal_issn_l <chr>, journal_name <chr>, publisher <chr>,
#> # title <chr>, year <chr>, updated <chr>, authors <list>
There are no API restrictions. However, providing an email address is required and a rate limit of 100k is suggested. If you need to access more data, use the data dump instead.
This package also has a RStudio Addin for easily finding free full-texts in RStudio.
Install and load from CRAN:
install.packages("roadoi")
library(roadoi)
To install the development version, use the devtools package
devtools::install_github("ropensci/roadoi")
library(roadoi)
Open access copies of scholarly publications are sometimes hard to find. Some are published in open access journals. Others are made freely available as preprints before publication, and others are deposited in institutional repositories, digital archives maintained by universities and research institutions. This document guides you to roadoi, a R client that makes it easy to search for these open access copies by interfacing the Unpaywall service where DOIs are matched with freely available full-texts available from open access journals and archives.
Unpaywall, developed and maintained by the team of Impactstory, is a non-profit service that finds open access copies of scholarly literature simply by looking up a DOI (Digital Object Identifier). It not only returns open access full-text links, but also helpful metadata about the open access status of a publication such as licensing or provenance information.
Unpaywall uses different data sources to find open access full-texts including:
- Crossref: a DOI registration agency serving major scholarly publishers.
- Directory of Open Access Journals (DOAJ): a registry of open access journals
- Various OAI-PMH metadata sources. OAI-PMH is a protocol often used by open access journals and repositories such as arXiv and PubMed Central.
See Piwowar et al. (2018) for a comprehensive overview of Unpaywall.
There is one major function to talk with Unpaywall, oadoi_fetch()
, taking a character vector of DOIs and your email address as required arguments.
library(roadoi)
roadoi::oadoi_fetch(dois = c("10.1186/s12864-016-2566-9",
"10.1103/physreve.88.012814"),
email = "[email protected]")
#> # A tibble: 2 x 18
#> doi best_oa_location oa_locations data_standard is_oa genre oa_status
#> <chr> <list> <list> <int> <lgl> <chr> <chr>
#> 1 10.1… <tibble [1 × 9]> <tibble [5 … 2 TRUE jour… gold
#> 2 10.1… <tibble [1 × 9]> <tibble [2 … 2 TRUE jour… hybrid
#> # … with 11 more variables: has_repository_copy <lgl>,
#> # journal_is_oa <lgl>, journal_is_in_doaj <lgl>, journal_issns <chr>,
#> # journal_issn_l <chr>, journal_name <chr>, publisher <chr>,
#> # title <chr>, year <chr>, updated <chr>, authors <list>
The client supports API version 2. According to the Unpaywall Data Format, the following variables with the following definitions are returned:
Column | Description |
---|---|
doi |
DOI (always in lowercase) |
best_oa_location |
list-column describing the best OA location. Algorithm prioritizes publisher hosted content (e.g. Hybrid or Gold) |
oa_locations |
list-column of all the OA locations. |
data_standard |
Indicates the data collection approaches used for this resource. 1 mostly uses Crossref for hybrid detection. 2 uses more comprehensive hybrid detection methods. |
is_oa |
Is there an OA copy (logical)? |
genre |
Publication type |
oa_status |
Classifies OA resources by location and license terms as one of: gold, hybrid, bronze, green or closed. See here for more information https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-. |
has_repository_copy |
Is a full-text available in a repository? |
journal_is_oa |
Is the article published in a fully OA journal? Uses the Directory of Open Access Journals (DOAJ) as source. |
journal_is_in_doaj |
Is the journal listed in the Directory of Open Access Journals (DOAJ). |
journal_issns |
ISSNs, i.e. unique code to identify journals. |
journal_issns_l |
Linking ISSN. |
journal_name |
Journal title |
publisher |
Publisher |
title |
Publication title. |
year |
Year published. |
published_date |
Date published. |
updated |
Time when the data for this resource was last updated. |
authors |
Lists authors (if available) |
The columns best_oa_location
and oa_locations
are list-columns
that contain useful metadata about the OA sources found by Unpaywall These are
Column | Description |
---|---|
evidence |
How the OA location was found and is characterized by Unpaywall? |
host_type |
OA full-text provided by publisher or repository . |
license |
The license under which this copy is published |
url |
The URL where you can find this OA copy. |
versions |
The content version accessible at this location following the DRIVER 2.0 Guidelines (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) |
Note that Unpaywall schema is only informally described. Check also with https://unpaywall.org/data-format.
There at least two ways to simplify these list-columns.
To get the full-text links from the list-column best_oa_location
, you may want to use purrr::map_chr()
.
library(dplyr)
roadoi::oadoi_fetch(dois = c("10.1186/s12864-016-2566-9",
"10.1103/physreve.88.012814"),
email = "[email protected]") %>%
dplyr::mutate(
urls = purrr::map(best_oa_location, "url") %>%
purrr::map_if(purrr::is_empty, ~ NA_character_) %>%
purrr::flatten_chr()
) %>%
.$urls
#> [1] "https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/s12864-016-2566-9"
#> [2] "https://link.aps.org/accepted/10.1103/PhysRevE.88.012814"
If you want to gather all full-text links and to explore where these links are hosted, simplify the list-column oa_locations
with tidyr::unnest()
. Note the column updated
, which belongs to the main data.frame and the nested list-column. It will cause an error when flatting into regular columns. Either de-select updated
or change the argument names_repair
.
library(dplyr)
library(tidyr)
roadoi::oadoi_fetch(dois = c("10.1186/s12864-016-2566-9",
"10.1103/physreve.88.012814"),
email = "[email protected]") %>%
tidyr::unnest(oa_locations, names_repair = "universal") %>%
dplyr::mutate(
hostname = purrr::map(url, httr::parse_url) %>%
purrr::map_chr(., "hostname", .null = NA_integer_)
) %>%
dplyr::mutate(hostname = gsub("www.", "", hostname)) %>%
dplyr::group_by(hostname) %>%
dplyr::summarize(hosts = n())
#> # A tibble: 7 x 2
#> hostname hosts
#> <chr> <int>
#> 1 arxiv.org 1
#> 2 bmcgenomics.biomedcentral.com 1
#> 3 doi.org 1
#> 4 europepmc.org 1
#> 5 link.aps.org 1
#> 6 ncbi.nlm.nih.gov 1
#> 7 pub.uni-bielefeld.de 1
Note that fields to be returned might change according to the Unpaywall API specs
There are no API restrictions. However, Unpaywall requires an email address when using its API. If you are too tired to type in your email address every time, you can store the email in the .Renviron
file with the option roadoi_email
roadoi_email = "[email protected]"
```.
You can open your `.Renviron` file calling
```r
file.edit("~/.Renviron")`
Save the file and restart your R session. To stop sharing the email when using roadoi, delete it from your .Renviron
file.
To follow your API call, and to estimate the time until completion, use the .progress
parameter inherited from plyr
to display a progress bar.
roadoi::oadoi_fetch(dois = c("10.1186/s12864-016-2566-9",
"10.1103/physreve.88.012814"),
email = "[email protected]",
.progress = "text")
#>
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
#> # A tibble: 2 x 18
#> doi best_oa_location oa_locations data_standard is_oa genre oa_status
#> <chr> <list> <list> <int> <lgl> <chr> <chr>
#> 1 10.1… <tibble [1 × 9]> <tibble [5 … 2 TRUE jour… gold
#> 2 10.1… <tibble [1 × 9]> <tibble [2 … 2 TRUE jour… hybrid
#> # … with 11 more variables: has_repository_copy <lgl>,
#> # journal_is_oa <lgl>, journal_is_in_doaj <lgl>, journal_issns <chr>,
#> # journal_issn_l <chr>, journal_name <chr>, publisher <chr>,
#> # title <chr>, year <chr>, updated <chr>, authors <list>
Unpaywall is a reliable API. However, this client follows Hadley Wickham's Best practices for writing an API package and throws an error when the API does not return valid JSON or is not available. To catch these errors, you may want to use purrr's safely()
function
random_dois <- c("ldld", "10.1038/ng.3260", "§dldl ")
my_data <- purrr::map(random_dois,
.f = purrr::safely(function(x) roadoi::oadoi_fetch(x, email = "[email protected]")))
# return results as data.frame
purrr::map_df(my_data, "result")
#> # A tibble: 1 x 18
#> doi best_oa_location oa_locations data_standard is_oa genre oa_status
#> <chr> <list> <list> <int> <lgl> <chr> <chr>
#> 1 10.1… <tibble [1 × 11… <tibble [1 … 2 TRUE jour… green
#> # … with 11 more variables: has_repository_copy <lgl>,
#> # journal_is_oa <lgl>, journal_is_in_doaj <lgl>, journal_issns <chr>,
#> # journal_issn_l <chr>, journal_name <chr>, publisher <chr>,
#> # title <chr>, year <chr>, updated <chr>, authors <list>
#show errors
purrr::map(my_data, "error")
#> [[1]]
#> <simpleError: Unpaywall request failed [404]
#> 'ldld' is an invalid doi. See https://doi.org/ldld>
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> <simpleError: Unpaywall request failed [404]
#> '§dldl' is an invalid doi. See https://doi.org/§dldl>
An increasing number of universities, research organisations and funders have launched open access policies in recent years. Using roadoi together with other R-packages makes it easy to examine how and to what extent researchers comply with these policies in a reproducible and transparent manner. In particular, the rcrossref package, maintained by rOpenSci, provides many helpful functions for this task.
DOIs have become essential for referencing scholarly publications, and thus many digital libraries and institutional databases keep track of these persistent identifiers. For the sake of this vignette, instead of starting with a pre-defined set of publications originating from these sources, we simply generate a random sample of 50 DOIs registered with Crossref by using the rcrossref package.
library(dplyr)
library(rcrossref)
# get a random sample of DOIs and metadata describing these works
random_dois <- rcrossref::cr_r(sample = 50)
Now let's call Unpaywall. We are capturing possible errors.
oa_df <- purrr::map(random_dois, .f = purrr::safely(
function(x) roadoi::oadoi_fetch(x, email = "[email protected]")
)) %>%
purrr::map_df("result")
After obtaining the data, reporting with R is straightforward. You can even generate dynamic reports using R Markdown and related packages, thus making your study reproducible and transparent.
To display how many full-text links were found and which sources were used in a nicely formatted markdown-table using the knitr
-package:
if(!is.null(oa_df))
oa_df %>%
group_by(is_oa) %>%
summarise(Articles = n()) %>%
mutate(Proportion = Articles / sum(Articles)) %>%
arrange(desc(Articles)) %>%
knitr::kable()
is_oa | Articles | Proportion |
---|---|---|
FALSE | 33 | 0.66 |
TRUE | 17 | 0.34 |
How did Unpaywall find those Open Access full-texts, which were characterized as best matches, and how are these OA types distributed over publication types?
if(!is.null(oa_df))
oa_df %>%
filter(is_oa == TRUE) %>%
select(best_oa_location, genre) %>%
tidyr::unnest(best_oa_location) %>%
group_by(evidence, genre) %>%
summarise(Articles = n()) %>%
arrange(desc(Articles)) %>%
knitr::kable()
evidence | genre | Articles |
---|---|---|
open (via free pdf) | journal-article | 7 |
open (via page says license) | journal-article | 6 |
oa repository (via OAI-PMH doi match) | journal-article | 1 |
oa repository (via OAI-PMH title and first author match) | journal-article | 1 |
open (via free pdf) | book-chapter | 1 |
open (via page says license) | proceedings-article | 1 |
For more examples, see Piwowar et al. 2018. Together with the article, the authors shared their analysis of Unpaywall Data as R Markdown supplement.
This blog post describes how to analyze the Unpaywall data dump with R: https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence/
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., … Haustein, S. (2018). The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles. PeerJ, 6, e4375. https://doi.org/10.7717/peerj.4375
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
License: MIT
Please use the issue tracker for bug reporting and feature requests.