-
Notifications
You must be signed in to change notification settings - Fork 2
/
covidlit.Rmd
100 lines (84 loc) · 3.22 KB
/
covidlit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Growth of the COVID19 literature"
output:
html_document:
df_print: paged
---
```{r}
library(tidyverse)
```
Updated 13 June 2020: Graph axes changed to more clearly reflect that these are cumulative article counts
Load in data from the NIH COVID-19 portfolio (https://icite.od.nih.gov/covid19/search/). This dataset was downloaded 9 June 2020, including the `DOI`, `PMCID`, `PMID`, `Publication Date`, `Publication Types`, `Source`, and `Journal Name` fields, and includes all articles through 2020-06-09.
```{r}
covidlit <- read_csv('data/covidlit-COVID-19_Portfolio-export_2020-06-09-13-52-30.csv')
```
How many total articles?
```{r}
nrow(covidlit)
```
We want to count published articles and preprints. What kind of sources do we have?
```{r}
unique(covidlit$Source)
```
Let's count articles by type...
```{r}
all_types <- unique(covidlit$Source)
preprint_types <- all_types[!all_types == "Peer reviewed (PubMed)"]
for (src in all_types) {
print(paste(src, nrow(covidlit[covidlit$Source == src,])))
}
```
For each date in the `Publication Date` column, count number of articles and number of preprints...
```{r}
# convert 'Publication Date' to a Date field
covidlit$`Publication Date` <- as.Date(covidlit$`Publication Date`, format = "%Y-%m-%d")
dates <- sort(unique(as.Date(covidlit$`Publication Date`)))
# data.frame to hold the data
df <- data.frame(date = NA, all = NA, jrnl = NA, pp = NA, stringsAsFactors = FALSE)
for (d in as.list(dates)) {
tmp <- covidlit[covidlit$`Publication Date` == d,]
df <- rbind(df, data.frame(date = d,
all = nrow(tmp),
jrnl = nrow(tmp[tmp$Source == 'Peer reviewed (PubMed)',]),
pp = nrow(tmp[tmp$Source %in% preprint_types,]),
stringsAsFactors = FALSE))
}
# when creating a data.frame, dates get mangled, so reformat
df$date <- as.Date(df$date, origin="1970-01-01")
# remove the first row, all NA
df <- df[-1,]
# get cumulative # of journal articles and preprints
df$cumsum_all <- cumsum(df$all)
df$cumsum_jrnl <- cumsum(df$jrnl)
df$cumsum_pp <- cumsum(df$pp)
df
```
Make the data 'tidy'...
```{r}
# get rid of columns we don't need
# select pubs after 1 Jan 2020, but skip the last day, whose data hasn't been fully updated
df <- df %>%
select(-c("all", "jrnl", "pp")) %>%
filter(date >= "2020-01-01", date <= "2020-06-08")
# rename cols (note, these are cumulative totals)
colnames(df) <- c("date","All Articles", "Published", "Preprints")
# use {tidyr} to convert a 'wide' table to 'long' for easier plotting (see https://tidyr.tidyverse.org/)
mytable <- df %>% pivot_longer(cols = c('All Articles', 'Published', 'Preprints'), names_to = "Article type",
values_to = "count")
mytable
```
Graph the data...
```{r}
p <- ggplot(mytable) +
geom_line(aes(x = date, y = count, color = `Article type`)) +
labs(title = "Growth of the COVID-19 literature",
subtitle = "Source: https://icite.od.nih.gov/covid19/search/",
y = "Cumulative Number of Articles", x = "Publication Date") +
theme_minimal() +
theme(legend.position = "right", legend.title = element_blank())
p
```
Document session for computational reproducibility
```{r}
sessionInfo()
```