covidlit.Rmd

---
title: "Growth of the COVID19 literature"
output:
  html_document:
    df_print: paged
---


```{r}
library(tidyverse)
```
Updated 13 June 2020: Graph axes changed to more clearly reflect that these are cumulative article counts

Load in data from the NIH COVID-19 portfolio (https://icite.od.nih.gov/covid19/search/). This dataset was downloaded 9 June 2020, including the `DOI`, `PMCID`, `PMID`, `Publication Date`, `Publication Types`, `Source`, and `Journal Name` fields, and includes all articles through 2020-06-09.
```{r}
covidlit <- read_csv('data/covidlit-COVID-19_Portfolio-export_2020-06-09-13-52-30.csv')
```

How many total articles?
```{r}
nrow(covidlit)
```

We want to count published articles and preprints. What kind of sources do we have?
```{r}
unique(covidlit$Source)
```

Let's count articles by type...
```{r}
all_types <- unique(covidlit$Source)
preprint_types <- all_types[!all_types == "Peer reviewed (PubMed)"]

for (src in all_types) {
  print(paste(src, nrow(covidlit[covidlit$Source == src,])))
}
```

For each date in the `Publication Date` column, count number of articles and number of preprints...
```{r}
# convert 'Publication Date' to a Date field
covidlit$`Publication Date` <- as.Date(covidlit$`Publication Date`, format = "%Y-%m-%d")
dates <- sort(unique(as.Date(covidlit$`Publication Date`)))
# data.frame to hold the data
df <- data.frame(date = NA, all = NA, jrnl = NA, pp = NA, stringsAsFactors = FALSE)
for (d in as.list(dates)) {
  tmp <- covidlit[covidlit$`Publication Date` == d,]
  df <- rbind(df, data.frame(date = d, 
                             all = nrow(tmp),
                             jrnl = nrow(tmp[tmp$Source == 'Peer reviewed (PubMed)',]),
                             pp = nrow(tmp[tmp$Source %in% preprint_types,]),
                             stringsAsFactors = FALSE))
}
# when creating a data.frame, dates get mangled, so reformat
df$date <- as.Date(df$date, origin="1970-01-01")
# remove the first row, all NA
df <- df[-1,]
# get cumulative # of journal articles and preprints
df$cumsum_all <- cumsum(df$all)
df$cumsum_jrnl <- cumsum(df$jrnl)
df$cumsum_pp <- cumsum(df$pp)

df
```

Make the data 'tidy'...
```{r}
# get rid of columns we don't need
# select pubs after 1 Jan 2020, but skip the last day, whose data hasn't been fully updated
df <- df %>% 
  select(-c("all", "jrnl", "pp")) %>% 
  filter(date >= "2020-01-01", date <= "2020-06-08")

# rename cols (note, these are cumulative totals)
colnames(df) <- c("date","All Articles", "Published", "Preprints")

# use {tidyr} to convert a 'wide' table to 'long' for easier plotting (see https://tidyr.tidyverse.org/)
mytable <- df %>% pivot_longer(cols = c('All Articles', 'Published', 'Preprints'), names_to = "Article type",
                               values_to = "count")
mytable
```

Graph the data...
```{r}
p <- ggplot(mytable) +
  geom_line(aes(x = date, y = count, color = `Article type`)) +
  labs(title = "Growth of the COVID-19 literature", 
       subtitle = "Source: https://icite.od.nih.gov/covid19/search/",
       y = "Cumulative Number of Articles", x = "Publication Date") +
  theme_minimal() +
  theme(legend.position = "right", legend.title = element_blank())

p
```

Document session for computational reproducibility
```{r}
sessionInfo()
```