diff --git a/.gitignore b/.gitignore index d1faf52f..60cc1cdc 100644 --- a/.gitignore +++ b/.gitignore @@ -8,3 +8,4 @@ inst/doc /doc/ /Meta/ scratch.R +pkgdown/favicon/ diff --git a/DESCRIPTION b/DESCRIPTION index 2e8bb0ad..63617bf9 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -54,7 +54,7 @@ Roxygen: list(markdown = TRUE) RoxygenNote: 7.2.3 Imports: checkmate, - stats, + stats Suggests: dplyr, tidyr, @@ -63,8 +63,8 @@ Suggests: forcats, ggplot2, pak, + owidR, testthat (>= 3.0.0), - covidregionaldata, usethis, knitr, bookdown, @@ -74,8 +74,7 @@ Suggests: incidence2, outbreaks Remotes: - epiverse-trace/epiparameter, - epiforecasts/covidregionaldata + epiverse-trace/epiparameter Config/testthat/edition: 3 Config/Needs/website: epiverse-trace/epiversetheme diff --git a/_pkgdown.yml b/_pkgdown.yml index cdc17a36..169313a2 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -1,3 +1,14 @@ url: https://epiverse-trace.github.io/cfr/ template: package: epiversetheme +articles: +- title: Package vignettes + navbar: Package vignettes + contents: + - estimate_static_severity + - estimate_time_varying_severity + - estimate_ascertainment +- title: Handling data from other packages + navbar: Handling data from other packages + contents: + - data_from_incidence2 diff --git a/datadelay.Rproj b/cfr.Rproj similarity index 100% rename from datadelay.Rproj rename to cfr.Rproj diff --git a/inst/WORDLIST b/inst/WORDLIST index 7f3d20a7..5ba183d8 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -18,7 +18,6 @@ Poisson Tidyverse WIP binom -codeblock codecov com covidregionaldata @@ -42,6 +41,7 @@ linelists lockdown md org +owidR packagename parameterised pkg diff --git a/vignettes/cfr.Rmd b/vignettes/cfr.Rmd index d077f9f2..9babf605 100644 --- a/vignettes/cfr.Rmd +++ b/vignettes/cfr.Rmd @@ -16,7 +16,9 @@ vignette: > It is important to understand the severity of a disease in terms of the case fatality rate in order to respond appropriately to an outbreak. During an outbreak there is often a delay between cases being reported, and the outcomes of those cases being known. + Knowing the distribution of these delays from previous outbreaks of the same (or similar) diseases, and accounting for them, can be useful in getting better estimates of disease severity. + The severity of a disease can be estimated while correcting for delays in reporting using methods outlines in @nishiura2009, and which are implemented in the _cfr_ package. ::: {.alert .alert-primary} @@ -50,7 +52,7 @@ library(cfr) ## Case and death data -Data on cases and deaths may be obtained from a number of publicly accessible sources, such as the [_covidregionaldata_ package for data on the Covid-19 outbreak](https://epiforecasts.io/covidregionaldata/). +Data on cases and deaths may be obtained from a number of publicly accessible sources, such as the [global Covid-19 dataset collected by Our World in Data](https://ourworldindata.org/coronavirus), and provided through the [_owidR_ package](https://cran.r-project.org/package=owidR). In an outbreak response scenario, such data may also be compiled and shared locally. @@ -66,6 +68,9 @@ Here, we use some data from the first Ebola outbreak, in the Democratic Republic ```{r} data("ebola1976") + +# view ebola dataset +head(ebola1976) ``` ## Obtaining data on reporting delays @@ -105,9 +110,12 @@ estimate_static( The `estimate_static()` function is well suited to small outbreaks. For larger outbreaks where it may be useful to know how severity has changed over time, the function `estimate_time_varying()` is available. This function is however not well suited to small outbreaks. More on this can be found on the [vignette on estimating how disease severity varies over the course of an outbreak](estimate_time_varying_severity.html). -## Estimate reporting rate +## Estimate ascertainment rate + +It is important to know what proportion of cases in an outbreak are being ascertained to muster the appropriate response, and to estimate the overall burden of the outbreak. -It is important to know what proportion of cases in an outbreak are being reported to muster the appropriate response, and to estimate the overall burden of the outbreak. +**Note** that the ascertainment rate may be affected by a number of factors. +When the main factor in low ascertainment is the lack of (access to) testing capacity, we refer to this as reporting or under-reporting. The `estimate_reporting()` function can help estimate this using the daily case and death data, the known severity of the disease from previous outbreaks, as well as a delay distribution of onset-to-death. diff --git a/vignettes/data_from_incidence2.Rmd b/vignettes/data_from_incidence2.Rmd index 298bf5ed..bc935de4 100644 --- a/vignettes/data_from_incidence2.Rmd +++ b/vignettes/data_from_incidence2.Rmd @@ -113,7 +113,7 @@ estimate_static( ## `` objects from aggregated case data -Aggregated case data such as that provided by the package [_covidregionaldata_](https://epiforecasts.io/covidregionaldata/) can be used with _cfr_ directly. +Aggregated case data such as the Covid-19 dataset provided by _incidence2_ can be used with _cfr_ directly. Such usage is shown in other vignettes such as the [vignette on estimating time-varying severity](estimate_time_varying_severity.html). diff --git a/vignettes/estimate_ascertainment.Rmd b/vignettes/estimate_ascertainment.Rmd index cf115f8a..7529245d 100644 --- a/vignettes/estimate_ascertainment.Rmd +++ b/vignettes/estimate_ascertainment.Rmd @@ -1,5 +1,5 @@ --- -title: "Estimating the proportion of cases that are reported during an outbreak" +title: "Estimating the proportion of cases that are ascertained during an outbreak" output: bookdown::html_vignette2: fig_caption: yes @@ -9,7 +9,7 @@ pkgdown: bibliography: resources/library.json link-citations: true vignette: > - %\VignetteIndexEntry{Estimating the proportion of cases that are reported during an outbreak} + %\VignetteIndexEntry{Estimating the proportion of cases that are ascertained during an outbreak} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: @@ -20,128 +20,105 @@ editor_options: knitr::opts_chunk$set( collapse = TRUE, comment = "#>", - fig.align = "left", - fig.width = 600 / 80, - fig.height = 400 / 80, - dpi = 80 + dpi = 300, + fig.width = 5, fig.height = 3 ) ``` -# Overview +Cases, infections or hospitalisations and deaths might be under-ascertained during an outbreak of an infectious disease for reasons including testing capacity, criteria not being sufficient to identify all infections, or symptom-based testing rather than random sampling, or changes in case definition over time. -This vignette outlines the function within _cfr_ used to estimate the -proportion of cases, infections or hospitalisations ascertained. +::: {.alert .alert-warning} +New to calculating disease severity using _cfr_? You might want to see the ["Get started" vignette first](cfr.html). +::: -First, we load the key packages required for this vignette: +::: {.alert .alert-primary} +## Use case {-} + +We wish to estimate one of the typical severity quantities used in epidemiology -- the *case fatality risk (CFR)*, *infection hospitality risk (IFR)* or *hospitalisation fatality risk (HFR)* -- using a method that uses a time series of cases, infections or hospitalisations (respectively) and deaths, _while correcting for the delay in reporting the outcomes of cases_. +::: + +::: {.alert .alert-secondary} +## What we have {-} + +* A time-series of cases, hospitalisations or some other proxy for infections over time; +* A time-series of deaths; +* A delay distribution, describing the probability an individual will die $t$ days after they were initially infected. + +The first two elements are expected be included in a dataframe with the columns "dates", "cases", and "deaths"; see below for examples. + +The delay distribution is expected to be specified as an object of the class `` from the package [_epiparameter_](https://epiverse-trace.github.io/epiparameter/). +::: + +This vignette shows how to use the `estimate_reporting()` function in _cfr_ to estimate the proportion of cases, infections or hospitalisations ascertained, and consequently, the extent of under-ascertainment or under-reporting. +Note that under-ascertainment may be due to a number of reasons (see above), while under-reporting is due to cases not being reported (such as due to a lack of testing capacity). +Users should take care to use the correct terminology for their specific situation. + +However, we have named the function `estimate_reporting()` as we expect that users will be most interested in situations where testing capacity is itself the limiting factor in ascertainment. +This was the case during the Covid-19 pandemic before testing capacity was substantially increased, and is expected to be the case for a future pandemic caused by a novel pathogen. + +First load _cfr_ and packages to access and plot data. ```{r, message = FALSE, warning=FALSE, eval = TRUE} +# load {cfr} and data packages library(cfr) library(epiparameter) -library(covidregionaldata) -``` +library(owidR) -and the four optional packages, to run the multiple examples in the last -section of this vignette: - -```{r} +# packages to wrangle and plot data library(dplyr) library(tidyr) library(purrr) library(scales) -library(ggplot2) library(forcats) +library(ggplot2) ``` -# Motivation +## Ascertainment for the Covid-19 pandemic in the U.K. -There are several reasons why cases, infections or hospitalisations and -deaths might be under-ascertained during an outbreak of an infectious disease. -Some examples include: +The function `estimate_reporting()` from the _cfr_ package estimates the proportion of cases, infections, hospitalisations -- or whichever proxy for infections is provided -- which have been ascertained. -* Testing capacity or criteria is not sufficient to identify all infections. +The methods are based on @nishiura2009 to estimate severity, and are extended by combining the resulting severity estimates with an assumed baseline severity estimate. -* Symptom-based testing rather than sampling at random in the -population. This can lead to biases due to some infections either being -asymptomatic or having a pre-symptomatic phase. +The proportion of cases, infections or other quantity provided that have been ascertained is given by the ratio of the (assumed) true baseline severity estimate, to the delay-adjusted severity estimate. -* Changes in case definition over time. +The delay-adjusted severity estimates can be calculated using either the `estimate_static()` or the `estimate_time_varying()` functions. +See the vignettes on [estimating a static measure of disease severity](estimate_static_severity.html) and [estimating a time-varying measure of disease severity](estimate_time_varying_severity.html), respectively, for more details on each of these functions. -For example, during the recent COVID-19 pandemic, the U.K. did not roll -out widespread testing until around May 2020. This means that during March and -April, infections were heavily under-ascertained. We focus on this period -within this vignette, as it illustrates the use for this analysis pipeline well. +### Preparing the raw data -# Methods +First we subset the data so that we focus on just the first few months of the COVID-19 outbreak in the U.K. +During this period test availability changed dramatically as a result of the vaccine campaign. -## Data required +We load the [global Covid-19 dataset collected by Our World in Data](https://ourworldindata.org/coronavirus), and provided through the [_owidR_ package](https://cran.r-project.org/package=owidR). -The data required to estimate how the severity of a disease changes over time -using the _cfr_ package includes: - -* A time-series of cases, hospitalisations or some other proxy for infections -over time; -* A time-series of deaths; -* A delay distribution, describing the probability an individual will die $t$ -days after they were initially exposed. Such distributions come from the -literature, where studies have typically fit distributions to data describing -the process. - -In practice, the time-series of cases and deaths will already be together, given -that they usually originate from sources that will have typically collated them -into a single data file. - -The package requires a `data.frame` of input data --- typically case and death -time series data --- and a delay distribution. The delay distribution we use -here comes from the _epiparameter_ package. - -## Estimating the proportion of cases or infections that have been ascertained - -The function `estimate_reporting()` from the _cfr_ package estimates the -proportion of cases, infections, hospitalisations --- or whichever proxy for -infections is provide --- which have been ascertained. The method used within -this function extends the methods outlined in the previous vignettes about -estimating the severity during an ongoing outbreak and measuring how the -severity changes over time. Both methods are based on the @nishiura2009 methods to estimate severity. We extend this by -combining the resulting severity estimates with an assumed baseline severity -estimate. The proportion of cases, infections or other quantity provided that -have been ascertained is given by the ratio of the assumed to be true baseline -severity estimate to the delay-adjusted severity estimate. The delay-adjusted -severity estimates can be calculated using either the `estimate_static()` or the -`estimate_time_varying()` functions. - -# Example with data from the ongoing COVID-19 pandemic in the U.K. - -We outline how the time-varying severity estimation works in *cfr* using -a number of examples. The data for all of the examples is from the ongoing -COVID-19 epidemic. Firstly, we analyse the U.K. data, then we pick three other -countries with large outbreaks to analyse. - -## Downloading the raw data - -First of all, we subset the data so that we focus on just the first few months -of the COVID-19 outbreak in the U.K. We do so, as test availability changed -dramatically as a result of the vaccine campaign. We download the data — -using the `covidregionaldata` package — with the following -command: +We introduce this dataset here, rather than its more U.K. focused equivalent from the _incidence2_ package, because we are going to use it further on for other countries. ```{r} -df_covid_uk <- get_national_data( - countries = "united kingdom", source = "who", verbose = FALSE +# get Covid data from {owidR} +df_covid <- owid_covid() + +# filter for the U.K +df_covid_uk <- filter(df_covid, location == "United Kingdom") + +# select and rename columns wth {dplyr} +df_covid_uk <- select( + df_covid_uk, date, + cases = new_cases, deaths = new_deaths ) -df_covid_uk <- rename(df_covid_uk, cases = cases_new, deaths = deaths_new) +# view the data format +df_covid_uk ``` -We then subset the data to focus on just the first few months of the outbreak -with the following command: +We then subset the data to focus on just the first few months of the outbreak. ```{r} +# the first few months of the pandemic df_covid_uk_subset <- filter(df_covid_uk, date <= "2020-05-31") ``` -## Plotting the raw data - -First, we plot case incidence data with following command: +Then, we plot case incidence data with following command. +Further plotting commands are hidden for brevity. ```{r fig.cap="Incidence of cases over time for the ongoing COVID-19 outbreak in the U.K.", class.source = 'fold-hide'} ggplot(df_covid_uk_subset) + @@ -149,10 +126,10 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = cases ), - colour = "darkblue" + colour = "steelblue" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = comma @@ -170,10 +147,10 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = deaths ), - colour = "red" + colour = "brown" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = comma @@ -183,11 +160,9 @@ ggplot(df_covid_uk_subset) + ) ``` -## The delay distribution +### Onset-to-death distribution for Covid-19 -We again retrieve the appropriate distribution — reported in @linton2020 — using the -`epidist_db()` function from the _epiparameter_ package, using the following -command: +We retrieve the appropriate distribution reported in @linton2020 using the `epidist_db()` function from the [_epiparameter_ package](https://epiverse-trace.github.io/epiparameter/). ```{r} onset_to_death_covid <- epidist_db( @@ -197,17 +172,19 @@ onset_to_death_covid <- epidist_db( ) ``` -To visualise this distribution, we evaluate it between 0 and 30 days, and plot -the results over time. We do so using the following command: +### Estimating the proportion of cases that have been ascertained -```{r fig.cap = "Example plot of the appropriate delay distribution for the COVID-19 epidemic in the U.K.We plot the onset-to-death distribution we use throughout this example for COVID-19, reported in https://doi.org/10.3390/jcm9020538."} -plot(onset_to_death_covid, day_range = seq_len(30)) -``` +We use the `estimate_reporting()` function within the _cfr_ package to calculate the time-varying CFR for the COVID-19 epidemic in the U.K. -## Estimating the proportion of cases that have been ascertained +The function includes a `type` argument, which determines whether `estimate_static()` or `estimate_time_varying()` is used to estimate the delay-adjusted severity of the disease. -We use the `estimate_reporting()` function within the _cfr_ package to -calculate the time-varying CFR for the COVID-19 epidemic in the U.K: +The ascertainment rate is calculated as the disease severity calculated from the data, divided by the 'known' disease severity; this is expected to be known or assumed from our best knowledge of the pathology of the disease. + +This known disease severity is passed to the `severity_baseline` argument in `estimate_reporting()`, and forms the denominator in the resulting under-ascertainment calculation. + +We assume that the 'true' CFR of Covid-19 is 0.014. + +The other arguments are the same as those found in `estimate_time_varying()`. ```{r } df_reporting_static <- estimate_reporting( @@ -219,14 +196,6 @@ df_reporting_static <- estimate_reporting( ) ``` -The function includes a `type` argument, which determines whether -`estimate_static()` or `estimate_time_varying()` is used to estimate the -delay-adjusted severity of the disease used as the numerator in the -under-ascertainment calculation. The `severity_baseline` argument in the -`estimate_reporting()` determines the denominator in the resulting -under-ascertainment calculation. The other arguments are the same as those -found in the `estimate_time_varying()`. - ```{r } df_reporting_varying <- estimate_reporting( data = df_covid_uk, @@ -250,44 +219,49 @@ df_reporting_static df_reporting_varying ``` -# Example with all countries with large early COVID-19 outbreaks +## Ascertainment in countries with large early COVID-19 outbreaks + +Finally, we estimate ascertainment for all countries with large early outbreaks of Covid-19. + +We define a large outbreak as one which has caused at least 100,000 deaths, and focus on the period between the start of each outbreak to the 1st June 2020. + +We do so as it was this period where under-ascertainment of cases and infections was likely the highest, as tests were still being developed and had not been made widely available in many countries. + +We load the [global Covid-19 dataset collected by Our World in Data](https://ourworldindata.org/coronavirus), and provided through the [_owidR_ package](https://cran.r-project.org/package=owidR). ```{r} -df_covid <- get_national_data( - source = "who", verbose = FALSE -) %>% - rename(cases = cases_new, deaths = deaths_new) +# get data from {owidR} +df_covid <- owid_covid() -df_covid_subset <- filter(df_covid, date <= "2020-05-31") +# filter out whole continents +df_covid <- filter( + df_covid, location != continent +) + +# rename the cases and deaths columns using {dplyr} +df_covid <- select( + df_covid, iso_code, location, continent, date, + cases = new_cases, deaths = new_deaths, + total_deaths +) ``` -Finally, we put the ascertainment estimates for all countries with large -outbreaks in a single table and summarise the results in a single figure. We -define a large outbreak as one which has caused at least 100,000 deaths. We -focus on these countries as the resulting table and figure become unwieldy if -they are made any larger. We focus on the period between the start of each -outbreak to the 1st June 2020. We do so as it was this period where -under-ascertainment of cases and infections was likely the highest, as tests -were still being developed and had not been made widely available in many -countries. - -To calculate all ascertainment estimates for all countries with -large COVID-19 outbreaks up to the 1st June 2020, we use the following command — -this codeblock depends on the package `dplyr`, for ease with manipulating -large data.frames which are structured by group: +We adopt data science tools to iteratively apply the `estimate_reporting()` function across data grouped by country. +We refer the user to the book [R for Data Science](https://r4ds.had.co.nz/) for a better explanation of some of the code used here, including from the packages in [the Tidyverse](https://www.tidyverse.org/). + ```{r} # calculate the total number deaths by country and filter for countries # with 100,000 deaths, and then filter for the first six months of 2020 df_reporting <- df_covid %>% group_by(iso_code) %>% - mutate(total_deaths = max(deaths_total)) %>% - filter(total_deaths > 100000 & !is.na(country)) %>% + mutate(total_deaths = max(total_deaths, na.rm = TRUE)) %>% + filter(total_deaths > 100000 & !is.na(location)) %>% filter(date < "2020-06-01") %>% ungroup() # nest the data -df_reporting <- nest(df_reporting, .by = country) +df_reporting <- nest(df_reporting, .by = location) # calculate the reporting rate in each country using # map on nested dataframes @@ -308,20 +282,17 @@ df_reporting <- unnest(df_reporting, cols = "reporting") head(df_reporting) ``` -Then, to plot all of these results on a single figure, we use the following -command — this codeblock depends on the commonly used package `ggplot2`, for -ease with plotting large data.frames: +Plot the ascertainment for each country. -```{r fig.cap = "Example plot of the corrected time-varying CFR. We calculate the time-varying CFR for the ongoing COVID-19 epidemic in United States, corrected for delays.", class.source = 'fold-hide'} +```{r fig.cap = "Example plot of the ascertainment rate by country during the early stages of the Covid-19 pandemic.", class.source = 'fold-hide'} df_reporting %>% ggplot() + geom_pointrange( aes( - x = fct_reorder(country, reporting_me), + x = fct_reorder(location, reporting_me), y = reporting_me, ymin = reporting_lo, - ymax = reporting_hi, - colour = country + ymax = reporting_hi ) ) + coord_flip() + diff --git a/vignettes/estimate_static_severity.Rmd b/vignettes/estimate_static_severity.Rmd index c4f2b6ec..470c8269 100644 --- a/vignettes/estimate_static_severity.Rmd +++ b/vignettes/estimate_static_severity.Rmd @@ -1,5 +1,5 @@ --- -title: "Calculating disease severity by adjusting for delays between onset and outcomes" +title: "Calculating a static, delay-adjusted estimate of disease severity" output: bookdown::html_vignette2: fig_caption: yes @@ -9,7 +9,7 @@ pkgdown: bibliography: resources/library.json link-citations: true vignette: > - %\VignetteIndexEntry{Calculating disease severity by adjusting for delays between onset and outcomes} + %\VignetteIndexEntry{Calculating a static, delay-adjusted estimate of disease severity} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: @@ -20,150 +20,71 @@ editor_options: knitr::opts_chunk$set( collapse = TRUE, comment = "#>", - fig.align = "left", - fig.width = 700 / 80, - fig.height = 400 / 80, - dpi = 80 + dpi = 300, + fig.width = 5, fig.height = 3 ) ``` -# Overview +It is important to understand the severity of a disease in terms of the case fatality rate in order to respond appropriately to an outbreak. +During an outbreak there is often a delay between cases being reported, and the outcomes of those cases being known. -This vignette outlines how to use _cfr_ to calculate disease severity -during an ongoing outbreak. We wish to estimate one of the typical severity -quantities used in epidemiology: the *case fatality risk (CFR)*, -*infection hospitality risk (IFR)* or *hospitalisation fatality risk (HFR)*. We -do so using a method that uses a time series of cases, infections of -hospitalisations (respectively) and deaths with an appropriate delay -distribution. The methods used here originate from @nishiura2009. +Knowing the distribution of these delays from previous outbreaks of the same (or similar) diseases, and accounting for them, can be useful in getting better estimates of disease severity. -This vignette demonstrates how to implement the core functions within -_cfr_. Specifically, we will cover +This vignette demonstrates how to calculate a static estimate of the severity of an outbreak using methods outlined in @nishiura2009, and implemented in _cfr_. -1. Why naive estimates of CFR or HFR generated during an ongoing outbreak -require adjusting, given out of sync time series are often being compared. +::: {.alert .alert-warning} +New to calculating disease severity using _cfr_? You might want to see the ["Get started" vignette first](cfr.html). +::: -2. Arrive at estimates over time for the number of symptomatic individuals -that will likely be known to have had an outcome $t$ days after symptom onset +::: {.alert .alert-primary} +## Use case {-} -3. Define an unbiased estimator to estimate the CFR or HFR from @nishiura2009 +We want a static estimate of the severity of an outbreak, using a method that uses a time series of cases, infections or hospitalisations (as appropriate) and deaths, _while correcting for the delay in reporting the outcomes of cases_. +::: -4. Numerically implement the unbiased estimator numerically using -maximum likelihood estimation to arrive at central estimates of the CFR or HFR, -along with a confidence interval. +::: {.alert .alert-secondary} +## What we have {-} -Now we load the three packages we need using the following commands: - -```{r, message = FALSE, warning=FALSE, eval = TRUE} -library(cfr) -library(epiparameter) -library(covidregionaldata) -library(dplyr) -library(scales) -library(ggplot2) -``` - -# Motivation - -Key epidemiological quantities estimated in real-time often suffer from biases. -In particular, dividing deaths-to-date by cases-to-date, infections-to-date or -hospitalisations-to-date leads to biased estimates of the CFR, IFR or -HFR (respectively). This is because this calculation does not account -for delays from confirmation of a case, infection or hospitalisation-to-death. - -If we had perfectly synchronised case and death incidence data, the basic -versions of the CFR, IFR and HFR — defined as either the ratio of the -total number of deaths and the total number of cases, infections or -hospitalisations — would be accurate. These basic estimates are referred to as -naive estimators. However, due to delays between onset or hospitalisation and -death for all known diseases, this synchronisation is never true. - -For example, suppose 10 people start showing symptoms of a specific disease on -a given day and the end of that day all remain alive. Suppose that for the next -5 days, the numbers of new cases continue to rise until they reach 100 new cases -on day 5. However, suppose that by day 5, all infected individuals remain alive. -The naive estimate of the CFR would be zero calculated at the end of the first 5 -days would be zero, because there would have been zero deaths in total. -Even after deaths begin to occur, this lag between the ascertainment of a case -or hospitalisation and outcome leads to a consistently biased estimate. Hence, -adjusting for such delays using an appropriate delay distribution -is essential for accurate estimates of severity. - -# Methods - -## Data required - -The data required to estimate the severity of a disease during an ongoing -outbreak using the _cfr_ package includes: - -* A time-series of cases, hospitalisations or some other proxy for infections -over time; +* A time-series of cases, hospitalisations or some other proxy for infections over time; * A time-series of deaths; -* A delay distribution, describing the probability an individual will die $t$ -days after they were initially exposed. Such distributions come from the -literature, where studies have typically fit distributions to data describing -the process. +* A delay distribution, describing the probability an individual will die $t$ days after they were initially infected. -In practice, the time-series of cases and deaths will already be together, given -that they usually originate from sources that will have typically collated them -into a single data file. +The first two elements are expected be included in a dataframe with the columns "dates", "cases", and "deaths"; see below for examples. -The function within _cfr_ used to estimate the severity of a disease at a -single time-point — `estimate_static()` — requires a `data.frame` of input data ---- typically case and death time series data --- and a delay distribution. The -delay distribution we use here comes from the _epiparameter_ package. +The delay distribution is expected to be specified as an object of the class `` from the package [_epiparameter_](https://epiverse-trace.github.io/epiparameter/). +::: -## Adjusting for delays between the two time series +::: {.alert .alert-info} +## Concept: Correcting for delays in CFR estimation {-} -The method used by this function follows @nishiura2009. -The function calculates a quantity $u_t$ for each day within the input data, -which represents the proportion of cases with a known outcome, on day $t$. -Following Nishiura et al., $u_t$ is calculated in the following way: +Simply dividing the number of deaths by the number of cases would obtain a CFR that is a _naive estimator_ of the true CFR. +Consider the example below to understand why. -\begin{equation} - u_t = \frac{\sum_{i = 0}^t - \sum_{j = 0}^\infty c_i f_{j - i}}{\sum_{i = 0} c_i}, -\end{equation} +Suppose 10 people start showing symptoms of a specific disease on a given day and the end of that day all remain alive. +Suppose that for the next 5 days, the numbers of new cases continue to rise until they reach 100 new cases on day 5. +However, suppose that by day 5, all infected individuals remain alive. -where $f_t$ is the value of the probability mass function at time $t$ and $c_t$, -$d_t$ are the number of new cases and new deaths at time $t$, (respectively). We -then use $u_t$ at the end of the outbreak in the following likelihood -function to estimate the severity of the disease in question. +The naive estimate of the CFR calculated at the end of the first 5 days would be _zero_, because there would have been zero deaths in total --- _at that point_. +That is to say, the _outcomes_ of cases (infections) would not be known. -\begin{equation} - L(\theta | y) = \log{\binom{u_tC}{D}} + D \log{\theta} + - (u_tC - D)\log{(1.0 - \theta)}, -\end{equation} - -$C$ and $D$ are the cumulative number of cases and deaths (respectively) up -until time $t$. Lastly, $\theta$ is the parameter we wish to estimate, the -severity of the disease. We estimate $\theta$ using simple maximum-likelihood -methods, allowing the functions within this package to be quick and easy tools -to use. - -## Interpreting the estimates - -The precise severity measure — CFR, IFR, HFR, etc — that \theta represents -depends upon the input data given by the user. For complete clarity, here are -the most common time series that users might calculate severity from and the -resulting severity estimate from such data: +Even after deaths begin to occur, this lag between the ascertainment of a case or hospitalisation and outcome leads to a consistently biased estimate. +Hence, adjusting for such delays using an appropriate delay distribution is essential for accurate estimates of severity. +::: -:Case and death incidence data, with a case-to-death delay distribution (or close approximation, such as onset-to-death) — Case Fatality Risk (CFR). +First load _cfr_ and packages to access and plot data. -:Infection and death incidence data, with an exposure-to-death delay distribution (or close approximation). - -:Hospitalisation Fatality Risk (HFR) --- Hospitalisation and death incidence data, delay distribution (or close approximation). +```{r, message = FALSE, warning=FALSE, eval = TRUE} +library(cfr) +library(epiparameter) +library(incidence2) +library(dplyr) +library(scales) +library(ggplot2) +``` # Severity of the 1976 Ebola Outbreak -We use case and death incidence data from the 1976 Ebola outbreak in the Democratic the overall severity of Ebola. We do so as though we were roughly half way -through the outbreak, emulating when the methods presented in this package are -arguably their most useful. - -First of all, we plot the data from the entire outbreak. Then we subset the -data, keeping only the first half of each time series, in order to emulate -attempting to calculate the severity of the disease when the effects of long -lag-times between case detection and death are most serious. +We use case and death incidence data from the 1976 Ebola outbreak in the Democratic the overall severity of Ebola. We do so as though we were roughly half way through the outbreak, emulating when the methods presented in this package are arguably their most useful. ## Plotting the raw data @@ -181,7 +102,7 @@ ggplot(ebola1976) + aes( x = date, y = cases ), - colour = "darkblue" + colour = "steelblue" ) + scale_x_date( date_labels = "%d-%b-%Y" @@ -199,7 +120,7 @@ ggplot(ebola1976) + aes( x = date, y = deaths ), - colour = "red" + colour = "brown" ) + scale_x_date( date_labels = "%d-%b-%Y" @@ -214,15 +135,14 @@ subsetting the data so that we only include days before the 30th September 1976, using `dplyr::filter()`. ```{r, message = FALSE, warning = FALSE, eval = TRUE} -df_ebola_subset <- dplyr::filter(ebola1976, date <= "1976-09-30") +df_ebola_subset <- filter(ebola1976, date <= "1976-09-30") ``` ## The delay distribution -For this example, given that we are using case data and detection of a case is -well-approximated by symptom onset, we use the distribution describing the delay -between onset-to-death. As such, we retrieve such a distribution from the -literature [@barry2018] using the `epidist_db()` function from the _epiparameter_ package with the following command: +For this example, given that we are using case data and detection of a case is well-approximated by symptom onset, we use the distribution describing the delay between onset-to-death. + +As such, we retrieve such a distribution from the literature [@barry2018] using the `epidist_db()` function from the _epiparameter_ package with the following command: ```{r} onset_to_death_ebola <- epidist_db( @@ -236,13 +156,12 @@ To visualise this distribution, we evaluate it between 0 and 30 days, and plot the results over time using the appropriate plot method from the _epiparameter_ package. ```{r, fig.cap = "Example plot of the appropriate delay distribution for the 1976 Ebola dataset.** We plot the onset-to-death distribution we use throughout this example for Ebola Virus Disease (EVD), reported [here](https://doi.org/10.1016/S0140-6736(18)31387-4)."} -plot(onset_to_death_ebola, day_range = seq_len(30), vb = FALSE) +plot(onset_to_death_ebola, day_range = seq_len(30), vb = FALSE, cex.main = 0.5) ``` ## Estimating incidence of cases (or similar time-series) with a known outcome -The function `known_outcomes()` from the _cfr_ package estimates the -number of cases which have a known outcome over time. +The function `known_outcomes()` from the _cfr_ package estimates the number of cases which have a known outcome over time. The resulting data frame contains two new columns, "known_outcomes", for the number of known outcomes (deaths) expected for each day, and "u_t", the under-reporting factor that estimates what proportion of cases have not been reported, given the estimated number of known outcomes. @@ -275,7 +194,7 @@ ggplot(df_known_outcomes_ebola) + ) + scale_colour_manual( values = c( - "Cases" = "darkblue", + "Cases" = "steelblue", "Known outcomes" = "darkgreen" ), name = NULL @@ -293,16 +212,11 @@ ggplot(df_known_outcomes_ebola) + ## Estimating the naive and corrected CFR -Once we calculate the proportion of cases with known outcomes, we apply the -proportion to the number of cases to correct for the delay between -onset-to-death. We do so by using the function `estimate_static()` from the -_cfr_ package. This function estimates the proportion of known outcomes -over time, and uses the estimate to correct the naive severity estimate. This is -controlled using the `correct_for_delays` boolean flag argument in the -`estimate_static()` function. Otherwise, it calculates a naive severity estimate, -which does not adjust for delays and is simply the total number of deaths in the -dataset divided by the total number of cases. We run this function for the naive -and corrected estimates using the following two commands (respectively): +Once we calculate the proportion of cases with known outcomes, we apply the proportion to the number of cases to correct for the delay between onset-to-death. +We do so by using the function `estimate_static()` from the _cfr_ package. +This function estimates the proportion of known outcomes over time, and uses the estimate to correct the naive severity estimate. +This is controlled using the `correct_for_delays` boolean flag argument in the `estimate_static()` function. +Otherwise, it calculates a naive severity estimate, which does not adjust for delays and is simply the total number of deaths in the dataset divided by the total number of cases. ```{r message = FALSE, warning = FALSE, eval = TRUE} # calculating the naive CFR @@ -321,24 +235,26 @@ estimate_static( # Severity of the COVID-19 pandemic in the U.K. -We now perform a similar analysis with all of the same steps, with data taken -from the ongoing COVID-19 epidemic in the U.K. For brevity, we describe the -steps without methodological explanations throughout this example. +We now perform a similar analysis with all of the same steps, with data taken from the ongoing COVID-19 epidemic in the U.K. +For brevity, we describe the steps without methodological explanations throughout this example. + +We get the data from the [_incidence2_ package](https://cran.r-project.org/package=incidence2) (but note that it was originally made available in the [_covidregionaldata_](https://github.com/epiforecasts/covidregionaldata) package which is no longer available on CRAN). ## Plotting the raw data -First of all, we subset the data so that we focus on just the first year of the -COVID-19 outbreak in the U.K. We do so, as the CFR changed dramatically as a -result of the vaccine campaign. The static severity calculations we are -performing in this vignette are not able to deal with changes in severity over -time. We download the data --- using the `covidregionaldata` package --- -change some default column names to match those required by _cfr_ and -subset the data.frame to focus on the first year of the pandemic in the U.K., -with the following commands: +We subset the data so that we focus on just the first year of the COVID-19 outbreak in the U.K. +We do so, as the CFR changed dramatically as a result of the vaccination campaign. +The static severity calculations we are performing in this vignette are not able to deal with changes in severity over time. + +We change some default column names to match those required by _cfr_ and subset the data.frame to focus on the first year of the pandemic in the U.K. ```{r, message = FALSE, warning = FALSE, eval = TRUE} -df_covid_uk <- get_national_data( - countries = "united kingdom", source = "who", verbose = FALSE +df_covid_uk <- covidregionaldataUK + +df_covid_uk <- aggregate( + data = df_covid_uk, + x = cbind(cases_new, deaths_new) ~ date, + FUN = function(x) sum(x, na.rm = TRUE) ) df_covid_uk <- rename( @@ -349,7 +265,7 @@ df_covid_uk <- rename( df_covid_uk_subset <- filter(df_covid_uk, date <= "2020-12-31") ``` -Then, we plot the subsetted case data with following command: +Then, we plot the subsetted case data. ```{r, fig.cap = "Incidence of cases over time for the ongoing COVID-19 outbreak in the U.K.", class.source = 'fold-hide'} ggplot(df_covid_uk_subset) + @@ -357,7 +273,7 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = cases ), - colour = "darkblue" + colour = "steelblue" ) + scale_x_date( date_labels = "%d-%b-%Y" @@ -370,7 +286,7 @@ ggplot(df_covid_uk_subset) + ) ``` -Then, we plot the subsetted death data with following command: +Then, we plot the subsetted death data. ```{r, fig.cap = "Incidence of deaths over time for the ongoing COVID-19 outbreak in the U.K.", class.source = 'fold-hide'} ggplot(df_covid_uk_subset) + @@ -378,7 +294,7 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = deaths ), - colour = "red" + colour = "brown" ) + scale_x_date( date_labels = "%d-%b-%Y" @@ -393,7 +309,7 @@ ggplot(df_covid_uk_subset) + ## The delay distribution -We again retrieve the appropriate distribution from @linton2020 using the `epidist_db()` function from the _epiparameter_ package, using the following command. +We again retrieve the appropriate distribution from @linton2020 using the `epidist_db()` function from the _epiparameter_ package. ```{r message = FALSE, warning = FALSE, eval = TRUE} onset_to_death_covid <- epidist_db( @@ -403,8 +319,7 @@ onset_to_death_covid <- epidist_db( ) ``` -To visualise this distribution, we evaluate it between 0 and 30 days, and plot -the results over time. We do so using the following command: +To visualise this distribution, we evaluate it between 0 and 30 days, and plot the results over time. ```{r fig.cap = "Example plot of the appropriate delay distribution for the COVID-19 epidemic in the U.K. We plot the onset-to-death distribution we use throughout this example for COVID-19, reported [here](https://doi.org/10.3390/jcm9020538)."} plot(onset_to_death_covid, day_range = seq_len(30), vb = FALSE) @@ -412,8 +327,7 @@ plot(onset_to_death_covid, day_range = seq_len(30), vb = FALSE) ## Estimating incidence of cases with a known outcome -We use the same method and implementation as in the Ebola example to calculate -the number of known outcomes over time, using the following command: +We use the same method and implementation as in the Ebola example to calculate the number of known outcomes over time. ```{r} df_known_outcomes_covid <- known_outcomes( @@ -438,7 +352,7 @@ ggplot(df_known_outcomes_covid) + ) + scale_colour_manual( values = c( - "Cases" = "darkblue", + "Cases" = "steelblue", "Known outcomes" = "darkgreen" ), name = NULL @@ -477,4 +391,36 @@ estimate_static( ) ``` +## Details: Adjusting for delays between two time series {-} + +The method used in `estimate_static()` function follows @nishiura2009. +The function calculates a quantity $u_t$ for each day within the input data, which represents the proportion of cases with a known outcome, on day $t$. +Following Nishiura et al., $u_t$ is calculated in the following way: + +\begin{equation} + u_t = \frac{\sum_{i = 0}^t + \sum_{j = 0}^\infty c_i f_{j - i}}{\sum_{i = 0} c_i}, +\end{equation} + +where $f_t$ is the value of the probability mass function at time $t$ and $c_t$, $d_t$ are the number of new cases and new deaths at time $t$, (respectively). +We then use $u_t$ at the end of the outbreak in the following likelihood function to estimate the severity of the disease in question. + +\begin{equation} + L(\theta | y) = \log{\binom{u_tC}{D}} + D \log{\theta} + + (u_tC - D)\log{(1.0 - \theta)}, +\end{equation} + +$C$ and $D$ are the cumulative number of cases and deaths (respectively) up until time $t$. +Lastly, $\theta$ is the parameter we wish to estimate, the severity of the disease. +We estimate $\theta$ using simple maximum-likelihood methods, allowing the functions within this package to be quick and easy tools to use. + +The precise severity measure — CFR, IFR, HFR, etc — that \theta represents depends upon the input data given by the user. +For complete clarity, here are the most common time series that users might calculate severity from and the resulting severity estimate from such data. + + - Case and death incidence data, with a case-to-death delay distribution (or close approximation, such as onset-to-death) — Case Fatality Risk (CFR). + + - Infection and death incidence data, with an exposure-to-death delay distribution (or close approximation). + + - Hospitalisation Fatality Risk (HFR) --- Hospitalisation and death incidence data, delay distribution (or close approximation). + ## References diff --git a/vignettes/estimate_time_varying_severity.Rmd b/vignettes/estimate_time_varying_severity.Rmd index df067415..b2e4196a 100644 --- a/vignettes/estimate_time_varying_severity.Rmd +++ b/vignettes/estimate_time_varying_severity.Rmd @@ -20,162 +20,97 @@ editor_options: knitr::opts_chunk$set( collapse = TRUE, comment = "#>", - fig.align = "left", - fig.width = 700 / 80, - fig.height = 400 / 80, - dpi = 80 + dpi = 300, + fig.width = 5, fig.height = 3 ) ``` -# Overview +There are many biological, epidemiological and behavioural reasons why the severity of a disease might change during the course of an outbreak, including the introduction of vaccines or therapeutics, reducing the relative risk of death, or the emergence of pathogen variants, which may alter the risk of hospitalisation or death associated with infection. -This vignette outlines how to use _cfr_ to estimate how the severity of a -disease changes over the course of an ongoing outbreak. We wish to estimate -how common severity quantities change over time: -the *case fatality risk (CFR)*, *infection hospitality risk (IFR)* or -*hospitalisation fatality risk (HFR)*. +Running the `estimate_static()` function on datasets corresponding to different stages of the outbreak would lead to two static estimates, which may or may not differ. +This could be useful, but a more detailed and easier to interpret approach is to use the function demonstrated in this vignette, `estimate_time_varying()`. -We do so using a method that analyses the time series of cases and outcomes -with an appropriate delay -distribution. The methods used here are documented in @nishiura2009. +This vignette outlines how to use _cfr_ to estimate how the severity of a disease changes over the course of an ongoing outbreak. -First, we load the five essential packages needed for this vignette: +::: {.alert .alert-warning} +New to calculating disease severity using _cfr_? You might want to see the ["Get started" vignette first](cfr.html). +::: -```{r, message = FALSE, warning=FALSE, eval = TRUE} -library(cfr) -library(epiparameter) -library(covidregionaldata) -``` - -and the four optional packages, to run the multiple examples in the last -section of this vignette: - -```{r, message = FALSE, warning=FALSE, eval = TRUE} -library(dplyr) -library(tidyr) -library(purrr) -library(scales) -library(ggplot2) -``` - -# Motivation - -There are many biological, epidemiological and behavioural reasons why the -severity of a disease might change during the course of an outbreak. We list a -few of the most common reasons here: +::: {.alert .alert-primary} +## Use case {-} -* The introduction of vaccines or therapeutics, reducing the relative risk of death compared to infection. +We wish to estimate how common severity quantities _change over time_ --- such as the case fatality risk (CFR) --- using a method that uses a time series of cases, infections or hospitalisations (as appropriate) and deaths, _while correcting for the delay in reporting the outcomes of cases_. +::: -* The emergence of pathogen variants, which may alter the risk of hospitalisation or death associated with infection. +::: {.alert .alert-secondary} +## What we have {-} -For example, suppose a new variant of a currently circulating pathogen has -emerged. Typically, estimating the severity of a the newly emerged variant is of -critical importance for surveillance and planning purposes. Running the -`estimate_static()` function on datasets corresponding to outbreaks caused by -the two variants would lead to two static estimates, which may or may not -differ. This could be useful, but a more detailed and easier to interpret -approach is to use the function demonstrated in this vignette: -`estimate_time_varying()`. - -# Methods - -## Data required - -The data required to estimate how the severity of a disease changes over time -using the _cfr_ package includes: - -* A time-series of cases, hospitalisations or some other proxy for infections -over time; +* A time-series of cases, hospitalisations or some other proxy for infections over time; * A time-series of deaths; -* A delay distribution, describing the probability an individual will die $t$ -days after they were initially exposed. Such distributions come from the -literature, where studies have typically fit distributions to data describing -the process. +* A delay distribution, describing the probability an individual will die $t$ days after they were initially infected. -In practice, the time-series of cases and deaths will already be together, given -that they usually originate from sources that will have typically collated them -into a single data file. +The first two elements are expected be included in a dataframe with the columns "dates", "cases", and "deaths"; see below for examples. -The package requires a `data.frame` of input data --- typically case and death -time series data --- and a delay distribution. The delay distribution we use -here comes from the _epiparameter_ package. +The delay distribution is expected to be specified as an object of the class `` from the package [_epiparameter_](https://epiverse-trace.github.io/epiparameter/). +::: -## Adjusting for delays between the two time series - -The function `estimate_time_varying()` from the _cfr_ package estimates -the number of cases which have a known outcome over time. The method used -within this function follows @nishiura2009. -The function calculates a quantity $k_t$ for each day within the input data, -which represents the number of cases with a known outcome, on day $t$. Following -Nishiura et al., $k_t$ is calculated in the following way: - -\begin{equation} - k_t = \sum_{j = 0}^t c_t f_{j - t}. -\end{equation} +First load _cfr_ and packages to access and plot data. -We then assume that the severity measure, for example CFR, of interest is -Binomially-distributed, in the following way +```{r, message = FALSE, warning=FALSE, eval = TRUE} +library(cfr) -\begin{equation} - d_t \sim \text{Binomial}(k_t, \theta_t). -\end{equation} +# packages to access delay data and case data +library(epiparameter) +library(owidR) -We use maximum-likelihood techniques to determine the value of $\theta_t$ for -each $t$, whereby $\theta$ represents the severity measure of interest. +# packages to wrangle and plot data +library(dplyr) +library(tidyr) +library(purrr) +library(scales) +library(ggplot2) +``` -## Interpreting the estimates +The `estimate_time_varying()` function requires a `data.frame` of input data --- typically case and death time series data --- and a delay distribution. +The delay distribution we use here comes from the [_epiparameter_ package](https://epiverse-trace.github.io/epiparameter/). -The precise severity measure — CFR, IFR, HFR, etc — that $\theta$ represents -depends upon the input data given by the user. For complete clarity, here are -the most common time series that users might calculate severity from and the -resulting severity estimate from such data: +## Changing severity of the COVID-19 pandemic in the U.K. -* Case and death incidence data, with a case-to-death delay distribution (or -close approximation, such as onset-to-death). This would -result in a Case Fatality Risk (CFR) estimate. +We outline how the time-varying severity estimation works in *cfr* using a number of examples. +The data for all of the examples is from the ongoing COVID-19 epidemic. +Firstly, we analyse the U.K. data, then we pick three other countries with large outbreaks to analyse. -* Infection and death incidence data, with an exposure-to-death delay -distribution (or close approximation). This would result in an Infection -Fatality Risk (IFR) estimate. +### Preparing the raw data -* Hospitalisation and death incidence data, with a hospitalisation-to-death -delay distribution (or close approximation). This would result in a -Hospitalisation Fatality Risk (HFR) estimate. +First we subset the data so that we focus on just the first year of the COVID-19 outbreak in the U.K. +During this period, the CFR changed dramatically as a result of a number of changes in policy, such as implementation and relaxation of lockdown; rollout of the vaccine; new variants emerging, etc. -# Changing severity of the COVID-19 pandemic in the U.K. +We load the [global Covid-19 dataset collected by Our World in Data](https://ourworldindata.org/coronavirus), and provided through the [_owidR_ package](https://cran.r-project.org/package=owidR). -We outline how the time-varying severity estimation works in *cfr* using a -number of examples. The data for all of the examples is from the ongoing -COVID-19 epidemic. Firstly, we analyse the U.K. data, then we pick three other -countries with large outbreaks to analyse. +We introduce this dataset here, rather than its more U.K. focused equivalent from the _incidence2_ package, because we are going to use it further on for other countries. -## Downloading the raw data +```{r} +# get Covid data from {owidR} +df_covid <- owid_covid() -First of all, we subset the data so that we focus on just the first year of the -COVID-19 outbreak in the U.K. We do so, as the CFR changed dramatically as a -result of a number of factors: changes in policy, such as implementation and -relaxation of lockdown; rollout of the vaccine; new variants emerging, etc. We -download the data --- using the [_covidregionaldata_ package](https://github.com/epiforecasts/covidregionaldata) --- with the -following command: +# filter for the U.K +df_covid_uk <- filter(df_covid, location == "United Kingdom") -```{r} -df_covid_uk <- get_national_data( - countries = "united kingdom", source = "who", verbose = FALSE +# select and rename columns wth {dplyr} +df_covid_uk <- select( + df_covid_uk, date, + cases = new_cases, deaths = new_deaths ) - -df_covid_uk <- rename(df_covid_uk, cases = cases_new, deaths = deaths_new) ``` -We then subset the data to focus on just the first few months of the outbreak -with the following command: +We then subset the data to focus on just the first few months of the outbreak. ```{r} df_covid_uk_subset <- filter(df_covid_uk, date <= "2020-12-31") ``` -## Plotting the raw data - -First, we plot case incidence data with following command: +Then, we plot case incidence data with following command. +Further plotting commands are hidden for brevity. ```{r, fig.cap = "Incidence of cases over time for the ongoing COVID-19 outbreak in the U.K."} ggplot(df_covid_uk_subset) + @@ -183,10 +118,10 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = cases ), - colour = "darkblue" + colour = "steelblue" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = comma @@ -196,7 +131,7 @@ ggplot(df_covid_uk_subset) + ) ``` -Then, we plot the death incidence data with following command: +We also plot the death incidence data over time. ```{r, fig.cap = "Incidence of deaths over time for the ongoing COVID-19 outbreak in the U.K.", class.source = 'fold-hide'} ggplot(df_covid_uk_subset) + @@ -204,10 +139,10 @@ ggplot(df_covid_uk_subset) + aes( x = date, y = deaths ), - colour = "red" + colour = "brown" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = comma @@ -217,11 +152,9 @@ ggplot(df_covid_uk_subset) + ) ``` -## The delay distribution +### Onset-to-death distribution for Covid-19 -We again retrieve the appropriate distribution reported in @linton2020 using the -`epidist_db()` function from the _epiparameter_ package, using the following -command: +We retrieve the appropriate distribution reported in @linton2020 using the `epidist_db()` function from the _epiparameter_ package. ```{r message=FALSE, warning=FALSE, eval=TRUE} onset_to_death_covid <- epidist_db( @@ -231,17 +164,15 @@ onset_to_death_covid <- epidist_db( ) ``` -To visualise this distribution, we evaluate it between 0 and 30 days, and plot -the results over time. We do so using the following command: +To visualise this distribution, we evaluate it between 0 and 30 days, and plot the results over time. -```{r message=FALSE, warning=FALSE, eval=TRUE, fig.cap = "Example plot of the appropriate delay distribution for the COVID-19 epidemic in the U.K.** We plot the onset-to-death distribution we use throughout this example for COVID-19, reported in https://doi.org/10.3390/jcm9020538."} -plot(onset_to_death_covid, day_range = seq(30)) +```{r message=FALSE, warning=FALSE, eval=TRUE, fig.cap = "Example plot of the appropriate delay distribution for the COVID-19 epidemic in the U.K. We plot the onset-to-death distribution we use throughout this example for COVID-19, reported in @linton2020."} +plot(onset_to_death_covid, day_range = seq(30), cex.main = 0.5) ``` -## Estimating the naive and corrected CFR +### Estimating the naive and corrected CFR -We use the `estimate_time_varying()` function within the _cfr_ package -to calculate the time-varying CFR for the COVID-19 epidemic in the U.K: +We use the `estimate_time_varying()` function within the _cfr_ package to calculate the time-varying CFR for the COVID-19 epidemic in the U.K. ```{r} # calculating the naive time-varying CFR @@ -253,6 +184,7 @@ df_covid_cfr_uk_naive <- estimate_time_varying( correct_for_delays = FALSE ) +# calculating the corrected time-varying CFR df_covid_cfr_uk_corrected <- estimate_time_varying( df_covid_uk_subset, epi_dist = onset_to_death_covid, @@ -262,8 +194,7 @@ df_covid_cfr_uk_corrected <- estimate_time_varying( ) ``` -Once we have calculated both severity quantities over time, we plot the results. -First we plot the naive estimate, uncorrected for delays: +Once we have calculated both severity quantities over time, we plot the results; first the naive estimate. ```{r, fig.cap = "Example plot of the naive time-varying CFR.** We calculate the time-varying CFR for the ongoing COVID-19 epidemic in the U.K., uncorrected for delays.", class.source = 'fold-hide'} ggplot(df_covid_cfr_uk_naive) + @@ -277,10 +208,10 @@ ggplot(df_covid_cfr_uk_naive) + aes( x = date, y = severity_me ), - colour = "red" + colour = "brown" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = percent @@ -290,7 +221,7 @@ ggplot(df_covid_cfr_uk_naive) + ) ``` -Lastly, we plot the corrected — for delays — estimate: +Lastly, we plot the estimate corrected for reporting delays. ```{r, fig.cap = "Example plot of the corrected time-varying CFR.** We calculate the time-varying CFR for the ongoing COVID-19 epidemic in the U.K., corrected for delays.", class.source = 'fold-hide'} ggplot(df_covid_cfr_uk_corrected) + @@ -304,10 +235,10 @@ ggplot(df_covid_cfr_uk_corrected) + aes( x = date, y = severity_me ), - colour = "red" + colour = "brown" ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = percent @@ -317,57 +248,58 @@ ggplot(df_covid_cfr_uk_corrected) + ) ``` -# Severity of COVID-19 in multiple countries +## Severity of COVID-19 in multiple countries -This section uses functions from the `dplyr` package. It does so for two -reasons: `covidregionaldata`, the package we use to download the raw data, -downloads it as a `tibble`, the in-built data.frame class of the `dplyr` -package and because the package includes functions which allow the user to -easily map a function across one `tibble`, separated by groups. In our case, we -wish the run the `estimate_time_varying()` function across all countries. The -`tibble` contains data from all countries. We could make a separate `tibble` -for each country we wish to run. However, as the package `dplyr` already -contains the functionality to run a function across a large `tibble`, separated -by a group, we use it. Furthermore, `dplyr`, as part of the `tidyverse` is a -common package, which many users will already have installed. +We now show how to calculate the time-varying CFR for Covid-19 in multiple countries with large outbreaks. +To do this, we adopt a data science approach to apply the `estimate_time_varying()` function across data grouped by country. +We refer the user to the book [R for Data Science](https://r4ds.had.co.nz/) for a better explanation of some of the code used here, including from the packages in [the Tidyverse](https://www.tidyverse.org/). -First of all, we retrieve daily case and death data for COVID-19 for all -countries, again using the `covidregionaldata` package: +First we retrieve daily case and death data for COVID-19 for all countries and locations from the `owidR` package. ```{r} -df_covid <- get_national_data( - source = "who", verbose = FALSE +# get data from {owidR} +df_covid <- owid_covid() + +# filter out whole continents +df_covid <- filter( + df_covid, location != continent ) -df_covid <- rename(df_covid, cases = cases_new, deaths = deaths_new) +# rename the cases and deaths columns using {dplyr} +df_covid <- select( + df_covid, iso_code, location, continent, date, + cases = new_cases, deaths = new_deaths, + total_deaths +) ``` -Next, we run some `dplyr` commands to group the data and map the -`estimate_time_varying()` function across each group within the `tibble`. We do -not focus on the `dplyr` commands in this vignette. Instead, we refer the user -to the `dplyr` [documentation](https://dplyr.tidyverse.org/). Note the below code takes a few moments to run. +Next, we group the data by country and apply the `estimate_time_varying()` function across each group (country). +This code takes a few moments to run. ```{r} # get the total number of deaths in each country with more than 100,000 deaths df_covid_cfr <- df_covid %>% group_by(iso_code) %>% - mutate(total_deaths = max(deaths_total)) %>% - filter(total_deaths > 100000 & !is.na(country)) %>% + mutate(total_deaths = max(total_deaths, na.rm = TRUE)) %>% + filter(total_deaths > 100000 & !is.na(location)) %>% ungroup() # for each country, get the time-varying severity estimate, # correcting for delays and smoothing the case and death data -# first nest the data +# first nest the data; nest() from {tidyr} df_covid_cfr <- nest( df_covid_cfr, - .by = country + .by = location ) +``` +```{r} # to each nested data frame, apply the function `estimate_time_varying` # overwrite the `data` column, as all data will be preserved df_covid_cfr <- mutate( df_covid_cfr, + # using map() from {purrr} data = map( .x = data, .f = estimate_time_varying, # arguments to the function @@ -376,126 +308,70 @@ df_covid_cfr <- mutate( ) ) -# unnest the cfr data +# unnest the cfr data; unnest() from {tidyr} df_covid_cfr <- unnest(df_covid_cfr, cols = data) ``` -Note, for simplicity, we use the same delay distribution between onset and -death for all locations. +For simplicity, we use the same delay distribution between onset and death for all locations. -We then plot the time-varying CFR for a selection of three more countries with -large outbreaks of COVID-19. First of all, we plot the time-varying CFR -estimate for the COVID-19 outbreak in Brazil: +Finally we plot the time-varying CFR for a selection of three more countries with large outbreaks of COVID-19: Brazil, India, and the United States. -```{r, fig.cap = "Example plot of the corrected time-varying CFR.** We calculate the time-varying CFR for the ongoing COVID-19 epidemic in Brazil, corrected for delays.", class.source = 'fold-hide'} -filter(df_covid_cfr, country == "Brazil") %>% +```{r, fig.cap = "Example plot of the corrected time-varying CFR. We calculate the time-varying CFR for the ongoing COVID-19 epidemic in Brazil, India, and the United States, corrected for delays.", class.source = 'fold-hide', fig.width=8} +filter(df_covid_cfr, location %in% c("Brazil", "India", "United States")) %>% ggplot() + geom_ribbon( aes( - x = date, ymin = severity_lo, ymax = severity_hi + x = date, ymin = severity_lo, ymax = severity_hi, + group = location ), - fill = "grey75" + fill = "grey" ) + geom_line( aes( - x = date, y = severity_me - ), - colour = "red" + x = date, y = severity_me, colour = location + ) ) + scale_x_date( - date_labels = "%d-%b-%Y" + date_labels = "%b-%Y" ) + scale_y_continuous( labels = percent ) + - labs( - x = "Date", y = "CFR (%)" - ) -``` - -Next, we plot the time-varying CFR estimate for the COVID-19 outbreak in India: - -```{r, fig.cap = "Example plot of the corrected time-varying CFR.** We calculate the time-varying CFR for the ongoing COVID-19 epidemic in Germany deaths in total, corrected for delays.", class.source = 'fold-hide'} -filter(df_covid_cfr, country == "Germany") %>% - ggplot() + - geom_ribbon( - aes( - x = date, ymin = severity_lo, ymax = severity_hi - ), - fill = "grey75" - ) + - geom_line( - aes( - x = date, y = severity_me - ), - colour = "red" - ) + - scale_x_date( - date_labels = "%d-%b-%Y" + scale_colour_brewer( + palette = "Dark2" ) + - scale_y_continuous( - labels = percent + coord_cartesian( + ylim = c(0, 0.2), + expand = FALSE ) + labs( x = "Date", y = "CFR (%)" + ) + + theme( + legend.position = "top" ) ``` -Next, we plot the time-varying CFR estimate for the COVID-19 outbreak in the -United States: +## Details: Adjusting for delays between two time series {-} -```{r, fig.cap = "Example plot of the corrected time-varying CFR.** We calculate the time-varying CFR for the ongoing COVID-19 epidemic in United States, corrected for delays.", class.source = 'fold-hide'} -filter(df_covid_cfr, country == "United States") %>% - ggplot() + - geom_ribbon( - aes( - x = date, ymin = severity_lo, ymax = severity_hi - ), - fill = "grey75" - ) + - geom_line( - aes( - x = date, y = severity_me - ), - colour = "red" - ) + - scale_x_date( - date_labels = "%d-%b-%Y" - ) + - scale_y_continuous( - labels = percent - ) + - labs( - x = "Date", y = "CFR (%)" - ) -``` +The function `estimate_time_varying()` from the _cfr_ package estimates the number of cases which have a known outcome over time, following @nishiura2009. +The function calculates a quantity $k_t$ for each day within the input data, which represents the number of cases with a known outcome, on day $t$. +Following @nishiura2009, $k_t$ is calculated in the following way: -Finally, we plot all of the countries we have calculated the time-varying CFR -for, using the package `ggplot2`, which contains the `facet_wrap()` function. -This allows us to loop over the different countries with ease. We do so with -the following commands: +\begin{equation} + k_t = \sum_{j = 0}^t c_t f_{j - t}. +\end{equation} -```{r, fig.height = 1000/80, fig.cap = "The corrected time-varying CFR for each country with a large outbreak. We calculate the time-varying CFR for the ongoing COVID-19 epidemic in each country with a large outbreak — defined as any country with more than 100,000 deaths in total — corrected for delays.", class.source = 'fold-hide'} -df_covid_cfr %>% - ggplot( - aes(x = date) - ) + - geom_line(aes(y = severity_me), linetype = "dashed", alpha = 0.25) + - geom_ribbon( - aes( - ymin = severity_lo, - ymax = severity_hi - ), - fill = "dodgerblue", alpha = 0.5 - ) + - coord_cartesian(clip = "off") + - facet_wrap(facets = vars(country), ncol = 2) + - theme_minimal() + - labs(x = "Date", y = "CFR (%)") + - scale_y_continuous(labels = percent, limits = c(0, 0.1)) -``` +We then assume that the severity measure, for example CFR, of interest is +Binomially-distributed, in the following way + +\begin{equation} + d_t \sim \text{Binomial}(k_t, \theta_t). +\end{equation} + +We use maximum-likelihood techniques to determine the value of $\theta_t$ for +each $t$, whereby $\theta$ represents the severity measure of interest. -We do not focus on the `ggplot2` commands in this vignette. Instead, we refer -the user to the `ggplot2` documentation [here](https://ggplot2.tidyverse.org/0). +The precise severity measure — CFR, IFR, HFR, etc — that $\theta$ represents depends upon the input data given by the user. ## References