diff --git a/README.Rmd b/README.Rmd index d1807eb57..98cb1f744 100644 --- a/README.Rmd +++ b/README.Rmd @@ -33,62 +33,34 @@ The goal of this package is to provide a tested and reliable collection of metri Predictions can be handled in various formats: `scoringutils` can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary. -## Input formats -Most of the time, the `score()` function will be able to do the entire evaluation for you. All you need to do is to pass in a `data.frame` with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see `example_quantile` for the expected format), based on predictive samples (see `example_continuous` and `example_integer` for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below): - -```{r, echo=FALSE} -requirements <- data.table( - "Format" = c( - "quantile-based", "sample-based", "binary", "pairwise-comparisons" - ), - `Required columns` = c( - "'true_value', 'prediction', 'quantile'", "'true_value', 'prediction', - 'sample'", "'true_value', 'prediction'", "additionally a column 'model'" - ) -) -kable(requirements) -``` - -Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. It is important, that there are only columns present which are relevant in order to group forecasts. A combination of different columns should uniquely define the *unit of a single forecast*, meaning that a single forecast is defined by the values in the other columns. - - +## Installation +Install the CRAN version of this package using: -## Checking the input data - -The function `check_forecasts()` can be used to check the input data. It gives a summary of what `scoringutils` thinks you are trying to achieve. It infers the type of the prediction target, the prediction format, and the unit of a single forecasts, gives an overview of the number of unique values per column (helpful for spotting missing data) and returns warnings or errors. - -```{r} -head(example_quantile) +```{r, eval = FALSE} +install.packages("scoringutils") ``` -```{r} -check_forecasts(example_quantile) -``` +Install the stable development version of the package with: -If you are unsure what your input data should look like, have a look at the `example_quantile`, `example_integer`, `example_continuous` and `example_binary` data sets provided in the package. - -## Showing available forecasts +```{r, eval = FALSE} +install.packages("scoringutils", repos = "https://epiforecasts.r-universe.dev") +``` -The function `avail_forecasts()` may also be helpful to determine where forecasts are available. Using the `by` argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run +Install the unstable development from GitHub using the following, -```{r} -avail_forecasts(example_quantile, by = c("model", "target_type")) +```{r, eval = FALSE} +remotes::install_github("epiforecasts/scoringutils", dependencies = TRUE) ``` -We see that 'epiforecasts-EpiNow2' has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts. +## Quick start -This information can also be visualised using the `plot_avail_forecasts()` function: +In this quick start guide we explore some of the functionality of the `scoringutils` package using quantile forecasts from the [ECDC forecasting hub](https://covid19forecasthub.eu/) as an example. For more detailed documentation please see the package vignettes, and individual function documentation. -```{r, fig.width=11, fig.height=6} -example_quantile %>% - avail_forecasts(by = c("model", "forecast_date", "target_type")) %>% - plot_avail_forecasts() + - facet_wrap(~ target_type) -``` +### Plotting forecasts -You can also visualise forecasts directly using the `plot_predictions()` function: +As a first step to evaluating the forecasts we visualise them. For the purposes of this example here we make use of `plot_predictions()` to filter the available forecasts for a single model, and forecast date. ```{r, fig.width = 9, fig.height = 6} example_quantile %>% @@ -105,219 +77,40 @@ example_quantile %>% theme(legend.position = "bottom") ``` -## Scoring and summarising forecasts - -Forecasts can easily be scored using the `score()` function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, `summarise_scores` takes care of summarising these scores to the level specified by the user. +### Scoring forecasts -```{r} -score(example_quantile) %>% - head() -``` +Forecasts can be easily and quickly scored using the `score()` function. This function returns unsumarised scores, which in most cases is not what the user wants. Here we make use of additional functions from `scoringutils` to add empirical coverage-levels (`add_coverage()`), and scores relative to a baseline model (here chosen to be the EuroCOVIDhub-ensemble model). See the getting started vignette for more details. Finally we summarise these scores by model and target type. -```{r} +```{r score-example} example_quantile %>% score() %>% - summarise_scores(by = c("model", "target_type")) %>% - kable() -``` - -The `by` argument can be used to define the level of summary. By default, `by = NULL` is set to the unit of a single forecast. For quantile-based forecasts, unsummarised scores are returned for every quantile individually. It can therefore make sense to run `summarise_scores`even without any arguments provided. - -```{r} -score(example_quantile) %>% - summarise_scores() -``` - -### Adding empirical coverage - -For quantile-based forecasts we are often interested in specific coverage-levels, for example, what percentage of true values fell between all 50% or the 90% prediction intervals. We can add this information using the function `add_coverage()`. This function also requires a `by` argument which defines the level of grouping for which the percentage of true values covered by certain prediction intervals is computed. - -```{r} -score(example_quantile) %>% add_coverage(ranges = c(50, 90), by = c("model", "target_type")) %>% - summarise_scores(by = c("model", "target_type")) -``` - -### Adding relative scores - -In order to better compare models against each other we can use relative scores which are computed based on pairwise comparisons (see details below). Relative scores can be added to the evaluation using the function `summarise_scores()`. This requires a column called 'model' to be present. Pairwise comparisons are computed according to the grouping specified in `by`: essentially, the data.frame with all scores gets split into different data.frames according to the values specified in `by` and relative scores are computed for every individual group separately. The `baseline` argumen allows us to specify a baseline that can be used to scale relative scores (all scores are divided by the baseline relative score). For example, to obtain relative scores separately for different forecast targets, we can run - -```{r} -score(example_quantile) %>% - summarise_scores(by = c("model", "target_type"), - relative_skill = TRUE, - baseline = "EuroCOVIDhub-ensemble") -``` - - -## Visualising scores - -### Coloured table - -A simple coloured table can be produced based on the scores: - -```{r} -score(example_integer) %>% - summarise_scores(by = c("model")) %>% - plot_score_table() -``` - -### Score heatmap - -We can also summarise one particular metric across different categories using a simple heatmap: - -```{r, fig.width=11, fig.height=6} -score(example_continuous) %>% - summarise_scores(by = c("model", "location", "target_type")) %>% - plot_heatmap(x = "location", metric = "bias") + - facet_wrap(~ target_type) -``` - -### Weighted interval score components - -The weighted interval score can be split up into three components: Over-prediction, under-prediction and dispersion. These can be visualised separately in the following way: - -```{r} -score(example_quantile) %>% - summarise_scores(by = c("target_type", "model")) %>% - plot_wis() + - facet_wrap(~ target_type, scales = "free") -``` - -## Calibration - -Calibration is a measure statistical consistency between the forecasts and the observed values. The most common way of assessing calibration (more precisely: probabilistic calibration) are PIT histograms. The probability integral transform (PIT) is equal to the cumulative distribution function of a forecast evaluated at the true observed value. Ideally, pit values should be uniformly distributed after the transformation. - -We can compute pit values as such: - -```{r} -example_continuous %>% - pit(by = "model") -``` - -And visualise the results as such: - -```{r} -example_continuous %>% - pit(by = c("model", "target_type")) %>% - plot_pit() + - facet_grid(model ~ target_type) -``` - -Similarly for quantile-based forecasts: - -```{r} -example_quantile[quantile %in% seq(0.1, 0.9, 0.1), ] %>% - pit(by = c("model", "target_type")) %>% - plot_pit() + - facet_grid(model ~ target_type) -``` - -Another way to look at calibration are interval coverage and quantile coverage. Interval coverage is the percentage of true values that fall inside a given central prediction interval. Quantile coverage is the percentage of observed values that fall below a given quantile level. - -In order to plot interval coverage, you need to include "range" in the `by` argument to `summarise_scores()`. The green area on the plot marks conservative behaviour, i.e. your empirical coverage is greater than it nominally need be (e.g. 55% of true values covered by all 50% central prediction intervals.) - -```{r} -example_quantile %>% - score() %>% - summarise_scores(by = c("model", "range")) %>% - plot_interval_coverage() -``` - -To visualise quantile coverage, you need to include "quantile" in `by`. Again, the green area corresponds to conservative forecasts, where central prediction intervals would cover more than needed. - -```{r} -example_quantile %>% - score() %>% - summarise_scores(by = c("model", "quantile")) %>% - plot_quantile_coverage() -``` - -## Pairwise comparisons - -Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all cominations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model. - -In `scoringutils`, pairwise comparisons can be made in two ways: Through the standalone function `pairwise_comparison()` or from within `summarise_scores` which simply adds relative scores to an existing set of scores. - -```{r} -example_quantile %>% - score() %>% - pairwise_comparison(by = "model", baseline = "EuroCOVIDhub-baseline") -``` - -```{r} -example_quantile %>% - score() %>% - summarise_scores(by = "model", relative_skill = TRUE, baseline = "EuroCOVIDhub-baseline") -``` - -If using the `pairwise_comparison()` function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty. - -```{r, fig.width=9, fig.height=7} -example_quantile %>% - score() %>% - pairwise_comparison(by = c("model", "target_type")) %>% - plot_pairwise_comparison() + - facet_wrap(~ target_type) -``` - - -## Additional analyses and visualisations - -### Correlation between scores - -It may sometimes be interesting to see how different scores correlate with each other. We can examine this using the function `correlation()`. When dealing with quantile-based forecasts, it is important to call `summarise_scorees()` before `correlation()` to summarise over quantiles before computing correlations. - -```{r} -example_quantile %>% - score() %>% - summarise_scores() %>% - correlation() -``` - -Visualisng correlations: - -```{r} -example_quantile %>% - score() %>% - summarise_scores() %>% - correlation() %>% - plot_correlation() -``` - -### Scores by interval ranges - -If you would like to see how different forecast interval ranges contribute to average scores, you can viusalise scores by interval range: - -```{r} -example_quantile %>% - score() %>% - summarise_scores(by = c("model", "range", "target_type")) %>% - plot_ranges() + - facet_wrap(~ target_type, scales = "free") + summarise_scores( + by = c("model", "target_type"), + relative_skill = TRUE, + baseline = "EuroCOVIDhub-ensemble" + ) %>% + kable() ``` -## Tips and tricks - converting to sample-based forecasts +`scoringutils` contains additional functionality to summarise these scores at different levels, to visualise them, and to explore the forecasts themselves. See the package vignettes and function documentation for more information. -Different metrics are available for different forecasting formats. In some cases, you may for example have forecasts in a sample-based format, but wish to make use of some of the functionality only available to quantile-based forecasts. For example, you may want to use the decomposition of the weighted interval score, or may like to compute interval coverage values. +## Citation -You can convert your forecasts into a sample-based format using the function `sample_to_quantile()`. There is, however, one caveat: Quantiles will be calculated based on the predictive samples, which may introduce a bias if the number of available samples is small. +If using `scoringutils` in your work please consider citing it using the following, -```{r} -example_integer %>% - sample_to_quantile(quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)) %>% - score() %>% - add_coverage(by = c("model", "target_type")) +```{r, echo = FALSE} +citation("scoringutils") ``` -## Available metrics +## How to make a bug report or feature request -An overview of available metrics can be found in the `metrics_summary` data set that is included in the package. +Please briefly describe your problem and what output you expect in an [issue](https://github.com/epiforecasts/scoringutils/issues). If you have a question, please don't open an issue. Instead, ask on our [Q and A page](https://github.com/epiforecasts/scoringutils/discussions/categories/q-a). -```{r} -metrics_summary -``` +## Contributing -## Lower level functions +We welcome contributions and new contributors! We particularly appreciate help on priority problems in the [issues](https://github.com/epiforecasts/scoringutils/issues). Please check and add to the issues, and/or add a [pull request](https://github.com/epiforecasts/scoringutils/pulls). -Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better. +## Code of Conduct + +Please note that the `scoringutils` project is released with a [Contributor Code of Conduct](https://epiforecasts.io/scoringutils/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms. diff --git a/vignettes/scoringutils.Rmd b/vignettes/getting-started.Rmd similarity index 53% rename from vignettes/scoringutils.Rmd rename to vignettes/getting-started.Rmd index 4cb2fa464..c25f5b7de 100644 --- a/vignettes/scoringutils.Rmd +++ b/vignettes/getting-started.Rmd @@ -1,56 +1,59 @@ --- -title: "scoringutils" +title: "Getting started" author: "Nikos Bosse" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{scoringutils} + %\VignetteIndexEntry{Getting started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- -```{r, include = FALSE} -knitr::opts_chunk$set( - fig.width = 7, - collapse = TRUE, - comment = "#>" -) +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, + fig.width = 7, + collapse = TRUE, + comment = "#>") library(scoringutils) library(magrittr) library(data.table) library(ggplot2) +library(knitr) ``` -The `scoringutils` package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against the true observed values. The `scoringutils` package offers convenient automated forecast evaluation in a data.table format (using the function `score()`), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases. +The `scoringutils` package provides a collection of metrics and proper scoring rules that make it simple to score probabilistic forecasts against the true observed values. The `scoringutils` package offers convenient automated forecast evaluation in a `data.table` format (using the function `score()`), but also provides experienced users with a set of reliable lower-level scoring metrics operating on vectors/matriced they can build upon in other applications. In addition it implements a wide range of flexible plots that are able to cover many use cases. -The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the `scoringRules` package, which provides a comprehensive collection of proper scoring rules (also used in `scoringutils`). In contrast to `scoringutils`, there is little functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models. +The goal of this package is to provide a tested and reliable collection of metrics that can be used for scoring probabilistic forecasts (forecasts with a full predictive distribution, rather than point forecasts). It has a much stronger focus on convenience than e.g. the `scoringRules` package, which provides a comprehensive collection of proper scoring rules (also used in `scoringutils`). In contrast to other packages, `scoringutils` offers functionality to automatically evaluate forecasts, to visualise scores and to obtain relative scores between models. Predictions can be handled in various formats: `scoringutils` can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary. ## Input formats -Most of the time, the `score` function will be able to do the entire evaluation for you. All you need to do is to pass in a `data.frame` with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see `example_quantile`), based on predictive samples (see `example_continuous` and `example_integer`) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below): +Most of the time, the `score()` function will be able to do the entire evaluation for you. All you need to do is to pass in a `data.frame` with the appropriate columns. Which columns are required depends on the format the forecasts come in. The forecast format can either be based on quantiles (see `example_quantile` for the expected format), based on predictive samples (see `example_continuous` and `example_integer` for the expected format in each case) or in a binary format. The following table gives an overview (pairwise comparisons will be explained in more detail below): ```{r, echo=FALSE} -requirements <- - data.table( - "Format" = c("quantile-based", "sample-based", "binary", "pairwise-comparisons"), - `Required columns` = c("'true_value', 'prediction', 'quantile'", - "'true_value', 'prediction', 'sample'", - "'true_value', 'prediction'", - "additionally a column 'model'") +requirements <- data.table( + "Format" = c( + "quantile-based", "sample-based", "binary", "pairwise-comparisons" + ), + `Required columns` = c( + "'true_value', 'prediction', 'quantile'", "'true_value', 'prediction', + 'sample'", "'true_value', 'prediction'", "additionally a column 'model'" ) -requirements +) +kable(requirements) ``` Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. It is important, that there are only columns present which are relevant in order to group forecasts. A combination of different columns should uniquely define the *unit of a single forecast*, meaning that a single forecast is defined by the values in the other columns. + + ## Checking the input data The function `check_forecasts()` can be used to check the input data. It gives a summary of what `scoringutils` thinks you are trying to achieve. It infers the type of the prediction target, the prediction format, and the unit of a single forecasts, gives an overview of the number of unique values per column (helpful for spotting missing data) and returns warnings or errors. ```{r} -example_quantile +head(example_quantile) ``` ```{r} @@ -64,46 +67,51 @@ If you are unsure what your input data should look like, have a look at the `exa The function `avail_forecasts()` may also be helpful to determine where forecasts are available. Using the `by` argument you can specify the level of summary. For example, to see how many forecasts there are per model and target_type, we can run ```{r} -avail_forecasts(example_quantile, - by = c("model", "target_type")) +avail_forecasts(example_quantile, by = c("model", "target_type")) ``` -We see that 'epiforecasts-EpiNow2' has some missing forecasts and that UMass-MechBayes has no case forecasts. +We see that 'epiforecasts-EpiNow2' has some missing forecasts for the deaths forecast target and that UMass-MechBayes has no case forecasts. -This information can also be visualised: +This information can also be visualised using the `plot_avail_forecasts()` function: ```{r, fig.width=11, fig.height=6} -avail_forecasts(example_quantile, - by = c("model", "forecast_date", "target_type")) %>% - plot_avail_forecasts() + +example_quantile %>% + avail_forecasts(by = c("model", "forecast_date", "target_type")) %>% + plot_avail_forecasts() + facet_wrap(~ target_type) - ``` -You can visualise forecasts directly using the `plot_predictions` function: +You can also visualise forecasts directly using the `plot_predictions()` function: -```{r, fig.width=9, fig.height=6} -plot_predictions(data = example_quantile, - x = "target_end_date", - filter_truth = list('target_end_date <= "2021-07-15"', - 'target_end_date > "2021-05-22"'), - filter_forecasts = list("model == 'EuroCOVIDhub-ensemble'", - 'forecast_date == "2021-06-28"')) + +```{r, fig.width = 9, fig.height = 6} +example_quantile %>% + plot_predictions( + x = "target_end_date", + filter_truth = list( + 'target_end_date <= "2021-07-15"', 'target_end_date > "2021-05-22"' + ), + filter_forecasts = list( + "model == 'EuroCOVIDhub-ensemble'", 'forecast_date == "2021-06-28"' + ) + ) + facet_wrap(target_type ~ location, ncol = 4, scales = "free") + theme(legend.position = "bottom") ``` ## Scoring and summarising forecasts -Forecasts can easily be scored using the `score()` function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, `summarise_scores` takes care of the summary. +Forecasts can easily be scored using the `score()` function. This function returns unsumarised scores, which in most cases is not what the user wants. A second function, `summarise_scores` takes care of summarising these scores to the level specified by the user. ```{r} -score(example_quantile) +score(example_quantile) %>% + head() ``` ```{r} -score(example_quantile) %>% - summarise_scores(by = c("model", "target_type")) +example_quantile %>% + score() %>% + summarise_scores(by = c("model", "target_type")) %>% + kable() ``` The `by` argument can be used to define the level of summary. By default, `by = NULL` is set to the unit of a single forecast. For quantile-based forecasts, unsummarised scores are returned for every quantile individually. It can therefore make sense to run `summarise_scores`even without any arguments provided. @@ -144,7 +152,7 @@ A simple coloured table can be produced based on the scores: ```{r} score(example_integer) %>% summarise_scores(by = c("model")) %>% - plot_score_table() + plot_score_table() ``` ### Score heatmap @@ -194,7 +202,7 @@ Similarly for quantile-based forecasts: ```{r} example_quantile[quantile %in% seq(0.1, 0.9, 0.1), ] %>% pit(by = c("model", "target_type")) %>% - plot_pit() + + plot_pit() + facet_grid(model ~ target_type) ``` @@ -222,7 +230,7 @@ example_quantile %>% Relative scores for different models can be computed using pairwise comparisons, a sort of pairwise tournament where all cominations of two models are compared against each other based on the overlapping set of available forecasts common to both models. Internally, a ratio of the mean scores of both models is computed. The relative score of a model is then the geometric mean of all mean score ratios which involve that model. When a baseline is provided, then that baseline is excluded from the relative scores for individual models (which therefore differ slightly from relative scores without a baseline) and all relative scores are scaled by (i.e. divided by) the relative score of the baseline model. -In `scoringutils`, pairwise comparisons can be made in two ways: Through the standalone function `pairwise_comparison()` or from within `summarise_scores` which simply adds relative scores to an existing set of scores. +In `scoringutils`, pairwise comparisons can be made in two ways: Through the standalone function `pairwise_comparison()` or from within `summarise_scores()` which simply adds relative scores to an existing set of scores. ```{r} example_quantile %>% @@ -233,7 +241,9 @@ example_quantile %>% ```{r} example_quantile %>% score() %>% - summarise_scores(by = "model", relative_skill = TRUE, baseline = "EuroCOVIDhub-baseline") + summarise_scores( + by = "model", relative_skill = TRUE, baseline = "EuroCOVIDhub-baseline" + ) ``` If using the `pairwise_comparison()` function, we can also visualise pairwise comparisons by showing the mean score ratios between models. By default, smaller values are better and the model we care about is showing on the y axis on the left, while the model against it is compared is shown on the x-axis on the bottom. In the example above, the EuroCOVIDhub-ensemble performs best (it only has values smaller 1), while the EuroCOVIDhub-baseline performs worst (and only has values larger than 1). For cases, the UMass-MechBayes model is of course excluded as there are no case forecasts available and therefore the set of overlapping forecasts is empty. @@ -242,7 +252,7 @@ If using the `pairwise_comparison()` function, we can also visualise pairwise co example_quantile %>% score() %>% pairwise_comparison(by = c("model", "target_type")) %>% - plot_pairwise_comparison() + + plot_pairwise_comparison() + facet_wrap(~ target_type) ``` @@ -260,7 +270,7 @@ example_quantile %>% correlation() ``` -Visualisng correlations: +Visualising correlations: ```{r} example_quantile %>% @@ -278,7 +288,7 @@ If you would like to see how different forecast interval ranges contribute to av example_quantile %>% score() %>% summarise_scores(by = c("model", "range", "target_type")) %>% - plot_ranges() + + plot_ranges() + facet_wrap(~ target_type, scales = "free") ``` @@ -290,7 +300,9 @@ You can convert your forecasts into a sample-based format using the function `sa ```{r} example_integer %>% - sample_to_quantile(quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99)) %>% + sample_to_quantile( + quantiles = c(0.01, 0.025, seq(0.05, 0.95, 0.05), 0.975, 0.99) + ) %>% score() %>% add_coverage(by = c("model", "target_type")) ``` @@ -306,202 +318,3 @@ metrics_summary ## Lower level functions Most of these metrics are available as lower level functions and extensively documented. Have a look at the help files to understand these better. - - -# Scoring Forecasts Directly - -A variety of metrics and scoring rules can also be accessed directly through -the `scoringutils` package. - -The following gives an overview of (most of) the implemented metrics. - -## Bias - -The function `bias` determines bias from predictive Monte-Carlo samples, -automatically recognising whether forecasts are continuous or -integer valued. - -For continuous forecasts, Bias is measured as -$$B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))$$ - -where $P_t$ is the empirical cumulative distribution function of the -prediction for the true value $x_t$. Computationally, $P_t (x_t)$ is -just calculated as the fraction of predictive samples for $x_t$ -that are smaller than $x_t$. - -For integer valued forecasts, Bias is measured as - -$$B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))$$ - -to adjust for the integer nature of the forecasts. In both cases, Bias can -assume values between -1 and 1 and is 0 ideally. - -```{r} -## integer valued forecasts -true_values <- rpois(30, lambda = 1:30) -predictions <- replicate(200, rpois(n = 30, lambda = 1:30)) -bias_sample(true_values, predictions) - -## continuous forecasts -true_values <- rnorm(30, mean = 1:30) -predictions <- replicate(200, rnorm(30, mean = 1:30)) -bias_sample(true_values, predictions) -``` - - -## Sharpness -Sharpness is the ability of the model to generate predictions within a -narrow range. It is a data-independent measure, and is purely a feature -of the forecasts themselves. - -Sharpness / dispersion of predictive samples corresponding to one single true value is -measured as the normalised median of the absolute deviation from -the median of the predictive samples. For details, see `?stats::mad` - -```{r} -predictions <- replicate(200, rpois(n = 30, lambda = 1:30)) -mad_sample(predictions) -``` - -## Calibration - -Calibration or reliability of forecasts is the ability of a model to -correctly identify its own uncertainty in making predictions. In a model -with perfect calibration, the observed data at each time point look as if -they came from the predictive probability distribution at that time. - -Equivalently, one can inspect the probability integral transform of the -predictive distribution at time t, - -$$u_t = F_t (x_t)$$ - -where $x_t$ is the observed data point at time $t \text{ in } t_1, …, t_n$, -n being the number of forecasts, and $F_t$ is the (continuous) predictive -cumulative probability distribution at time t. If the true probability -distribution of outcomes at time t is $G_t$ then the forecasts $F_t$ are -said to be ideal if $F_t = G_t$ at all times $t$. In that case, the -probabilities ut are distributed uniformly. - -In the case of discrete outcomes such as incidence counts, -the PIT is no longer uniform even when forecasts are ideal. -In that case a randomised PIT can be used instead: - -$$u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )$$ - -where $k_t$ is the observed count, $P_t(x)$ is the predictive -cumulative probability of observing incidence $k$ at time $t$, -$P_t (-1) = 0$ by definition and $v$ is standard uniform and independent -of $k$. If $P_t$ is the true cumulative -probability distribution, then $u_t$ is standard uniform. - -The function checks whether integer or continuous forecasts were provided. -It then applies the (randomised) probability integral and tests -the values $u_t$ for uniformity using the -Anderson-Darling test. - -As a rule of thumb, there is no evidence to suggest a forecasting model is -miscalibrated if the p-value found was greater than a threshold of $p >= 0.1$, -some evidence that it was miscalibrated if $0.01 < p < 0.1$, and good -evidence that it was miscalibrated if $p <= 0.01$. -In this context it should be noted, though, that uniformity of the -PIT is a necessary but not sufficient condition of calibration. It should -als be noted that the test only works given sufficient samples, otherwise the -Null hypothesis will often be rejected outright. - - -## Continuous Ranked Probability Score (CRPS) -Wrapper around the `crps_sample()` function from the -`scoringRules` package. For more information look at the manuals from the -`scoringRules` package. The function can be used for continuous as well as -integer valued forecasts. Smaller values are better. - -```{r} -true_values <- rpois(30, lambda = 1:30) -predictions <- replicate(200, rpois(n = 30, lambda = 1:30)) -crps(true_values, predictions) -``` - - - -## Dawid-Sebastiani Score (DSS) -Wrapper around the `dss_sample()` function from the -`scoringRules` package. For more information look at the manuals from the -`scoringRules` package. The function can be used for continuous as well as -integer valued forecasts. Smaller values are better. - -```{r} -true_values <- rpois(30, lambda = 1:30) -predictions <- replicate(200, rpois(n = 30, lambda = 1:30)) -dss(true_values, predictions) -``` - -## Log Score -Wrapper around the `log_sample()` function from the -`scoringRules` package. For more information look at the manuals from the -`scoringRules` package. The function should not be used for integer valued -forecasts. While Log Scores are in principle possible for integer valued -forecasts they require a kernel density estimate which is not well defined -for discrete values. Smaller values are better. - -```{r} -true_values <- rnorm(30, mean = 1:30) -predictions <- replicate(200, rnorm(n = 30, mean = 1:30)) -logs(true_values, predictions) -``` - -## Brier Score -The Brier score is a proper score rule that assesses the accuracy of -probabilistic binary predictions. The outcomes can be either 0 or 1, -the predictions must be a probability that the true outcome will be 1. - -The Brier Score is then computed as the mean squared error between the -probabilistic prediction and the true outcome. - -$$\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2$$ - - -```{r} -true_values <- sample(c(0,1), size = 30, replace = TRUE) -predictions <- runif(n = 30, min = 0, max = 1) - -brier_score(true_values, predictions) -``` - -## Interval Score -The Interval Score is a Proper Scoring Rule to score quantile predictions, -following Gneiting and Raftery (2007). Smaller values are better. - -The score is computed as - -$$ \text{score} = (\text{upper} - \text{lower}) + \\ -\frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ -\frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot -1(\text{true_value} > \text{upper})$$ - - -where $1()$ is the indicator function and $\alpha$ is the decimal value that -indicates how much is outside the prediction interval. -To improve usability, the user is asked to provide an interval range in -percentage terms, i.e. interval_range = 90 (percent) for a 90 percent -prediction interval. Correspondingly, the user would have to provide the -5\% and 95\% quantiles (the corresponding alpha would then be 0.1). -No specific distribution is assumed, -but the range has to be symmetric (i.e you can't use the 0.1 quantile -as the lower bound and the 0.7 quantile as the upper). -Setting `weigh = TRUE` will weigh the score by $\frac{\alpha}{2}$ such that -the Interval Score converges to the CRPS for increasing number of quantiles. - - -```{r} -true_values <- rnorm(30, mean = 1:30) -interval_range <- 90 -alpha <- (100 - interval_range) / 100 -lower <- qnorm(alpha/2, rnorm(30, mean = 1:30)) -upper <- qnorm((1- alpha/2), rnorm(30, mean = 1:30)) - -interval_score(true_values = true_values, - lower = lower, - upper = upper, - interval_range = interval_range) -``` -