diff --git a/examples/diagnostics_and_criticism/model_averaging.ipynb b/examples/diagnostics_and_criticism/model_averaging.ipynb
index da806ef10..37d5db3f2 100644
--- a/examples/diagnostics_and_criticism/model_averaging.ipynb
+++ b/examples/diagnostics_and_criticism/model_averaging.ipynb
@@ -7,7 +7,7 @@
"(model_averaging)=\n",
"# Model Averaging\n",
"\n",
- ":::{post} Aug 2022\n",
+ ":::{post} Aug 2024\n",
":tags: model comparison, model averaging\n",
":category: intermediate\n",
":author: Osvaldo Martin\n",
@@ -32,11 +32,13 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Running on PyMC v5.9.2\n"
+ "Running on PyMC v5.16.2+24.g799c98f41\n"
]
}
],
"source": [
+ "import os\n",
+ "\n",
"import arviz as az\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
@@ -61,8 +63,7 @@
},
"outputs": [],
"source": [
- "RANDOM_SEED = 8927\n",
- "np.random.seed(RANDOM_SEED)\n",
+ "rng = np.random.seed(2741)\n",
"az.style.use(\"arviz-darkgrid\")"
]
},
@@ -79,54 +80,57 @@
"tags": []
},
"source": [
- "When confronted with more than one model we have several options. One of them is to perform model selection, using for example a given Information Criterion as exemplified by the PyMC examples {ref}`pymc:model_comparison` and the {ref}`GLM-model-selection`. Model selection is appealing for its simplicity, but we are discarding information about the uncertainty in our models. This is somewhat similar to computing the full posterior and then just keeping a point-estimate like the posterior mean; we may become overconfident of what we really know. You can also browse the {doc}`blog/tag/model-comparison` tag to find related posts. \n",
+ "When confronted with more than one model we have several options. One of them is to perform model selection as exemplified by the PyMC examples {ref}`pymc:model_comparison` and the {ref}`GLM-model-selection`, usually is a good idea to also include posterior predictive checks in order to decide which model to keep. Discarding all models except one is equivalent to affirm that, among the evaluated models, one is correct (under some criteria) with probability 1 and the rest are incorrect. In most cases this will be an overstatment that ignores the uncertainty we have in our models. This is somewhat similar to computing the full posterior and then just keeping a point-estimate like the posterior mean; we may become overconfident of what we really know. You can also browse the {doc}`blog/tag/model-comparison` tag to find related posts. \n",
"\n",
- "One alternative is to perform model selection but to consider all the different models together with the computed values of a given Information Criterion. It is important to put all these numbers and tests in the context of our problem so that we and our audience can have a better feeling of the possible limitations and shortcomings of our methods. If you are in the academic world you can use this approach to add elements to the discussion section of a paper, presentation, thesis, and so on.\n",
+ "An alternative to this dilema is to perform model selection but to acknoledge the models we discared. If the number of models are not that large this can be part of a technical discussion on a paper, presentation, thesis, and so on. If the audience is not technical enough, this may not be a good idea.\n",
"\n",
- "Yet another approach is to perform model averaging. The idea now is to generate a meta-model (and meta-predictions) using a weighted average of the models. There are several ways to do this. PyMC includes three methods that will be briefly discussed in this notebook. You will find a more thorough explanation in the work by {cite:t}`Yao_2018`. PyMC integrates with ArviZ for model comparison. \n",
+ "Yet another alternative, the topic of this example, is to perform model averaging. The idea is to weight each model by its merit and generate predictions from each model, proportional to those weights. There are several ways to do this, including the three methods that will be briefly discussed in this notebook. You will find a more thorough explanation in the work by {cite:t}`Yao_2018` and {cite:t}`Yao_2022`. \n",
"\n",
"\n",
"## Pseudo Bayesian model averaging\n",
"\n",
- "Bayesian models can be weighted by their marginal likelihood, which is known as Bayesian Model Averaging. While this is theoretically appealing, it is problematic in practice: on the one hand the marginal likelihood is highly sensitive to the specification of the prior, in a way that parameter estimation is not, and on the other, computing the marginal likelihood is usually a challenging task. An alternative route is to use the values of WAIC (Widely Applicable Information Criterion) or LOO (pareto-smoothed importance sampling Leave-One-Out cross-validation), which we will call generically IC, to estimate weights. We can do this by using the following formula:\n",
+ "Bayesian models can be weighted by their marginal likelihood, which is known as Bayesian Model Averaging. While this is theoretically appealing, it is problematic in practice: on the one hand the marginal likelihood is highly sensitive to the specification of the prior, in a way that parameter estimation is not, and on the other, computing the marginal likelihood is usually a challenging task. Additionally, Bayesian model averaging is flawed in the $\\mathcal{M}$-open setting in which the true data-generating process is not one of the candidate models being fit {cite:t}`Yao_2018`. A more robust approach is to compute the expected log pointwise predictive density (ELPD).\n",
+ "\n",
+ "$$\n",
+ "\\sum_i^N \\log \\int \\ p(y_i \\mid \\theta) \\; p(\\theta \\mid y) d\\theta\n",
+ "$$\n",
+ "\n",
+ "where $N$ is the number of data points, $y_i$ is the i-th data point, $\\theta$ are the parameters of the model, $p(y_i \\mid \\theta)$ is the likelihood of the i-th data point given the parameters, and $p(\\theta \\mid y)$ is the posterior distribution.\n",
+ "\n",
+ "Once we have computed the ELPD for each model we can compute weights by doing\n",
"\n",
- "$$w_i = \\frac {e^{ - \\frac{1}{2} dIC_i }} {\\sum_j^M e^{ - \\frac{1}{2} dIC_j }}$$\n",
+ "$$w_i = \\frac {e^{dELPD_i}} {\\sum_j^M e^{dELPD_i}}$$\n",
"\n",
- "Where $dIC_i$ is the difference between the i-th information criterion value and the lowest one. Remember that the lower the value of the IC, the better. We can use any information criterion we want to compute a set of weights, but, of course, we cannot mix them. \n",
+ "Where $dELPD_i$ is the difference between the model with the best ELPD and the i-th model.\n",
"\n",
- "This approach is called pseudo Bayesian model averaging, or Akaike-like weighting and is an heuristic way to compute the relative probability of each model (given a fixed set of models) from the information criteria values. Note that the denominator is just a normalization term to ensure that the weights sum up to one.\n",
+ "This approach is called pseudo Bayesian model averaging, or Akaike-like weighting and is an heuristic to compute the relative probability of each model (given a fixed set of models). Note that we exponetiate to \"revert\" the effect of the logarithm in the ELPD formula and the denominator is a normalization term to ensure that the weights sum up to one. With a pinch of salt, we can interpret these weights as the probability of each model explaining the data.\n",
+ "\n",
+ "So far so good, but the ELPD is a theoretical quantity, and in practice we need to approximate it. To do so ArviZ offers two methods\n",
+ "\n",
+ "* WAIC, Widely Applicable Information Criterion\n",
+ "* LOO, Pareto-Smooth-Leave-One-Out-Cross-Validation.\n",
+ "\n",
+ "Both requiere and InferenceData with the log-likelihood group and are equally fast to compute. We recommend using LOO because it has better practical properties, and better diagnostics (so we known when we are having issues with the ELPD estimation).\n",
"\n",
"## Pseudo Bayesian model averaging with Bayesian Bootstrapping\n",
"\n",
- "The above formula for computing weights is a nice and simple approach, but with one major caveat: it does not take into account the uncertainty in the computation of the IC. We could compute the standard error of the IC (assuming a Gaussian approximation) and modify the above formula accordingly. Or we can do something more robust, like using a [Bayesian Bootstrapping](http://www.sumsar.net/blog/2015/04/the-non-parametric-bootstrap-as-a-bayesian-model/) to estimate, and incorporate this uncertainty.\n",
+ "The above formula for computing weights is a nice and simple approach, but with one major caveat: it does not take into account the uncertainty in the computation of the ELPD. We could compute the standard error of the ELPD value (assuming a Gaussian approximation) and modify the above formula accordingly. Or we can do something more robust, like using a [Bayesian Bootstrapping](http://www.sumsar.net/blog/2015/04/the-non-parametric-bootstrap-as-a-bayesian-model/) to estimate, and incorporate this uncertainty.\n",
"\n",
"## Stacking\n",
"\n",
- "The third approach implemented in PyMC is known as _stacking of predictive distributions_ by {cite:t}`Yao_2018`. We want to combine several models in a metamodel in order to minimize the divergence between the meta-model and the _true_ generating model. When using a logarithmic scoring rule this is equivalent to:\n",
+ "The third approach we will discuss is known as _stacking of predictive distributions_ by {cite:t}`Yao_2018`. We want to combine several models in a metamodel in order to minimize the divergence between the meta-model and the _true_ generating model. When using a logarithmic scoring rule this is equivalent to:\n",
"\n",
- "$$\\max_{w} \\frac{1}{n} \\sum_{i=1}^{n}log\\sum_{k=1}^{K} w_k p(y_i|y_{-i}, M_k)$$\n",
+ "$$\\max_{w} \\frac{1}{n} \\sum_{i=1}^{n}log\\sum_{k=1}^{K} w_k p(y_i \\mid y_{-i}, M_k)$$\n",
"\n",
"Where $n$ is the number of data points and $K$ the number of models. To enforce a solution we constrain $w$ to be $w_k \\ge 0$ and $\\sum_{k=1}^{K} w_k = 1$. \n",
"\n",
- "The quantity $p(y_i|y_{-i}, M_k)$ is the leave-one-out predictive distribution for the $M_k$ model. Computing it requires fitting each model $n$ times, each time leaving out one data point. Fortunately we can approximate the exact leave-one-out predictive distribution using LOO (or even WAIC), and that is what we do in practice.\n",
+ "The quantity $p(y_i \\mid y_{-i}, M_k)$ is the leave-one-out predictive distribution for the $M_k$ model. Computing it requires fitting each model $n$ times, each time leaving out one data point. Fortunately, this is exactly what LOO approximates in a very efficient way. So we can use LOO and stacking together. To be fair, we can also use WAIC, even when WAIC approximates the ELPD in a different way.\n",
"\n",
"## Weighted posterior predictive samples\n",
"\n",
- "Once we have computed the weights, using any of the above 3 methods, we can use them to get weighted posterior predictive samples. PyMC offers functions to perform these steps in a simple way, so let's see them in action using an example.\n",
- "\n",
- "The following example is taken from the superb book {cite:t}`mcelreath2018statistical` by Richard McElreath. You will find more PyMC examples from this book in the repository [Statistical-Rethinking-with-Python-and-PyMC](https://github.com/pymc-devs/pymc-resources/tree/main/Rethinking_2). We are going to explore a simplified version of it. Check the book for the whole example and a more thorough discussion of both the biological motivation for this problem and a theoretical/practical discussion of using Information Criteria to compare, select and average models.\n",
+ "Once we have computed the weights, using any of the above 3 methods, we can use them to get weighted posterior predictive samples. We will illustrate how to do it using the body fat dataset {cite}`penrose1985`. This dataset has measurements from 251 individuals, including their weight, height, the circumference of the abdomen, the circumference of the wrist etc. Our purpose is to predict the percentage of body fat, as estimated by the siri variable, also available from the dataset.\n",
"\n",
- "Briefly, our problem is as follows: We want to explore the composition of milk across several primate species. It is hypothesized that females from species of primates with larger brains produce more _nutritious_ milk (loosely speaking this is done _in order to_ support the development of such big brains). This is an important question for evolutionary biologists. To try to give an answer we will use 3 variables:\n",
- "* two predictor variables - the proportion of neocortex mass compared to the total mass of the brain, and the logarithm of the body mass of the mothers. \n",
- "* one predicted variable - the kilocalories per gram of milk. \n",
- "\n",
- "With these variables we are going to build 3 different linear models:\n",
- " \n",
- "1. A model using only the neocortex variable\n",
- "2. A model using only the logarithm of the mass variable\n",
- "3. A model using both variables\n",
- "\n",
- "Let start by uploading the data and centering the `neocortex` and `log mass` variables, for better sampling."
+ "Let's start by loading the data"
]
},
{
@@ -164,53 +168,126 @@
" \n",
" \n",
" \n",
" \n",
" \n",
- " kcal.per.g \n",
- " neocortex \n",
- " log_mass \n",
+ " siri \n",
+ " age \n",
+ " weight \n",
+ " height \n",
+ " neck \n",
+ " chest \n",
+ " abdomen \n",
+ " hip \n",
+ " thigh \n",
+ " knee \n",
+ " ankle \n",
+ " biceps \n",
+ " forearm \n",
+ " wrist \n",
"
\n", + "\n" ], "text/plain": [ - "
\n", + "\n" ], "text/plain": [ - "
\n", + "\n" ], "text/plain": [ - "
\n", + "\n" ], "text/plain": [ - "
\n", + " | rank | \n", + "elpd_loo | \n", + "p_loo | \n", + "elpd_diff | \n", + "weight | \n", + "se | \n", + "dse | \n", + "warning | \n", + "scale | \n", + "
---|---|---|---|---|---|---|---|---|---|
model_1 | \n", + "0 | \n", + "-817.216895 | \n", + "3.626704 | \n", + "0.000000 | \n", + "0.639236 | \n", + "10.496342 | \n", + "0.000000 | \n", + "False | \n", + "log | \n", + "
model_0 | \n", + "1 | \n", + "-825.344978 | \n", + "1.832909 | \n", + "8.128083 | \n", + "0.360764 | \n", + "9.970768 | \n", + "8.698358 | \n", + "False | \n", + "log | \n", + "
<xarray.Dataset>\n", + "Dimensions: (siri_dim_2: 251, sample: 3999)\n", + "Coordinates:\n", + " * siri_dim_2 (siri_dim_2) int64 0 1 2 3 4 5 6 ... 244 245 246 247 248 249 250\n", + " * sample (sample) object MultiIndex\n", + " * chain (sample) int64 2 3 1 2 1 2 3 3 3 3 3 3 ... 3 3 3 1 0 0 2 0 2 1 2\n", + " * draw (sample) int64 682 691 550 397 831 520 ... 638 997 295 483 606 9\n", + "Data variables:\n", + " siri (siri_dim_2, sample) float64 17.75 16.43 14.7 ... 30.98 27.67\n", + "Attributes:\n", + " created_at: 2024-08-23T16:10:41.836182+00:00\n", + " arviz_version: 0.20.0.dev0\n", + " inference_library: pymc\n", + " inference_library_version: 5.16.2+24.g799c98f41
\n", - " | rank | \n", - "elpd_loo | \n", - "p_loo | \n", - "elpd_diff | \n", - "weight | \n", - "se | \n", - "dse | \n", - "warning | \n", - "scale | \n", - "
---|---|---|---|---|---|---|---|---|---|
model_2 | \n", - "0 | \n", - "8.266521 | \n", - "3.253300 | \n", - "0.000000 | \n", - "1.000000e+00 | \n", - "2.557509 | \n", - "0.000000 | \n", - "False | \n", - "log | \n", - "
model_1 | \n", - "1 | \n", - "4.340585 | \n", - "2.122619 | \n", - "3.925936 | \n", - "0.000000e+00 | \n", - "2.074807 | \n", - "1.723294 | \n", - "False | \n", - "log | \n", - "
model_0 | \n", - "2 | \n", - "3.551017 | \n", - "1.988888 | \n", - "4.715504 | \n", - "1.221245e-14 | \n", - "1.587097 | \n", - "2.493630 | \n", - "False | \n", - "log | \n", - "
<xarray.Dataset>\n", + "Dimensions: (siri_dim_0: 251)\n", + "Coordinates:\n", + " * siri_dim_0 (siri_dim_0) int64 0 1 2 3 4 5 6 ... 244 245 246 247 248 249 250\n", + "Data variables:\n", + " siri (siri_dim_0) float64 12.3 6.1 25.3 10.4 ... 33.6 29.3 26.0 31.9\n", + "Attributes:\n", + " created_at: 2024-08-23T16:10:41.440917+00:00\n", + " arviz_version: 0.20.0.dev0\n", + " inference_library: pymc\n", + " inference_library_version: 5.16.2+24.g799c98f41
<xarray.Dataset>\n", - "Dimensions: (kcal_dim_2: 17, sample: 7999)\n", - "Coordinates:\n", - " * kcal_dim_2 (kcal_dim_2) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16\n", - " * sample (sample) object MultiIndex\n", - " * chain (sample) int64 0 2 0 0 0 0 0 1 3 1 2 3 ... 2 1 1 1 0 3 2 1 3 2 3\n", - " * draw (sample) int64 216 768 211 631 322 ... 1824 1709 95 1165 1267\n", - "Data variables:\n", - " kcal (kcal_dim_2, sample) float64 0.3795 0.3581 ... 0.3967 0.6766\n", - "Attributes:\n", - " created_at: 2023-11-20T05:39:30.790844\n", - " arviz_version: 0.17.0.dev0\n", - " inference_library: pymc\n", - " inference_library_version: 5.9.2" + ".xr-wrap{width:700px!important;} " ], "text/plain": [ - "