pymc-devs · fonnesbeck · Aug 15, 2024 · Sep 5, 2024 · Sep 18, 2024 · Sep 18, 2024
diff --git a/docs/source/contributing/developer_guide.md b/docs/source/contributing/developer_guide.md
@@ -4,38 +4,70 @@ orphan: true
 
 # PyMC Developer Guide
 
-{doc}`PyMC <index>` is a Python package for Bayesian statistical modeling built on top of {doc}`PyTensor <pytensor:index>`.
-This document aims to explain the design and implementation of probabilistic programming in PyMC, with comparisons to other PPLs like TensorFlow Probability (TFP) and Pyro.
+{doc}`PyMC <index>` is a Python package for Bayesian statistical modeling built on top of the {doc}`PyTensor <pytensor:index>` library.
+This document explains the design and implementation of probabilistic programming in PyMC, with comparisons to other probabilistic programming libraries like TensorFlow Probability (TFP) and Pyro.
 A user-facing API introduction can be found in the {ref}`API quickstart <pymc_overview>`.
-A more accessible, user facing deep introduction can be found in [Peadar Coyle's probabilistic programming primer](https://github.com/springcoil/probabilisticprogrammingprimer).
-
-## Distribution
+An accessible introduction to building models with PyMC can be found in [our PyData London 2022 tutorial](https://github.com/fonnesbeck/probabilistic_python).
+
+## Table of Contents
+- [Distributions](#distributions)
+- [Reflection](#reflection)
+- [PyMC in Comparison](#pymc-in-comparison)
+  - [PyMC](#pymc)
+  - [Tensorflow Probability](#tensorflow-probability)
+  - [Pyro](#pyro)
+- [Behind the scenes of the logp function](#behind-the-scenes-of-the-logp-function)
+- [Model Context and Random Variables](#model-context-and-random-variables)
+- [Additional things that pm.Model does](#additional-things-that-pmmodel-does)
+- [Logp and dlogp](#logp-and-dlogp)
+- [Inference](#inference)
+  - [MCMC](#mcmc)
+    - [Transition kernel](#transition-kernel)
+    - [Dynamic HMC](#dynamic-hmc)
+    - [Variational Inference (VI)](#variational-inference-vi)
+      - [Some challenges and insights from implementing VI](#some-challenges-and-insights-from-implementing-vi)
+  - [Forward sampling](#forward-sampling)
+  - [Extending PyMC](#extending-pymc)
+  - [What we got wrong](#what-we-got-wrong)
+    - [Shape](#shape)
+    - [Random methods in numpy](#random-methods-in-numpy)
+    - [Samplers are in Python](#samplers-are-in-python)
+
+## Distributions
 
 Probability distributions in PyMC are implemented as classes that inherit from {class}`~pymc.Continuous` or {class}`~pymc.Discrete`.
-Either of these inherit {class}`~pymc.Distribution` which defines the high level API.
+Both of these inherit {class}`~pymc.Distribution` which defines the high level API.
 
-For a detailed introduction on how a new distribution should be implemented check out the {ref}`guide on implementing distributions <implementing_distribution>`.
+For a detailed introduction on how a specific statistical distribution should be implemented check out the {ref}`guide on implementing distributions <implementing_distribution>`.
 
 
 ## Reflection
 
-How tensor/value semantics for probability distributions are enabled in PyMC:
+Let's consider how the tensor/value semantics for probability distributions are enabled in PyMC.
+
+Model random variables are created by calling probability distribution classes with parameters inside of a `pm.Model` context, using a syntax analogous to statistical notation. For example, a normal distribution with a specified mean and standard deviation is written as:
+
+$$
+z \sim \text{Normal}(0, 5)
+$$
 
-In PyMC, model variables are defined by calling probability distribution classes with parameters:
+And in PyMC:
 
 ```python
-z = Normal("z", 0, 5)
+with pm.Model():
+    z = pm. Normal("z", 0, 5)
 ```
 
-This is done inside the context of a ``pm.Model``, which intercepts some information, for example to capture known dimensions.
-The notation aligns with the typically used math notation:
+The context manager intercepts information about the distribution relevant to the model, such as the variable dimension and any transforms, and registers it with the model.
 
-$$
-z \sim \text{Normal}(0, 5)
-$$
+The call to a {class}`~pymc.Distribution` constructor returns an PyTensor {class}`~pytensor.tensor.TensorVariable`, which is a symbolic representation of the model variable and the graph of inputs it depends on.
 
-A call to a {class}`~pymc.Distribution` constructor as shown above returns an PyTensor {class}`~pytensor.tensor.TensorVariable`, which is a symbolic representation of the model variable and the graph of inputs it depends on.
-Under the hood, the variables are created through the {meth}`~pymc.Distribution.dist` API, which calls the {class}`~pytensor.tensor.random.basic.RandomVariable` {class}`~pytensor.graph.op.Op` corresponding to the distribution.
+```python
+print(type(z))
+# ==> <class 'pytensor.tensor.variable.TensorVariable'>
+```
+
+Under the hood, the variables are created through the {meth}`~pymc.Distribution.dist` classmethod, which calls the {class}`~pytensor.tensor.random.basic.RandomVariable` {class}`~pytensor.graph.op.Op` corresponding to the distribution.
 
 At a high level of abstraction, the idea behind ``RandomVariable`` ``Op``s is to create symbolic variables (``TensorVariable``s) that can be associated with the properties of a probability distribution.
 For example, the ``RandomVariable`` ``Op`` which becomes part of the symbolic computation graph is associated with the random number generators or probability mass/density functions of the distribution.
@@ -57,7 +89,7 @@ Now, because the ``NormalRV`` can be associated with the [probability density fu
 with pm.Model():
     z = pm.Normal("z", 0, 5)
 symbolic = pm.logp(z, 2.5)
-numeric = symbolic.eval()
+symbolic.eval()
 # array(-2.65337645)
 ```
 
@@ -92,14 +124,17 @@ $$
 
 ```python
 with pm.Model() as model:
-    z = pm.Normal('z', mu=0., sigma=5.)             # ==> pytensor.tensor.var.TensorVariable
-    x = pm.Normal('x', mu=z, sigma=1., observed=5.) # ==> pytensor.tensor.var.TensorVariable
+    z = pm.Normal('z', mu=0., sigma=5.)
+    # ==> pytensor.tensor.var.TensorVariable
+    x = pm.Normal('x', mu=z, sigma=1., observed=5.)
+    # ==> pytensor.tensor.var.TensorVariable
 # The log-prior of z=2.5
-pm.logp(z, 2.5).eval()                              # ==> -2.65337645
-# ???????
-x.logp({'z': 2.5})                                  # ==> -4.0439386
-# ???????
-model.logp({'z': 2.5})                              # ==> -6.6973152
+pm.logp(z, 2.5).eval()
+# ==> -2.65337645
+x.logp({'z': 2.5})
+# ==> -4.0439386
+model.logp({'z': 2.5})
-model.logp({'z': 2.5})
+model.compile_logp()({'z': 2.5})
-model.logp({'z': 2.5})
+model.compile_logp()({'z': 2.5})
+# ==> -6.6973152
 ```
 
 ### Tensorflow Probability
@@ -110,25 +145,35 @@ import tensorflow.compat.v1 as tf
 from tensorflow_probability import distributions as tfd
 
 with tf.Session() as sess:
-    z_dist = tfd.Normal(loc=0., scale=5.)            # ==> <class 'tfp.python.distributions.normal.Normal'>
-    z = z_dist.sample()                              # ==> <class 'tensorflow.python.framework.ops.Tensor'>
-    x = tfd.Normal(loc=z, scale=1.).log_prob(5.)     # ==> <class 'tensorflow.python.framework.ops.Tensor'>
+    z_dist = tfd.Normal(loc=0., scale=5.)
+    # ==> <class 'tfp.python.distributions.normal.Normal'>
+    z = z_dist.sample()
+    # ==> <class 'tensorflow.python.framework.ops.Tensor'>
+    x = tfd.Normal(loc=z, scale=1.).log_prob(5.)
+    # ==> <class 'tensorflow.python.framework.ops.Tensor'>
     model_logp = z_dist.log_prob(z) + x
-    print(sess.run(x, feed_dict={z: 2.5}))           # ==> -4.0439386
-    print(sess.run(model_logp, feed_dict={z: 2.5}))  # ==> -6.6973152
+    print(sess.run(x, feed_dict={z: 2.5}))
+    # ==> -4.0439386
+    print(sess.run(model_logp, feed_dict={z: 2.5}))
+    # ==> -6.6973152
 ```
 
 ### Pyro
 
 ```python
-z_dist = dist.Normal(loc=0., scale=5.)           # ==> <class 'pyro.distributions.torch.Normal'>
-z = pyro.sample("z", z_dist)                     # ==> <class 'torch.Tensor'>
+z_dist = dist.Normal(loc=0., scale=5.)
+# ==> <class 'pyro.distributions.torch.Normal'>
+z = pyro.sample("z", z_dist)
+# ==> <class 'torch.Tensor'>
 # reset/specify value of z
 z.data = torch.tensor(2.5)
-x = dist.Normal(loc=z, scale=1.).log_prob(5.)    # ==> <class 'torch.Tensor'>
+x = dist.Normal(loc=z, scale=1.).log_prob(5.)
+# ==> <class 'torch.Tensor'>
 model_logp = z_dist.log_prob(z) + x
-x                                                # ==> -4.0439386
-model_logp                                       # ==> -6.6973152
+x
+# ==> -4.0439386
+model_logp
+# ==> -6.6973152
 ```
 
 
@@ -159,21 +204,24 @@ As explained above, distribution in a ``pm.Model()`` context automatically turn
 To get the logp of a free\_RV is just evaluating the ``logp()`` [on itself](https://github.com/pymc-devs/pymc/blob/6d07591962a6c135640a3c31903eba66b34e71d8/pymc/model.py#L1212-L1213):
 
 ```python
-# self is a pytensor.tensor with a distribution attached
-self.logp_sum_unscaledt = distribution.logp_sum(self)
-self.logp_nojac_unscaledt = distribution.logp_nojac(self)
+class Normal(Continuous):
+    def logp(self, value):
+        mu = self.mu
+        tau = self.tau
+        return bound(
+            (-tau * (value - mu) ** 2 + pt.log(tau / np.pi / 2.0)) / 2.0,
+            tau > 0,
+        )
 ```
 
-Or for an observed RV. it evaluate the logp on the data:
+The logp evaluations are represented as tensors (``RV.logpt``). When we combine different ``logp`` values (for example, by summing all ``RVs.logpt`` to obtain the total logp for the model), PyTensor manages the dependencies automatically during the graph construction and compilation process.
+This dependence among nodes in the model graph means that whenever you want to generate a new function that takes new input tensors, you either need to regenerate the graph with the appropriate dependencies, or replace the node by editing the existing graph.
+The latter is facilitated by PyTensor's ``pytensor.clone_replace()`` function.
 
-```python
-self.logp_sum_unscaledt = distribution.logp_sum(data)
-self.logp_nojac_unscaledt = distribution.logp_nojac(data)
-```
 
-### Model context and Random Variable
+### Model Context and Random Variables
 
-I like to think that the ``with pm.Model() ...`` is a key syntax feature and *the* signature of PyMC model language, and in general a great out-of-the-box thinking/usage of the context manager in Python (with some critics, of course).
+A signature feature of PyMC's syntax is the ``with pm.Model() ...`` expression, which extends the functionality of the context manager in Python to make expressing Bayesian models as natural as possible.
 
 Essentially [what a context manager does](https://www.python.org/dev/peps/pep-0343/) is:
 
@@ -193,38 +241,33 @@ finally:
     VAR.__exit__()
 ```
 
-or conceptually:
-
-```python
-with EXPR as VAR:
-    # DO SOMETHING
-    USERCODE
-    # DO SOME ADDITIONAL THINGS
-```
-
-So what happened within the ``with pm.Model() as model: ...`` block, besides the initial set up ``model = pm.Model()``?
-Starting from the most elementary:
+But what are the implications of this, besides the model instatiation ``model = pm.Model()``?
 
 ### Random Variable
 
-From the above session, we know that when we call e.g. ``pm.Normal('x', ...)`` within a Model context, it returns a random variable.
-Thus, we have two equivalent ways of adding random variable to a model:
+As we have seen already, when we call e.g. ``pm.Normal('x', ...)`` within a Model context, it returns a random variable.
 
 ```python
-with pm.Model() as m:
+with pm.Model() as model:
     x = pm.Normal('x', mu=0., sigma=1.)
 
-print(type(x))                              # ==> <class 'pytensor.tensor.var.TensorVariable'>
-print(m.free_RVs)                           # ==> [x]
-print(logpt(x, 5.0))                        # ==> Elemwise{switch,no_inplace}.0
-print(logpt(x, 5.).eval({}))                # ==> -13.418938533204672
-print(m.logp({'x': 5.}))                    # ==> -13.418938533204672
+print(type(x))
+# ==> <class 'pytensor.tensor.var.TensorVariable'>
+print(model.free_RVs)
+# ==> [x]
+print(pm.logp(x, 5.0))
+# ==> Elemwise{switch,no_inplace}.0
+print(pm.logp(x, 5.).eval({}))
+# ==> -13.418938533204672
+print(model.compile_logp()({'x': 5.}))
+# ==> -13.418938533204672
 ```
 
 In general, if a variable has observations (``observed`` parameter), the RV is an observed RV, otherwise if it has a ``transformed`` (``transform`` parameter) attribute, it is a transformed RV otherwise, it will be the most elementary form: a free RV.
+
 Note that this means that random variables with observations cannot be transformed.
 
-<!--
+
 Below, I will take a deeper look into transformed RV. A normal user
 might not necessarily come in contact with the concept, since a
 transformed RV and ``TransformedDistribution`` are intentionally not
@@ -245,18 +288,11 @@ usually created in order to optimise performance. But getting a
 possible (see also in
 {ref}`doc <pymc_overview##Transformed-distributions-and-changes-of-variables>`):
 
-.. code:: python
-
-
-    lognorm = Exp().apply(pm.Normal.dist(0., 1.))
-    lognorm
-
-
-.. parsed-literal::
-
-    <pymc.distributions.transforms.TransformedDistribution at 0x7f1536749b00>
-
-
+```python
+lognorm = Exp().apply(pm.Normal.dist(0., 1.))
+lognorm
+# <pymc.distributions.transforms.TransformedDistribution at 0x7f1536749b00>
+```
 
 Now, back to ``model.RV(...)`` - things returned from ``model.RV(...)``
 are PyTensor tensor variables, and it is clear from looking at
@@ -305,29 +341,33 @@ transformation by nested applying multiple transforms to a Distribution
 
     z2 = Exp().apply(z)
     z2.transform is None  # ==> True
--->
+
 
 
 ### Additional things that ``pm.Model`` does
 
 In a way, ``pm.Model`` is a tape machine that records what is being added to the model, it keeps track the random variables (observed or unobserved) and potential term (additional tensor that to be added to the model logp), and also deterministic transformation (as bookkeeping):
+
 * named\_vars
 * free\_RVs
 * observed\_RVs
 * deterministics
 * potentials
 * missing\_values
+
 The model context then computes some simple model properties, builds a bijection mapping that transforms between dictionary and numpy/PyTensor ndarray, thus allowing the ``logp``/``dlogp`` functions to have two equivalent versions:
 One takes a ``dict`` as input and the other takes an ``ndarray`` as input.
 More importantly, a ``pm.Model()`` contains methods to compile PyTensor functions that take Random Variables (that are also initialised within the same model) as input, for example:
 
 ```python
+from pymc.blocking import DictToArrayBijection
+
 with pm.Model() as m:
     z = pm.Normal('z', 0., 10., shape=10)
     x = pm.Normal('x', z, 1., shape=10)
 
-print(m.initial_point)
-print(m.dict_to_array(m.initial_point))  # ==> m.bijection.map(m.initial_point)
+print(m.initial_point())
+print(DictToArrayBijection.map(m.initial_point))  # ==> m.bijection.map(m.initial_point)
 print(m.bijection.rmap(np.arange(20)))
 # {'z': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), 'x': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}
 # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
@@ -336,26 +376,19 @@ print(m.bijection.rmap(np.arange(20)))
 
 ```python
 list(filter(lambda x: "logp" in x, dir(pm.Model)))
-#['d2logp',
-# 'd2logp_nojac',
-# 'datalogpt',
-# 'dlogp',
-# 'dlogp_array',
-# 'dlogp_nojac',
-# 'fastd2logp',
-# 'fastd2logp_nojac',
-# 'fastdlogp',
-# 'fastdlogp_nojac',
-# 'fastlogp',
-# 'fastlogp_nojac',
-# 'logp',
-# 'logp_array',
-# 'logp_dlogp_function',
-# 'logp_elemwise',
-# 'logp_nojac',
-# 'logp_nojact',
-# 'logpt',
-# 'varlogpt']
+# ['compile_d2logp',
+#  'compile_dlogp',
+#  'compile_logp',
+#  'd2logp',
+#  'datalogp',
+#  'dlogp',
+#  'logp',
+#  'logp_dlogp_function',
+#  'observedlogp',
+#  'point_logps',
+#  'potentiallogp',
+#  'varlogp',
+#  'varlogp_nojac']
 ```
 
 ### Logp and dlogp