diff --git a/docs/index.md b/docs/index.md
index 2ee142c9f..7879bc8a3 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,7 +4,7 @@ GPJax is a didactic Gaussian process (GP) library in JAX, supporting GPU
 acceleration and just-in-time compilation. We seek to provide a flexible
 API to enable researchers to rapidly prototype and develop new ideas.
 
-![Gaussian process posterior.](./_static/GP.svg)
+![Gaussian process posterior.](static/GP.svg)
 
 
 ## "Hello, GP!"
@@ -40,7 +40,7 @@ would write on paper, as shown below.
 
 !!! Install
 
-    GPJax can be installed via pip. See our [installation guide](https://docs.jaxgaussianprocesses.com/installation/) for further details.
+    GPJax can be installed via pip. See our [installation guide](installation.md) for further details.
 
     ```bash
     pip install gpjax
@@ -48,7 +48,7 @@ would write on paper, as shown below.
 
 !!! New
 
-    New to GPs? Then why not check out our [introductory notebook](https://docs.jaxgaussianprocesses.com/examples/intro_to_gps/) that starts from Bayes' theorem and univariate Gaussian distributions.
+    New to GPs? Then why not check out our [introductory notebook](_examples/intro_to_gps.md) that starts from Bayes' theorem and univariate Gaussian distributions.
 
 !!! Begin
 
diff --git a/docs/sharp_bits.md b/docs/sharp_bits.md
index be88beb0c..72aeb726b 100644
--- a/docs/sharp_bits.md
+++ b/docs/sharp_bits.md
@@ -60,7 +60,7 @@ learning rate is greater is than 0.03, we would end up with a negative variance
 We visualise this issue below where the red cross denotes the invalid lengthscale value
 that would be obtained, were we to optimise in the unconstrained parameter space.
 
-![](_static/step_size_figure.svg)
+![](static/step_size_figure.svg)
 
 A simple but impractical solution would be to use a tiny learning rate which would
 reduce the possibility of stepping outside of the parameter's support. However, this
@@ -70,7 +70,7 @@ subspace of the real-line onto the entire real-line. Here, gradient updates are
 applied in the unconstrained parameter space before transforming the value back to the
 original support of the parameters. Such a transformation is known as a bijection.
 
-![](_static/bijector_figure.svg)
+![](static/bijector_figure.svg)
 
 To help understand this, we show the effect of using a log-exp bijector in the above
 figure. We have six points on the positive real line that range from 0.1 to 3 depicted
@@ -81,8 +81,7 @@ value, we apply the inverse of the bijector, which is the exponential function i
 case. This gives us back the blue cross.
 
 In GPJax, we supply bijective functions using [Tensorflow Probability](https://www.tensorflow.org/probability/api_docs/python/tfp/substrates/jax/bijectors).
-In our [PyTrees doc](examples/pytrees.md) document, we detail how the user can define
-their own bijectors and attach them to the parameter(s) of their model.
+
 
 ## Positive-definiteness
 
@@ -91,8 +90,7 @@ their own bijectors and attach them to the parameter(s) of their model.
 ### Why is positive-definiteness important?
 
 The Gram matrix of a kernel, a concept that we explore more in our
-[kernels notebook](examples/constructing_new_kernels.py) and our [PyTree notebook](examples/pytrees.md), is a
-symmetric positive definite matrix. As such, we
+[kernels notebook](_examples/constructing_new_kernels.md). As such, we
 have a range of tools at our disposal to make subsequent operations on the covariance
 matrix faster. One of these tools is the Cholesky factorisation that uniquely decomposes
 any symmetric positive-definite matrix $\mathbf{\Sigma}$ by
@@ -158,7 +156,7 @@ for some problems, this amount may need to be increased.
 ## Slow-to-evaluate
 
 Famously, a regular Gaussian process model (as detailed in
-[our regression notebook](examples/regression.py)) will scale cubically in the number of data points.
+[our regression notebook](_examples/regression.md)) will scale cubically in the number of data points.
 Consequently, if you try to fit your Gaussian process model to a data set containing more
 than several thousand data points, then you will likely incur a significant
 computational overhead. In such cases, we recommend using Sparse Gaussian processes to
@@ -168,7 +166,7 @@ When the data contains less than around 50000 data points, we recommend using
 the collapsed evidence lower bound objective [@titsias2009] to optimise the parameters
 of your sparse Gaussian process model. Such a model will scale linearly in the number of
 data points and quadratically in the number of inducing points. We demonstrate its use
-in [our sparse regression notebook](examples/collapsed_vi.py).
+in [our sparse regression notebook](_examples/collapsed_vi.md).
 
 For data sets exceeding 50000 data points, even the sparse Gaussian process outlined
 above will become computationally infeasible. In such cases, we recommend using the
@@ -176,4 +174,4 @@ uncollapsed evidence lower bound objective [@hensman2013gaussian] that allows st
 mini-batch optimisation of the parameters of your sparse Gaussian process model. Such a
 model will scale linearly in the batch size and quadratically in the number of inducing
 points. We demonstrate its use in
-[our sparse stochastic variational inference notebook](examples/uncollapsed_vi.py).
+[our sparse stochastic variational inference notebook](_examples/uncollapsed_vi.md).