---
title: {Artificial, Biological} Neural Nets
---
In an energy-based model (EBM), we make a trade-off for the numerical
precision of probabilistic likelihood that comes from normalizing a
variable over a density and instead are concerned with finding the
dependence between two variables: our input
We define an energy function as
To perform inference, we search this function using gradient descent (or
an alternative optimization method) to find compatible
In a latent variable energy-based model, the output
EBMs provide an alternative to probability density estimation by learning a manifold to compute the dependency of variables. In doing so, they provide a method to describe probabilistic and non-probabilistic approaches to learning without needing to estimate the normalization constant in probabilistic models, increasing flexibility of the model.
The Gibbs-Boltzmann distribution was originally used in
thermodynamics to find the probability that a system will be in a
certain state
Energies can then be thought of as being unnormalized negative log
probabilities. That is, we may use the Gibbs-Boltzmann distribution to
convert an energy function to its equivalent probabilistic
representation after normalization, i.e.
The derivation introduces a
When
In physics,
There are two classes of learning models to train an EBMs in order to
parameterize
-
Contrastive methods push down the energy of training data points,
$F(x_i, y_i)$ , while pushing up energy everywhere else,$F(x_i, y')$ . Types of contrastive methods differ in the way they pick the points to push up.For an example of this method, see the section on Generative Adversarial Networks.
-
Regularized latent variable methods build energy function
$F(x, y)$ so that the volume of low energy regions is limited or minimized by applying regularization. Types of architectural methods differ in the way they limit the information capacity of the code.For an example of this method, see the section on Variational Autoencoders.
Examples of constrastive learning methods include Contrastive
Divergence, Ratio Matching, Noise Contrastive Estimation, and Minimum
Probability Flow. The term contrastive sample is often used to refer
to a data point causing an energy pull-up, such as the incorrect
TODO: add sections on Contrastive embedding, Sparse Coding, Contrastive Divergence.
There are three ways to combine probability density models:
-
Mixture -- Take a weighted average of the distributions. The mixture can never be sharper than the individual distributions, making this is a very weak way to combine models.
-
Product -- Multiply the distributions at each point and then renormalize. This is exponentially more powerful than a mixture. The normalization makes maximum likelihood learning difficult, but approximations allow us to learn anyway.
-
Composition -- Use the values of the latent variables of one model as the data for the next model. This works well for learning multiple layers of representation, but only if the individual models are undirected.
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. This technique is known as the product of experts (PoE).
As we've seen, energy-based models represent probability distributions over data by assigning an unnormalized probability scalar, i.e. an energy, to each input data point. This provides useful modeling flexibility since any arbitrary model that outputs a real number given an input can be used as an energy model. Since each model represents an unnormalized probability distribution, models can be naturally combined through product of experts or other hierarchical models.
There are two categories of density models:
-
Stochastic generative models using directed acyclic graphs (see section on Bayes Networks). Generation from this type of model is easy, inference can be hard, and learning is easy after inference.
$$P(v) = \sum_h P(h)P(v | h)$$ where$v$ are the visible units and$h$ are the hidden units. -
Energy-based models that associate an energy with each data vector. Generation this type of model is hard, inference can be easy, and learning is typically hard but varies.
$$P(v, h) = \frac{\exp(-E(v, h))}{\sum_{u, g} \exp(-E(u, g))}$$ The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.$$P(v) = \frac{\sum_{h} \exp(-E(v, h))}{\sum_{u, g} \exp(-E(u, g))}$$ The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.
Probabilities can be understood as integrating over a density. In Bayesian analysis, many of the densities are not analytically tractable or have a posterior that is expensive to compute. When it is intractable, we can try to programmatically simulate a random variable with the given density.
Markov chain Monte Carlo (MCMC) are algorithmic methods for sampling from a probability distribution that is represented as graphical model, i.e. a Markov chain. Under certain conditions, we can generate a memoryless (Markovian) process in the form of an ensemble of Markov Chains that has the same limiting distribution as the random variable that we're trying to simulate. Because of the memoryless property, a large number of random walks through the chains using simulations of the random variable will represent correlated independent samples. Eventually this process will converge to an equilibrium distribution and produce an estimate of an otherwise intractable posterior.
Surprisingly, it is unnecessary to allow the simulated physical process to reach the equilibrium distribution. If we start the process at an observed data vector and run it for a few steps, we can generate a "confabulation" that works well for adjusting the weights. If the Markov chain starts to diverge from the data in a systematic way, we already have evidence that the model is imperfect and that it can be improved (in this local region of the data space) by reducing the energy of the initial data vector and raising the energy of the confabulation.
The contrastive backpropagation learning procedure cycles through
the observed data vectors adjusting each weight by:
The basis of all Bayesian statistics can be interpreted through Bayes'
Theorem, simply stated as
A Bayesian Network is a directed acyclic graph that represents a
factorization of the joint probability of all random variables. If the
random variables are
-
Define the prior distribution that incorporates subjective beliefs about a parameter. If the prior is uninformative, the posterior is data-driven. If the prior is informative, the posterior is a mixture of the prior and the data.
-
Collect data. The more informative the prior, the more data you need to "change" your beliefs. With a lot of data, the data will dominate the posterior distribution.
-
Update the prior distribution with the collected data using Bayes' theorem to obtain a posterior distribution. The posterior distribution represents the updated beliefs about the parameter after having seen the data. In most cases, the posterior distribution has to be found via MCMC simulations.
-
Analyze the posterior distribution and summarize it (i.e. by describing its mean, median, standard deviation, quantiles, etc.)
The Hopfield network is a complete undirected graph made of binary threshold neurons, usually having values of either 1 or -1, with all pairs of units connected by symmetric weighted edges. Hopfield networks serve as content-addressable memory and provide a model for understanding associative memory in humans.
The constraint that weights are symmetric guarantees that the energy
function decreases monotonically, guaranteeing that descending its
manifold will converge to a local minimum. The energy,
Training a Hopfield network involves lowering the energy of states that
we want the network to remember. The network will converge to a
remembered state if it is given only part of the queried state, i.e.
given a distorted pattern it can recover a memorized one that is most
similar. Much like our own memory, the network may retrieve the wrong
pattern, known as a spurious pattern, from the wrong local minimum
instead of the intended pattern, known as retrieval states, at the
expected local minimum. For each stored pattern
Hopfield networks called dense associative memory (DAM) models use
an energy function that can be expressed as the sum of interaction
functions
Updating units in the Hopfield network can be done one at a time, in a biologically plausible and asynchronous way, or they can be updated synchronously, all at the same time. Updating is performed using the following rule: $$s_{i} \leftarrow \left{ { \begin{array}{ll}+1&{\mbox{if }} \displaystyle \sum {{j}}{w{{ij}}s_{j}}\geq \theta _{i},\-1&{\mbox{otherwise.}}\end{array} }\right.$$ The updating rule implies that neurons attract or repel each other in state space, i.e. the values of neurons will converge if the weight between them is positive or they will diverge if the weight is negative. Under repeated updates, the network will eventually converge to a stable state which is a local minimum or basin of attraction in the energy function.
A learning rule is used to store information in the memory. It must follow the constraints of only being local to a neuron and its adjacent neurons and must be incremental, meaning that updating weights to store a new pattern must be dependent on values of the previously stored information. These rules enforce a sort of biologically plausibility on the network but have been altered to improve functionality (see section on continuous Hopfield networks).
The Hebbian theory of synaptic plasticity is closely related to the
constraints of this model. The Hebbian rule is often summarized as
"Neurons that fire together, wire together", and is both local and
incremental like the standard Hopfield network. When learning
Spike-timing-dependent plasticity (STDP) is a biological process that adjusts the strength of connections between neurons in the brain. The process adjusts the connection strengths based on the relative timing of a particular neuron's output and input action potentials (or spikes).
Experiments that stimulated two connected neurons with varying interstimulus asynchrony confirmed the importance of temporal precedence implicit in Hebb's principle: the presynaptic neuron has to fire just before the postsynaptic neuron for the synapse to be potentiated. In addition, it has become evident that the presynaptic neural firing needs to consistently predict the postsynaptic firing for synaptic plasticity to occur robustly.
The free energy principle, introduced by neuroscientest Karl Friston, generalizes a formal description of how systems minimize a free energy function of their internal states in an attempt to generate beliefs about hidden states in their environment.
In terms of the brain, we are given sensory information in current and past inputs while neurons are continuously performing inference by updating into configurations that better explain the observed sensory data. This theory of brain function, in which the brain is constantly generating and updating a mental model of the world, is known as predictive coding.
We can represent these configuration of internal neurons as continuous valued latent variables that correspond to averaged voltage potential across time, spikes, and possibly neurons in the same cortical minicolumn. Thus, neural computation corresponds to approximate inference and error back-propagation at the same time. It has been shown that the predictive coding model is able to approximate backpropagation along arbitrary computation graphs, making it biologically plausible but is still computationally infeasible with our current hardware. An alternative neuromorphic hardware or wetware architecture might someday be able to run massively parallelized computations only using local rules that may outperform backpropagation through time methods.
When reformulated using a generalized notation, active inference in the
brain can be formally described on the tuple
-
A sample space
$\Omega$ -- from which random perturbations$\omega \in \Omega$ are drawn -
Hidden or external states
$\Psi:\Psi\times A \times \Omega \to \mathbb{R}$ -- that cause sensory states and depend on action -
Sensory states
$S:\Psi \times A \times \Omega \to \mathbb{R}$ -- a probabilistic mapping from action and hidden states -
Action
$A:S\times R \to \mathbb{R}$ -- that depends on sensory and internal states -
Internal states
$R:R\times S \to \mathbb{R}$ -- that cause action and depend on sensory states -
Generative density
$p(s,\psi \mid m)$ -- over sensory and hidden states under a generative model$m$ -
Variational density
$q(\psi \mid \mu )$ -- over hidden states$\psi \in \Psi$ that is parameterised by internal states$\mu \in R$
This closely resembles energy-based learning models, where predicted
outputs
The notion that self-organising biological systems are continuously updating themself in order to minimize variational free energy was derived from the work of the German physicist Helmholtz. In particular, acknowledgments are made to Helmholtz's conception of unconscious inference, which encapsulates the view of the human perceptual system as a statistical inference engine whose core function is to infer the probable causes of sensory input.
Helmholtz machines are artificial neural networks that, through many cycles of sensing and generative dreaming, gradually learn to how to converge their dreams into reality. In this process they create a succinct internal model of a fluctuating world, making them highly suitable for unsupervised learning tasks. The Helmholtz machine contains two networks, a bottom-up recognition network that takes an external observation as input and produces a distribution over hidden variables, and a top-down generative network that generates values of the hidden variables and the data itself.
The Wake-Sleep algorithm, used for training, consists of two phases:
-
In the wake phase, neurons are driven by recognition connections, and generative connections are adapted to increase the probability that they would reconstruct the correct activity vector in the layer below.
-
In the sleep phase, neurons are driven by generative connections, and recognition connections are adapted to increase the probability that they would produce the correct activity vector in the layer above.
The Helmholtz Machine served as an important precursor for autoencoder networks (AE) which are also a discriminative model used for making predictions and are similarly composed of a sequence of two networks: one for encoding inputs (analogous to the recognition network) and another for decoding outputs (analogous to the generative network). The encoder network learns a representation of the data by mapping data to a compressed or latent space and then training the decoder to generate an output that minimizes the reconstruction loss.
On its own, this model would be prone to copying all the input features into the hidden layer and passing it directly as the output, essentially behaving like an identity function and not learning anything. This is avoided by applying an information bottleneck by feeding the coded input into hidden layers of lower dimension known as under-complete layers. An under-complete layer cannot behave as an identity function simply because the hidden layer doesn't have enough dimensions to copy the original input.
An affine transformation preserves lines and parallelism, i.e. it's a linear function with a translation constant. A very simple form of an AE does the following:
-
In the encoder stage, the autoencoder takes in an input
$x \in \mathbb R^n$ and applies an affine transformation,$W_h \in \mathbb R^{d\times n}$ , resulting in an intermediate hidden layer$h \in \mathbb R^d$ :$h = f(W_h x + b_h)$ , where$f$ is an element-wise activation function and$h$ is called the code. -
In the decoder stage another affine transformation, defined by
$W_x \in \mathbb R^{n\times d}$ , produces the output$\hat{x} \in \mathbb R^n$ , which is the model's prediction/reconstruction of the input:$\hat{x} = g(W_x h + b_x)$ , where$g$ is an activation function.
Instead of using the wake-sleep algorithm as was done in Helmholtz
machines, autoencoders are trained using backpropagation.
Cross-Entropy can be used as a loss function
Interpreted as an energy-based model, we train the system to produce an energy function that grows quadratically as the corrupted data moves away from the data manifold. For a denoising autoencoder, training can be done using contrastive divergence (CD) to handle the uncountable ways to corrupt a piece of data in high-dimensional space.
In a continuous space, CD first picks a training sample
Model
Training
The development of sophisticated learning procedures have been built on a few key assumptions of early models of the brain and have turned out to be highly effective. The assumptions are as follows:
-
The brain is a hierarchical or sequential composition of layered networks that specialize in certain functionality but are closely integrated with one another.
-
There likely exists a network that processes complex sensory or semi-structured input into encoded or latent data before communicating it to layers above it, i.e. in a bottom-up manner. This latent representation of observations of reality can be thought of as a symbolic and denoised representation of the world.
-
There exists a network that learns from latent representations and attempts to generate useful predictions of the world in a top-down down manner. This can be hypothesized as being closely related to processes that occur in our mind during imagination and when dreaming.
-
Although incomprehensible to us, structured data and knowledge appears to exist on a manifold in high-dimensional space. Supervised-learning is a form of probing and exploring this hidden structure.
Variational Autoencoders (VAE) are a type of generative models that aim to simulate the data generative process. VAEs have conceptual similarities to autoencoders, but have very different formulations. The main difference between VAEs and AEs is that VAEs are designed to create a better latent space for capturing underlying causal relations that in turn enables them to perform better in the generative process.
-
In the encoder stage, the input
$x$ is passed to the encoder. Instead of generating a hidden representation of the the code$h$ , as was done in an AE, the code in a VAE is comprised of two things: the mean,$\mathbb{E}(z)$ , and the variance,$\mathbb{V}(z)$ , where$z$ is the latent random variable following a Gaussian distribution. (Other distributions can be used, but a Gaussian is most common in practice. A reparameterization trick is used when sampling$z$ , this is outlined in more detail below.)The encoder will be a function from
$\mathcal{X}$ to$\mathbb{R}^{2d} : x \mapsto {h}$ , where$h$ represents the concatenation of$\mathbb{E}({z})$ and$\mathbb{V}({z})$ . -
In the sampler stage,
$z$ is sampled from the above distribution parametrized by the encoder. Specifically,$\mathbb{E}({z})$ and$\mathbb{V}({z})$ are passed into a sampler to generate the latent variable$z$ . -
In the decoder stage,
$z$ is passed into the decoder to generate$\hat{x}$ .The decoder will be a function from
$\mathcal{Z}$ to$\mathbb{R}^{n}: z \mapsto \hat{x}$ .
To obtain
To train the VAE, we want to minimize a loss function which will be
composed of a reconstruction term as well as a regularization term.
Regularization is done by using a penalty term
We can write the loss as
In general, to go from the latent space to input space during the
generative process, we will need to either learn the underlying
distribution of the latent code or enforce some structure on the space.
In a VAE the regularization term is used to enforce a specific Gaussian
structure on the latent space. The penalty term is the relative
entropy which is a measure of the distance between two distributions:
the latent variables sampled from the Gaussian distribution and the
standard normal distribution.
Interpreted as an energy-based model, Variational Autoencoders have an
architecture similar to Regularized Latent Variable EBM where the
flexibility of the latent variable is constrained in order to prevent an
energy function from being 0 everywhere. This happens because every true
output
The latent variable
The effects of regularization with Gaussian noise can be visualized as
connecting the code vectors spheres
Model
Loss Function
Training and Testing
TODO
The basic idea of generative adversarial network (GAN) is to simultaneously train a discriminator and a generator. The discriminator is trained to distinguish real samples of a dataset from fake samples produced by the generator. We can think of a GAN as a form of energy-based model using contrastive methods. In this light, it's known as an energy-based generative adversial network (EBGAN). It pushes up the energy of contrastive samples and pushes down the energy of training samples. The generator produces contrastive samples intelligently while the discriminator, which is essentially a cost function, acts as an energy model. Both the generator and the discriminator are neural nets.
The two kinds of input to GANs are its training samples and the
contrastive samples produced by the generator. For training samples, the
GAN passes these samples through the discriminator and makes their
energy go down. For contrastive samples, the GAN samples latent
variables from some distribution, runs them through the generator to
produce something similar to training samples, and passes them through
the discriminator to make their energy go up. The loss function for
discriminator is as follows,
The loss function used for the generator
Viewing the discriminator as an energy function allows us to use a wider variety of architectures and loss functions in addition to the usual binary classifier with logistic output, $$\mathcal{L} = \mathbb{E}x [\log(D(\boldsymbol{x} ))] + \mathbb{E}{\hat{x}}[\log(1-D(\boldsymbol{\hat{x}}))]$$
To generate samples from EBMs, we use an iterative refinement process based on Langevin dynamics. Informally, this involves performing noisy gradient descent on the energy function to arrive at low-energy configurations.
From this perspective, the GAN is very similar to a variational autoencoder but has an alternate training methodology and mental model and usually outperforms VAEs. Unfortunately, there are still many challenges with using GANS and EBGANs:
-
Unstable convergence -- As a result of the adversarial nature between the generator and discriminator there is an unstable equilibrium point rather than an equilibrium.
-
Vanishing gradient -- As the discriminator becomes more confident, the outputs of the cost network move into flatter regions where the gradients become more saturated. These flatter regions provide small, vanishing gradients that hinder the generator network's training. Thus, when training a GAN, you want to make sure that the cost gradually increases as you become more confident.
-
Mode collapse -- If a generator maps all
$\vec{z}$ 's from the sampler to a single$\vec{\hat{x}}$ that can fool the discriminator, then the generator will produce only that$\vec{\hat{x}}$ . Eventually, the discriminator will learn to detect specifically this fake input. As a result, the generator simply finds the next most plausible$\vec{\hat{x}}$ and the cycle continues. Consequently, the discriminator gets trapped in local minima while cycling through fake$\vec{\hat{x}}$ 's. A possible solution to this issue is to enforce some penalty to the generator for always giving the same output given different inputs. -
Energy smoothness -- In an EBGAN, problems arise when we have samples that are close to the true manifold. If we train the system successfully where we get the discriminator to produce
$0$ outside the manifold and an infinite probability on the manifold, the energy function becomes useless because we don't want the energy value to go from$0$ to infinity in a very small step. An improvement of the GAN model is made with the Wasserstein GAN which limits the size of discriminator weight.
Below is a standard implement of a Deep Convolutional Generative Adversarial Network (DCGAN).
Generator Model
Descrimantor Model
Loss Function and Setup
Training the Discriminator and the Generator
TODO
The energy of binary modern Hopfield networks can be generalized to allow continuous states while keeping convergence and storage capacity properties. In addition to a new energy function, a new update rule must be used that minimizes the energy.
First, we define the convex log-sum-exp (lse) function as,
Suppose there are
Then, for the new energy of the generalized continuous Hopfield
network, we take the logarithm of the negative energy of modern
Hopfield networks and add a quadratic term of the current state. The
quadratic term ensures that the norm of the state vector
Yann LeCun, DS-GA 1008, DEEP LEARNING, NYU CENTER FOR DATA SCIENCE.
Geoffrey E. Hinton, CSC2535, Advanced Machine Learning, University of Toronto.
Friston, K.J., Stephan, K.E. Free-energy and the brain. Synthese 159, 417--458 (2007). [https://doi.org/10.1007/s11229-007-9237-y\]
GE Hinton, P Dayan, BJ Frey, RM Neal, The "wake-sleep" algorithm for unsupervised neural networks, Science 26 May 1995.
Hinton G, Osindero S, Welling M, Teh YW. Unsupervised discovery of nonlinear structure using contrastive backpropagation.
Cogn Sci. 2006 Jul 8;30(4):725-31. doi: 10.1207/s15516709cog0000_76. PMID: 21702832.
Yoshua Bengio, Benjamin Scellier, Olexa Bilaniuk, João Sacramento, Walter Senn: Feedforward Initialization for Fast Inference of Deep Generative Networks is biologically plausible. CoRR abs/1606.01651 (2016)
Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
Igor Mordatch. Concept learning with energy-based models. CoRR, abs/1811.02486, 2018.
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Pavlovic, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., and Hochreiter, S. Hopfield networks is all you need. ArXiv, 2020.