Perceptrons and Neural Networks

📓 used during the workshop:

Neurons

Biological motivation: The success of a neuron's firing depends on the firing of connected neurons, and the strength of those connections to the other neighboring neurons. To implement this in a simple model, we can create a neuron that just implements some threshold function, and fires if the weighted input exceeds.

Perceptrons

Perceptrons are linear classifiers that are capable of performing binary classification tasks, where the output is 0 or 1. All other binary classification tasks can be mapped onto this one: true/false, positive/negative, etc. In this example we use positive/1 and negative/0 interchangeably.

Perceptron model

We have a neuron which takes as input the input/feature vector $\vec{x}$, an n-dimensional vector, and outputs a class prediction $y$. Each feature is weighted by a corresponding element of $\vec{w}$ of the same dimension. A bias $b$ is added to obtain the input to the threshold function, $a$. We can see that $a$ is simply a linear combination of the feature vector with a bias applied.

$$a = \sum_i w_ix_i + b$$

Equivalently, using vector notation:

$$a = \vec{w}^T\vec{x} + b$$

Once we have calculated $a$, we use the Heaviside function H as a threshold to calculate the class, either 0 or 1. Only positive values of $a$ cause a positive class prediction.

For a moment, lets consider the geometric interpretation of the perceptron. Together, the weights and biases together form a hyperplane in n-dimensional space. We will look at a simple example in two dimensions. The hyperplane, or decision boundary, is just a line and can be written with the implicit equation $\vec{w}^t\vec{x} - b = 0$. It separates the positive class predictions from the negative class predictions.

The weights define the normal vector to the plane: $\vec{w}^T = \begin{bmatrix} 1 & 1 \end{bmatrix}$

The bias can be computed by picking a point on the decision boundary: $b = \vec{w}^T \vec{X} = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 4 \\ 0 \end{bmatrix} = 4$

Lets work out the math when we query a point: $\vec{x} = \begin{bmatrix} 4 \\ 2 \end{bmatrix}$ $a = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 4 \\ 2 \end{bmatrix} - 4 = 2$

And the Heaviside function predicts that this point will have a positive class label: $H(2) = 1$

Perceptron algorithm

Rosenblatt, 1958

Linear separability and the XOR problem

Notice that in the classification example, all of the positive class training points fall on one side of the hyperplane, and all of the negative class training points fall on the other. This means the dataset is linearly separable. That is, there exists some $\vec{w}$ such that for all $(\vec{x}, y)$ pairs, $H(\vec{w}^Tx) = y$. In fact, the perceptron algorithm only converges if the data is linearly separable.

One common and simple example of where a perceptron would fail is known as the XOR function. There is clearly no single line which can perfectly delineate the boundary between the positive and negative classes, and therefore the perceptron will never exit the training loop.

It is possible to structure multiple perceptrons together to obtain a NAND gate, and NAND gates can be used to form any other logic gate. Thus, by cleverly hooking up multiple perceptrons together, we can build more interesting decision boundaries, but they are still collections of hyperplanes in space.

Another problem is that it is not clear what we are learning from the weights. Small perturbations in the weights either has no effect at all on the output, or completely flips the output from one class to another.

Neural Networks

Neural networks extend the idea of perceptrons to use nonlinear activation functions and multiple layers. This allows us to perform classification, regression, dimensionality reduction, function approximation, and many more learning tasks. It's known that multilayer feedforward networks are universal function approximators.

Below is an example of a fully connected neural network with two hidden layers. It is called fully connected because each neuron in the hidden layer receives as input all of the outputs of the previous layer.

The input layer is the n-dimensional $\vec{x}$. It is fed through two hidden layers, 2 and 3. The weights $\vec{w_{ij}}^{(k)}$ follow the convention where $i$ is the neuron on the previous layer it connects to, $j$ is the neuron of the next layer it connects to, and the superscript $(k)$ denotes which layer this weight is on. The biases follow a similar convention: $b_j^{(k)}$. The final layer is the output layer, where the activations are collapsed into a single scalar, so this network performs a regression task.

In this example the network has 44 learnable parameters: the first hidden layer has 15 weights and 3 biases, the second hidden layer has 15 weights and 5 biases, and the output layer has 5 weights and 1 bias. Modern neural networks have trillions of parameters to learn.

As stated, this network is designed for a regression task. For a classification task with three classes, the output would be a three dimensional $\vec{y}$, where $y_i$ would store the probability that the input $\vec{x}$ is class i. These probabilities can then be collapsed into a class prediction by performing a one hot encoding, which is simply setting the index of the highest probability in $\vec{y}$ to 1, and the remaining indices to 0.

As the number of neurons in a hidden layer increases, the network is 'wider', and as the number of hidden layers increases, it becomes 'deeper'.

Activation functions

Instead of applying the Heaviside function to each neuron, neural networks use a variety of different activation functions. Because these are nonlinear functions, every hidden layer in the network performs a nonlinear transformation of the data, which allows the neural network to learn very complicated functions/decision boundaries.

Logistic/Sigmoid

Also known as sigmoid. This is a squashing function, which means that the domain is the set of all real numbers and is mapped to the interval [0,1]. We can interpret this number as a probability. For classification tasks, we could use these on the last layer to obtain these pseudo-probabilities and then one-hot encode them to obtain the class prediction. Because it has a very small gradient when x is very small or very large, it can lead to slow learning.

Rectified Linear Unit (ReLU)

ReLU acts like a threshold due to its piecewise nature. For negative inputs, it sets the output to 0, and positive inputs cause the output to scale linearly. Because it is a linear function, the gradient is the same value for all positive values, so it doesn't have the same problem with slow learning exhibited by sigmoid or other activation functions like tanh.

Training

Randomly assign the weights and biases of the entire network to start. Then, we need to learn the weights and biases that perform well on our task.

A loss function for an input expresses the error of our prediction in terms of the weights and the biases of the network. One example of a loss function is L2 loss: $\mathcal{L}(\textbf{W}, \vec{B}) = (y_{true} - y_{predicted})^2$. We want to minimize the loss over all of the training data:

$$min_{\textbf{W}, \vec{B}} \sum_i \mathcal{L_i}(\textbf{W}, \vec{B})$$

Even with our tiny neural network example, this amounts to a optimization problem in 44 dimensions. Even for small networks meant for classification of handwritten digits, we reach network sizes on the order of 10,000 parameters. The takeaway: to train neural networks, we need to solve an optimization problem in extremely high dimensional space!

Credit: Jason Pacheco, CS480/580

Gradient descent

In these high dimensional non-convex spaces, calculating the true global minimum is an intractable problem. Instead, we will use the gradient of the function to take steps towards a local minimum. The intuition behind this is that whenever we observe a new example and update the parameters, we want to tune the weights and biases to move in the direction of greatest improvement. If a neuron increases the probability of an incorrect prediction, turn down the weight of that edge. If a neuron increases the probability of a correct prediction, turn up the weight of that edge.

Note the functions in red, which show a closed form of the network activations at each layer. To calculate the gradient at every layer, we can compute the derivative using the chain rule. Backpropagation is simply repeatedly applying the chain rule to the entire network, working backwards from the output to the input. This requires access to the derivative of the nonlinear activation functions.

There are a few different ways to do this. In batch gradient descent, we compute the loss for the entire training dataset for each step of the gradient descent. This is computationally expensive if the dataset is very large. On the other hand, stochastic gradient descent performs a step of the gradient descent for every individual sample, the order of which are randomized before training begins. Mini-batch gradient strikes a balance between batch and stochastic gradient descent by splitting the training dataset into a set of smaller batches, smoothing out some of the variance induced by the stochastic nature of the process while keeping the computational burden much smaller than batch gradient descent.

There are other optimization schemes as well, such as AdaGrad, ADAM. I found this article to be a helpful overview.

Hyperparameters

So far we have contained our discussion of training to tuning of the network. There are other things you can tune that affect training speed, network accuracy, etc. but aren't inside the neural network. These are known as hyper parameters. One example is learning rate which dictates how large of a step you take along the direction of the gradient during the gradient descent algorithm. If the learning rate is too small the network learns slowly, and if it is too large the network can fail to converge due to overshooting local minima in the optimization space.

UArizona DataLab, Data Science Institute, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly