Skip to content

Perceptrons and Neural Networks

Brenda Huppenthal edited this page Mar 21, 2024 · 9 revisions

Neurons

Biological motivation: whether or not a neuron in your brain fires depends on whether the connected neurons are firing, and the strength of those connections to the neighboring neurons. To implement this in a simple model, we can create a neuron that just implements some threshold function, and fires if the weighted input exceeds.

Perceptrons

Perceptrons are linear classifiers that are capable of performing binary classification tasks, where the output is 0 or 1. All other binary classification tasks can be mapped onto this one: true/false, positive/negative, etc. In this example we use positive/1 and negative/0 interchangeably.

Perceptron model

We have a neuron which takes as input the input/feature vector $\vec{x}$, an n-dimensional vector, and outputs a class prediction $y$. Each feature is weighted by a corresponding element of $\vec{w}$ of the same dimension. A bias $b$ is added to obtain the input to the threshold function, $a$. We can see that $a$ is simply a linear combination of the feature vector with a bias applied.

$$a = \sum_i w_ix_i + b$$

Equivalently, using vector notation:

$$a = \vec{w}^T\vec{x} + b$$

Once we have calculated $a$, we use the Heaviside function H as a threshold to calculate the class, either 0 or 1. Only positive values of $a$ cause a positive class prediction.

For a moment, lets the geometric interpretation of the perceptron. Together, the weights and biases together form a hyperplane in n-dimensional space. We will look at a simple example in two dimensions. The hyperplane, or decision boundary, is just a line and can be written with the implicit equation $\vec{w}^t\vec{x} - b = 0$. It separates the positive class predictions from the negative class predictions.

The weights define the normal vector to the plane: $\vec{w}^T = \begin{bmatrix} 1 & 1 \end{bmatrix}$

The bias can be computed by picking a point on the decision boundary: $b = \vec{w}^T \vec{X} = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 4 \\ 0 \end{bmatrix} = 4$

Lets work out the math when we query a point: $\vec{x} = \begin{bmatrix} 4 \\ 2 \end{bmatrix}$ $a = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 4 \\ 2 \end{bmatrix} - 4 = 2$

And the Heaviside function predicts that this point will have a positive class label: $H(2) = 1$

Perceptron algorithm

Rosenblatt, 1958

Linear separability and the XOR problem

Notice that in the classification example, all of the positive class training points fall on one side of the hyperplane, and all of the negative class training points fall on the other. This means the dataset is linearly separable. That is, there exists some $\vec{w}$ such that for all $(\vec{x}, y)$ pairs, $H(\vec{w}^Tx) = y$. In fact, the perceptron algorithm only converges if the data is linearly separable.

One common and simple example of where a perceptron would fail is known as the XOR function. There is clearly no single line which can perfectly delineate the boundary between the positive and negative classes, and therefore the perceptron will never exit the training loop.

It is possible to structure multiple perceptrons together to obtain a NAND gate, and NAND gates can be used to form any other logic gate. Thus, by cleverly hooking up multiple perceptrons together, we can build more interesting decision boundaries, but they are still collections of hyperplanes in space.

Another problem is that it is not clear what we are learning from the weights. Small perturbations in the weights either has no effect at all on the output, or completely flips the output from one class to another.

Neural Networks

Neural networks extend the idea of perceptrons to use nonlinear activation functions and multiple layers. This allows us to perform classification, regression, dimensionality reduction, function approximation, and many more learning tasks. It's known that multilayer feedforward networks are universal function approximators.

Below is an example of a fully connected neural network with two hidden layers. It is called fully connected because each neuron in the hidden layer receives as input all of the outputs of the previous layer.

The input layer is the n-dimensional $\vec{x}$. It is fed through two hidden layers, 2 and 3. The weights $\vec{w_{ij}}^{(k)}$ follow the convention where $i$ is the neuron on the previous layer it connects to, $j$ is the neuron of the next layer it connects to, and the superscript $(k)$ denotes which layer this weight is on. The biases follow a similar convention: $b_j^{(k)}$. The final layer is the output layer, where the activations are collapsed into a single scalar, so this network performs a regression task.

In this example the network has 44 learnable parameters: the first hidden layer has 15 weights and 3 biases, the second hidden layer has 15 weights and 5 biases, and the output layer has 5 weights and 1 bias. Modern neural networks have trillions of parameters to learn.

As stated, this network is designed for a regression task. For a classification task with three classes, the output would be a three dimensional $\vec{y}$, where $y_i$ would store the probability that the input $\vec{x}$ is class i. These probabilities can then be collapsed into a class prediction by performing a one hot encoding, which is simply setting the index of the highest probability in $\vec{y}$ to 1, and the remaining indices to 0.

As the number of neurons in a hidden layer increases, the network is 'wider', and as the number of hidden layers increases, it becomes 'deeper'.

Activation functions

Instead of applying the Heaviside function to each neuron, neural networks use a variety of different activation functions. Because these are nonlinear functions, every hidden layer in the network performs a nonlinear transformation of the data, which allows the neural network to learn very complicated functions/decision boundaries.

Logistic/Sigmoid

Also known as sigmoid. This is a squashing function, which means that the domain is the set of all real numbers and is mapped to the interval [0,1]. We can interpret this number as a probability. For classification tasks, we could use these on the last layer to obtain these pseudo-probabilities and then one-hot encode them to obtain the class prediction. Because it has a very small gradient when x is very small or very large, it can lead to slow learning.

Rectified Linear Unit (ReLU)

ReLU acts like a threshold due to its piecewise nature. For negative inputs, it sets the output to 0, and positive inputs cause the output to scale linearly. Because it is a linear function, the gradient is the same value for all positive values, so it doesn't have the same problem with slow learning exhibited by sigmoid or other activation functions like tanh.

Training

Randomly assign the weights and biases of the entire network to start. Then, we need to somehow learn the weights and biases that will perform well on our task.

Cost function: the cost is a function of all of the learnable parameters, as these are what we can change in order to influence the network behavior. For a simple MNIST example, you can have 10k parameters, which means we are trying to do an optimization in an extremely high dimensional space!

Gradient descent

Intuition: whenever we update the parameters, move in the direction of greatest improvement if a neuron increases the probability of an incorrect prediction, turn down the weight of that edge if a neuron increases the probability of a correct prediction, turn up the weight of that edge

Performing gradient descent means calculating the derivative of your loss/cost function and then you can take a step in the direction that minimizes it.

[get a picture of a surface in 3d] [make the point that if you have an orange point in a sea of blue points, that point will probably nudge you in the wrong direction, so typically we do minibatch gradient descent rather than pure stochastic gradient descent] batch: the whole dataset, low variance, but can take a long time if you have a lot of data stochastic: individual minibatch: a compromise, less variance than stochastic, but dont need all the data like batch

There are other optimization schemes as well (ADAM, ADA...)

Hyperparameters

Weights and biases are parameters. There are other things you can tune that affect training and can affect accuracy, but aren't inside the neural network. These are hyper parameters. One example is learning rate which dictates how large of a step you take along the direction of the gradient. This can help smooth out the path that stochastic gradient descent takes. Too small and you learn slowly, too large and you overshoot local minima.