-
Notifications
You must be signed in to change notification settings - Fork 0
Perceptrons and Neural Networks
📓 used during the workshop:
Biological motivation: whether or not a neuron in your brain fires depends on whether the connected neurons are firing, and the strength of those connections to the neighboring neurons. To implement this in a simple model, we can create a neuron that just implements some threshold function, and fires if the weighted input exceeds.
Perceptrons are linear classifiers that are capable of performing binary classification tasks, where the output is 0 or 1. All other binary classification tasks can be mapped onto this one: true/false, positive/negative, etc. In this example we use positive/1 and negative/0 interchangeably.
We have a neuron which takes as input the input/feature vector
Equivalently, using vector notation:
Once we have calculated
For a moment, lets the geometric interpretation of the perceptron. Together, the weights and biases together form a hyperplane in n-dimensional space. We will look at a simple example in two dimensions. The hyperplane, or decision boundary, is just a line and can be written with the implicit equation
The weights define the normal vector to the plane:
The bias can be computed by picking a point on the decision boundary:
Lets work out the math when we query a point:
And the Heaviside function predicts that this point will have a positive class label:
Rosenblatt, 1958
Notice that in the classification example, all of the positive class training points fall on one side of the hyperplane, and all of the negative class training points fall on the other. This means the dataset is linearly separable. That is, there exists some
One common and simple example of where a perceptron would fail is known as the XOR function. There is clearly no single line which can perfectly delineate the boundary between the positive and negative classes, and therefore the perceptron will never exit the training loop.
It is possible to structure multiple perceptrons together to obtain a NAND gate, and NAND gates can be used to form any other logic gate. Thus, by cleverly hooking up multiple perceptrons together, we can build more interesting decision boundaries, but they are still collections of hyperplanes in space.
Another problem is that it is not clear what we are learning from the weights. Small perturbations in the weights either has no effect at all on the output, or completely flips the output from one class to another.
Neural networks extend the idea of perceptrons to use nonlinear activation functions and multiple layers. This allows us to perform classification, regression, dimensionality reduction, function approximation, and many more learning tasks. It's known that multilayer feedforward networks are universal function approximators.
Below is an example of a fully connected neural network with two hidden layers. It is called fully connected because each neuron in the hidden layer receives as input all of the outputs of the previous layer.
The input layer is the n-dimensional
In this example the network has 44 learnable parameters: the first hidden layer has 15 weights and 3 biases, the second hidden layer has 15 weights and 5 biases, and the output layer has 5 weights and 1 bias. Modern neural networks have trillions of parameters to learn.
As stated, this network is designed for a regression task. For a classification task with three classes, the output would be a three dimensional
As the number of neurons in a hidden layer increases, the network is 'wider', and as the number of hidden layers increases, it becomes 'deeper'.
Instead of applying the Heaviside function to each neuron, neural networks use a variety of different activation functions. Because these are nonlinear functions, every hidden layer in the network performs a nonlinear transformation of the data, which allows the neural network to learn very complicated functions/decision boundaries.
Logistic/Sigmoid
Also known as sigmoid. This is a squashing function, which means that the domain is the set of all real numbers and is mapped to the interval [0,1]. We can interpret this number as a probability. For classification tasks, we could use these on the last layer to obtain these pseudo-probabilities and then one-hot encode them to obtain the class prediction. Because it has a very small gradient when x is very small or very large, it can lead to slow learning.
Rectified Linear Unit (ReLU)
ReLU acts like a threshold due to its piecewise nature. For negative inputs, it sets the output to 0, and positive inputs cause the output to scale linearly. Because it is a linear function, the gradient is the same value for all positive values, so it doesn't have the same problem with slow learning exhibited by sigmoid or other activation functions like tanh.
Randomly assign the weights and biases of the entire network to start. Then, we need to learn the weights and biases that perform well on our task.
A loss function for an input expresses the error of our prediction in terms of the weights and the biases of the network. One example of a loss function is L2 loss:
Even with our tiny neural network example, this amounts to a optimization problem in 44 dimensions. Even for small networks meant for classification of handwritten digits, we reach network sizes on the order of 10,000 parameters. The takeaway: to train neural networks, we need to solve an optimization problem in extremely high dimensional space!
Credit: Jason Pacheco, CS480/580
In these high dimensional non-convex spaces, calculating the true global minimum is an intractable problem. Instead, we will use the gradient of the function to take steps towards a local minimum. The intuition behind this is that whenever we observe a new example and update the parameters, we want to tune the weights and biases to move in the direction of greatest improvement. If a neuron increases the probability of an incorrect prediction, turn down the weight of that edge. If a neuron increases the probability of a correct prediction, turn up the weight of that edge.
Note the functions in red, which show a 'closed form' of the network activations at each layer. To calculate the gradient at every layer, we can compute the derivative using the chain rule. Backpropagation is simply repeatedly applying the chain rule to the entire network, working backwards from the output to the input. This requires access to the derivative of the nonlinear activation functions.
There are a few different ways to do this. In batch gradient descent, we compute the loss for the entire training dataset for each step of the gradient descent. This is computationally expensive if the dataset is very large. On the other hand, stochastic gradient descent performs a step of the gradient descent for every individual sample, the order of which are randomized before training begins. Mini-batch gradient strikes a balance between batch and stochastic gradient descent by splitting the training dataset into a set of smaller batches, smoothing out some of the variance induced by the stochastic nature of the process while keeping the computational burden much smaller than batch gradient descent.
There are other optimization schemes as well, such as AdaGrad, ADAM. I found this article to be a helpful overview.
So far we have contained our discussion of training to tuning of the network. There are other things you can tune that affect training speed, network accuracy, etc. but aren't inside the neural network. These are known as hyper parameters. One example is learning rate which dictates how large of a step you take along the direction of the gradient during the gradient descent algorithm. If the learning rate is too small the network learns slowly, and if it is too large the network can fail to converge due to overshooting local minima in the optimization space.
UArizona DataLab, Data Science Institute, 2024