-
Notifications
You must be signed in to change notification settings - Fork 0
Perceptrons and Neural Networks
Biological motivation: whether or not a neuron in your brain fires depends on whether the connected neurons are firing, and the strength of those connections to the neighboring neurons. To implement this in a simple model, we can create a neuron that just implements some threshold function, and fires if the weighted input exceeds.
Perceptrons are linear classifiers that are capable of performing binary classification tasks, where the output is 0 or 1. All other binary classification tasks can be mapped onto this one: true/false, positive/negative, etc. In this example we use positive/1 and negative/0 interchangeably.
We have a neuron which takes as input the input/feature vector
Equivalently, using vector notation:
Once we have calculated
For a moment, lets the geometric interpretation of the perceptron. Together, the weights and biases together form a hyperplane in n-dimensional space. We will look at a simple example in two dimensions. The hyperplane, or decision boundary, is just a line and can be written with the implicit equation
The weights define the normal vector to the plane:
The bias can be computed by picking a point on the decision boundary:
Lets work out the math when we query a point:
And the Heaviside function predicts that this point will have a positive class label:
Rosenblatt, 1958
Notice that in the classification example, all of the positive class training points fall on one side of the hyperplane, and all of the negative class training points fall on the other. This means the dataset is linearly separable. That is, there exists some
One common and simple example of where a perceptron would fail is known as the XOR function. There is clearly no single line which can perfectly delineate the boundary between the positive and negative classes, and therefore the perceptron will never exit the training loop.
It is possible to structure multiple perceptrons together to obtain a NAND gate, and NAND gates can be used to form any other logic gate. Thus, by cleverly hooking up multiple perceptrons together, we can build more interesting decision boundaries, but they are still collections of hyperplanes in space.
Another problem is that it is not clear what we are learning from the weights. Small perturbations in the weights either has no effect at all on the output, or completely flips the output from one class to another.
Neural networks extend the idea of perceptrons to use nonlinear activation functions and multiple layers. This allows us to perform classification, regression, dimensionality reduction, function approximation, and many more learning tasks. It's known that multilayer feedforward networks are universal function approximators.
Below is an example of a fully connected neural network with two hidden layers. It is called fully connected because each neuron in the hidden layer receives as input all of the outputs of the previous layer.
The input layer is the n-dimensional
In this example the network has 44 learnable parameters: the first hidden layer has 15 weights and 3 biases, the second hidden layer has 15 weights and 5 biases, and the output layer has 5 weights and 1 bias. Modern neural networks have trillions of parameters to learn.
As stated, this network is designed for a regression task. For a classification task with three classes, the output would be a three dimensional
As the number of neurons in a hidden layer increases, the network is 'wider', and as the number of hidden layers increases, it becomes 'deeper'.
Instead of applying the Heaviside function to each neuron, neural networks use a variety of different activation functions. Because these are nonlinear functions, every hidden layer in the network performs a nonlinear transformation of the data, which allows the neural network to learn very complicated functions/decision boundaries.
Logistic/Sigmoid
Also known as sigmoid. This is a squashing function, which means that the domain is the set of all real numbers and is mapped to the interval [0,1]. We can interpret this number as a probability. For classification tasks, we could use these on the last layer to obtain these pseudo-probabilities and then one-hot encode them to obtain the class prediction. Because it has a very small gradient when x is very small or very large, it can lead to slow learning.
Rectified Linear Unit (ReLU)
ReLU acts like a threshold due to its piecewise nature. For negative inputs, it sets the output to 0, and positive inputs cause the output to scale linearly. Because it is a linear function, the gradient is the same value for all positive values, so it doesn't have the same problem with slow learning exhibited by sigmoid or other activation functions like tanh.
Randomly assign the weights and biases of the entire network to start. Then, we need to somehow learn the weights and biases that will perform well on our task.
Cost function: the cost is a function of all of the learnable parameters, as these are what we can change in order to influence the network behavior. For a simple MNIST example, you can have 10k parameters, which means we are trying to do an optimization in an extremely high dimensional space!
Intuition: whenever we update the parameters, move in the direction of greatest improvement if a neuron increases the probability of an incorrect prediction, turn down the weight of that edge if a neuron increases the probability of a correct prediction, turn up the weight of that edge
Performing gradient descent means calculating the derivative of your loss/cost function and then you can take a step in the direction that minimizes it.
[get a picture of a surface in 3d] [make the point that if you have an orange point in a sea of blue points, that point will probably nudge you in the wrong direction, so typically we do minibatch gradient descent rather than pure stochastic gradient descent] batch: the whole dataset, low variance, but can take a long time if you have a lot of data stochastic: individual minibatch: a compromise, less variance than stochastic, but dont need all the data like batch
There are other optimization schemes as well (ADAM, ADA...)
Weights and biases are parameters. There are other things you can tune that affect training and can affect accuracy, but aren't inside the neural network. These are hyper parameters. One example is learning rate which dictates how large of a step you take along the direction of the gradient. This can help smooth out the path that stochastic gradient descent takes. Too small and you learn slowly, too large and you overshoot local minima.
UArizona DataLab, Data Science Institute, 2024