-
Notifications
You must be signed in to change notification settings - Fork 0
Convolutional Neural Networks
Generally, we want to automate the derivation of useful information from images. Some example tasks include:
- Classification: given an image, predict a class label
- Object detection: generate a bounding box around the object
- Semantic segmentation: assign every pixel in the image a class label
- Instance segmentation: differentiate between multiple instances of the same semantic class
- Pose recognition: for example, estimating the pose of a head, which can be used to determine what they are looking at
- Activity recognition: related to pose recognition, classify a pose or series of poses
- Object tracking: propose correspondence of detected objects across frames of a video
- Image restoration
- Feature matching: detection of features and correspondence between multiple views
In this workshop, we will use RGB images. Each image has dimensionality H x W x 3, where H is the height of the image, W is the width of the image, and every pixel has three color channels (Red, Green, Blue).
To motivate the need for a neural network architecture for images, consider the case of classifying digits with MNIST using a fully connected neural network. In this case each pixel takes on only a single greyscale value, and thus the image has H x W values. In the fully connected neural network, we unravel the image into a one dimensional vector which is HW long, either by picking values row-wise or column-wise.
Despite both images having the same structure, the location of the white value is shifted in the vector, meaning it interacts with a completely different set of weights and biases. This is a toy example of a larger problem. For a classification example, we would like to learn to identify an image that features a cat, whether the cat is in the upper left or the lower right of the image. This property is known as translation invariance. Convolution is the linear operator that enables this in CNNs.
Convolutional neural network architectures consist of convolutional and pooling layers in addition to fully connected layers. Note that we use a common activation function, ReLU. Based on the output of the fully connected layers, we are performing a classification task between some number of classes. Note also that the network reduces the dimensionality of the input using the convolution and pooling layers to some compact representation, and then feeds this into the fully connected layers which will generate the predicted class.
In the convolution operation, we pass kernels (also known as filters) over an image. We take the inner product between the kernel and a patch of the image. The output of convolution is high where the underlying image resembles the filter, and low where it does not. Once this response is calculated, it is fed through an activation function like ReLU, similar to fully connected neural networks.
Before neural networks there was significant effort expended to design kernels to detect small features, such as edges, corners, etc and use these for computer vision tasks. Neural networks allow us to set up a convolutional architecture that both learns the kernels that are useful for the given task as well as the mapping from the feature space to the output - i.e. the network learns the parameters of each kernel.
Here's an example of the learned filter bank of ImageNet (Krizhevsky et al.). Note that some of the filters are what we would expect: the network has learned to look for lines at various angles, as well as for dots. Not all of the filters are easily interpretable.
In addition to specifying the number of filters and the shape of the filters, we also need to specify the stride and padding. The stride specifies how many pixels the kernel shifts by as it slides around the input image. Using zero-padding, the image will add artificial black pixels in a border around the input.
These feature maps still have very high dimension, so we pool the filter activations through a pooling layer. This can be thought of as summarizing the input. In max pool, we take the maximum of all activations, and in average pool we take the average of all activations. Typically max pool is used. There are no learnable parameters for this layer.
In the AlexNet architecture, which represented a huge leap forward in image classification, input images are 227x227x3. The first convolutional layer output is 55x55x96 (290,400), so 96 filters must be learned. Each filter is 11x11x3, so we learn 11x11x3 weights and 1 bias for a total of 364 parameters per filter. Thus we need to learn 34,944 parameters in total for this convolutional layer.
If we used a fully connected neural network instead, we would go from 227x227x3 (154,587) -> 55x55x96 (290,400), which requires 154587*290400 weights and 290400 biases. That's approximately 44 billion parameters for a single layer!
Also note that by learning filters that are shared across all spatial locations in the image, we have achieved our goal of translation invariance.
Intuition: the first layers of the network learn low level, simple features. As the input progresses deeper into the architecture of the network, these are combined to create more complex features and eventually a high level semantic description of the image is extracted. This high level semantic description is then used in the fully connected network for the learning task.
On the ImageNet visual database, can classify images into 1000 classes with 92.7% accuracy. There are better models now, which use more advanced architectures, but this is quite simple and effective.
- Stanford's CS231n, CNNs for Visual Recognition An excellent resource for neural networks as well as CNNs
- ConvNetJS CIFAR-10 Train and test a model on CIFAR10 in your browser. Excellent demonstration of the activations at every step in the architecture.
- Comprehensive Guide to Convolutional Neural Networks - the ELI5 way
- A guide to convolution arithmetic for deep learning A thorough overview of convolution, stride, and padding
UArizona DataLab, Data Science Institute, 2024