This project is about using the Sign Language Digits Dataset to classify images of sign language digits. This is similar to the MNIST dataset that has been used throughout the years to classify a grayscale, handwritten digits between 0 to 9.
The idea behind this project is to create a convolutional neural network (CNN) model to classify a sign language digit image to a digit between 0 to 9.
Furthermore, this demonstrates how I approach designing of a learning model and its development. I will outline the designs I have considered and evaluating their performance in this dataset.
This dataset has been provided by Turkey Ankara Ayrancı Anadolu High School and I have found this dataset through Kaggle. The images are converted to grayscale images of size 64 x 64.
In this project, I will be using Python as the programming language of choice. Also, I will use the Keras framework to create the layers of the CNN model.
Type in the following command:
python run.py
Ensure that all requirements (found here) have been met in order to run the project.
The architecture used in this model is the following
- Convolution 1D: 32 filters and 3 x 1 kernel size
- Maximum Pooling 1D: 2 x 1 kernel size
- Convolution 1D: 64 filters and 3 x 1 kernel size
- Maximum Pooling 1D: 2 x 1 kernel size
- Convolution 1D: 128 filters and 3 x 1 kernel size
- Maximum Pooling 1D: 2 x 1 kernel size
- Convolution 1D: 256 filters and 3 x 1 kernel size
- Maximum Pooling 1D: 2 x 1 kernel size
- Flatten
- Dense: 1024 hidden units
- Dropout: 0.5 hidden unit drop probability
- Dense: 512 hidden units
- Dropout: 0.5 hidden unit drop probability
- Dense: 256 hidden units
- Dense: 10 output units corresponding to digits 0 to 9
This architecture is inspired by the VGG16 network with the paper found here. In this paper, configuration A has been used as the starting point.
Due to the small amount of data, I have to ensure that the amount of parameters is kept to a small amount to ensure that it does not overfit the training set. As a result, I have limited the number of convolution operations performed in each of the pixels.
Also, the number of hidden units in this architecture is reduced as it gets closer to the output layer. The application of dropout in between each dense layer helps to reduce the effect of overfitting.
The highest test set accuracy received after 50 epochs is 93.46%.
The training set accuracy is 99.39% and the validation set accuracy is 88.48%.
The loss function for the training and validation sets is shown here:
The dataset and its original source can be found through Kaggle's website here.
The arXiv paper that refers to the VGG16 network can be found here.