Digital image processing investigates the portrayal and the control of pictorial data. With the development in Machine Learning, digital image processing is being promised over time. This research is an example of Optical Character Recognition (OCR), where Convolutional Neural Network (CNN)
is used for the classification of Bengali Numerals
.
Topics of this study are listed below:
- Image Processing
- Introduction to Research Planning
- Implementation in MATLAB
- Result Analysis
- Conclusion
Image processing is the technology that processes a digital image analysing its features, and then predicts information about the image. Image processing uses different machine learning tools to classify images based on the available features.
The machine learning tools and algorithm consists of Support Vector Machine (SVM)
, k-nearest neighbour
, Artificial Neural Network (ANN)
, etc. Deep learning is efficient for image processing among all other tools. Here are many deep learning models: Deep Belief Network (DBN)
, Restricted Boltzmann Machine (RBM)
, Convolutional Neural Network (CNN)
, etc.
The Convolutional Neural Network (CNN) offers unique features that make this model more suitable for image processing. One of the significant features of CCN is eliminating the need for manual feature extraction
. Another essential feature is that one CNN model can be used for different recognition tasks. The existing CNN model can learn to recognise new patterns with new learning or training data.
Image processing can recognise patterns
in an image. First, the system is trained with a vast set of images, from where the system learns to classify and identify patterns. Then, if a test mage is provided, the system can recognise patterns the image contains. Similarly, a system can recognise characters in an image considering the characters as a pattern. This approach is considered Optical Character Recognition (OCR).
This research is based on Optical Character Recognition (OCR). In this study, a system will be designed to recognise Bengali numerals
. The numerals are handwritten, and the numerical dataset will be collected from a cloud-based repository. Data will be preprocessed first and then randomly split into training and testing data. Here a Convolutional Neural Network (CNN)
will be designed. The CNN will be trained using the training dataset of Bengali handwritten numerical. After training, the CNN will be tested using the testing dataset for recognition accuracy.
This study aims to deploy a CNN for Optical Character Recognition. The methodology of this research is shown in this figure.
The CNN also consists of input and output layers like other neural networks. In between these two layers, there are many hidden layers. Some hidden layers are for feature extraction and a layer for classification.
Details about each of these layers can be found here.
This research is implemented in MATLAB, and the version used is R2021b. MATLAB has excellent resources for deep neural networks, and also it provides inbuilt functions for different NN applications.
For this study, a dataset of handwritten Bengali numerals is required. The dataset is collected from a cloud-based repository. The images of 10 Bengali numerals from this dataset are collected for this study. The collected dataset can be found in this GitHub repository.
Table 01: Number of images in each class
Class | Number of Images | Class | Number of Images |
---|---|---|---|
0 | 1982 | 5 | 1986 |
1 | 1982 | 6 | 1981 |
2 | 1953 | 7 | 1958 |
3 | 1975 | 8 | 1984 |
4 | 1980 | 9 | 1967 |
The collected data are preprocessed to meet the criteria for the CNN. The designed CNN model requires images with dimensions 150-by-150
and have a grey-scale
colour channel. The collected images have a grey-scale channel, but they don't match the dimension requirement. So, the images are resized to make the dimension150-by-150.
After preprocessing, data are split into training data and testing data. 75%
of data are used for training, and 25%
are for validation.
The CNN model has four layers. An input layer
, convolution layer
, classification layer
, and output
.
Input layer
parameter is [150 150 1]
, which means the image dimensions are 150-by-150, and one means it will process grey-scale image.
The designed CNN has four convolution layers
for feature extraction. Each layer has a different filter size, and also the number of filters is increased in each layer. A max-pooling layer between each layer will perform the down-sampling operation.
There is a fully connected classification layer
with ten classes, as Bengali numerals have ten digits. The SoftMax layer and classification layer follow this layer.
The classification layer generates prediction values for each of the classes. The class with the maximum prediction will be considered the output for the image that is being tested.
The training process starts with a 0.001
learning rate. The max epoch will be ten during training, and the data will be shuffled after each epoch. The network is trained using the training data, 75%
of the total data.
While training the network, the software calculates the accuracy of the testing data at regular intervals. At the end of the training, the software calculates the total accuracy.
The network is trained with images having 150-pixels height, 150-pixels width, and a bit depth of 8-bit. The training images of a grey-scale channel. The machine used for training has 8.00 Gigabyte
RAM, a processor Intel(R) Core(TM) i7-4500UCPU
with speed 1.80GHz
and 2.40 GHz
, and a 64-bit
Operating System.
The detail about training progress is shown in this figure.
The network training is completed in 10 epochs
, and the total training time is 107 minutes and 25 seconds
. The network has a learning rate of 0.001
.
When training is complete, the Convolutional Neural Network achieve an accuracy of 95.05%. The recognition accuracy of some randomly selected Bengali numerals from testing data is shown in this figure.
In this research, a Convolutional Neural Network
is designed for Optical Character Recognition of Bengali numerals. The designed CNN is fonts and shapes autonomous, meaning it can detect digits of any font style, size and shape. As CNN is used, this study does not require manual feature extraction. The designed CNN extracts the features automatically, for that it requires a massive set of data during training. The CNN is trained using around 2000 images per class. The network can process grey-scale images with dimensions 150-by-150 and a bit depth of 8bit, meaning the value of each pixel is in the range of 0 to 255.