Speech Emotion Recognition Using Deep Learning: A Comprehensive Approach with MFCC Images, Data Combination, and Augmentation
Speech serves as one of the most fundamental methods of communication between humans, and as such, speech processing has become a critical area of research. Automatic recognition of emotions from voice signals is a growing field within Human-Computer Interaction (HCI), with applications in healthcare, emergency response, depression detection, behavior analysis, virtual reality, media recommendation, and human-robot interaction.
Traditional research in Speech Emotion Recognition (SER) has primarily focused on training models on isolated datasets with limited speech features. Additionally, a significant challenge in SER is the scarcity of diverse and balanced data. Our approach aims to address these limitations by combining voice samples from multiple datasets—namely IEMOCAP, RAVDESS, EMODB, and SAVEE—into a larger corpus. We also implement data augmentation techniques to enhance the dataset and improve generalization.
In contrast to the prevalent use of hand-selected Mel-Frequency Cepstral Coefficients (MFCC) features, this research explores the use of MFCC images, which provide a more comprehensive representation of the speech signal. This approach allows our deep learning model to capture richer features, leading to enhanced performance in recognizing emotions from speech.
- Dataset Combination: Merges datasets from IEMOCAP, RAVDESS, EMODB, and SAVEE to create a more extensive, diverse corpus.
- Data Augmentation: Applies augmentation techniques to overcome data scarcity and improve model robustness.
- MFCC Image Representation: Uses MFCC images instead of selected features, enabling the model to learn from a broader range of speech information.
- Deep Learning Model: Implements a deep learning-based architecture optimized for emotion recognition from speech.
- Healthcare: Emotion recognition for mental health assessments, such as depression detection.
- Emergency Response: Emotion detection for improving the efficiency of emergency call centers.
- Virtual Reality: Emotion analysis to enhance user experience in virtual environments.
- Human-Robot Interaction: Enables robots to understand and respond to human emotions.
- Media Recommendation: Provides personalized media suggestions based on user emotions.
- IEMOCAP
- RAVDESS
- EMODB
- SAVEE
-
Data Preparation:
- Collect voice samples from multiple SER datasets to build a diverse and balanced corpus.
- Apply data augmentation techniques to artificially expand the dataset and address data scarcity challenges.
-
MFCC Image Generation:
- Convert speech signals into MFCC images, providing a detailed representation of the voice input.
- Use these images instead of traditional hand-selected MFCC features, allowing the deep learning model to extract a richer set of acoustic patterns.
-
Deep Learning Model:
- Develop a multi-step algorithm optimized for SER, using convolutional neural networks (CNNs) to process MFCC images.
- Train the model on the augmented dataset to enhance robustness and accuracy in emotion classification.
-
Experimentation:
- Conduct a series of experiments to evaluate the impact of MFCC images on raw audio data, both with and without Additive White Gaussian Noise (AWGN) augmentation.
- Test the effect of MFCC images, AWGN augmentation, and their combination on the performance of various deep learning models.
- Analyze the results to determine the influence of these techniques on the effectiveness of the models.
-
Comparison with Pre-trained Models:
- During training, the top layers of pre-trained models are frozen to focus on training the lower layers with MFCC images.
- The pre-trained models are used to compare performance with the proposed custom CNN model.
- These models are trained using the same dataset to evaluate their outcomes relative to the custom model.
Through a series of experiments, the following aspects were evaluated:
- The effectiveness of MFCC images in representing speech data, providing better context than traditional feature selection.
- The impact of AWGN augmentation on raw audio data and MFCC images, leading to improved model robustness against noise.
- The combination of MFCC images and AWGN augmentation, showing potential synergies that further enhance deep learning model performance.
- Pre-trained model evaluation compared against the custom CNN model to assess the benefit of MFCC image-based input and augmentation.
This research introduces a novel approach to SER by integrating MFCC image-based input and data augmentation techniques into deep learning models. Our experimental results demonstrate the effectiveness of using MFCC images, AWGN augmentation, and their combination in improving the accuracy and robustness of SER systems. Furthermore, the performance of pre-trained models was benchmarked against the proposed custom CNN model, highlighting the advantages of the custom approach. The proposed methodology offers a scalable solution to challenges such as data scarcity and limited feature selection.
-
Clone the repository:
git clone https://github.com/sudhanshu-shukla-git/mfcc-image-based-ser.git
-
Install dependencies:
pip install -r requirements.txt
This project is licensed under the MIT License. See the LICENSE file for more details.