Speech Emotion Recognition Using Deep Learning: A Comprehensive Approach with MFCC Images, Data Combination, and Augmentation

Overview

Speech serves as one of the most fundamental methods of communication between humans, and as such, speech processing has become a critical area of research. Automatic recognition of emotions from voice signals is a growing field within Human-Computer Interaction (HCI), with applications in healthcare, emergency response, depression detection, behavior analysis, virtual reality, media recommendation, and human-robot interaction.

Traditional research in Speech Emotion Recognition (SER) has primarily focused on training models on isolated datasets with limited speech features. Additionally, a significant challenge in SER is the scarcity of diverse and balanced data. Our approach aims to address these limitations by combining voice samples from multiple datasets—namely IEMOCAP, RAVDESS, EMODB, and SAVEE—into a larger corpus. We also implement data augmentation techniques to enhance the dataset and improve generalization.

In contrast to the prevalent use of hand-selected Mel-Frequency Cepstral Coefficients (MFCC) features, this research explores the use of MFCC images, which provide a more comprehensive representation of the speech signal. This approach allows our deep learning model to capture richer features, leading to enhanced performance in recognizing emotions from speech.

Key Features

Dataset Combination: Merges datasets from IEMOCAP, RAVDESS, EMODB, and SAVEE to create a more extensive, diverse corpus.
Data Augmentation: Applies augmentation techniques to overcome data scarcity and improve model robustness.
MFCC Image Representation: Uses MFCC images instead of selected features, enabling the model to learn from a broader range of speech information.
Deep Learning Model: Implements a deep learning-based architecture optimized for emotion recognition from speech.

Applications

Healthcare: Emotion recognition for mental health assessments, such as depression detection.
Emergency Response: Emotion detection for improving the efficiency of emergency call centers.
Virtual Reality: Emotion analysis to enhance user experience in virtual environments.
Human-Robot Interaction: Enables robots to understand and respond to human emotions.
Media Recommendation: Provides personalized media suggestions based on user emotions.

Datasets Used

IEMOCAP
RAVDESS
EMODB
SAVEE

Key Methodology Steps

Data Preparation:
- Collect voice samples from multiple SER datasets to build a diverse and balanced corpus.
- Apply data augmentation techniques to artificially expand the dataset and address data scarcity challenges.
MFCC Image Generation:
- Convert speech signals into MFCC images, providing a detailed representation of the voice input.
- Use these images instead of traditional hand-selected MFCC features, allowing the deep learning model to extract a richer set of acoustic patterns.
Deep Learning Model:
- Develop a multi-step algorithm optimized for SER, using convolutional neural networks (CNNs) to process MFCC images.
- Train the model on the augmented dataset to enhance robustness and accuracy in emotion classification.
Experimentation:
- Conduct a series of experiments to evaluate the impact of MFCC images on raw audio data, both with and without Additive White Gaussian Noise (AWGN) augmentation.
- Test the effect of MFCC images, AWGN augmentation, and their combination on the performance of various deep learning models.
- Analyze the results to determine the influence of these techniques on the effectiveness of the models.
Comparison with Pre-trained Models:
- During training, the top layers of pre-trained models are frozen to focus on training the lower layers with MFCC images.
- The pre-trained models are used to compare performance with the proposed custom CNN model.
- These models are trained using the same dataset to evaluate their outcomes relative to the custom model.

Experimentation

Through a series of experiments, the following aspects were evaluated:

The effectiveness of MFCC images in representing speech data, providing better context than traditional feature selection.
The impact of AWGN augmentation on raw audio data and MFCC images, leading to improved model robustness against noise.
The combination of MFCC images and AWGN augmentation, showing potential synergies that further enhance deep learning model performance.
Pre-trained model evaluation compared against the custom CNN model to assess the benefit of MFCC image-based input and augmentation.

Conclusion

This research introduces a novel approach to SER by integrating MFCC image-based input and data augmentation techniques into deep learning models. Our experimental results demonstrate the effectiveness of using MFCC images, AWGN augmentation, and their combination in improving the accuracy and robustness of SER systems. Furthermore, the performance of pre-trained models was benchmarked against the proposed custom CNN model, highlighting the advantages of the custom approach. The proposed methodology offers a scalable solution to challenges such as data scarcity and limited feature selection.

How to Use

Clone the repository:

git clone https://github.com/sudhanshu-shukla-git/mfcc-image-based-ser.git

Install dependencies:
```
pip install -r requirements.txt
```

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Architecture		Architecture
Dataset EDA		Dataset EDA
Experiments		Experiments
.DS_Store		.DS_Store
Audio Datasets and Emotions Analysis.ipynb		Audio Datasets and Emotions Analysis.ipynb
Balancing data and creating dataset.ipynb		Balancing data and creating dataset.ipynb
LICENSE		LICENSE
README.md		README.md
Save MFCC Image.ipynb		Save MFCC Image.ipynb
Trimming Sound and Adding AGWN.ipynb		Trimming Sound and Adding AGWN.ipynb
process_methodology.drawio		process_methodology.drawio
requirements.txt		requirements.txt
sample_train_data_all.csv		sample_train_data_all.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Recognition Using Deep Learning: A Comprehensive Approach with MFCC Images, Data Combination, and Augmentation

Overview

Key Features

Applications

Datasets Used

Key Methodology Steps

Experimentation

Conclusion

How to Use

License

About

Releases

Packages

Languages

License

sudhanshu-shukla-git/mfcc-image-based-ser

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Recognition Using Deep Learning: A Comprehensive Approach with MFCC Images, Data Combination, and Augmentation

Overview

Key Features

Applications

Datasets Used

Key Methodology Steps

Experimentation

Conclusion

How to Use

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages