We introduce a multimodal emotion dataset comprising data from 30-channel electroencephalography (EEG), audio, and video recordings from 42 participants. Each participant engaged in a cue-based conversation scenario, eliciting five distinct emotions: neutral(N), anger(A), happiness(H), sadness(S), and calmness(C).
Participants engage in paired listen/speak sets with recordings of an experienced actor, seated in front of a 27-inch monitor displaying visual stimuli. The experiment is designed as a pseudo-random class-iteration sequence: [A, A, C, C, S, S, H, A, C, H, H, S, S, A, A, C, C, H, H, S]. Throughout the experiment, each participant contributed 200 interactions. This resulted in a cumulative total of 8,400 interactions across all participants. Please refer to the paper for more details: https://www.nature.com/articles/s41597-024-03838-4
Each 'Video' subfolder contains 200 video clips, each is 20 sec in lengths, 30 fps and performs either ’listening’ or ’speaking’ tasks. The Video data adopts the structure - [5 emotion classes × 2 tasks × 20 iterations]
File format: .mp4
Baseline performance of DeepFace: Mean ACC = 52.8 %, Mean F1-score = 51.5 %
Each 'Audio' subfolder contains 100 audio files, each is 20 sec in lengths and performs only ’speaking’ task. The audio data adopts the structure - [5 classes × 1 task ('speaking') × 20 conversations]
File format: .wav
Baseline performance of SCNN: Mean ACC = 36.7 %, Mean F1-score = 34.1 %
Each 'EEG' subfolder contains 2 EEG data files. Each instance is 20 sec in lengths and an initial sampling rate of 500 Hz. Due to continuous recording, the processed EEG data adopts the structure - [200 instances × 10,000 time points(20s × 500 Hz) × 30 channels]. The labels for this data use a one-hot encoding format, structured as 200 trials by 10 classes (5 emotions multiplied by 2 tasks).
File format: .mat
Baseline performance of EEGnet: Mean ACC = 36.7 %, Mean F1-score = 34.1 %
Note that the label information can be applied across all modalities since all recordings, regardless of the modality, were conducted synchronously. This ensures uniform annotations throughout the dataset.
- conda environment
conda create --name eav python=3.10 conda activate eav
- Clone the repo
git clone https://github.com/nubcico/EAV.git cd EAV
- Install requirements
pip install -r requirements.txt
The raw dataset, along with the pre-extracted features, can be accessed and downloaded from Zenodo.
After downloading the dataset, you must choose between utilizing the raw dataset or the pre-extracted features, as this decision will determine your subsequent steps.
If you opt for the raw dataset, only a minor modification is required in the Dataload_Audio.py
file: adjust the parent_directory
parameter in the DataLoadAudio
class to the directory location of the "EAV" folder on your system.
Using the raw dataset enables customization of your training and testing data split ratio through the EAVDataSplit class. In our case, we employed a 70/30 split, calculated as h_idx = 56
. If x
is your desired training dataset percentage (e.g., x = 70), h_idx
can be calculated using the formula h_idx = (x * 80) / 100
.
If you decide to work with the pre-extracted features, you need to modify the code as follows: comment out the lines currently used for the raw dataset before aud_list.get_split
, then uncomment the section for the pre-extracted features. Additionally, set the direct
variable to point to the path containing the "Audio" directory on your system.
aud_loader = DataLoadAudio(subject=sub, parent_directory=r'D:\EAV')
[data_aud , data_aud_y] = aud_loader.process()
aud_list = EAVDataSplit(data_aud, data_aud_y)
[tr_x_aud, tr_y_aud, te_x_aud , te_y_aud] = aud_list.get_split(h_idx=56)
# direct=r"D:\EAV\Inputs\Audio"
# file_name = f"subject_{sub:02d}_aud.pkl"
# file_ = os.path.join(direct, file_name)
# with open(file_, 'rb') as f:
# aud_list = pickle.load(f)
# tr_x_aud, tr_y_aud, te_x_aud , te_y_aud = aud_list
data = [tr_x_aud, tr_y_aud, te_x_aud , te_y_aud]
The same adjustments should be applied for each modality.
For the classification of the audio modality, we employ the Audio Spectrogram Transformer (AST) model pretrained on the AudioSet dataset (the pre-trained model can be downloaded from Hugging Face), which we will subsequently fine-tune on our specific dataset, as implemented in the 'Dataload_audio.py' and 'Transformer_torch/Transformer_Audio.py' files.
from Transformer_torch import Transformer_Audio
...
mod_path = os.path.join(os.getcwd(), 'ast-finetuned-audioset')
Trainer = Transformer_Audio.AudioModelTrainer(data, model_path=mod_path, sub =f"subject_{sub:02d}",
num_classes=5, weight_decay=1e-5, lr=0.005, batch_size = 8)
Trainer.train(epochs=10, lr=5e-4, freeze=True)
Trainer.train(epochs=15, lr=5e-6, freeze=False)
test_acc.append(Trainer.outputs_test)
The 'AudioModelTrainer' class is designed to train and fine-tune this model effectively. It leverages PyTorch and the Hugging Face Transformers library to adapt the AST model for the emotion classification task.
from transformers import AutoModelForAudioClassification
...
class AudioModelTrainer:
def __init__(self, DATA, model_path, sub='', num_classes=5, weight_decay=1e-5, lr=0.001, batch_size=128):
...
self.model = AutoModelForAudioClassification.from_pretrained(model_path)
For the video and EEG modalities, the framework allows the choice between Transformer-based and CNN-based models. Specifically, for video, we utilize the Vision Transformer model, which is pretrained on the facial_emotions_image_detection dataset(the pre-trained model can be downloaded from Hugging Face). The following example from the 'Dataload_vision.py' file illustrates both options:
# Transformer for Vision
from Transformer_torch import Transformer_Vision
mod_path = os.path.join('C:\\Users\\minho.lee\\Dropbox\\Projects\\EAV', 'facial_emotions_image_detection')
trainer = Transformer_Vision.ImageClassifierTrainer(data,
model_path=mod_path, sub=f"subject_{sub:02d}",
num_labels=5, lr=5e-5, batch_size=128)
trainer.train(epochs=10, lr=5e-4, freeze=True)
trainer.train(epochs=5, lr=5e-6, freeze=False)
trainer.outputs_test
Alternatively, the CNN-based model can be utilized as follows:
# CNN for Vision
from CNN_torch.CNN_Vision import ImageClassifierTrainer
trainer = ImageClassifierTrainer(data, num_labels=5, lr=5e-5, batch_size=32)
trainer.train(epochs=3, lr=5e-4, freeze=True)
trainer.train(epochs=3, lr=5e-6, freeze=False)
trainer._delete_dataloader()
trainer.outputs_test
The same approach can be applied for the EEG modality, providing flexibility in choosing between Transformer and CNN architectures based on the requirements of the task.
To execute the program via the command line, run the following commands for each modality respectively:
python Dataload_audio.py
The data preprocessing for audio, EEG, and video modalities is designed to prepare the raw data for emotion classification. Each modality follows its own distinct workflow:
-
Audio Modality:
TheDataLoadAudio
class handles audio data processing:- Data File Loading: The
data_files()
method retrieves audio file paths and their corresponding emotion labels. - Feature Extraction: The
feature_extraction()
method loads the audio files, resamples them to a target rate, and segments the audio into 5-second clips. Each segment is labeled according to the associated emotion. - Label Encoding: Emotion labels are converted to numerical indices for model compatibility.
- Processing Coordination: The
process()
method orchestrates these steps, returning the extracted features and labels.
class DataLoadAudio: def process(self): self.data_files() # Load audio file paths and labels self.feature_extraction() # Extract audio features return self.feature, self.label_indexes # Return features and labels
- Data File Loading: The
-
EEG Modality:
TheDataLoadEEG
class manages EEG data processing:- Data File Loading: The
data_mat()
method loads EEG data and labels from MAT files. - Downsampling: The
downsampling()
method reduces the sampling frequency to a target rate. - Bandpass Filtering: The
bandpass()
method applies a bandpass filter to retain frequencies of interest. - Segmentation: The
data_div()
method divides the data into smaller segments for analysis. - Data Preparation: The
data_prepare()
method coordinates the above steps, returning the processed EEG segments and labels.
class DataLoadEEG: def data_prepare(self): self.data_mat() # Load EEG data and labels self.downsampling() # Downsample the data self.bandpass() # Apply bandpass filtering self.data_div() # Divide the data into segments return self.seg_f_div, self.label_div # Return filtered segments and labels
- Data File Loading: The
-
Video Modality:
TheDataLoadVision
class processes video data:- Data File Loading: The
data_files()
method gathers video file paths and emotion labels. - Face Detection and Frame Extraction: The
data_load()
method captures frames from each video, using face detection to align faces if enabled. It collects segments of 25 frames, each representing a 5-second interval. - Label Encoding: Emotion labels are converted to numerical indices for consistency.
- Data Preparation: The
process()
method manages these tasks, returning the segmented images and their corresponding labels.
class DataLoadVision: def process(self): self.data_files() # Load video file paths and labels self.data_load() # Extract and process frames from videos return self.images, self.image_label_idx # Return processed image segments and labels
- Data File Loading: The
This structured approach ensures that each modality's data is appropriately preprocessed, facilitating effective training and evaluation of the emotion classification model.
- CNN based Emotion recognition on Video, Audio and EEG domains using Tensorflow
- CNN based Emotion recognition on Video and EEG domains using PyTorch
- Transformer based Emotion recognition on Video, Audio and EEG domains using PyTorch
- Create demo file
- Add .pkl files of preprocessed video data (Feature_vision folder)
- Add inference files
Minho Lee - [email protected]
Adai Shomanov - [email protected]
Zhuldyz Kabidenova - [email protected]
Adnan Yazici - [email protected]
Distributed under the MIT License. See LICENSE.txt
for more information.