-
Notifications
You must be signed in to change notification settings - Fork 21
Home
MMSA-Feature Extraction Toolkit extracts multimodal features for Multimodal Sentiment Analysis Datasets. It integrates several commonly used tools for visual, acoustic and text modality. The extracted features are compatible with the MMSA Framework and thus can be used directly. The tool can also extract features for single videos.
MMSA-Feature Extraction Toolkit is available from PyPI. Due to package size limitation on PyPi, large model files cannot be shipped with the package. Users need to run a post install command to download these files manually.
# Install package from PyPI
$ pip install MMSA-FET
# Download models & libraries from Google Drive. Use --proxy if needed.
$ python -m MSA_FET install
Note: For the OpenFaceExtractor to work on Linux Platforms, a few system-wide dependancies are needed. See Dependency Installation for more information.
MMSA-FET is fairly easy to use. Below is a basic example on how to extract features for a single video file and a dataset folder.
Note: To extract features for datasets, the datasets need to be organized in a specific file structure, and a
label.csv
file is needed. See Dataset and Structure for details. Raw video files and label files for MOSI, MOSEI and CH-SIMS can be downloaded here with code ``.
from MSA_FET import FeatureExtractionTool
# initialize with default librosa config which only extracts audio features
fet = FeatureExtractionTool("librosa")
# alternatively initialize with a custom config file
fet = FeatureExtractionTool("custom_config.json")
# extract features for single video
feature = fet.run_single("input.mp4")
print(feature)
# extract for dataset & save features to file
feature = fet.run_dataset(dataset_dir="~/MOSI", out_file="output/feature.pkl")
The custom_config.json
is the path to a custom config file, the format of which is introduced below.
For detailed usage, please read APIs and Command Line Arguments.
MMSA-FET comes with a few example configs which can be used like below.
# Each supported tool has an example config
fet = FeatureExtractionTool(config="librosa")
fet = FeatureExtractionTool(config="opensmile")
fet = FeatureExtractionTool(config="wav2vec")
fet = FeatureExtractionTool(config="openface")
fet = FeatureExtractionTool(config="mediapipe")
fet = FeatureExtractionTool(config="bert")
fet = FeatureExtractionTool(config="roberta")
For customized features, you can:
- Edit the default configs and pass a dictionary to the config parameter like the example below:
from MSA_FET import FeatureExtractionTool, get_default_config
# here we only extract audio and video features
config_a = get_default_config('opensmile')
config_v = get_default_config('openface')
# modify default config
config_a['audio']['args']['feature_level'] = 'LowLevelDescriptors'
# combine audio and video configs
config = {**config_a, **config_v}
# initialize
fet = FeatureExtractionTool(config=config)
- Provide a config json file. The below example extracts features of all three modalities. To extract unimodal features, just remove unnecessary sections from the file.
{
"audio": {
"tool": "librosa",
"sample_rate": null,
"args": {
"mfcc": {
"n_mfcc": 20,
"htk": true
},
"rms": {},
"zero_crossing_rate": {},
"spectral_rolloff": {},
"spectral_centroid": {}
}
},
"video": {
"tool": "openface",
"fps": 25,
"average_over": 3,
"args": {
"hogalign": false,
"simalign": false,
"nobadaligned": false,
"landmark_2D": true,
"landmark_3D": false,
"pdmparams": false,
"head_pose": true,
"action_units": true,
"gaze": true,
"tracked": false
}
},
"text": {
"model": "bert",
"device": "cpu",
"pretrained": "models/bert_base_uncased",
"args": {}
}
}
-
Librosa (link)
Supports all librosa features listed here, including: mfcc, rms, zero_crossing_rate, spectral_rolloff, spectral_centroid, etc. Detailed configurations can be found here.
-
openSMILE (link)
Supports all feature sets listed here, including: ComParE_2016, GeMAPS, eGeMAPS, emobase, etc. Detailed configurations can be found here.
-
Wav2vec2 (link)
Integrated from huggingface transformers. Detailed configurations can be found here.
-
OpenFace (link)
Supports all features in OpenFace's FeatureExtraction binary, including: facial landmarks in 2D and 3D, head pose, gaze related, facial action units, HOG binary files. Details of these features can be found in the OpenFace Wiki here and here. Detailed configurations can be found here.
-
MediaPipe (link)
Supports face mesh and holistic(face, hand, pose) solutions. Detailed configurations can be found here.
-
TalkNet(link)
TalkNet is used to support Active Speaker Detection in case there are multiple human faces in the video.