DCASE2016 Baseline system

Audio Research Group / Tampere University of Technology

Python implementation

Systems:

Task 1 - Acoustic scene classification
Task 3 - Sound event detection in real life audio

Authors

Toni Heittola ([email protected], http://www.cs.tut.fi/~heittolt/)
Annamaria Mesaros ([email protected], http://www.cs.tut.fi/~mesaros/)
Tuomas Virtanen ([email protected], http://www.cs.tut.fi/~tuomasv/)

Introduction
Installation
Usage
System blocks
System evaluation
System parameters
Changelog
License
Introduction =============== This document describes the Python implementation of the baseline systems for the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE2016) challenge tasks 1 and task 3. The challenge consists of four tasks:
Acoustic scene classification
Sound event detection in synthetic audio
Sound event detection in real life audio
Domestic audio tagging

The baseline systems for task 1 and 3 shares the same basic approach: MFCC based acoustic features and GMM based classifier. The main motivation to have similar approaches for both tasks was to provide low entry level and allow easy switching between the tasks.

The dataset handling is hidden behind dataset access class, which should help DCASE challenge participants implementing their own systems.

The Matlab implementation is also available.

1.1. Acoustic scene classification

The acoustic features include MFCC static coefficients (with 0th coefficient), delta coefficients and acceleration coefficients. The system learns one acoustic model per acoustic scene class, and does the classification with maximum likelihood classification scheme.

1.2. Sound event detection

The acoustic features include MFCC static coefficients (0th coefficient omitted), delta coefficients and acceleration coefficients. The system has a binary classifier for each sound event class included. For the classifier, two acoustic models are trained from the mixture signals: one with positive examples (target sound event active) and one with negative examples (target sound event non-active). The classification is done between these two models as likelihood ratio. Post-processing is applied to get sound event detection output.

Installation ===============

The systems are developed for Python 2.7.0. Currently, the baseline system is tested only with Linux operating system.

Run to ensure that all external modules are installed

pip install -r requirements.txt

External modules required

numpy, scipy, scikit-learn pip install numpy scipy scikit-learn

Scikit-learn (version >= 0.16) is required for the machine learning implementations.

PyYAML pip install pyyaml

PyYAML is required for handling the configuration files.

librosa pip install librosa

Librosa is required for the feature extraction.

Usage ========

For each task there is separate executable (.py file):

task1_scene_classification.py, Acoustic scene classification
task3_sound_event_detection_in_real_life_audio.py, Real life audio sound event detection

Each system has two operating modes: Development mode and Challenge mode.

All the usage parameters are shown by python task1_scene_classification.py -h and python task3_sound_event_detection_in_real_life_audio.py -h

The system parameters are defined in task1_scene_classification.yaml and task3_sound_event_detection_in_real_life_audio.yaml.

With default parameter settings, the system will download needed dataset from Internet and extract it under directory data (storage path is controlled with parameter path->data).

Development mode

In this mode, the system is trained and evaluated with the development dataset. This is the default operating mode.

To run the system in this mode: python task1_scene_classification.py or python task1_scene_classification.py -development.

Challenge mode

In this mode, the system is trained with the provided development dataset and the evaluation dataset is run through the developed system. Output files are generated in correct format for the challenge submission. The system ouput is saved in the path specified with the parameter: path->challenge_results.

To run the system in this mode: python task1_scene_classification.py -challenge.

System blocks ================

The system implements following blocks:

Dataset initialization

Downloads the dataset from the Internet if needed
Extracts the dataset package if needed
Makes sure that the meta files are appropriately formated

Feature extraction (do_feature_extraction)

Goes through all the training material and extracts the acoustic features
Features are stored file-by-file on the local disk (pickle files)

Feature normalization (do_feature_normalization)

Goes through the training material in evaluation folds, and calculates global mean and std of the data.
Stores the normalization factors (pickle files)

System training (do_system_training)

Trains the system
Stores the trained models and feature normalization factors together on the local disk (pickle files)

System testing (do_system_testing)

Goes through the testing material and does the classification / detection
Stores the results (text files)

System evaluation (do_system_evaluation)

Reads the ground truth and the output of the system and calculates evaluation metrics

System evaluation ====================

Task 1 - Acoustic scene classification

Metrics

The scoring of acoustic scene classification will be based on classification accuracy: the number of correctly classified segments among the total number of segments. Each segment is considered an independent test sample.

Results

TUT Acoustic scenes 2016, development set

Dataset

Evaluation setup

4 cross-validation folds, average classification accuracy over folds
15 acoustic scene classes
Classification unit: one file (30 seconds of audio).

System parameters

Frame size: 40 ms (with 50% hop size)
Number of Gaussians per acoustic scene class model: 16
Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
Trained and tested on full audio

Scene	Accuracy
Beach	63.3 %
Bus	79.6 %
Cafe/restaurant	83.2 %
Car	87.2 %
City center	85.5 %
Forest path	81.0 %
Grocery store	65.0 %
Home	82.1 %
Library	50.4 %
Metro station	94.7 %
Office	98.6 %
Park	13.9 %
Residential area	77.7 %
Train	33.6 %
Tram	85.4 %
Overall accuracy	72.5 %

DCASE 2013 Scene classification, development set

Dataset

Evaluation setup

- 5 fold average

10 acoustic scene classes
Classification unit: one file (30 seconds of audio).

System parameters

Frame size: 40 ms (with 50% hop size)
Number of Gaussians per acoustic scene class model: 16
Feature vector: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values

Scene	Accuracy
Bus	93.3 %
Busy street	80.0 %
Office	86.7 %
Open air market	73.3 %
Park	26.7 %
Quiet street	53.3 %
Restaurant	40.0 %
Supermarket	26.7 %
Tube	66.7 %
Tube station	53.3 %
Overall accuracy	60.0 %

Task 3 - Real life audio sound event detection

Metrics

Segment-based metrics

Segment based evaluation is done in a fixed time grid, using segments of one second length to compare the ground truth and the system output.

Total error rate (ER) is the main metric for this task. Error rate as defined in Poliner2007 will be evaluated in one-second segments over the entire test set.
F-score is calculated over all test data based on the total number of false positive, false negatives and true positives.

Event-based metrics

Event-based evaluation considers true positives, false positives and false negatives with respect to event instances.

Definition: An event in the system output is considered correctly detected if its temporal position is overlapping with the temporal position of an event with the same label in the ground truth. A tolerance is allowed for the onset and offset (200 ms for onset and 200 ms or half length for offset)

Error rate calculated as described in Poliner2007 over all test data based on the total number of insertions, deletions and substitutions.
F-score is calculated over all test data based on the total number of false positive, false negatives and true positives.

Detailed description of metrics can be found from DCASE2016 website.

Results

TUT Sound events 2016, development set

Dataset

Evaluation setup

4 cross-validation folds

System parameters

Frame size: 40 ms (with 50% hop size)
Number of Gaussians per sound event model (positive and negative): 16
Feature vector: 20 MFCC static coefficients (excluding 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values
Decision_threshold: 140

Segment based metrics - overall

Scene	ER	ER / S	ER / D	ER / I	F1
Home	0.96	0.08	0.82	0.06	15.9 %
Residential area	0.86	0.05	0.74	0.07	31.5 %
Average	0.91				23.7 %

Segment based metrics - class-wise

Scene	ER	F1
Home	1.06	9.2 %
Residential area	1.03	17.6 %
Average	1.04	13.4 %

Event based metrics (onset-only) - overall

Scene	ER	F1
Home	1.28	4.7 %
Residential area	1.92	2.9 %
Average	1.60	3.8 %

Event based metrics (onset-only) - class-wise

Scene	ER	F1
Home	1.27	4.3 %
Residential area	1.97	1.5 %
Average	1.62	2.9 %

System parameters ==================== All the parameters are set in task1_scene_classification.yaml, and task3_sound_event_detection_in_real_life_audio.yaml.

Controlling the system flow

The blocks of the system can be controlled through the configuration file. Usually all of them can be kept on.

flow:
  initialize: true
  extract_features: true
  feature_normalizer: true
  train_system: true
  test_system: true
  evaluate_system: true

General parameters

The selection of used dataset.

general:
  development_dataset: TUTSoundEvents_2016_DevelopmentSet
  challenge_dataset: TUTSoundEvents_2016_EvaluationSet

  overwrite: false                                          # Overwrite previously stored data

development_dataset: TUTSoundEvents_2016_DevelopmentSet : The dataset handler class used while running the system in development mode. If one wants to handle a new dataset, inherit a new class from the Dataset class (src/dataset.py).

challenge_dataset: TUTSoundEvents_2016_EvaluationSet : The dataset handler class used while running the system in challenge mode. If one wants to handle a new dataset, inherit a new class from the Dataset class (src/dataset.py).

Available dataset handler classes:

DCASE 2016

TUTAcousticScenes_2016_DevelopmentSet
TUTAcousticScenes_2016_EvaluationSet
TUTSoundEvents_2016_DevelopmentSet
TUTSoundEvents_2016_EvaluationSet

DCASE 2013

DCASE2013_Scene_DevelopmentSet
DCASE2013_Scene_EvaluationSet
DCASE2013_Event_DevelopmentSet
DCASE2013_Event_EvaluationSet

overwrite: false : Switch to allow the system always to overwrite existing data on disk.

challenge_submission_mode: false : Switch to control where system output is saved. If true, path->challenge_results used, and all results are overwritten by default.

System paths

This section contains the storage paths.

path:
  data: data/

  base: system/baseline_dcase2016_task1/
  features: features/
  feature_normalizers: feature_normalizers/
  models: acoustic_models/
  results: evaluation_results/

  challenge_results: challenge_submission/task_1_acoustic_scene_classification/

These parameters defines the folder-structure to store acoustic features, feature normalization data, acoustic models and evaluation results.

data: data/ : Defines the path where the dataset data is downloaded and stored. Path is relative to the main script.

base: system/baseline_dcase2016_task1/ : Defines the base path where the system stores the data. Other paths are stored under this path. If specified directory does not exist it is created. Path is relative to the main script.

challenge_results: challenge_submission/task_1_acoustic_scene_classification/ : Defines where the system output is stored while running the system in challenge mode.

Feature extraction

This section contains the feature extraction related parameters.

features:
  fs: 44100
  win_length_seconds: 0.04
  hop_length_seconds: 0.02

  include_mfcc0: true           #
  include_delta: true           #
  include_acceleration: true    #

  mfcc:
    window: hamming_asymmetric  # [hann_asymmetric, hamming_asymmetric]
    n_mfcc: 20                  # Number of MFCC coefficients
    n_mels: 40                  # Number of MEL bands used
    n_fft: 2048                 # FFT length
    fmin: 0                     # Minimum frequency when constructing MEL bands
    fmax: 22050                 # Maximum frequency when constructing MEL band
    htk: false                  # Switch for HTK-styled MEL-frequency equation

  mfcc_delta:
    width: 9

  mfcc_acceleration:
    width: 9

fs: 44100 : Default sampling frequency. If given dataset does not fulfill this criteria the audio data is resampled.

win_length_seconds: 0.04 : Feature extraction frame length in seconds.

hop_length_seconds: 0.02 : Feature extraction frame hop-length in seconds.

include_mfcc0: true : Switch to include zeroth coefficient of static MFCC in the feature vector

include_delta: true : Switch to include delta coefficients to feature vector. Zeroth MFCC is always included in the delta coefficients. The width of delta-window is set in mfcc_delta->width: 9

include_acceleration: true : Switch to include acceleration (delta-delta) coefficients to feature vector. Zeroth MFCC is always included in the delta coefficients. The width of acceleration-window is set in mfcc_acceleration->width: 9

mfcc->n_mfcc: 16 : Number of MFCC coefficients

mfcc->fmax: 22050 : Maximum frequency for MEL band. Usually, this is set to a half of the sampling frequency.

Classifier

This section contains the frame classifier related parameters. These parameters are used when chosen classifier is trained.

classifier:
  method: gmm                   # The system supports only gmm

  audio_error_handling:         # Handling audio errors (temporary microphone failure and radio signal interferences from mobile phones)
    clean_data: false           # Exclude audio errors from training audio
  
  parameters: !!null            # Parameters are copied from classifier_parameters based on defined method

classifier_parameters:
  gmm:
    n_components: 16            # Number of Gaussian components
    covariance_type: diag       # Diagonal or full covariance matrix
    random_state: 0
    thresh: !!null
    tol: 0.001
    min_covar: 0.001
    n_iter: 40
    n_init: 1
    params: wmc
    init_params: wmc

audio_error_handling->clean_data: false : Some datasets provide audio error annotations. With this switch these annotations can be used to exclude the segments containing audio errors from the feature matrix fed to the classifier during training. Audio errors can be temporary microphone failure or radio signal interferences from mobile phones.

classifier_parameters->gmm->n_components: 16 : Number of Gaussians used in the modeling.

In order to add new classifiers to the system, add parameters under classifier_parameters with new tag. Set classifier->method and add appropriate code where classifier_method variable is used system block API (look into do_system_training and do_system_testing methods). In addition to this, one might want to modify filename methods (get_model_filename and get_result_filename) to allow multiple classifier methods co-exist in the system.

Recognizer

This section contains the sound recognition related parameters (used in task1_scene_classification.py).

recognizer:
  audio_error_handling:         # Handling audio errors (temporary microphone failure and radio signal interferences from mobile phones)
    clean_data: false           # Exclude audio errors from test audio

audio_error_handling->clean_data: false : Some datasets provide audio error annotations. With this switch these annotations can be used to exclude the segments containing audio errors from the feature matrix fed to the recognizer. Audio errors can be temporary microphone failure or radio signal interferences from mobile phones.

Detector

This section contains the sound event detection related parameters (used in task3_sound_event_detection_in_real_life_audio.py).

detector:
  decision_threshold: 140.0
  smoothing_window_length: 1.0  # seconds
  minimum_event_length: 0.1     # seconds
  minimum_event_gap: 0.1        # seconds

decision_threshold: 140.0 : Decision threshold used to do final classification. This can be used to control the sensitivity of the system. With log-likelihoods: event_activity = (positive - negative) > decision_threshold

smoothing_window_length: 1.0 : Size of sliding accumulation window (in seconds) used before frame-wise classification decision

minimum_event_length: 0.1 : Minimum length (in seconds) of outputted events. Events with shorter length than given are filtered out from the system output.

minimum_event_gap: 0.1 : Minimum gap (in seconds) between events from same event class in the output. Consecutive events (event with same event label) having shorter gaps between them than set parameter are merged together.

Changelog ============

1.2 / 2016-11-10

Added evaluation in challenge mode for task 1

1.1 / 2016-05-19

Added audio error handling

1.0 / 2016-02-08

Initial commit

License ==========

See file EULA.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCASE2016 Baseline system

Table of Contents

1.1. Acoustic scene classification

1.2. Sound event detection

Development mode

Challenge mode

Task 1 - Acoustic scene classification

Metrics

Results

TUT Acoustic scenes 2016, development set

DCASE 2013 Scene classification, development set

Task 3 - Real life audio sound event detection

Metrics

Results

TUT Sound events 2016, development set

1.2 / 2016-11-10

1.1 / 2016-05-19

1.0 / 2016-02-08

About

Releases 8

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
src		src
.gitignore		.gitignore
EULA.pdf		EULA.pdf
README.html		README.html
README.md		README.md
requirements.txt		requirements.txt
task1_scene_classification.py		task1_scene_classification.py
task1_scene_classification.yaml		task1_scene_classification.yaml
task3_sound_event_detection_in_real_life_audio.py		task3_sound_event_detection_in_real_life_audio.py
task3_sound_event_detection_in_real_life_audio.yaml		task3_sound_event_detection_in_real_life_audio.yaml

TUT-ARG/DCASE2016-baseline-system-python

Folders and files

Latest commit

History

Repository files navigation

DCASE2016 Baseline system

Table of Contents

1.1. Acoustic scene classification

1.2. Sound event detection

Development mode

Challenge mode

Task 1 - Acoustic scene classification

Metrics

Results

TUT Acoustic scenes 2016, development set

DCASE 2013 Scene classification, development set

Task 3 - Real life audio sound event detection

Metrics

Results

TUT Sound events 2016, development set

1.2 / 2016-11-10

1.1 / 2016-05-19

1.0 / 2016-02-08

About

Topics

Resources

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages