Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comparison of MFCC computation between librosa and essentia, for acoustic scene classification #525

Open
edufonseca opened this issue Dec 2, 2016 · 7 comments
Milestone

Comments

@edufonseca
Copy link

A comparison was made of the MFCC computation between librosa and essentia, using data from DCASE challenge 2016, using their baseline system (MFCC+GMM), for Task 1 - Acoustic scene classification.

Procedure:

  1. Match common input parameters in both libraries and use same signal framing
  2. Edited minor differences in librosa (20log10 and truncation of lowest amplitude values ) such that same amplitude treatment is used by both libraries
  3. Did not look into the filterbank (in theory, both based on Slaney’s)
  4. Specific essentia params in Windowing algorithm: disable zero phase windowing, and leaving normalization as True (by default). Hence, the window normalization appears to be the only major difference between both computations, at least to the best of my knowledge.

Run two simulations for Task 1 - Acoustic scene classification: with and without normalization.
Report the difference of classification accuracy found between librosa and essentia-based systems:

  • Normalized = True -> accuracy difference ~ 6 % (librosa based system performs better)
  • Normalized = False -> accuracy difference ~ +-0.3 %

Next plot shows the hamming window used in librosa and in essentia (Normalized = True). Note bottom of the plot.
hamming_ess_zpwoff_normon

Next two plots show mean and std of MFCCs computed over 1500 frames of the same audio file, for librosa and essentia. Up: with window normalization. Bottom: without window normalization
mfcc_file7_small

mfcc_file7_small_nfalse

Comment:
This occurs for this particular scenario, audio content (soundscapes) and classifier (GMM). Would something similar happen in a different scenario?

@dbogdanov dbogdanov added this to the 2.1 milestone Dec 4, 2016
@dbogdanov
Copy link
Member

I've created a separate issue concerning changes in MFCC values due to signal level #543. Normalized windowing will further contribute to this problem making mel energy values even smaller.

@dbogdanov
Copy link
Member

We might want to change normalized to False by default.

@dbogdanov
Copy link
Member

@edufonseca Do you still have your scripts to evaluate accuracy difference when using normalized windows again? (As we lowered the threshold for silence in #543, may be the normalization is not a problem any more).

@ChenJunHero
Copy link

ChenJunHero commented Jul 24, 2018

I also found that there were much differences of spectrum amptitude matrix between essentia and librosa.I doubt it`s of "Pading","StartFromZero".I will try to get the formant frequencies and trace the diffence of result.

@sildeag
Copy link

sildeag commented Jul 24, 2018

@edufonseca Did you compare with any other apps eg. OpenSmile, etc.?

@edufonseca
Copy link
Author

edufonseca commented Jul 25, 2018 via email

@dbogdanov
Copy link
Member

One of the main differences with librosa is in the silence threshold. We have done some updates related to that in the mfcc_thresholding but it's not merged yet. You can try to compare with MFCCs computed using that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants