Skip to content
/ eesen Public
forked from srvk/eesen

End-to-End Speech Recognition using Deep RNNs (Models), CTC (Training) and WFSTs (Decoding)

License

Notifications You must be signed in to change notification settings

guoli-ye/eesen

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Eesen

Eesen is a toolkit to build speech recognition (ASR) systems in a completely end-to-end fashion. The goal of Eesen is to simplify the existing complicated, expertise-intensive ASR pipeline into a straightforward learning problem. Acoustic modeling in Eesen involves training a single recurrent neural network (RNN) which models the sequence-to-sequence mapping from speech to transcripts. Eesen discards the following elements required by the existing ASR pipeline:

  • Hidden Markov models (HMMs)
  • Gaussian mixture models (GMMs)
  • Decision trees and phonetic questions
  • Dictionary, if characters are used as the modeling units
  • ...

Eesen is developed on the basis of the popular Kaldi toolkit. However, Eesen is fully self-contained, requiring no dependencies from Kaldi to funciton.

Eesen is released as an open-source project under the highly non-restrictive Apache License Version 2.0. We welcome community participation and contribution.

Key Components

Eesen contains 3 key components to enable end-to-end ASR:

  • Acoustic Model -- Bi-directional RNNs with LSTM units.
  • Training -- Connectionist temporal classification (CTC) as the training objective.
  • Decoding -- A principled decoding approach based on Weighted Finite-State Transducers (WFSTs).

Highlights of Eesen

  • The WFST-based decoding approach can incorporate lexicons and language models into CTC decoding in an effective and efficient way.
  • GPU implementation of RNN model training and CTC learning.
  • Multiple utterances are processed in parallel for training speed-up.
  • Inherits Kaldi's programming stype. Convenient to implement new modules.
  • Eesen's close connection with Kaldi makes the end-to-end systems directly comparable to Kaldi's hybrid HMM/DNN systems.
  • Fully-fledged example setups to demonstrate end-to-end system building, with both phonemes and characters as labels.

Experimental Results

Refer to RESULTS under each example setup.

To-Do List (short-term)

  • Create TIMIT and Switchboard example setups.
  • Add CPU-based training.
  • Add lattice-based decoding to example setups.
  • More Wiki pages/documentation, especially about training and decoding commands.

To-Do List (long-term)

  • Further improve Eesen's ASR accuracy from various aspects, to make it eventually better than the state-of-the-art hybrid HMM/DNN pipeline.
  • Investigate the advantages and disadvantages of Eesen on different languages and speech conditions (noisy, far-field, etc.).
  • Accelerate model training by adopting better learning techniques or multi-GPU distributed learning.

Contact

Email Yajie Miao if you have any questions or suggestions.

About

End-to-End Speech Recognition using Deep RNNs (Models), CTC (Training) and WFSTs (Decoding)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 88.5%
  • Cuda 4.3%
  • Shell 3.1%
  • Perl 2.1%
  • C 1.0%
  • Makefile 0.8%
  • Other 0.2%