The mozilla-deepspeech-0.6.1
model is a speech recognition neural network pre-trained by Mozilla
based on DeepSpeech architecture (CTC decoder with beam search and n-gram language model)
with changed neural network topology.
For details on the original DeepSpeech, see paper.
For details on this model, see repository.
Metric | Value |
---|---|
Type | Speech recognition |
GFlops per audio frame | 0.0472 |
GFlops per second of audio | 2.36 |
MParams | 47.2 |
Source framework | TensorFlow* |
Metric | Value | Parameters |
---|---|---|
WER @ Librispeech test-clean | 8.93% | with LM, beam_width = 32, Python CTC decoder |
WER @ Librispeech test-clean | 7.55% | with LM, beam_width = 500, C++ CTC decoder |
NB: beam_width=32 is a low value for a CTC decoder, and was used to achieve reasonable evaluation time with Python CTC decoder in Accuracy Checker. Increasing beam_width improves WER metric and slows down decoding. Speech Recognition DeepSpeech Demo has a faster C++ CTC decoder module.
-
Audio MFCC coefficients, name:
input_node
, shape:1, 16, 19, 26
, format:B, N, T, C
, where:B
- batch size, fixed to 1N
-input_lengths
, number of audio frames in this section of audioT
- context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.C
- 26 MFCC coefficients per each frame
See
<omz_dir>/models/public/mozilla-deepspeech-0.6.1/accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
Number of audio frames, INT32 value, name:
input_lengths
, shape1
. -
LSTM in-state (c) and input (h, a.k.a hidden state) vectors. Names:
previous_state_c
andprevious_state_h
, shapes:1, 2048
, format:B, C
.
When splitting a long audio into chunks, these inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Audio MFCC coefficients, name:
input_node
, shape:1, 16, 19, 26
, format:B, N, T, C
, where:B
- batch size, fixed to 1N
- number of audio frames in this section of audio, fixed to 16T
- context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.C
- 26 MFCC coefficients in each frame
See
<omz_dir>/models/public/mozilla-deepspeech-0.6.1/accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
LSTM in-state and input vectors. Names:
previous_state_c
andprevious_state_h
, shapes:1, 2048
, format:B, C
.
When splitting a long audio into chunks, these inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape:16, 1, 29
, format:N, B, C
, where:N
- number of audio frames in this section of audioB
- batch size, fixed to 1C
- alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state and output vectors. Names:
new_state_c
andnew_state_h
, shapes:1, 2048
, format:B, C
. See Inputs.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape:16, 1, 29
, format:N, B, C
, where:N
- number of audio frames in this section of audio, fixed to 16B
- batch size, fixed to 1C
- alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state and output vectors. Names:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.2
fornew_state_c
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.1
fornew_state_h
Shapes:
1, 2048
, format:B, C
. See the corresponding Inputs.
You can download models and if necessary convert them into Inference Engine format using the Model Downloader and other automation tools as shown in the examples below.
An example of using the Model Downloader:
omz_downloader --name <model_name>
An example of using the Model Converter:
omz_converter --name <model_name>
The original model is distributed under the
Mozilla Public License, Version 2.0.
A copy of the license is provided in <omz_dir>/models/public/licenses/MPL-2.0-Mozilla-Deepspeech.txt
.