Speech models and supporting files for voice2json.
Files are contained in <LANGUAGE>/<LOCALE>
directories. Each locale directory should contain a SOURCE
file describing where it was sourced from. The LICENSE
file in each locale directory covers the artifacts for that specific profile.
- Directories with
pocketsphinx
contain CMU Sphinx acoustic models - Directories with
kaldi
contain Kaldi acoustic models (eithergmm
ornnet3
). - Directories with
deepspeech
contain Mozilla DeepSpeech acoustic models (version 0.6). - Directories with
julius
contain Julius acoustic models (DNN, version 4.5).
Some files are split into multiple parts so that they can be uploaded to GitHub. This is done with the split
command:
split -d -b 25M FILE FILE.part-
They can be recombined simply with:
cat FILE.part-* > FILE
voice2json
supports the following languages/locales. I don't speak or write any language besides U.S. English very well, so please let me know if any profile is broken or could be improved!
Untested profiles (highlighted below) may work, but I don't have the necessary data or enough understanding of the language to test them.
Language | Locale | System | Closed | Open | ||
---|---|---|---|---|---|---|
View | Download | Catalan | ca-es | pocketsphinx | UNTESTED | UNTESTED |
View | Download | Czech | cs-cz | Kaldi | UNTESTED | UNTESTED |
View | Download | Dutch (Nederlands) | nl | kaldi | ★ ★ ★ ★ ★ (2x) | ☹ (1x) |
View | Download | Dutch (Nederlands) | nl | pocketsphinx | ★ ★ ★ ★ (18x) | ☹ (3x) |
View | Download | English | en-in | pocketsphinx | ☹ (4x) | ☹ (4x) |
View | Download | English | en-us | deepspeech | ★ ★ ★ ★ ★ (1x) | ★ ★ ★ ★ (1x) |
View | Download | English | en-us | julius | ★ ★ ★ ★ (1x) | UNTESTED |
View | Download | English | en-us | kaldi | ★ ★ ★ ★ ★ (3x) | ★ ★ ★ ★ (1x) |
View | Download | English | en-us | pocketsphinx | ★ ★ ★ ★ ★ (9x) | ★ ★ ★ ★ (2x) |
View | Download | French (Français) | fr | kaldi | ★ ★ ★ ★ (4x) | ★ ★ ★ ★ (1x) |
View | Download | French (Français) | fr | kaldi | ★ ★ ★ ★ ★ (3x) | ★ ★ ★ ★ ★ (0.5x) |
View | Download | French (Français) | fr | pocketsphinx | ★ ★ ★ ★ (23x) | ☹ (3x) |
View | Download | German (Deutsch) | de | pocketsphinx | ★ ★ ★ ★ ★ (17x) | ★ ★ ★ ★ ★ (3x) |
View | Download | German (Deutsch) | de-DE | deepspeech | ★ ★ ★ ★ ★ (1x) | ★ ★ ★ ★ (1x) |
View | Download | German (Deutsch) | de-DE | kaldi | ★ ★ ★ ★ ★ (4x) | ★ ★ ★ ★ (1x) |
View | Download | Greek (Ελληνικά) | el-gr | pocketsphinx | ★ ★ ★ ★ ★ (15x) | ☹ (1x) |
View | Download | Hindi (Devanagari) | hi | pocketsphinx | UNTESTED | UNTESTED |
View | Download | Italian (Italiano) | it | pocketsphinx | ★ ★ ★ ★ ★ (21x) | ★ ★ ★ ★ ★ (7x) |
View | Download | Italian (Italiano) | it | kaldi | ★ ★ ★ ★ ★ (1x) | ★ ★ ★ ★ ★ (1x) |
View | Download | Kazakh (қазақша) | kz | pocketsphinx | UNTESTED | UNTESTED |
View | Download | Korean | ko-kr | kaldi | ☹ (4x) | ☹ (4x) |
View | Download | Mandarin | zh-cn | pocketsphinx | UNTESTED | UNTESTED |
View | Download | Polish (polski) | pl | julius | UNTESTED | UNTESTED |
View | Download | Portuguese (Português) | pt-br | pocketsphinx | ★ ★ ★ ★ (51x) | ☹ (11x) |
View | Download | Russian (Русский) | ru | kaldi | ★ ★ ★ ★ ★ (2x) | ★ ★ ★ ★ ★ (0.5x) |
View | Download | Russian (Русский) | ru | pocketsphinx | ★ ★ ★ ★ ★ (17x) | ☹ (1x) |
View | Download | Spanish (Español) | es | kaldi | ★ ★ ★ ★ ★ (4x) | ★ ★ ★ ★ ★ (1x) |
View | Download | Spanish (Español) | es | pocketsphinx | ★ ★ ★ ★ (25x) | ★ ★ ★ ★ (15x) |
View | Download | Spanish | es-mexican | pocketsphinx | ★ ★ ★ ★ ★ (9x) | ★ ★ ★ ★ (2x) |
View | Download | Swedish (svenska) | sv | kaldi | ★ ★ ★ ★ (3x) | ☹ (1x) |
View | Download | Vietnamese (Tiếng Việt) | vi | kaldi | ★ ★ ★ ★ ★ (4x) | ☹ (1x) |
Each profile is given a ★ rating, indicating how accurate it was at transcribing a set of test WAV files. I'm considering anything below 75% accuracy to be effectively unusable (☹).
Transcription Accuracy | |
---|---|
★ ★ ★ ★ ★ | [95%, 100%] |
★ ★ ★ ★ | [90%, 95%) |
★ ★ ★ | [85%, 90%) |
★ ★ | [80%, 85%) |
★ | [75%, 80%) |
☹ | [0%, 75%) |
Profiles are tested in two conditions:
- Closed
- All example sentences from the profile's sentences.ini are run through Google WaveNet to produce synthetic speech
- The profile is trained and tested on exactly the sentences it should recognize (ideal case)
- This resembles the intended use case of
voice2json
, though real world speech will be less perfect
- Open
- Speech examples are provided by contributors, VoxForge, or Mozilla Common Voice
- The profile is tested using the sample WAV files with the
--open
flag - This (usually) demonstrates why its best to define voice commands first!
Transcription speed-up is given as (Nx) where N is the average ratio of real-time to transcription time.
A value of 2x means that voice2json
was able to transcribe the test WAV files twice as fast as their real-time durations on average.
The reported values come from an Intel Core i7-based laptop with 16GB of RAM, so expect slower transcriptions on Raspberry Pi's.
The acoustic models and pronunciation dictionaries come from one of:
When language models or grapheme-to-phoneme models were unavailable, they were generated using:
- Data from Universal Dependencies
- The Phonetisaurus G2P tool