-
Notifications
You must be signed in to change notification settings - Fork 8
Speech Transcription with Vosk v1.0
Date completed | October 18, 2023 |
Release where first appeared | OpenWillis v1.6 |
Researcher / Developer | Vijay Yadav, Anzar Abbas |
import openwillis as ow
transcript_json, transcript_text = ow.speech_transcription_vosk(filepath = '', language = '', transcribe_interval = '')
This function transcribes speech into text using the Vosk Speech Recognition Toolkit, an open source model that allows for offline speech transcription without needing significant computational resources. It is best suited for researchers transcribing speech on their personal machines.
A limitation of this function is that it assumes the source audio contains speech from a single speaker only. The transcription output does not contain speaker labels and will not be able to distinguish speech from multiple speakers. For this functionality, see the Speech Transcription with Whisper or Speech Transcription with AWS functions in OpenWillis.
Vosk supports several languages/dialects, namely English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, and Polish.
By default, the function assumes the input language is English, the language code for which is en
. See below for all other language codes.
Optionally, the user may specify a section of the source audio they want to transcribe by entering transcribe_interval = [a,b]
, where a
is the start time and b
is the end time, both in seconds.
From the transcription, two outputs are provided: The first is the transcript_json
output, which contains start and end time labels for each word as well as a confidence score on the transcription itself. The second output is the transcript_text
, which is simply a string of the entire transcription, not including any punctuation.
Type | String |
Description | A local path to the file that the user wants to process. Most audio and video file types are supported. |
Type | String; optional, default is English, i.e., en
|
Description | Language of the source audio file |
English |
en
|
Indian English |
en-in
|
Chinese |
cn
|
Russian |
ru
|
French |
fr
|
German |
de
|
Spanish |
es
|
Portuguese or Brazilian Portuguese |
pt
|
Greek |
gr
|
Turkish |
tr
|
Vietnamese |
vn
|
Italian |
it
|
Dutch |
nl
|
Catalan |
ca
|
Arabic |
ar
|
Farsi |
fa
|
Filipino |
ph
|
Ukrainian |
uk
|
Kazakh |
kz
|
Swedish |
sv
|
Japanese |
ja
|
Esperanto |
eo
|
Hindo |
hi
|
Czech |
cz
|
Polish |
pl
|
Type | List; optional, default is [0]
|
Description | List specifying the start and end time (in seconds) of the audio file that the user wants to transcribe. For example, the user can input [0, 10] if they want to transcribe the first 10 seconds, [11, 20] if they want to transcribe a chunk in the middle of the video, or [20] if they want to transcribe everything after 20 seconds.
|
Type | JSON |
Description | This is a word-by-word transcript saved as a JSON |
Type | String |
Description | The transcription compiled into a string |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
vosk-api | Apache 2.0 | Open source model for offline speech transcription that is the backbone of this function |
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Video Preprocessing for Faces v1.0
- Video Cropping v1.0
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.0
- Speech Transcription with AWS v1.0
- WillisDiarize v1.0
- WillisDiarize with AWS v1.0
- Speaker Separation with Labels v1.1
- Speaker Separation without Labels v1.1
- Audio Preprocessing v1.0
- Speech Characteristics v3.2
- Vocal Acoustics v2.1
- Phonation Acoustics v1.0
- GPS Analysis v1.0
- Research guidelines