Skip to content

Speech Transcription with Vosk v1.0

Michelle Worthington edited this page Aug 14, 2024 · 5 revisions
Date completed October 18, 2023
Release where first appeared OpenWillis v1.6
Researcher / Developer Vijay Yadav, Anzar Abbas

1 – Syntax

import openwillis as ow

transcript_json, transcript_text = ow.speech_transcription_vosk(filepath = '', language = '', transcribe_interval = '')

2 – Methods

This function transcribes speech into text using the Vosk Speech Recognition Toolkit, an open source model that allows for offline speech transcription without needing significant computational resources. It is best suited for researchers transcribing speech on their personal machines.

A limitation of this function is that it assumes the source audio contains speech from a single speaker only. The transcription output does not contain speaker labels and will not be able to distinguish speech from multiple speakers. For this functionality, see the Speech Transcription with Whisper or Speech Transcription with AWS functions in OpenWillis.

Vosk supports several languages/dialects, namely English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, and Polish.

By default, the function assumes the input language is English, the language code for which is en. See below for all other language codes.

Optionally, the user may specify a section of the source audio they want to transcribe by entering transcribe_interval = [a,b], where a is the start time and b is the end time, both in seconds.

From the transcription, two outputs are provided: The first is the transcript_json output, which contains start and end time labels for each word as well as a confidence score on the transcription itself. The second output is the transcript_text, which is simply a string of the entire transcription, not including any punctuation.

3 – Inputs

3.1 – filepath

Type String
Description A local path to the file that the user wants to process. Most audio and video file types are supported.

3.2 – language

Type String; optional, default is English, i.e., en
Description Language of the source audio file
English en
Indian English en-in
Chinese cn
Russian ru
French fr
German de
Spanish es
Portuguese or Brazilian Portuguese pt
Greek gr
Turkish tr
Vietnamese vn
Italian it
Dutch nl
Catalan ca
Arabic ar
Farsi fa
Filipino ph
Ukrainian uk
Kazakh kz
Swedish sv
Japanese ja
Esperanto eo
Hindo hi
Czech cz
Polish pl

3.3 – transcribe_interval

Type List; optional, default is [0]
Description List specifying the start and end time (in seconds) of the audio file that the user wants to transcribe. For example, the user can input [0, 10] if they want to transcribe the first 10 seconds, [11, 20] if they want to transcribe a chunk in the middle of the video, or [20] if they want to transcribe everything after 20 seconds.

4 – Outputs

4.1 – transcript_json

Description This is a word-by-word transcript saved as a JSON

4.2 – transcript_text

Type String
Description The transcription compiled into a string

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
vosk-api Apache 2.0 Open source model for offline speech transcription that is the backbone of this function
Clone this wiki locally