Skip to content

Speech Transcription with Whisper v1.1

GeorgiosEfstathiadis edited this page May 29, 2024 · 4 revisions
Date completed March 19, 2024
Release where first appeared OpenWillis v2.1
Researcher / Developer Vijay Yadav, Anzar Abbas, Georgios Efstathiadis

1 – Use

import openwillis as ow

transcript_json, transcript_text = ow.speech_transcription_whisper(filepath = '', model = '', compute_type = '', batch_size = '', hf_token = '', language = '', min_speakers = '', max_speakers = '', context = '')

2 – Methods

This function transcribes speech into text using WhisperX, an open source library built atop Whisper, a speech transcription model developed by OpenAI. The function allows for offline speech transcription where computational resources are available. It is best suited for researchers transcribing speech on institutional or cloud-based machines, ideally with GPU access.

The function allows for multiple models, tiny, small, medium, large and large-v2.

  • The tiny model, with 39M parameters, is suitable for personal machines and can handle relatively short audio files.
  • The small model, with 244M parameters, offers a balance of performance and file length.
  • The medium model, with 769M parameters, is a step up for processing longer audio files.
  • On the other hand, the large and large-v2 models, with 1550M parameters, demand GPU resources for efficient processing of large recordings.

Naturally, transcription accuracy will be higher with the larger models. There is also the option to change the default compute type and batch size for the transcription. Refer to the WhisperX instruction set for further information on the computational resources required.

The user will also need a Hugging Face token to access the underlying models. To acquire a token, they will need to create an account on Hugging Face, accept user conditions for the segmentation, voice activity detection, and diarization models, and go to account settings to access their token.

Whisper can support several languages/dialects, namely Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. The language codes for each can be found below.

If no language is specified, Whisper will automatically detect the language and process it as such. However, language classification accuracy may vary depending on the language and its similarity to other languages. We recommend always specifying the language, if known.

2.2.1 – Transcription only

In case of a straightforward transcription, the function simply passes the source audio through Whisper.

In transcript_json, a word-by-word and phrase-by-phrase transcript is saved as a JSON.

In transcript_text, a string of the entire transcript, including punctuation, is saved.

2.2.2 – Transcription + Speaker labeling

When the source audio contains multiple speakers, the function will automatically label each speaker in the output JSON as speaker0, speaker1, speaker2, and so on at both the word and the phrase level.

If the number of speakers in the source audio is known to the user, they may specify max_speakers = n to limit the number of speakers that will be labeled. Please note that this is only an upper limit. If the model detects fewer than the expected number of speakers, it will output the JSON as such.

As is to be expected, speaker labeling is not 100% accurate.

For phrases that are labeled as one speaker but contain words labeled for more than one speaker, each word’s speaker label will be reassigned to the phrase-level identification if it does not match the phrase’s speaker label.

2.2.3 – Transcription + Speaker labeling + Speaker identification

If the source audio is a recording of a structured clinical interview, the user may specify the scale being administered using the context parameter and the function will automatically identify which of the speakers is the clinician and which of the speakers is the participant.

Then, at the word and phrase level, the speaker label is changed to either clinician or participant.

When context is specified, the user must set speaker_labels = 'True' and max_speakers = 2.

Currently supported contexts are the following clinical scales:

  • MADRS, conducted in accordance with SIGMA, for which context = 'madrs'.
  • PANSS, conducted in accordance with the SCI-PANSS, for which context = 'panss'.
  • HAM-A, conducted in accordance with Hamilton Anxiety Rating Scale (SIGH-A), for which context = 'ham-a'.
  • CAPS past week conducted in accordance with DSM-5 (CAPS-5) Past Week Version, for which context = 'caps-past-week'.
  • CAPS past month conducted in accordance with DSM-5 (CAPS-5) Past Month Version, for which context = 'caps-past-month'.
  • CAPS DSM IV conducted in accordance with Clinician-Administered PTSD Scale For DSM-IV, for which context = 'caps-iv'.
  • MINI conducted in accordance with Version 7.0.2 for DSM-5, for which context = 'mini'.
  • CAINS conducted in accordance with CAINS (v1.0), for which context = 'cains'.

Speaker identification is done by comparing the transcribed text with expected rater prompts from the clinical scale. This comparison is done using a pre-trained multilingual sentence transformer model, which works by mapping different sentences to the pre-trained embedding space, based on their underlying meaning. The embeddings are compared using cosine similarity. The closer the embeddings are in similarity, the more similar they are in meaning. The speaker whose audio more closely matches the expected rater prompts is labeled as the clinician, while the other speaker is labeled as the participant. This comparison is done in a manner that is agnostic of the language the speech is in.


3 – Inputs

3.1 – filepath

Type String
Description A local path to the file that the user wants to process. Most audio and video file types are supported.

3.2 – model

Type String; default is tiny
Description Name of model the user wants to utilize, with the options being tiny, small, medium, large and large-v2.

3.3 – compute_type

Type String; default is int16
Description The type of computation to be used during transcription. Only if running on CPU, if running on GPU compute_type will be float16.

3.4 – batch_size

Type Integer; default is 16
Description The number of audio samples to be processed together in one batch during transcription. Adjusting this can help manage memory usage and processing speed.

3.5 – hf_token

Type String
Description The user’s Hugging Face token as described in the function methods

3.6 – language

Type String; optional – if left empty; resorts to auto-detection
Description Language of the source audio file
english en
chinese zh
german de
spanish es
russian ru
korean ko
french fr
japanese ja
portuguese pt
turkish tr
polish pl
catalan ca
dutch nl
arabic ar
swedish sv
italian it
indonesian id
hindi hi
finnish fi
vietnamese vi
hebrew he
ukrainian uk
greek el
malay ms
czech cs
romanian ro
danish da
hungarian hu
tamil ta
norwegian no
thai th
urdu ur
croatian hr
bulgarian bg
lithuanian lt
latin la
maori mi
malayalam ml
welsh cy
slovak sk
telugu te
persian fa
latvian lv
bengali bn
serbian sr
azerbaijani az
slovenian sl
kannada kn
estonian et
macedonian mk
breton br
basque eu
icelandic is
armenian hy
nepali ne
mongolian mn
bosnian bs
kazakh kk
albanian sq
swahili sw
galician gl
marathi mr
punjabi pa
sinhala si
khmer km
shona sn
yoruba yo
somali so
afrikaans af
occitan oc
georgian ka
belarusian be
tajik tg
sindhi sd
gujarati gu
amharic am
yiddish yi
lao lo
uzbek uz
faroese fo
haitian creole ht
pashto ps
turkmen tk
nynorsk nn
maltese mt
sanskrit sa
luxembourgish lb
myanmar my
tibetan bo
tagalog tl
malagasy mg
assamese as
tatar tt
hawaiian haw
lingala ln
hausa ha
bashkir ba
javanese jw
sundanese su

3.7 – max_speakers

Type Integer; optional
Description This parameter sets the maximum number of speakers to be identified and labeled in audio or video transcriptions.

3.8 – context

Type String; optional
Description In case the source audio is the recording of a known clinical scale, specification of the clinical scale. If scale is provided, num_speakers is assumed to be 2.

4 – Outputs

4.1 – transcript_json

Type JSON
Description This is a word-wise and phrase-wise transcript saved as a JSON

4.2 – transcript_text

Type String
Description The transcription, compiled into a string

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
WhisperX BSD-4 Library for transcribing and diarizing speakers in recordings.
Whisper MIT Library for transcribing speech
Pyannote MIT Library for speaker diarization
SentenceTransformer Apache 2.0 A pre-trained BERT-based sentence transformer model to compute the similarity between the speech.
Clone this wiki locally