-
Notifications
You must be signed in to change notification settings - Fork 8
Speech Transcription with Whisper v1.1
Date completed | March 19, 2024 |
Release where first appeared | OpenWillis v2.1 |
Researcher / Developer | Vijay Yadav, Anzar Abbas, Georgios Efstathiadis |
import openwillis as ow
transcript_json, transcript_text = ow.speech_transcription_whisper(filepath = '', model = '', compute_type = '', batch_size = '', hf_token = '', language = '', min_speakers = '', max_speakers = '', context = '')
This function transcribes speech into text using WhisperX, an open source library built atop Whisper, a speech transcription model developed by OpenAI. The function allows for offline speech transcription where computational resources are available. It is best suited for researchers transcribing speech on institutional or cloud-based machines, ideally with GPU access.
The function allows for multiple models, tiny
, small
, medium
, large
and
large-v2
.
- The tiny model, with 39M parameters, is suitable for personal machines and can handle relatively short audio files.
- The small model, with 244M parameters, offers a balance of performance and file length.
- The medium model, with 769M parameters, is a step up for processing longer audio files.
- On the other hand, the large and large-v2 models, with 1550M parameters, demand GPU resources for efficient processing of large recordings.
Naturally, transcription accuracy will be higher with the larger models. There is also the option to change the default compute type and batch size for the transcription. Refer to the WhisperX instruction set for further information on the computational resources required.
The user will also need a Hugging Face token to access the underlying models. To acquire a token, they will need to create an account on Hugging Face, accept user conditions for the segmentation, voice activity detection, and diarization models, and go to account settings to access their token.
Whisper can support several languages/dialects, namely Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. The language codes for each can be found below.
If no language is specified, Whisper will automatically detect the language and process it as such. However, language classification accuracy may vary depending on the language and its similarity to other languages. We recommend always specifying the language, if known.
In case of a straightforward transcription, the function simply passes the source audio through Whisper.
In transcript_json
, a word-by-word and phrase-by-phrase transcript is saved as a JSON.
In transcript_text
, a string of the entire transcript, including punctuation, is saved.
When the source audio contains multiple speakers, the function will automatically label each speaker in the output JSON as speaker0
, speaker1
, speaker2
, and so on at both the word and the phrase level.
If the number of speakers in the source audio is known to the user, they may specify max_speakers = n
to limit the number of speakers that will be labeled. Please note that this is only an upper limit. If the model detects fewer than the expected number of speakers, it will output the JSON as such.
As is to be expected, speaker labeling is not 100% accurate.
For phrases that are labeled as one speaker but contain words labeled for more than one speaker, each word’s speaker label will be reassigned to the phrase-level identification if it does not match the phrase’s speaker label.
If the source audio is a recording of a structured clinical interview, the user may specify the scale being administered using the context
parameter and the function will automatically identify which of the speakers is the clinician and which of the speakers is the participant.
Then, at the word and phrase level, the speaker label is changed to either clinician
or participant
.
When context is specified, the user must set speaker_labels = 'True'
and max_speakers = 2
.
Currently supported contexts are the following clinical scales:
- MADRS, conducted in accordance with SIGMA, for which
context = 'madrs'
. - PANSS, conducted in accordance with the SCI-PANSS, for which
context = 'panss'
. - HAM-A, conducted in accordance with Hamilton Anxiety Rating Scale (SIGH-A), for which
context = 'ham-a'
. - CAPS past week conducted in accordance with DSM-5 (CAPS-5) Past Week Version, for which
context = 'caps-past-week'
. - CAPS past month conducted in accordance with DSM-5 (CAPS-5) Past Month Version, for which
context = 'caps-past-month'
. - CAPS DSM IV conducted in accordance with Clinician-Administered PTSD Scale For DSM-IV, for which
context = 'caps-iv'
. - MINI conducted in accordance with Version 7.0.2 for DSM-5, for which
context = 'mini'
. - CAINS conducted in accordance with CAINS (v1.0), for which
context = 'cains'
.
Speaker identification is done by comparing the transcribed text with expected rater prompts from the clinical scale. This comparison is done using a pre-trained multilingual sentence transformer model, which works by mapping different sentences to the pre-trained embedding space, based on their underlying meaning. The embeddings are compared using cosine similarity. The closer the embeddings are in similarity, the more similar they are in meaning. The speaker whose audio more closely matches the expected rater prompts is labeled as the clinician
, while the other speaker is labeled as the participant
. This comparison is done in a manner that is agnostic of the language the speech is in.
Type | String |
Description | A local path to the file that the user wants to process. Most audio and video file types are supported. |
Type | String; default is tiny
|
Description | Name of model the user wants to utilize, with the options being tiny, small, medium, large and large-v2 .
|
Type | String; default is int16
|
Description | The type of computation to be used during transcription. Only if running on CPU, if running on GPU compute_type will be float16 .
|
Type | Integer; default is 16
|
Description | The number of audio samples to be processed together in one batch during transcription. Adjusting this can help manage memory usage and processing speed. |
Type | String |
Description | The user’s Hugging Face token as described in the function methods |
Type | String; optional – if left empty; resorts to auto-detection |
Description | Language of the source audio file |
english |
en
|
chinese |
zh
|
german |
de
|
spanish |
es
|
russian |
ru
|
korean |
ko
|
french |
fr
|
japanese |
ja
|
portuguese |
pt
|
turkish |
tr
|
polish |
pl
|
catalan |
ca
|
dutch |
nl
|
arabic |
ar
|
swedish |
sv
|
italian |
it
|
indonesian |
id
|
hindi |
hi
|
finnish |
fi
|
vietnamese |
vi
|
hebrew |
he
|
ukrainian |
uk
|
greek |
el
|
malay |
ms
|
czech |
cs
|
romanian |
ro
|
danish |
da
|
hungarian |
hu
|
tamil |
ta
|
norwegian |
no
|
thai |
th
|
urdu |
ur
|
croatian |
hr
|
bulgarian |
bg
|
lithuanian |
lt
|
latin |
la
|
maori |
mi
|
malayalam |
ml
|
welsh |
cy
|
slovak |
sk
|
telugu |
te
|
persian |
fa
|
latvian |
lv
|
bengali |
bn
|
serbian |
sr
|
azerbaijani |
az
|
slovenian |
sl
|
kannada |
kn
|
estonian |
et
|
macedonian |
mk
|
breton |
br
|
basque |
eu
|
icelandic |
is
|
armenian |
hy
|
nepali |
ne
|
mongolian |
mn
|
bosnian |
bs
|
kazakh |
kk
|
albanian |
sq
|
swahili |
sw
|
galician |
gl
|
marathi |
mr
|
punjabi |
pa
|
sinhala |
si
|
khmer |
km
|
shona |
sn
|
yoruba |
yo
|
somali |
so
|
afrikaans |
af
|
occitan |
oc
|
georgian |
ka
|
belarusian |
be
|
tajik |
tg
|
sindhi |
sd
|
gujarati |
gu
|
amharic |
am
|
yiddish |
yi
|
lao |
lo
|
uzbek |
uz
|
faroese |
fo
|
haitian creole |
ht
|
pashto |
ps
|
turkmen |
tk
|
nynorsk |
nn
|
maltese |
mt
|
sanskrit |
sa
|
luxembourgish |
lb
|
myanmar |
my
|
tibetan |
bo
|
tagalog |
tl
|
malagasy |
mg
|
assamese |
as
|
tatar |
tt
|
hawaiian |
haw
|
lingala |
ln
|
hausa |
ha
|
bashkir |
ba
|
javanese |
jw
|
sundanese |
su
|
Type | Integer; optional |
Description | This parameter sets the maximum number of speakers to be identified and labeled in audio or video transcriptions. |
Type | String; optional |
Description | In case the source audio is the recording of a known clinical scale, specification of the clinical scale. If scale is provided, num_speakers is assumed to be 2.
|
Type | JSON |
Description | This is a word-wise and phrase-wise transcript saved as a JSON |
Type | String |
Description | The transcription, compiled into a string |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
WhisperX | BSD-4 | Library for transcribing and diarizing speakers in recordings. |
Whisper | MIT | Library for transcribing speech |
Pyannote | MIT | Library for speaker diarization |
SentenceTransformer | Apache 2.0 | A pre-trained BERT-based sentence transformer model to compute the similarity between the speech. |
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Video Preprocessing for Faces v1.0
- Video Cropping v1.0
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.0
- Speech Transcription with AWS v1.0
- WillisDiarize v1.0
- WillisDiarize with AWS v1.0
- Speaker Separation with Labels v1.1
- Speaker Separation without Labels v1.1
- Audio Preprocessing v1.0
- Speech Characteristics v3.2
- Vocal Acoustics v2.1
- Phonation Acoustics v1.0
- GPS Analysis v1.0
- Research guidelines