Skip to content

Speaker separation v1.2

anzar edited this page Sep 26, 2023 · 2 revisions
Date completed September 26th, 2023
Release where first appeared OpenWillis v1.4
Researcher / Developer Vijay Yadav

1 – Use

Local processing

With the json input coming from the local Speech Transcription function:

import openwillis as ow

signal_label = ow.speaker_separation(filepath = 'data.wav', json_response = '{}', hf_token = 'abc', model = 'pyannote', c_scale = '')

Saving audio files

To use the signal_label output to save the separated audio files:

import openwillis as ow

ow.to_audio(filepath = 'data.wav', signal_label, out_dir)

2 – Methods

This Speaker Separation function separates an audio file with two speakers into two audio signals, each containing speech only from one of the two speakers. The to_audio function is used to save that audio signal as an audio file on the user’s drive.

By default, this function will use the pyannote speech diarization model to determine the timepoints when each speaker starts and finishes speaking. Alternatively, the user can choose a faster but potentially less accurate model (pyannote-diart) by specifying it as the model parameter.

Additionally, it requires a transcribed json response from the local Speech Transcription function (note: not the Speech Transcription Cloud function). Using the word timepoints in the json, the function separates the audio signals for each speaker and returns a dictionary. The dictionary's keys represent the labels 'speaker0' or 'speaker1' and the values hold the audio signals in numpy array format for each speaker.

If the audio is from a structured clinical interview and the user wants to identify the clinician from the participant, they can provide an optional input parameter (Section 3.6) to specify the scale. The function has built-in knowledge of the PANSS scale and automatically identifies the identity of each speaker. It returns a dictionary with keys labeled as 'clinician' and 'participant', holding the audio signals in numpy array format. Currently, the PANSS and MADRS are supported.

The function uses the json response as an input generated by the Speech Transcription function. It takes the json, audio file, and the chosen diarization model (pyannote/pyannote-diart) as inputs. Based on the diarization timepoints, it filters the transcription content and prepares separate transcriptions for each speaker. It then compares the transcriptions of each audio file with the expected transcript provided by the rater of the specified clinical scale.

To compare the transcriptions, the function searches for each expected clinician prompt within the transcribed text and calculates probability scores for the top five prompts that are most likely to be present. These scores are averaged, and based on the comparison, the transcribed text with the higher overall probability score is labeled as belonging to the clinician, while the other file is labeled as belonging to the participant. Note that speaker identification requires the initial identification of each speakers' timepoints and transcription timepoints, i.e. provided by a separate model, so there may be a slight lag of a few words when filtering the transcription for each speaker.

The to_audio function exports audio signals from a dictionary to individual WAV files. It takes a file path, a dictionary containing labeled speakers and their respective audio signals as numpy arrays, and an output directory. Each WAV file (for both speakers) is saved in the specified output directory with a unique name in the format of "filename_speakerlabel.wav", where "filename" refers to the original file name, and "speakerlabel" represents the label of the speaker.


3 – Inputs

3.1 – filepath

Type str
Description path to audio file to be separated

3.2 – hf_token

Type str;
Description hugging face token necessary to use the underlying model(s). To acquire a token, create an account on hugging face, then accept the user conditions for the models found under https://huggingface.co/pyannote/segmentation, https://huggingface.co/pyannote/embedding, and https://huggingface.co/pyannote/speaker-diarization. The token to be inserted as a parameter in this function can be found in your account settings under access tokens (create a new one if you don’t already have one).

3.3 – json_response

Type json
Description json output that lists each word transcribed, the confidence level associated with that word’s transcription, its utterance start time, and its utterance end time.

3.4 – model

Type str; optional, default model is ‘pyannote’
Description Can be either default model (no argument) or model = 'pyannote-diart', with the former being slower but more accurate and the latter allowing batch processing for quicker results with comparatively lower accuracy.

3.6 – c_scale

Type str; optional, default scale=‘’
Description In case the user wants to identify the speaker (clinician vs. participant) in the audio file, they can enter the name of the clinical scale that the audio file is capturing as an optional parameter. The function currently supports PANSS, in which case scale = 'panss', and MADRS, in which case scale = 'madrs'.

4 – Outputs

4.1 – signal_label

Type dictionary
Description A dictionary with the speaker label as the key and the audio signal numpy array as the value.
labels = {'speaker0': [2,35,56,-52 … 13, -14], 'speaker1': [12,45,26,-12 ……43, -54] }

^ What the dictionary looks like


5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
diart MIT A nice balance of diarization speed and accuracy while still being able to run locally and having an open source license
pyannote MIT A Python library for speaker diarization, speaker recognition, and speech activity detection tasks, with pre-trained models and tools for these tasks, as well as the ability to train your own models.
SentenceTransformer Apache 2.0 A pre-trained BERT-based sentence transformer model to compute the similarity between the speech.
Clone this wiki locally