Speaker separation v1.2

Date completed	September 26th, 2023
Release where first appeared	OpenWillis v1.4
Researcher / Developer	Vijay Yadav

1 – Use

Local processing

With the json input coming from the local Speech Transcription function:

import openwillis as ow

signal_label = ow.speaker_separation(filepath = 'data.wav', json_response = '{}', hf_token = 'abc', model = 'pyannote', c_scale = '')

Saving audio files

To use the signal_label output to save the separated audio files:

import openwillis as ow

ow.to_audio(filepath = 'data.wav', signal_label, out_dir)

2 – Methods

This Speaker Separation function separates an audio file with two speakers into two audio signals, each containing speech only from one of the two speakers. The to_audio function is used to save that audio signal as an audio file on the user’s drive.

By default, this function will use the pyannote speech diarization model to determine the timepoints when each speaker starts and finishes speaking. Alternatively, the user can choose a faster but potentially less accurate model (pyannote-diart) by specifying it as the model parameter.

Additionally, it requires a transcribed json response from the local Speech Transcription function (note: not the Speech Transcription Cloud function). Using the word timepoints in the json, the function separates the audio signals for each speaker and returns a dictionary. The dictionary's keys represent the labels 'speaker0' or 'speaker1' and the values hold the audio signals in numpy array format for each speaker.

If the audio is from a structured clinical interview and the user wants to identify the clinician from the participant, they can provide an optional input parameter (Section 3.6) to specify the scale. The function has built-in knowledge of the PANSS scale and automatically identifies the identity of each speaker. It returns a dictionary with keys labeled as 'clinician' and 'participant', holding the audio signals in numpy array format. Currently, the PANSS and MADRS are supported.

The function uses the json response as an input generated by the Speech Transcription function. It takes the json, audio file, and the chosen diarization model (pyannote/pyannote-diart) as inputs. Based on the diarization timepoints, it filters the transcription content and prepares separate transcriptions for each speaker. It then compares the transcriptions of each audio file with the expected transcript provided by the rater of the specified clinical scale.

To compare the transcriptions, the function searches for each expected clinician prompt within the transcribed text and calculates probability scores for the top five prompts that are most likely to be present. These scores are averaged, and based on the comparison, the transcribed text with the higher overall probability score is labeled as belonging to the clinician, while the other file is labeled as belonging to the participant. Note that speaker identification requires the initial identification of each speakers' timepoints and transcription timepoints, i.e. provided by a separate model, so there may be a slight lag of a few words when filtering the transcription for each speaker.

The to_audio function exports audio signals from a dictionary to individual WAV files. It takes a file path, a dictionary containing labeled speakers and their respective audio signals as numpy arrays, and an output directory. Each WAV file (for both speakers) is saved in the specified output directory with a unique name in the format of "filename_speakerlabel.wav", where "filename" refers to the original file name, and "speakerlabel" represents the label of the speaker.

3 – Inputs

3.1 – `filepath`

Type	str
Description	path to audio file to be separated

3.2 – `hf_token`

Type	str;
Description	hugging face token necessary to use the underlying model(s). To acquire a token, create an account on hugging face, then accept the user conditions for the models found under https://huggingface.co/pyannote/segmentation, https://huggingface.co/pyannote/embedding, and https://huggingface.co/pyannote/speaker-diarization. The token to be inserted as a parameter in this function can be found in your account settings under access tokens (create a new one if you don’t already have one).

3.3 – `json_response`

Type	json
Description	json output that lists each word transcribed, the confidence level associated with that word’s transcription, its utterance start time, and its utterance end time.

3.4 – `model`

Type	str; optional, default model is ‘pyannote’
Description	Can be either default model (no argument) or `model = 'pyannote-diart'`, with the former being slower but more accurate and the latter allowing batch processing for quicker results with comparatively lower accuracy.

3.6 – `c_scale`

Type	str; optional, default scale=‘’
Description	In case the user wants to identify the speaker (clinician vs. participant) in the audio file, they can enter the name of the clinical scale that the audio file is capturing as an optional parameter. The function currently supports PANSS, in which case `scale = 'panss'`, and MADRS, in which case `scale = 'madrs'`.

4 – Outputs

4.1 – `signal_label`

Type	dictionary
Description	A dictionary with the speaker label as the key and the audio signal numpy array as the value.

labels = {'speaker0': [2,35,56,-52 … 13, -14], 'speaker1': [12,45,26,-12 ……43, -54] }

^ What the dictionary looks like

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency	License	Justification
diart	MIT	A nice balance of diarization speed and accuracy while still being able to run locally and having an open source license
pyannote	MIT	A Python library for speaker diarization, speaker recognition, and speech activity detection tasks, with pre-trained models and tools for these tasks, as well as the ability to train your own models.
SentenceTransformer	Apache 2.0	A pre-trained BERT-based sentence transformer model to compute the similarity between the speech.

OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker separation v1.2

1 – Use

Local processing

Saving audio files

2 – Methods

3 – Inputs

3.1 – `filepath`

3.2 – `hf_token`

3.3 – `json_response`

3.4 – `model`

3.6 – `c_scale`

4 – Outputs

4.1 – `signal_label`

5 – Dependencies

Table of contents

Clone this wiki locally

Speaker separation v1.2

1 – Use

Local processing

Saving audio files

2 – Methods

3 – Inputs

3.1 – filepath

3.2 – hf_token

3.3 – json_response

3.4 – model

3.6 – c_scale

4 – Outputs

4.1 – signal_label

5 – Dependencies

Table of contents

Clone this wiki locally

3.1 – `filepath`

3.2 – `hf_token`

3.3 – `json_response`

3.4 – `model`

3.6 – `c_scale`

4.1 – `signal_label`