-
Notifications
You must be signed in to change notification settings - Fork 8
Speaker separation v1.2
Date completed | September 26th, 2023 |
Release where first appeared | OpenWillis v1.4 |
Researcher / Developer | Vijay Yadav |
With the json
input coming from the local Speech Transcription function:
import openwillis as ow
signal_label = ow.speaker_separation(filepath = 'data.wav', json_response = '{}', hf_token = 'abc', model = 'pyannote', c_scale = '')
To use the signal_label
output to save the separated audio files:
import openwillis as ow
ow.to_audio(filepath = 'data.wav', signal_label, out_dir)
This Speaker Separation function separates an audio file with two speakers into two audio signals, each containing speech only from one of the two speakers. The to_audio
function is used to save that audio signal as an audio file on the user’s drive.
By default, this function will use the pyannote speech diarization model to determine the timepoints when each speaker starts and finishes speaking. Alternatively, the user can choose a faster but potentially less accurate model (pyannote-diart) by specifying it as the model parameter.
Additionally, it requires a transcribed json
response from the local Speech Transcription function (note: not the Speech Transcription Cloud function). Using the word timepoints in the json
, the function separates the audio signals for each speaker and returns a dictionary. The dictionary's keys represent the labels 'speaker0' or 'speaker1' and the values hold the audio signals in numpy array format for each speaker.
If the audio is from a structured clinical interview and the user wants to identify the clinician from the participant, they can provide an optional input parameter (Section 3.6) to specify the scale. The function has built-in knowledge of the PANSS scale and automatically identifies the identity of each speaker. It returns a dictionary with keys labeled as 'clinician' and 'participant', holding the audio signals in numpy array format. Currently, the PANSS and MADRS are supported.
The function uses the json
response as an input generated by the Speech Transcription function. It takes the json
, audio file, and the chosen diarization model (pyannote/pyannote-diart) as inputs. Based on the diarization timepoints, it filters the transcription content and prepares separate transcriptions for each speaker. It then compares the transcriptions of each audio file with the expected transcript provided by the rater of the specified clinical scale.
To compare the transcriptions, the function searches for each expected clinician prompt within the transcribed text and calculates probability scores for the top five prompts that are most likely to be present. These scores are averaged, and based on the comparison, the transcribed text with the higher overall probability score is labeled as belonging to the clinician, while the other file is labeled as belonging to the participant. Note that speaker identification requires the initial identification of each speakers' timepoints and transcription timepoints, i.e. provided by a separate model, so there may be a slight lag of a few words when filtering the transcription for each speaker.
The to_audio
function exports audio signals from a dictionary to individual WAV files. It takes a file path, a dictionary containing labeled speakers and their respective audio signals as numpy arrays, and an output directory. Each WAV file (for both speakers) is saved in the specified output directory with a unique name in the format of "filename_speakerlabel.wav", where "filename" refers to the original file name, and "speakerlabel" represents the label of the speaker.
Type | str |
Description | path to audio file to be separated |
Type | str; |
Description | hugging face token necessary to use the underlying model(s). To acquire a token, create an account on hugging face, then accept the user conditions for the models found under https://huggingface.co/pyannote/segmentation, https://huggingface.co/pyannote/embedding, and https://huggingface.co/pyannote/speaker-diarization. The token to be inserted as a parameter in this function can be found in your account settings under access tokens (create a new one if you don’t already have one). |
Type | json |
Description | json output that lists each word transcribed, the confidence level associated with that word’s transcription, its utterance start time, and its utterance end time. |
Type | str; optional, default model is ‘pyannote’ |
Description | Can be either default model (no argument) or model = 'pyannote-diart' , with the former being slower but more accurate and the latter allowing batch processing for quicker results with comparatively lower accuracy.
|
Type | str; optional, default scale=‘’ |
Description | In case the user wants to identify the speaker (clinician vs. participant) in the audio file, they can enter the name of the clinical scale that the audio file is capturing as an optional parameter. The function currently supports PANSS, in which case scale = 'panss' , and MADRS, in which case scale = 'madrs' .
|
Type | dictionary |
Description | A dictionary with the speaker label as the key and the audio signal numpy array as the value. |
labels = {'speaker0': [2,35,56,-52 … 13, -14], 'speaker1': [12,45,26,-12 ……43, -54] }
^ What the dictionary looks like
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
diart | MIT | A nice balance of diarization speed and accuracy while still being able to run locally and having an open source license |
pyannote | MIT | A Python library for speaker diarization, speaker recognition, and speech activity detection tasks, with pre-trained models and tools for these tasks, as well as the ability to train your own models. |
SentenceTransformer | Apache 2.0 | A pre-trained BERT-based sentence transformer model to compute the similarity between the speech. |
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Video Preprocessing for Faces v1.0
- Video Cropping v1.0
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.0
- Speech Transcription with AWS v1.0
- WillisDiarize v1.0
- WillisDiarize with AWS v1.0
- Speaker Separation with Labels v1.1
- Speaker Separation without Labels v1.1
- Audio Preprocessing v1.0
- Speech Characteristics v3.2
- Vocal Acoustics v2.1
- Phonation Acoustics v1.0
- GPS Analysis v1.0
- Research guidelines