-
Notifications
You must be signed in to change notification settings - Fork 8
Speech Characteristics v3.2
Date completed | Sep 20, 2024 |
Release where first appeared | OpenWillis v2.3 |
Researcher / Developer | Georgios Efstathiadis, Vijay Yadav, Michelle Worthington |
import openwillis as ow
words, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '', min_turn_length = '', min_coherence_turn_length = '', option = 'coherence')
This function measures speech characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from any OpenWillis speech transcription function (Vosk, Whisper, or AWS).
By default, the function assumes the transcript contains speech in English and all measures listed below will be populated. If the transcript is not in English, the user must specify the language argument as something other than English and only language-independent measures will be populated.
By default, the function also assumes the transcript contains speech from one speaker and all measures below are calculated from the entire transcript. Per-turn measures are not populated in this case.
In the case of multiple speakers labeled in the transcript, the user must define which speaker they want to quantify speech characteristics from using the speaker_label
argument. All measures calculated will be only for that speaker. When a speaker is specified, per-turn measures are indeed calculated.
The user may wish to calculate per-turn measures only on turns meeting a minimum length requirement to focus on more substantive speech samples (for example, to avoid including one-word responses such as “yes” or “OK”). To do so, they may specify the minimum number of words a turn must have using min_turn_length
. If 'coherence'
measures are also calculated, then the user can also specify whether they want to apply a minimum number of words filter for calculating the coherence measures only, using min_coherence_turn_length
. By default this is set to 5, since most of these measures are only relevant when calculated on larger text segments.
The option parameter can be set to 'simple'
, where it calculates most of the measures here which are computationally inexpensive. It can also be set to 'coherence'
(default value) which will also calculate the higher order linguistic variables related to speech coherence.
The function’s first output is a words
dataframe, which contains a row for each word and measures specific to that word. This includes:
-
pre_word_pause
: pause time before the word in seconds. Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potentially long silences prior to the start of speech in an audio file, thepre_word_pause
for the first word in every file is set toNaN
. To distinguishpre_word_pause
frompre_turn_pause
as defined later in this document,pre_word_pause
for the first word in each turn is also set toNaN
. -
num_syllables
: number of syllables, identified using NLTK’s SyllableTokenizer. This is an English-only measure. -
part_of_speech
: part of speech associated with the word, identified using NLTK, as specified in thepart_of_speech
column (Andreasen & Pfohl, 1976; Tang et al., 2021); these are English only measures:noun
verb
adjective
pronoun
adverb
determiner
-
first_person
: flag to identify whether the word is a first-person singular pronoun, i.e. 'I,' 'me,' 'my,' or 'mine’ (Andreasen & Pfohl, 1976; Tang et al., 2021). This is an English-only measure. -
Verb_tense
: tense of verbs identified using NLTK, i.e. whether the verb is in the past or present tense. This is an English-only measure. - Speech coherence measures are also calculated at the word level (These measures are only calculated if language is in this list):
-
word_coherence
: LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word (Parola et al., 2023). -
word_coherence_5
: LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window (Parola et al., 2023). -
word_coherence_10
: LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window (Parola et al., 2023). -
word_coherence_variability_k
for k from 2 to 10: LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance (Parola et al., 2023).
-
When a speaker_label
is specified, per-turn measures will be populated.
A turn is defined as a segment of speech by one speaker, i.e., everything that individual says before a different speaker starts speaking. The turn-level output contains a row for each turn over the course of the file. Note phrase related measures work for other languages, but are better suited for English. Per-turn measures include:
-
pre_turn_pause
: pause time before the turn in seconds. In case the first turn for the speaker is also the first turn in the audio file, this is set to NaN to correct for any potential delays between the start of the recording and the first detection of speech. -
turn_length_minutes
: total length of the turn in minutes. This is the time between the beginning of the first word in the turn to the end of the last word in the turn, meaning it considers both words spoken and the pauses between those words. -
turn_length_words
: total number of words in the turn. This is simply a count of the number of words detected in the turn. -
words_per_min
: rate of speech in words per minute. This is calculated by simply dividingturn_length_words
byturn_length_minutes
. -
syllables_per_min
: articulation rate, i.e., syllables per minute. This is calculated by summingnum_syllables
from the per-word measures and dividing the sum byturn_length_minutes
. This is an English-only measure. -
speech_percentage
: speech percentage i.e., time spoken over total time. This is the sum of time taken to speak every word in the turn divided byturn_length_minutes
to quantify how much of the turn was spent speaking versus in the silences between words. -
mean_pause_length
: mean pause time between words in seconds. This is the mean of all thepre_word_pause
measurements for all words in the turn. -
pause_variability
: pause variability in seconds. This is the variance of all thepre_word_pause
measurements for all words in the turn. - Emotional valence associated with the turn, calculated using vaderSentiment:
To use these measures, we recommend that themin_turn_length
argument be set to at least 5 words. These are English-only measures.-
sentiment_pos
: degree of positive valence ranging from 0-1 -
sentiment_neg
: degree of negative valence ranging from 0-1 -
sentiment_neu
: degree of neutral valence ranging from 0-1 -
sentiment_overall
: degree of overall valence ranging from 0-1
-
-
mattr_5
,mattr_10
,mattr_25
,mattr_50
,mattr_100
: lexical diversity as measured by the moving average type token ratio (MATTR) score with different window sizes. This is calculated by first lemmatizing words withnltk
. To use this measure, we recommend that themin_turn_length
argument be set to at least 5 words. These are English-only measures. -
first_person_percentage
: first-person singular pronouns percentage, measured as the percentage of pronouns in the utterance that are first-person singular pronouns. - First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
-
first_person_sentiment_positive
: positive sentiment measured as(100-first_person_percentage)*sentiment_pos
. -
first_person_sentiment_negative
: negative measured asfirst_person_percentage*sentiment_neg
.
-
-
word_repeat_percentage
: repeated words percentage, measured as the percentage of words in the utterance that are repeated (Stasak et al., 2019). This is calculated by using a sliding window of 10 words to adjust for utterance length. -
phrase_repeat_percentage
: repeated phrase percentage, measured as the percentage of phrases in the utterance that are repeated (Stasak et al., 2019). This is calculated by using a sliding window of 3 phrases to adjust for utterance length. - Speech coherence measures calculated at the turn level:
-
first_order_sentence_tangeniality
: LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn (Tang et al., 2021; He et al., 2024; Elvevåg et al., 2007; Parola et al., 2023). This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
This measure is only calculated if language is in this list -
second_order_sentence_tangeniality
: LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle (Parola et al., 2023).
This measure is only calculated if language is in this list -
turn_to_turn_tangeniality
: LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker (Tang et al., 2021). This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
This measure is only calculated if language is in this list -
semantic_perplexity
: LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is (Parola et al., 2023).
This measure is only calculated if language is in this list -
semantic_perplexity_5
: LLM measure of semantic perplexity using a window of 5 words.
This measure is only calculated if language is in this list -
semantic_perplexity_11
: LLM measure of semantic perplexity using a window of 11 words.
This measure is only calculated if language is in this list -
semantic_perplexity_15
: LLM measure of semantic perplexity using a window of 15 words.
This measure is only calculated if language is in this list
-
The exact mathematical formula used to calculate perplexity measures is
where
-
interrupt_flag
: flag for pre-turn pauses that are zero or negative, i.e. interruptions. This is set toTrue
if pre-turn pause is negative or zero (indicative of an interruption) and otherwise, it is set toFalse
. This variable is used in thesummary
output to calculatenum_interrupts
.
The summary
dataframe compiles file-level information. In the case of a single speaker, these measures are calculated by compiling information from the per-word measures output. In the case of multiple speakers, these are calculated by compiling information from the per-turn measures output. Note phrase related measures work for other languages, but are better suited for English.
-
file_length
: this is the total length of the audio file in minutes. -
speech_length_minutes
: total length of speech in minutes. For a single speaker, this is the time from the beginning of the first word to end of the last word. For multiple speakers, this is the sum ofturn_length_minutes
across all turns. -
speech_length_words
: total number of words spoken across the whole file. For a single speaker, this is the total count of words in the file. For multiple speakers, *this is the *sum ofturn_length_words
across all turns. -
words_per_min
: rate of speech in words per minute across the whole file:
speech_length_words
/speech_length_minutes.
-
syllables_per_min
: articulation rate i.e. syllables per minute across the whole file. For a single speaker, this isnum_syllables
from all words /speech_length_minutes
. For multiple speakers, this isnum_syllables
from filtered turns /speech_length_minutes.
-
speech_percentage
: speech percentage as the time spoken divided by the total time (speech_length_minutes
/file_length
.) -
mean_pause_length
: mean pause time between words in seconds. This is the mean ofpre_word_pause
across all words in file. -
mean_pause_variability
: pause variability in seconds. This is the variability ofpre_word_pause
across all words in file. -
Emotional valence associated with the turn, calculated using vaderSentiment. These measures are calculated on a string of the entire transcript. To use these measures, we recommend that the
min_turn_length
argument be set to at least 5 words. These are English-only measures.-
sentiment_pos
: degree of positive valence ranging from 0-1 -
sentiment_neg
: degree of negative valence ranging from 0-1 -
sentiment_neu
: degree of neutral valence ranging from 0-1 -
sentiment_overall
: degree of overall valence ranging from 0-1
-
-
mattr_5
,mattr_10
,mattr_25
,mattr_50
,mattr_100
: lexical diversity as measured by the moving average type token ratio (MATTR) score with different window sizes. Calculated on a string of the entire transcript. These are English-only measures. -
first_person_percentage
: first-person singular pronouns percentage, measured as the percentage of pronouns in the speech that are first-person singular pronouns. Calculated on a string of the entire transcript. -
First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
-
first_person_sentiment_positive
: positive measured as(100-first_person_percentage)*sentiment_pos
. -
first_person_sentiment_negative
: negative measured asfirst_person_percentage*sentiment_neg
. -
first_person_sentiment_overall
: overall measured as mixed average of other two measures (i.e.first_person_sentiment_positive
if a turn is positive orfirst_person_sentiment_negative
if turn is negative averaged across turns if multiple speakers or across entire text).
For a single speaker, this is calculated using summary measures. For multiple speakers, this measure is averaged across turns (more meaningful when there are multiple turns).
-
-
word_repeat_percentage
: repeated words percentage, measured as the percentage of words in the speech that are repeated. For a single speaker, this is calculated using the full text. For multiple speakers, this measure is averaged across turns. -
phrase_repeat_percentage
: repeated phrase percentage, measured as the percentage of phrases in the speech that are repeated. For a single speaker, this is calculated using the full text. For multiple speakers, this measure is averaged across turns. -
Means and variances of speech coherence measures calculated at the word level (These measures are only calculated if language is in this list):
-
word_coherence_mean
andword_coherence_var
: LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word. -
word_coherence_5_mean
andword_coherence_5_var
: LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window. -
word_coherence_10_mean
andword_coherence_10_var
: LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window. -
word_coherence_variability_k_mean
andword_coherence_variability_k_var
for k from 2 to 10: LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance.
-
In addition to the variables above, files with multiple speakers and identified turns will also have the following variables populated:
-
num_turns
: number of turns that met the minimum length threshold -
num_one_word_turns
: number of one-word turns. **Note: **this variable is not interpretable when min turn length is larger than 1. -
mean_turn_length_minutes
: mean length of turns in minutes -
mean_turn_length_words
: mean length of turns in words spoken -
mean_pre_turn_pause
: mean pause time before each turn -
speaker_percentage
: speaker percentage. This is the percentage of the entire file that contained speech from this speaker rather than other speakers. It is calculated by dividingspeech_length_minutes
from thesummary
output byfile_length
. **Note: **this variable is not interpretable when min turn length is larger than 1. - Means and variances of speech coherence measures calculated at the turn level:
-
first_order_sentence_tangeniality_mean
andfirst_order_sentence_tangeniality_var
: LLM measures of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn. This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
This measure is only calculated if language is in this list -
second_order_sentence_tangeniality_mean
andsecond_order_sentence_tangeniality_var
: LLM measures of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle.
This measure is only calculated if language is in this list -
turn_to_turn_tangeniality_mean
andturn_to_turn_tangeniality_var
: LLM measures of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker. This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
This measure is only calculated if language is in this list -
semantic_perplexity_mean
andsemantic_perplexity_var
: LLM measures of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is.
This measure is only calculated if language is in this list -
semantic_perplexity_5_mean
andsemantic_perplexity_5_var
: LLM measures of semantic perplexity, but using a window of 5 words.
This measure is only calculated if language is in this list -
semantic_perplexity_11_mean
andsemantic_perplexity_11_var
: LLM measures of semantic perplexity, but using a window of 11 words.
This measure is only calculated if language is in this list -
semantic_perplexity_15_mean
andsemantic_perplexity_15_var
: LLM measures of semantic perplexity, but using a window of 15 words.
This measure is only calculated if language is in this list
-
The exact mathematical formula used to calculate perplexity measures is
where
-
turn_to_turn_tangeniality_slope
: LLM measure of response to the other speaker similarity slope, calculated as the slope of theturn_to_turn_tangeniality
measure on the duration of the interview. Aims to answer the question: Does the response to interviewer similarity degrade over time?
This measure is only calculated if language is in this list -
num_interrupts
: number of interruptions, i.e. negative pre-turn pauses; the sum of interrupt flags from theturns
summary.
Type | JSON |
Description | output from speech transcription function |
Type | String; optional, default = 'en'
|
Description | The language for which speech characteristics will be calculated. If the language is English, all shown variables will be calculated. If the language is not English, only language-independent variables will be calculated |
Type | String; optional, default = None |
Description | The speaker label from the JSON file for which the speech characteristics are calculated |
Type | Integer; optional, default = 1 |
Description | The minimum length in words a turn needs to be for per-turn measures to be calculated |
Type | Integer; optional, default = 5 |
Description | The minimum length in words a turn needs to be for coherence measures to be calculated |
Type | String; optional, default = ‘coherence’ |
Description | String that determines measures calculated; can be ‘simple’ or ‘coherence’ |
Option | List of variables calculated |
‘simple’ | Amount of speech, pause measures, sentiment, lexical richness, part of speech and repetition measures |
‘coherence’ | Simple measures + speech coherence measures |
Type | pandas.DataFrame |
Description | Per-word measures of speech characteristics |
Type | pandas.DataFrame or None |
Description | Per-turn measures of speech characteristics in case the input JSON contains speech from multiple speakers and a speaker is identified using the speaker_label parameter
|
Type | pandas.DataFrame |
Description | File-level measures of speech characteristics |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
NLTK | Apache 2.0 | Well-established library for commonly measured natural language characteristics |
LexicalRichness | MIT | Straightforward implementation of methods for calculation of MATTR score |
vaderSentiment | MIT | Widely used library for sentiment analysis, trained on a large and heterogeneous dataset |
transformers | Apache 2.0 | Library for accessing pre-trained language models. Used for calculation of word coherence measures and semantic perplexity. |
sentence_transformers | Apache 2.0 | Library for accessing pre-trained language models for sentences and paragraphs. Used for calculation of sentence and turn tangentiality measures. |
spacy | MIT | Natural language processing library used for lemmatizing words in MATTR calculation. |
Andreasen, N., & Pfohl, B. (1976). Linguistic Analysis of Speech in Affective Disorders. Archives of General Psychiatry, 33(11), 1361. https://doi.org/10.1001/archpsyc.1976.01770110089009
Dikaios, K., Rempel, S., Dumpala, S. H., Oore, S., Kiefte, M., & Uher, R. (2023). Applications of Speech Analysis in Psychiatry. Harvard Review of Psychiatry, 31(1), 1–13. https://doi.org/10.1097/hrp.0000000000000356
Elvevåg, B., Foltz, P. W., Weinberger, D. R., & Goldberg, T. E. (2007). Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia. Schizophrenia Research, 93(1–3), 304–316. https://doi.org/10.1016/j.schres.2007.03.001
He, R., Palominos, C., Zhang, H., Alonso-Sánchez, M. F., Palaniyappan, L., & Hinzen, W. (2024). Navigating the semantic space: unraveling the structure of meaning in psychosis using different computational language models. Psychiatry Research, 333, 115752. https://doi.org/10.1016/j.psychres.2024.115752
Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2023). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. Schizophrenia Research, 259, 59–70. https://doi.org/10.1016/j.schres.2022.07.002
Stasak, B., Epps, J., & Goecke, R. (2019). Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis. Speech Communication, 115, 1–14. https://doi.org/10.1016/j.specom.2019.10.003
Tang, S. X., Kriz, R., Cho, S., Park, S. J., Harowitz, J., Gur, R. E., Bhati, M. T., Wolf, D. H., Sedoc, J., & Liberman, M. (2021). Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. Npj Schizophrenia, 7(1). https://doi.org/10.1038/s41537-021-00154-3
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Video Preprocessing for Faces v1.0
- Video Cropping v1.0
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.0
- Speech Transcription with AWS v1.0
- WillisDiarize v1.0
- WillisDiarize with AWS v1.0
- Speaker Separation with Labels v1.1
- Speaker Separation without Labels v1.1
- Audio Preprocessing v1.0
- Speech Characteristics v3.2
- Vocal Acoustics v2.1
- Phonation Acoustics v1.0
- GPS Analysis v1.0
- Research guidelines