-
Notifications
You must be signed in to change notification settings - Fork 8
Speech characteristics v3.0
Date completed | March 19, 2024 |
Release where first appeared | v2.1 |
Researcher / Developer | Georgios Efstathiadis, Vijay Yadav |
import openwillis as ow
words, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '', min_turn_length = '')
This function measures speech characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from any OpenWillis speech transcription function (Vosk, Whisper, or AWS).
By default, the function assumes the transcript contains speech in English and all measures listed below will be populated. If the transcript is not in English, the user must specify the language argument as something other than English and only language-independent measures will be populated.
By default, the function also assumes the transcript contains speech from one speaker and all measures below are calculated from the entire transcript. Per-turn measures are not populated in this case.
In the case of multiple speakers labeled in the transcript, the user must define which speaker they want to quantify speech characteristics from using the speaker_label
argument. All measures calculated will be only for that speaker. When a speaker is specified, per-turn measures are indeed calculated.
The user may wish to calculate per-turn measures only on turns meeting a minimum length requirement. To do so, they may specify the minimum number of words a turn must have using min_turn_length
.
The function’s first output is a words
dataframe, which contains a row for each word and measures specific to that word. This includes:
- Pause time before the word in seconds (
pre_word_pause
)
Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potentially long silences prior to the start of speech in an audio file, thepre_word_pause
for the first word in every file is set toNaN
. To distinguishpre_word_pause
frompre_turn_pause
as defined later in this document,pre_word_pause
for the first word in each turn is also set toNaN
. - Number of syllables, identified using NLTK’s SyllableTokenizer (
num_syllables
). This is an English-only measure. - Part of speech associated with the word, identified using NLTK, as specified in the
part_of_speech
column (Andreasen & Pfohl, 1976; Tang et al., 2021); these are English only measures:noun
verb
adjective
pronoun
adverb
determiner
- Flag to identify whether the word is a first-person singular pronoun, i.e. 'I,' 'me,' 'my,' or 'mine’ (Andreasen & Pfohl, 1976; Tang et al., 2021). This is an English-only measure. (
first_person
) - Tense of verbs identified using NLTK, i.e. whether the verb is in the past or present tense. This is an English-only measure. (
verb_tense
) - Speech coherence measures are also calculated at the word level (These measures are only calculated if language is in this list):
- LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word (Parola et al., 2023). (
word_coherence
) - LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window (Parola et al., 2023). (
word_coherence_5
) - LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window (Parola et al., 2023). (
word_coherence_10
) - LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance (Parola et al., 2023). (
word_coherence_variability_k
for k from 2 to 10)
- LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word (Parola et al., 2023). (
When a speaker_label
is specified, per-turn measures will be populated.
A turn is defined as a segment of speech by one speaker, i.e., everything that individual says before a different speaker starts speaking. The turn-level output contains a row for each turn over the course of the file. Note phrase related measures work for other languages, but are better suited for English. Per-turn measures include:
- Pause time before the turn in seconds (
pre_turn_pause
)
In case the first turn for the speaker is also the first turn in the audio file, this is set to NaN to correct for any potential delays between the start of the recording and the first detection of speech. - Total length of the turn in minutes (
turn_length_minutes
)
This is the time between the beginning of the first word in the turn to the end of the last word in the turn, meaning it considers both words spoken and the pauses between those words. - Total number of words in the turn (
turn_length_words
)
This is simply a count of the number of words detected in the turn. - Rate of speech in words per minute (
words_per_min
)
This is calculated by simply dividingturn_length_words
byturn_length_minutes
. - Articulation rate, i.e., syllables per minute (
syllables_per_min
)
This is calculated by summingnum_syllables
from the per-word measures and dividing the sum byturn_length_minutes
. This is an English-only measure. - Speech percentage i.e., time spoken over total time (
speech_percentage
)
This is the sum of time taken to speak every word in the turn divided byturn_length_minutes
to quantify how much of the turn was spent speaking versus in the silences between words. - Mean pause time between words in seconds (
mean_pause_length
)
This is the mean of all thepre_word_pause
measurements for all words in the turn. - Pause variability in seconds (
pause_variability
)
This is the variance of all thepre_word_pause
measurements for all words in the turn - Emotional valence associated with the turn, calculated using vaderSentiment:
To use these measures, we recommend that themin_turn_length
argument be set to at least 5 words. These are English-only measures.- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
- Lexical diversity as measured by the moving average type token ratio (MATTR) score (
mattr
)
To use this measure, we recommend that themin_turn_length
argument be set to at least 5 words. This is an English-only measure. - First-person singular pronouns percentage, measured as the percentage of pronouns in the utterance that are first-person singular pronouns (
first_person_percentage
) - First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
- Positive measured as
(100-first_person_percentage)*sentiment_pos
.(first_person_sentiment_positive
) - Negative measured as
first_person_percentage*sentiment_neg
. (first_person_sentiment_negative
)
- Positive measured as
- Repeated words percentage, measured as the percentage of words in the utterance that are repeated (Stasak et al., 2019). (
word_repeat_percentage
) - Repeated phrase percentage, measured as the percentage of phrases in the utterance that are repeated (Stasak et al., 2019). (
phrase_repeat_percentage
) - Speech coherence measures calculated at the turn level:
-
LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn (Tang et al., 2021; He et al., 2024; Elvevåg et al., 2007; Parola et al., 2023). (
first_order_sentence_tangeniality
)This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
This measure is only calculated if language is in this list
-
LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle (Parola et al., 2023). (
second_order_sentence_tangeniality
)This measure is only calculated if language is in this list
-
LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker (Tang et al., 2021). (
turn_to_turn_tangeniality
)This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
This measure is only calculated if language is in this list -
LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is (Parola et al., 2023). The exact mathematical formula used to calculate this measure is
-
where semantic_perplexity
)
This measure is only calculated if language is in this list
- Flag for pre-turn pauses that are zero or negative, i.e. interruptions (
interrupt_flag
)
This is set toTrue
if pre-turn pause is negative or zero (indicative of an interruption) and otherwise, it is set toFalse
. This variable is used in thesummary
output to calculatenum_interrupts
.
The summary
dataframe compiles file-level information. In the case of a single speaker, these measures are calculated by compiling information from the per-word measures output. In the case of multiple speakers, these are calculated by compiling information from the per-turn measures output. Note phrase related measures work for other languages, but are better suited for English.
-
File length (
file_length
)
This is the total length of the audio file in minutes -
Total length of speech in minutes (
speech_length_minutes
)
_Single speaker: time from the beginning of the first word to the end of the last word
Multiple speakers: sum ofturn_length_minutes
across all turns -
Total number of words spoken (
speech_length_words
) across the whole file
Single speaker: total count of words in the file
Multiple speakers: sum ofturn_length_words
across all turns -
Rate of speech in words per minute (
words_per_min
) across the whole file
speech_length_words
/speech_length_minutes
-
Articulation rate i.e. syllables per minute (
syllables_per_min
) across the whole file
Single speaker:num_syllables
from all words /speech_length_minutes
Multiple speakers:
num_syllables
from filtered turns /speech_length_minutes
-
Speech percentage i.e., time spoken over total time (
speech_percentage
)
speech_length_minutes
/file_length
-
Mean pause time between words in seconds (
mean_pause_length
)
Mean ofpre_word_pause
across all words in file -
Pause variability in seconds (
mean_pause_variability
)
Variability ofpre_word_pause
across all words in file -
Emotional valence associated with the turn, calculated using vaderSentiment
Calculated on a string of the entire transcript
To use these measures, we recommend that themin_turn_length
argument be set to at least 5 words. These are English-only measures.- Degree of positive valence ranging from 0-1 (
sentiment_pos
) - Degree of negative valence ranging from 0-1 (
sentiment_neg
) - Degree of neutral valence ranging from 0-1 (
sentiment_neu
) - Degree of overall valence ranging from 0-1 (
sentiment_overall
)
- Degree of positive valence ranging from 0-1 (
-
Lexical diversity as measured by the moving average type token ratio (MATTR) score (
mattr
)
Calculated on a string of the entire transcript
To use this measure, we recommend that themin_turn_length
argument be set to at least 5 words. This is an English-only measure. -
First-person singular pronouns percentage, measured as the percentage of pronouns in the speech that are first-person singular pronouns (
first_person_percentage
)Calculated on a string of the entire transcript
-
First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
- Positive measured as
(100-first_person_percentage)*sentiment_pos
.(first_person_sentiment_positive
) - Negative measured as
first_person_percentage*sentiment_neg
. (first_person_sentiment_negative
) - Overall measured as mixed average of other two measures (i.e.
first_person_sentiment_positive
if a turn is positive orfirst_person_sentiment_negative
if turn is negative averaged across turns if multiple speakers or across entire text) (first_person_sentiment_overall
)
Single speaker: calculated using summary measures
Multiple speakers: averaged measure across turns (more meaningful when multiple utterances/turns)
- Positive measured as
-
Repeated words percentage, measured as the percentage of words in the speech that are repeated. (
word_repeat_percentage
)Single speaker: calculated using full text
Multiple speakers: averaged measure across turns
-
Repeated phrase percentage, measured as the percentage of phrases in the speech that are repeated. (
phrase_repeat_percentage
)Single speaker: calculated using full text
Multiple speakers: averaged measure across turns
-
Means and variances of speech coherence measures calculated at the word level (These measures are only calculated if language is in this list):
- LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word. (
word_coherence_mean
andword_coherence_var
) - LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window. (
word_coherence_5_mean
andword_coherence_5_var
) - LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window. (
word_coherence_10_mean
andword_coherence_10_var
) - LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance. (
word_coherence_variability_k_mean
andword_coherence_variability_k_var
for k from 2 to 10)
- LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word. (
In addition to the variables above, files with multiple speakers and identified turns will also have the following variables populated:
- Number of turns that met the minimum length threshold (
num_turns
) - Number of one-word turns (
num_one_word_turns
)
*variable not interpretable when min turn length is larger than 1 - Mean length of turns in minutes (
mean_turn_length_minutes
) - Mean length of turns in words spoken (
mean_turn_length_words
) - Mean pause time before each turn (
mean_pre_turn_pause
) - Speaker percentage (
speaker_percentage
)
This is the percentage of the entire file that contained speech from this speaker rather than other speakers. It is calculated by dividingspeech_length_minutes
from thesummary
output byfile_length
.
*variable not interpretable when min turn length is larger than 1 - Means and variances of speech coherence measures calculated at the turn level:
-
LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn. (
first_order_sentence_tangeniality_mean
andfirst_order_sentence_tangeniality_var
)This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
This measure is only calculated if language is in this list -
LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle. (
second_order_sentence_tangeniality_mean
andsecond_order_sentence_tangeniality_var
)
This measure is only calculated if language is in this list -
LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker. (
turn_to_turn_tangeniality_mean
andturn_to_turn_tangeniality_var
)This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
This measure is only calculated if language is in this list -
LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is. The exact mathematical formula used to calculate this measure is
-
where semantic_perplexity_mean
and semantic_perplexity_var
)
This measure is only calculated if language is in this list
-
LLM measure of response to the other speaker similarity slope, calculated as the slope of the
turn_to_turn_tangeniality
measure on the duration of the interview (turn_to_turn_tangeniality_slope
)Aims to answer the question: Does the response to interviewer similarity degrade over time?
-
Number of interruptions, i.e. negative pre-turn pauses (
num_interrupts
); the sum of interrupt flags from theturns
summary.
Type | JSON |
Description | output from speech transcription function |
Type | String, optional, default = 'en'
|
Description | The language for which speech characteristics will be calculated. If the language is English, all shown variables will be calculated. If the language is not English, only language-independent variables will be calculated |
Type | String, optional, default = None |
Description | The speaker label from the JSON file for which the speech characteristics are calculated |
Type | Integer, optional, default = 1 |
Description | The minimum length in words a turn needs to be for per-turn measures to be calculated |
Type | pandas.DataFrame |
Description | Per-word measures of speech characteristics |
Type | pandas.DataFrame or None |
Description | Per-turn measures of speech characteristics in case the input JSON contains speech from multiple speakers and a speaker is identified using the speaker_label parameter
|
Type | pandas.DataFrame |
Description | File-level measures of speech characteristics |
Below are dependencies specific to calculation of this measure.
Dependency | License | Justification |
NLTK | Apache 2.0 | Well-established library for commonly measured natural language characteristics |
LexicalRichness | MIT | Straightforward implementation of methods for calculation of MATTR score |
vaderSentiment | MIT | Widely used library for sentiment analysis, trained on a large and heterogeneous dataset |
Andreasen, N., & Pfohl, B. (1976). Linguistic Analysis of Speech in Affective Disorders. _Archives of General Psychiatry_, _33_(11), 1361. https://doi.org/10.1001/archpsyc.1976.01770110089009
Dikaios, K., Rempel, S., Dumpala, S. H., Oore, S., Kiefte, M., & Uher, R. (2023). Applications of Speech Analysis in Psychiatry. _Harvard Review of Psychiatry_, _31_(1), 1–13. https://doi.org/10.1097/hrp.0000000000000356
Elvevåg, B., Foltz, P. W., Weinberger, D. R., & Goldberg, T. E. (2007). Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia. _Schizophrenia Research_, _93_(1–3), 304–316. https://doi.org/10.1016/j.schres.2007.03.001
He, R., Palominos, C., Zhang, H., Alonso-Sánchez, M. F., Palaniyappan, L., & Hinzen, W. (2024). Navigating the semantic space: unraveling the structure of meaning in psychosis using different computational language models. _Psychiatry Research_, _333_, 115752. https://doi.org/10.1016/j.psychres.2024.115752
Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2023). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. _Schizophrenia Research_, _259_, 59–70. https://doi.org/10.1016/j.schres.2022.07.002
Stasak, B., Epps, J., & Goecke, R. (2019). Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis. _Speech Communication_, _115_, 1–14. https://doi.org/10.1016/j.specom.2019.10.003
Tang, S. X., Kriz, R., Cho, S., Park, S. J., Harowitz, J., Gur, R. E., Bhati, M. T., Wolf, D. H., Sedoc, J., & Liberman, M. (2021). Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. _Npj Schizophrenia_, _7_(1). https://doi.org/10.1038/s41537-021-00154-3
OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.
- Release notes
- Getting started
-
List of functions
- Video Preprocessing for Faces v1.0
- Video Cropping v1.0
- Facial Expressivity v2.0
- Emotional Expressivity v2.0
- Eye Blink Rate v1.0
- Speech Transcription with Vosk v1.0
- Speech Transcription with Whisper v1.0
- Speech Transcription with AWS v1.0
- WillisDiarize v1.0
- WillisDiarize with AWS v1.0
- Speaker Separation with Labels v1.1
- Speaker Separation without Labels v1.1
- Audio Preprocessing v1.0
- Speech Characteristics v3.2
- Vocal Acoustics v2.1
- Phonation Acoustics v1.0
- GPS Analysis v1.0
- Research guidelines