Skip to content

Speech Characteristics v3.2

Michelle Worthington edited this page Nov 15, 2024 · 4 revisions
Date completed Sep 20, 2024
Release where first appeared OpenWillis v2.3
Researcher / Developer Georgios Efstathiadis, Vijay Yadav, Michelle Worthington

1 – Syntax

import openwillis as ow

words, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '', min_turn_length = '', min_coherence_turn_length = '', option = 'coherence')

2 – Methods

This function measures speech characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from any OpenWillis speech transcription function (Vosk, Whisper, or AWS).

By default, the function assumes the transcript contains speech in English and all measures listed below will be populated. If the transcript is not in English, the user must specify the language argument as something other than English and only language-independent measures will be populated.

By default, the function also assumes the transcript contains speech from one speaker and all measures below are calculated from the entire transcript. Per-turn measures are not populated in this case.

In the case of multiple speakers labeled in the transcript, the user must define which speaker they want to quantify speech characteristics from using the speaker_label argument. All measures calculated will be only for that speaker. When a speaker is specified, per-turn measures are indeed calculated.

The user may wish to calculate per-turn measures only on turns meeting a minimum length requirement to focus on more substantive speech samples (for example, to avoid including one-word responses such as “yes” or “OK”). To do so, they may specify the minimum number of words a turn must have using min_turn_length. If 'coherence' measures are also calculated, then the user can also specify whether they want to apply a minimum number of words filter for calculating the coherence measures only, using min_coherence_turn_length. By default this is set to 5, since most of these measures are only relevant when calculated on larger text segments.

The option parameter can be set to 'simple', where it calculates most of the measures here which are computationally inexpensive. It can also be set to 'coherence'(default value) which will also calculate the higher order linguistic variables related to speech coherence.

2.1 – Per-word measures

The function’s first output is a words dataframe, which contains a row for each word and measures specific to that word. This includes:

  • pre_word_pause: pause time before the word in seconds. Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potentially long silences prior to the start of speech in an audio file, the pre_word_pause for the first word in every file is set to NaN. To distinguish pre_word_pause from pre_turn_pause as defined later in this document, pre_word_pause for the first word in each turn is also set to NaN.
  • num_syllables: number of syllables, identified using NLTK’s SyllableTokenizer. This is an English-only measure.
  • part_of_speech: part of speech associated with the word, identified using NLTK, as specified in the part_of_speech column (Andreasen & Pfohl, 1976; Tang et al., 2021); these are English only measures:
    • noun
    • verb
    • adjective
    • pronoun
    • adverb
    • determiner
  • first_person: flag to identify whether the word is a first-person singular pronoun, i.e. 'I,' 'me,' 'my,' or 'mine’ (Andreasen & Pfohl, 1976; Tang et al., 2021). This is an English-only measure.
  • Verb_tense: tense of verbs identified using NLTK, i.e. whether the verb is in the past or present tense. This is an English-only measure.
  • Speech coherence measures are also calculated at the word level (These measures are only calculated if language is in this list):
    • word_coherence: LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word (Parola et al., 2023).
    • word_coherence_5: LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window (Parola et al., 2023).
    • word_coherence_10: LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window (Parola et al., 2023).
    • word_coherence_variability_k for k from 2 to 10: LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance (Parola et al., 2023).

2.2 – Per-turn measures

When a speaker_label is specified, per-turn measures will be populated.

A turn is defined as a segment of speech by one speaker, i.e., everything that individual says before a different speaker starts speaking. The turn-level output contains a row for each turn over the course of the file. Note phrase related measures work for other languages, but are better suited for English. Per-turn measures include:

  • pre_turn_pause: pause time before the turn in seconds. In case the first turn for the speaker is also the first turn in the audio file, this is set to NaN to correct for any potential delays between the start of the recording and the first detection of speech.
  • turn_length_minutes: total length of the turn in minutes. This is the time between the beginning of the first word in the turn to the end of the last word in the turn, meaning it considers both words spoken and the pauses between those words.
  • turn_length_words: total number of words in the turn. This is simply a count of the number of words detected in the turn.
  • words_per_min: rate of speech in words per minute. This is calculated by simply dividing turn_length_words by turn_length_minutes .
  • syllables_per_min: articulation rate, i.e., syllables per minute. This is calculated by summing num_syllables from the per-word measures and dividing the sum by turn_length_minutes. This is an English-only measure.
  • speech_percentage: speech percentage i.e., time spoken over total time. This is the sum of time taken to speak every word in the turn divided by turn_length_minutes to quantify how much of the turn was spent speaking versus in the silences between words.
  • mean_pause_length: mean pause time between words in seconds. This is the mean of all the pre_word_pause measurements for all words in the turn.
  • pause_variability: pause variability in seconds. This is the variance of all the pre_word_pause measurements for all words in the turn.
  • Emotional valence associated with the turn, calculated using vaderSentiment:
    To use these measures, we recommend that the min_turn_length argument be set to at least 5 words. These are English-only measures.
    • sentiment_pos: degree of positive valence ranging from 0-1
    • sentiment_neg: degree of negative valence ranging from 0-1
    • sentiment_neu: degree of neutral valence ranging from 0-1
    • sentiment_overall: degree of overall valence ranging from 0-1
  • mattr_5, mattr_10, mattr_25, mattr_50, mattr_100: lexical diversity as measured by the moving average type token ratio (MATTR) score with different window sizes. This is calculated by first lemmatizing words with nltk. To use this measure, we recommend that the min_turn_length argument be set to at least 5 words. These are English-only measures.
  • first_person_percentage: first-person singular pronouns percentage, measured as the percentage of pronouns in the utterance that are first-person singular pronouns.
  • First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
    • first_person_sentiment_positive: positive sentiment measured as (100-first_person_percentage)*sentiment_pos.
    • first_person_sentiment_negative: negative measured as first_person_percentage*sentiment_neg.
  • word_repeat_percentage: repeated words percentage, measured as the percentage of words in the utterance that are repeated (Stasak et al., 2019). This is calculated by using a sliding window of 10 words to adjust for utterance length.
  • phrase_repeat_percentage: repeated phrase percentage, measured as the percentage of phrases in the utterance that are repeated (Stasak et al., 2019). This is calculated by using a sliding window of 3 phrases to adjust for utterance length.
  • Speech coherence measures calculated at the turn level:
    • first_order_sentence_tangeniality: LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn (Tang et al., 2021; He et al., 2024; Elvevåg et al., 2007; Parola et al., 2023). This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
      This measure is only calculated if language is in this list

    • second_order_sentence_tangeniality: LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle (Parola et al., 2023).
      This measure is only calculated if language is in this list

    • turn_to_turn_tangeniality: LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker (Tang et al., 2021). This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
      This measure is only calculated if language is in this list

    • semantic_perplexity: LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is (Parola et al., 2023).
      This measure is only calculated if language is in this list

    • semantic_perplexity_5: LLM measure of semantic perplexity using a window of 5 words.
      This measure is only calculated if language is in this list

    • semantic_perplexity_11: LLM measure of semantic perplexity using a window of 11 words.
      This measure is only calculated if language is in this list

    • semantic_perplexity_15: LLM measure of semantic perplexity using a window of 15 words.
      This measure is only calculated if language is in this list

The exact mathematical formula used to calculate perplexity measures is

$$ \text{perplexity}(\text{Turn}) = \exp\left(-\frac{1}{n}\sum_{i=0}^{n} \log P_{LM}(t_i|\text{Turn}_{\backslash i})\right) $$

where $\text{Turn} = (t_1, \ldots, t_n)$ and $P_{LM}$ is the language model probability model used (e.g., BERT).

  • interrupt_flag: flag for pre-turn pauses that are zero or negative, i.e. interruptions. This is set to True if pre-turn pause is negative or zero (indicative of an interruption) and otherwise, it is set to False. This variable is used in the summary output to calculate num_interrupts.

2.3 – Summary

The summary dataframe compiles file-level information. In the case of a single speaker, these measures are calculated by compiling information from the per-word measures output. In the case of multiple speakers, these are calculated by compiling information from the per-turn measures output. Note phrase related measures work for other languages, but are better suited for English.

  • file_length: this is the total length of the audio file in minutes.

  • speech_length_minutes: total length of speech in minutes. For a single speaker, this is the time from the beginning of the first word to end of the last word. For multiple speakers, this is the sum of turn_length_minutes across all turns.

  • speech_length_words: total number of words spoken across the whole file. For a single speaker, this is the total count of words in the file. For multiple speakers, *this is the *sum of turn_length_words across all turns.

  • words_per_min: rate of speech in words per minute across the whole file:
    speech_length_words / speech_length_minutes.

  • syllables_per_min: articulation rate i.e. syllables per minute across the whole file. For a single speaker, this is num_syllables from all words / speech_length_minutes. For multiple speakers, this is num_syllables from filtered turns / speech_length_minutes.

  • speech_percentage: speech percentage as the time spoken divided by the total time (speech_length_minutes / file_length.)

  • mean_pause_length: mean pause time between words in seconds. This is the mean of pre_word_pause across all words in file.

  • mean_pause_variability: pause variability in seconds. This is the variability of pre_word_pause across all words in file.

  • Emotional valence associated with the turn, calculated using vaderSentiment. These measures are calculated on a string of the entire transcript. To use these measures, we recommend that the min_turn_length argument be set to at least 5 words. These are English-only measures.

    • sentiment_pos: degree of positive valence ranging from 0-1
    • sentiment_neg: degree of negative valence ranging from 0-1
    • sentiment_neu: degree of neutral valence ranging from 0-1
    • sentiment_overall: degree of overall valence ranging from 0-1
  • mattr_5, mattr_10, mattr_25, mattr_50, mattr_100: lexical diversity as measured by the moving average type token ratio (MATTR) score with different window sizes. Calculated on a string of the entire transcript. These are English-only measures.

  • first_person_percentage: first-person singular pronouns percentage, measured as the percentage of pronouns in the speech that are first-person singular pronouns. Calculated on a string of the entire transcript.

  • First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)

    • first_person_sentiment_positive: positive measured as (100-first_person_percentage)*sentiment_pos.
    • first_person_sentiment_negative: negative measured as first_person_percentage*sentiment_neg.
    • first_person_sentiment_overall: overall measured as mixed average of other two measures (i.e. first_person_sentiment_positive if a turn is positive or first_person_sentiment_negative if turn is negative averaged across turns if multiple speakers or across entire text).

    For a single speaker, this is calculated using summary measures. For multiple speakers, this measure is averaged across turns (more meaningful when there are multiple turns).

  • word_repeat_percentage: repeated words percentage, measured as the percentage of words in the speech that are repeated. For a single speaker, this is calculated using the full text. For multiple speakers, this measure is averaged across turns.

  • phrase_repeat_percentage: repeated phrase percentage, measured as the percentage of phrases in the speech that are repeated. For a single speaker, this is calculated using the full text. For multiple speakers, this measure is averaged across turns.

  • Means and variances of speech coherence measures calculated at the word level (These measures are only calculated if language is in this list):

    • word_coherence_mean and word_coherence_var: LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word.
    • word_coherence_5_mean and word_coherence_5_var: LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window.
    • word_coherence_10_mean and word_coherence_10_var: LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window.
    • word_coherence_variability_k_mean and word_coherence_variability_k_var for k from 2 to 10: LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance.

In addition to the variables above, files with multiple speakers and identified turns will also have the following variables populated:

  • num_turns: number of turns that met the minimum length threshold
  • num_one_word_turns: number of one-word turns. **Note: **this variable is not interpretable when min turn length is larger than 1.
  • mean_turn_length_minutes: mean length of turns in minutes
  • mean_turn_length_words: mean length of turns in words spoken
  • mean_pre_turn_pause: mean pause time before each turn
  • speaker_percentage: speaker percentage. This is the percentage of the entire file that contained speech from this speaker rather than other speakers. It is calculated by dividing speech_length_minutes from the summary output by file_length. **Note: **this variable is not interpretable when min turn length is larger than 1.
  • Means and variances of speech coherence measures calculated at the turn level:
    • first_order_sentence_tangeniality_mean and first_order_sentence_tangeniality_var: LLM measures of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn. This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
      This measure is only calculated if language is in this list

    • second_order_sentence_tangeniality_mean and second_order_sentence_tangeniality_var: LLM measures of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle.
      This measure is only calculated if language is in this list

    • turn_to_turn_tangeniality_mean and turn_to_turn_tangeniality_var: LLM measures of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker. This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
      This measure is only calculated if language is in this list

    • semantic_perplexity_mean and semantic_perplexity_var: LLM measures of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is.
      This measure is only calculated if language is in this list

    • semantic_perplexity_5_mean and semantic_perplexity_5_var: LLM measures of semantic perplexity, but using a window of 5 words.
      This measure is only calculated if language is in this list

    • semantic_perplexity_11_mean and semantic_perplexity_11_var: LLM measures of semantic perplexity, but using a window of 11 words.
      This measure is only calculated if language is in this list

    • semantic_perplexity_15_mean and semantic_perplexity_15_var: LLM measures of semantic perplexity, but using a window of 15 words.
      This measure is only calculated if language is in this list

The exact mathematical formula used to calculate perplexity measures is

$$ \text{perplexity}(\text{Turn}) = \exp\left(-\frac{1}{n}\sum_{i=0}^{n} \log P_{LM}(t_i|\text{Turn}_{\backslash i})\right) $$

where $\text{Turn} = (t_1, \ldots, t_n)$ and $P_{LM}$ is the language model probability model used (e.g., BERT).

  • turn_to_turn_tangeniality_slope: LLM measure of response to the other speaker similarity slope, calculated as the slope of the turn_to_turn_tangeniality measure on the duration of the interview. Aims to answer the question: Does the response to interviewer similarity degrade over time?
    This measure is only calculated if language is in this list

  • num_interrupts: number of interruptions, i.e. negative pre-turn pauses; the sum of interrupt flags from the turns summary.


3 – Inputs

3.1 – json_conf

Type JSON
Description output from speech transcription function

3.2 – language

Type String; optional, default = 'en'
Description The language for which speech characteristics will be calculated. If the language is English, all shown variables will be calculated. If the language is not English, only language-independent variables will be calculated

3.3 – speaker_label

Type String; optional, default = None
Description The speaker label from the JSON file for which the speech characteristics are calculated

3.4 – min_turn_length

Type Integer; optional, default = 1
Description The minimum length in words a turn needs to be for per-turn measures to be calculated

3.5 – min_coherence_turn_length

Type Integer; optional, default = 5
Description The minimum length in words a turn needs to be for coherence measures to be calculated

3.6 – option

Type String; optional, default = ‘coherence’
Description String that determines measures calculated; can be ‘simple’ or ‘coherence’
Option List of variables calculated
‘simple’ Amount of speech, pause measures, sentiment, lexical richness, part of speech and repetition measures
‘coherence’ Simple measures + speech coherence measures

4 – Outputs

4.1 – words

Type pandas.DataFrame
Description Per-word measures of speech characteristics

4.2 – turns

Type pandas.DataFrame or None
Description Per-turn measures of speech characteristics in case the input JSON contains speech from multiple speakers and a speaker is identified using the speaker_label parameter

4.3 – summary

Type pandas.DataFrame
Description File-level measures of speech characteristics

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
NLTK Apache 2.0 Well-established library for commonly measured natural language characteristics
LexicalRichness MIT Straightforward implementation of methods for calculation of MATTR score
vaderSentiment MIT Widely used library for sentiment analysis, trained on a large and heterogeneous dataset
transformers Apache 2.0 Library for accessing pre-trained language models. Used for calculation of word coherence measures and semantic perplexity.
sentence_transformers Apache 2.0 Library for accessing pre-trained language models for sentences and paragraphs. Used for calculation of sentence and turn tangentiality measures.
spacy MIT Natural language processing library used for lemmatizing words in MATTR calculation.

6 – References

Andreasen, N., & Pfohl, B. (1976). Linguistic Analysis of Speech in Affective Disorders. Archives of General Psychiatry, 33(11), 1361. https://doi.org/10.1001/archpsyc.1976.01770110089009

Dikaios, K., Rempel, S., Dumpala, S. H., Oore, S., Kiefte, M., & Uher, R. (2023). Applications of Speech Analysis in Psychiatry. Harvard Review of Psychiatry, 31(1), 1–13. https://doi.org/10.1097/hrp.0000000000000356

Elvevåg, B., Foltz, P. W., Weinberger, D. R., & Goldberg, T. E. (2007). Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia. Schizophrenia Research, 93(1–3), 304–316. https://doi.org/10.1016/j.schres.2007.03.001

He, R., Palominos, C., Zhang, H., Alonso-Sánchez, M. F., Palaniyappan, L., & Hinzen, W. (2024). Navigating the semantic space: unraveling the structure of meaning in psychosis using different computational language models. Psychiatry Research, 333, 115752. https://doi.org/10.1016/j.psychres.2024.115752

Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2023). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. Schizophrenia Research, 259, 59–70. https://doi.org/10.1016/j.schres.2022.07.002

Stasak, B., Epps, J., & Goecke, R. (2019). Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis. Speech Communication, 115, 1–14. https://doi.org/10.1016/j.specom.2019.10.003

Tang, S. X., Kriz, R., Cho, S., Park, S. J., Harowitz, J., Gur, R. E., Bhati, M. T., Wolf, D. H., Sedoc, J., & Liberman, M. (2021). Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. Npj Schizophrenia, 7(1). https://doi.org/10.1038/s41537-021-00154-3

Clone this wiki locally