Skip to content

Speech characteristics v3.0

GeorgiosEfstathiadis edited this page Jun 6, 2024 · 4 revisions
Date completed March 19, 2024
Release where first appeared v2.1
Researcher / Developer Georgios Efstathiadis, Vijay Yadav

1 – Use

import openwillis as ow

words, turns, summary = ow.speech_characteristics(json_conf = '', language = '', speaker_label = '', min_turn_length = '')

2 – Methods

This function measures speech characteristics from a transcript of an individual’s speech. The transcript inputted is the JSON output from any OpenWillis speech transcription function (Vosk, Whisper, or AWS).

By default, the function assumes the transcript contains speech in English and all measures listed below will be populated. If the transcript is not in English, the user must specify the language argument as something other than English and only language-independent measures will be populated.

By default, the function also assumes the transcript contains speech from one speaker and all measures below are calculated from the entire transcript. Per-turn measures are not populated in this case.

In the case of multiple speakers labeled in the transcript, the user must define which speaker they want to quantify speech characteristics from using the speaker_label argument. All measures calculated will be only for that speaker. When a speaker is specified, per-turn measures are indeed calculated.

The user may wish to calculate per-turn measures only on turns meeting a minimum length requirement. To do so, they may specify the minimum number of words a turn must have using min_turn_length.

2.1 – Per-word measures

The function’s first output is a words dataframe, which contains a row for each word and measures specific to that word. This includes:

  • Pause time before the word in seconds (pre_word_pause)
    Individual word timestamps in the input JSON are used to calculate pause lengths before each word. To avoid measurement of potentially long silences prior to the start of speech in an audio file, the pre_word_pause for the first word in every file is set to NaN. To distinguish pre_word_pause from pre_turn_pause as defined later in this document, pre_word_pause for the first word in each turn is also set to NaN.
  • Number of syllables, identified using NLTK’s SyllableTokenizer (num_syllables). This is an English-only measure.
  • Part of speech associated with the word, identified using NLTK, as specified in the part_of_speech column (Andreasen & Pfohl, 1976; Tang et al., 2021); these are English only measures:
    • noun
    • verb
    • adjective
    • pronoun
    • adverb
    • determiner
  • Flag to identify whether the word is a first-person singular pronoun, i.e. 'I,' 'me,' 'my,' or 'mine’ (Andreasen & Pfohl, 1976; Tang et al., 2021). This is an English-only measure. (first_person)
  • Tense of verbs identified using NLTK, i.e. whether the verb is in the past or present tense. This is an English-only measure. (verb_tense)
  • Speech coherence measures are also calculated at the word level (These measures are only calculated if language is in this list):
    • LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word (Parola et al., 2023). (word_coherence)
    • LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window (Parola et al., 2023). (word_coherence_5)
    • LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window (Parola et al., 2023). (word_coherence_10)
    • LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance (Parola et al., 2023). (word_coherence_variability_k for k from 2 to 10)

2.2 – Per-turn measures

When a speaker_label is specified, per-turn measures will be populated.

A turn is defined as a segment of speech by one speaker, i.e., everything that individual says before a different speaker starts speaking. The turn-level output contains a row for each turn over the course of the file. Note phrase related measures work for other languages, but are better suited for English. Per-turn measures include:

  • Pause time before the turn in seconds (pre_turn_pause)
    In case the first turn for the speaker is also the first turn in the audio file, this is set to NaN to correct for any potential delays between the start of the recording and the first detection of speech.
  • Total length of the turn in minutes (turn_length_minutes)
    This is the time between the beginning of the first word in the turn to the end of the last word in the turn, meaning it considers both words spoken and the pauses between those words.
  • Total number of words in the turn (turn_length_words)
    This is simply a count of the number of words detected in the turn.
  • Rate of speech in words per minute (words_per_min)
    This is calculated by simply dividing turn_length_words by turn_length_minutes .
  • Articulation rate, i.e., syllables per minute (syllables_per_min)
    This is calculated by summing num_syllables from the per-word measures and dividing the sum by turn_length_minutes. This is an English-only measure.
  • Speech percentage i.e., time spoken over total time (speech_percentage)
    This is the sum of time taken to speak every word in the turn divided by turn_length_minutes to quantify how much of the turn was spent speaking versus in the silences between words.
  • Mean pause time between words in seconds (mean_pause_length)
    This is the mean of all the pre_word_pause measurements for all words in the turn.
  • Pause variability in seconds (pause_variability)
    This is the variance of all the pre_word_pause measurements for all words in the turn
  • Emotional valence associated with the turn, calculated using vaderSentiment:
    To use these measures, we recommend that the min_turn_length argument be set to at least 5 words. These are English-only measures.
    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)
  • Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)
    To use this measure, we recommend that the min_turn_length argument be set to at least 5 words. This is an English-only measure.
  • First-person singular pronouns percentage, measured as the percentage of pronouns in the utterance that are first-person singular pronouns (first_person_percentage)
  • First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)
    • Positive measured as (100-first_person_percentage)*sentiment_pos.(first_person_sentiment_positive)
    • Negative measured as first_person_percentage*sentiment_neg. (first_person_sentiment_negative)
  • Repeated words percentage, measured as the percentage of words in the utterance that are repeated (Stasak et al., 2019). (word_repeat_percentage)
  • Repeated phrase percentage, measured as the percentage of phrases in the utterance that are repeated (Stasak et al., 2019). (phrase_repeat_percentage)
  • Speech coherence measures calculated at the turn level:
    • LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn (Tang et al., 2021; He et al., 2024; Elvevåg et al., 2007; Parola et al., 2023). (first_order_sentence_tangeniality)

      This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?

      This measure is only calculated if language is in this list

    • LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle (Parola et al., 2023). (second_order_sentence_tangeniality)

      This measure is only calculated if language is in this list

    • LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker (Tang et al., 2021). (turn_to_turn_tangeniality)

      This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
      This measure is only calculated if language is in this list

    • LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is (Parola et al., 2023). The exact mathematical formula used to calculate this measure is

$$ \text{perplexity}(\text{Turn}) = \exp\left(-\frac{1}{n}\sum_{i=0}^{n} \log P_{LM}(t_i|\text{Turn}_{\backslash i})\right) $$

where $\text{Turn} = (t_1, \ldots, t_n)$ and $P_{LM}$ is the language model probability model used (e.g., BERT). (semantic_perplexity) This measure is only calculated if language is in this list

  • Flag for pre-turn pauses that are zero or negative, i.e. interruptions (interrupt_flag)
    This is set to True if pre-turn pause is negative or zero (indicative of an interruption) and otherwise, it is set to False. This variable is used in the summary output to calculate num_interrupts.

2.3 – Summary

The summary dataframe compiles file-level information. In the case of a single speaker, these measures are calculated by compiling information from the per-word measures output. In the case of multiple speakers, these are calculated by compiling information from the per-turn measures output. Note phrase related measures work for other languages, but are better suited for English.

  • File length (file_length)
    This is the total length of the audio file in minutes

  • Total length of speech in minutes (speech_length_minutes)
    _Single speaker: time from the beginning of the first word to the end of the last word
    Multiple speakers: sum of turn_length_minutes across all turns

  • Total number of words spoken (speech_length_words) across the whole file
    Single speaker: total count of words in the file
    Multiple speakers: sum of turn_length_words across all turns

  • Rate of speech in words per minute (words_per_min) across the whole file
    speech_length_words / speech_length_minutes

  • Articulation rate i.e. syllables per minute (syllables_per_min) across the whole file
    Single speaker: num_syllables from all words / speech_length_minutes
    Multiple speakers: num_syllables from filtered turns / speech_length_minutes

  • Speech percentage i.e., time spoken over total time (speech_percentage)
    speech_length_minutes / file_length

  • Mean pause time between words in seconds (mean_pause_length)
    Mean of pre_word_pause across all words in file

  • Pause variability in seconds (mean_pause_variability)
    Variability of pre_word_pause across all words in file

  • Emotional valence associated with the turn, calculated using vaderSentiment
    Calculated on a string of the entire transcript
    To use these measures, we recommend that the min_turn_length argument be set to at least 5 words. These are English-only measures.

    • Degree of positive valence ranging from 0-1 (sentiment_pos)
    • Degree of negative valence ranging from 0-1 (sentiment_neg)
    • Degree of neutral valence ranging from 0-1 (sentiment_neu)
    • Degree of overall valence ranging from 0-1 (sentiment_overall)
  • Lexical diversity as measured by the moving average type token ratio (MATTR) score (mattr)
    Calculated on a string of the entire transcript
    To use this measure, we recommend that the min_turn_length argument be set to at least 5 words. This is an English-only measure.

  • First-person singular pronouns percentage, measured as the percentage of pronouns in the speech that are first-person singular pronouns (first_person_percentage)

    Calculated on a string of the entire transcript

  • First-person singular pronouns percentage x sentiment separated, measuring the interaction between the speech’s sentiment and the use of first person pronounce split into two variables (Dikaios et al., 2023)

    • Positive measured as (100-first_person_percentage)*sentiment_pos.(first_person_sentiment_positive)
    • Negative measured as first_person_percentage*sentiment_neg. (first_person_sentiment_negative)
    • Overall measured as mixed average of other two measures (i.e. first_person_sentiment_positive if a turn is positive or first_person_sentiment_negative if turn is negative averaged across turns if multiple speakers or across entire text) (first_person_sentiment_overall)

    Single speaker: calculated using summary measures

    Multiple speakers: averaged measure across turns (more meaningful when multiple utterances/turns)

  • Repeated words percentage, measured as the percentage of words in the speech that are repeated. (word_repeat_percentage)

    Single speaker: calculated using full text

    Multiple speakers: averaged measure across turns

  • Repeated phrase percentage, measured as the percentage of phrases in the speech that are repeated. (phrase_repeat_percentage)

    Single speaker: calculated using full text

    Multiple speakers: averaged measure across turns

  • Means and variances of speech coherence measures calculated at the word level (These measures are only calculated if language is in this list):

    • LLM measure of word coherence, indicating semantic similarity of each word to the immediately preceding word. (word_coherence_mean and word_coherence_var)
    • LLM measure of word coherence (5-word window), indicating semantic similarity of each word in 5-words window. (word_coherence_5_mean and word_coherence_5_var)
    • LLM measure of word coherence (10-word window), indicating semantic similarity of each word in 10-words window. (word_coherence_10_mean and word_coherence_10_var)
    • LLM measure of word-to-word variability at k inter-word distances (for k from 2 to 10), indicating semantic similarity between each word and the next following word at k inter-word distance. (word_coherence_variability_k_mean and word_coherence_variability_k_var for k from 2 to 10)

In addition to the variables above, files with multiple speakers and identified turns will also have the following variables populated:

  • Number of turns that met the minimum length threshold (num_turns)
  • Number of one-word turns (num_one_word_turns)
    *variable not interpretable when min turn length is larger than 1
  • Mean length of turns in minutes (mean_turn_length_minutes)
  • Mean length of turns in words spoken (mean_turn_length_words)
  • Mean pause time before each turn (mean_pre_turn_pause)
  • Speaker percentage (speaker_percentage)
    This is the percentage of the entire file that contained speech from this speaker rather than other speakers. It is calculated by dividing speech_length_minutes from the summary output by file_length.
    *variable not interpretable when min turn length is larger than 1
  • Means and variances of speech coherence measures calculated at the turn level:
    • LLM measure of sentence tangentiality, measured as the average semantic similarity of each phrase in the turn to the previous phrase in the turn. (first_order_sentence_tangeniality_mean and first_order_sentence_tangeniality_var)

      This measure aims to answer the question: How much sense do consecutive sentences make? Do they veer off-topic, failing to address the original point or question?
      This measure is only calculated if language is in this list

    • LLM measure of second-order sentence tangentiality, same as above, but instead of consecutive sentences we look at sentences that are separated from one sentence in the middle. (second_order_sentence_tangeniality_mean and second_order_sentence_tangeniality_var)
      This measure is only calculated if language is in this list

    • LLM measure of response to the other speaker similarity, measured as the semantic similarity of current turn to previous turn of the other speaker. (turn_to_turn_tangeniality_mean and turn_to_turn_tangeniality_var)

      This measure makes more sense in a clinical interview context where the analysis is being run on the participant and aims to answer the question: How much sense does an answer to a question of the interviewer make?
      This measure is only calculated if language is in this list

    • LLM measure of semantic perplexity, measured as the pseudo-perplexity of the turn and indicating how predictable the turn is. The exact mathematical formula used to calculate this measure is

$$ \text{perplexity}(\text{Turn}) = \exp\left(-\frac{1}{n}\sum_{i=0}^{n} \log P_{LM}(t_i|\text{Turn}_{\backslash i})\right) $$

where $\text{Turn} = (t_1, \ldots, t_n)$ and $P_{LM}$ is the language model probability model used (BERT). (semantic_perplexity_mean and semantic_perplexity_var) This measure is only calculated if language is in this list

  • LLM measure of response to the other speaker similarity slope, calculated as the slope of the turn_to_turn_tangeniality measure on the duration of the interview (turn_to_turn_tangeniality_slope)

    Aims to answer the question: Does the response to interviewer similarity degrade over time?

  • Number of interruptions, i.e. negative pre-turn pauses (num_interrupts); the sum of interrupt flags from the turns summary.


3 – Inputs

3.1 – json_conf

Type JSON
Description output from speech transcription function

3.2 – language

Type String, optional, default = 'en'
Description The language for which speech characteristics will be calculated. If the language is English, all shown variables will be calculated. If the language is not English, only language-independent variables will be calculated

3.3 – speaker_label

Type String, optional, default = None
Description The speaker label from the JSON file for which the speech characteristics are calculated

3.4 – min_turn_length

Type Integer, optional, default = 1
Description The minimum length in words a turn needs to be for per-turn measures to be calculated

4 – Outputs

4.1 – words

Type pandas.DataFrame
Description Per-word measures of speech characteristics

4.2 – turns

Type pandas.DataFrame or None
Description Per-turn measures of speech characteristics in case the input JSON contains speech from multiple speakers and a speaker is identified using the speaker_label parameter

4.3 – summary

Type pandas.DataFrame
Description File-level measures of speech characteristics

5 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
NLTK Apache 2.0 Well-established library for commonly measured natural language characteristics
LexicalRichness MIT Straightforward implementation of methods for calculation of MATTR score
vaderSentiment MIT Widely used library for sentiment analysis, trained on a large and heterogeneous dataset

6 – References

Andreasen, N., & Pfohl, B. (1976). Linguistic Analysis of Speech in Affective Disorders. _Archives of General Psychiatry_, _33_(11), 1361. https://doi.org/10.1001/archpsyc.1976.01770110089009


Dikaios, K., Rempel, S., Dumpala, S. H., Oore, S., Kiefte, M., & Uher, R. (2023). Applications of Speech Analysis in Psychiatry. _Harvard Review of Psychiatry_, _31_(1), 1–13. https://doi.org/10.1097/hrp.0000000000000356


Elvevåg, B., Foltz, P. W., Weinberger, D. R., & Goldberg, T. E. (2007). Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia. _Schizophrenia Research_, _93_(1–3), 304–316. https://doi.org/10.1016/j.schres.2007.03.001


He, R., Palominos, C., Zhang, H., Alonso-Sánchez, M. F., Palaniyappan, L., & Hinzen, W. (2024). Navigating the semantic space: unraveling the structure of meaning in psychosis using different computational language models. _Psychiatry Research_, _333_, 115752. https://doi.org/10.1016/j.psychres.2024.115752


Parola, A., Lin, J. M., Simonsen, A., Bliksted, V., Zhou, Y., Wang, H., Inoue, L., Koelkebeck, K., & Fusaroli, R. (2023). Speech disturbances in schizophrenia: Assessing cross-linguistic generalizability of NLP automated measures of coherence. _Schizophrenia Research_, _259_, 59–70. https://doi.org/10.1016/j.schres.2022.07.002


Stasak, B., Epps, J., & Goecke, R. (2019). Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis. _Speech Communication_, _115_, 1–14. https://doi.org/10.1016/j.specom.2019.10.003


Tang, S. X., Kriz, R., Cho, S., Park, S. J., Harowitz, J., Gur, R. E., Bhati, M. T., Wolf, D. H., Sedoc, J., & Liberman, M. (2021). Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. _Npj Schizophrenia_, _7_(1). https://doi.org/10.1038/s41537-021-00154-3
Clone this wiki locally