Skip to content

Research guidelines

anzar edited this page Jun 14, 2023 · 1 revision

1 – Introduction

Although the documentation thus far provides technical instructions on how to install and use OpenWillis, there are scientific considerations to keep in mind when designing and conducting experiments that use OpenWillis. These guidelines walk through best practices that will ensure high quality measurements from data processed using OpenWillis.

2 – General video guidelines

The functions are designed to calculate behavioral characteristics from videos of individuals, specifically from the head/face region. Hence, the assumption is that the videos being passed through are indeed of a single individual, facing––at least for part of the video––towards the camera, with sufficient data quality to make out a face. Below are some general guidelines to consider.

2.1 – Head direction

The underlying computer vision models work best when the individual is facing the camera.

If collecting data for an experiment, it would be ideal if the camera was placed as directly in front of the individual as possible (or, for example, right over the shoulder of the interviewer, if there is one).

If working with previously collected videos, such as those from clinical interviews, the functions should still work fairly well as long as the individual is facing mostly towards the lens. If they’re looking away sometimes, that’s okay. The functions will not recognize a face in those frames and not process any variables from those frames (as will be reflected in the framewise output). As long as there are sufficient frames where the individual is facing the lens, the summary statistics can still be relied upon for downstream analyses.

2.2 – Persons in video

Currently, the functions assume there is only one person in the video. they are unable to process data from two or more faces separately, and in the case of multiple faces, there is no way for the user to control which face is processed for data.

If collecting data for an experiment, it is best to collect it in a way that only the individual whose behavior needs to be measured is in the video. alternatively, it should be collected in a way where the second (or third, or fourth, etc.) person can be cropped out.

If working with existing videos that happen to have multiple individuals in the frame (or videos that cut between closeups on different individuals), then it is the user’s responsibility to crop and/or trim the video to only contain frames of the individual whose behavior they want to measure.

2.3 – Data quality

Folks often think image resolution is the biggest barrier to data quality. In reality, as long as you’re not rocking a flip-phone video from the early 2000s, you’re probably pretty solid. Super high resolution will also not make a difference; the models work by cropping the image around the face and resampling it to 256x256 per frame. So, as long as you have that much resolution around the face, you’re in good shape. any modern camera where the individual is the subject of the video will suffice.

More important factors to consider for data quality have to do with object obstruction (hats, glasses, etc.) that may obfuscate some of the face, poor lighting, particularly shadows across the face, which can confuse the models, and other objects in the video that could be confused as faces (pictures in the background, stuffed toys). it’s best to stick to plain videos with not much else going on.

3 – Processing facial expressivity

Beyond general video guidelines, there are specific recommendations for when the user is analyzing facial expressivity as part of an experiment. We've learned the hard way that incorporating baseline measures can be helpful when looking at intra- or inter- individual variability and are necessary in certain patient populations. They are highly recommended for all clinical experiments.

3.1 – Using baseline measures

You’ll notice that the facial landmark and facial emotion functions allow for a baseline video to be included as one of the inputs. When a baseline video is included, the function first calculates mean facial expressivity in the baseline video, then it calculates framewise expressivity in the main video, and finally it corrects the value against the baseline. The resulting measures are then expressivity measures relative to the baseline.

Without the baseline, the function will calculate facial expressivity as-is. This is sufficient in some contexts. However, most of the time, an individual’s facial or emotional expressivity will differ based on factors that may be confounding variables in your experiment. a baseline video can correct for that and exclude the effect of those confounding variables.

Below are a couple examples that hopefully make a good case for baselining in most cases.

Depression – let’s say we’re looking at emotional expressivity during a response to an open-ended question in depressed individuals receiving treatment over several weeks. A variable that may influence facial expressivity measurements may be fatigue (affected by factors unrelated to underlying disease progression such as time of day or events at work). If we are measuring facial expressivity as is, we’re going to get a noisy measurement that varies as a consequence of fatigue and not necessarily motor retardation that exists as a consequence of neurobiology of depression. But if we collect a baseline for every time point, where the individual is asked to read out a sentence or small passage, baselining would remove the effects of whatever confounding variables may be affecting facial expressivity during that timepoint and get a cleaner measure of expressivity as it is affected by depression. An example of baselining in depression can be found in this paper.

Schizophrenia – let’s say we want to quantify flattened facial affect in individuals with schizophrenia at just one timepoint and compare it to healthy volunteers. Folks with schizophrenia are likely on an antipsychotic of some sort and we know that most antipsychotics lead to a side effect known as tardive dyskinesia, which in turn can lead to involuntary movements in the face. These involuntary movements show up as heightened facial expressivity and may make it seem (in the resulting measures) like someone is being super expressive when in reality, it’s just the involuntary facial movements that are being quantified. But, if we collect a video where the individual is just asked to sit and look at the camera for 10 seconds and use it to baseline the actual video of them where they are, let’s say, responding to an open-ended question, then we can somewhat remove the effect of tardive dyskinesia from the resulting measures of facial expressivity. If we don’t do this baselining, the effect of tardive dyskinesia may make it seem like individuals with schizophrenia are more expressive than healthy volunteers, which wouldn’t make sense. An example of this can be found in this paper.

As you can see, what the baseline video contains will differ case by case and the best person to make that judgment is the scientist designing the experiment because they will be most aware of possible confounding factors for the measurement. we’re here to help, so don’t fret to get in touch via [email protected] if you want another perspective. But hopefully the case for baselining at this point is strong enough for you to do it.

4 – General audio guidelines

Similar to video analysis, audio analysis is designed to quantify properties of human voice and speech. So, there is a basic assumption that the inputted audio file contains human voice and not much else; other sounds will still output measures but they may be nonsensical. Here are some things to consider:

4.1 – Dealing with background noise

The vocal_acoustics function is essentially just conducting signal processing on the waveform of the audio file, irrespective of what the waveform contains. Hence, if there is background noise mixed in with the individual’s voice, it’s going to add noise to the measurement. So, if the experimenter has enough control over the data collection to minimize background noise, they should absolutely do that. If background noise is unavoidable (e.g., data’s already been collected, data is being collected in an uncontrolled environment), there are a couple things the user can do and/or hope for.

  1. Apply some kind of background noise cancelation function. There are a few of these out there and plenty of research on effective ways to do this. We plan on integrating background noise cancellation as part of audio processing in future versions of OpenWillis but for now it’s up to the user to conduct background noise cancellation on the file before processing measures.
  2. Hope that the signal-to-noise ratio is good enough. That is to say, yes, background noise may be affecting the measures to some extent and we can appreciate that and acknowledge that and still maintain that we’ll see an effect in our analysis since the underlying acoustic property is measured clearly enough to see whatever effect we’re expecting. Even with the presence of background noise, maybe 90% of the waveform still represents voice and that may be enough.

As far as the speech_transcription functions go, the more noise there is, the more likely there are going to be transcription errors, affecting measures outputted by the speech_characteristics function.

4.2 – Trimming the ends

Some of the output variables, such as length of pauses or time spent speaking or not speaking are distinguishing between audio frames where voice was detected versus not detected. When using these measures, the experimenter should trim out the ends of the audio file so that the bookends of the collected audio do not affect the measurement. It may be that the individual started speaking a few seconds after the recording started and the experimenter stopped the recording a few seconds after the individual was done speaking; we don’t want the few seconds at the beginning and end of the recording to affect downstream measures of voice/speech prevalence or pause characteristics.

4.3 – Persons in audio

Similar to video, when measuring vocal acoustic properties, the assumption is that the audio file contains speech from one individual. If the file contains two people's voices, the function will not distinguish between them and give measures for all the speech in the file.

It's for this very scenario that we developed the speaker_separation function, which should be able to automate the separation of a single audio file with two speakers into two audio files, each only containing the voice of one individual. All downstream processing would then happen to those files separately.

As far as speech characteristics are concerned, the updates in OpenWillis v1.2 allow for processing speech characteristics from audio files with up to two speakers (see function documentation).

4.4 – Transcription accuracy

Regardless of how clean or crisp your audio file is, always expect there to be some transcription errors. This may be surprising because your Alexa understands you perfectly and the methods we use are not far off. But it is to be noted that we speak differently and more clearly when we know that a machine will be transcribing our speech compared to when we’re talking to other humans. It's the latter kind of speech that we will most likely be processing through OpenWillis and that kind of speech is more prone to having transcription errors. There's not much we can really do about this (except reduce background noise and other factors that would lead to poor data quality) but accept and acknowledge that there is likely noise in our signal.

4.5 – Recording setup

It's well-understood that the recording hardware you’re using has an effect on the variables you’re measuring. The same sound being analyzed using the same code by two different types of microphones will lead to varying measures. This sucks and is a consequence of these measures being quite sensitive. The best thing the experimenter can do is to be consistent with the hardware used. That's easy if the experimenter is also the one collecting the data. But, if data is collected through phone calls or individuals submitting their own data through their own phones or computers, etc., then the experimenter does not have any control on the type of microphone used. In those scenarios, it’s important to state that as a confounding variable in any findings communicated from the research.

5 – Interpreting vocal acoustic variables

As promised, part of the benefit of using this python library is that we’re going to help make some sense of the vocal acoustic variables that it outputs. After all, anyone that’s not a speech expert probably doesn’t know what to make of a glottal-to-noise excitation ratio.

We try to link as much documentation on each of the features we can in the methods descriptions. We hope to publish more comprehensive content to help users understand and interpret these variables.

For now, all we want to say is that though these variables seem quite obscure, they’re actually well documented in the scientific literature. There's a whole world of speech research that has defined each of these variables and there are plenty of papers, probably in the patient population you’re studying, where the relationship between each of these variables and disease severity has been explored. Even Wikipedia will do an adequate job defining the variables and a quick lit review will give you an easy idea on how these variables should behave in your population of interest and what kind of voice data to best calculate them from. For example, fundamental frequency is more useful when measured from sustained vowel phonation than from free speech. That's not obvious to the layperson, but it will be obvious to you when you simply look for papers that measure it in your population of interest.

5.1 – Advanced speech analysis

When using the speech_characteristics function, the user will get some simple variables characterizing speech such as pause characteristics or rate of speech and the user is going to get some higher level features like emotional valence of voice or parts of speech used. It's easy to calculate the rate of speech since it just requires calculating words spoken per minute; it’s not simple calculating higher level features as they depend on pre-trained models. Hence, when doing so, some things have to be taken into consideration.

5.2 – Pre-trained models are not perfect

Something like deducing emotional valence of speech relies on pre-trained machine learning models that neither were nor should be trained by our team. These already exist and have been trained on large datasets collected far beyond the scope of a digital measurement of a clinical functioning python library like OpenWillis.

That being said, these methods are just that: trained models that make predictions that have never claimed to be 100% accurate all of the time. We picked these models rather than other ones because we feel confident in their ability to provide a reliable output, but we can’t personally vouch that they’ll always work. Keep in mind that this is still the cutting edge of natural language processing and it’s worth calculating these measures and seeing if they correlate with (for example) disease severity in the direction we would hypothesize.

5.3 – Pre-trained models aren’t language-agnostic

The models will have been trained on data from a specific language, most likely English. Which means that even though OpenWillis can transcribe multiple languages, it cannot calculate the same list of speech characteristics for all of those languages. The current setup of the speech_characteristics function processes a different list of speech characteristics for data inputted in English versus data inputted in other languages.

If you are aware of any advanced speech characteristics that can be acquired from reliable pre-trained models in other languages, please get in touch.

6 – Consideration of different behaviors

This is perhaps one of the most important considerations when analyzing data using OpenWillis. The nature of the behavior the individual is participating in has a significant impact on how the variables calculated can be used in an experiment and they should subsequently be interpreted.

For example, facial expressivity calculated when someone is quietly watching a video versus when they are talking about something are not the same variable and in most cases are not comparable. Calculating the rate of speech when someone is reading a passage out loud versus when they’re asked to speak freely about a topic are also not the same variables.

Hence, when designing an experiment or analyzing data, the researcher must think actively about the behavior the individual is participating in and the effect that may have on the measurement.

6.1 – Splitting behaviors in advance

OpenWillis of course does not distinguish between different behaviors and calculate variables for them separately; it is up to the user to separately calculate variables for individual behaviors. For example, when processing video from a clinical interview, there may be times when the patient is (a) listening to the interviewer, (b) responding to open-ended questions, or (c) being asked to repeat phrases. In this case, the user will have to trim the video for each of these three ‘behaviors’ and process them separately and consequently handle the calculated variable separately. To conflate them all would lead to a noisy variable and perhaps too low of a signal-to-noise ratio to see an effect. The same applies for voice and speech analysis. Variables will differ based on whether the individual is speaking freely, repeating phrases, reading out loud, being asked to make sustained vowel sounds, etc.

The question then is to what granularity behaviors need to be separated and what behaviors do need to be processed independently of each other. Unfortunately, there isn’t really one answer to that question and the best people to have a perspective on this are the experimenters themselves. This is because they understand better than anyone else the population they are studying and the dataset they are using and can best determine what behaviors need to be processed separately.

Though digital phenotyping of this kind of a relatively novel field, there is precedence in the literature for analysis of visual and auditory behaviors and our primary advice on splitting behaviors would be to dive into the literature and see how others have done this in the past––whether in your own patient population or generally for the measure you are interested in. We've learned that besides simple common sense (i.e. not comparing facial expressivity between individuals that have been asked to be quiet versus individual that are responding to questions), there are a lot of other scenarios where behaviorscan be split (e.g., analyzing emotional valence of speech when describing positively versus negatively valenced stimuli).

6.2 – Multiplication of variables by behavior

As one can imagine, when the experimenter splits behaviors and processes variables from them separately, it multiplies the number of variables they are working with because now they don’t just have a measure of facial expressivity or happiness expressivity, they have that measure for each behavior they captured. So, even though OpenWillis captures (at present) a handful of behavioral variables, the number of variables the user can end up with as part of their experiment can multiply to a number much larger than that.

We wanted to take a second to point this out explicitly because the field of digital phenotyping isn’t as complicated as people make it out to be. There are digital measurement folks who provide these services who like to boast having hundreds or even thousands of variables. A lot of the time, they’re just multiplying the variables by the different behaviors they can be captured in; ultimately, the measures themselves are quite simple.

7 – Amount of data needed

One of the most common questions we receive has to do with the amount of behavioral data needed to be able to calculate the variables to a sufficient degree.

The bad news, of course, is that the answer to that question depends on several factors and it is the experimenter that would have the better perspective on this, rather than the developers of the methods.

The good news, however, is that the answer is not as much data as one would think. Calculating vocal acoustic variables does not need hours or even several minutes of data; there’s precedent in the literature of 60-second recordings being sufficient to see an effect. When it comes to speech data, about the length of a paragraph is enough speech to be able to capture most speech characteristics of relevance. When it comes to facial expressivity, oculomotor behavior, or other types of movement behaviors, the answer is the same: There’s precedent in the literature that effects can be quantified quite quickly. This makes sense because if we ask ourselves how long it takes us as human observers to notice the same behaviors we’re trying to quantify using OpenWillis, we’d say it doesn’t really take that long to assess someone’s mood or sense something off about their voice or what they’re saying. The same would be true for OpenWillis. further good news is that, for most if not all these variables, there is precedence in the scientific literature for how much data is sufficient and that can be relied upon greatly by the experimenter.

Clone this wiki locally