This repository contains an R package which is a torch R wrapper around the Silero Voice Activity Detection model.
The package was created with as main goal to remove non-speech audio segments before doing an automatic transcription using audio.whisper to avoid transcription hallucinations. It contains
- functions to detect the location of voice in audio using a the .jit model from Silero implemented in torch
- it has similar functionality as audio.vadwebrtc
- The package is currently not on CRAN
- For the development version of this package:
remotes::install_github("bnosac/audio.vadsilero")
Look to the documentation of the functions: help(package = "audio.vadsilero")
Get a audio file in 16 bit with mono PCM samples (pcm_s16le codec) with a sampling rate of either 8Khz, 16KHz or 32Khz
library(audio.vadsilero)
file <- system.file(package = "audio.vadsilero", "extdata", "test_wav.wav")
vad <- silero(file, milliseconds = 96)
vad
Voice Activity Detection
- file: D:/Jan/R/win-library/4.1/audio.vadsilero/extdata/test_wav.wav
- sample rate: 16000
- VAD type: silero, VAD mode: , VAD by milliseconds: 96, VAD frame_length: 1536
- Percent of audio containing a voiced signal: 76.6%
- Seconds voiced: 4.7
- Seconds unvoiced: 1.4
vad$vad_segments
vad_segment start end has_voice
1 0.000 0.672 FALSE
2 0.768 2.592 TRUE
3 2.688 2.688 FALSE
4 2.784 3.168 TRUE
5 3.264 3.648 FALSE
6 3.744 4.896 TRUE
7 4.992 4.992 FALSE
8 5.088 6.432 TRUE
9 6.528 6.912 FALSE
Example of a simple plot of these audio and voice segments
library(wav)
x <- read_wav(file)
plot(seq_along(x) / 16000, x, type = "l", xlab = "Seconds", ylab = "Signal")
lines(vad$vad$millisecond / 1000, vad$vad$probability * max(x), type = "l", col = "red", lwd = 2)
Need support in text mining? Contact BNOSAC: http://www.bnosac.be