Integrating audio.whisper and audio.vadsilero #62

jmgirard · 2024-03-26T15:02:42Z

Does this look correct for what predict.whisper() is looking for in its sections argument?

convert_silero <- function(infile) {
  # Read in output from audio.vadsilero::silero()
  sdf <- readRDS(infile)
  # Extract segments information
  sections <- sdf$vad_segments
  # Drop non-voiced segments
  out <- sections[sections$has_voice == TRUE, ]
  # Convert to milliseconds
  out$start <- out$start * 1000
  out$end <- out$end * 1000
  # Calculate duration
  out$duration <- out$end - out$start
  # Drop segments with no duration
  out <- out[out$duration > 0, ]
  # Format output
  out <- out[, c("start", "duration")]

  out
}

The text was updated successfully, but these errors were encountered:

jwijffels · 2024-03-26T20:46:05Z

Yes, something like that. I´ve made a generic function is.voiced which is part of audio.vadwebrtc and will also work on the output of silero which allows you to be a bit more liberal on small sections e.g. consider non-voiced segments smaller than 1 sec as voiced - we can be a bit liberal on identifying voiced segments and need to remove solely the larger chunks of silences to exclude non-voiced hallucinations.
See https://github.com/bnosac/audio.vadwebrtc/blob/0ca8192268b74f37a5b068775bad2f596cd339d5/R/vad.R#L133

Examples at

audio.whisper/R/whisper.R

Line 216 in cbf7c00

#' ## Provide multiple offsets and durations to get the segments in there

Where these sections or offset/ durations can come from a VAD model or the result of is.voiced.

I also tend to prefer the offset/duration arguments instead of the sections argument as it tends to be able to recover in a new section if a previous section had repetitions.

jwijffels · 2024-03-26T21:14:37Z

Probably the VAD can be used for better diarization as well if we do VAD by channel and see which section in the transcription corresponds to voiced elements as detected by the VAD

jmgirard · 2024-03-26T22:45:03Z

Ok, good to know. I'll try offset and duration instead.

jmgirard · 2024-03-27T00:01:16Z

Hmm... offset and duration seems to be running a lot slower than without that. Has that been your experience too?

jwijffels · 2024-03-27T06:18:00Z

For this you need to understand that whisper runs in chunks of 30seconds.

The behaviour is different for the 2 arguments.

argument sections creates a new audio file based on these voiced sections and does the transcription
arguments offset/duration looks to each offset/duration section, gets the 30 seconds where it is in, does the transcription of that and limits the output to the requested period. So this might do things several times if you have many voiced sections within the same 30 seconds window.

Feel free to provide feedback how the transcription works on your audio.

jmgirard · 2024-03-27T12:13:56Z

Gotcha. In my use case, sections took about 20m to run one file whereas offset/duration took several hours. The output for sections looks good so far, but I'll do a more thorough check once more files are processed.

jwijffels · 2024-03-27T12:31:41Z

Would be good if you can test if the timepoints on the output when using sections are ok.

Regarding speed that's normal. With sections you basically remove the non-voiced audio (so it will be faster than transcribing the full audio file). Probably you feeded the VAD directly in there but the VAD can provide many small chunks, it makes sense to combine these a bit. Function is.voiced which is part of audio.vadwebrtc (and works on output of silero as well) combines a bit larger chunks - https://github.com/bnosac/audio.vadwebrtc/blob/master/R/vad.R#L126-L184

jmgirard · 2024-03-27T13:50:04Z

Ok, is.voiced() is useful. Trying this now:

convert_silero <- function(vadfile, smin = 500, vmin = 500) {
  # Extract segments information
  sections <- audio.vadwebrtc::is.voiced(
    readRDS(vadfile), 
    units = "milliseconds", 
    silence_min = smin, 
    voiced_min = vmin
  )
  # Drop non-voiced segments
  out <- sections[sections$has_voice == TRUE, ]
  # Format output
  out <- out[, c("start", "duration")]
  return(out)
}

transcribe_file <- function(infile, outfile, vadfile, approach = "offsets", ...) {
  approach <- match.arg(approach, choices = c("sections", "offsets"), several.ok = FALSE)
  if (file.exists(outfile)) {
    return("skipped")
  }
  vad_segments <- convert_silero(vadfile, ...)
  switch (approach,
    sections = {
      transcript <- predict(
        model,
        infile,
        type = "transcribe",
        language = "en",
        n_threads = 1,
        n_processors = 1,
        sections = vad_segments,
        trace = FALSE
      )
    },
    offsets = {
      transcript <- predict(
        model,
        infile,
        type = "transcribe",
        language = "en",
        n_threads = 1,
        n_processors = 1,
        offset = vad_segments$start,
        duration = vad_segments$duration,
        trace = FALSE
      )
    }
  )
  saveRDS(transcript, file = outfile, compress = "gzip")
  return("created")
}

jwijffels · 2024-05-06T13:00:10Z

Probably the VAD can be used for better diarization as well if we do VAD by channel and see which section in the transcription corresponds to voiced elements as detected by the VAD

Added predict.whisper_transcription for this in version 0.4.1

jmgirard mentioned this issue Mar 29, 2024

Notes on repetitions #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating audio.whisper and audio.vadsilero #62

Integrating audio.whisper and audio.vadsilero #62

jmgirard commented Mar 26, 2024 •

edited

Loading

jwijffels commented Mar 26, 2024 •

edited

Loading

jwijffels commented Mar 26, 2024 •

edited

Loading

jmgirard commented Mar 26, 2024

jmgirard commented Mar 27, 2024 •

edited

Loading

jwijffels commented Mar 27, 2024 •

edited

Loading

jmgirard commented Mar 27, 2024

jwijffels commented Mar 27, 2024 •

edited

Loading

jmgirard commented Mar 27, 2024 •

edited

Loading

jwijffels commented May 6, 2024

Integrating audio.whisper and audio.vadsilero #62

Integrating audio.whisper and audio.vadsilero #62

Comments

jmgirard commented Mar 26, 2024 • edited Loading

jwijffels commented Mar 26, 2024 • edited Loading

jwijffels commented Mar 26, 2024 • edited Loading

jmgirard commented Mar 26, 2024

jmgirard commented Mar 27, 2024 • edited Loading

jwijffels commented Mar 27, 2024 • edited Loading

jmgirard commented Mar 27, 2024

jwijffels commented Mar 27, 2024 • edited Loading

jmgirard commented Mar 27, 2024 • edited Loading

jwijffels commented May 6, 2024

jmgirard commented Mar 26, 2024 •

edited

Loading

jwijffels commented Mar 26, 2024 •

edited

Loading

jwijffels commented Mar 26, 2024 •

edited

Loading

jmgirard commented Mar 27, 2024 •

edited

Loading

jwijffels commented Mar 27, 2024 •

edited

Loading

jwijffels commented Mar 27, 2024 •

edited

Loading

jmgirard commented Mar 27, 2024 •

edited

Loading