Whisper model returns incorrect transcription for Japanese speech and is slow to return results #2377

dsnsabari · 2024-10-07T01:13:55Z

dsnsabari
Oct 7, 2024

Issue Description:

I am using the Whisper model to recognize Japanese speech. However, most of the time, it is returning the transcription "ご視聴ありがとうございました" (which translates to "Thank you for watching"). This result is incorrect for the input speech I am testing. Additionally, the model is taking a considerable amount of time to return the transcription.

Steps to Reproduce:

Use the Whisper model to transcribe Japanese speech.
Observe the returned transcription and the time taken to generate it.

Expected Behavior:

The model should return an accurate transcription of the Japanese speech input.
The time taken for transcription should be reduced.

Audio files: 11543.zip

Actual Behavior:

The returned transcription is often "ご視聴ありがとうございました" regardless of the input.
The model takes too long to return the results.

Code Snippet:

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

audio_file_path = "/home/ubuntu/sabari/7.mp3"


model1 = whisper.load_model("large-v3")

# Perform transcription
result = model1.transcribe(audio_file_path,language="ja") # options=options)


# Print the transcription
print(result["text"])

Environment:

Whisper model version: '20240930'
Python version: 3.9.12
Hardware specifications: NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6

glangford · 2024-10-07T12:30:14Z

glangford
Oct 7, 2024

See this past discussion for some other whisper options to try

Best prompt to transcribe Japanese? #2151

@shonokin any other suggestions based on your testing?

2 replies

shonokin Oct 7, 2024

Sorry, not really.
I'd started using "faster whisper" with model -v2 or -v3 in the program SubtitleEdit. For some reason that returns the best results for me, but still take a lot of post-processing adjustments. Haven't really expermented further in the past 6 months.

misutoneko Oct 7, 2024

So any idea how faster whisper manages to use large-v3? Do they have better version of the model or is it relying on VAD or something?
EDIT: I see Silero-VAD mentioned...I wonder if the solution relies wholly on that, or if there are other tricks too
(what would happen if the VAD is disabled?)

misutoneko · 2024-10-07T12:36:38Z

misutoneko
Oct 7, 2024

Hi, thank you for the samples.
I don't have any solution for large-v3 or turbo, but the small model works with these command line switches:
--language ja --model small --suppress_tokens 50364

I guess you can do the same in Python by manipulating suppress_tokens[].

EDIT: Note that --suppress_tokens "" does not work in this case, for some reason. Could depend on the model though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper model returns incorrect transcription for Japanese speech and is slow to return results #2377

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Whisper model returns incorrect transcription for Japanese speech and is slow to return results #2377

dsnsabari Oct 7, 2024

Replies: 2 comments · 2 replies

glangford Oct 7, 2024

shonokin Oct 7, 2024

misutoneko Oct 7, 2024

misutoneko Oct 7, 2024

dsnsabari
Oct 7, 2024

Replies: 2 comments 2 replies

glangford
Oct 7, 2024

misutoneko
Oct 7, 2024