Domain-specific finetuning Whisper #917

twardoch · 2023-02-01T22:32:09Z

twardoch
Feb 1, 2023

I am interested in domain-specific fine-tuning.

In my audio, there is a certain number of brand names, person names and domain-specific jargon. Whisper generally transcribes tht text fine but sometimes gets the specific vocab wrong.

Is this the right approach too fix this:

Take my audio samples for which I have GT (ground-truth) transcriptions.
Run Whisper on them, get the generated text.
Compare the generated text with the GT transcripts and gather those that mismatch.
Fine-tune Whisper ONLY on the audio samples and the GT transcripts for which a mismatch was found.

So in short, fine-tune only on the corrected errors, not on the entire corpus? I imagine that would reduce the fine-tuning time.

Is there any caveat to my approach?

glangford · 2023-02-02T01:52:39Z

glangford
Feb 2, 2023

Before fine tuning it might be useful to experiment using
--initial_prompt 'brand1 brand2 ... term1 term2'
to try and correct for any misspellings you get, and see if accuracy improves to an acceptable level.

3 replies

glangford Feb 2, 2023

Where do I use --initial-prompt in python code? #355

JonathanFly Feb 2, 2023

I haven't tested this, but remember this is also a language model and changing the prompt it also telling Whisper what was said before requested transcription. If the prompt is nothing but a list of keywords, the model is going to transcribe the audio as if it the context of that audio is that it followed list of keywords (instead of being the middle of a conversation, or whatever.) This could impact the quality of the transcription.

So you might want to try a prompt that matches the tone of whatever you are transcribing rather than flat list, but still includes the keywords. "So we were talking about products A,B,C,D..." or something.

Quite possible the impact is so minor this isn't worth thinking about. Worth testing though.

Lorenzoncina Mar 9, 2023

This is true. I tested using as initial prompt a list of keywords and it has a bad impact on the WER.

whicks1 · 2023-02-02T03:14:04Z

whicks1
Feb 2, 2023

I’m curious on other folks thoughts here too. Haven’t had great luck in longer files with initial prompt. Was thinking of doing a post process step of running outputs through spaCy or nltk to get named entities and regex to replace.

0 replies

twardoch · 2023-02-02T10:50:00Z

twardoch
Feb 2, 2023
Author

Thanks a lot!

0 replies

glangford · 2023-02-02T17:46:42Z

glangford
Feb 2, 2023

Completely anecdotal - in a non-English transcript today (large model, beam search) the surname "von der Leyen" was transcribed as "FonderLion". Running it again with --initial prompt didn't help. Interesting that whisper capitalized it and without a space, so there was at least an attempt at a proper name. :)

0 replies

foges · 2023-03-21T16:55:09Z

foges
Mar 21, 2023

+1. I agree that having a list of custom vocab would be incredibly useful, likely for most applications. I think use cases for ASR are often domain specific. In my case, I'm using a for dictation and being able to specify names of people I work with and team lingo would be super helpful.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain-specific finetuning Whisper #917

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Domain-specific finetuning Whisper #917

Replies: 5 comments · 3 replies

twardoch Feb 2, 2023 Author

Replies: 5 comments 3 replies

twardoch
Feb 2, 2023
Author