Option to suppress tokens #1697

pprobst · 2023-12-28T11:44:54Z

Hello. As far as I know, whispercpp does not support the option to suppress particular tokens / numerals during inference, like WhisperX does. This is particularly useful, for example, if we want to transcribe numbers literally, e.g., "one" instead of "1".

Is there any interest in adding support for this feature?

ggerganov · 2023-12-29T10:03:39Z

Have you tried the grammar functionality: #1229 - could be useful

pprobst · 2023-12-29T18:44:51Z

No, I didn't know it was a thing. I will look into it. But if I understood it correctly from a brief reading, this is awesome; given a context-specific corpus, we can use it to "boost the weight" of certain words. There should be an example or something in the README to give more attention to this feature because it's a game-changer 👀

pprobst · 2023-12-29T19:48:17Z

Oh! I just found exactly what I needed in the source code (the grammar functionality will still be handy for me, though, once I get it running for my own needs).

whisper.cpp/whisper.cpp

Lines 4473 to 4478 in 2623640

    
           static const std::vector<std::string> non_speech_tokens = { 
        
               "\"", "#", "(", ")", "*", "+", "/", ":", ";", "<", "=", ">", "@", "[", "\\", "]", "^", 
        
               "_", "`", "{", "|", "}", "~", "「", "」", "『", "』", "<<", ">>", "<<<", ">>>", "--", 
        
               "---", "-(", "-[", "('", "(\"", "((", "))", "(((", ")))", "[[", "]]", "{{", "}}", "♪♪", 
        
               "♪♪♪","♩", "♪", "♫", "♬", "♭", "♮", "♯" 
        
           };

And

whisper.cpp/whisper.cpp

Lines 4567 to 4575 in 2623640

    
           if (params.suppress_non_speech_tokens) { 
        
               for (const std::string & token : non_speech_tokens) { 
        
                   const std::string suppress_tokens[] = {token, " " + token}; 
        
                   for (const std::string & suppress_token : suppress_tokens) { 
        
                       if (vocab.token_to_id.find(suppress_token) != vocab.token_to_id.end()) { 
        
                           logits[vocab.token_to_id.at(suppress_token)] = -INFINITY; 
        
                       } 
        
                   } 
        
               }

Very handy. Just need to add the tokens I won't use. I'm closing this issue now.

flatsiedatsie · 2024-01-27T11:16:23Z

Would it be possible to load such lists from an external file instead of having to recompile?

josharian · 2024-03-04T18:32:08Z

Seconded. I have the same use case (suppressing 0123456789%$£, as in whisperx). It'd be nice to have this available in vanilla whisper.cpp via the command line. If there's openness, I'm game to try my hand at a PR. The main question is what the flags should be. (Should it be a simple --suppress-digits, like whisperx? Or take a string, which we split into tokens? Etc.)

josharian · 2024-03-04T19:33:53Z

Hmmm. It's more than what @pprobst found, because there are lots of all-numeric tokens, such as "500" (which is not just "5", "0", "0"). Grammars might indeed be a better choice here...

pprobst · 2024-03-04T19:38:25Z

Hmmm. It's more than what @pprobst found, because there are lots of all-numeric tokens, such as "500" (which is not just "5", "0", "0"). Grammars might indeed be a better choice here...

Yeah. I noticed that afterward. Then, I got the list of tokens (IDs) that whisperx returned to me when I used the suppress_numerals options and hardcoded them into whispercpp. Ugly but it worked for me. Would be cool if there was a similar option in whispercpp.

josharian · 2024-03-05T17:30:00Z

While I work on grammars, here's a quick patch folks can apply as desired:

        static const std::string numbery = "0123456789%$£";
        for (int i = 0; i < vocab.token_beg; i++) {
            const std::string & token = vocab.id_to_token.at(i);
            if (token.find_first_of(numbery) != std::string::npos) {
                logits[i] = -INFINITY;
            }
        }

(This goes just after "suppress non-speech tokens".)

It appears to cost about 5% runtime, not inconsiderable.

If we wanted to add proper support for a simple "omit these ascii characters", without grammars, this could be made much cheaper by doing the find_first_of calcs once, not every time.

wttdotm · 2024-09-27T19:50:06Z

@josharian I just want to say thank you for this simple and goated patch. I was having a nightmare of a time dealing with word-level timestamps for dollar amounts (eg. "$45" being three different words but being treated as one by whisper) and after hours and hours of banging my head against unhelpful searches and repos this works perfectly.

josharian · 2024-09-27T20:06:52Z

Thanks. :)

If you want to get rid of the performance penalty for that patch, you can try pulling in this still-pretty-small commit, which I just rebased onto master: josharian@c664398. (It worked months ago when I wrote it, but I haven't tested it since.)

pprobst closed this as completed Dec 29, 2023

josharian mentioned this issue Mar 5, 2024

Move grammar support out of examples? Unify? #1930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to suppress tokens #1697

Option to suppress tokens #1697

pprobst commented Dec 28, 2023 •

edited

Loading

ggerganov commented Dec 29, 2023

pprobst commented Dec 29, 2023

pprobst commented Dec 29, 2023 •

edited

Loading

flatsiedatsie commented Jan 27, 2024

josharian commented Mar 4, 2024

josharian commented Mar 4, 2024

pprobst commented Mar 4, 2024 •

edited

Loading

josharian commented Mar 5, 2024 •

edited

Loading

wttdotm commented Sep 27, 2024

josharian commented Sep 27, 2024

Option to suppress tokens #1697

Option to suppress tokens #1697

Comments

pprobst commented Dec 28, 2023 • edited Loading

ggerganov commented Dec 29, 2023

pprobst commented Dec 29, 2023

pprobst commented Dec 29, 2023 • edited Loading

flatsiedatsie commented Jan 27, 2024

josharian commented Mar 4, 2024

josharian commented Mar 4, 2024

pprobst commented Mar 4, 2024 • edited Loading

josharian commented Mar 5, 2024 • edited Loading

wttdotm commented Sep 27, 2024

josharian commented Sep 27, 2024

pprobst commented Dec 28, 2023 •

edited

Loading

pprobst commented Dec 29, 2023 •

edited

Loading

pprobst commented Mar 4, 2024 •

edited

Loading

josharian commented Mar 5, 2024 •

edited

Loading