Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to suppress tokens #1697

Closed
pprobst opened this issue Dec 28, 2023 · 10 comments
Closed

Option to suppress tokens #1697

pprobst opened this issue Dec 28, 2023 · 10 comments

Comments

@pprobst
Copy link
Contributor

pprobst commented Dec 28, 2023

Hello. As far as I know, whispercpp does not support the option to suppress particular tokens / numerals during inference, like WhisperX does. This is particularly useful, for example, if we want to transcribe numbers literally, e.g., "one" instead of "1".

Is there any interest in adding support for this feature?

@ggerganov
Copy link
Owner

Have you tried the grammar functionality: #1229 - could be useful

@pprobst
Copy link
Contributor Author

pprobst commented Dec 29, 2023

No, I didn't know it was a thing. I will look into it. But if I understood it correctly from a brief reading, this is awesome; given a context-specific corpus, we can use it to "boost the weight" of certain words. There should be an example or something in the README to give more attention to this feature because it's a game-changer 👀

@pprobst
Copy link
Contributor Author

pprobst commented Dec 29, 2023

Oh! I just found exactly what I needed in the source code (the grammar functionality will still be handy for me, though, once I get it running for my own needs).

whisper.cpp/whisper.cpp

Lines 4473 to 4478 in 2623640

static const std::vector<std::string> non_speech_tokens = {
"\"", "#", "(", ")", "*", "+", "/", ":", ";", "<", "=", ">", "@", "[", "\\", "]", "^",
"_", "`", "{", "|", "}", "~", "", "", "", "", "<<", ">>", "<<<", ">>>", "--",
"---", "-(", "-[", "('", "(\"", "((", "))", "(((", ")))", "[[", "]]", "{{", "}}", "♪♪",
"♪♪♪","", "", "", "", "", "", ""
};

And

whisper.cpp/whisper.cpp

Lines 4567 to 4575 in 2623640

if (params.suppress_non_speech_tokens) {
for (const std::string & token : non_speech_tokens) {
const std::string suppress_tokens[] = {token, " " + token};
for (const std::string & suppress_token : suppress_tokens) {
if (vocab.token_to_id.find(suppress_token) != vocab.token_to_id.end()) {
logits[vocab.token_to_id.at(suppress_token)] = -INFINITY;
}
}
}

Very handy. Just need to add the tokens I won't use. I'm closing this issue now.

@pprobst pprobst closed this as completed Dec 29, 2023
@flatsiedatsie
Copy link

Would it be possible to load such lists from an external file instead of having to recompile?

@josharian
Copy link
Contributor

Seconded. I have the same use case (suppressing 0123456789%$£, as in whisperx). It'd be nice to have this available in vanilla whisper.cpp via the command line. If there's openness, I'm game to try my hand at a PR. The main question is what the flags should be. (Should it be a simple --suppress-digits, like whisperx? Or take a string, which we split into tokens? Etc.)

@josharian
Copy link
Contributor

Hmmm. It's more than what @pprobst found, because there are lots of all-numeric tokens, such as "500" (which is not just "5", "0", "0"). Grammars might indeed be a better choice here...

@pprobst
Copy link
Contributor Author

pprobst commented Mar 4, 2024

Hmmm. It's more than what @pprobst found, because there are lots of all-numeric tokens, such as "500" (which is not just "5", "0", "0"). Grammars might indeed be a better choice here...

Yeah. I noticed that afterward. Then, I got the list of tokens (IDs) that whisperx returned to me when I used the suppress_numerals options and hardcoded them into whispercpp. Ugly but it worked for me. Would be cool if there was a similar option in whispercpp.

@josharian
Copy link
Contributor

josharian commented Mar 5, 2024

While I work on grammars, here's a quick patch folks can apply as desired:

        static const std::string numbery = "0123456789%$£";
        for (int i = 0; i < vocab.token_beg; i++) {
            const std::string & token = vocab.id_to_token.at(i);
            if (token.find_first_of(numbery) != std::string::npos) {
                logits[i] = -INFINITY;
            }
        }

(This goes just after "suppress non-speech tokens".)

It appears to cost about 5% runtime, not inconsiderable.

If we wanted to add proper support for a simple "omit these ascii characters", without grammars, this could be made much cheaper by doing the find_first_of calcs once, not every time.

@wttdotm
Copy link

wttdotm commented Sep 27, 2024

@josharian I just want to say thank you for this simple and goated patch. I was having a nightmare of a time dealing with word-level timestamps for dollar amounts (eg. "$45" being three different words but being treated as one by whisper) and after hours and hours of banging my head against unhelpful searches and repos this works perfectly.

@josharian
Copy link
Contributor

Thanks. :)

If you want to get rid of the performance penalty for that patch, you can try pulling in this still-pretty-small commit, which I just rebased onto master: josharian@c664398. (It worked months ago when I wrote it, but I haven't tested it since.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants