Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt tokenization does not match openai/whisper #1098

Open
iceychris opened this issue Jul 12, 2023 · 1 comment · May be fixed by #1118 or #1768
Open

Prompt tokenization does not match openai/whisper #1098

iceychris opened this issue Jul 12, 2023 · 1 comment · May be fixed by #1118 or #1768
Labels
bug Something isn't working

Comments

@iceychris
Copy link
Contributor

Hey there!

When passing in a prompt via --prompt, the tokenized word piece ids do not seem to match openai/whisper.
This leads to the decoder producing garbage output (probably because it receives combinations of token ids it has never seen before).

I think one way to resolve this would be to port the openai/tiktoken tokenizer encode implementation to whisper.cpp.

whisper.cpp

  • tokenizer encode implementation:

    whisper.cpp/whisper.cpp

    Lines 2597 to 2623 in 4774d2f

    // find the longest tokens that form the words:
    std::vector<whisper_vocab::id> tokens;
    for (const auto & word : words) {
    if (word.empty()) continue;
    int i = 0;
    int n = word.size();
    while (i < n) {
    int j = n;
    bool found = false;
    while (j > i) {
    auto sub = word.substr(i, j-i);
    auto it = vocab.token_to_id.find(sub);
    if (it != vocab.token_to_id.end()) {
    tokens.push_back(it->second);
    i = j;
    found = true;
    break;
    }
    --j;
    }
    if (!found) {
    fprintf(stderr, "unknown token \n");
    ++i;
    }
    }
    }
$ make && ./main -nt -nf -bs 1 --prompt " hallo" -l de -m models/ggml-tiny.bin samples/jfk.wav
...
whisper_full_with_state: prompt[0] = 50361 | [_PREV_]
whisper_full_with_state: prompt[1] = 6500 |  hall
whisper_full_with_state: prompt[2] = 78 | o
whisper_full_with_state: prompt[3] = 50258 | [_SOT_]
...

openai/whisper

from whisper.tokenizer import get_tokenizer

prompt = " hallo"
tokenizer = get_tokenizer(multilingual=True, language="de", task="transcribe")
ids = tokenizer.encode(prompt)
tokens = [tokenizer.decode([i]) for i in ids]
print(list(zip(ids, tokens)))
[(324, ' ha'), (1913, 'llo')]
iceychris added a commit to iceychris/whisper.cpp that referenced this issue Jul 18, 2023
@iceychris iceychris linked a pull request Jul 18, 2023 that will close this issue
3 tasks
@emcodem
Copy link

emcodem commented Aug 2, 2023

just have in mind tokenizer.decode removes all tokens higher than timestamp so what you see might not be the full truth.

@bobqianic bobqianic added the bug Something isn't working label Oct 24, 2023
@bobqianic bobqianic linked a pull request Jan 17, 2024 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants