Prompt tokenization does not match openai/whisper #1098

iceychris · 2023-07-12T09:35:43Z

Hey there!

When passing in a prompt via --prompt, the tokenized word piece ids do not seem to match openai/whisper.
This leads to the decoder producing garbage output (probably because it receives combinations of token ids it has never seen before).

I think one way to resolve this would be to port the openai/tiktoken tokenizer encode implementation to whisper.cpp.

`whisper.cpp`

tokenizer encode implementation:

whisper.cpp/whisper.cpp

Lines 2597 to 2623 in 4774d2f

    
           // find the longest tokens that form the words: 
        
           std::vector<whisper_vocab::id> tokens; 
        
           for (const auto & word : words) { 
        
               if (word.empty()) continue; 
        
               int i = 0; 
        
               int n = word.size(); 
        
               while (i < n) { 
        
                   int j = n; 
        
                   bool found = false; 
        
                   while (j > i) { 
        
                       auto sub = word.substr(i, j-i); 
        
                       auto it = vocab.token_to_id.find(sub); 
        
                       if (it != vocab.token_to_id.end()) { 
        
                           tokens.push_back(it->second); 
        
                           i = j; 
        
                           found = true; 
        
                           break; 
        
                       } 
        
                       --j; 
        
                   } 
        
                   if (!found) { 
        
                       fprintf(stderr, "unknown token \n"); 
        
                       ++i; 
        
                   } 
        
               } 
        
           }

$ make && ./main -nt -nf -bs 1 --prompt " hallo" -l de -m models/ggml-tiny.bin samples/jfk.wav

...
whisper_full_with_state: prompt[0] = 50361 | [_PREV_]
whisper_full_with_state: prompt[1] = 6500 |  hall
whisper_full_with_state: prompt[2] = 78 | o
whisper_full_with_state: prompt[3] = 50258 | [_SOT_]
...

`openai/whisper`

tokenizer encode implementation: https://github.com/openai/tiktoken/blob/5d970c1100d3210b42497203d6b5c1e30cfda6cb/src/lib.rs#L14-L98

from whisper.tokenizer import get_tokenizer

prompt = " hallo"
tokenizer = get_tokenizer(multilingual=True, language="de", task="transcribe")
ids = tokenizer.encode(prompt)
tokens = [tokenizer.decode([i]) for i in ids]
print(list(zip(ids, tokens)))

[(324, ' ha'), (1913, 'llo')]

The text was updated successfully, but these errors were encountered:

…#1098)

emcodem · 2023-08-02T09:56:30Z

just have in mind tokenizer.decode removes all tokens higher than timestamp so what you see might not be the full truth.

iceychris added a commit to iceychris/whisper.cpp that referenced this issue Jul 18, 2023

Switch to BPE tokenization impl from openai/tiktoken (fixes ggerganov…

c2fb6b7

…#1098)

iceychris linked a pull request Jul 18, 2023 that will close this issue

Switch to BPE tokenization impl from openai/tiktoken #1118

Open

3 tasks

bobqianic added the bug Something isn't working label Oct 24, 2023

tazz4843 mentioned this issue Jan 11, 2024

Invalid utf-8 tazz4843/whisper-rs#115

Closed

bobqianic linked a pull request Jan 17, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt tokenization does not match openai/whisper #1098

Prompt tokenization does not match openai/whisper #1098

iceychris commented Jul 12, 2023

emcodem commented Aug 2, 2023

Prompt tokenization does not match openai/whisper #1098

Prompt tokenization does not match openai/whisper #1098

Comments

iceychris commented Jul 12, 2023

whisper.cpp

openai/whisper

emcodem commented Aug 2, 2023

`whisper.cpp`

`openai/whisper`