-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : add grammar-based sampling #1229
Conversation
Oh boy, this is super cool! Thank you for doing it - can't wait to play with it. |
Sadly I am not exactly sure how to reproduce, but after some commands were recognized and I said something like "Thank you" instead of an actual command present in the grammar, I sometimes ran into this crash:
I ran it with
|
Thanks for reporting that - I believe I've seen this as well. Will look into it. |
I managed to reproduce the exception - here is a stack trace: lldb ./bin/command
(lldb) target create "./bin/command"
Current executable set to '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64).
(lldb) r -m ../models/ggml-base.en.bin -t 8 --grammar 'root ::= "Ok Whisper, start Listening for commands. " ("Red" | "Green" | "blue" | "Thank you") ' --grammar-penalty 1000.0
error: shell expansion failed (reason: lldb-argdumper exited with error 127). consider launching with 'process launch'.
(lldb) process l
Available completions:
launch -- Launch the executable in the debugger.
load -- Load a shared library into the current process.
(lldb) process launch
Process 6351 launched: '/Users/ggerganov/development/github/whisper.cpp/build-rwdi/bin/command' (arm64)
whisper_init_from_file_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2
whisper_model_load: mem required = 310.00 MB (+ 6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.66 MB
whisper_model_load: model size = 140.54 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...
2023-09-06 13:57:00.825138+0300 command[6351:79069] [plugin] AddInstanceForFactory: No factory registered for id <CFUUID 0x60000020c140> F8BB1C28-BAE8-11D6-9C31-00039315CD46
init: found 1 capture devices:
init: - Capture device #0: 'Georgi’s iPhone Microphone'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init: - sample rate: 16000
init: - format: 33056 (required: 33056)
init: - channels: 1 (required: 1)
init: - samples per frame: 1024
main: grammar:
root ::= [O] [k] [ ] [W] [h] [i] [s] [p] [e] [r] [,] [ ] [s] [t] [a] [r] [t] [ ] [L] [i] [s] [t] [e] [n] [i] [n] [g] [ ] [f] [o] [r] [ ] [c] [o] [m] [m] [a] [n] [d] [s] [.] [ ] root_1
root_1 ::= [R] [e] [d] | [G] [r] [e] [e] [n] | [b] [l] [u] [e] | [T] [h] [a] [n] [k] [ ] [y] [o] [u]
process_general_transcription: general-purpose mode
process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'
process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper', (t = 362 ms)
process_general_transcription: WARNING: prompt not recognized, try again
process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'
process_general_transcription: Speech detected! Processing ...
process_general_transcription: Heard 'Ok Whisper, start Listening for commands', (t = 448 ms)
process_general_transcription: The prompt has been recognized!
process_general_transcription: Waiting for voice commands ...
process_general_transcription: Speech detected! Processing ...
libc++abi: terminating due to uncaught exception of type std::out_of_range: basic_string
Process 6351 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
-> 0x189984764 <+8>: b.lo 0x189984784 ; <+40>
0x189984768 <+12>: pacibsp
0x18998476c <+16>: stp x29, x30, [sp, #-0x10]!
0x189984770 <+20>: mov x29, sp
Target 0: (command) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x0000000189984764 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x00000001899bbc28 libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x00000001898c9ae8 libsystem_c.dylib`abort + 180
frame #3: 0x0000000189974b84 libc++abi.dylib`abort_message + 132
frame #4: 0x00000001899643b4 libc++abi.dylib`demangling_terminate_handler() + 320
frame #5: 0x000000018963b03c libobjc.A.dylib`_objc_terminate() + 160
frame #6: 0x0000000189973f48 libc++abi.dylib`std::__terminate(void (*)()) + 16
frame #7: 0x0000000189976d34 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
frame #8: 0x0000000189976ce0 libc++abi.dylib`__cxa_throw + 140
frame #9: 0x00000001898ef71c libc++.1.dylib`std::__1::__throw_out_of_range[abi:v15006](char const*) + 72
frame #10: 0x00000001898eb680 libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__throw_out_of_range[abi:v15006]() const + 24
frame #11: 0x00000001898ec79c libc++.1.dylib`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long, std::__1::allocator<char> const&) + 208
frame #12: 0x0000000100008af0 command`process_general_transcription(whisper_context*, audio_async&, whisper_params const&) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::substr[abi:v15006](this="Ok W", __pos=32, __n=18446744073709551615) const at string:3573:12 [opt]
frame #13: 0x0000000100008ad8 command`process_general_transcription(ctx=0x00000001003046a0, audio=0x000000016fdfede8, params=0x000000016fdfed28) at command.cpp:603:60 [opt]
frame #14: 0x0000000100009654 command`main(argc=<unavailable>, argv=<unavailable>) at command.cpp:688:23 [opt]
frame #15: 0x0000000189663f28 dyld`start + 2236
(lldb) |
whisper.cpp
Outdated
for (const auto & reject : rejects) { | ||
if (logits[reject.id] > 0) { | ||
logits[reject.id] /= params.grammar_penalty; | ||
} else { | ||
logits[reject.id] *= params.grammar_penalty; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently experimenting with the following penalty and I think it works better:
for (const auto & reject : rejects) { | |
if (logits[reject.id] > 0) { | |
logits[reject.id] /= params.grammar_penalty; | |
} else { | |
logits[reject.id] *= params.grammar_penalty; | |
} | |
} | |
for (const auto & reject : rejects) { | |
logits[reject.id] -= params.grammar_penalty; | |
} |
Not sure where this asymmetric scaling came from in the LLM world, but I think it's wrong.
Here is some more discussion on this topic: ggerganov/llama.cpp#2970
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Honestly I don't have a great understanding of the statistics to know what penalization function makes sense.
I'm still playing with this and so far have really good impressions. The API is perfect. AFAICT this approach works on the letter level and not on the token level:
Let's say for example that at the current moment, the grammar allows the letter # decoded so far
Ok Whis Which tokens are we going to penalize? Is it going to penalize Edit: nvm, you have actually did it the best way :) |
I just realized that even though Whisper is designed for audio transcription, it's fundamentally built on a transformer architecture. This makes prompts an incredibly useful tool; for instance, they can guide the model in correctly spelling specific nouns. So my question is, under what circumstances would grammar-based sampling be more effective compared to using prompts? |
AFAIK, applying grammar constraints to Whisper decoder is a new area yet to be studied This weekend I'll be looking into this and hopefully merging it. Thinking if we should just merge the grammar parser straight into |
No, really I was just hesitant to add all that extra code to |
I'm not too sure about it either. I haven't really looked into grammar-based sampling. We can talk about it after it's merged :) |
One approach is to move the grammar stuff (both impl + parsing) into I will now try to merge the parsing into |
- option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further
I'm looking for a way to just slightly nudge whisper towards these tokens, that way I can continue using it as a general purpose transcription tool, while simultaneously using it as a voice assistant. So far my major blocker to using this seems to be as mentioned in ejones#1, the false positive tokens. For my use case, a good workaround would be some way to let whisper abandon the grammar sampling earlier, perhaps through a configuration option on whisper_full_params. |
whisper : fine-tuning grammar functionality
@ggerganov I tested this branch with your chess and assistant cases from ejones#1. I had a similar experience as you - tiny fairly consistently matches the grammar and invalid commands tend to produce an empty string (or |
I didn't test this configuration - will do so. My guess is grammar will definitely help especially in situations where certain things sound similar. I imagine a use case where the grammar describes only the legal moves on the chess board at a given moment. In that case, it will help to disambiguate moves that sound similar but could be invalid (e.g. |
Ah, good point. |
Hi, whats the state on this. Would try helping out to get this in... |
I'm not entirely certain either. Over the past two to three weeks, there has been a fascinating discussion regarding the detection of wake words in #1232. @isaac-mcfadyen contributed a truly intriguing perspective. |
Sorry for the delays - I've been travelling recently and now I'm catching up with lots of things. This PR is one of top priorities. Hoping to find the time this week |
* whisper : add grammar-based sampling * build : fix after master merge * command : fix exception when recognizing the command * whisper : fine-tuning grammar functionality * command : grammar-related improvements - option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further * grammars : add assistant + update comments * command : enable beam-search, add "no_timestamps", add "context", add p * whisper : remove comment --------- Co-authored-by: Georgi Gerganov <[email protected]>
* whisper : add grammar-based sampling * build : fix after master merge * command : fix exception when recognizing the command * whisper : fine-tuning grammar functionality * command : grammar-related improvements - option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further * grammars : add assistant + update comments * command : enable beam-search, add "no_timestamps", add "context", add p * whisper : remove comment --------- Co-authored-by: Georgi Gerganov <[email protected]>
* whisper : add grammar-based sampling * build : fix after master merge * command : fix exception when recognizing the command * whisper : fine-tuning grammar functionality * command : grammar-related improvements - option to read grammar from file - add sample grammars for colors and chess moves - fine-tune the performance further * grammars : add assistant + update comments * command : enable beam-search, add "no_timestamps", add "context", add p * whisper : remove comment --------- Co-authored-by: Georgi Gerganov <[email protected]>
Ports grammar-based sampling from llama.cpp. Most of the code is simply copied over with
s/llama/whisper/
. Unlike llama.cpp, where sampling functions are part of the API, the grammar functionality here is wrapped up inwhisper_full
(the grammar state is attached to eachwhisper_decoder
). More notably, the approach is more forgiving here: tokens not matching the grammar are scaled down rather than masked out entirely (grammar_penalty
), special tokens are ignored, and parse failures simply (in theory) revert to unconstrained sampling.To demonstrate the functionality, I've added grammars to
command
, e.g.:Probably needs more testing and refining but early results look promising! This demo shows constrained sampling on the left vs unconstrained on the right:
whisper-chess.mp4
Edit by @ggerganov:
More examples: