-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more tokenizer tests #3742
Add more tokenizer tests #3742
Conversation
I'm testing
for codepoint |
Restricting
fixes |
After applying your change to the test starcoder passes too
|
Interesting. I see
Hm. |
Pay no attention to neox, it's not relevant here (uses old model that was failing on map illegal access) |
Should I also include your patch to restrict tokenizer tests to unicode planes in this PR? |
Yes, please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
@Galunid : sorry for mixing it up when using Github online editor (will never try again;). May I ask you to repair the mess (don't know enough about git to fix it on your branch)? Thanks again. |
efd3c22
to
1244b00
Compare
@goerch Done, no worries ;) |
OK, I'll wait until tomorrow morning (+10 hours) before merging. |
* master: (350 commits) speculative : ensure draft and target model vocab matches (ggerganov#3812) llama : correctly report GGUFv3 format (ggerganov#3818) simple : fix batch handling (ggerganov#3803) cuda : improve text-generation and batched decoding performance (ggerganov#3776) server : do not release slot on image input (ggerganov#3798) batched-bench : print params at start log : disable pid in log filenames server : add parameter -tb N, --threads-batch N (ggerganov#3584) (ggerganov#3768) server : do not block system prompt update (ggerganov#3767) sync : ggml (conv ops + cuda MSVC fixes) (ggerganov#3765) cmake : add missed dependencies (ggerganov#3763) cuda : add batched cuBLAS GEMM for faster attention (ggerganov#3749) Add more tokenizer tests (ggerganov#3742) metal : handle ggml_scale for n%4 != 0 (close ggerganov#3754) Revert "make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)" issues : separate bug and enhancement template + no default title (ggerganov#3748) Update special token handling in conversion scripts for gpt2 derived tokenizers (ggerganov#3746) llama : remove token functions with `context` args in favor of `model` (ggerganov#3720) Fix baichuan convert script not detecing model (ggerganov#3739) make : add optional CUDA_NATIVE_ARCH (ggerganov#2482) ...
Conversion scripts used 96981f3
PersimmonIssues:
Persimmon script doesn't allow for
--vocab-only
,GPT-Neox tokenizer fails with
std::unordered_map
illegal access seen in other gpt2 tokenizer based models. I applied the fix from Missing tokenizer tests #3730 (comment) and the test passed. I didn't include the passing version here, only the failing one, let me know which one you want @goerchRefact fails with
byte not found in vocab
(you can see in CI)Starcoder fails with
byte not found in vocab
(you can see in CI)Models used:
closes #3730