This repository has been archived by the owner on Nov 1, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 730
why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469
Labels
bug
Something isn't working
Comments
Tokenizer saved has length 50265 but then we add 4 special tokens:
which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8):
|
@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens.
|
I have the same question, and if it is ok to use a roberta tokenizer instead ? |
Same Questions,Will it cause Index Error? |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
🐛 Bug
The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Code sample
model.resize_token_embeddings(len(tokenizer))Expected behavior
The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?Environment
pip
, source):Additional context
The text was updated successfully, but these errors were encountered: