Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

Open
Zcchill opened this issue Oct 30, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@Zcchill
Copy link

Zcchill commented Oct 30, 2022

🐛 Bug

The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....'
  2. See error
None

Code sample

model.resize_token_embeddings(len(tokenizer))

Expected behavior

The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?

Environment

  • metaseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux, Windows, MacOS):
  • How you installed metaseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@Zcchill Zcchill added the bug Something isn't working label Oct 30, 2022
@suchenzang
Copy link
Contributor

Tokenizer saved has length 50265 but then we add 4 special tokens:

for id in range(self.dictionary.nspecial, tok_vocab_size):

which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8):
self.dictionary.pad_to_multiple_(8)
, which is why vocab size ends up being 50272.

@Zcchill
Copy link
Author

Zcchill commented Oct 30, 2022

@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens.
It seems that only pad_to_multiple(8): make the vocab size from 50265 to 50272. what I mean is that are id 50265-50272 all "madeupword"?

  • And does it mean that the use of model.resize_token_embeddings(len(tokenizer)) have none bad influence?

image

1667305969706

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m', use_fast=False)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m',cache_dir='/ssdwork/cache/').cuda()
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
model.resize_token_embeddings(len(tokenizer))
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.

@baiyuting
Copy link

baiyuting commented Mar 12, 2023

I have the same question, and if it is ok to use a roberta tokenizer instead ?

@Gusicun
Copy link

Gusicun commented Jul 26, 2023

Same Questions,Will it cause Index Error?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants