why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

Zcchill · 2022-10-30T04:53:45Z

🐛 Bug

The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

None

Code sample

model.resize_token_embeddings(len(tokenizer))

Expected behavior

The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?

Environment

metaseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux, Windows, MacOS):
How you installed metaseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

suchenzang · 2022-10-30T06:11:36Z

Tokenizer saved has length 50265 but then we add 4 special tokens:

metaseq/metaseq/tasks/streaming_language_modeling.py

Line 158 in e2df6a0

for id in range(self.dictionary.nspecial, tok_vocab_size):

which gives us a dictionary vocab size of 50269 at this point. This is followed by a pad_to_multiple(8):

metaseq/metaseq/tasks/streaming_language_modeling.py

Line 169 in e2df6a0

self.dictionary.pad_to_multiple_(8)

, which is why vocab size ends up being 50272.

Zcchill · 2022-10-30T06:40:49Z

@suchenzang - Thank you for your answering! It seems that the 4 special token have already been among the 50265 tokens.
It seems that only pad_to_multiple(8): make the vocab size from 50265 to 50272. what I mean is that are id 50265-50272 all "madeupword"?

And does it mean that the use of model.resize_token_embeddings(len(tokenizer)) have none bad influence?

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m', use_fast=False)
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m',cache_dir='/ssdwork/cache/').cuda()
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.
model.resize_token_embeddings(len(tokenizer))
all_text  ='Which poem is the best one, and please write it to me.'
input_ids = tokenizer(all_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, do_sample=False, max_length=256, num_beams=1)
output_decode = tokenizer.batch_decode(outputs, skip_special_tokens=True)
output_decode = output_decode[0]
print(output_decode) # the result:Which poem is the best one, and please write it to me.\nI'm not sure, but I think it's the one by the author of the poem.

baiyuting · 2023-03-12T08:52:03Z

I have the same question, and if it is ok to use a roberta tokenizer instead ?

Gusicun · 2023-07-26T09:15:04Z

Same Questions,Will it cause Index Error?

Zcchill added the bug Something isn't working label Oct 30, 2022

tju01 mentioned this issue Jul 27, 2023

Decode error while inferencing a batch of prompts vllm-project/vllm#340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

Zcchill commented Oct 30, 2022 •

edited

Loading

suchenzang commented Oct 30, 2022

Zcchill commented Oct 30, 2022 •

edited

Loading

baiyuting commented Mar 12, 2023 •

edited

Loading

Gusicun commented Jul 26, 2023

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265. #469

Comments

Zcchill commented Oct 30, 2022 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

suchenzang commented Oct 30, 2022

Zcchill commented Oct 30, 2022 • edited Loading

baiyuting commented Mar 12, 2023 • edited Loading

Gusicun commented Jul 26, 2023

Zcchill commented Oct 30, 2022 •

edited

Loading

Zcchill commented Oct 30, 2022 •

edited

Loading

baiyuting commented Mar 12, 2023 •

edited

Loading