Codes beyond 255 #2

avi-otterai · 2023-12-11T13:53:51Z

Thank you for releasing code and pretrained tokenizers!

Could you please elaborate why the codes sometimes are greater than 255? The simplest example is directly in the paper: melons → [261, 255, 209] - if the codebook size is 256 then shouldn't 255 be the maximum possible?

We found codes as far as 263 in practice.

davda54 · 2023-12-12T00:43:53Z

Hi, this a very good question, I'm sorry for the lack of documentation :) The VQ-VAE really uses only 256 codes, but for practical reasons (LM training), the tokenizer has special tokens in addition. This is defined on these lines, there are 8 special tokens and the total vocabulary size is therefore 256+8.

Thanks for letting me know about that example in the paper, I believe subtracting 8 from all these codes will make it less confusing for the reader?

avi-otterai · 2023-12-12T04:33:00Z

Thanks for the quick response David! I agree a first example without this special case would be good for the readers.

Quick clarifications:

Based on the referenced lines of code, is it true that only the R part of the code gets 256+8 tokens while the G and B indices get 256+1 (padding) tokens?
Could you confirm whether these are prepended or appended to the vocabulary? In other words, do the G/B codes have values 0-256 or 8-264?

davda54 · 2023-12-12T15:02:17Z

The special ids are in [0:8] for all factors. Then the actual codes are in [8:264] for all factors. The special symbols are currently defined to always use zero index for the G and B factors, the indices in [1:8] are unused.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes beyond 255 #2

Codes beyond 255 #2

avi-otterai commented Dec 11, 2023

davda54 commented Dec 12, 2023

avi-otterai commented Dec 12, 2023

davda54 commented Dec 12, 2023

Codes beyond 255 #2

Codes beyond 255 #2

Comments

avi-otterai commented Dec 11, 2023

davda54 commented Dec 12, 2023

avi-otterai commented Dec 12, 2023

davda54 commented Dec 12, 2023