Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

ashreya2003 · 2024-08-27T16:12:06Z

I’m working with the LLama3 model and would like to calculate the length of tokens for any given text before passing it to the LLM. However, I want to achieve this using the tiktoken library without relying on any pre-built encoding models like GPT-2 or GPT-3. Instead, I want to use a tokenization approach that is compatible with LLama3’s BPE method.

Feature Request:

1.	Token Length Calculation: Implement a feature in tiktoken that allows for the calculation of token lengths specific to LLama3’s tokenizer, without requiring the use of an encoding model.
2.	BPE Tokenizer Support: Since LLama3 uses a BPE (Byte Pair Encoding) tokenizer, I need to:
•	Define the tokenization rules specific to LLama3.
•	Load or define the BPE merges and vocabulary specific to LLama3 without relying on an external file download.
3.	Special Tokens Handling: Include handling for special tokens defined in LLama3 (e.g., <|begin_of_text|>, <|end_of_text|>), and allow for easy customization of these tokens.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

ashreya2003 commented Aug 27, 2024

Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

Comments

ashreya2003 commented Aug 27, 2024