Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the length of Token Beforehand Without Using Any Encoding Model for LLama3 #521

Open
ashreya2003 opened this issue Aug 27, 2024 · 1 comment

Comments

@ashreya2003
Copy link

I’m working with the LLama3 model and would like to calculate the length of tokens for any given text before passing it to the LLM. However, I want to achieve this using the tiktoken library without relying on any pre-built encoding models like GPT-2 or GPT-3. Instead, I want to use a tokenization approach that is compatible with LLama3’s BPE method.

Feature Request:

1.	Token Length Calculation: Implement a feature in tiktoken that allows for the calculation of token lengths specific to LLama3’s tokenizer, without requiring the use of an encoding model.
2.	BPE Tokenizer Support: Since LLama3 uses a BPE (Byte Pair Encoding) tokenizer, I need to:
•	Define the tokenization rules specific to LLama3.
•	Load or define the BPE merges and vocabulary specific to LLama3 without relying on an external file download.
3.	Special Tokens Handling: Include handling for special tokens defined in LLama3 (e.g., <|begin_of_text|>, <|end_of_text|>), and allow for easy customization of these tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@ashreya2003 and others