You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m working with the LLama3 model and would like to calculate the length of tokens for any given text before passing it to the LLM. However, I want to achieve this using the tiktoken library without relying on any pre-built encoding models like GPT-2 or GPT-3. Instead, I want to use a tokenization approach that is compatible with LLama3’s BPE method.
Feature Request:
1. Token Length Calculation: Implement a feature in tiktoken that allows for the calculation of token lengths specific to LLama3’s tokenizer, without requiring the use of an encoding model.
2. BPE Tokenizer Support: Since LLama3 uses a BPE (Byte Pair Encoding) tokenizer, I need to:
• Define the tokenization rules specific to LLama3.
• Load or define the BPE merges and vocabulary specific to LLama3 without relying on an external file download.
3. Special Tokens Handling: Include handling for special tokens defined in LLama3 (e.g., <|begin_of_text|>, <|end_of_text|>), and allow for easy customization of these tokens.
The text was updated successfully, but these errors were encountered:
I’m working with the LLama3 model and would like to calculate the length of tokens for any given text before passing it to the LLM. However, I want to achieve this using the tiktoken library without relying on any pre-built encoding models like GPT-2 or GPT-3. Instead, I want to use a tokenization approach that is compatible with LLama3’s BPE method.
Feature Request:
The text was updated successfully, but these errors were encountered: