This project is a port of Andrej Karpathy's minbpe to Mojo, currently in beta.
Minbpe
implements the Byte Pair Encoding (BPE) algorithm, which is commonly used in large language models (LLMs) tokenization. For a comprehensive explanation of this project, visit its GitHub page at https://github.com/karpathy/minbpe.
Not all features of minpe
are available yet, but will be introduced as the project evolves. Currently, the main focus is on enhancing the performance of the core functionality.
Due to differences in language capabilities, the architecture of this port has been modified to fit the constraints and features of Mojo. While the architecture is different, the core functionalities and behaviors of the application remain the same as in the original. As Mojo's language features continue to evolve, we expect to further refine and redesign the project.
Tokenizers in minbpe.mojo
are implemented by confirming to the Tokenizer
trait, which defines the required methods around tokenization processes.
- BasicTokenizer: Implements the BasicTokenizer, the simplest implementation of the BPE algorithm that runs directly on text.
- RegexTokenizer: Implements the RegexTokenizer that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.
- GPT4Tokenizer to be implemented
- Ensure that the
Magic
command line tool is installed by following the Modular Docs. - Run
magic shell
within the root of the cloned repository to install the project's dependencies (Mojo 24.5 via Max, Regex), and to activate the project's virtual environment in which you can run the mojo apps.
The quick start example from minbpe
can be implement with minbpe.mojo
as follows:
from mojobpe import Tokenizer,BasicTokenizer
from mojobpe.utils.tat import print_list_int
fn main() raises:
var text = "aaabdaaabac"
var tokenizer = BasicTokenizer()
tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
print_list_int(tokenizer.encode(text))
# [258, 100, 258, 97, 99]
print(tokenizer.decode(List[Int](258, 100, 258, 97, 99)))
# aaabdaaabac
tokenizer.save("toy")
# writes toy.model (for loading)
A detailed benchmark analysis will be available soon.
For now we have included a Mojo port of train.py
from the original repository, which times the training of both the Basic and Regex Tokenizer with the text from Taylor Swift's Wikipedia page. In our preliminary tests, the Mojo version proves to be approximately three times faster than the original Python implementation. You can run this training benchmark test using the following command:
magic shell
mojo train.mojo
- 2024.10.09
- Update to Mojo 24.5
- 2024.06.07
- Update to Mojo 24.4
- Performance improvements thanks to new features of CompactDict
- 2024.05.14
- Status: Beta
- Performance improvements
- 2024.05.12
- Switch to MoString for String concatenation
- 2024.05.04
- Initial repository setup and commit.
- We achieved a significant performance boost by utilizing Maxim Zaks' exceptional Mojo library, CompactDict, which provides blazing fast dictionary implementations. We've incorporated a slightly modified version of this library in the
mojobe.utils
folder (generic_dict
andstring_dict
); all credits go to him. - Gregor Purdy has implemented an impressive Rust port of
minbpe
. In our initial tests, Gregor's port performs similar to our current Mojo port..
MIT