bpe.c is a lightweight, minimal implementation of Byte Pair Encoding (BPE) in C. Inspired by Andrej Karpathy minbpe .
- Implements Byte Pair Encoding algorithm
- Trains on input text to learn token merges
- Customizable vocabulary size
- Minimal dependencies (standard C libraries only)
- Initialization: The tokenizer starts with a basic vocabulary of 256 byte values.
- Training: It analyzes the input text, finding the most frequent pairs of tokens and merging them iteratively until the desired vocabulary size is reached.
- Encoding: Using the learned merges, it converts input text into a sequence of token IDs.
- Decoding: It can reconstruct the original text from a sequence of token IDs.
You can easily customize the tokenizer by modifying the following constants in the file:
- INITIAL_VOCAB_SIZE: The starting vocabulary size (default is 256 for ASCII characters)
- MAX_TEXT_SIZE: The maximum length of text that can be processed
Modify the main
function to experiment with different texts and vocabulary sizes.
int main() {
BasicTokenizer *tokenizer = create_tokenizer();
const char *text = "hello world the sky is blue";
size_t vocab_size = 300;
train(tokenizer, text, vocab_size, 1);
// Encode the text
int ids[MAX_TEXT_SIZE];
size_t ids_size = 0;
encode(tokenizer, text, ids, &ids_size);
// Decode the ids
char decoded_text[MAX_TEXT_SIZE];
decode(tokenizer, ids, ids_size, decoded_text);
printf("Encoded IDs:\n");
for (size_t i = 0; i < ids_size; ++i) {
printf("%d ", ids[i]);
}
printf("\nDecoded text: %s\n", decoded_text);
clean_tokenizer(tokenizer);
return 0;
}
The result:
If you use bpe.c in your research, please cite it as follows:
@misc{bpe.c2024,
author = {Ashvanth.S},
title = {bpe.c: Minimal implementation of Byte Pair Encoding (BPE) in C},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ash-01xor/bpe.c}},
}