bpe.c

bpe.c is a lightweight, minimal implementation of Byte Pair Encoding (BPE) in C. Inspired by Andrej Karpathy minbpe .

Features

Implements Byte Pair Encoding algorithm
Trains on input text to learn token merges
Customizable vocabulary size
Minimal dependencies (standard C libraries only)

How It Works

Initialization: The tokenizer starts with a basic vocabulary of 256 byte values.
Training: It analyzes the input text, finding the most frequent pairs of tokens and merging them iteratively until the desired vocabulary size is reached.
Encoding: Using the learned merges, it converts input text into a sequence of token IDs.
Decoding: It can reconstruct the original text from a sequence of token IDs.

Usage

You can easily customize the tokenizer by modifying the following constants in the file:

INITIAL_VOCAB_SIZE: The starting vocabulary size (default is 256 for ASCII characters)
MAX_TEXT_SIZE: The maximum length of text that can be processed

Modify the main function to experiment with different texts and vocabulary sizes.

int main() {
    BasicTokenizer *tokenizer = create_tokenizer();
    
    const char *text = "hello world the sky is blue";
    size_t vocab_size = 300;

    train(tokenizer, text, vocab_size, 1);

    // Encode the text
    int ids[MAX_TEXT_SIZE];
    size_t ids_size = 0;
    encode(tokenizer, text, ids, &ids_size);
    
    // Decode the ids
    char decoded_text[MAX_TEXT_SIZE];
    decode(tokenizer, ids, ids_size, decoded_text);
    
    printf("Encoded IDs:\n");
    for (size_t i = 0; i < ids_size; ++i) {
        printf("%d ", ids[i]);
    }
    printf("\nDecoded text: %s\n", decoded_text);
    
    clean_tokenizer(tokenizer);

    return 0;
}

The result:

Citation

If you use bpe.c in your research, please cite it as follows:

@misc{bpe.c2024,
  author = {Ashvanth.S},
  title = {bpe.c: Minimal implementation of Byte Pair Encoding (BPE) in C},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ash-01xor/bpe.c}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
LICENSE		LICENSE
README.md		README.md
minbpe.c		minbpe.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bpe.c

Features

How It Works

Usage

Citation

About

Releases

Packages

Languages

License

ash-01xor/bpe.c

Folders and files

Latest commit

History

Repository files navigation

bpe.c

Features

How It Works

Usage

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages