Skip to content

Byte Pair Encoding (BPE) Tokenization for Natural Language Processing

License

Notifications You must be signed in to change notification settings

teleprint-me/byte-pair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Byte Pair Encoding

Overview

The Byte-Pair Encoder (BPE) is a powerful tokenization method widely used in natural language processing. This Python implementation of BPE is inspired by the paper Neural Machine Translation of Rare Words with Subword Units and guided by Lei Mao's educational tutorial.

Features

  • Tokenization: Efficient tokenization using Byte-Pair Encoding.
  • Vocabulary Management: Tools for managing and analyzing vocabulary.
  • Token Pair Frequency: Calculate token pair frequencies for subword units.

Getting Started

To get started with Byte-Pair Encoder, follow these simple steps:

  1. Clone the Repository

    git clone https://github.com/teleprint-me/byte-pair.git
  2. Install Dependencies

    virtualenv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  3. Run the Code

    python -m byte_pair.encode --input_file samples/taming_shrew.md --output_file local/vocab.json --n_merges 5000

Usage

For comprehensive usage instructions and options, consult the documentation:

python -m byte_pair.encode --help

Documentation

Detailed information on how to use and contribute to the project is available in the documentation.

Contributing

Contributions are welcome! If you have suggestions, bug reports, or improvements, please don't hesitate to submit issues or pull requests.

License

This project is licensed under the AGPL (GNU Affero General Public License). For detailed information, see the LICENSE file.

Acknowledgments

Special thanks to Lei Mao for the blog tutorial that inspired this implementation.

Additional Resources

About

Byte Pair Encoding (BPE) Tokenization for Natural Language Processing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages