🙂 Fast state-of-the-art tokenizers for Ruby
Add this line to your application’s Gemfile:
gem "tokenizers"
Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")
Encode
encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids
Decode
tokenizer.decode(ids)
Create a tokenizer
tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))
Set the pre-tokenizer
tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new
Train the tokenizer (example data)
trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)
Encode
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
output.tokens
Save the tokenizer to a file
tokenizer.save("tokenizer.json")
Load a tokenizer from a file
tokenizer = Tokenizers.from_file("tokenizer.json")
Check out the Quicktour and equivalent Ruby code for more info
This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test