Skip to content
Minhao Chou edited this page Apr 5, 2022 · 3 revisions

Guide of transformer tokenizers

BERT tokenizer

tokenizers::BertTokenizer::Options options{};
options.vocab_file = /bert/vocab/file/path;
std::unique_ptr<BertTokenizer> tokenizer = tokenizers::BertTokenizer::CreateBertTokenizer(options);

std::vector<std::string> texts = {"bert tokenizer", "gpt tokenizer"};
int max_length =512;
std::vector<EncodeOutput> batch_outputs = tokenizer->BatchEncode(&texts, nullptr, max_length);
Clone this wiki locally