tokenizers.bpe - R package for Byte Pair Encoding

This repository contains an R package which is an Rcpp wrapper around the YouTokenToMe C++ library

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency
It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]
YouTokenToMe is available at https://github.com/VKCOM/YouTokenToMe
Note that the flat_hash_map used in YouTokenToMe was replaced by parallel-hashmap

Features

The R package allows you to

build a Byte Pair Encoding (BPE) model
apply the model to encode text
apply the model to decode ids back to text

Installation

For regular users, install the package from your local CRAN mirror install.packages("tokenizers.bpe")
For installing the development version of this package: remotes::install_github("bnosac/tokenizers.bpe")

Look to the documentation of the functions

help(package = "tokenizers.bpe")

Example

As an example, let's take some training data containing questions asked in Belgian Parliament in 2017 and focus on French text only.

library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language == "french")
writeLines(text = x$text, con = "traindata.txt")

Train a model on text data and inspect the vocabulary

model <- bpe("traindata.txt", coverage = 0.999, vocab_size = 5000)
model

Byte Pair Encoding model trained with YouTokenToMe
  size of the vocabulary: 5000
  model stored at: C:/Users/Jan/Dropbox/Work/RForgeBNOSAC/BNOSAC/tokenizers.bpe/youtokentome.bpe

str(model$vocabulary)

'data.frame':	5000 obs. of  2 variables:
 $ id     : int  0 1 2 3 4 5 6 7 8 9 ...
 $ subword: chr  "<PAD>" "<UNK>" "<BOS>" "<EOS>" ...

Use the model to encode text

text <- c("L'appartement est grand & vraiment bien situe en plein centre",
          "Proportion de femmes dans les situations de famille monoparentale.")
bpe_encode(model, x = text, type = "subwords")

[[1]]
 [1] "▁L'"     "app"     "ar"      "tement"  "▁est"    "▁grand"  "▁"       "&"       "▁v"      "r"       "ai"      "ment"    "▁bien"   "▁situe"  "▁en"     "▁plein"  "▁centre"

[[2]]
 [1] "▁Pro"        "por"         "tion"        "▁de"         "▁femmes"     "▁dans"       "▁les"        "▁situations" "▁de"         "▁famille"    "▁mon"        "op"          "ar"          "ent"         "ale."

bpe_encode(model, x = text, type = "ids")

[[1]]
 [1]  421  327   98  554  178 1521    4    1  117   11  101   99  679 4599  113 3702 2126

[[2]]
 [1] 1529 4878   92   76 2321  162  108 4099   76 3218  791  312   98   87 2546

Use the model to decode byte pair encodings back to text

x <- bpe_encode(model, x = text, type = "ids")
bpe_decode(model, x)

[[1]]
[1] "L'appartement est grand <UNK> vraiment bien situe en plein centre"

[[2]]
[1] "Proportion de femmes dans les situations de famille monoparentale."

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
R		R
data		data
inst/extdata		inst/extdata
man		man
src		src
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.note		LICENSE.note
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
tokenizers.bpe.Rproj		tokenizers.bpe.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

tokenizers.bpe - R package for Byte Pair Encoding

Features

Installation

Example

Support in text mining

About

Licenses found

Releases 4

Packages

Languages

License

Licenses found

bnosac/tokenizers.bpe

Folders and files

Latest commit

History

Repository files navigation

tokenizers.bpe - R package for Byte Pair Encoding

Features

Installation

Example

Support in text mining

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages