BPE trainer ignoring special tokens. #1616

henrycharlesworth · 2024-08-16T14:01:06Z

I am trying to train a custom tokenizer. My use case is related to assembly code, so I want merges to be possible across full instructions (potentially multiple "words"). To do this, I am replacing all spaces with a dummy token (e.g. "<space>"), and have a pretokenizer that splits on "\n". This basically works, but my issue comes when I try to add in special tokens. The following is a simple example to reproduce the issue:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])

trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

An example segment of my corpus I am using to train will look something like:

lea rsi,<code_addr_1> <string_literal><disasm_string_0></string_literal> <eoi>
mov edi, eax <eoi>
call <external>::<function_name><disasm_function_1></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_2></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rax, qword ptr [rax]<unk_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_3></function_name> <eoi>

so the aim is to ensure that e.g. <disasm_function_1> is always a single token. This works at test time (i.e. these special tokens are always tokenized as single tokens), but it's clearly not happening during the BPE training. If I examine the tokens/merges I am getting out, many of them contain the special tokens within them. E.g. from the resulting JSON file:

"</return_val><space><calling_conv>stdcall</calling_conv><func_name><disasm_function_0></func_name><parameters>(": 370,
      "pop<space>r1": 371,
      "call<space><external>::<function_name><disasm_function_2></function_name><space><eoi>": 372,

you can see these learned tokens contain the special tokens within them.

Is this expected behaviour? My assumption was that the BPE trainer would prevent this from happening (as I provide it with a list of the special tokens - why else would it need this argument)? And it's not very desirable to fill up the vocab with lots of merges that aren't ever going to be valid.

Is there anyway to stop this from happening (or is there anything that I haven't set up properly?)

EDIT:

My current horrible workaround is to do:

tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
] + [Split(tok, behavior="isolated") for tok in special_tokens])

which seems to work, but can't be the best way.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-08-16T19:39:50Z

Hey! you are adding the tokens before initializing the normalizer, this worked for me:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
- tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])
+ tokenizer.add_special_tokens(special_tokens)
trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

henrycharlesworth · 2024-08-16T19:49:43Z

So I tried this and for me it still gives exactly the same result. It works at test time (as did the previous version), but during training it is still merging across the special tokens.

ArthurZucker · 2024-08-19T12:20:00Z

You are right, sorry. Here is a PR with a fix, not sure why we never had that.

ArthurZucker linked a pull request Aug 19, 2024 that will close this issue

🚨 breaking: Fix training with special tokens #1617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE trainer ignoring special tokens. #1616

BPE trainer ignoring special tokens. #1616

henrycharlesworth commented Aug 16, 2024 •

edited

Loading

ArthurZucker commented Aug 16, 2024

henrycharlesworth commented Aug 16, 2024

ArthurZucker commented Aug 19, 2024

BPE trainer ignoring special tokens. #1616

BPE trainer ignoring special tokens. #1616

Comments

henrycharlesworth commented Aug 16, 2024 • edited Loading

ArthurZucker commented Aug 16, 2024

henrycharlesworth commented Aug 16, 2024

ArthurZucker commented Aug 19, 2024

henrycharlesworth commented Aug 16, 2024 •

edited

Loading