Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE trainer ignoring special tokens. #1616

Open
henrycharlesworth opened this issue Aug 16, 2024 · 3 comments · May be fixed by #1617
Open

BPE trainer ignoring special tokens. #1616

henrycharlesworth opened this issue Aug 16, 2024 · 3 comments · May be fixed by #1617

Comments

@henrycharlesworth
Copy link

henrycharlesworth commented Aug 16, 2024

I am trying to train a custom tokenizer. My use case is related to assembly code, so I want merges to be possible across full instructions (potentially multiple "words"). To do this, I am replacing all spaces with a dummy token (e.g. "<space>"), and have a pretokenizer that splits on "\n". This basically works, but my issue comes when I try to add in special tokens. The following is a simple example to reproduce the issue:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])

trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

An example segment of my corpus I am using to train will look something like:

lea rsi,<code_addr_1> <string_literal><disasm_string_0></string_literal> <eoi>
mov edi, eax <eoi>
call <external>::<function_name><disasm_function_1></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_2></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rax, qword ptr [rax]<unk_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_3></function_name> <eoi>

so the aim is to ensure that e.g. <disasm_function_1> is always a single token. This works at test time (i.e. these special tokens are always tokenized as single tokens), but it's clearly not happening during the BPE training. If I examine the tokens/merges I am getting out, many of them contain the special tokens within them. E.g. from the resulting JSON file:

"</return_val><space><calling_conv>stdcall</calling_conv><func_name><disasm_function_0></func_name><parameters>(": 370,
      "pop<space>r1": 371,
      "call<space><external>::<function_name><disasm_function_2></function_name><space><eoi>": 372,

you can see these learned tokens contain the special tokens within them.

Is this expected behaviour? My assumption was that the BPE trainer would prevent this from happening (as I provide it with a list of the special tokens - why else would it need this argument)? And it's not very desirable to fill up the vocab with lots of merges that aren't ever going to be valid.

Is there anyway to stop this from happening (or is there anything that I haven't set up properly?)

EDIT:

My current horrible workaround is to do:

tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
] + [Split(tok, behavior="isolated") for tok in special_tokens])

which seems to work, but can't be the best way.

@ArthurZucker
Copy link
Collaborator

Hey! you are adding the tokens before initializing the normalizer, this worked for me:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip


corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
- tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])
+ tokenizer.add_special_tokens(special_tokens)
trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

@henrycharlesworth
Copy link
Author

So I tried this and for me it still gives exactly the same result. It works at test time (as did the previous version), but during training it is still merging across the special tokens.

@ArthurZucker ArthurZucker linked a pull request Aug 19, 2024 that will close this issue
@ArthurZucker
Copy link
Collaborator

You are right, sorry. Here is a PR with a fix, not sure why we never had that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants