Add the ability to serialize custom Python components #581

n1t0 · 2021-01-06T15:56:55Z

It is currently impossible to serialize custom Python components, so if a Tokenizer embeds some of them, the user can't save it.

I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.

The text was updated successfully, but these errors were encountered:

ibraheem-moosa · 2022-02-13T04:23:41Z

This is a useful feature. We can probably serialize Python objects using pickle or dill. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?

Narsil · 2022-02-14T09:08:34Z

The end result has to be saved as JSON, I don't think it's doable.
Also pickle is highly unsafe and not portable (despite being widely used).

Currently the workaround, is to override the component before save, and override after load

tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")

## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()

It is a bit inconvenient but at least it's safe and portable.

cceyda · 2023-04-06T20:59:05Z

You also can't load it as a PreTrainedTokenizerFast if you have a custom component.

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

As a workaround I do

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())

but using overriding using the private _tokenizer maybe unpredictably problematic.

Narsil · 2023-04-07T09:56:16Z

Totally understandable.

What kind of pre-tokenizer are you saving ?
If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

luvwinnie · 2023-08-27T03:56:34Z

Is now can saving the custom pretokenizer?

Narsil · 2023-08-28T07:50:56Z

No. custom is python code, it's not serializable by nature.

millanp95 · 2024-10-02T14:24:30Z

Totally understandable.

What kind of pre-tokenizer are you saving ? If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

Hi @Narsil , k-mer tokenization is used in many applications in bioinformatics. Right now I am doing the following to define my tokenizer, save and load my model, which I now know is not ideal. I wondered if there is a way to use serializable building blocks to save/load the tokenizer as any other HF tokenizer. Thank you

from itertools import product

import torch
from torchtext.data.utils import get_tokenizer

from tokenizers import Tokenizer,PreTokenizedString, NormalizedString
from tokenizers.pre_tokenizers import PreTokenizer, Whitespace
from tokenizers.models import WordLevel

from typing import List, Tuple

#Define the pre-tokenizer steps (Just split the string in chunks of size k)
class KmerPreTokenizer:
    def __init__(self, k: int, stride=None):
        self.k = k
        self.stride = k if not stride else stride

    def split(self, i: int, normalized: NormalizedString) -> List[Tuple[str, Tuple[int, int]]]:
        seq = normalized.original
        splits = [normalized[i:i + self.k] for i in range(0, len(seq) - self.k + 1, self.stride)]
        return splits

    def pre_tokenize(self, pretok: PreTokenizedString):               
        pretok.split(self.split)

class KmerDecoder:
    def decode(self, tokens: List[str]) -> str:
        return "".join(tokens)


# Build the vocabulary
k = 4
good_kmers = []
bad_kmers = []
kmers = [''.join(kmer) for kmer in  product('ACGTN',repeat=k)]
for kmer in kmers:
    if "N" in kmer:
        bad_kmers.append(kmer)
    else:
        good_kmers.append(kmer)

kmers = good_kmers + bad_kmers
vocab=dict((word, i) for i,word in enumerate(kmers))


#Use the Vocab and the pre-tokenizer to get a customized k-mer tokenizer
 Create a WordLevel model from the vocabulary list
tok = Tokenizer(WordLevel(vocab=vocab, unk_token="[UNK]"))
tok.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
#tok.decoder = Decoder.custom(KmerDecoder())

# Optional: Train the tokenizer (if you want to add more tokens or further refine it)
# trainer = WordLevelTrainer(special_tokens=["<MASK>", "<CLS>", "<UNK>"])
# tokenizer.train_from_iterator(kmer_iter, trainer)

# Save or use the tokenizer
# tokenizer.save("path/to/tokenizer.json")

input = "ACGCGCGCGTGGAGCGCGATCGACTTT"
print("PreTokenize:", input)
print(tok.pre_tokenizer.pre_tokenize_str(input))


#Save the tokenizer 
from transformers import PreTrainedTokenizerFast
tok.pre_tokenizer = Whitespace()
new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tok)
#new_tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
# Save the fast tokenizer
new_tokenizer.save_pretrained("tokenizers")

#Load the tokenizer
from transformers import AutoTokenizer

# Load the tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained("tokenizers")
loaded_tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
# Test the loaded tokenizer
input_text = "ACGCGCGCGTGGAGCGCGATCGACNTTTT"
print(loaded_tokenizer.tokenize(input_text))
print(loaded_tokenizer(input_text))

n1t0 added enhancement New feature or request python Issue related to the python binding labels Jan 6, 2021

n1t0 mentioned this issue Feb 3, 2021

Exception: Custom PreTokenizer cannot be serialized #613

Closed

n1t0 mentioned this issue Feb 22, 2021

Implementing Custom Decoders? #636

Closed

maclandrol mentioned this issue Sep 19, 2022

fix: create a copy for tokenizer object huggingface/transformers#18408

Merged

Narsil closed this as completed Aug 28, 2023

shivanraptor mentioned this issue Oct 10, 2023

Exception: Custom Normalizer cannot be serialized #1361

Closed

ArthurZucker mentioned this issue Jan 16, 2024

Inconsistencies between .save_pretrained and from_pretrained for slow and fast tokenizers (RoFormer) huggingface/transformers#28164

Closed

4 tasks

millanp95 mentioned this issue Oct 15, 2024

Serializing k-mer style pre-tokenizer #1654

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to serialize custom Python components #581

Add the ability to serialize custom Python components #581

n1t0 commented Jan 6, 2021

ibraheem-moosa commented Feb 13, 2022 •

edited

Loading

Narsil commented Feb 14, 2022

cceyda commented Apr 6, 2023

Narsil commented Apr 7, 2023

luvwinnie commented Aug 27, 2023

Narsil commented Aug 28, 2023

millanp95 commented Oct 2, 2024 •

edited

Loading

Add the ability to serialize custom Python components #581

Add the ability to serialize custom Python components #581

Comments

n1t0 commented Jan 6, 2021

ibraheem-moosa commented Feb 13, 2022 • edited Loading

Narsil commented Feb 14, 2022

cceyda commented Apr 6, 2023

Narsil commented Apr 7, 2023

luvwinnie commented Aug 27, 2023

Narsil commented Aug 28, 2023

millanp95 commented Oct 2, 2024 • edited Loading

ibraheem-moosa commented Feb 13, 2022 •

edited

Loading

millanp95 commented Oct 2, 2024 •

edited

Loading