Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to serialize custom Python components #581

Closed
n1t0 opened this issue Jan 6, 2021 · 7 comments
Closed

Add the ability to serialize custom Python components #581

n1t0 opened this issue Jan 6, 2021 · 7 comments
Labels
enhancement New feature or request python Issue related to the python binding

Comments

@n1t0
Copy link
Member

n1t0 commented Jan 6, 2021

It is currently impossible to serialize custom Python components, so if a Tokenizer embeds some of them, the user can't save it.

I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.

@n1t0 n1t0 added enhancement New feature or request python Issue related to the python binding labels Jan 6, 2021
@ibraheem-moosa
Copy link

ibraheem-moosa commented Feb 13, 2022

This is a useful feature. We can probably serialize Python objects using pickle or dill. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?

@Narsil
Copy link
Collaborator

Narsil commented Feb 14, 2022

The end result has to be saved as JSON, I don't think it's doable.
Also pickle is highly unsafe and not portable (despite being widely used).

Currently the workaround, is to override the component before save, and override after load

tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")

## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()

It is a bit inconvenient but at least it's safe and portable.

@cceyda
Copy link

cceyda commented Apr 6, 2023

You also can't load it as a PreTrainedTokenizerFast if you have a custom component.

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

As a workaround I do

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())

but using overriding using the private _tokenizer maybe unpredictably problematic.

@Narsil
Copy link
Collaborator

Narsil commented Apr 7, 2023

Totally understandable.

What kind of pre-tokenizer are you saving ?
If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

@luvwinnie
Copy link

Is now can saving the custom pretokenizer?

@Narsil
Copy link
Collaborator

Narsil commented Aug 28, 2023

No. custom is python code, it's not serializable by nature.

@millanp95
Copy link

millanp95 commented Oct 2, 2024

Totally understandable.

What kind of pre-tokenizer are you saving ? If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

Hi @Narsil , k-mer tokenization is used in many applications in bioinformatics. Right now I am doing the following to define my tokenizer, save and load my model, which I now know is not ideal. I wondered if there is a way to use serializable building blocks to save/load the tokenizer as any other HF tokenizer. Thank you

from itertools import product

import torch
from torchtext.data.utils import get_tokenizer

from tokenizers import Tokenizer,PreTokenizedString, NormalizedString
from tokenizers.pre_tokenizers import PreTokenizer, Whitespace
from tokenizers.models import WordLevel

from typing import List, Tuple

#Define the pre-tokenizer steps (Just split the string in chunks of size k)
class KmerPreTokenizer:
    def __init__(self, k: int, stride=None):
        self.k = k
        self.stride = k if not stride else stride

    def split(self, i: int, normalized: NormalizedString) -> List[Tuple[str, Tuple[int, int]]]:
        seq = normalized.original
        splits = [normalized[i:i + self.k] for i in range(0, len(seq) - self.k + 1, self.stride)]
        return splits

    def pre_tokenize(self, pretok: PreTokenizedString):               
        pretok.split(self.split)

class KmerDecoder:
    def decode(self, tokens: List[str]) -> str:
        return "".join(tokens)


# Build the vocabulary
k = 4
good_kmers = []
bad_kmers = []
kmers = [''.join(kmer) for kmer in  product('ACGTN',repeat=k)]
for kmer in kmers:
    if "N" in kmer:
        bad_kmers.append(kmer)
    else:
        good_kmers.append(kmer)

kmers = good_kmers + bad_kmers
vocab=dict((word, i) for i,word in enumerate(kmers))


#Use the Vocab and the pre-tokenizer to get a customized k-mer tokenizer
 Create a WordLevel model from the vocabulary list
tok = Tokenizer(WordLevel(vocab=vocab, unk_token="[UNK]"))
tok.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
#tok.decoder = Decoder.custom(KmerDecoder())

# Optional: Train the tokenizer (if you want to add more tokens or further refine it)
# trainer = WordLevelTrainer(special_tokens=["<MASK>", "<CLS>", "<UNK>"])
# tokenizer.train_from_iterator(kmer_iter, trainer)

# Save or use the tokenizer
# tokenizer.save("path/to/tokenizer.json")

input = "ACGCGCGCGTGGAGCGCGATCGACTTT"
print("PreTokenize:", input)
print(tok.pre_tokenizer.pre_tokenize_str(input))


#Save the tokenizer 
from transformers import PreTrainedTokenizerFast
tok.pre_tokenizer = Whitespace()
new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tok)
#new_tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
# Save the fast tokenizer
new_tokenizer.save_pretrained("tokenizers")

#Load the tokenizer
from transformers import AutoTokenizer

# Load the tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained("tokenizers")
loaded_tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(KmerPreTokenizer(k))
# Test the loaded tokenizer
input_text = "ACGCGCGCGTGGAGCGCGATCGACNTTTT"
print(loaded_tokenizer.tokenize(input_text))
print(loaded_tokenizer(input_text))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Issue related to the python binding
Projects
None yet
Development

No branches or pull requests

6 participants