Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python version hangs on encode_batch when run in subprocess #187

Closed
kretes opened this issue Mar 4, 2020 · 15 comments
Closed

python version hangs on encode_batch when run in subprocess #187

kretes opened this issue Mar 4, 2020 · 15 comments

Comments

@kretes
Copy link

kretes commented Mar 4, 2020

Hello,

When I run encode_batch with at least two texts in an array in python subprocess - the call hangs.
This happens in real-life when used inside pytorch dataloaders with multiple workers.
A self-contained reproducible script is here: https://gist.github.com/kretes/1a51bb8b936fc4e6277f71931b886bed

@epwalsh
Copy link
Collaborator

epwalsh commented Mar 6, 2020

Hey @kretes, thanks for bringing this up and thanks for the reproducible example. I ran the gist and sure enough I saw the same results.

What I'm guessing is happening (and keep in mind I am no expert in multiprocessing so I could be way off) is that when a tokenizer is used from the main process, the Rust code behind the scenes is creating some resources that it needs for parallel processing (like a thread pool and I don't know what else - this stuff comes from https://github.com/rayon-rs/rayon and I'm not super familiar with it). Then when the process is forked and encode_batch is called, the child process now needs to access those resources, but they are locked by the parent process.

So as long as you don't call encode or encode_batch until after forking, it will work:

from multiprocessing import Process
from tokenizers.implementations import ByteLevelBPETokenizer

def encode():
    tok = ByteLevelBPETokenizer()
    print(tok.encode_batch(['ala']))
    print(tok.encode_batch(['ala', 'kot']))

p = Process(target=encode)
p.start()
p.join()

@ivanjacobsec
Copy link

Encode batch hangs for me in python even when not ran as a subprocess.

@epwalsh
Copy link
Collaborator

epwalsh commented May 2, 2020

@ivanjacobsec can you provide an example?

@ivanjacobsec
Copy link

ivanjacobsec commented May 3, 2020

First off all thank you for the great work and it is a inspring project.

I tried 2 approaches:

   data_path="../data/"
   dest_path='../data/models/'
   model_name='my-tokenizer'
   paths = [str(x) for x in Path(data_path).glob("**/*.txt")]
   from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

   tokenizer = Tokenizer(BPE.empty())

   tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(MyPretokenizer())
   tokenizer.decoder=decoders.Decoder.custom(MoleculePretokenizer())
   tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(MoleculePretokenizer())
   tokenizer.decoder = decoders.Decoder.custom(MoleculePretokenizer())

   trainer = trainers.BpeTrainer(special_tokens=[ "<mask>",'<pad>'])
   tokenizer.train(trainer, paths)
   tokenizer.model.save(dest_path, model_name)
   print(tokenizer.encode_batch(["string","string string"]).ids))
class MyByteLevelBPETokenizer(BaseTokenizer):
    """ MyByteLevelBPETokenizer

    Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
    """

    def __init__(
        self,
        vocab_file: Optional[str] = None,
        merges_file: Optional[str] = None,
        add_prefix_space: bool = False,
        lowercase: bool = False,
        dropout: Optional[float] = None,
        unicode_normalizer: Optional[str] = None,
        continuing_subword_prefix: Optional[str] = None,
        end_of_word_suffix: Optional[str] = None,
    ):
        if vocab_file is not None and merges_file is not None:
            tokenizer = Tokenizer(
                BPE.from_files(
                    vocab_file,
                    merges_file,
                    dropout=dropout,
                    continuing_subword_prefix=continuing_subword_prefix or "",
                    end_of_word_suffix=end_of_word_suffix or "",
                )
            )
        else:
            tokenizer = Tokenizer(BPE.empty())

        # Check for Unicode normalization first (before everything else)
        normalizers = []

        if unicode_normalizer:
            normalizers += [unicode_normalizer_from_str(unicode_normalizer)]

        if lowercase:
            normalizers += [Lowercase()]

        # Create the normalizer structure
        if len(normalizers) > 0:
            if len(normalizers) > 1:
                tokenizer.normalizer = Sequence(normalizers)
            else:
                tokenizer.normalizer = normalizers[0]

        tokenizer.pre_tokenizer =pre_tokenizers.PreTokenizer.custom(MyPretokenizer())
        tokenizer.decoder = decoders.Decoder.custom(MyPretokenizer())

        parameters = {
            "model": "ByteLevelBPE",
            "add_prefix_space": add_prefix_space,
            "lowercase": lowercase,
            "dropout": dropout,
            "unicode_normalizer": unicode_normalizer,
            "continuing_subword_prefix": continuing_subword_prefix,
            "end_of_word_suffix": end_of_word_suffix,
        }

        super().__init__(tokenizer, parameters)

    def train(
        self,
        files: Union[str, List[str]],
        vocab_size: int = 30000,
        min_frequency: int = 2,
        show_progress: bool = True,
        special_tokens: List[str] = [],
    ):
        """ Train the model using the given files """

        trainer = trainers.BpeTrainer(
            special_tokens=special_tokens,
            min_frequency=min_frequency,
            show_progress=show_progress
        )
        if isinstance(files, str):
            files = [files]
        self._tokenizer.train(trainer, files)

class MyPretokenizer(object):
    def __init__(self) -> None:
      
        pass
    def pre_tokenize(self, sequence: str) -> List[Tuple[str, Offsets]]:
      
       return  [(l,(i,i)) for i, l in enumerate(list(sequence.strip()))]

    def decode(self, tokens):
      
        return "".join(tokens)
data_path="../data/"
 dest_path='../data/models/'
 model_name='my-tokenizer'
 paths = [str(x) for x in Path(data_path).glob("**/*.txt")]
 tokenizer=MyByteLevelBPETokenizer()
 tokenizer.train(paths)
 tokenizer.save(dest_path,dodel_name)
 tokenizer.encode_batch(['string string','string string']

With both of them batch_encode hangs.
In both cases the merges.txt is empty. This is still no clear to me why and where to look for that.
I am not quite sure on the correct way to make the offsets havent had the time to read the paper. If you could point me to some python code where it is done allready it would be nice.

Other issues I encountered:

tokenizer = Tokenizer(BPE.empty())
tokenizer.normalizer = Strip()

produces this error:

    tokenizer.normalizer = Strip()
TypeError

tokenizer = Tokenizer(BPE())

produces this error:

  tokenizer = Tokenizer(BPE())
TypeError: cannot create 'BPE' instances

I would really like to see a full working example in python with custom pre-tokenizer,normalizer and decoder.

@BlueProphet
Copy link

BlueProphet commented May 28, 2020

I think I am having the same issue so I am not going to make a new one. But this really puzzled me for a long time.

Basically when I test the tokenizer and then try to use it in a torch data loader I get a freeze on the call to encode in data loader.

Here is a simple example

import torch
import tokenizers

#Just making a test tokenizer call even in a new class 
#will ruin the tokenizer when used later in torch data_loader
#TOKENIZER = transformers.RobertaTokenizerFast(
TEST_TOKENIZER = tokenizers.ByteLevelBPETokenizer(
    vocab_file=f"roberta-base/vocab.json", 
    merges_file=f"roberta-base/merges.txt", 
    lowercase=True,
    add_prefix_space=True
)

test = TEST_TOKENIZER.encode("this is a test.")

TOKENIZER = tokenizers.ByteLevelBPETokenizer(
    vocab_file=f"roberta-base/vocab.json", 
    merges_file=f"roberta-base/merges.txt", 
    lowercase=True,
    add_prefix_space=True
)

class Dataset:
    def __init__(self, texts):
        self.texts = texts
        self.tokenizer = TOKENIZER
    
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        #This call freezes when test is run above.
        data = self.tokenizer.encode(self.texts[item])
        return data

texts = ['This is another test from data_loader', 
    'This is the second test from data_loader']
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
    print (d)

@n1t0
Copy link
Member

n1t0 commented May 29, 2020

I spent quite some time digging this to try and understand what happens exactly.

@epwalsh got it right: When the process gets forked, we end up with two (or more) different processes, with different memory spaces, but the exact same content in memory (at the time of the fork). Now, this shouldn't be a problem in most cases, but sometimes the locks we use to support multi-threading contain some state to operate properly. When this state does not make sense in the new context (in the new process), this lock might get impossible to unlock. (cf this discussion on stack overflow for more info, along with this example)

There are two different places that seem to cause such behavior:

  • The lazy_static used in pre_tokenizers/byte_level.rs uses a std::sync:once type of lock that apparently doesn't support the fork. This is what happens in the example shown by @BlueProphet. Even if we only use encode here, the pre-tokenizer gets stuck when trying to access the lazy static properties.
  • The encode_batch, train, and any other methods that make use of rayon to parallelize the workload.

Now, this is actually quite easy to fix, with only one rule to follow: do not use a tokenizer in the main process if you want to use them with multiprocessing. I think in most cases, this is actually a best practice that should be followed regardless of this problem.

By following this rule, the above snippet will look like:

import torch
import tokenizers

class Dataset:
    def __init__(self, texts):
        self.texts = texts
        self.tokenizer = tokenizers.ByteLevelBPETokenizer(
            vocab_file=f"roberta-base/vocab.json", 
            merges_file=f"roberta-base/merges.txt", 
            lowercase=True,
            add_prefix_space=True
        )
    
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        data = self.tokenizer.encode(self.texts[item])
        return data.ids

texts = [
    'This is another test from data_loader', 
    'This is the second test from data_loader'
]
dataset = Dataset(texts=texts)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, num_workers=1)
for d in data_loader:
    print (d)

Any features of the tokenizers can be used by following this setup.

@BlueProphet
Copy link

Thanks for the reply.

But your code doesn't change the bug. You still get a freeze if you use any tokenizer outside the loader.

This is not an issue in normal code if you know this can happen. I experienced this working in a notebook and testing some code before running the loader. And thinking it was a tokenizer thread issue was not the first thing that came to mind.

I do not know Rust so I cannot speak to getting this fixed. Seems like a major bug but if you already know about it, then the workaround is easy.

@psinger
Copy link

psinger commented May 31, 2020

I have the same bug and only have it since upgrading from 0.0.5 to 0.0.7. Basically whenever I use encode within a dataset and use num_workers>0 in a dataloader, I end up in a deadlock. This appears to be a critical bug to me.

It only also happens if there are more than one word in the sequence to be tokenizes. So to me it seems that tokenizers is doing some internal multiprocessing which is also imminent from observing CPU usage, and this interferes with any other outside multiprocessing, e.g., on data loader level.

I am especially curious what changes you did from 0.0.5 to 0.0.7 that started to cause this. Is there a way to disable any internal multithreading/multiprocessing you are doing? This bug is causing me quite some issues.

@psinger
Copy link

psinger commented Jun 11, 2020

@n1t0 Any idea if and when this will be fixed? I need to go back to older versions or other alternatives otherwise.

@n1t0
Copy link
Member

n1t0 commented Jun 11, 2020

@psinger If you have an example of what you are saying, that with an older version it works, I'd love to see that. It could help in the process of debugging.

I'd love to be able to fix this, the thing is, I have no idea how to fix it right now. I found a huge number of articles (a good example here) online about the problems related to Python multiprocessing while using fork. Everybody seems to agree on one point: fork and multithreading don't mix well.
This is something that has to do with the way fork works and has nothing to do with any specific language. And you can even find it in the Python documentation: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

fork
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

So, I agree it seems like a huge bug, and I'd love to be able to fix it, but I don't see any reasonable way to do this right now. Any help would be greatly appreciated! 😃

@epwalsh
Copy link
Collaborator

epwalsh commented Jun 12, 2020

@n1t0 I've been thinking a lot about this as well because we'd like to be able to load and tokenize data in multiple processes in AllenNLP, but at the same time need the tokenizers in the main process.

One thought I had would be to just disable using rayon adapters based on an environment or something. To keep this from getting unwieldy, we could have some wrapper trait with our own implementation of par_iter that would handle this logic, i.e. checking for the environment variable, then deciding whether to use rayon's par_iter or not.

I might have time to look into this next week.

@psinger
Copy link

psinger commented Jun 13, 2020

@n1t0 I am too unaware what multiprocessing/multithreading is being done within the Rust routines. Is there a chance to enable an option to disable any multithreading within the tokenizer?

I only encountered this bug after upgrading from 0.0.5 to 0.0.7. Before I could use any number of workers in the data loaders without having any deadlocks.

@n1t0
Copy link
Member

n1t0 commented Jun 15, 2020

Sounds good @epwalsh! That's a great idea! I suspect there might be some other roadblocks but it should cover the largest source of locks use. Plus, this is something that we will need anyway, for example, to allow using custom Python parts (PreTokenizer, Normalizer, ...) where the GIL won't let us do multithreading anyway. Let me know if I can be of any help!

@epwalsh
Copy link
Collaborator

epwalsh commented Jun 15, 2020

Ok, I've got a solution. PR coming soon!

@n1t0
Copy link
Member

n1t0 commented Jun 29, 2020

This should now be fixed with the 0.8.0 version that was recently released.

@psinger @BlueProphet We added an environment variable that allows us to disable the parallelism (TOKENIZERS_PARALLELISM=false), but it should detect that the process has been forked and prevent any deadlock. So this should avoid a lot of frustration in the future. Thank you for your patience!

And thank you very much @epwalsh for all your help on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants