-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904
Comments
RobertaTokenizerFast
unexpectedly quits when creating a TextDataset
RobertaTokenizerFast
unexpectedly quits when creating a TextDataset
This seems to work for me, I guess it crashes because you don't have enough memory. Unfortunately Also, maybe the huggingface/nlp library might be better suited here. cc @lhoestq |
You could try from transformers import AutoTokenizer
from nlp import load_dataset
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
dataset = load_dataset("text", data_files="path/to/wiki.train.raw", split="train")
tokenized_dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True)
print(tokenized_dataset[0]["input_ids"]) We're still working on making it as fast as we can, but at least you won't have any memory issues. |
Re @n1t0 comment: "I guess it crashes because you don't have enough memory" this is correct. (I was hoping I could get away with 61.0 GiB, the standard for an AWS Re @lhoestq your code ran without errors for me. Thanks! I did get a lot of the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🐛 Bug
When creating a
TextDataset
usingRobertaTokenizerFast
my program unexpectedly dies. (Not so withRobertaTokenizer
).Information
Model I am using: RoBERTa
Language I am using the model on: English
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
Creation of the training dataset, not having the process killed. eg:
Environment info
transformers
version: 2.11.0The text was updated successfully, but these errors were encountered: