RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

josiahdavis · 2020-07-20T07:36:22Z

🐛 Bug

When creating a TextDataset using RobertaTokenizerFast my program unexpectedly dies. (Not so with RobertaTokenizer).

Information

Model I am using: RoBERTa

Language I am using the model on: English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: language modelling
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import AutoTokenizer, TextDataset

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="/home/ubuntu/data/wikitext-103-raw/wiki.train.raw",
    block_size=-1,
    overwrite_cache=False,
)
print(train_dataset)

Expected behavior

Creation of the training dataset, not having the process killed. eg:

<transformers.data.datasets.language_modeling.TextDataset object at 0x7f138a1fd2b0>

Environment info

transformers version: 2.11.0
Platform: Linux-5.3.0-1030-aws-x86_64-with-debian-buster-sid
Python version: 3.7.7
PyTorch version (GPU?): 1.5.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

patil-suraj · 2020-07-21T15:45:14Z

@n1t0

n1t0 · 2020-07-21T16:26:21Z

This seems to work for me, I guess it crashes because you don't have enough memory. Unfortunately TextDataset has not been optimized for fast tokenizers yet, so it does a lot more work than needed when using them. It's probably better to use python tokenizers for now with TextDataset.

Also, maybe the huggingface/nlp library might be better suited here. cc @lhoestq

lhoestq · 2020-07-21T16:52:37Z

You could try

from transformers import AutoTokenizer
from nlp import load_dataset

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
dataset = load_dataset("text", data_files="path/to/wiki.train.raw", split="train")
tokenized_dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True)
print(tokenized_dataset[0]["input_ids"])

We're still working on making it as fast as we can, but at least you won't have any memory issues.

josiahdavis · 2020-07-29T03:52:17Z

Re @n1t0 comment: "I guess it crashes because you don't have enough memory" this is correct. (I was hoping I could get away with 61.0 GiB, the standard for an AWS p3.2xlarge.)

Re @lhoestq your code ran without errors for me. Thanks!

I did get a lot of the Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors warnings which I wasn't getting before.

stale · 2020-09-27T07:40:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

josiahdavis changed the title ~~RoBERTa fast tokenizer unexpectedly quits~~ RobertaTokenizerFast unexpectedly quits when creating a TextDataset Jul 20, 2020

josiahdavis changed the title ~~RobertaTokenizerFast unexpectedly quits when creating a TextDataset~~ RobertaTokenizerFast unexpectedly quits when creating a TextDataset Jul 20, 2020

stale bot added the wontfix label Sep 27, 2020

stale bot closed this as completed Oct 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

josiahdavis commented Jul 20, 2020

patil-suraj commented Jul 21, 2020

n1t0 commented Jul 21, 2020

lhoestq commented Jul 21, 2020 •

edited

Loading

josiahdavis commented Jul 29, 2020 •

edited

Loading

stale bot commented Sep 27, 2020

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

Comments

josiahdavis commented Jul 20, 2020

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

patil-suraj commented Jul 21, 2020

n1t0 commented Jul 21, 2020

lhoestq commented Jul 21, 2020 • edited Loading

josiahdavis commented Jul 29, 2020 • edited Loading

stale bot commented Sep 27, 2020

lhoestq commented Jul 21, 2020 •

edited

Loading

josiahdavis commented Jul 29, 2020 •

edited

Loading