Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

Closed
2 of 4 tasks
josiahdavis opened this issue Jul 20, 2020 · 5 comments
Closed
2 of 4 tasks
Labels

Comments

@josiahdavis
Copy link

🐛 Bug

When creating a TextDataset using RobertaTokenizerFast my program unexpectedly dies. (Not so with RobertaTokenizer).

Information

Model I am using: RoBERTa

Language I am using the model on: English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: language modelling
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import AutoTokenizer, TextDataset

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="/home/ubuntu/data/wikitext-103-raw/wiki.train.raw",
    block_size=-1,
    overwrite_cache=False,
)
print(train_dataset)

Expected behavior

Creation of the training dataset, not having the process killed. eg:

<transformers.data.datasets.language_modeling.TextDataset object at 0x7f138a1fd2b0>

Environment info

  • transformers version: 2.11.0
  • Platform: Linux-5.3.0-1030-aws-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.5.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No
@josiahdavis josiahdavis changed the title RoBERTa fast tokenizer unexpectedly quits RobertaTokenizerFast unexpectedly quits when creating a TextDataset Jul 20, 2020
@josiahdavis josiahdavis changed the title RobertaTokenizerFast unexpectedly quits when creating a TextDataset RobertaTokenizerFast unexpectedly quits when creating a TextDataset Jul 20, 2020
@patil-suraj
Copy link
Contributor

@n1t0

@n1t0
Copy link
Member

n1t0 commented Jul 21, 2020

This seems to work for me, I guess it crashes because you don't have enough memory. Unfortunately TextDataset has not been optimized for fast tokenizers yet, so it does a lot more work than needed when using them. It's probably better to use python tokenizers for now with TextDataset.

Also, maybe the huggingface/nlp library might be better suited here. cc @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Jul 21, 2020

You could try

from transformers import AutoTokenizer
from nlp import load_dataset

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
dataset = load_dataset("text", data_files="path/to/wiki.train.raw", split="train")
tokenized_dataset = dataset.map(lambda ex: tokenizer(ex["text"]), batched=True)
print(tokenized_dataset[0]["input_ids"])

We're still working on making it as fast as we can, but at least you won't have any memory issues.

@josiahdavis
Copy link
Author

josiahdavis commented Jul 29, 2020

Re @n1t0 comment: "I guess it crashes because you don't have enough memory" this is correct. (I was hoping I could get away with 61.0 GiB, the standard for an AWS p3.2xlarge.)

Re @lhoestq your code ran without errors for me. Thanks!

I did get a lot of the Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors warnings which I wasn't getting before.

@stale
Copy link

stale bot commented Sep 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 27, 2020
@stale stale bot closed this as completed Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants