token indices sequence length is longer than the specified maximum sequence length #1791

cswangjiawei · 2019-11-11T14:05:26Z

❓ Questions & Help

When I use Bert, the "token indices sequence length is longer than the specified maximum sequence length for this model (1017 > 512)" occurs. How can I solve this error?

LysandreJik · 2019-11-11T17:21:12Z

This means you're encoding a sequence that is larger than the max sequence the model can handle (which is 512 tokens). This is not an error but a warning; if you pass that sequence to the model it will crash as it cannot handle such a long sequence.

You can truncate the sequence: seq = seq[:512] or use the max_length tokenizer parameter so that it handles it on its own.

cswangjiawei · 2019-11-12T12:20:09Z

Thank you. I truncate the sequence and it worked. But I use the parameter max_length of the method "encode" of the class of Tokenizer , it do not works.

LysandreJik · 2019-11-12T16:36:52Z

Hi, could you show me how you're using the max_length parameter?

Edit:

The recommended way is to call the tokenizer directly instead of using the encode method, so the following is the recommended way of handling it:

from transformers import GPT2Tokenizer

text = "This is a sequence"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
x = tokenizer(text, truncation=True, max_length=2)

print(len(x))  # 2

Previous answer:

If you use it as such it should truncate your sequences:

from transformers import GPT2Tokenizer

text = "This is a sequence"

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
x = tokenizer.encode(text, max_length=2)

print(len(x))  # 2

cswangjiawei · 2019-11-13T13:09:20Z

I use max_length is as follows:

model_class, tokenizer_class, pretrained_weights = BertModel, BertTokenizer, 'bert-base-uncased'
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

text = "After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he ... " #this sentence is very long, Its length is more than 512. In order to save space, not all of them are shown here
# text = tokenizer.tokenize(text)
# if len(text) > 512:
#     text = text[:512]
#text = "After a morning of Thrift Store hunting, a friend and I were thinking of lunch"

text = tokenizer.encode(text, add_special_tokens=True, max_length=10)
print(text)
print(len(text))

It works. I previously set max_length to 512, just output the encoded list, so I didn't notice that the length has changed. But the warning still occurs:

LysandreJik · 2019-11-13T17:06:31Z

Glad it works! Indeed, we should do something about this warning, it shouldn't appear when a max length is specified.

cswangjiawei · 2019-11-14T06:11:39Z

Thank you very much!

LukasMut · 2020-02-03T10:26:19Z

What if I need the sequence length to be longer than 512 (e.g., to retrieve the answer in a QA model)?

julien-c · 2020-02-03T16:04:50Z

Hi @LukasMut this question might be better suited to Stack Overflow.

paulogaspar · 2020-03-22T17:28:23Z

I have the same doubt as @LukasMut . Did you open a Stack Overflow question?

nextjedi · 2020-04-12T22:52:23Z

did you got the solution @LukasMut @paulogaspar

paulogaspar · 2020-04-13T10:26:06Z

Not really. All solutions point to using only the 512 tokens, and choosing what to place in those tokens (for example, picking which part of the text)

Kabongosalomon · 2020-04-15T16:31:55Z

Having the same issue @paulogaspar any update on this? I'm having sequences with more than 512 tokens.

paulogaspar · 2020-04-15T16:44:55Z

Having the same issue @paulogaspar any update on this? I'm having sequences with more than 512 tokens.

Take a look at my last answer, that's the point I'm at.

SeanBannister · 2020-05-25T05:06:26Z

Also dealing with this issue and thought I'd post what's going through my head, correct me if I'm wrong but I think the maximum sequence length is determined when the model is first trained? In which case training a model with a larger sequence length is the solution? And I'm wondering if fine-tuning can be used to increase the sequence length.

nabinkhadka · 2020-05-31T14:26:36Z

Same question. What to do if text is long?

Ricocotam · 2020-06-02T09:51:16Z

That's a research questions guys

SeanBannister · 2020-06-02T10:50:29Z

This might help people looking for further details facebookresearch/fairseq#1685 & google-research/bert#27

ClementViricel · 2020-06-11T14:10:10Z

Hi,
The question i have is almost the same.
Bert has some configuration options. As far as i know about transformers, it's not constrain by sequence length at all.
Can I change the config to have more than 512 tokens ?

LysandreJik · 2020-06-17T22:00:53Z

Most transformers are unfortunately completely constrained, which is the case for BERT (512 tokens max).

If you want to use transformers without being limited to a sequence length, you should take a look at Transformer-XL or XLNet.

vr25 · 2020-06-18T15:19:07Z

@LysandreJik

I thought XLNet has a max length of 512 as well.

Transformer-XL is still is a mystery to me because it seems like the length is still 512 for downstream tasks, unlike language modeling (pre-training).

Please let me know if my understanding is incorrect.

Thanks!

LysandreJik · 2020-06-18T18:03:24Z

XLNet was pre-trained/fine-tuned with a maximum length of 512, indeed. However, the model is not limited to such a length:

from transformers import XLNetLMHeadModel, XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")

encoded = tokenizer.encode_plus("Alright, let's do this" * 500, return_tensors="pt")
print(encoded["input_ids"].shape)  # torch.Size([1, 3503])
print(model(**encoded)[0].shape)  # torch.Size([1, 3503, 32000])

The model is not limited to a specific length because it doesn't leverage absolute positional embeddings, instead leveraging the same relative positional embeddings that Transformer-XL used. Please note that since the model isn't trained on larger sequences thant 512, no results are guaranteed on larger sequences, even if the model can still handle them.

cheery · 2020-10-09T22:36:08Z

I was going to try this out, but after reading this out few times now, I still have no idea how I'm supposed to truncate the token stream for the pipeline.

I got some results by combining @cswangjiawei 's advice of running the tokenizer, but it returns a truncated sequence that is slightly longer than the limit I set.

Otherwise the results are good, although they come out slow and I may have to figure how to activate cuda on py torch.

Update: There is an article that shows how to run the summarizer on large texts, I got it to work with this one: https://www.thepythoncode.com/article/text-summarization-using-huggingface-transformers-python

KarteekMenda93 · 2020-10-12T06:37:28Z

What if I need the sequence length to be longer than 512 (e.g., to retrieve the answer in a QA model)?

You can go for BigBird as it takes a input token size of 4096 tokens(but can take upto 16K size)

nasib-ullah · 2020-11-14T12:28:17Z

Let me help with what I have understood. Correct me if I am wrong. The reason you can't use sequence length more than max_length is because of the positional encoding. Let's have a look at the positional encoding in the original Transformer paper

So the pos in the formula is the index of the words, and they have set 10000 as the scale to cover the usual length of most of the sentences. Now, if you look at the visualization of these functions, you will notice until the pos value is less than 10000 we will get a unique temporal representation of each word. But once it's length is more than 10000 representation won't be unique for each word (e.g. 1st and 10001 will have the same representation). So if max_length = scale (512 as discussed here) and sequence_length > max_length positional encoding will not work.
I didn't check what scale value (you can check it by yourself) BERT uses, but probably this may be the reason.

.

MajaRolevski · 2020-11-17T14:18:58Z

What if I need the sequence length to be longer than 512 (e.g., to retrieve the answer in a QA model)?

You can go for BigBird as it takes a input token size of 4096 tokens(but can take upto 16K size)

The code and weights for BigBird haven't been published yet, am I right?

KarteekMenda93 · 2020-11-18T02:24:59Z

What if I need the sequence length to be longer than 512 (e.g., to retrieve the answer in a QA model)?

You can go for BigBird as it takes a input token size of 4096 tokens(but can take upto 16K size)

The code and weights for BigBird haven't been published yet, am I right?

Yes and in that case you have Longformers, Reformers which can handle the long sequences.

etetteh · 2020-12-09T19:33:00Z

My model was pretrained with max_seq_len of 128 and max_posi_embeddings of 512 using the original BERT code release.
I am having the same problem here. I have tried a couple of fixes, but none of them is working for me.

export MAX_LENGTH=120
export MODEL="./bert-1.5M"

python3 preprocess.py ./data/train.txt $MODEL $MAX_LENGTH > train.txt
python3 preprocess.py ./data/dev.txt $MODEL $MAX_LENGTH > dev.txt
python3 preprocess.py ./data/test.txt $MODEL $MAX_LENGTH > test.txt


I am running  run_ner_old.py file.

Can anyone help.

sugampath · 2021-02-16T16:35:50Z

I use max_length is as follows:

model_class, tokenizer_class, pretrained_weights = BertModel, BertTokenizer, 'bert-base-uncased'
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

text = "After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he ... " #this sentence is very long, Its length is more than 512. In order to save space, not all of them are shown here
# text = tokenizer.tokenize(text)
# if len(text) > 512:
#     text = text[:512]
#text = "After a morning of Thrift Store hunting, a friend and I were thinking of lunch"

text = tokenizer.encode(text, add_special_tokens=True, max_length=10)
print(text)
print(len(text))

It works. I previously set max_length to 512, just output the encoded list, so I didn't notice that the length has changed. But the warning still occurs:

How to apply this method in csv file i have csv file "data.csv" in 2nd column it contains news that to be pass in bert of 512 length

NightMachinery · 2021-12-11T18:06:40Z

I am trying to create an arbitrary length text summarizer using Huggingface; should I just partition the input text to the max model length, summarize each part to, say, half its original length, and repeat this procedure as long as necessary to reach the target length for the whole sequence?

It feels to me that this is quite a general problem. Shouldn't this be supported as part of the pipeline API itself? (I can do a PR if it's a good fit for the API.)

sfbaker7 · 2022-09-23T19:24:04Z

Not sure if this is the best approach, but I did something like this and it solves the problem ^

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
def summarize_text(text: str, max_len: int) -> str:
    try:
        summary = summarizer(text, max_length=max_len, min_length=10, do_sample=False)
        return summary[0]["summary_text"]
    except IndexError as ex:
        logging.warning("Sequence length too large for model, cutting text in half and calling again")
        return summarize_text(text=text[:(len(text) // 2)], max_len=max_len//2) + summarize_text(text=text[(len(text) // 2):], max_len=max_len//2)

More info: huggingface/transformers#1791

FurkanGozukara · 2022-10-27T20:31:43Z

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
def summarize_text(text: str, max_len: int) -> str:
    try:
        summary = summarizer(text, max_length=max_len, min_length=10, do_sample=False)
        return summary[0]["summary_text"]
    except IndexError as ex:
        logging.warning("Sequence length too large for model, cutting text in half and calling again")
        return summarize_text(text=text[:(len(text) // 2)], max_len=max_len//2) + summarize_text(text=text[(len(text) // 2):], max_len=max_len//2)

i have tested and works great awesome

maiduydung · 2022-12-16T05:02:02Z

Thanks to sfbaker7 recursive suggestion, i made a similar function for translating

model_name = "Helsinki-NLP/opus-mt-ja-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_text(text):
    try:
        translated = model.generate(**tokenizer(text, return_tensors="pt"))[0]
        return tokenizer.decode(translated, skip_special_tokens=True)
    except IndexError as err :  
        return translate_text(text[:(len(text) // 2)]) + translate_text(text[len(text) // 2:])

Hope this helps someone

LuisAVasquez · 2023-04-14T13:30:11Z

Found an elegant solution.

If you study this line of the implementation of the pipeline API, you can notice that you can give the arguments for tokenizer.encode directly during the initialization of the pipeline.

For a specific task, use:

pipeline(
       "summarization",  # your preferred task
        model="facebook/bart-large-cnn", # your preferred model
        <other arguments>, # top_k, ....
        "max_length" : 30, # or 512, or whatever your cut-off is
        "padding" : 'max_length',
        "truncation" : True,
    )

For loading a pretrained model, use:

model = <....>
tokenizer = <....>

pipeline(
        model=model,
        tokenizer=tokenizer,
        <other arguments> # top_k, ....
        "max_length" : 30, # or 512, or whatever your cut-off is
        "padding" : 'max_length',
        "truncation" : True,
    )

This has worked for me with TextClassificationPipeline and BertTokenizer. I haven't tested other pipelines or tokenizers, but judging by the implementation, it should work.

sajeedmehrab · 2023-04-25T17:57:10Z

I faced the same issue when using the fill-mask pipeline. The fill_mask.py does not consider any preprocess_params or keyword arguments, but can be fixed as follows:

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/fill_mask.py#L96 -> model_inputs = self.tokenizer(inputs, return_tensors=return_tensors, **preprocess_parameters)

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/fill_mask.py#L201 -> def _sanitize_parameters(self, top_k=None, targets=None, **tokenizer_kwargs):
preprocess_params = tokenizer_kwargs

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/fill_mask.py#L215 -> return preprocess_params, {}, postprocess_params

After that you can create a pipeline as follows:

from transformers import pipeline
fill_mask_pipeline = pipeline(
'fill-mask',
model=model,
tokenizer=tokenizer,
device=0
)

tokenizer_kwargs = {'truncation':True, 'max_length':2048}
output = fill_mask_pipeline("Text to predict <mask>", **tokenizer_kwargs)

I will add a pull request too!

vegansquirrel · 2023-11-14T07:47:38Z

Glad it works! Indeed, we should do something about this warning, it shouldn't appear when a max length is specified.

I tried it and it worked, I think the problem is fixed..
I was also getting the max context length 512 error, setting max token to {'max_new_tokens':512}

lucifermorningstar1305 · 2024-02-08T12:14:59Z

Hey everyone,

Just a quick question: how can I shut this warning off
Token indices sequence length is longer than the specified maximum sequence length for this model (791 > 512). Running this sequence through the model will result in indexing errors

I have figured out how to deal with large context-length with Bert models, so I want to switch off this error.

LysandreJik mentioned this issue Nov 14, 2019

Token indices sequence length is longer than the specified maximum sequence length for this model #1833

Merged

thomwolf closed this as completed Dec 4, 2019

josiahdavis mentioned this issue Jul 29, 2020

RobertaTokenizerFast unexpectedly quits when creating a TextDataset #5904

Closed

4 tasks

kalyangvs mentioned this issue Oct 30, 2020

Error running pre-trained Punctuation_and_Capitalization model in local jupyter notebook NVIDIA/NeMo#1186

Closed

megantosh mentioned this issue Sep 27, 2021

Index 0 is out of bounds for dimension 0 with size 0 flairNLP/flair#2450

Closed

0xsuid added a commit to 0xsuid/code-generation-gpt-models that referenced this issue Sep 24, 2022

Fix max_length issue

fd98e2b

More info: huggingface/transformers#1791

0xsuid added a commit to 0xsuid/code-generation-gpt-models that referenced this issue Sep 24, 2022

Fix max_length issue

f64dd29

More info: huggingface/transformers#1791

0xsuid added a commit to 0xsuid/code-generation-gpt-models that referenced this issue Sep 24, 2022

Fix max_length issue

fcb961e

More info: huggingface/transformers#1791

SOVIETIC-BOSS88 mentioned this issue Nov 26, 2022

Token indices sequence length issue. kathrinse/be_great#6

Closed

little51 mentioned this issue Jun 9, 2023

sequence length is longer than the specified maximum sequence length git-cloner/llama-lora-fine-tuning#3

Closed

amanshanbhag mentioned this issue Jun 10, 2024

Warning for maximum sequence length when running FSDP Llama2 example aws-samples/awsome-distributed-training#354

Closed

token indices sequence length is longer than the specified maximum sequence length #1791

token indices sequence length is longer than the specified maximum sequence length #1791

Comments

cswangjiawei commented Nov 11, 2019

❓ Questions & Help

LysandreJik commented Nov 11, 2019

cswangjiawei commented Nov 12, 2019

LysandreJik commented Nov 12, 2019 • edited Loading

cswangjiawei commented Nov 13, 2019 • edited Loading

LysandreJik commented Nov 13, 2019

cswangjiawei commented Nov 14, 2019

LukasMut commented Feb 3, 2020

julien-c commented Feb 3, 2020

paulogaspar commented Mar 22, 2020

nextjedi commented Apr 12, 2020

paulogaspar commented Apr 13, 2020

Kabongosalomon commented Apr 15, 2020

paulogaspar commented Apr 15, 2020

SeanBannister commented May 25, 2020

nabinkhadka commented May 31, 2020

Ricocotam commented Jun 2, 2020

SeanBannister commented Jun 2, 2020 • edited Loading

ClementViricel commented Jun 11, 2020

LysandreJik commented Jun 17, 2020

vr25 commented Jun 18, 2020

LysandreJik commented Jun 18, 2020 • edited Loading

cheery commented Oct 9, 2020 • edited Loading

KarteekMenda93 commented Oct 12, 2020

nasib-ullah commented Nov 14, 2020

MajaRolevski commented Nov 17, 2020

KarteekMenda93 commented Nov 18, 2020

etetteh commented Dec 9, 2020

sugampath commented Feb 16, 2021

NightMachinery commented Dec 11, 2021

sfbaker7 commented Sep 23, 2022 • edited Loading

FurkanGozukara commented Oct 27, 2022 • edited Loading

maiduydung commented Dec 16, 2022

LuisAVasquez commented Apr 14, 2023 • edited Loading

sajeedmehrab commented Apr 25, 2023 • edited Loading

vegansquirrel commented Nov 14, 2023

lucifermorningstar1305 commented Feb 8, 2024

LysandreJik commented Nov 12, 2019 •

edited

Loading

cswangjiawei commented Nov 13, 2019 •

edited

Loading

SeanBannister commented Jun 2, 2020 •

edited

Loading

LysandreJik commented Jun 18, 2020 •

edited

Loading

cheery commented Oct 9, 2020 •

edited

Loading

sfbaker7 commented Sep 23, 2022 •

edited

Loading

FurkanGozukara commented Oct 27, 2022 •

edited

Loading

LuisAVasquez commented Apr 14, 2023 •

edited

Loading

sajeedmehrab commented Apr 25, 2023 •

edited

Loading