[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

FDSRashid · 2023-09-18T03:23:56Z

Describe the bug
The model for the camel-bert transformers on hugging face did not specify max Length, causing longer tokenized sentences to not process. I am unsure whether to actually use the camel-bert models, because I saw that they were last updated over two years ago. I want to get encoded sentences of texts in classical arabic, so If there are any models within camel-tools that are trained on classical arabic, that would be wonderful.

To Reproduce
I used the Autotokenizer from pretrained command :
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca') model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')

Doing model.model_max_length returned a extremely large number. For this reason, I couldn't chunk sequences based on the model max length. I looked at the error code message, it mentioned the tensor needing to be 512 in length. Thats when i noticed the model maximum wasn't set.

Expected behavior
Since the error message expects token lengths to be 512, i would expect the model max length to be set to 512. However, if theres a new model trained on classical arabic that is used in camel_tools, I apologize, i didnt find it. I just want to encode sentences from classical arabic texts.

Screenshots
No screenshots, unfortunately, i fixed the error by setting a max length manually to another variable.

However, this was the text of the error message : RuntimeError: The size of tensor a (5338) must match the size of tensor b (512) at non-singleton dimension 1

Desktop (please complete the following information):
Working from Google Colab

Additional context
None, but if theres a pretrained model that is updated on camel_tools that is trained on classical arabic, that would be amazing.

The text was updated successfully, but these errors were encountered:

owo · 2023-09-18T08:51:00Z

Hi @FDSRashid ,

This is an issue for CAMeLBERT.
Can you please post the issue there and @balhafni will take a look.

balhafni · 2023-09-18T08:58:36Z

Hi @FDSRashid,

This is an issue in the way the configs were created for the CAMeLBERT models. We recommend always specifying the max_length, which has a maximum value of 512, whenever you use a CAMeLBERT tokenizer:

tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', max_length=512)

FDSRashid added the bug label Sep 18, 2023

FDSRashid assigned owo Sep 18, 2023

owo closed this as completed Sep 18, 2023

dearden mentioned this issue Feb 2, 2024

Max length problem with bert-base-arabic-camelbert-mix-pos-msa CAMeL-Lab/CAMeLBERT#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

FDSRashid commented Sep 18, 2023

owo commented Sep 18, 2023

balhafni commented Sep 18, 2023

[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

Comments

FDSRashid commented Sep 18, 2023

owo commented Sep 18, 2023

balhafni commented Sep 18, 2023