You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The model for the camel-bert transformers on hugging face did not specify max Length, causing longer tokenized sentences to not process. I am unsure whether to actually use the camel-bert models, because I saw that they were last updated over two years ago. I want to get encoded sentences of texts in classical arabic, so If there are any models within camel-tools that are trained on classical arabic, that would be wonderful.
To Reproduce
I used the Autotokenizer from pretrained command : tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca') model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
Doing model.model_max_length returned a extremely large number. For this reason, I couldn't chunk sequences based on the model max length. I looked at the error code message, it mentioned the tensor needing to be 512 in length. Thats when i noticed the model maximum wasn't set.
Expected behavior
Since the error message expects token lengths to be 512, i would expect the model max length to be set to 512. However, if theres a new model trained on classical arabic that is used in camel_tools, I apologize, i didnt find it. I just want to encode sentences from classical arabic texts.
Screenshots
No screenshots, unfortunately, i fixed the error by setting a max length manually to another variable.
However, this was the text of the error message : RuntimeError: The size of tensor a (5338) must match the size of tensor b (512) at non-singleton dimension 1
Desktop (please complete the following information):
Working from Google Colab
Additional context
None, but if theres a pretrained model that is updated on camel_tools that is trained on classical arabic, that would be amazing.
The text was updated successfully, but these errors were encountered:
This is an issue in the way the configs were created for the CAMeLBERT models. We recommend always specifying the max_length, which has a maximum value of 512, whenever you use a CAMeLBERT tokenizer:
Describe the bug
The model for the camel-bert transformers on hugging face did not specify max Length, causing longer tokenized sentences to not process. I am unsure whether to actually use the camel-bert models, because I saw that they were last updated over two years ago. I want to get encoded sentences of texts in classical arabic, so If there are any models within camel-tools that are trained on classical arabic, that would be wonderful.
To Reproduce
I used the Autotokenizer from pretrained command :
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca') model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
Doing model.model_max_length returned a extremely large number. For this reason, I couldn't chunk sequences based on the model max length. I looked at the error code message, it mentioned the tensor needing to be 512 in length. Thats when i noticed the model maximum wasn't set.
Expected behavior
Since the error message expects token lengths to be 512, i would expect the model max length to be set to 512. However, if theres a new model trained on classical arabic that is used in camel_tools, I apologize, i didnt find it. I just want to encode sentences from classical arabic texts.
Screenshots
No screenshots, unfortunately, i fixed the error by setting a max length manually to another variable.
However, this was the text of the error message : RuntimeError: The size of tensor a (5338) must match the size of tensor b (512) at non-singleton dimension 1
Desktop (please complete the following information):
Working from Google Colab
Additional context
None, but if theres a pretrained model that is updated on camel_tools that is trained on classical arabic, that would be amazing.
The text was updated successfully, but these errors were encountered: