-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max length problem with bert-base-arabic-camelbert-mix-pos-msa #6
Comments
Hi @dearden! A few points based on what you provided:
from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camembert-MSA-pos-msa', max_length=512, truncation=True)
pos = pipeline('token-classification', model='CAMeL-Lab/bert-base-arabic-camembert-msa-pos-msa', tokenizer=tokenizer)
pos(text)
Hope this helps! |
Hi @balhafni! Thanks for the quick response.
for the input " ".join(["هذه ثماني كلمات لطيفة يجب معالجتها."] * 64 Is that expected? My understanding is that setting
|
What I'm doing
I'm using CamelBERT PoS tagging to process modern standard Arabic text, and I'm doing so as follows.
The problem
When running the model on texts with >512 words, I get the following error.
As mentioned in this issue over in Camel Tools, it's a known CamelBERT problem and the solution is to use the tokeniser as follows:
However, this does not fix the whole pipeline, and running pipeline with
max_length=512
results in an error because the parameter does not exist.What I've tried
I've tried doing the following...
but that doesn't work either. There's this warning...
which suggests that the parameter is being ignored even when we specify
max_length
.I've almost got it working by doing the tokenisation and model separately.
But then I get the output as tensors, and I'm not sure how to decode the output into human readable form.
Question
Is there a known fix or workaround to this problem? The output from CamelBERT is super useful, but there's quite a lot of texts with >512 tokens.
Thanks! And apologies if I'm just missing something obvious.
The text was updated successfully, but these errors were encountered: