-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bert: add conversion script for BERT Token Dropping TF2 checkpoints #17142
bert: add conversion script for BERT Token Dropping TF2 checkpoints #17142
Conversation
The documentation is not available anymore as the PR was closed or merged. |
src/transformers/models/bert/convert_bert_token_dropping_original_tf2_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/bert/convert_bert_token_dropping_original_tf2_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean, thanks for your contribution @stefan-it! Don't forget to fill the model card so that members of the community are aware of how your model was trained 😃
Hi @LysandreJik thanks for the approval! I have also added a model card and the model is also mentioned in our new "hmBERT: Historical Multilingual Language Models for Named Entity Recognition" - which is used as the backbone language model for our winning NER models (English and French) :) |
/cc @sgugger 🤗 |
Sorry this slipped through the cracks! Thanks a lot for your contributrion! |
…uggingface#17142) * bert: add conversion script for BERT Token Dropping TF2 checkpoints * bert: rename conversion script for BERT Token Dropping checkpoints * bert: fix flake errors in BERT Token Dropping conversion script * bert: make doc-builder happy!!1!11 * bert: fix pytorch_dump_path of BERT Token Dropping conversion script
Hi,
this PR adds a conversion script for BERT models, that were trained with the recently introduced "Token Dropping for Efficient BERT Pretraining" approach, introduced in this paper:
Models are trained with the TensorFlow 2 implementation from the TensorFlow models repository, which can be found here. Note: The model architecture only needs changes during pre-training, but the final pre-trained model is compatible with the original BERT architecture!
Unfortunately, the authors do not plan to release pre-trained checkpoints.
But I have pre-trained several models with their official implementation and I've also released the checkpoints and the PyTorch converted model weights on the Hugging Face Model Hub:
https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased
This is a multi-lingual model, that was trained on ~130GB of historic and noisy OCR'ed texts with a 64k vocab.
Conversion Script Usage
In order to test the conversion script, the following commands can be used to test the conversion:
This outputs:
All model checkpoint weights were used when initializing BertForMaskedLM. All the weights of BertForMaskedLM were initialized from the model checkpoint at ./exported. If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
Masked LM Predictions
The masked lm predictions are pretty good and are comparable with the multil-lingual model, that was trained with the official BERT implementation. Just use the inference widget on the Hugging Face Model Hub.
In this example, the sentence
and I cannot conceive the reafon why [MASK] hath
is used to test the model. For a good comparison, the 32k hmBERT is used that was trained with the official BERT implementation on the same corpus:With the 64k hmBERT model that was trained with the Token Dropping approach, the output is:
Downstream Task Performance
We have also used this model when participating in the HIPE-2022 Shared Task and the BERT model pre-trained with Token Dropping approach achieved really good results on the NER downstream task, see results here:
bs4-e10-lr5e-05#4
bs8-e10-lr3e-05#3