Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bert: add conversion script for BERT Token Dropping TF2 checkpoints #17142

Merged
merged 5 commits into from
Jun 27, 2022
Merged

bert: add conversion script for BERT Token Dropping TF2 checkpoints #17142

merged 5 commits into from
Jun 27, 2022

Conversation

stefan-it
Copy link
Collaborator

@stefan-it stefan-it commented May 9, 2022

Hi,

this PR adds a conversion script for BERT models, that were trained with the recently introduced "Token Dropping for Efficient BERT Pretraining" approach, introduced in this paper:

Unbenannt

Models are trained with the TensorFlow 2 implementation from the TensorFlow models repository, which can be found here. Note: The model architecture only needs changes during pre-training, but the final pre-trained model is compatible with the original BERT architecture!

Unfortunately, the authors do not plan to release pre-trained checkpoints.

But I have pre-trained several models with their official implementation and I've also released the checkpoints and the PyTorch converted model weights on the Hugging Face Model Hub:

https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased

This is a multi-lingual model, that was trained on ~130GB of historic and noisy OCR'ed texts with a 64k vocab.

Conversion Script Usage

In order to test the conversion script, the following commands can be used to test the conversion:

wget "https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased/resolve/main/ckpt-1000000.data-00000-of-00001"
wget "https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased/resolve/main/ckpt-1000000.index"
wget "https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased/resolve/main/config.json"
python3 convert_bert_token_dropping_original_tf2_checkpoint_to_pytorch.py --tf_checkpoint_path ckpt-1000000 --bert_config_file config.json --pytorch_dump_path ./exported

This outputs:

All model checkpoint weights were used when initializing BertForMaskedLM.

All the weights of BertForMaskedLM were initialized from the model checkpoint at ./exported.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.

Masked LM Predictions

The masked lm predictions are pretty good and are comparable with the multil-lingual model, that was trained with the official BERT implementation. Just use the inference widget on the Hugging Face Model Hub.

In this example, the sentence and I cannot conceive the reafon why [MASK] hath is used to test the model. For a good comparison, the 32k hmBERT is used that was trained with the official BERT implementation on the same corpus:

[
  {
    "score": 0.3564337193965912,
    "token": 1349,
    "token_str": "she",
    "sequence": "and I cannot conceive the reafon why she hath"
  },
  {
    "score": 0.21097686886787415,
    "token": 903,
    "token_str": "it",
    "sequence": "and I cannot conceive the reafon why it hath"
  },
  {
    "score": 0.10645408183336258,
    "token": 796,
    "token_str": "he",
    "sequence": "and I cannot conceive the reafon why he hath"
  },
  {
    "score": 0.0170532688498497,
    "token": 1049,
    "token_str": "we",
    "sequence": "and I cannot conceive the reafon why we hath"
  },
  {
    "score": 0.01265314407646656,
    "token": 45,
    "token_str": "I",
    "sequence": "and I cannot conceive the reafon why I hath"
  }
]

With the 64k hmBERT model that was trained with the Token Dropping approach, the output is:

[
  {
    "score": 0.5147836804389954,
    "token": 796,
    "token_str": "he",
    "sequence": "and I cannot conceive the reafon why he hath"
  },
  {
    "score": 0.1566970944404602,
    "token": 1349,
    "token_str": "she",
    "sequence": "and I cannot conceive the reafon why she hath"
  },
  {
    "score": 0.08448878675699234,
    "token": 903,
    "token_str": "it",
    "sequence": "and I cannot conceive the reafon why it hath"
  },
  {
    "score": 0.020168323069810867,
    "token": 45,
    "token_str": "I",
    "sequence": "and I cannot conceive the reafon why I hath"
  },
  {
    "score": 0.01774059422314167,
    "token": 3560,
    "token_str": "God",
    "sequence": "and I cannot conceive the reafon why God hath"
  }
]

Downstream Task Performance

We have also used this model when participating in the HIPE-2022 Shared Task and the BERT model pre-trained with Token Dropping approach achieved really good results on the NER downstream task, see results here:

Backbone LM Configuration F1-Score (All, Development) F1-Score (German, Development) F1-Score (English, Development) F1-Score (French, Development) Model Hub Link
hmBERT (32k) bs4-e10-lr5e-05#4 87.64 89.26 88.78 84.80 here
hmBERT (64k, token dropping) bs8-e10-lr3e-05#3 87.02 88.89 86.63 85.50 here

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 9, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean, thanks for your contribution @stefan-it! Don't forget to fill the model card so that members of the community are aware of how your model was trained 😃

@stefan-it
Copy link
Collaborator Author

Hi @LysandreJik

thanks for the approval! I have also added a model card and the model is also mentioned in our new "hmBERT: Historical Multilingual Language Models for Named Entity Recognition" - which is used as the backbone language model for our winning NER models (English and French) :)

@stefan-it
Copy link
Collaborator Author

/cc @sgugger 🤗

@sgugger sgugger merged commit 71b2839 into huggingface:main Jun 27, 2022
@sgugger
Copy link
Collaborator

sgugger commented Jun 27, 2022

Sorry this slipped through the cracks! Thanks a lot for your contributrion!

@stefan-it stefan-it deleted the add-bert-token-dropping-conversion-script branch June 27, 2022 19:09
younesbelkada pushed a commit to younesbelkada/transformers that referenced this pull request Jun 29, 2022
…uggingface#17142)

* bert: add conversion script for BERT Token Dropping TF2 checkpoints

* bert: rename conversion script for BERT Token Dropping checkpoints

* bert: fix flake errors in BERT Token Dropping conversion script

* bert: make doc-builder happy!!1!11

* bert: fix pytorch_dump_path of BERT Token Dropping conversion script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants