From 9d7e60a3c0ee2989972cabf10afc362e6224c010 Mon Sep 17 00:00:00 2001
From: Will Rice <will@spokestack.io>
Date: Wed, 5 May 2021 21:18:32 -0400
Subject: [PATCH] Accept Upstream Changes (#1)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Add more metadata to the user agent (#10972)

* Add more metadata to the user agent

* Fix typo

* Use DISABLE_TELEMETRY

* Address review comments

* Use global env

* Add clean envs on circle CI

* Enforce string-formatting with f-strings (#10980)

* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* add notebook (#10995)

* Merge trainers (#10975)

* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

* add blog to docs (#10997)

* Update training_args.py (#11000)

In the group by length documentation length is misspelled as legnth

* Add `examples/language_modeling/run_mlm_no_trainer.py` (#11001)

* Add initial script for finetuning MLM models with accelerate

* Add evaluation metric calculation

* Fix bugs

* Use no_grad on evaluation

* update script docstring

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* PR feedback

* Fix CI failure

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix Adafactor documentation (recommend correct settings) (#10526)

* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* flip scale_parameter to True for the 2nd recommendatoin

Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Improve the speed of adding tokens from added_tokens.json (#10780)

* use bisect to add one token to unique_no_split_tokens

* fix style

* Add Vision Transformer and ViTFeatureExtractor (#10950)

* Squash all commits into one

* Update ViTFeatureExtractor to use image_utils instead of torchvision

* Remove torchvision and add Pillow

* Small docs improvement

* Address most comments by @sgugger

* Fix tests

* Clean up conversion script

* Pooler first draft

* Fix quality

* Improve conversion script

* Make style and quality

* Make fix-copies

* Minor docs improvements

* Should use fix-copies instead of manual handling

* Revert "Should use fix-copies instead of manual handling"

This reverts commit fd4e591bce4496d41406425c82606a8fdaf8a50b.

* Place ViT in alphabetical order

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* DebertaTokenizer Rework closes #10258 (#10703)

* closes #10258

* typo

* reworked deberta test

* implemented the comments from BigBird01 regarding sequence pair encoding of deberta

* Update style

* VOCAB_FILES_NAMES is now a oneliner as suggested by @sgugger

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* added #fmt: on as requested by @sgugger

* Style

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* minor typo fix

*negative* log-likelihood

* [doc] no more bucket

* added new notebook and merge of trainer (#11015)

* added new notebook and merge of trainer

* Update docs/source/sagemaker.md

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fixed typo: logging instead of logger (#11025)

* Add a script to check inits are consistent (#11024)

* s|Pretrained|PreTrained| (#11048)

* [doc] update code-block rendering (#11053)

double : prevents code-block section to be rendered, so made it single :

* Pin docutils (#11062)

* Pin docutils

* Versions table

* Remove unnecessary space (#11060)

* Some models have no tokenizers (#11064)

* Refactor AutoModel classes and add Flax Auto classes (#11027)

* Refactor AutoModel classes and add Flax Auto classes

* Add new objects to the init

* Fix hubconf and sort models

* Fix TF tests

* Missing coma

* Update src/transformers/models/auto/auto_factory.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Fix init

* Fix dummies

* Other init to fix

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Documentation about loading a fast tokenizer within Transformers (#11029)

* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add example for registering callbacks with trainers (#10928)

* Add example for callback registry

Resolves: #9036

* Update callback registry documentation

* Added comments for other ways to register callback

* Add `examples/language_modeling/run_clm_no_trainer.py` (#11026)

* Initial draft for clm no trainer

* Remove unwanted args

* Fix bug

* Update examples/language-modeling/run_clm_no_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Replace pkg_resources with importlib_metadata (#11061)

* Replace pkg_resources with importlib_metadata

Fixes #10964. The other reason for this change is that pkg_resources has been [deprecated](https://github.com/pypa/setuptools/commit/8fe85c22cee7fde5e6af571b30f864bad156a010) in favor of importlib_metadata.

* Reduce to a single importlib_metadata import switch

* Trigger CI

Co-authored-by: Stas Bekman <stas@stason.org>

* Add center_crop to ImageFeatureExtractoMixin (#11066)

* Document common config attributes (#11070)

* Fix distributed gather for tuples of tensors of varying sizes (#11071)

* Make a base init in FeatureExtractionMixin (#11074)

* Add Readme for language modeling scripts with accelerate (#11073)

* HF emoji unicode doesn't work in console (#11081)

It doesn't look like using 🤗 is a great idea for printing to console. See attachment.

This PR proposes to replace 🤗 with "HuggingFace" for an exception message.

@LysandreJik

* Link to new blog

* added social thumbnail for docs (#11083)

* added new merged Trainer test (#11090)

* [WIP] GPT Neo cleanup (#10985)

* better names

* add attention mixin

* all slow tests in one class

* make helper methods static so we can test

* add local attention tests

* better names

* doc

* apply review suggestions

* Release v4.5.0

* Development on v4.6.0dev0

* [doc] gpt-neo (#11098)

make the example work

* Auto feature extractor (#11097)

* AutoFeatureExtractor

* Init and first tests

* Tests

* Damn you gitignore

* Quality

* Defensive test for when not all backends are here

* Use pattern for Speech2Text models

* accelerate question answering examples with no trainer (#11091)

* accelerate question answering examples with no trainer

* removed train and eval flags also fixed fill np array function

* Update examples/question-answering/run_qa_beam_search_no_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/question-answering/run_qa_no_trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Style

* dead link fixed (#11103)

* GPTNeo: handle padded wte (#11079)

* GPTNeo: handle padded wte

* Switch to config.vocab_size

* apply review suggestion

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* fix: The 'warn' method is deprecated (#11105)

* The 'warn' method is deprecated

* fix test

* [examples] fix white space (#11099)

these get concatenated without whitespace, so fix it

* Dummies multi backend (#11100)

* Replaces requires_xxx by one generic method

* Quality and update check_dummies

* Fix inits check

* Post-merge cleanup

* Some styling of the training table in Notebooks (#11118)

* Adds a note to resize the token embedding matrix when adding special … (#11120)

* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space

* fix tests (#11109)

* [versions] handle version requirement ranges (#11110)

* handle version requirement ranges

* add mixed requirement test

* cleanup

* Adds use_auth_token with pipelines (#11123)

* added model_kwargs to infer_framework_from_model

* added model_kwargs to tokenizer

* added use_auth_token as named parameter

* added dynamic get for use_auth_token

* Fix and refactor check_repo (#11127)

* Fix typing error in Trainer class (prediction_step) (#11138)

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

* Typo fix of the name of BertLMHeadModel in BERT doc (#11133)

* [run_clm] clarify why we get the tokenizer warning on long input (#11145)

* clarify why we get the warning here

* Update examples/language-modeling/run_clm.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* wording

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [DeepSpeed] ZeRO Stage 3 (#10753)

* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Add nvidia megatron models (#10911)

* Add support for NVIDIA Megatron models

* Add support for NVIDIA Megatron GPT2 and BERT

Add the megatron_gpt2 model. That model reuses the existing GPT2 model. This
commit includes a script to convert a Megatron-GPT2 checkpoint downloaded
from NVIDIA GPU Cloud. See examples/megatron-models/README.md for details.

Add the megatron_bert model. That model is implemented as a modification of
the existing BERT model in Transformers. This commit includes a script to
convert a Megatron-BERT checkpoint downloaded from NVIDIA GPU Cloud. See
examples/megatron-models/README.md for details.

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Remove model.half in tests + add "# Copied ..."

Remove the model.half() instruction which makes tests fail on the CPU.

Add a comment "# Copied ..." before many classes in the model to enable automatic
tracking in CI between the new Megatron classes and the original Bert ones.

* Fix issues

* Fix Flax/TF tests

* Fix copyright

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update docs/source/model_doc/megatron_bert.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update docs/source/model_doc/megatron_gpt2.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/__init__.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Resolve most of 'sgugger' comments

* Fix conversion issue + Run make fix-copies/quality/docs

* Apply suggestions from code review

* Causal LM & merge

* Fix init

* Add CausalLM to last auto class

Co-authored-by: Julien Demouth <jdemouth@nvidia.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* [trainer] solve "scheduler before optimizer step" warning (#11144)

* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test

* Add fairscale and deepspeed back to the CI (#11147)

* Add fairscale and deepspeed back to the CI

* Add deepspeed to single GPU tests

* Updates SageMaker docs for updating DLCs (#11140)

* Don't duplicate logs in TensorBoard and handle --use_env (#11141)

* Run mlm pad to multiple for fp16 (#11128)

* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)

* [tests] relocate core integration tests (#11146)

* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps

* [setup] extras[docs] must include 'all' (#11148)

* extras[doc] must include 'all'

* fix

* better

* regroup

* Add support for multiple models for one config in auto classes (#11150)

* Add support for multiple models for one config in auto classes

* Use get_values everywhere

* Prettier doc

* [setup] make fairscale and deepspeed setup extras (#11151)

* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* no reason not to ask for the good version

* update the CIs

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Skip Megatron tests for now

* typo (#11152)

* typo

* style

* [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips (#11142)

* Add Wav2Vec Inference notebook

* Update docs/source/community.md

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Fix LogitsProcessor documentation (#11130)

* Change duplicated LogitsProcessor to LogitsWarper in LogitsProcessorList document

* Write more detailed information about LogitsProcessor's scores argument

* apply suggestion from review

* style

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Update README.md (#11161)

Corrected a typo ('Downlowd' to 'Download')

* Make `get_special_tokens_mask` consider all tokens (#11163)

* Add a special tokenizer for CPM model (#11068)

* Add a special tokenizer for CPM model

* make style

* fix

* Add docs

* styles

* cpm doc

* fix ci

* fix the overview

* add test

* make style

* typo

* Custom tokenizer flag

* Add REAMDE.md

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* [examples/translation] support mBART-50 and M2M100 fine-tuning (#11170)

* keep a list of multilingual tokenizers

* add forced_bos_token argument

* [examples run_clm] fix _LazyModule hasher error (#11168)

* fix _LazyModule hasher error

* reword

* added json dump and extraction of train run time (#11167)

* added json dump and extraction of train run time

* make style happy

* Fix Typo

* Reactivate Megatron tests an use less workers

* Minor typos fixed (#11182)

* Fix style

* model_path should be ignored as the checkpoint path (#11157)

* model_path is refered as the path of the trainer, and should be ignored as the checkpoint path.

* Improved according to Sgugger's comment.

* Added documentation for data collator. (#10941)

* Added documentation for data collator.

* Update docs/source/data_collator.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Added documentation for data collator.

* Added documentation for the data collator.

* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.

* Update documentation for the data collator.

* Update documentation for the data collator.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Amna <A.A.Ahmad@student.tudelft.nl>

* Fix typo (#11188)

* Add DeiT (PyTorch) (#11056)

* First draft of deit

* More improvements

* Remove DeiTTokenizerFast from init

* Conversion script works

* Add DeiT to ViT conversion script

* Add tests, add head model, add support for deit in vit conversion script

* Update model checkpoint names

* Update image_mean and image_std, set resample to bicubic

* Improve docs

* Docs improvements

* Add DeiTForImageClassificationWithTeacher to init

* Address comments by @sgugger

* Improve feature extractors

* Make fix-copies

* Minor fixes

* Address comments by @patil-suraj

* All models uploaded

* Fix tests

* Remove labels argument from DeiTForImageClassificationWithTeacher

* Fix-copies, style and quality

* Fix tests

* Fix typo

* Multiple docs improvements

* More docs fixes

* Replaced `which` with `who` (#11183)

* Import torch.utils.checkpoint in ProphetNet (#11214)

* Sagemaker test docs update for framework upgrade (#11206)

* increased train_runtime for model parallelism

* added documentation for framework upgrade

* Use MSELoss in (M)BartForSequenceClassification (#11178)

* wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model (#11041)

* add vocab while converting wav2vec2 original finetuned model

* check save directory exists

* return_attention_mask fix

* quality

* Add Matt as the TensorFlow reference (#11212)

* Fix GPT-2 warnings (#11213)

* Fix GPT-2 warnings

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* fix docstrings (#11221)

* Add documentation for BertJapanese (#11219)

* Start writing BERT-Japanese doc

* Fix typo, Update toctree

* Modify model file to use comment for document, Add examples

* Clean bert_japanese by make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Split a big code block into two

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add prefix >>> to all lines in code blocks

* Clean bert_japanese by make fixup

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Replace error by warning when loading an architecture in another (#11207)

* Replace error by warning when loading an architecture in another

* Style

* Style again

* Add a test

* Adapt old test

* Document v4.5.1

* Refactor GPT2 (#11225)

* refactor GPT2

* fix mlp and head pruning

* address Sylvains comments

* apply suggestion from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Doc check: a bit of clean up (#11224)

* added cache_dir=model_args.cache_dir to all example with cache_dir arg (#11220)

* Avoid using no_sync on SageMaker DP (#11229)

* Indent code block in the documentation (#11233)

* Indent code block

* Indent code blocks version 2

* Quality

* Run CI on deepspeed and fairscale (#11172)

* Run CI on deepspeed and fairscale

* Test it on this branch :)

* Rename

* Update the CI image

* [Deepspeed] zero3 tests band aid (#11235)

* temp band-aid

* style

* Save the Wav2Vec2 processor before training starts (#10910)

Co-authored-by: nithin19 <nithin@amberscript.com>

* make embeddings plural in warning message (#11228)

* Stale bot updated (#10562)

* Updated stale bot

* Specify issue number

* Remove particular handling of assignees

* Unleash the stalebot

* Remove debug branch

* Close open files to suppress ResourceWarning (#11240)

Co-authored-by: Sudharsan Thirumalai <sudharsan.t@sprinklr.com>

* Fix dimention misspellings. (#11238)

* Update modeling_gpt_neo.py

dimention -> dimension

* Update configuration_speech_to_text.py

dimention -> dimension

* Add prefix to examples in model_doc rst (#11226)

* Add prefix to examples in model_doc rst

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [troubleshooting] add 2 points of reference to the offline mode (#11236)

* add 2 points of reference to the offline mode

* link the new doc

* add error message

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* rename

* Trigger CI

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix #10128 (#11248)

* [deepspeed] test on one node 2 gpus max (#11237)

* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version

* Trainer iterable dataset (#11254)

* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Style

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Adding pipeline task aliases. (#11247)

* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.

* Support for set_epoch (#11258)

* Tokenizer fast save (#11234)

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

* update dependency_versions_table (#11273)

missed this updating when bumped the version.

* Workflow fixes (#11270)

* Enabling multilingual models for translation pipelines. (#10536)

* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix failing workflows

* Trainer support for IterableDataset for evaluation and predict (#11286)

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

* move device statements outside if statements (#11292)

* modify double considering special tokens in `language_modeling.py` (#11275)

* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)"

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py

* [Trainer] fix the placement on device with fp16_full_eval (#11322)

* fix the placement on device with fp16_full_eval

* deepspeed never goes on device

* [Trainer] Add a progress bar for batches skipped (#11324)

* Load checkpoint without re-creating the model (#11318)

* Added translation example script  (#11196)

* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script

* [Generate] Remove outdated code (#11331)

* remove update function

* update

* refactor more

* refactor

* [GPTNeo] create local attention mask ones (#11335)

* create local attention mask ones

* remove old method, address patricks comment

* Update to use datasets remove_cloumns method (#11343)

* Update to use datasets remove_cloumns method

* Quality

* Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117)

* Removed `max_length` from being mandatory within `generate`. (#11314)

* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.

* Honor contributors to models (#11329)

* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors

* [deepspeed] fix resume from checkpoint (#11352)

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

* Examples reorg (#11350)

* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments and clean

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)

* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics

* [testing doc] bring doc up to date (#11359)

* bring doc up to date

* fix

* Merge new TF example script (#11360)

First of the new and more idiomatic TF examples!

* Remove boiler plate code (#11340)

* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor

* Move old TF text classification script to legacy (#11361)

And update README to explain the work-in-progress!

* [contributing doc] explain/link to good first issue (#11346)

* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix token_type_ids error for big_bird model. (#11355)

* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

* Add huggingface_hub dep for #11328

* Add in torchhub

* [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)

* fix wav2vec2 tok

* up

* [Flax] Correct typo (#11374)

* finish

* fix copy

* [run_translation.py] fix typo  (#11372)

fix typo

Co-authored-by: johnson <johnson@github.com>

* Add space (#11373)

* Correctly cast num_train_epochs to int (#11379)

* Fix typo (#11369)

* Fix Trainer with remove_unused_columns=False (#11382)

* Fix Trainer with remove_unused_columns=False

* Typo

* [Flax] Big FlaxBert Refactor (#11364)

* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying

* correct typo (#11393)

* correct conversion (#11394)

* Fix typo in text (#11396)

* fixed typos (#11391)

* make blenderbot test slow (#11395)

* Fixed trainer total_flos relaoding in distributed mode (#11383)

* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training

* Trainer push to hub (#11328)

* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <julien@huggingface.co>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* push (#11400)

* added support for exporting of t5 to onnx with past_key_values (#10651)

* Fixing bug in generation (#11297)

When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.

* Style

* Try to trigger failure more

* Wrong branch Sylvain...

* Fix cross-attention head mask for Torch encoder-decoder models (#10605)

* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring

* Default to accuracy metric (#11405)

* Enable option for subword regularization in `XLMRobertaTokenizer` (#11149)

* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <stefan@schweter.it>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Use 3 workers for torch tests

* documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410)

* Style

* Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)

* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests

* EncoderDecoderConfigs should not create new objects (#11300)

* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434)

* add pooling layer support (#11439)

* make style (#11442)

* Pin black to 20.8.b1

* With style

* Pin black to 21.4b0

* TF BART models - Add `cross_attentions` to model output and fix cross-attention head masking (#10699)

* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config

* Add basic support for FP16 in SageMaker model parallelism (#11407)

* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo

* docs(examples): fix link to TPU launcher script (#11427)

* fix some typos in docs, comments, logging/errors (#11432)

* Pass along seed to DistributedSampler (#11406)

* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler

* Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording

* [docs] fix invalid class name (#11438)

* fix invalid class name

* proper ref

* proper ref

* make sure to test against the local checkout (#11437)

* Style

* Give each test a different repo name (#11453)

* [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)

* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval

* Variable Correction for Consistency in Distillation Example (#11444)

As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work

* [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)

* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Remove max length beam scorer (#11378)

* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update QuickTour docs to reflect model output object (#11462)

* update docs to reflect model output object

* run make style`

* Finish Making Quick Tour respect the model object (#11467)

* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix docs for decoder_input_ids (#11466)

* fix docs for decoder_input_ids

* revert the changes for bart and mbart

* Update min versions in README and add Flax (#11472)

* Update min versions in README and add Flax

* Adapt index

* Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add error message

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix #1149 (#11493)

* [Flax] Add docstrings & model outputs (#11498)

* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* simplify more

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Reformat to make code clearer in tokenizer call (#11497)

* Reformat to make code clearer

* Reformat to make code clearer

* solved coefficient issue for the TF version of gelu_fast (#11514)

Co-authored-by: Michael Benayoun <michael@huggingface.co>

* Split checkpoint from model_name_or_path in examples (#11492)

* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments

* Patch notification service

* Pin HuggingFace Hub dependency (#11502)

* correct the dimension comment of matrix multiplication (#11494)

Co-authored-by: Frederik Bode <frederik@paperbox.ai>

* add sp_model_kwargs to unpickle of xlm roberta tok (#11430)

add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test

* make style (#11520)

* Update README.md (#11489)

Add link to code

* T5 Gradient Checkpointing (#11353)

* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <shuanck@gmail.com>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <shuanck@gmail.com>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting

* Adding `AutomaticSpeechRecognitionPipeline`. (#11337)

* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Implement Fast Tokenization for Deberta (#11387)

* Accepts BatchEncoding in LengthSampler (#11431)

* Fix do_eval default value in training_args.py (#11511)

* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md

* Update TF text classification example (#11496)

Big refactor, fixes and multi-GPU/TPU support

* reszie token embeds (#11524)

* Run model templates on master (#11527)

* [Examples] Added support for test-file in QA examples with no trainer (#11510)

* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines

* Add Stas and Suraj as authors (#11526)

* Improve task summary docs (#11513)

* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [debug utils] activation/weights underflow/overflow detector (#11274)

* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made

* Fixed docs for the shape of `scores` in `generate()` (#10057)

* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix

* Fix examples in M2M100 docstrings (#11540)

Replaces `tok` with `tokenizer` so examples can run with copy-paste

* [Flax BERT/Roberta] few small fixes (#11558)

* small fixes

* style

* [Wav2Vec2] Fix convert (#11562)

* push

* small change

* correct other typo

* Remove `datasets` submodule. (#11563)

* fix the mlm longformer example by changing [MASK] to <mask> (#11559)

* Add LUKE (#11223)

* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <ikuya@ikuya.net>

* [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)

* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix metric computation in `run_glue_no_trainer` (#11569)

* Fixes a useless warning. (#11566)

Fixes #11525

* Accumulate opt state dict on do_rank 0 (#11481)

* Update training tutorial (#11533)

* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <hamelsmu@github.com>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <hamelsmu@github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* fix resize_token_embeddings (#11572)

* Add multi-class, multi-label and regression to transformers (#11012)

* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <abhishekkrthakur@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Enable added tokens (#11325)

* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition

* Make quality scripts work when one backend is missing. (#11573)

* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code

* [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)

* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <patrick@huggingface.co>

* Removes SageMakerTrainer code but keeps class as wrapper (#11587)

* removed all old code

* make quality

* [Flax] Add Electra models (#11426)

* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Reproducible checkpoint (#11582)

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

* [trainer] document resume randomness (#11588)

* document resume randomness

* fix link

* reword

* fix

* reword

* style

* copies need to be fixed too (#11585)

* add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)

* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <deepch23@us.ibm.com>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <deepch23@us.ibm.com>

* Skip Funnel test

* Pytorch - Lazy initialization of models (#11471)

* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Accept tensorflow-rocm package when checking TF availability (#11595)

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Co-authored-by: JohnnyC08 <jcastaldo08@gmail.com>
Co-authored-by: Hemil Desai <hemil.desai10@gmail.com>
Co-authored-by: Josh <1113285+jsrozner@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: cchen-dialpad <47165889+cchen-dialpad@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: cronoik <johannes.schaffrath@mail.de>
Co-authored-by: Joe Davison <josephddavison@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: versis <versis791@gmail.com>
Co-authored-by: Eren Şahin <sahineren.09@gmail.com>
Co-authored-by: Amala Deshmukh <amala.d166@gmail.com>
Co-authored-by: konstin <konstin@mailbox.org>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: SHYAM SUNDER KUMAR <beingprofess@gmail.com>
Co-authored-by: Leo Gao <54557097+leogao2@users.noreply.github.com>
Co-authored-by: Vasudev Gupta <7vasudevgupta@gmail.com>
Co-authored-by: Jannis Born <jannis.born@gmx.de>
Co-authored-by: Yusuke Mori <mori@mi.t.u-tokyo.ac.jp>
Co-authored-by: Julien Demouth <julien.demouth@gmail.com>
Co-authored-by: Julien Demouth <jdemouth@nvidia.com>
Co-authored-by: Andrea Cappelli <ak314@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <62820084+Muennighoff@users.noreply.github.com>
Co-authored-by: Keisuke Hirota <tahiro.k.ad@gmail.com>
Co-authored-by: Saviour Owolabi <42647840+Seyviour@users.noreply.github.com>
Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>
Co-authored-by: Masatoshi TSUCHIYA <tsuchm@users.noreply.github.com>
Co-authored-by: fghuman <f.z.ghuman@student.tudelft.nl>
Co-authored-by: Amna <A.A.Ahmad@student.tudelft.nl>
Co-authored-by: Takuya Makino <takuyamakino15@gmail.com>
Co-authored-by: calpt <calpt@mail.de>
Co-authored-by: Ceyda Cinarel <15624271+cceyda@users.noreply.github.com>
Co-authored-by: Nithin Holla <nithin.holla7@gmail.com>
Co-authored-by: nithin19 <nithin@amberscript.com>
Co-authored-by: Joel Stremmel <joelstremmel22@gmail.com>
Co-authored-by: Sudharsan S T <stsudharshan@gmail.com>
Co-authored-by: Sudharsan Thirumalai <sudharsan.t@sprinklr.com>
Co-authored-by: Thomas Wood <odell.wood@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: e <e_yi@foxmail.com>
Co-authored-by: TAE YOUNGDON <49802647+taepd@users.noreply.github.com>
Co-authored-by: rajvi-k <rajvi.kapadia01@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: wlhgtc <hgtcwl@foxmail.com>
Co-authored-by: johnson7788 <linghuchongxajh@gmail.com>
Co-authored-by: johnson <johnson@github.com>
Co-authored-by: PenutChen <penut85420@gmail.com>
Co-authored-by: Max Del <max.del.edu@gmail.com>
Co-authored-by: Yoshitomo Matsubara <yoshitomo-matsubara@users.noreply.github.com>
Co-authored-by: Teven <teven.lescao@gmail.com>
Co-authored-by: Kiran R <kiranr8k@gmail.com>
Co-authored-by: Nicola De Cao <nicola.decao@gmail.com>
Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com>
Co-authored-by: Philip May <philip@may.la>
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: abiolaTresor <48957493+abiolaTresor@users.noreply.github.com>
Co-authored-by: Amine Abdaoui <abdaoui@lirmm.fr>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: Kostas Stathoulopoulos <k.stathoylopoylos@gmail.com>
Co-authored-by: Bhadresh Savani <bhadreshpsavani@gmail.com>
Co-authored-by: Jaimeen Ahn <32367255+jaimeenahn@users.noreply.github.com>
Co-authored-by: Ashwin Geet D'Sa <win.12894@gmail.com>
Co-authored-by: Hamel Husain <hamelsmu@github.com>
Co-authored-by: Hamel Husain <hamel.husain@gmail.com>
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
Co-authored-by: Michael Benayoun <michael@huggingface.co>
Co-authored-by: Frederik Bode <fredo.bode@gmail.com>
Co-authored-by: Frederik Bode <frederik@paperbox.ai>
Co-authored-by: Manuel Romero <mrm8488@gmail.com>
Co-authored-by: CeShine Lee <ceshine@users.noreply.github.com>
Co-authored-by: Shubham Sanghavi <shubham.sanghavi@outlook.com>
Co-authored-by: bonniehyeon <50580028+bonniehyeon@users.noreply.github.com>
Co-authored-by: jingyihe <29100716+kylie-box@users.noreply.github.com>
Co-authored-by: Ikuya Yamada <ikuya@ikuya.net>
Co-authored-by: Muktan <muktan123@gmail.com>
Co-authored-by: abhishek thakur <1183441+abhi1thakur@users.noreply.github.com>
Co-authored-by: abhishek thakur <abhishekkrthakur@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
Co-authored-by: Patrick Fernandes <pattuga@gmail.com>
Co-authored-by: Deepali <70963368+cdeepali@users.noreply.github.com>
Co-authored-by: Deepali Chourasia <deepch23@us.ibm.com>
Co-authored-by: Mats Sjöberg <mats.sjoberg@csc.fi>
---
 .../text-classification/run_tf_glue.py        | 265 ++++++++++++++++++
 model_cards/google/tapas-base/README.md       | 123 ++++++++
 2 files changed, 388 insertions(+)
 create mode 100755 examples/tensorflow/text-classification/run_tf_glue.py
 create mode 100644 model_cards/google/tapas-base/README.md

diff --git a/examples/tensorflow/text-classification/run_tf_glue.py b/examples/tensorflow/text-classification/run_tf_glue.py
new file mode 100755
index 00000000000000..5b6df337e91800
--- /dev/null
+++ b/examples/tensorflow/text-classification/run_tf_glue.py
@@ -0,0 +1,265 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Fine-tuning the library models for sequence classification."""
+
+
+import logging
+import os
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Dict, Optional
+
+import numpy as np
+import tensorflow as tf
+import tensorflow_datasets as tfds
+
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    EvalPrediction,
+    HfArgumentParser,
+    PreTrainedTokenizer,
+    TFAutoModelForSequenceClassification,
+    TFTrainer,
+    TFTrainingArguments,
+    glue_compute_metrics,
+    glue_convert_examples_to_features,
+    glue_output_modes,
+    glue_processors,
+    glue_tasks_num_labels,
+)
+from transformers.utils import logging as hf_logging
+
+
+hf_logging.set_verbosity_info()
+hf_logging.enable_default_handler()
+hf_logging.enable_explicit_format()
+
+
+class Split(Enum):
+    train = "train"
+    dev = "validation"
+    test = "test"
+
+
+def get_tfds(
+    task_name: str,
+    tokenizer: PreTrainedTokenizer,
+    max_seq_length: Optional[int] = None,
+    mode: Split = Split.train,
+    data_dir: str = None,
+):
+    if task_name == "mnli-mm" and mode == Split.dev:
+        tfds_name = "mnli_mismatched"
+    elif task_name == "mnli-mm" and mode == Split.train:
+        tfds_name = "mnli"
+    elif task_name == "mnli" and mode == Split.dev:
+        tfds_name = "mnli_matched"
+    elif task_name == "sst-2":
+        tfds_name = "sst2"
+    elif task_name == "sts-b":
+        tfds_name = "stsb"
+    else:
+        tfds_name = task_name
+
+    ds, info = tfds.load("glue/" + tfds_name, split=mode.value, with_info=True, data_dir=data_dir)
+    ds = glue_convert_examples_to_features(ds, tokenizer, max_seq_length, task_name)
+    ds = ds.apply(tf.data.experimental.assert_cardinality(info.splits[mode.value].num_examples))
+
+    return ds
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class GlueDataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+
+    Using `HfArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    task_name: str = field(metadata={"help": "The name of the task to train on: " + ", ".join(glue_processors.keys())})
+    data_dir: Optional[str] = field(default=None, metadata={"help": "The input/output data dir for TFDS."})
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+
+    def __post_init__(self):
+        self.task_name = self.task_name.lower()
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
+    # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
+    # or just modify its tokenizer_config.json.
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
+    )
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = HfArgumentParser((ModelArguments, GlueDataTrainingArguments, TFTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if (
+        os.path.exists(training_args.output_dir)
+        and os.listdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(
+        f"n_replicas: {training_args.n_replicas}, distributed training: {bool(training_args.n_replicas > 1)}, "
+        f"16-bits training: {training_args.fp16}",
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+
+    try:
+        num_labels = glue_tasks_num_labels["mnli" if data_args.task_name == "mnli-mm" else data_args.task_name]
+        output_mode = glue_output_modes[data_args.task_name]
+    except KeyError:
+        raise ValueError(f"Task not found: {data_args.task_name}")
+
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=data_args.task_name,
+        cache_dir=model_args.cache_dir,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+    )
+
+    with training_args.strategy.scope():
+        model = TFAutoModelForSequenceClassification.from_pretrained(
+            model_args.model_name_or_path,
+            from_pt=bool(".bin" in model_args.model_name_or_path),
+            config=config,
+            cache_dir=model_args.cache_dir,
+        )
+
+    # Get datasets
+    train_dataset = (
+        get_tfds(
+            task_name=data_args.task_name,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            data_dir=data_args.data_dir,
+        )
+        if training_args.do_train
+        else None
+    )
+    eval_dataset = (
+        get_tfds(
+            task_name=data_args.task_name,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            mode=Split.dev,
+            data_dir=data_args.data_dir,
+        )
+        if training_args.do_eval
+        else None
+    )
+
+    def compute_metrics(p: EvalPrediction) -> Dict:
+        if output_mode == "classification":
+            preds = np.argmax(p.predictions, axis=1)
+        elif output_mode == "regression":
+            preds = np.squeeze(p.predictions)
+        return glue_compute_metrics(data_args.task_name, preds, p.label_ids)
+
+    # Initialize our Trainer
+    trainer = TFTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        compute_metrics=compute_metrics,
+    )
+
+    # Training
+    if training_args.do_train:
+        trainer.train()
+        trainer.save_model()
+        tokenizer.save_pretrained(training_args.output_dir)
+
+    # Evaluation
+    results = {}
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+
+        result = trainer.evaluate()
+        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
+
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+
+            for key, value in result.items():
+                logger.info(f"  {key} = {value}")
+                writer.write(f"{key} = {value}\n")
+
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_cards/google/tapas-base/README.md b/model_cards/google/tapas-base/README.md
new file mode 100644
index 00000000000000..9685f28566d499
--- /dev/null
+++ b/model_cards/google/tapas-base/README.md
@@ -0,0 +1,123 @@
+---
+language: en
+tags:
+- tapas
+- masked-lm
+license: apache-2.0
+---
+
+# TAPAS base model 
+
+This model corresponds to the `tapas_inter_masklm_base_reset` checkpoint of the [original Github repository](https://github.com/google-research/tapas). 
+
+Disclaimer: The team releasing TAPAS did not write a model card for this model so this model card has been written by
+the Hugging Face team and contributors.
+
+## Model description
+
+TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. 
+This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it
+can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
+was pretrained with two objectives:
+
+- Masked language modeling (MLM): taking a (flattened) table and associated context, the model randomly masks 15% of the words in 
+  the input, then runs the entire (partially masked) sequence through the model. The model then has to predict the masked words. 
+  This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, 
+  or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional 
+  representation of a table and associated text.
+- Intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating 
+  a balanced dataset of millions of syntactically created training examples. Here, the model must predict (classify) whether a sentence 
+  is supported or refuted by the contents of a table. The training examples are created based on synthetic as well as counterfactual statements.
+
+This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used 
+to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed
+or refuted by the contents of a table. Fine-tuning is done by adding classification heads on top of the pre-trained model, and then jointly
+train the randomly initialized classification heads with the base model on a labelled dataset. 
+
+## Intended uses & limitations
+
+You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. 
+See the [model hub](https://huggingface.co/models?filter=tapas) to look for fine-tuned versions on a task that interests you.
+
+
+Here is how to use this model to get the features of a given table-text pair in PyTorch:
+
+```python
+from transformers import TapasTokenizer, TapasModel
+import pandas as pd
+tokenizer = TapasTokenizer.from_pretrained('tapase-base')
+model = TapasModel.from_pretrained("tapas-base")
+data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
+         'Age': ["56", "45", "59"],
+         'Number of movies': ["87", "53", "69"]
+}
+table = pd.DataFrame.from_dict(data)
+queries = ["How many movies has George Clooney played in?"]
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(table=table, queries=queries, return_tensors='pt')
+output = model(**encoded_input)
+```
+
+## Training data
+
+For masked language modeling (MLM), a collection of 6.2 million tables was extracted from English Wikipedia: 3.3M of class [Infobox](https://en.wikipedia.org/wiki/Help:Infobox)
+and 2.9M of class WikiTable. The author only considered tables with at most 500 cells. As a proxy for questions that appear in the 
+downstream tasks, the authros extracted the table caption, article title, article description, segment title and text of the segment 
+the table occurs in as relevant text snippets. In this way, 21.3M snippets were created. For more info, see the original [TAPAS paper](https://www.aclweb.org/anthology/2020.acl-main.398.pdf).
+
+For intermediate pre-training, 2 tasks are introduced: one based on synthetic and the other from counterfactual statements. The first one 
+generates a sentence by sampling from a set of logical expressions that filter, combine and compare the information on the table, which is 
+required in table entailment (e.g., knowing that Gerald Ford is taller than the average president requires summing
+all presidents and dividing by the number of presidents). The second one corrupts sentences about tables appearing on Wikipedia by swapping 
+entities for plausible alternatives. Examples of the two tasks can be seen in Figure 1. The procedure is described in detail in section 3 of 
+the [TAPAS follow-up paper](https://www.aclweb.org/anthology/2020.findings-emnlp.27.pdf).
+
+## Training procedure
+
+### Preprocessing
+
+The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
+then of the form:
+
+```
+[CLS] Context [SEP] Flattened table [SEP]
+```
+
+The details of the masking procedure for each sequence are the following:
+- 15% of the tokens are masked.
+- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
+- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+- In the 10% remaining cases, the masked tokens are left as is.
+
+The details of the creation of the synthetic and counterfactual examples can be found in the [follow-up paper](https://arxiv.org/abs/2010.00571). 
+
+### Pretraining
+
+The model was trained on 32 Cloud TPU v3 cores for one million steps with maximum sequence length 512 and batch size of 512.
+In this setup, pre-training takes around 3 days. The optimizer used is Adam with a learning rate of 5e-5, and a warmup ratio 
+of 0.10. 
+
+
+### BibTeX entry and citation info
+
+```bibtex
+@misc{herzig2020tapas,
+      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
+      author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
+      year={2020},
+      eprint={2004.02349},
+      archivePrefix={arXiv},
+      primaryClass={cs.IR}
+}
+```
+
+```bibtex
+@misc{eisenschlos2020understanding,
+      title={Understanding tables with intermediate pre-training}, 
+      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
+      year={2020},
+      eprint={2010.00571},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
\ No newline at end of file