Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch - Lazy initialization of models #11471

Merged
merged 29 commits into from
May 5, 2021

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Apr 27, 2021

🚨🚨🚨 Breaking seeded model initialization 🚨🚨🚨

As explained below this PR breaks seeded model initialization by default. To ensure the exact same model initialization as before, use:

torch.manual_seed(seed)
model = BertForSequenceClassification.from_pretrained("bert-base-cased", _fast_init=False)

What does this PR do?

This PR implements fast initializing by only initializing weights that need to be initialized. For every model two aggressive tests are added to make sure that the new "fast" initialization initializes the weights according to the exact same distributions as the previous init scheme.

IMO, it is not possible to ensure that:

from transformers import BertForSequenceClassification
import torch

torch.manual_seed(0)
model = BertForSequenceClassification.from_pretrained("bert-base-cased", _fast_init=False)  # this randomely inits the lm_head layer

yields the same results as the new "fast" init

torch.manual_seed(0)
model = BertForSequenceClassification.from_pretrained("bert-base-cased")  # this randomely inits the lm_head layer

since in the first case all layers are initialized so that the random number generator is called much more often, thus making it impossible to ensure identical weights initialization => compare to this post to better understand why this is probably not possible.

This became obvious in this PR since I had to change the random_seed for running the run_ner.py examples test to make it pass.

I guess it is therefore better to stick to "initializing all weights" for now and only when changing to version 5.0 making this breaking change. => Guess we should discuss this @sgugger @LysandreJik @stas00

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Fixes: #9205

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! I have a few comments on the design, let's continue the discussion on each one!

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/models/bert/modeling_bert.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad we finally are trying to sort it out!

If we are still uncertain about some edge cases perhaps we can also empower the user to have a way to fully init weights and bypass this optimization

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/models/bert/modeling_bert.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, great to see such a great speedup! I'd rather be on the more cautious side here and make sure that each weight is correctly initialized, even if it's done twice, rather than have some weights which are not initialized.

src/transformers/models/bert/modeling_bert.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
@patrickvonplaten patrickvonplaten changed the title [POC] lazy_init_weights Lazy init Apr 30, 2021
@patrickvonplaten patrickvonplaten merged commit 3e3e41a into huggingface:master May 5, 2021
@patrickvonplaten patrickvonplaten deleted the lazy_init branch May 5, 2021 15:26
will-rice added a commit to will-rice/transformers that referenced this pull request May 6, 2021
* Add more metadata to the user agent (#10972)

* Add more metadata to the user agent

* Fix typo

* Use DISABLE_TELEMETRY

* Address review comments

* Use global env

* Add clean envs on circle CI

* Enforce string-formatting with f-strings (#10980)

* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* add notebook (#10995)

* Merge trainers (#10975)

* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review

Co-authored-by: Philipp Schmid <[email protected]>

Co-authored-by: Philipp Schmid <[email protected]>

* add blog to docs (#10997)

* Update training_args.py (#11000)

In the group by length documentation length is misspelled as legnth

* Add `examples/language_modeling/run_mlm_no_trainer.py` (#11001)

* Add initial script for finetuning MLM models with accelerate

* Add evaluation metric calculation

* Fix bugs

* Use no_grad on evaluation

* update script docstring

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* PR feedback

* Fix CI failure

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix Adafactor documentation (recommend correct settings) (#10526)

* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* flip scale_parameter to True for the 2nd recommendatoin

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Improve the speed of adding tokens from added_tokens.json (#10780)

* use bisect to add one token to unique_no_split_tokens

* fix style

* Add Vision Transformer and ViTFeatureExtractor (#10950)

* Squash all commits into one

* Update ViTFeatureExtractor to use image_utils instead of torchvision

* Remove torchvision and add Pillow

* Small docs improvement

* Address most comments by @sgugger

* Fix tests

* Clean up conversion script

* Pooler first draft

* Fix quality

* Improve conversion script

* Make style and quality

* Make fix-copies

* Minor docs improvements

* Should use fix-copies instead of manual handling

* Revert "Should use fix-copies instead of manual handling"

This reverts commit fd4e591bce4496d41406425c82606a8fdaf8a50b.

* Place ViT in alphabetical order

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* DebertaTokenizer Rework closes #10258 (#10703)

* closes #10258

* typo

* reworked deberta test

* implemented the comments from BigBird01 regarding sequence pair encoding of deberta

* Update style

* VOCAB_FILES_NAMES is now a oneliner as suggested by @sgugger

Co-authored-by: Sylvain Gugger <[email protected]>

* added #fmt: on as requested by @sgugger

* Style

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* minor typo fix

*negative* log-likelihood

* [doc] no more bucket

* added new notebook and merge of trainer (#11015)

* added new notebook and merge of trainer

* Update docs/source/sagemaker.md

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* fixed typo: logging instead of logger (#11025)

* Add a script to check inits are consistent (#11024)

* s|Pretrained|PreTrained| (#11048)

* [doc] update code-block rendering (#11053)

double : prevents code-block section to be rendered, so made it single :

* Pin docutils (#11062)

* Pin docutils

* Versions table

* Remove unnecessary space (#11060)

* Some models have no tokenizers (#11064)

* Refactor AutoModel classes and add Flax Auto classes (#11027)

* Refactor AutoModel classes and add Flax Auto classes

* Add new objects to the init

* Fix hubconf and sort models

* Fix TF tests

* Missing coma

* Update src/transformers/models/auto/auto_factory.py

Co-authored-by: Lysandre Debut <[email protected]>

* Fix init

* Fix dummies

* Other init to fix

Co-authored-by: Lysandre Debut <[email protected]>

* Documentation about loading a fast tokenizer within Transformers (#11029)

* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Add example for registering callbacks with trainers (#10928)

* Add example for callback registry

Resolves: #9036

* Update callback registry documentation

* Added comments for other ways to register callback

* Add `examples/language_modeling/run_clm_no_trainer.py` (#11026)

* Initial draft for clm no trainer

* Remove unwanted args

* Fix bug

* Update examples/language-modeling/run_clm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace pkg_resources with importlib_metadata (#11061)

* Replace pkg_resources with importlib_metadata

Fixes #10964. The other reason for this change is that pkg_resources has been [deprecated](https://github.com/pypa/setuptools/commit/8fe85c22cee7fde5e6af571b30f864bad156a010) in favor of importlib_metadata.

* Reduce to a single importlib_metadata import switch

* Trigger CI

Co-authored-by: Stas Bekman <[email protected]>

* Add center_crop to ImageFeatureExtractoMixin (#11066)

* Document common config attributes (#11070)

* Fix distributed gather for tuples of tensors of varying sizes (#11071)

* Make a base init in FeatureExtractionMixin (#11074)

* Add Readme for language modeling scripts with accelerate (#11073)

* HF emoji unicode doesn't work in console (#11081)

It doesn't look like using 🤗 is a great idea for printing to console. See attachment.

This PR proposes to replace 🤗 with "HuggingFace" for an exception message.

@LysandreJik

* Link to new blog

* added social thumbnail for docs (#11083)

* added new merged Trainer test (#11090)

* [WIP] GPT Neo cleanup (#10985)

* better names

* add attention mixin

* all slow tests in one class

* make helper methods static so we can test

* add local attention tests

* better names

* doc

* apply review suggestions

* Release v4.5.0

* Development on v4.6.0dev0

* [doc] gpt-neo (#11098)

make the example work

* Auto feature extractor (#11097)

* AutoFeatureExtractor

* Init and first tests

* Tests

* Damn you gitignore

* Quality

* Defensive test for when not all backends are here

* Use pattern for Speech2Text models

* accelerate question answering examples with no trainer (#11091)

* accelerate question answering examples with no trainer

* removed train and eval flags also fixed fill np array function

* Update examples/question-answering/run_qa_beam_search_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update examples/question-answering/run_qa_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Style

* dead link fixed (#11103)

* GPTNeo: handle padded wte (#11079)

* GPTNeo: handle padded wte

* Switch to config.vocab_size

* apply review suggestion

Co-authored-by: Suraj Patil <[email protected]>

* fix: The 'warn' method is deprecated (#11105)

* The 'warn' method is deprecated

* fix test

* [examples] fix white space (#11099)

these get concatenated without whitespace, so fix it

* Dummies multi backend (#11100)

* Replaces requires_xxx by one generic method

* Quality and update check_dummies

* Fix inits check

* Post-merge cleanup

* Some styling of the training table in Notebooks (#11118)

* Adds a note to resize the token embedding matrix when adding special … (#11120)

* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space

* fix tests (#11109)

* [versions] handle version requirement ranges (#11110)

* handle version requirement ranges

* add mixed requirement test

* cleanup

* Adds use_auth_token with pipelines (#11123)

* added model_kwargs to infer_framework_from_model

* added model_kwargs to tokenizer

* added use_auth_token as named parameter

* added dynamic get for use_auth_token

* Fix and refactor check_repo (#11127)

* Fix typing error in Trainer class (prediction_step) (#11138)

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

* Typo fix of the name of BertLMHeadModel in BERT doc (#11133)

* [run_clm] clarify why we get the tokenizer warning on long input (#11145)

* clarify why we get the warning here

* Update examples/language-modeling/run_clm.py

Co-authored-by: Sylvain Gugger <[email protected]>

* wording

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] ZeRO Stage 3 (#10753)

* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <[email protected]>

* Add nvidia megatron models (#10911)

* Add support for NVIDIA Megatron models

* Add support for NVIDIA Megatron GPT2 and BERT

Add the megatron_gpt2 model. That model reuses the existing GPT2 model. This
commit includes a script to convert a Megatron-GPT2 checkpoint downloaded
from NVIDIA GPU Cloud. See examples/megatron-models/README.md for details.

Add the megatron_bert model. That model is implemented as a modification of
the existing BERT model in Transformers. This commit includes a script to
convert a Megatron-BERT checkpoint downloaded from NVIDIA GPU Cloud. See
examples/megatron-models/README.md for details.

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Remove model.half in tests + add "# Copied ..."

Remove the model.half() instruction which makes tests fail on the CPU.

Add a comment "# Copied ..." before many classes in the model to enable automatic
tracking in CI between the new Megatron classes and the original Bert ones.

* Fix issues

* Fix Flax/TF tests

* Fix copyright

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update docs/source/model_doc/megatron_bert.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update docs/source/model_doc/megatron_gpt2.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/__init__.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Resolve most of 'sgugger' comments

* Fix conversion issue + Run make fix-copies/quality/docs

* Apply suggestions from code review

* Causal LM & merge

* Fix init

* Add CausalLM to last auto class

Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* [trainer] solve "scheduler before optimizer step" warning (#11144)

* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test

* Add fairscale and deepspeed back to the CI (#11147)

* Add fairscale and deepspeed back to the CI

* Add deepspeed to single GPU tests

* Updates SageMaker docs for updating DLCs (#11140)

* Don't duplicate logs in TensorBoard and handle --use_env (#11141)

* Run mlm pad to multiple for fp16 (#11128)

* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)

* [tests] relocate core integration tests (#11146)

* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps

* [setup] extras[docs] must include 'all' (#11148)

* extras[doc] must include 'all'

* fix

* better

* regroup

* Add support for multiple models for one config in auto classes (#11150)

* Add support for multiple models for one config in auto classes

* Use get_values everywhere

* Prettier doc

* [setup] make fairscale and deepspeed setup extras (#11151)

* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* no reason not to ask for the good version

* update the CIs

Co-authored-by: Sylvain Gugger <[email protected]>

* Skip Megatron tests for now

* typo (#11152)

* typo

* style

* [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips (#11142)

* Add Wav2Vec Inference notebook

* Update docs/source/community.md

Co-authored-by: Suraj Patil <[email protected]>

* Fix LogitsProcessor documentation (#11130)

* Change duplicated LogitsProcessor to LogitsWarper in LogitsProcessorList document

* Write more detailed information about LogitsProcessor's scores argument

* apply suggestion from review

* style

Co-authored-by: Suraj Patil <[email protected]>

* Update README.md (#11161)

Corrected a typo ('Downlowd' to 'Download')

* Make `get_special_tokens_mask` consider all tokens (#11163)

* Add a special tokenizer for CPM model (#11068)

* Add a special tokenizer for CPM model

* make style

* fix

* Add docs

* styles

* cpm doc

* fix ci

* fix the overview

* add test

* make style

* typo

* Custom tokenizer flag

* Add REAMDE.md

Co-authored-by: Lysandre <[email protected]>

* [examples/translation] support mBART-50 and M2M100 fine-tuning (#11170)

* keep a list of multilingual tokenizers

* add forced_bos_token argument

* [examples run_clm] fix _LazyModule hasher error (#11168)

* fix _LazyModule hasher error

* reword

* added json dump and extraction of train run time (#11167)

* added json dump and extraction of train run time

* make style happy

* Fix Typo

* Reactivate Megatron tests an use less workers

* Minor typos fixed (#11182)

* Fix style

* model_path should be ignored as the checkpoint path (#11157)

* model_path is refered as the path of the trainer, and should be ignored as the checkpoint path.

* Improved according to Sgugger's comment.

* Added documentation for data collator. (#10941)

* Added documentation for data collator.

* Update docs/source/data_collator.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Added documentation for data collator.

* Added documentation for the data collator.

* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.

* Update documentation for the data collator.

* Update documentation for the data collator.

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Amna <[email protected]>

* Fix typo (#11188)

* Add DeiT (PyTorch) (#11056)

* First draft of deit

* More improvements

* Remove DeiTTokenizerFast from init

* Conversion script works

* Add DeiT to ViT conversion script

* Add tests, add head model, add support for deit in vit conversion script

* Update model checkpoint names

* Update image_mean and image_std, set resample to bicubic

* Improve docs

* Docs improvements

* Add DeiTForImageClassificationWithTeacher to init

* Address comments by @sgugger

* Improve feature extractors

* Make fix-copies

* Minor fixes

* Address comments by @patil-suraj

* All models uploaded

* Fix tests

* Remove labels argument from DeiTForImageClassificationWithTeacher

* Fix-copies, style and quality

* Fix tests

* Fix typo

* Multiple docs improvements

* More docs fixes

* Replaced `which` with `who` (#11183)

* Import torch.utils.checkpoint in ProphetNet (#11214)

* Sagemaker test docs update for framework upgrade (#11206)

* increased train_runtime for model parallelism

* added documentation for framework upgrade

* Use MSELoss in (M)BartForSequenceClassification (#11178)

* wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model (#11041)

* add vocab while converting wav2vec2 original finetuned model

* check save directory exists

* return_attention_mask fix

* quality

* Add Matt as the TensorFlow reference (#11212)

* Fix GPT-2 warnings (#11213)

* Fix GPT-2 warnings

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Stas Bekman <[email protected]>

Co-authored-by: Stas Bekman <[email protected]>

* fix docstrings (#11221)

* Add documentation for BertJapanese (#11219)

* Start writing BERT-Japanese doc

* Fix typo, Update toctree

* Modify model file to use comment for document, Add examples

* Clean bert_japanese by make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Split a big code block into two

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Add prefix >>> to all lines in code blocks

* Clean bert_japanese by make fixup

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace error by warning when loading an architecture in another (#11207)

* Replace error by warning when loading an architecture in another

* Style

* Style again

* Add a test

* Adapt old test

* Document v4.5.1

* Refactor GPT2 (#11225)

* refactor GPT2

* fix mlp and head pruning

* address Sylvains comments

* apply suggestion from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Doc check: a bit of clean up (#11224)

* added cache_dir=model_args.cache_dir to all example with cache_dir arg (#11220)

* Avoid using no_sync on SageMaker DP (#11229)

* Indent code block in the documentation (#11233)

* Indent code block

* Indent code blocks version 2

* Quality

* Run CI on deepspeed and fairscale (#11172)

* Run CI on deepspeed and fairscale

* Test it on this branch :)

* Rename

* Update the CI image

* [Deepspeed] zero3 tests band aid (#11235)

* temp band-aid

* style

* Save the Wav2Vec2 processor before training starts (#10910)

Co-authored-by: nithin19 <[email protected]>

* make embeddings plural in warning message (#11228)

* Stale bot updated (#10562)

* Updated stale bot

* Specify issue number

* Remove particular handling of assignees

* Unleash the stalebot

* Remove debug branch

* Close open files to suppress ResourceWarning (#11240)

Co-authored-by: Sudharsan Thirumalai <[email protected]>

* Fix dimention misspellings. (#11238)

* Update modeling_gpt_neo.py

dimention -> dimension

* Update configuration_speech_to_text.py

dimention -> dimension

* Add prefix to examples in model_doc rst (#11226)

* Add prefix to examples in model_doc rst

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [troubleshooting] add 2 points of reference to the offline mode (#11236)

* add 2 points of reference to the offline mode

* link the new doc

* add error message

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* style

* rename

* Trigger CI

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix #10128 (#11248)

* [deepspeed] test on one node 2 gpus max (#11237)

* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version

* Trainer iterable dataset (#11254)

* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <[email protected]>

* Style

Co-authored-by: Lysandre Debut <[email protected]>

* Adding pipeline task aliases. (#11247)

* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.

* Support for set_epoch (#11258)

* Tokenizer fast save (#11234)

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

* update dependency_versions_table (#11273)

missed this updating when bumped the version.

* Workflow fixes (#11270)

* Enabling multilingual models for translation pipelines. (#10536)

* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix failing workflows

* Trainer support for IterableDataset for evaluation and predict (#11286)

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

* move device statements outside if statements (#11292)

* modify double considering special tokens in `language_modeling.py` (#11275)

* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)" 

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py

* [Trainer] fix the placement on device with fp16_full_eval (#11322)

* fix the placement on device with fp16_full_eval

* deepspeed never goes on device

* [Trainer] Add a progress bar for batches skipped (#11324)

* Load checkpoint without re-creating the model (#11318)

* Added translation example script  (#11196)

* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script

* [Generate] Remove outdated code (#11331)

* remove update function

* update

* refactor more

* refactor

* [GPTNeo] create local attention mask ones (#11335)

* create local attention mask ones

* remove old method, address patricks comment

* Update to use datasets remove_cloumns method (#11343)

* Update to use datasets remove_cloumns method

* Quality

* Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117)

* Removed `max_length` from being mandatory within `generate`. (#11314)

* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.

* Honor contributors to models (#11329)

* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors

* [deepspeed] fix resume from checkpoint (#11352)

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

* Examples reorg (#11350)

* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address review comments and clean

Co-authored-by: Lysandre Debut <[email protected]>

* Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)

* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics

* [testing doc] bring doc up to date (#11359)

* bring doc up to date

* fix

* Merge new TF example script (#11360)

First of the new and more idiomatic TF examples!

* Remove boiler plate code (#11340)

* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor

* Move old TF text classification script to legacy (#11361)

And update README to explain the work-in-progress!

* [contributing doc] explain/link to good first issue (#11346)

* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix token_type_ids error for big_bird model. (#11355)

* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <[email protected]>

* Add huggingface_hub dep for #11328

* Add in torchhub

* [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)

* fix wav2vec2 tok

* up

* [Flax] Correct typo (#11374)

* finish

* fix copy

* [run_translation.py] fix typo  (#11372)

fix typo

Co-authored-by: johnson <[email protected]>

* Add space (#11373)

* Correctly cast num_train_epochs to int (#11379)

* Fix typo (#11369)

* Fix Trainer with remove_unused_columns=False (#11382)

* Fix Trainer with remove_unused_columns=False

* Typo

* [Flax] Big FlaxBert Refactor (#11364)

* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying

* correct typo (#11393)

* correct conversion (#11394)

* Fix typo in text (#11396)

* fixed typos (#11391)

* make blenderbot test slow (#11395)

* Fixed trainer total_flos relaoding in distributed mode (#11383)

* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training

* Trainer push to hub (#11328)

* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <[email protected]>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* push (#11400)

* added support for exporting of t5 to onnx with past_key_values (#10651)

* Fixing bug in generation (#11297)

When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.

* Style

* Try to trigger failure more

* Wrong branch Sylvain...

* Fix cross-attention head mask for Torch encoder-decoder models (#10605)

* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring

* Default to accuracy metric (#11405)

* Enable option for subword regularization in `XLMRobertaTokenizer` (#11149)

* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <[email protected]>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Use 3 workers for torch tests

* documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410)

* Style

* Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)

* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests

* EncoderDecoderConfigs should not create new objects (#11300)

* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <[email protected]>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434)

* add pooling layer support (#11439)

* make style (#11442)

* Pin black to 20.8.b1

* With style

* Pin black to 21.4b0

* TF BART models - Add `cross_attentions` to model output and fix cross-attention head masking (#10699)

* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config

* Add basic support for FP16 in SageMaker model parallelism (#11407)

* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo

* docs(examples): fix link to TPU launcher script (#11427)

* fix some typos in docs, comments, logging/errors (#11432)

* Pass along seed to DistributedSampler (#11406)

* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler

* Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording

* [docs] fix invalid class name (#11438)

* fix invalid class name

* proper ref

* proper ref

* make sure to test against the local checkout (#11437)

* Style

* Give each test a different repo name (#11453)

* [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)

* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval

* Variable Correction for Consistency in Distillation Example (#11444)

As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work

* [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)

* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Remove max length beam scorer (#11378)

* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <[email protected]>

* update QuickTour docs to reflect model output object (#11462)

* update docs to reflect model output object

* run make style`

* Finish Making Quick Tour respect the model object (#11467)

* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docs for decoder_input_ids (#11466)

* fix docs for decoder_input_ids

* revert the changes for bart and mbart

* Update min versions in README and add Flax (#11472)

* Update min versions in README and add Flax

* Adapt index

* Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* add error message

Co-authored-by: Sylvain Gugger <[email protected]>

* fix #1149 (#11493)

* [Flax] Add docstrings & model outputs (#11498)

* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* simplify more

Co-authored-by: Sylvain Gugger <[email protected]>

* Reformat to make code clearer in tokenizer call (#11497)

* Reformat to make code clearer

* Reformat to make code clearer

* solved coefficient issue for the TF version of gelu_fast (#11514)

Co-authored-by: Michael Benayoun <[email protected]>

* Split checkpoint from model_name_or_path in examples (#11492)

* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments

* Patch notification service

* Pin HuggingFace Hub dependency (#11502)

* correct the dimension comment of matrix multiplication (#11494)

Co-authored-by: Frederik Bode <[email protected]>

* add sp_model_kwargs to unpickle of xlm roberta tok (#11430)

add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test

* make style (#11520)

* Update README.md (#11489)

Add link to code

* T5 Gradient Checkpointing (#11353)

* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting

* Adding `AutomaticSpeechRecognitionPipeline`. (#11337)

* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <[email protected]>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <[email protected]>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* Implement Fast Tokenization for Deberta (#11387)

* Accepts BatchEncoding in LengthSampler (#11431)

* Fix do_eval default value in training_args.py (#11511)

* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md

* Update TF text classification example (#11496)

Big refactor, fixes and multi-GPU/TPU support

* reszie token embeds (#11524)

* Run model templates on master (#11527)

* [Examples] Added support for test-file in QA examples with no trainer (#11510)

* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines

* Add Stas and Suraj as authors (#11526)

* Improve task summary docs (#11513)

* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [debug utils] activation/weights underflow/overflow detector (#11274)

* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made

* Fixed docs for the shape of `scores` in `generate()` (#10057)

* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix

* Fix examples in M2M100 docstrings (#11540)

Replaces `tok` with `tokenizer` so examples can run with copy-paste

* [Flax BERT/Roberta] few small fixes (#11558)

* small fixes

* style

* [Wav2Vec2] Fix convert (#11562)

* push

* small change

* correct other typo

* Remove `datasets` submodule. (#11563)

* fix the mlm longformer example by changing [MASK] to <mask> (#11559)

* Add LUKE (#11223)

* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <[email protected]>

* [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)

* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <[email protected]>

* Fix metric computation in `run_glue_no_trainer` (#11569)

* Fixes a useless warning. (#11566)

Fixes #11525

* Accumulate opt state dict on do_rank 0 (#11481)

* Update training tutorial (#11533)

* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <[email protected]>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <[email protected]>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* fix resize_token_embeddings (#11572)

* Add multi-class, multi-label and regression to transformers (#11012)

* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* Enable added tokens (#11325)

* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition

* Make quality scripts work when one backend is missing. (#11573)

* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code

* [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)

* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <[email protected]>

* Removes SageMakerTrainer code but keeps class as wrapper (#11587)

* removed all old code

* make quality

* [Flax] Add Electra models (#11426)

* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <[email protected]>

* Reproducible checkpoint (#11582)

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

* [trainer] document resume randomness (#11588)

* document resume randomness

* fix link

* reword

* fix

* reword

* style

* copies need to be fixed too (#11585)

* add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)

* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <[email protected]>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <[email protected]>

* Skip Funnel test

* Pytorch - Lazy initialization of models (#11471)

* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <[email protected]>

* Accept tensorflow-rocm package when checking TF availability (#11595)

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: JohnnyC08 <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: cchen-dialpad <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Lysandre <[email protected]>
Co-authored-by: cronoik <[email protected]>
Co-authored-by: Joe Davison <[email protected]>
Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: versis <[email protected]>
Co-authored-by: Eren Şahin <[email protected]>
Co-authored-by: Amala Deshmukh <[email protected]>
Co-authored-by: konstin <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: SHYAM SUNDER KUMAR <[email protected]>
Co-authored-by: Leo Gao <[email protected]>
Co-authored-by: Vasudev Gupta <[email protected]>
Co-authored-by: Jannis Born <[email protected]>
Co-authored-by: Yusuke Mori <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Andrea Cappelli <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: Keisuke Hirota <[email protected]>
Co-authored-by: Saviour Owolabi <[email protected]>
Co-authored-by: Kevin Canwen Xu <[email protected]>
Co-authored-by: Masatoshi TSUCHIYA <[email protected]>
Co-authored-by: fghuman <[email protected]>
Co-authored-by: Amna <[email protected]>
Co-authored-by: Takuya Makino <[email protected]>
Co-authored-by: calpt <[email protected]>
Co-authored-by: Ceyda Cinarel <[email protected]>
Co-authored-by: Nithin Holla <[email protected]>
Co-authored-by: nithin19 <[email protected]>
Co-authored-by: Joel Stremmel <[email protected]>
Co-authored-by: Sudharsan S T <[email protected]>
Co-authored-by: Sudharsan Thirumalai <[email protected]>
Co-authored-by: Thomas Wood <[email protected]>
Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: e <[email protected]>
Co-authored-by: TAE YOUNGDON <[email protected]>
Co-authored-by: rajvi-k <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: wlhgtc <[email protected]>
Co-authored-by: johnson7788 <[email protected]>
Co-authored-by: johnson <[email protected]>
Co-authored-by: PenutChen <[email protected]>
Co-authored-by: Max Del <[email protected]>
Co-authored-by: Yoshitomo Matsubara <[email protected]>
Co-authored-by: Teven <[email protected]>
Co-authored-by: Kiran R <[email protected]>
Co-authored-by: Nicola De Cao <[email protected]>
Co-authored-by: Daniel Stancl <[email protected]>
Co-authored-by: Philip May <[email protected]>
Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: abiolaTresor <[email protected]>
Co-authored-by: Amine Abdaoui <[email protected]>
Co-authored-by: LSinev <[email protected]>
Co-authored-by: Kostas Stathoulopoulos <[email protected]>
Co-authored-by: Bhadresh Savani <[email protected]>
Co-authored-by: Jaimeen Ahn <[email protected]>
Co-authored-by: Ashwin Geet D'Sa <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Manuel Romero <[email protected]>
Co-authored-by: CeShine Lee <[email protected]>
Co-authored-by: Shubham Sanghavi <[email protected]>
Co-authored-by: bonniehyeon <[email protected]>
Co-authored-by: jingyihe <[email protected]>
Co-authored-by: Ikuya Yamada <[email protected]>
Co-authored-by: Muktan <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Patrick Fernandes <[email protected]>
Co-authored-by: Deepali <[email protected]>
Co-authored-by: Deepali Chourasia <[email protected]>
Co-authored-by: Mats Sjöberg <[email protected]>
will-rice added a commit to will-rice/transformers that referenced this pull request Jun 16, 2021
* Add more metadata to the user agent (#10972)

* Add more metadata to the user agent

* Fix typo

* Use DISABLE_TELEMETRY

* Address review comments

* Use global env

* Add clean envs on circle CI

* Enforce string-formatting with f-strings (#10980)

* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* add notebook (#10995)

* Merge trainers (#10975)

* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review

Co-authored-by: Philipp Schmid <[email protected]>

Co-authored-by: Philipp Schmid <[email protected]>

* add blog to docs (#10997)

* Update training_args.py (#11000)

In the group by length documentation length is misspelled as legnth

* Add `examples/language_modeling/run_mlm_no_trainer.py` (#11001)

* Add initial script for finetuning MLM models with accelerate

* Add evaluation metric calculation

* Fix bugs

* Use no_grad on evaluation

* update script docstring

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* PR feedback

* Fix CI failure

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix Adafactor documentation (recommend correct settings) (#10526)

* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* flip scale_parameter to True for the 2nd recommendatoin

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Improve the speed of adding tokens from added_tokens.json (#10780)

* use bisect to add one token to unique_no_split_tokens

* fix style

* Add Vision Transformer and ViTFeatureExtractor (#10950)

* Squash all commits into one

* Update ViTFeatureExtractor to use image_utils instead of torchvision

* Remove torchvision and add Pillow

* Small docs improvement

* Address most comments by @sgugger

* Fix tests

* Clean up conversion script

* Pooler first draft

* Fix quality

* Improve conversion script

* Make style and quality

* Make fix-copies

* Minor docs improvements

* Should use fix-copies instead of manual handling

* Revert "Should use fix-copies instead of manual handling"

This reverts commit fd4e591bce4496d41406425c82606a8fdaf8a50b.

* Place ViT in alphabetical order

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* DebertaTokenizer Rework closes #10258 (#10703)

* closes #10258

* typo

* reworked deberta test

* implemented the comments from BigBird01 regarding sequence pair encoding of deberta

* Update style

* VOCAB_FILES_NAMES is now a oneliner as suggested by @sgugger

Co-authored-by: Sylvain Gugger <[email protected]>

* added #fmt: on as requested by @sgugger

* Style

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* minor typo fix

*negative* log-likelihood

* [doc] no more bucket

* added new notebook and merge of trainer (#11015)

* added new notebook and merge of trainer

* Update docs/source/sagemaker.md

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* fixed typo: logging instead of logger (#11025)

* Add a script to check inits are consistent (#11024)

* s|Pretrained|PreTrained| (#11048)

* [doc] update code-block rendering (#11053)

double : prevents code-block section to be rendered, so made it single :

* Pin docutils (#11062)

* Pin docutils

* Versions table

* Remove unnecessary space (#11060)

* Some models have no tokenizers (#11064)

* Refactor AutoModel classes and add Flax Auto classes (#11027)

* Refactor AutoModel classes and add Flax Auto classes

* Add new objects to the init

* Fix hubconf and sort models

* Fix TF tests

* Missing coma

* Update src/transformers/models/auto/auto_factory.py

Co-authored-by: Lysandre Debut <[email protected]>

* Fix init

* Fix dummies

* Other init to fix

Co-authored-by: Lysandre Debut <[email protected]>

* Documentation about loading a fast tokenizer within Transformers (#11029)

* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Add example for registering callbacks with trainers (#10928)

* Add example for callback registry

Resolves: #9036

* Update callback registry documentation

* Added comments for other ways to register callback

* Add `examples/language_modeling/run_clm_no_trainer.py` (#11026)

* Initial draft for clm no trainer

* Remove unwanted args

* Fix bug

* Update examples/language-modeling/run_clm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace pkg_resources with importlib_metadata (#11061)

* Replace pkg_resources with importlib_metadata

Fixes #10964. The other reason for this change is that pkg_resources has been [deprecated](https://github.com/pypa/setuptools/commit/8fe85c22cee7fde5e6af571b30f864bad156a010) in favor of importlib_metadata.

* Reduce to a single importlib_metadata import switch

* Trigger CI

Co-authored-by: Stas Bekman <[email protected]>

* Add center_crop to ImageFeatureExtractoMixin (#11066)

* Document common config attributes (#11070)

* Fix distributed gather for tuples of tensors of varying sizes (#11071)

* Make a base init in FeatureExtractionMixin (#11074)

* Add Readme for language modeling scripts with accelerate (#11073)

* HF emoji unicode doesn't work in console (#11081)

It doesn't look like using 🤗 is a great idea for printing to console. See attachment.

This PR proposes to replace 🤗 with "HuggingFace" for an exception message.

@LysandreJik

* Link to new blog

* added social thumbnail for docs (#11083)

* added new merged Trainer test (#11090)

* [WIP] GPT Neo cleanup (#10985)

* better names

* add attention mixin

* all slow tests in one class

* make helper methods static so we can test

* add local attention tests

* better names

* doc

* apply review suggestions

* Release v4.5.0

* Development on v4.6.0dev0

* [doc] gpt-neo (#11098)

make the example work

* Auto feature extractor (#11097)

* AutoFeatureExtractor

* Init and first tests

* Tests

* Damn you gitignore

* Quality

* Defensive test for when not all backends are here

* Use pattern for Speech2Text models

* accelerate question answering examples with no trainer (#11091)

* accelerate question answering examples with no trainer

* removed train and eval flags also fixed fill np array function

* Update examples/question-answering/run_qa_beam_search_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update examples/question-answering/run_qa_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Style

* dead link fixed (#11103)

* GPTNeo: handle padded wte (#11079)

* GPTNeo: handle padded wte

* Switch to config.vocab_size

* apply review suggestion

Co-authored-by: Suraj Patil <[email protected]>

* fix: The 'warn' method is deprecated (#11105)

* The 'warn' method is deprecated

* fix test

* [examples] fix white space (#11099)

these get concatenated without whitespace, so fix it

* Dummies multi backend (#11100)

* Replaces requires_xxx by one generic method

* Quality and update check_dummies

* Fix inits check

* Post-merge cleanup

* Some styling of the training table in Notebooks (#11118)

* Adds a note to resize the token embedding matrix when adding special … (#11120)

* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space

* fix tests (#11109)

* [versions] handle version requirement ranges (#11110)

* handle version requirement ranges

* add mixed requirement test

* cleanup

* Adds use_auth_token with pipelines (#11123)

* added model_kwargs to infer_framework_from_model

* added model_kwargs to tokenizer

* added use_auth_token as named parameter

* added dynamic get for use_auth_token

* Fix and refactor check_repo (#11127)

* Fix typing error in Trainer class (prediction_step) (#11138)

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

* Typo fix of the name of BertLMHeadModel in BERT doc (#11133)

* [run_clm] clarify why we get the tokenizer warning on long input (#11145)

* clarify why we get the warning here

* Update examples/language-modeling/run_clm.py

Co-authored-by: Sylvain Gugger <[email protected]>

* wording

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] ZeRO Stage 3 (#10753)

* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <[email protected]>

* Add nvidia megatron models (#10911)

* Add support for NVIDIA Megatron models

* Add support for NVIDIA Megatron GPT2 and BERT

Add the megatron_gpt2 model. That model reuses the existing GPT2 model. This
commit includes a script to convert a Megatron-GPT2 checkpoint downloaded
from NVIDIA GPU Cloud. See examples/megatron-models/README.md for details.

Add the megatron_bert model. That model is implemented as a modification of
the existing BERT model in Transformers. This commit includes a script to
convert a Megatron-BERT checkpoint downloaded from NVIDIA GPU Cloud. See
examples/megatron-models/README.md for details.

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Remove model.half in tests + add "# Copied ..."

Remove the model.half() instruction which makes tests fail on the CPU.

Add a comment "# Copied ..." before many classes in the model to enable automatic
tracking in CI between the new Megatron classes and the original Bert ones.

* Fix issues

* Fix Flax/TF tests

* Fix copyright

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update docs/source/model_doc/megatron_bert.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update docs/source/model_doc/megatron_gpt2.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/__init__.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Resolve most of 'sgugger' comments

* Fix conversion issue + Run make fix-copies/quality/docs

* Apply suggestions from code review

* Causal LM & merge

* Fix init

* Add CausalLM to last auto class

Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* [trainer] solve "scheduler before optimizer step" warning (#11144)

* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test

* Add fairscale and deepspeed back to the CI (#11147)

* Add fairscale and deepspeed back to the CI

* Add deepspeed to single GPU tests

* Updates SageMaker docs for updating DLCs (#11140)

* Don't duplicate logs in TensorBoard and handle --use_env (#11141)

* Run mlm pad to multiple for fp16 (#11128)

* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)

* [tests] relocate core integration tests (#11146)

* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps

* [setup] extras[docs] must include 'all' (#11148)

* extras[doc] must include 'all'

* fix

* better

* regroup

* Add support for multiple models for one config in auto classes (#11150)

* Add support for multiple models for one config in auto classes

* Use get_values everywhere

* Prettier doc

* [setup] make fairscale and deepspeed setup extras (#11151)

* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* no reason not to ask for the good version

* update the CIs

Co-authored-by: Sylvain Gugger <[email protected]>

* Skip Megatron tests for now

* typo (#11152)

* typo

* style

* [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips (#11142)

* Add Wav2Vec Inference notebook

* Update docs/source/community.md

Co-authored-by: Suraj Patil <[email protected]>

* Fix LogitsProcessor documentation (#11130)

* Change duplicated LogitsProcessor to LogitsWarper in LogitsProcessorList document

* Write more detailed information about LogitsProcessor's scores argument

* apply suggestion from review

* style

Co-authored-by: Suraj Patil <[email protected]>

* Update README.md (#11161)

Corrected a typo ('Downlowd' to 'Download')

* Make `get_special_tokens_mask` consider all tokens (#11163)

* Add a special tokenizer for CPM model (#11068)

* Add a special tokenizer for CPM model

* make style

* fix

* Add docs

* styles

* cpm doc

* fix ci

* fix the overview

* add test

* make style

* typo

* Custom tokenizer flag

* Add REAMDE.md

Co-authored-by: Lysandre <[email protected]>

* [examples/translation] support mBART-50 and M2M100 fine-tuning (#11170)

* keep a list of multilingual tokenizers

* add forced_bos_token argument

* [examples run_clm] fix _LazyModule hasher error (#11168)

* fix _LazyModule hasher error

* reword

* added json dump and extraction of train run time (#11167)

* added json dump and extraction of train run time

* make style happy

* Fix Typo

* Reactivate Megatron tests an use less workers

* Minor typos fixed (#11182)

* Fix style

* model_path should be ignored as the checkpoint path (#11157)

* model_path is refered as the path of the trainer, and should be ignored as the checkpoint path.

* Improved according to Sgugger's comment.

* Added documentation for data collator. (#10941)

* Added documentation for data collator.

* Update docs/source/data_collator.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Added documentation for data collator.

* Added documentation for the data collator.

* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.

* Update documentation for the data collator.

* Update documentation for the data collator.

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Amna <[email protected]>

* Fix typo (#11188)

* Add DeiT (PyTorch) (#11056)

* First draft of deit

* More improvements

* Remove DeiTTokenizerFast from init

* Conversion script works

* Add DeiT to ViT conversion script

* Add tests, add head model, add support for deit in vit conversion script

* Update model checkpoint names

* Update image_mean and image_std, set resample to bicubic

* Improve docs

* Docs improvements

* Add DeiTForImageClassificationWithTeacher to init

* Address comments by @sgugger

* Improve feature extractors

* Make fix-copies

* Minor fixes

* Address comments by @patil-suraj

* All models uploaded

* Fix tests

* Remove labels argument from DeiTForImageClassificationWithTeacher

* Fix-copies, style and quality

* Fix tests

* Fix typo

* Multiple docs improvements

* More docs fixes

* Replaced `which` with `who` (#11183)

* Import torch.utils.checkpoint in ProphetNet (#11214)

* Sagemaker test docs update for framework upgrade (#11206)

* increased train_runtime for model parallelism

* added documentation for framework upgrade

* Use MSELoss in (M)BartForSequenceClassification (#11178)

* wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model (#11041)

* add vocab while converting wav2vec2 original finetuned model

* check save directory exists

* return_attention_mask fix

* quality

* Add Matt as the TensorFlow reference (#11212)

* Fix GPT-2 warnings (#11213)

* Fix GPT-2 warnings

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Stas Bekman <[email protected]>

Co-authored-by: Stas Bekman <[email protected]>

* fix docstrings (#11221)

* Add documentation for BertJapanese (#11219)

* Start writing BERT-Japanese doc

* Fix typo, Update toctree

* Modify model file to use comment for document, Add examples

* Clean bert_japanese by make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Split a big code block into two

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Add prefix >>> to all lines in code blocks

* Clean bert_japanese by make fixup

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace error by warning when loading an architecture in another (#11207)

* Replace error by warning when loading an architecture in another

* Style

* Style again

* Add a test

* Adapt old test

* Document v4.5.1

* Refactor GPT2 (#11225)

* refactor GPT2

* fix mlp and head pruning

* address Sylvains comments

* apply suggestion from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Doc check: a bit of clean up (#11224)

* added cache_dir=model_args.cache_dir to all example with cache_dir arg (#11220)

* Avoid using no_sync on SageMaker DP (#11229)

* Indent code block in the documentation (#11233)

* Indent code block

* Indent code blocks version 2

* Quality

* Run CI on deepspeed and fairscale (#11172)

* Run CI on deepspeed and fairscale

* Test it on this branch :)

* Rename

* Update the CI image

* [Deepspeed] zero3 tests band aid (#11235)

* temp band-aid

* style

* Save the Wav2Vec2 processor before training starts (#10910)

Co-authored-by: nithin19 <[email protected]>

* make embeddings plural in warning message (#11228)

* Stale bot updated (#10562)

* Updated stale bot

* Specify issue number

* Remove particular handling of assignees

* Unleash the stalebot

* Remove debug branch

* Close open files to suppress ResourceWarning (#11240)

Co-authored-by: Sudharsan Thirumalai <[email protected]>

* Fix dimention misspellings. (#11238)

* Update modeling_gpt_neo.py

dimention -> dimension

* Update configuration_speech_to_text.py

dimention -> dimension

* Add prefix to examples in model_doc rst (#11226)

* Add prefix to examples in model_doc rst

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [troubleshooting] add 2 points of reference to the offline mode (#11236)

* add 2 points of reference to the offline mode

* link the new doc

* add error message

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* style

* rename

* Trigger CI

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix #10128 (#11248)

* [deepspeed] test on one node 2 gpus max (#11237)

* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version

* Trainer iterable dataset (#11254)

* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <[email protected]>

* Style

Co-authored-by: Lysandre Debut <[email protected]>

* Adding pipeline task aliases. (#11247)

* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.

* Support for set_epoch (#11258)

* Tokenizer fast save (#11234)

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

* update dependency_versions_table (#11273)

missed this updating when bumped the version.

* Workflow fixes (#11270)

* Enabling multilingual models for translation pipelines. (#10536)

* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix failing workflows

* Trainer support for IterableDataset for evaluation and predict (#11286)

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

* move device statements outside if statements (#11292)

* modify double considering special tokens in `language_modeling.py` (#11275)

* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)" 

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py

* [Trainer] fix the placement on device with fp16_full_eval (#11322)

* fix the placement on device with fp16_full_eval

* deepspeed never goes on device

* [Trainer] Add a progress bar for batches skipped (#11324)

* Load checkpoint without re-creating the model (#11318)

* Added translation example script  (#11196)

* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script

* [Generate] Remove outdated code (#11331)

* remove update function

* update

* refactor more

* refactor

* [GPTNeo] create local attention mask ones (#11335)

* create local attention mask ones

* remove old method, address patricks comment

* Update to use datasets remove_cloumns method (#11343)

* Update to use datasets remove_cloumns method

* Quality

* Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117)

* Removed `max_length` from being mandatory within `generate`. (#11314)

* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.

* Honor contributors to models (#11329)

* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors

* [deepspeed] fix resume from checkpoint (#11352)

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

* Examples reorg (#11350)

* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address review comments and clean

Co-authored-by: Lysandre Debut <[email protected]>

* Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)

* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics

* [testing doc] bring doc up to date (#11359)

* bring doc up to date

* fix

* Merge new TF example script (#11360)

First of the new and more idiomatic TF examples!

* Remove boiler plate code (#11340)

* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor

* Move old TF text classification script to legacy (#11361)

And update README to explain the work-in-progress!

* [contributing doc] explain/link to good first issue (#11346)

* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix token_type_ids error for big_bird model. (#11355)

* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <[email protected]>

* Add huggingface_hub dep for #11328

* Add in torchhub

* [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)

* fix wav2vec2 tok

* up

* [Flax] Correct typo (#11374)

* finish

* fix copy

* [run_translation.py] fix typo  (#11372)

fix typo

Co-authored-by: johnson <[email protected]>

* Add space (#11373)

* Correctly cast num_train_epochs to int (#11379)

* Fix typo (#11369)

* Fix Trainer with remove_unused_columns=False (#11382)

* Fix Trainer with remove_unused_columns=False

* Typo

* [Flax] Big FlaxBert Refactor (#11364)

* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying

* correct typo (#11393)

* correct conversion (#11394)

* Fix typo in text (#11396)

* fixed typos (#11391)

* make blenderbot test slow (#11395)

* Fixed trainer total_flos relaoding in distributed mode (#11383)

* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training

* Trainer push to hub (#11328)

* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <[email protected]>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* push (#11400)

* added support for exporting of t5 to onnx with past_key_values (#10651)

* Fixing bug in generation (#11297)

When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.

* Style

* Try to trigger failure more

* Wrong branch Sylvain...

* Fix cross-attention head mask for Torch encoder-decoder models (#10605)

* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring

* Default to accuracy metric (#11405)

* Enable option for subword regularization in `XLMRobertaTokenizer` (#11149)

* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <[email protected]>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Use 3 workers for torch tests

* documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410)

* Style

* Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)

* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests

* EncoderDecoderConfigs should not create new objects (#11300)

* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <[email protected]>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434)

* add pooling layer support (#11439)

* make style (#11442)

* Pin black to 20.8.b1

* With style

* Pin black to 21.4b0

* TF BART models - Add `cross_attentions` to model output and fix cross-attention head masking (#10699)

* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config

* Add basic support for FP16 in SageMaker model parallelism (#11407)

* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo

* docs(examples): fix link to TPU launcher script (#11427)

* fix some typos in docs, comments, logging/errors (#11432)

* Pass along seed to DistributedSampler (#11406)

* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler

* Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording

* [docs] fix invalid class name (#11438)

* fix invalid class name

* proper ref

* proper ref

* make sure to test against the local checkout (#11437)

* Style

* Give each test a different repo name (#11453)

* [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)

* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval

* Variable Correction for Consistency in Distillation Example (#11444)

As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work

* [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)

* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Remove max length beam scorer (#11378)

* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <[email protected]>

* update QuickTour docs to reflect model output object (#11462)

* update docs to reflect model output object

* run make style`

* Finish Making Quick Tour respect the model object (#11467)

* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docs for decoder_input_ids (#11466)

* fix docs for decoder_input_ids

* revert the changes for bart and mbart

* Update min versions in README and add Flax (#11472)

* Update min versions in README and add Flax

* Adapt index

* Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* add error message

Co-authored-by: Sylvain Gugger <[email protected]>

* fix #1149 (#11493)

* [Flax] Add docstrings & model outputs (#11498)

* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* simplify more

Co-authored-by: Sylvain Gugger <[email protected]>

* Reformat to make code clearer in tokenizer call (#11497)

* Reformat to make code clearer

* Reformat to make code clearer

* solved coefficient issue for the TF version of gelu_fast (#11514)

Co-authored-by: Michael Benayoun <[email protected]>

* Split checkpoint from model_name_or_path in examples (#11492)

* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments

* Patch notification service

* Pin HuggingFace Hub dependency (#11502)

* correct the dimension comment of matrix multiplication (#11494)

Co-authored-by: Frederik Bode <[email protected]>

* add sp_model_kwargs to unpickle of xlm roberta tok (#11430)

add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test

* make style (#11520)

* Update README.md (#11489)

Add link to code

* T5 Gradient Checkpointing (#11353)

* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting

* Adding `AutomaticSpeechRecognitionPipeline`. (#11337)

* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <[email protected]>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <[email protected]>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* Implement Fast Tokenization for Deberta (#11387)

* Accepts BatchEncoding in LengthSampler (#11431)

* Fix do_eval default value in training_args.py (#11511)

* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md

* Update TF text classification example (#11496)

Big refactor, fixes and multi-GPU/TPU support

* reszie token embeds (#11524)

* Run model templates on master (#11527)

* [Examples] Added support for test-file in QA examples with no trainer (#11510)

* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines

* Add Stas and Suraj as authors (#11526)

* Improve task summary docs (#11513)

* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [debug utils] activation/weights underflow/overflow detector (#11274)

* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made

* Fixed docs for the shape of `scores` in `generate()` (#10057)

* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix

* Fix examples in M2M100 docstrings (#11540)

Replaces `tok` with `tokenizer` so examples can run with copy-paste

* [Flax BERT/Roberta] few small fixes (#11558)

* small fixes

* style

* [Wav2Vec2] Fix convert (#11562)

* push

* small change

* correct other typo

* Remove `datasets` submodule. (#11563)

* fix the mlm longformer example by changing [MASK] to <mask> (#11559)

* Add LUKE (#11223)

* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <[email protected]>

* [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)

* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <[email protected]>

* Fix metric computation in `run_glue_no_trainer` (#11569)

* Fixes a useless warning. (#11566)

Fixes #11525

* Accumulate opt state dict on do_rank 0 (#11481)

* Update training tutorial (#11533)

* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <[email protected]>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <[email protected]>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* fix resize_token_embeddings (#11572)

* Add multi-class, multi-label and regression to transformers (#11012)

* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* Enable added tokens (#11325)

* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition

* Make quality scripts work when one backend is missing. (#11573)

* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code

* [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)

* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <[email protected]>

* Removes SageMakerTrainer code but keeps class as wrapper (#11587)

* removed all old code

* make quality

* [Flax] Add Electra models (#11426)

* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <[email protected]>

* Reproducible checkpoint (#11582)

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

* [trainer] document resume randomness (#11588)

* document resume randomness

* fix link

* reword

* fix

* reword

* style

* copies need to be fixed too (#11585)

* add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)

* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <[email protected]>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <[email protected]>

* Skip Funnel test

* Pytorch - Lazy initialization of models (#11471)

* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <[email protected]>

* Accept tensorflow-rocm package when checking TF availability (#11595)

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: JohnnyC08 <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: cchen-dialpad <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Lysandre <[email protected]>
Co-authored-by: cronoik <[email protected]>
Co-authored-by: Joe Davison <[email protected]>
Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: versis <[email protected]>
Co-authored-by: Eren Şahin <[email protected]>
Co-authored-by: Amala Deshmukh <[email protected]>
Co-authored-by: konstin <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: SHYAM SUNDER KUMAR <[email protected]>
Co-authored-by: Leo Gao <[email protected]>
Co-authored-by: Vasudev Gupta <[email protected]>
Co-authored-by: Jannis Born <[email protected]>
Co-authored-by: Yusuke Mori <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Andrea Cappelli <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: Keisuke Hirota <[email protected]>
Co-authored-by: Saviour Owolabi <[email protected]>
Co-authored-by: Kevin Canwen Xu <[email protected]>
Co-authored-by: Masatoshi TSUCHIYA <[email protected]>
Co-authored-by: fghuman <[email protected]>
Co-authored-by: Amna <[email protected]>
Co-authored-by: Takuya Makino <[email protected]>
Co-authored-by: calpt <[email protected]>
Co-authored-by: Ceyda Cinarel <[email protected]>
Co-authored-by: Nithin Holla <[email protected]>
Co-authored-by: nithin19 <[email protected]>
Co-authored-by: Joel Stremmel <[email protected]>
Co-authored-by: Sudharsan S T <[email protected]>
Co-authored-by: Sudharsan Thirumalai <[email protected]>
Co-authored-by: Thomas Wood <[email protected]>
Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: e <[email protected]>
Co-authored-by: TAE YOUNGDON <[email protected]>
Co-authored-by: rajvi-k <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: wlhgtc <[email protected]>
Co-authored-by: johnson7788 <[email protected]>
Co-authored-by: johnson <[email protected]>
Co-authored-by: PenutChen <[email protected]>
Co-authored-by: Max Del <[email protected]>
Co-authored-by: Yoshitomo Matsubara <[email protected]>
Co-authored-by: Teven <[email protected]>
Co-authored-by: Kiran R <[email protected]>
Co-authored-by: Nicola De Cao <[email protected]>
Co-authored-by: Daniel Stancl <[email protected]>
Co-authored-by: Philip May <[email protected]>
Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: abiolaTresor <[email protected]>
Co-authored-by: Amine Abdaoui <[email protected]>
Co-authored-by: LSinev <[email protected]>
Co-authored-by: Kostas Stathoulopoulos <[email protected]>
Co-authored-by: Bhadresh Savani <[email protected]>
Co-authored-by: Jaimeen Ahn <[email protected]>
Co-authored-by: Ashwin Geet D'Sa <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Manuel Romero <[email protected]>
Co-authored-by: CeShine Lee <[email protected]>
Co-authored-by: Shubham Sanghavi <[email protected]>
Co-authored-by: bonniehyeon <[email protected]>
Co-authored-by: jingyihe <[email protected]>
Co-authored-by: Ikuya Yamada <[email protected]>
Co-authored-by: Muktan <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Patrick Fernandes <[email protected]>
Co-authored-by: Deepali <[email protected]>
Co-authored-by: Deepali Chourasia <[email protected]>
Co-authored-by: Mats Sjöberg <[email protected]>
@stas00
Copy link
Contributor

stas00 commented Jun 18, 2021

FYI, pytorch has just added torch.nn.utils.skip_init() to handle similar situations:
https://pytorch.org/tutorials/prototype/skip_param_init.html
it should appear probably around pt-1.9.1.

Note that torch.nn.utils.skip_init() is even more efficient as it doesn't allocate any storage at all! So there is not even an overhead of creating any weights until they are loaded from state_dict. https://pytorch.org/tutorials/prototype/skip_param_init.html#implementation-details

will-rice added a commit to will-rice/transformers that referenced this pull request Jun 21, 2021
* Add more metadata to the user agent (#10972)

* Add more metadata to the user agent

* Fix typo

* Use DISABLE_TELEMETRY

* Address review comments

* Use global env

* Add clean envs on circle CI

* Enforce string-formatting with f-strings (#10980)

* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* add notebook (#10995)

* Merge trainers (#10975)

* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review

Co-authored-by: Philipp Schmid <[email protected]>

Co-authored-by: Philipp Schmid <[email protected]>

* add blog to docs (#10997)

* Update training_args.py (#11000)

In the group by length documentation length is misspelled as legnth

* Add `examples/language_modeling/run_mlm_no_trainer.py` (#11001)

* Add initial script for finetuning MLM models with accelerate

* Add evaluation metric calculation

* Fix bugs

* Use no_grad on evaluation

* update script docstring

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* PR feedback

* Fix CI failure

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix Adafactor documentation (recommend correct settings) (#10526)

* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* flip scale_parameter to True for the 2nd recommendatoin

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Improve the speed of adding tokens from added_tokens.json (#10780)

* use bisect to add one token to unique_no_split_tokens

* fix style

* Add Vision Transformer and ViTFeatureExtractor (#10950)

* Squash all commits into one

* Update ViTFeatureExtractor to use image_utils instead of torchvision

* Remove torchvision and add Pillow

* Small docs improvement

* Address most comments by @sgugger

* Fix tests

* Clean up conversion script

* Pooler first draft

* Fix quality

* Improve conversion script

* Make style and quality

* Make fix-copies

* Minor docs improvements

* Should use fix-copies instead of manual handling

* Revert "Should use fix-copies instead of manual handling"

This reverts commit fd4e591bce4496d41406425c82606a8fdaf8a50b.

* Place ViT in alphabetical order

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* DebertaTokenizer Rework closes #10258 (#10703)

* closes #10258

* typo

* reworked deberta test

* implemented the comments from BigBird01 regarding sequence pair encoding of deberta

* Update style

* VOCAB_FILES_NAMES is now a oneliner as suggested by @sgugger

Co-authored-by: Sylvain Gugger <[email protected]>

* added #fmt: on as requested by @sgugger

* Style

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* minor typo fix

*negative* log-likelihood

* [doc] no more bucket

* added new notebook and merge of trainer (#11015)

* added new notebook and merge of trainer

* Update docs/source/sagemaker.md

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* fixed typo: logging instead of logger (#11025)

* Add a script to check inits are consistent (#11024)

* s|Pretrained|PreTrained| (#11048)

* [doc] update code-block rendering (#11053)

double : prevents code-block section to be rendered, so made it single :

* Pin docutils (#11062)

* Pin docutils

* Versions table

* Remove unnecessary space (#11060)

* Some models have no tokenizers (#11064)

* Refactor AutoModel classes and add Flax Auto classes (#11027)

* Refactor AutoModel classes and add Flax Auto classes

* Add new objects to the init

* Fix hubconf and sort models

* Fix TF tests

* Missing coma

* Update src/transformers/models/auto/auto_factory.py

Co-authored-by: Lysandre Debut <[email protected]>

* Fix init

* Fix dummies

* Other init to fix

Co-authored-by: Lysandre Debut <[email protected]>

* Documentation about loading a fast tokenizer within Transformers (#11029)

* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Add example for registering callbacks with trainers (#10928)

* Add example for callback registry

Resolves: #9036

* Update callback registry documentation

* Added comments for other ways to register callback

* Add `examples/language_modeling/run_clm_no_trainer.py` (#11026)

* Initial draft for clm no trainer

* Remove unwanted args

* Fix bug

* Update examples/language-modeling/run_clm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace pkg_resources with importlib_metadata (#11061)

* Replace pkg_resources with importlib_metadata

Fixes #10964. The other reason for this change is that pkg_resources has been [deprecated](https://github.com/pypa/setuptools/commit/8fe85c22cee7fde5e6af571b30f864bad156a010) in favor of importlib_metadata.

* Reduce to a single importlib_metadata import switch

* Trigger CI

Co-authored-by: Stas Bekman <[email protected]>

* Add center_crop to ImageFeatureExtractoMixin (#11066)

* Document common config attributes (#11070)

* Fix distributed gather for tuples of tensors of varying sizes (#11071)

* Make a base init in FeatureExtractionMixin (#11074)

* Add Readme for language modeling scripts with accelerate (#11073)

* HF emoji unicode doesn't work in console (#11081)

It doesn't look like using 🤗 is a great idea for printing to console. See attachment.

This PR proposes to replace 🤗 with "HuggingFace" for an exception message.

@LysandreJik

* Link to new blog

* added social thumbnail for docs (#11083)

* added new merged Trainer test (#11090)

* [WIP] GPT Neo cleanup (#10985)

* better names

* add attention mixin

* all slow tests in one class

* make helper methods static so we can test

* add local attention tests

* better names

* doc

* apply review suggestions

* Release v4.5.0

* Development on v4.6.0dev0

* [doc] gpt-neo (#11098)

make the example work

* Auto feature extractor (#11097)

* AutoFeatureExtractor

* Init and first tests

* Tests

* Damn you gitignore

* Quality

* Defensive test for when not all backends are here

* Use pattern for Speech2Text models

* accelerate question answering examples with no trainer (#11091)

* accelerate question answering examples with no trainer

* removed train and eval flags also fixed fill np array function

* Update examples/question-answering/run_qa_beam_search_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update examples/question-answering/run_qa_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Style

* dead link fixed (#11103)

* GPTNeo: handle padded wte (#11079)

* GPTNeo: handle padded wte

* Switch to config.vocab_size

* apply review suggestion

Co-authored-by: Suraj Patil <[email protected]>

* fix: The 'warn' method is deprecated (#11105)

* The 'warn' method is deprecated

* fix test

* [examples] fix white space (#11099)

these get concatenated without whitespace, so fix it

* Dummies multi backend (#11100)

* Replaces requires_xxx by one generic method

* Quality and update check_dummies

* Fix inits check

* Post-merge cleanup

* Some styling of the training table in Notebooks (#11118)

* Adds a note to resize the token embedding matrix when adding special … (#11120)

* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space

* fix tests (#11109)

* [versions] handle version requirement ranges (#11110)

* handle version requirement ranges

* add mixed requirement test

* cleanup

* Adds use_auth_token with pipelines (#11123)

* added model_kwargs to infer_framework_from_model

* added model_kwargs to tokenizer

* added use_auth_token as named parameter

* added dynamic get for use_auth_token

* Fix and refactor check_repo (#11127)

* Fix typing error in Trainer class (prediction_step) (#11138)

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

* Typo fix of the name of BertLMHeadModel in BERT doc (#11133)

* [run_clm] clarify why we get the tokenizer warning on long input (#11145)

* clarify why we get the warning here

* Update examples/language-modeling/run_clm.py

Co-authored-by: Sylvain Gugger <[email protected]>

* wording

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] ZeRO Stage 3 (#10753)

* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <[email protected]>

* Add nvidia megatron models (#10911)

* Add support for NVIDIA Megatron models

* Add support for NVIDIA Megatron GPT2 and BERT

Add the megatron_gpt2 model. That model reuses the existing GPT2 model. This
commit includes a script to convert a Megatron-GPT2 checkpoint downloaded
from NVIDIA GPU Cloud. See examples/megatron-models/README.md for details.

Add the megatron_bert model. That model is implemented as a modification of
the existing BERT model in Transformers. This commit includes a script to
convert a Megatron-BERT checkpoint downloaded from NVIDIA GPU Cloud. See
examples/megatron-models/README.md for details.

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Remove model.half in tests + add "# Copied ..."

Remove the model.half() instruction which makes tests fail on the CPU.

Add a comment "# Copied ..." before many classes in the model to enable automatic
tracking in CI between the new Megatron classes and the original Bert ones.

* Fix issues

* Fix Flax/TF tests

* Fix copyright

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update docs/source/model_doc/megatron_bert.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update docs/source/model_doc/megatron_gpt2.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/__init__.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Resolve most of 'sgugger' comments

* Fix conversion issue + Run make fix-copies/quality/docs

* Apply suggestions from code review

* Causal LM & merge

* Fix init

* Add CausalLM to last auto class

Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* [trainer] solve "scheduler before optimizer step" warning (#11144)

* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test

* Add fairscale and deepspeed back to the CI (#11147)

* Add fairscale and deepspeed back to the CI

* Add deepspeed to single GPU tests

* Updates SageMaker docs for updating DLCs (#11140)

* Don't duplicate logs in TensorBoard and handle --use_env (#11141)

* Run mlm pad to multiple for fp16 (#11128)

* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)

* [tests] relocate core integration tests (#11146)

* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps

* [setup] extras[docs] must include 'all' (#11148)

* extras[doc] must include 'all'

* fix

* better

* regroup

* Add support for multiple models for one config in auto classes (#11150)

* Add support for multiple models for one config in auto classes

* Use get_values everywhere

* Prettier doc

* [setup] make fairscale and deepspeed setup extras (#11151)

* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* no reason not to ask for the good version

* update the CIs

Co-authored-by: Sylvain Gugger <[email protected]>

* Skip Megatron tests for now

* typo (#11152)

* typo

* style

* [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips (#11142)

* Add Wav2Vec Inference notebook

* Update docs/source/community.md

Co-authored-by: Suraj Patil <[email protected]>

* Fix LogitsProcessor documentation (#11130)

* Change duplicated LogitsProcessor to LogitsWarper in LogitsProcessorList document

* Write more detailed information about LogitsProcessor's scores argument

* apply suggestion from review

* style

Co-authored-by: Suraj Patil <[email protected]>

* Update README.md (#11161)

Corrected a typo ('Downlowd' to 'Download')

* Make `get_special_tokens_mask` consider all tokens (#11163)

* Add a special tokenizer for CPM model (#11068)

* Add a special tokenizer for CPM model

* make style

* fix

* Add docs

* styles

* cpm doc

* fix ci

* fix the overview

* add test

* make style

* typo

* Custom tokenizer flag

* Add REAMDE.md

Co-authored-by: Lysandre <[email protected]>

* [examples/translation] support mBART-50 and M2M100 fine-tuning (#11170)

* keep a list of multilingual tokenizers

* add forced_bos_token argument

* [examples run_clm] fix _LazyModule hasher error (#11168)

* fix _LazyModule hasher error

* reword

* added json dump and extraction of train run time (#11167)

* added json dump and extraction of train run time

* make style happy

* Fix Typo

* Reactivate Megatron tests an use less workers

* Minor typos fixed (#11182)

* Fix style

* model_path should be ignored as the checkpoint path (#11157)

* model_path is refered as the path of the trainer, and should be ignored as the checkpoint path.

* Improved according to Sgugger's comment.

* Added documentation for data collator. (#10941)

* Added documentation for data collator.

* Update docs/source/data_collator.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Added documentation for data collator.

* Added documentation for the data collator.

* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.

* Update documentation for the data collator.

* Update documentation for the data collator.

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Amna <[email protected]>

* Fix typo (#11188)

* Add DeiT (PyTorch) (#11056)

* First draft of deit

* More improvements

* Remove DeiTTokenizerFast from init

* Conversion script works

* Add DeiT to ViT conversion script

* Add tests, add head model, add support for deit in vit conversion script

* Update model checkpoint names

* Update image_mean and image_std, set resample to bicubic

* Improve docs

* Docs improvements

* Add DeiTForImageClassificationWithTeacher to init

* Address comments by @sgugger

* Improve feature extractors

* Make fix-copies

* Minor fixes

* Address comments by @patil-suraj

* All models uploaded

* Fix tests

* Remove labels argument from DeiTForImageClassificationWithTeacher

* Fix-copies, style and quality

* Fix tests

* Fix typo

* Multiple docs improvements

* More docs fixes

* Replaced `which` with `who` (#11183)

* Import torch.utils.checkpoint in ProphetNet (#11214)

* Sagemaker test docs update for framework upgrade (#11206)

* increased train_runtime for model parallelism

* added documentation for framework upgrade

* Use MSELoss in (M)BartForSequenceClassification (#11178)

* wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model (#11041)

* add vocab while converting wav2vec2 original finetuned model

* check save directory exists

* return_attention_mask fix

* quality

* Add Matt as the TensorFlow reference (#11212)

* Fix GPT-2 warnings (#11213)

* Fix GPT-2 warnings

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Stas Bekman <[email protected]>

Co-authored-by: Stas Bekman <[email protected]>

* fix docstrings (#11221)

* Add documentation for BertJapanese (#11219)

* Start writing BERT-Japanese doc

* Fix typo, Update toctree

* Modify model file to use comment for document, Add examples

* Clean bert_japanese by make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Split a big code block into two

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Add prefix >>> to all lines in code blocks

* Clean bert_japanese by make fixup

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace error by warning when loading an architecture in another (#11207)

* Replace error by warning when loading an architecture in another

* Style

* Style again

* Add a test

* Adapt old test

* Document v4.5.1

* Refactor GPT2 (#11225)

* refactor GPT2

* fix mlp and head pruning

* address Sylvains comments

* apply suggestion from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Doc check: a bit of clean up (#11224)

* added cache_dir=model_args.cache_dir to all example with cache_dir arg (#11220)

* Avoid using no_sync on SageMaker DP (#11229)

* Indent code block in the documentation (#11233)

* Indent code block

* Indent code blocks version 2

* Quality

* Run CI on deepspeed and fairscale (#11172)

* Run CI on deepspeed and fairscale

* Test it on this branch :)

* Rename

* Update the CI image

* [Deepspeed] zero3 tests band aid (#11235)

* temp band-aid

* style

* Save the Wav2Vec2 processor before training starts (#10910)

Co-authored-by: nithin19 <[email protected]>

* make embeddings plural in warning message (#11228)

* Stale bot updated (#10562)

* Updated stale bot

* Specify issue number

* Remove particular handling of assignees

* Unleash the stalebot

* Remove debug branch

* Close open files to suppress ResourceWarning (#11240)

Co-authored-by: Sudharsan Thirumalai <[email protected]>

* Fix dimention misspellings. (#11238)

* Update modeling_gpt_neo.py

dimention -> dimension

* Update configuration_speech_to_text.py

dimention -> dimension

* Add prefix to examples in model_doc rst (#11226)

* Add prefix to examples in model_doc rst

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [troubleshooting] add 2 points of reference to the offline mode (#11236)

* add 2 points of reference to the offline mode

* link the new doc

* add error message

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* style

* rename

* Trigger CI

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix #10128 (#11248)

* [deepspeed] test on one node 2 gpus max (#11237)

* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version

* Trainer iterable dataset (#11254)

* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <[email protected]>

* Style

Co-authored-by: Lysandre Debut <[email protected]>

* Adding pipeline task aliases. (#11247)

* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.

* Support for set_epoch (#11258)

* Tokenizer fast save (#11234)

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

* update dependency_versions_table (#11273)

missed this updating when bumped the version.

* Workflow fixes (#11270)

* Enabling multilingual models for translation pipelines. (#10536)

* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix failing workflows

* Trainer support for IterableDataset for evaluation and predict (#11286)

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

* move device statements outside if statements (#11292)

* modify double considering special tokens in `language_modeling.py` (#11275)

* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)" 

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py

* [Trainer] fix the placement on device with fp16_full_eval (#11322)

* fix the placement on device with fp16_full_eval

* deepspeed never goes on device

* [Trainer] Add a progress bar for batches skipped (#11324)

* Load checkpoint without re-creating the model (#11318)

* Added translation example script  (#11196)

* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script

* [Generate] Remove outdated code (#11331)

* remove update function

* update

* refactor more

* refactor

* [GPTNeo] create local attention mask ones (#11335)

* create local attention mask ones

* remove old method, address patricks comment

* Update to use datasets remove_cloumns method (#11343)

* Update to use datasets remove_cloumns method

* Quality

* Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117)

* Removed `max_length` from being mandatory within `generate`. (#11314)

* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.

* Honor contributors to models (#11329)

* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors

* [deepspeed] fix resume from checkpoint (#11352)

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

* Examples reorg (#11350)

* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address review comments and clean

Co-authored-by: Lysandre Debut <[email protected]>

* Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)

* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics

* [testing doc] bring doc up to date (#11359)

* bring doc up to date

* fix

* Merge new TF example script (#11360)

First of the new and more idiomatic TF examples!

* Remove boiler plate code (#11340)

* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor

* Move old TF text classification script to legacy (#11361)

And update README to explain the work-in-progress!

* [contributing doc] explain/link to good first issue (#11346)

* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix token_type_ids error for big_bird model. (#11355)

* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <[email protected]>

* Add huggingface_hub dep for #11328

* Add in torchhub

* [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)

* fix wav2vec2 tok

* up

* [Flax] Correct typo (#11374)

* finish

* fix copy

* [run_translation.py] fix typo  (#11372)

fix typo

Co-authored-by: johnson <[email protected]>

* Add space (#11373)

* Correctly cast num_train_epochs to int (#11379)

* Fix typo (#11369)

* Fix Trainer with remove_unused_columns=False (#11382)

* Fix Trainer with remove_unused_columns=False

* Typo

* [Flax] Big FlaxBert Refactor (#11364)

* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying

* correct typo (#11393)

* correct conversion (#11394)

* Fix typo in text (#11396)

* fixed typos (#11391)

* make blenderbot test slow (#11395)

* Fixed trainer total_flos relaoding in distributed mode (#11383)

* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training

* Trainer push to hub (#11328)

* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <[email protected]>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* push (#11400)

* added support for exporting of t5 to onnx with past_key_values (#10651)

* Fixing bug in generation (#11297)

When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.

* Style

* Try to trigger failure more

* Wrong branch Sylvain...

* Fix cross-attention head mask for Torch encoder-decoder models (#10605)

* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring

* Default to accuracy metric (#11405)

* Enable option for subword regularization in `XLMRobertaTokenizer` (#11149)

* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <[email protected]>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Use 3 workers for torch tests

* documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410)

* Style

* Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)

* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests

* EncoderDecoderConfigs should not create new objects (#11300)

* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <[email protected]>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434)

* add pooling layer support (#11439)

* make style (#11442)

* Pin black to 20.8.b1

* With style

* Pin black to 21.4b0

* TF BART models - Add `cross_attentions` to model output and fix cross-attention head masking (#10699)

* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config

* Add basic support for FP16 in SageMaker model parallelism (#11407)

* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo

* docs(examples): fix link to TPU launcher script (#11427)

* fix some typos in docs, comments, logging/errors (#11432)

* Pass along seed to DistributedSampler (#11406)

* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler

* Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording

* [docs] fix invalid class name (#11438)

* fix invalid class name

* proper ref

* proper ref

* make sure to test against the local checkout (#11437)

* Style

* Give each test a different repo name (#11453)

* [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)

* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval

* Variable Correction for Consistency in Distillation Example (#11444)

As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work

* [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)

* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Remove max length beam scorer (#11378)

* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <[email protected]>

* update QuickTour docs to reflect model output object (#11462)

* update docs to reflect model output object

* run make style`

* Finish Making Quick Tour respect the model object (#11467)

* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docs for decoder_input_ids (#11466)

* fix docs for decoder_input_ids

* revert the changes for bart and mbart

* Update min versions in README and add Flax (#11472)

* Update min versions in README and add Flax

* Adapt index

* Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* add error message

Co-authored-by: Sylvain Gugger <[email protected]>

* fix #1149 (#11493)

* [Flax] Add docstrings & model outputs (#11498)

* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* simplify more

Co-authored-by: Sylvain Gugger <[email protected]>

* Reformat to make code clearer in tokenizer call (#11497)

* Reformat to make code clearer

* Reformat to make code clearer

* solved coefficient issue for the TF version of gelu_fast (#11514)

Co-authored-by: Michael Benayoun <[email protected]>

* Split checkpoint from model_name_or_path in examples (#11492)

* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments

* Patch notification service

* Pin HuggingFace Hub dependency (#11502)

* correct the dimension comment of matrix multiplication (#11494)

Co-authored-by: Frederik Bode <[email protected]>

* add sp_model_kwargs to unpickle of xlm roberta tok (#11430)

add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test

* make style (#11520)

* Update README.md (#11489)

Add link to code

* T5 Gradient Checkpointing (#11353)

* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting

* Adding `AutomaticSpeechRecognitionPipeline`. (#11337)

* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <[email protected]>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <[email protected]>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* Implement Fast Tokenization for Deberta (#11387)

* Accepts BatchEncoding in LengthSampler (#11431)

* Fix do_eval default value in training_args.py (#11511)

* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md

* Update TF text classification example (#11496)

Big refactor, fixes and multi-GPU/TPU support

* reszie token embeds (#11524)

* Run model templates on master (#11527)

* [Examples] Added support for test-file in QA examples with no trainer (#11510)

* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines

* Add Stas and Suraj as authors (#11526)

* Improve task summary docs (#11513)

* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [debug utils] activation/weights underflow/overflow detector (#11274)

* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made

* Fixed docs for the shape of `scores` in `generate()` (#10057)

* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix

* Fix examples in M2M100 docstrings (#11540)

Replaces `tok` with `tokenizer` so examples can run with copy-paste

* [Flax BERT/Roberta] few small fixes (#11558)

* small fixes

* style

* [Wav2Vec2] Fix convert (#11562)

* push

* small change

* correct other typo

* Remove `datasets` submodule. (#11563)

* fix the mlm longformer example by changing [MASK] to <mask> (#11559)

* Add LUKE (#11223)

* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <[email protected]>

* [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)

* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <[email protected]>

* Fix metric computation in `run_glue_no_trainer` (#11569)

* Fixes a useless warning. (#11566)

Fixes #11525

* Accumulate opt state dict on do_rank 0 (#11481)

* Update training tutorial (#11533)

* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <[email protected]>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <[email protected]>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* fix resize_token_embeddings (#11572)

* Add multi-class, multi-label and regression to transformers (#11012)

* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* Enable added tokens (#11325)

* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition

* Make quality scripts work when one backend is missing. (#11573)

* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code

* [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)

* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <[email protected]>

* Removes SageMakerTrainer code but keeps class as wrapper (#11587)

* removed all old code

* make quality

* [Flax] Add Electra models (#11426)

* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <[email protected]>

* Reproducible checkpoint (#11582)

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

* [trainer] document resume randomness (#11588)

* document resume randomness

* fix link

* reword

* fix

* reword

* style

* copies need to be fixed too (#11585)

* add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)

* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <[email protected]>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <[email protected]>

* Skip Funnel test

* Pytorch - Lazy initialization of models (#11471)

* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <[email protected]>

* Accept tensorflow-rocm package when checking TF availability (#11595)

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: JohnnyC08 <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: cchen-dialpad <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Lysandre <[email protected]>
Co-authored-by: cronoik <[email protected]>
Co-authored-by: Joe Davison <[email protected]>
Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: versis <[email protected]>
Co-authored-by: Eren Şahin <[email protected]>
Co-authored-by: Amala Deshmukh <[email protected]>
Co-authored-by: konstin <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: SHYAM SUNDER KUMAR <[email protected]>
Co-authored-by: Leo Gao <[email protected]>
Co-authored-by: Vasudev Gupta <[email protected]>
Co-authored-by: Jannis Born <[email protected]>
Co-authored-by: Yusuke Mori <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Andrea Cappelli <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: Keisuke Hirota <[email protected]>
Co-authored-by: Saviour Owolabi <[email protected]>
Co-authored-by: Kevin Canwen Xu <[email protected]>
Co-authored-by: Masatoshi TSUCHIYA <[email protected]>
Co-authored-by: fghuman <[email protected]>
Co-authored-by: Amna <[email protected]>
Co-authored-by: Takuya Makino <[email protected]>
Co-authored-by: calpt <[email protected]>
Co-authored-by: Ceyda Cinarel <[email protected]>
Co-authored-by: Nithin Holla <[email protected]>
Co-authored-by: nithin19 <[email protected]>
Co-authored-by: Joel Stremmel <[email protected]>
Co-authored-by: Sudharsan S T <[email protected]>
Co-authored-by: Sudharsan Thirumalai <[email protected]>
Co-authored-by: Thomas Wood <[email protected]>
Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: e <[email protected]>
Co-authored-by: TAE YOUNGDON <[email protected]>
Co-authored-by: rajvi-k <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: wlhgtc <[email protected]>
Co-authored-by: johnson7788 <[email protected]>
Co-authored-by: johnson <[email protected]>
Co-authored-by: PenutChen <[email protected]>
Co-authored-by: Max Del <[email protected]>
Co-authored-by: Yoshitomo Matsubara <[email protected]>
Co-authored-by: Teven <[email protected]>
Co-authored-by: Kiran R <[email protected]>
Co-authored-by: Nicola De Cao <[email protected]>
Co-authored-by: Daniel Stancl <[email protected]>
Co-authored-by: Philip May <[email protected]>
Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: abiolaTresor <[email protected]>
Co-authored-by: Amine Abdaoui <[email protected]>
Co-authored-by: LSinev <[email protected]>
Co-authored-by: Kostas Stathoulopoulos <[email protected]>
Co-authored-by: Bhadresh Savani <[email protected]>
Co-authored-by: Jaimeen Ahn <[email protected]>
Co-authored-by: Ashwin Geet D'Sa <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Manuel Romero <[email protected]>
Co-authored-by: CeShine Lee <[email protected]>
Co-authored-by: Shubham Sanghavi <[email protected]>
Co-authored-by: bonniehyeon <[email protected]>
Co-authored-by: jingyihe <[email protected]>
Co-authored-by: Ikuya Yamada <[email protected]>
Co-authored-by: Muktan <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Patrick Fernandes <[email protected]>
Co-authored-by: Deepali <[email protected]>
Co-authored-by: Deepali Chourasia <[email protected]>
Co-authored-by: Mats Sjöberg <[email protected]>
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <[email protected]>
will-rice added a commit to will-rice/transformers that referenced this pull request Jul 19, 2021
* Add more metadata to the user agent (#10972)

* Add more metadata to the user agent

* Fix typo

* Use DISABLE_TELEMETRY

* Address review comments

* Use global env

* Add clean envs on circle CI

* Enforce string-formatting with f-strings (#10980)

* First third

* Styling and fix mistake

* Quality

* All the rest

* Treat %s and %d

* typo

* Missing )

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* add notebook (#10995)

* Merge trainers (#10975)

* Replace is_sagemaker_distributed_available

* Merge SageMakerTrainer into Trainer

* Test with shorter condition

* Put back deleted line

* Deprecate SageMakerTrainer and SageMakerTrainingArguments

* Apply suggestions from code review

Co-authored-by: Philipp Schmid <[email protected]>

Co-authored-by: Philipp Schmid <[email protected]>

* add blog to docs (#10997)

* Update training_args.py (#11000)

In the group by length documentation length is misspelled as legnth

* Add `examples/language_modeling/run_mlm_no_trainer.py` (#11001)

* Add initial script for finetuning MLM models with accelerate

* Add evaluation metric calculation

* Fix bugs

* Use no_grad on evaluation

* update script docstring

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* PR feedback

* Fix CI failure

* Update examples/language-modeling/run_mlm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix Adafactor documentation (recommend correct settings) (#10526)

* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* flip scale_parameter to True for the 2nd recommendatoin

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Improve the speed of adding tokens from added_tokens.json (#10780)

* use bisect to add one token to unique_no_split_tokens

* fix style

* Add Vision Transformer and ViTFeatureExtractor (#10950)

* Squash all commits into one

* Update ViTFeatureExtractor to use image_utils instead of torchvision

* Remove torchvision and add Pillow

* Small docs improvement

* Address most comments by @sgugger

* Fix tests

* Clean up conversion script

* Pooler first draft

* Fix quality

* Improve conversion script

* Make style and quality

* Make fix-copies

* Minor docs improvements

* Should use fix-copies instead of manual handling

* Revert "Should use fix-copies instead of manual handling"

This reverts commit fd4e591bce4496d41406425c82606a8fdaf8a50b.

* Place ViT in alphabetical order

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* DebertaTokenizer Rework closes #10258 (#10703)

* closes #10258

* typo

* reworked deberta test

* implemented the comments from BigBird01 regarding sequence pair encoding of deberta

* Update style

* VOCAB_FILES_NAMES is now a oneliner as suggested by @sgugger

Co-authored-by: Sylvain Gugger <[email protected]>

* added #fmt: on as requested by @sgugger

* Style

Co-authored-by: Lysandre <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* minor typo fix

*negative* log-likelihood

* [doc] no more bucket

* added new notebook and merge of trainer (#11015)

* added new notebook and merge of trainer

* Update docs/source/sagemaker.md

Co-authored-by: Lysandre Debut <[email protected]>

Co-authored-by: Lysandre Debut <[email protected]>

* fixed typo: logging instead of logger (#11025)

* Add a script to check inits are consistent (#11024)

* s|Pretrained|PreTrained| (#11048)

* [doc] update code-block rendering (#11053)

double : prevents code-block section to be rendered, so made it single :

* Pin docutils (#11062)

* Pin docutils

* Versions table

* Remove unnecessary space (#11060)

* Some models have no tokenizers (#11064)

* Refactor AutoModel classes and add Flax Auto classes (#11027)

* Refactor AutoModel classes and add Flax Auto classes

* Add new objects to the init

* Fix hubconf and sort models

* Fix TF tests

* Missing coma

* Update src/transformers/models/auto/auto_factory.py

Co-authored-by: Lysandre Debut <[email protected]>

* Fix init

* Fix dummies

* Other init to fix

Co-authored-by: Lysandre Debut <[email protected]>

* Documentation about loading a fast tokenizer within Transformers (#11029)

* Documentation about loading a fast tokenizer within Transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Add example for registering callbacks with trainers (#10928)

* Add example for callback registry

Resolves: #9036

* Update callback registry documentation

* Added comments for other ways to register callback

* Add `examples/language_modeling/run_clm_no_trainer.py` (#11026)

* Initial draft for clm no trainer

* Remove unwanted args

* Fix bug

* Update examples/language-modeling/run_clm_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace pkg_resources with importlib_metadata (#11061)

* Replace pkg_resources with importlib_metadata

Fixes #10964. The other reason for this change is that pkg_resources has been [deprecated](https://github.com/pypa/setuptools/commit/8fe85c22cee7fde5e6af571b30f864bad156a010) in favor of importlib_metadata.

* Reduce to a single importlib_metadata import switch

* Trigger CI

Co-authored-by: Stas Bekman <[email protected]>

* Add center_crop to ImageFeatureExtractoMixin (#11066)

* Document common config attributes (#11070)

* Fix distributed gather for tuples of tensors of varying sizes (#11071)

* Make a base init in FeatureExtractionMixin (#11074)

* Add Readme for language modeling scripts with accelerate (#11073)

* HF emoji unicode doesn't work in console (#11081)

It doesn't look like using 🤗 is a great idea for printing to console. See attachment.

This PR proposes to replace 🤗 with "HuggingFace" for an exception message.

@LysandreJik

* Link to new blog

* added social thumbnail for docs (#11083)

* added new merged Trainer test (#11090)

* [WIP] GPT Neo cleanup (#10985)

* better names

* add attention mixin

* all slow tests in one class

* make helper methods static so we can test

* add local attention tests

* better names

* doc

* apply review suggestions

* Release v4.5.0

* Development on v4.6.0dev0

* [doc] gpt-neo (#11098)

make the example work

* Auto feature extractor (#11097)

* AutoFeatureExtractor

* Init and first tests

* Tests

* Damn you gitignore

* Quality

* Defensive test for when not all backends are here

* Use pattern for Speech2Text models

* accelerate question answering examples with no trainer (#11091)

* accelerate question answering examples with no trainer

* removed train and eval flags also fixed fill np array function

* Update examples/question-answering/run_qa_beam_search_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update examples/question-answering/run_qa_no_trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Style

* dead link fixed (#11103)

* GPTNeo: handle padded wte (#11079)

* GPTNeo: handle padded wte

* Switch to config.vocab_size

* apply review suggestion

Co-authored-by: Suraj Patil <[email protected]>

* fix: The 'warn' method is deprecated (#11105)

* The 'warn' method is deprecated

* fix test

* [examples] fix white space (#11099)

these get concatenated without whitespace, so fix it

* Dummies multi backend (#11100)

* Replaces requires_xxx by one generic method

* Quality and update check_dummies

* Fix inits check

* Post-merge cleanup

* Some styling of the training table in Notebooks (#11118)

* Adds a note to resize the token embedding matrix when adding special … (#11120)

* Adds a note to resize the token embedding matrix when adding special tokens

* Remove superfluous space

* fix tests (#11109)

* [versions] handle version requirement ranges (#11110)

* handle version requirement ranges

* add mixed requirement test

* cleanup

* Adds use_auth_token with pipelines (#11123)

* added model_kwargs to infer_framework_from_model

* added model_kwargs to tokenizer

* added use_auth_token as named parameter

* added dynamic get for use_auth_token

* Fix and refactor check_repo (#11127)

* Fix typing error in Trainer class (prediction_step) (#11138)

* fix: docstrings in prediction_step

* ci: Satisfy line length requirements

* ci: character length requirements

* Typo fix of the name of BertLMHeadModel in BERT doc (#11133)

* [run_clm] clarify why we get the tokenizer warning on long input (#11145)

* clarify why we get the warning here

* Update examples/language-modeling/run_clm.py

Co-authored-by: Sylvain Gugger <[email protected]>

* wording

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] ZeRO Stage 3 (#10753)

* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <[email protected]>

* Add nvidia megatron models (#10911)

* Add support for NVIDIA Megatron models

* Add support for NVIDIA Megatron GPT2 and BERT

Add the megatron_gpt2 model. That model reuses the existing GPT2 model. This
commit includes a script to convert a Megatron-GPT2 checkpoint downloaded
from NVIDIA GPU Cloud. See examples/megatron-models/README.md for details.

Add the megatron_bert model. That model is implemented as a modification of
the existing BERT model in Transformers. This commit includes a script to
convert a Megatron-BERT checkpoint downloaded from NVIDIA GPU Cloud. See
examples/megatron-models/README.md for details.

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Remove model.half in tests + add "# Copied ..."

Remove the model.half() instruction which makes tests fail on the CPU.

Add a comment "# Copied ..." before many classes in the model to enable automatic
tracking in CI between the new Megatron classes and the original Bert ones.

* Fix issues

* Fix Flax/TF tests

* Fix copyright

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/configuration_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Lysandre Debut <[email protected]>

* Update docs/source/model_doc/megatron_bert.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update docs/source/model_doc/megatron_gpt2.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/__init__.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/models/megatron_bert/modeling_megatron_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Resolve most of 'sgugger' comments

* Fix conversion issue + Run make fix-copies/quality/docs

* Apply suggestions from code review

* Causal LM & merge

* Fix init

* Add CausalLM to last auto class

Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* [trainer] solve "scheduler before optimizer step" warning (#11144)

* solve "scheduler before optimizer step" warning

* style

* correct the state evaluation test

* Add fairscale and deepspeed back to the CI (#11147)

* Add fairscale and deepspeed back to the CI

* Add deepspeed to single GPU tests

* Updates SageMaker docs for updating DLCs (#11140)

* Don't duplicate logs in TensorBoard and handle --use_env (#11141)

* Run mlm pad to multiple for fp16 (#11128)

* Add mlm collator pad to multiple option (#10627)

* Use padding to 8x in run mlm (#10627)

* [tests] relocate core integration tests (#11146)

* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps

* [setup] extras[docs] must include 'all' (#11148)

* extras[doc] must include 'all'

* fix

* better

* regroup

* Add support for multiple models for one config in auto classes (#11150)

* Add support for multiple models for one config in auto classes

* Use get_values everywhere

* Prettier doc

* [setup] make fairscale and deepspeed setup extras (#11151)

* make fairscale and deepspeed setup extras

* fix default

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* no reason not to ask for the good version

* update the CIs

Co-authored-by: Sylvain Gugger <[email protected]>

* Skip Megatron tests for now

* typo (#11152)

* typo

* style

* [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips (#11142)

* Add Wav2Vec Inference notebook

* Update docs/source/community.md

Co-authored-by: Suraj Patil <[email protected]>

* Fix LogitsProcessor documentation (#11130)

* Change duplicated LogitsProcessor to LogitsWarper in LogitsProcessorList document

* Write more detailed information about LogitsProcessor's scores argument

* apply suggestion from review

* style

Co-authored-by: Suraj Patil <[email protected]>

* Update README.md (#11161)

Corrected a typo ('Downlowd' to 'Download')

* Make `get_special_tokens_mask` consider all tokens (#11163)

* Add a special tokenizer for CPM model (#11068)

* Add a special tokenizer for CPM model

* make style

* fix

* Add docs

* styles

* cpm doc

* fix ci

* fix the overview

* add test

* make style

* typo

* Custom tokenizer flag

* Add REAMDE.md

Co-authored-by: Lysandre <[email protected]>

* [examples/translation] support mBART-50 and M2M100 fine-tuning (#11170)

* keep a list of multilingual tokenizers

* add forced_bos_token argument

* [examples run_clm] fix _LazyModule hasher error (#11168)

* fix _LazyModule hasher error

* reword

* added json dump and extraction of train run time (#11167)

* added json dump and extraction of train run time

* make style happy

* Fix Typo

* Reactivate Megatron tests an use less workers

* Minor typos fixed (#11182)

* Fix style

* model_path should be ignored as the checkpoint path (#11157)

* model_path is refered as the path of the trainer, and should be ignored as the checkpoint path.

* Improved according to Sgugger's comment.

* Added documentation for data collator. (#10941)

* Added documentation for data collator.

* Update docs/source/data_collator.rst

Co-authored-by: Sylvain Gugger <[email protected]>

* Added documentation for data collator.

* Added documentation for the data collator.

* Merge branch 'doc_DataCollator' of C:\Users\mahii\PycharmProjects\transformers with conflicts.

* Update documentation for the data collator.

* Update documentation for the data collator.

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Amna <[email protected]>

* Fix typo (#11188)

* Add DeiT (PyTorch) (#11056)

* First draft of deit

* More improvements

* Remove DeiTTokenizerFast from init

* Conversion script works

* Add DeiT to ViT conversion script

* Add tests, add head model, add support for deit in vit conversion script

* Update model checkpoint names

* Update image_mean and image_std, set resample to bicubic

* Improve docs

* Docs improvements

* Add DeiTForImageClassificationWithTeacher to init

* Address comments by @sgugger

* Improve feature extractors

* Make fix-copies

* Minor fixes

* Address comments by @patil-suraj

* All models uploaded

* Fix tests

* Remove labels argument from DeiTForImageClassificationWithTeacher

* Fix-copies, style and quality

* Fix tests

* Fix typo

* Multiple docs improvements

* More docs fixes

* Replaced `which` with `who` (#11183)

* Import torch.utils.checkpoint in ProphetNet (#11214)

* Sagemaker test docs update for framework upgrade (#11206)

* increased train_runtime for model parallelism

* added documentation for framework upgrade

* Use MSELoss in (M)BartForSequenceClassification (#11178)

* wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model (#11041)

* add vocab while converting wav2vec2 original finetuned model

* check save directory exists

* return_attention_mask fix

* quality

* Add Matt as the TensorFlow reference (#11212)

* Fix GPT-2 warnings (#11213)

* Fix GPT-2 warnings

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Stas Bekman <[email protected]>

Co-authored-by: Stas Bekman <[email protected]>

* fix docstrings (#11221)

* Add documentation for BertJapanese (#11219)

* Start writing BERT-Japanese doc

* Fix typo, Update toctree

* Modify model file to use comment for document, Add examples

* Clean bert_japanese by make style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Split a big code block into two

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Add prefix >>> to all lines in code blocks

* Clean bert_japanese by make fixup

Co-authored-by: Sylvain Gugger <[email protected]>

* Replace error by warning when loading an architecture in another (#11207)

* Replace error by warning when loading an architecture in another

* Style

* Style again

* Add a test

* Adapt old test

* Document v4.5.1

* Refactor GPT2 (#11225)

* refactor GPT2

* fix mlp and head pruning

* address Sylvains comments

* apply suggestion from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Doc check: a bit of clean up (#11224)

* added cache_dir=model_args.cache_dir to all example with cache_dir arg (#11220)

* Avoid using no_sync on SageMaker DP (#11229)

* Indent code block in the documentation (#11233)

* Indent code block

* Indent code blocks version 2

* Quality

* Run CI on deepspeed and fairscale (#11172)

* Run CI on deepspeed and fairscale

* Test it on this branch :)

* Rename

* Update the CI image

* [Deepspeed] zero3 tests band aid (#11235)

* temp band-aid

* style

* Save the Wav2Vec2 processor before training starts (#10910)

Co-authored-by: nithin19 <[email protected]>

* make embeddings plural in warning message (#11228)

* Stale bot updated (#10562)

* Updated stale bot

* Specify issue number

* Remove particular handling of assignees

* Unleash the stalebot

* Remove debug branch

* Close open files to suppress ResourceWarning (#11240)

Co-authored-by: Sudharsan Thirumalai <[email protected]>

* Fix dimention misspellings. (#11238)

* Update modeling_gpt_neo.py

dimention -> dimension

* Update configuration_speech_to_text.py

dimention -> dimension

* Add prefix to examples in model_doc rst (#11226)

* Add prefix to examples in model_doc rst

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [troubleshooting] add 2 points of reference to the offline mode (#11236)

* add 2 points of reference to the offline mode

* link the new doc

* add error message

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* style

* rename

* Trigger CI

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix #10128 (#11248)

* [deepspeed] test on one node 2 gpus max (#11237)

* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version

* Trainer iterable dataset (#11254)

* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <[email protected]>

* Style

Co-authored-by: Lysandre Debut <[email protected]>

* Adding pipeline task aliases. (#11247)

* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.

* Support for set_epoch (#11258)

* Tokenizer fast save (#11234)

* Save fast tokenizers in both formats

* Fix for HerBERT

* Proper fix

* Properly test new behavior

* update dependency_versions_table (#11273)

missed this updating when bumped the version.

* Workflow fixes (#11270)

* Enabling multilingual models for translation pipelines. (#10536)

* [WIP] Enabling multilingual models for translation pipelines.

* decoder_input_ids -> forced_bos_token_id

* Improve docstring.

* Rebase

* Fixing 2 bugs

- Type token_ids coming from `_parse_and_tokenize`
- Wrong index from tgt_lang.

* Fixing black version.

* Adding tests for _build_translation_inputs and add them for all
tokenizers.

* Mbart actually puts the lang code at the end.

* Fixing m2m100.

* Adding TF support to `deep_round`.

* Update src/transformers/pipelines/text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Adding one line comment.

* Fixing M2M100 `_build_translation_input_ids`, and fix the call site.

* Fixing tests + deep_round -> nested_simplify

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix failing workflows

* Trainer support for IterableDataset for evaluation and predict (#11286)

* Bulk of the work

* Polish and tests

* Update QA Trainer

* Avoid breaking the predict method

* Deprecation warnings

* Store real eval dataloder

* Get eval dataset reference before wrap

* move device statements outside if statements (#11292)

* modify double considering special tokens in `language_modeling.py` (#11275)

* Update language_modeling.py

in "class TextDatasetForNextSentencePrediction(Dataset)", double considering "self.tokenizer.num_special_tokens_to_add(pair=True)" 

so, i remove self.block_size, and add parameter for "def create_examples_from_document". like "class LineByLineWithSOPTextDataset" do

* Update language_modeling.py

* [Trainer] fix the placement on device with fp16_full_eval (#11322)

* fix the placement on device with fp16_full_eval

* deepspeed never goes on device

* [Trainer] Add a progress bar for batches skipped (#11324)

* Load checkpoint without re-creating the model (#11318)

* Added translation example script  (#11196)

* initial changes

* modified evaluation

* updated evaluation

* updated evaluation on text translation example script

* added translation example script

* Formatted translation example script

* Reformatted translation example

* Fixed evaluation bug and added support for other tokenisers

* Fixed evaluation bug and added support for other tokenisers

* Added translation example script

* Formatted summarization example script

* Removed typos from summarization example script

* [Generate] Remove outdated code (#11331)

* remove update function

* update

* refactor more

* refactor

* [GPTNeo] create local attention mask ones (#11335)

* create local attention mask ones

* remove old method, address patricks comment

* Update to use datasets remove_cloumns method (#11343)

* Update to use datasets remove_cloumns method

* Quality

* Add an error message that fires when Reformer is not in training mode, but one runs .backward() (#11117)

* Removed `max_length` from being mandatory within `generate`. (#11314)

* Removed `max_length` from being mandatory within `generate`.

- Moving on to fully using `StoppingCriteria` for `greedy` and `sample`
modes.
- `max_length` still used for `beam_search` and `group_beam_search`
(Follow up PR)
- Fixes a bug with MaxLengthStoppingCriteria (we should stop as soon a
we hit the max_length, the comparison needs to be or equal, that affects
the tests).
- Added options to use `logits_processor` and `stopping_criteria`
directly within `generate` function (so some users can define their own
`logits_processor` and `stopping_criteria`).
- Modified the backward compat tests to make sure we issue a warning.

* Fix `max_length` argument in `generate`.

* Moving validate to being functional.

- Renamed `smax_length` to `stoppping_max_length`.

* Removing `logits_processor` and `stopping_criteria` from `generate`
arguments.

* Deepcopy.

* Fix global variable name.

* Honor contributors to models (#11329)

* Honor contributors to models

* Fix typo

* Address review comments

* Add more authors

* [deepspeed] fix resume from checkpoint (#11352)

This PR fixes a bug that most likely somehow got exposed (not caused) by https://github.com/huggingface/transformers/pull/11318 - surprisingly the same test worked just fine before that other PR.

* Examples reorg (#11350)

* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address review comments and clean

Co-authored-by: Lysandre Debut <[email protected]>

* Extract metric_key_prefix during NotebookProgressCallback.on_evaluate (#11347)

* Pass metric_key_prefix as kwarg to on_evaluate

* Replace eval_loss with metric_key_prefix_loss

* Default to "eval" if metric_key_prefix not in kwargs

* Add kwargs to CallbackHandler.on_evaluate signature

* Revert "Add kwargs to CallbackHandler.on_evaluate signature"

This reverts commit 8d4c85ed512f558f7579d36771e907b3379947b7.

* Revert "Pass metric_key_prefix as kwarg to on_evaluate"

This reverts commit 7766bfe2718601230ae593d37b1317bd53cfc075.

* Extract metric_key_prefix from metrics

* [testing doc] bring doc up to date (#11359)

* bring doc up to date

* fix

* Merge new TF example script (#11360)

First of the new and more idiomatic TF examples!

* Remove boiler plate code (#11340)

* remove boiler plate code

* adapt roberta

* correct docs

* finish refactor

* Move old TF text classification script to legacy (#11361)

And update README to explain the work-in-progress!

* [contributing doc] explain/link to good first issue (#11346)

* explain/link to good first issue

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Fix token_type_ids error for big_bird model. (#11355)

* MOD: fit chinese wwm to new datasets

* MOD: move wwm to new folder

* MOD: formate code

* Styling

* MOD add param and recover trainer

* MOD: add token_type_ids method for big bird

* MOD: format code

* MOD: format code

Co-authored-by: Sylvain Gugger <[email protected]>

* Add huggingface_hub dep for #11328

* Add in torchhub

* [Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer (#11349)

* fix wav2vec2 tok

* up

* [Flax] Correct typo (#11374)

* finish

* fix copy

* [run_translation.py] fix typo  (#11372)

fix typo

Co-authored-by: johnson <[email protected]>

* Add space (#11373)

* Correctly cast num_train_epochs to int (#11379)

* Fix typo (#11369)

* Fix Trainer with remove_unused_columns=False (#11382)

* Fix Trainer with remove_unused_columns=False

* Typo

* [Flax] Big FlaxBert Refactor (#11364)

* improve flax

* refactor

* typos

* Update src/transformers/modeling_flax_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_flax_utils.py

* fix typo

* improve error tolerance

* typo

* correct nasty saving bug

* fix from pretrained

* correct tree map

* add note

* correct weight tying

* correct typo (#11393)

* correct conversion (#11394)

* Fix typo in text (#11396)

* fixed typos (#11391)

* make blenderbot test slow (#11395)

* Fixed trainer total_flos relaoding in distributed mode (#11383)

* Fixed trainer total_flos relaoding in distributed mode

* logging flos at the end of training

* Trainer push to hub (#11328)

* Initial support for upload to hub

* push -> upload

* Fixes + examples

* Fix torchhub test

* Torchhub test I hate you

* push_model_to_hub -> push_to_hub

* Apply mixin to other pretrained models

* Remove ABC inheritance

* Add tests

* Typo

* Run tests

* Install git-lfs

* Change approach

* Add push_to_hub to all

* Staging test suite

* Typo

* Maybe like this?

* More deps

* Cache

* Adapt name

* Quality

* MOAR tests

* Put it in testing_utils

* Docs + torchhub last hope

* Styling

* Wrong method

* Typos

* Update src/transformers/file_utils.py

Co-authored-by: Julien Chaumond <[email protected]>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* push (#11400)

* added support for exporting of t5 to onnx with past_key_values (#10651)

* Fixing bug in generation (#11297)

When passing `inputs_embeds` and not `input_ids=None` the generation function fails because `input_ids` is created but the function but it should not.

* Style

* Try to trigger failure more

* Wrong branch Sylvain...

* Fix cross-attention head mask for Torch encoder-decoder models (#10605)

* Fix cross-attention head mask for Torch BART models

* Fix head masking for cross-attention module for the following
models: BART, Blenderbot, Blenderbot_small, M2M_100, Marian, MBart,
Pegasus

* Enable test_headmasking for M2M_100 model

* Fix cross_head_mask for FSMT, LED and T5

* This commit fixes `head_mask` for cross-attention modules
in the following models: FSMT, LED, T5

* It also contains some smaller changes in doc so that
it is be perfectly clear the shape of `cross_head_mask`
is the same as of `decoder_head_mask`

* Update template

* Fix template for BartForCausalLM

* Fix cross_head_mask for Speech2Text models

* Fix cross_head_mask in templates

* Fix args order in BartForCausalLM template

* Fix doc in BART templates

* Make more explicit naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Fix doc

* make style quality

* Fix speech2text docstring

* Default to accuracy metric (#11405)

* Enable option for subword regularization in `XLMRobertaTokenizer` (#11149)

* enable subword regularization.

* fix tokenizer storage

* fix docstring formatting

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Stefan Schweter <[email protected]>

* fix docstring formatting

* add test for subword regularization tokenizer

* improve comments of test

* add sp_model_kwargs

* reformat docstring to match the style

* add some more documentation

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* improve docstring

* empty commit to trigger CI

* Update src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docstring formatting for sphinx

Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>

* Use 3 workers for torch tests

* documentation linked to the parent class PreTrainedTokenizerFast but it should be the slow tokenizer (#11410)

* Style

* Add head_mask, decoder_head_mask, cross_head_mask to ProphetNet (#9964)

* Add head_mask & decoder_head_mask + some corrections

* Fix head masking for N-grams

* Enable test_headmasking for encoder and decod

* Fix one typo regarding in modeling_propgetnet.py

* Enable test_headmasking for ProphetNetStandaloneDecoderModelTest
and ProphetNetStandaloneEncoderModelTest in test_modeling_prophetnet.py

* make style

* Fix cross_head_mask

* Fix attention head mask naming

* `cross_head_mask` -> `cross_attn_head_mask`

* `cross_layer_head_mask` -> `cross_attn_layer_head_mask`

* Still need to merge #10605 to master to pass the tests

* EncoderDecoderConfigs should not create new objects (#11300)

* removes the creation of separate config objects and uses the existing ones instead+overwrite resize_token_embeddings from parent class because it is not working for the EncoderDecoderModel

* rollback to current version of the huggingface master branch

* reworked version that ties the encoder and decoder config of the parent encoderdecoder instance

* overwrite of resize_token_embeddings throws an error now

* review comment suggestion

Co-authored-by: Suraj Patil <[email protected]>

* implemented warning in case encoderdecoder is created with differing configs of encoderdecoderconfig and decoderconfig or encoderconfig

* added test to avoid diverging configs of wrapper class and wrapped classes

* Update src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

* make style

Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* updating the checkpoint for GPT2ForSequence Classification to one with classification head (#11434)

* add pooling layer support (#11439)

* make style (#11442)

* Pin black to 20.8.b1

* With style

* Pin black to 21.4b0

* TF BART models - Add `cross_attentions` to model output and fix cross-attention head masking (#10699)

* Add cross_attn_head_mask to BART

* Fix cross_attentions in TFBart-like models

* This commit enables returning of `cross_attentions`
for TFBart-like models

* It also fixes attention head masking in cross-attenion module

* Update TF model templates

* Fix missing , in TF model templates

* Fix typo: congig -> config

* Add basic support for FP16 in SageMaker model parallelism (#11407)

* Add FP16 support for SageMaker MP

* Add print debugs

* Squeeze

* Remove debug statements

* Add defensive check

* Typo

* docs(examples): fix link to TPU launcher script (#11427)

* fix some typos in docs, comments, logging/errors (#11432)

* Pass along seed to DistributedSampler (#11406)

* Pass along seed to DistributedSampler

* Add seed to DistributedLengthGroupedSampler

* Clarify description of the is_split_into_words argument (#11449)

* Improve documentation for is_split_into_words argument

* Change description wording

* [docs] fix invalid class name (#11438)

* fix invalid class name

* proper ref

* proper ref

* make sure to test against the local checkout (#11437)

* Style

* Give each test a different repo name (#11453)

* [Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)

* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval

* Variable Correction for Consistency in Distillation Example (#11444)

As the error comes from the inconsistency of variable meaning number of gpus in parser and its actual usage in the train.py script, 'gpus' and 'n_gpu' respectively,  the correction makes the example work

* [Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)

* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* Remove max length beam scorer (#11378)

* removed max_len

* removed max_length from BeamSearchScorer

* correct max length

* finish

* del vim

* finish & add test

Co-authored-by: Patrick von Platen <[email protected]>

* update QuickTour docs to reflect model output object (#11462)

* update docs to reflect model output object

* run make style`

* Finish Making Quick Tour respect the model object (#11467)

* finish quicktour

* fix import

* fix print

* explain config default better

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* fix docs for decoder_input_ids (#11466)

* fix docs for decoder_input_ids

* revert the changes for bart and mbart

* Update min versions in README and add Flax (#11472)

* Update min versions in README and add Flax

* Adapt index

* Update `PreTrainedTokenizerBase` to check/handle batch length for `text_pair` parameter (#11486)

* Update tokenization_utils_base.py

* add assertion

* check batch len

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* add error message

Co-authored-by: Sylvain Gugger <[email protected]>

* fix #1149 (#11493)

* [Flax] Add docstrings & model outputs (#11498)

* add attentions & hidden states

* add model outputs + docs

* finish docs

* finish tests

* finish impl

* del @

* finish

* finish

* correct test

* apply sylvains suggestions

* Update src/transformers/models/bert/modeling_flax_bert.py

Co-authored-by: Sylvain Gugger <[email protected]>

* simplify more

Co-authored-by: Sylvain Gugger <[email protected]>

* Reformat to make code clearer in tokenizer call (#11497)

* Reformat to make code clearer

* Reformat to make code clearer

* solved coefficient issue for the TF version of gelu_fast (#11514)

Co-authored-by: Michael Benayoun <[email protected]>

* Split checkpoint from model_name_or_path in examples (#11492)

* Split checkpoint from model_name_or_path in examples

* Address review comments

* Address review comments

* Patch notification service

* Pin HuggingFace Hub dependency (#11502)

* correct the dimension comment of matrix multiplication (#11494)

Co-authored-by: Frederik Bode <[email protected]>

* add sp_model_kwargs to unpickle of xlm roberta tok (#11430)

add test for pickle

simplify test

fix test code style

add missing pickle import

fix test

fix test

fix test

* make style (#11520)

* Update README.md (#11489)

Add link to code

* T5 Gradient Checkpointing (#11353)

* Implement gradient checkpoinging for T5Stack

* A bit more robust type checking

* Add `gradient_checkpointing` to T5Config

* Formatting

* Set requires_grad only when training

* None return value will only cause problems when training

* Change the output tuple according to `use_cache`

* Enable gradient checkpointing for the decoder

Squashed commit of the following:

commit 658bdd0bd1215353a8770f558bda2ea69a0ad0c7
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 14:08:17 2021 +0800

    Only set `require_grad` for gradient checkpointing

commit acaeee6b2e675045fb28ce2176444c1d63e908bd
Author: Ceshine Lee <[email protected]>
Date:   Sat Apr 24 13:59:35 2021 +0800

    Make gradient checkpointing work with the decoder

* Formatting

* Adding `AutomaticSpeechRecognitionPipeline`. (#11337)

* Adding `AutomaticSpeechRecognitionPipeline`.

- Because we added everything to enable this pipeline, we probably
should add it to `transformers`.
- This PR tries to limit the scope and focuses only on the pipeline part
(what should go in, and out).
- The tests are very specific for S2T and Wav2vec2 to make sure both
architectures are supported by the pipeline. We don't use the mixin for
tests right now, because that requires more work in the `pipeline`
function (will be done in a follow up PR).
- Unsure about the "helper" function `ffmpeg_read`. It makes a lot of
  sense from a user perspective, it does not add any additional
dependencies (as in hard dependency, because users can always use their
own load mechanism). Meanwhile, it feels slightly clunky to have so much
optional preprocessing.
- The pipeline is not done to support streaming audio right now.

Future work:

- Add `automatic-speech-recognition` as a `task`. And add the
FeatureExtractor.from_pretrained within `pipeline` function.
- Add small models within tests
- Add the Mixin to tests.
- Make the logic between ForCTC vs ForConditionalGeneration better.

* Update tests/test_pipelines_automatic_speech_recognition.py

Co-authored-by: Lysandre Debut <[email protected]>

* Adding docs + main import + type checking + LICENSE.

* Doc style !.

* Fixing TYPE_HINT.

* Specifying waveform shape in the docs.

* Adding asserts + specify in the documentation the shape of the input
np.ndarray.

* Update src/transformers/pipelines/automatic_speech_recognition.py

Co-authored-by: Patrick von Platen <[email protected]>

* Adding require to tests + move the `feature_extractor` doc.

Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>

* Implement Fast Tokenization for Deberta (#11387)

* Accepts BatchEncoding in LengthSampler (#11431)

* Fix do_eval default value in training_args.py (#11511)

* Fix do_eval default value in training_args.py

* Update PULL_REQUEST_TEMPLATE.md

* Update TF text classification example (#11496)

Big refactor, fixes and multi-GPU/TPU support

* reszie token embeds (#11524)

* Run model templates on master (#11527)

* [Examples] Added support for test-file in QA examples with no trainer (#11510)

* added support for test-file

* fixed typo

* added suggested changes

* reformatted code

* modifed files

* fix post processing error

* Trigger CI

* removed extra lines

* Add Stas and Suraj as authors (#11526)

* Improve task summary docs (#11513)

* fix task summary docs

* refactor to use model.config.id2label instead of list

* fix nit

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* [debug utils] activation/weights underflow/overflow detector (#11274)

* sync

* add activation overflow debug utility

* cleanup

* document detect_overflow

* import torch

* add deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* convert to rst, add note

* add class

* fix docs

* improve the doc

* rework to dump a lot more info about each frame

* complete expansion

* cleanup

* format

* cleanup

* doesn't have to be transformers

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* wrap long line

* style

Co-authored-by: Sylvain Gugger <[email protected]>

* [DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made

* Fixed docs for the shape of `scores` in `generate()` (#10057)

* Fixed the doc for the shape of return scores tuples in generation_utils.py.

* Fix the output shape of `scores` for `DecoderOnlyOutput`.

* style fix

* Fix examples in M2M100 docstrings (#11540)

Replaces `tok` with `tokenizer` so examples can run with copy-paste

* [Flax BERT/Roberta] few small fixes (#11558)

* small fixes

* style

* [Wav2Vec2] Fix convert (#11562)

* push

* small change

* correct other typo

* Remove `datasets` submodule. (#11563)

* fix the mlm longformer example by changing [MASK] to <mask> (#11559)

* Add LUKE (#11223)

* Rebase with master

* Minor bug fix in docs

* Copy files from adding_luke_v2 and improve docs

* change the default value of use_entity_aware_attention to True

* remove word_hidden_states

* fix head models

* fix tests

* fix the conversion script

* add integration tests for the pretrained large model

* improve docstring

* Improve docs, make style

* fix _init_weights for pytorch 1.8

* improve docs

* fix tokenizer to construct entity sequence with [MASK] entity when entities=None

* Make fix-copies

* Make style & quality

* Bug fixes

* Add LukeTokenizer to init

* Address most comments by @patil-suraj and @LysandreJik

* rename _compute_extended_attention_mask to get_extended_attention_mask

* add comments to LukeSelfAttention

* fix the documentation of the tokenizer

* address comments by @patil-suraj, @LysandreJik, and @sgugger

* improve docs

* Make style, quality and fix-copies

* Improve docs

* fix docs

* add "entity_span_classification" task

* update example code for LukeForEntitySpanClassification

* improve docs

* improve docs

* improve the code example in luke.rst

* rename the classification layer in LukeForEntityClassification from typing to classifier

* add bias to the classifier in LukeForEntitySpanClassification

* update docs to use fine-tuned hub models in code examples of the head models

* update the example sentences

* Make style & quality

* Add require_torch to tokenizer tests

* Add require_torch to tokenizer tests

* Address comments by @sgugger and add community notebooks

* Make fix-copies

Co-authored-by: Ikuya Yamada <[email protected]>

* [Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer (#11538)

* Fixed tokenization mistakes while adding single-char tokens to tokenizer

* Added tests and Removed unnecessary comments.

* finalize wav2vec2 tok

* add more aggressive tests

* Apply suggestions from code review

* fix useless import

Co-authored-by: Patrick von Platen <[email protected]>

* Fix metric computation in `run_glue_no_trainer` (#11569)

* Fixes a useless warning. (#11566)

Fixes #11525

* Accumulate opt state dict on do_rank 0 (#11481)

* Update training tutorial (#11533)

* Update training tutorial

* Apply suggestions from code review

Co-authored-by: Hamel Husain <[email protected]>

* Address review comments

* Update docs/source/training.rst

Co-authored-by: Lysandre Debut <[email protected]>

* More review comments

* Last review comments

Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>

* fix resize_token_embeddings (#11572)

* Add multi-class, multi-label and regression to transformers (#11012)

* add to  bert

* review comments

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* self.config.problem_type

* fix style

* fix

* fin

* fix

* update doc

* fix

* test

* Test more problem types

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* fix

* remove

* fix

* quality

* make fix-copies

* remove test

Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>

* Enable added tokens (#11325)

* Fix tests

* Reorganize

* Update tests/test_modeling_mobilebert.py

* Remove unnecessary addition

* Make quality scripts work when one backend is missing. (#11573)

* Make quality scripts work when one backend is missing.

* Check env variable is properly set

* Add default

* With print statements

* Fix typo

* Set env variable

* Remove debug code

* [FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py (#11470)

* add flax roberta

* make style

* correct initialiazation

* modify model to save weights

* fix copied from

* fix copied from

* correct some more code

* add more roberta models

* Apply suggestions from code review

* merge from master

* finish

* finish docs

Co-authored-by: Patrick von Platen <[email protected]>

* Removes SageMakerTrainer code but keeps class as wrapper (#11587)

* removed all old code

* make quality

* [Flax] Add Electra models (#11426)

* add electra model to flax

* Remove Electra Next Sentence Prediction model added by mistake

* fix parameter sharing and loosen equality threshold

* fix styling issues

* add mistaken removen imports

* fix electra table

* Add FlaxElectra to automodels and fixe docs

* fix issues pointed out the PR

* fix flax electra to comply with latest changes

* remove stale class

* add copied from

Co-authored-by: Patrick von Platen <[email protected]>

* Reproducible checkpoint (#11582)

* Set generator in dataloader

* Use generator in all random samplers

* Checkpoint all RNG states

* Final version

* Quality

* Test

* Address review comments

* Quality

* Remove debug util

* Add python and numpy RNGs

* Split states in different files in distributed

* Quality

* local_rank for TPUs

* Only use generator when accepted

* Add test

* Set seed to avoid flakiness

* Make test less flaky

* Quality

* [trainer] document resume randomness (#11588)

* document resume randomness

* fix link

* reword

* fix

* reword

* style

* copies need to be fixed too (#11585)

* add importlib_metadata and huggingface_hub as dependency in the conda recipe (#11591)

* add importlib_metadata as dependency (#11490)

Co-authored-by: Deepali Chourasia <[email protected]>

* add huggingface_hub dependency

Co-authored-by: Deepali Chourasia <[email protected]>

* Skip Funnel test

* Pytorch - Lazy initialization of models (#11471)

* lazy_init_weights

* remove ipdb

* save int

* add necessary code

* remove unnecessary utils

* Update src/transformers/models/t5/modeling_t5.py

* clean

* add tests

* correct

* finish tests

* finish tests

* fix some more tests

* fix xlnet & transfo-xl

* fix more tests

* make sure tests are independent

* fix tests more

* finist tests

* final touches

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <[email protected]>

* clean tests

* give arg positive name

* add more mock weights to xlnet

Co-authored-by: Stas Bekman <[email protected]>

* Accept tensorflow-rocm package when checking TF availability (#11595)

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: JohnnyC08 <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: cchen-dialpad <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Lysandre <[email protected]>
Co-authored-by: cronoik <[email protected]>
Co-authored-by: Joe Davison <[email protected]>
Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: versis <[email protected]>
Co-authored-by: Eren Şahin <[email protected]>
Co-authored-by: Amala Deshmukh <[email protected]>
Co-authored-by: konstin <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: SHYAM SUNDER KUMAR <[email protected]>
Co-authored-by: Leo Gao <[email protected]>
Co-authored-by: Vasudev Gupta <[email protected]>
Co-authored-by: Jannis Born <[email protected]>
Co-authored-by: Yusuke Mori <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Julien Demouth <[email protected]>
Co-authored-by: Andrea Cappelli <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: Keisuke Hirota <[email protected]>
Co-authored-by: Saviour Owolabi <[email protected]>
Co-authored-by: Kevin Canwen Xu <[email protected]>
Co-authored-by: Masatoshi TSUCHIYA <[email protected]>
Co-authored-by: fghuman <[email protected]>
Co-authored-by: Amna <[email protected]>
Co-authored-by: Takuya Makino <[email protected]>
Co-authored-by: calpt <[email protected]>
Co-authored-by: Ceyda Cinarel <[email protected]>
Co-authored-by: Nithin Holla <[email protected]>
Co-authored-by: nithin19 <[email protected]>
Co-authored-by: Joel Stremmel <[email protected]>
Co-authored-by: Sudharsan S T <[email protected]>
Co-authored-by: Sudharsan Thirumalai <[email protected]>
Co-authored-by: Thomas Wood <[email protected]>
Co-authored-by: Nicolas Patry <[email protected]>
Co-authored-by: e <[email protected]>
Co-authored-by: TAE YOUNGDON <[email protected]>
Co-authored-by: rajvi-k <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: wlhgtc <[email protected]>
Co-authored-by: johnson7788 <[email protected]>
Co-authored-by: johnson <[email protected]>
Co-authored-by: PenutChen <[email protected]>
Co-authored-by: Max Del <[email protected]>
Co-authored-by: Yoshitomo Matsubara <[email protected]>
Co-authored-by: Teven <[email protected]>
Co-authored-by: Kiran R <[email protected]>
Co-authored-by: Nicola De Cao <[email protected]>
Co-authored-by: Daniel Stancl <[email protected]>
Co-authored-by: Philip May <[email protected]>
Co-authored-by: Stefan Schweter <[email protected]>
Co-authored-by: abiolaTresor <[email protected]>
Co-authored-by: Amine Abdaoui <[email protected]>
Co-authored-by: LSinev <[email protected]>
Co-authored-by: Kostas Stathoulopoulos <[email protected]>
Co-authored-by: Bhadresh Savani <[email protected]>
Co-authored-by: Jaimeen Ahn <[email protected]>
Co-authored-by: Ashwin Geet D'Sa <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Hamel Husain <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Michael Benayoun <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Frederik Bode <[email protected]>
Co-authored-by: Manuel Romero <[email protected]>
Co-authored-by: CeShine Lee <[email protected]>
Co-authored-by: Shubham Sanghavi <[email protected]>
Co-authored-by: bonniehyeon <[email protected]>
Co-authored-by: jingyihe <[email protected]>
Co-authored-by: Ikuya Yamada <[email protected]>
Co-authored-by: Muktan <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: abhishek thakur <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Patrick Fernandes <[email protected]>
Co-authored-by: Deepali <[email protected]>
Co-authored-by: Deepali Chourasia <[email protected]>
Co-authored-by: Mats Sjöberg <[email protected]>
@john-heyer
Copy link

Hey all - this introduced a pretty nasty bug for us, that took a while to figure out. Here's a case where this initialization of the missing_keys, which normally shouldn't matter, broke our model after version 4.6. We have a custom model that optionally initializes some of its weights from a separate module. Calling from_pretrained and passing this separate module used to "work", but now those weights are overwritten. See MWE:

from torch.nn import Linear
from transformers import BertModel


class MyCustomModel(BertModel):
    def __init__(self, config, custom_layer=None):
        super().__init__(config)
        if custom_layer is not None:
            self.custom_layer = custom_layer
        else:
            self.custom_layer = Linear(1024, 1024)


if __name__ == "__main__":
    import transformers
    print(transformers.__version__)
    layer = Linear(1024, 1024)
    print(layer.weight.sum())
    custom_model = MyCustomModel.from_pretrained('bert-base-uncased', custom_layer=layer)
    # used to be the same as the layer above, but it is "re-initialized" in the from_pretrained method
    print(custom_model.custom_layer.weight.sum())

Result:

4.11.3
tensor(5.9874, grad_fn=<SumBackward0>)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing MyCustomModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MyCustomModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MyCustomModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of MyCustomModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.custom_layer.bias', 'bert.custom_layer.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tensor(17.4651, grad_fn=<SumBackward0>)

This used to "just work" in version < 4.6. Perhaps we were relying on an unintended "feature". Setting _fast_init=False does fix things, but it's a bit hacky as it only applies to the initialization of this custom module that is called upstream in our service. Additionally, we don't know what happens if we'll need to rely on this feature in the future, but it goes away.

Can you comment on this? Thanks!

@patrickvonplaten
Copy link
Contributor Author

Hey @john-heyer,

Thanks for the feedback. We indeed didn't take account the effect this would have on custom models that inherit from transformers models like BERT. Just to understand better, the problem was that before 4.6, the custom_layer was not (re-)initialized when calling MyCustomModel.from_pretrained(...) - however after 4.6 it was initialized twice once before calling from_pretrained(...) and once after calling it?

@patrickvonplaten
Copy link
Contributor Author

@john-heyer - I answered in-detail here: #17370

@john-heyer
Copy link

thanks @patrickvonplaten - yes, that is correct - before 4.6 custom_layer was not re-initialized, and now it is! Thanks for opening the other thread. I'll follow there.

@vadimkantorov
Copy link

Can this be used to load pretrained params directly to the GPU models without keeping the full copy of params on CPU?

E.g. https://github.com/huggingface/diffusers has unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device) while theoretically one could have unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church"m device = torch_device)`

@patrickvonplaten
Copy link
Contributor Author

Hey @vadimkantorov, could you maybe open an issue under diffusers instead? :-)

@vadimkantorov
Copy link

Is transfomers following a different design? I assumed diffusers just copied the original design from transformers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[model_utils] very slow model instantiation
6 participants