Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge v3.0 pre-release into master, prepare for full v3.0 release (#2683
) * [`v3`] Training refactor - MultiGPU, loss logging, bf16, etc. (#2449) * See #1638: Adds huggingface trainer for sentence transformers * Fix type of tokenizer * Get the trainer using the feature collation * Update the docstring to reflect changes * Initial draft for refactoring training usig the Transformers Trainer * Separate 'fit' functionality (new and old) into a mixin * Resolve test issues * Reformat * Update the imports * Add TODO regarding custom label columns * Remove dead code * Don't provide the trainer to the eval sampler * Introduce datasets as a dependency * Introduce "accelerate" as a dependency * Avoid use_amp on CPU tests * Specify that SentenceTransformer is a class, not a module * Avoid circular import * Remove | used as an "or" operator in typing * Use test evaluator after training, as intended * Use tokenize function instead of tokenizer; Add EvaluatorCallback which calls the evaluator on every epoch (for BC); Stop saving "do_lower_case" from Transformer; * Reformat * Revert Transformer tokenizer changes * Add support for the tokenizer to return more than just input_ids & attention_masks Required for LSTM * Use the test evaluators after training the examples * Use pure torch for BoW tokenization * Use dev evaluator for BiLSTM - test fails * Add Trainer support for BoW-based models * Pass epoch to evaluator in every-epoch callback For fit backwards compatibility * Run formatting * Use steps_per_epoch to set max_steps if possible * Ignore extracting dataloader arguments for now * Remove dead code * Allow both "label" and "score" columns for labels * Reformatting * Improve errors if datasets don't match with loss dictionary well * Made tests more consistent; list instead of set * Simplify trainer with DatasetDict * Implement a proportional sampler in addition to round robin * Add CLIP finetuning support to the Trainer * Start updating evaluators to return dictionaries * Reformat * Hackishly insert the DataParallel model into the loss function * Allow for fsdp=["full_shard", "auto_wrap"] with fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"} * Re-add support for DataParallel * Use 'ParallelMode.NOT_PARALLEL' * Prevent crash with DDP & an evaluation set * When training with multiple datasets, add "dataset_name" column Rather than relying on some Batch Sampler hacking (which fails with some distributed training approaches) * Update type hints: make loss & evaluator optional Co-authored-by: Wang Bo <[email protected]> * Set correct superclasses for samplers * Override 'accelerator.even_batches' as it's incompatible with multi-dataset * Throw exception if "return_loss" or "dataset_name" columns are used * Set min. version for accelerate * Heavily extend model card generation * Remove some dead code * Fix evaluator type hints * Ensure that 'model_card_template.md' is included in the built package * Rephrase comments slightly * Heavily refactor samplers; add no duplicates/group by label samplers * Ensure that data_loader.dataset exists in FitMixin * Adopt 8 as the default batch * Fix logging error in example * Remove the deprecated correct_bias * Simplify with walrus operator * Fix some bugs in set_widget_examples with short datasets * Improve docstring slightly * Add edge case in case training data has an unrecognized format * Fix extracting dataset metadata * Remove moot TYPE_CHECKING * Set base model when loading a ST model also * Add test_dataloader, add prefetch_factor to dataloaders * Resolve predict_example fix; fix newlines in text * Fix bug in compute_dataset_metrics examples * Add call to action in ValueError * Reuse original model card if no training is done * Also collect nested losses (e.g. MatryoshkaLoss) and make losses in tags * Remove generated tag; keep loss: prefix on tags * Remove unused arguments * Add support for "best model step" in model card * Make hyperparameters code-formatted * Fix load_best_model for Transformers models, prevent for non-Transformers * Store base_model_revision in model_card_data * Prevent crash when loading a local model * Allow for bfloat16 inference --------- Co-authored-by: Matthew Franglen <[email protected]> Co-authored-by: Wang Bo <[email protected]> * [`v3`] Add `similarity` and `similarity_pairwise` methods to Sentence Transformers (#2615) * Add similarity function to model configuration * Add more tests * Replace util.cos_sim with model.similarity in some examples * Reintroduce evaluation.SimilarityFunction * Remove last references of score function in ST class * Add similarity_fn_name to model card * Add save_pretrained alias for save * Introduce DOT alias for DOT_PRODUCT * [`v3`] Fix various model card errors (#2616) * Prevent model card save failure * Print exceptions in more detail when they occur * Fix edge case if dataset language is None * [`v3`] Fix trainer `compute_loss` when evaluating/predicting if the `loss` updated the inputs in-place (#2617) * Recompute the features if return_output * Add SimilarityFunction to __init__, increment dev version * Never return None in infer_datasets (#2620) * Implement resume_from_checkpoint (#2621) * [`v3`] Update example scripts to the new v3 training format (#2622) * Update example scripts to the new v3 training format * Add distillation training examples * Add Matryoshka training examples * Add NLI training examples * Add STS training scripts * Fix accidentally overriding eval set * Update paraphrases multi-dataset training script * Convert regular dicts to DatasetDict on Trainer init * Update Quora duplicate training scripts * Update "other" training scripts * Update multilingual conversion script * Add example scripts to Evaluators * Add example to ST class itself * Update docs formatting slightly * Fix model card snippet * Add short docstring for similarity_fn_name property * Remove "return_outputs" as it's not strictly necessary. Avoids OOM & speeds up training (#2633) * Fix crash from inferring the dataset_id from a local dataset (#2636) See #2635 * Fix multilingual conversion script; extend MSELoss to multi-column (#2641) And remove the now-unnecessary make_multilingual_sys.py * Update evaluation scripts to use HF Datasets (#2642) * Increment the version in setup.py (as well) * Fix resume_from_checkpoint by also updating the loss (#2648) I'm not very sure if updating the potential wrapped model like this will also work; it seems a bit risky, but it's equally risky to not do it. * Fix an issue with in-place variable overriding preventing backwards passes on MSELoss (#2647) Only when there's multiple columns * Simplify load_from_checkpoint using load_state_dict (#2650) Overriding the model has several downsides, e.g. regarding the model card generation * Don't override the labels variable to avoid inplace operation (#2651) * Resolve "one of the variables needed for gradient computation has been modified by an inplace operation." (#2654) * [`v3`] Add hyperparameter optimization support by letting `loss` be a Callable that accepts a `model` (#2655) * Add HPO support by letting the 'loss' be a function * Only add "dataset_name" column if required by the loss function * Add tag hinting at the number of training samples (#2660) * [`v3`] For the Cached losses; ignore gradients if grad is disabled (e.g. eval) (#2668) * For the Cached losses; ignore gradients if grad is disabled (e.g. eval) * Warn that Matryoshka/AdaptiveLayer losses are not compatible with Cached * [`docs`] Rewrite the https://sbert.net documentation for v3.0 (#2632) * Start restructuring/rewriting the docs * Update Pretrained Models section for ST * Update & add many docstrings * Completely overhaul "Training Overview" docs page for ST * Update dataset overview * Remove kwargs from paraphrase_mining signature * Add "aka sbert" * Remove Hugging Face docs page * Update ST Usages * Fix some links * Use the training examples corresponding to that model type * Add hyperparameter optimization example script + docs * Add distributed training docs * Complete rewrite for the Sentence Transformer docs portion * Update the CE part of the docs * Specify if __name__ == "__main__" & dataloader_drop_last with DDP * Update the entire project to Google-style docstring * Remove contact page * Update README with updated links, etc. * Update the loss examples * Fix formatting * Add remove_columns/select_columns tip to dataset overview * [`v3`] Chore - include import sorting in ruff (#2672) * Include import sorting in ruff * Remove deprecated ignore-init-module-imports * Remove --select I from ruff.toml again after CI issues * [`v3`] Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy (#2673) * Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy * Reformat * [`v3`] Add various useful Sphinx packages (copy code, link to code, nicer tabs) (#2674) * No longer hide toctrees in API Reference * Add linkcode support It's not perfect, as it'll always link to 'master', but it'll do pretty nicely for the most part. * Add copy button to all code blocks * Add nicer tabs * Reformatted * [`v3`] Make the "primary_metric" for evaluators a bit more robust (#2675) * Make the "primary_metric" for evaluators a bit more robust * Also remove some other TODOs that are not very important or already done * Set `broadcast_buffers = False` when training with DDP (#2663) * [`v3`] Warn about using DP instead of DDP + set dataloader_drop_last with DDP (#2677) * Warn about using DP instead of DDP + set dataloader_drop_last with DDP * Prevent duplicate warnings * Remove note, done automatically now * Avoid inequality comparison to True * [`v3`] Add warning that Evaluators only run on 1 GPU when multi-GPU training (#2678) * Add warning that Evaluators only run on 1 GPU when multi-GPU training * Also add a note in the distributed training docs * [`v3`] Move training dependencies into a "train" extra (#2676) * Move training dependencies into a "train" extra * Install the train extra with the CI tests * Simplify dev install: also include train deps there * Implement is_..._available in ST instead; add is_training_available * Update references to the API ref (#2679) * [`v3`] Add "dataset_size:" to the tag denoting the number of training samples (#2680) * Prepend "dataset_size:" instead. I can always change the look of this later On the HF side * Fix formatting of Python modules * Docs: pairwise_cosine_similarity -> pairwise_similarity * Link to the yet-to-be-released release notes instead * Update phrasing on local_files_only docstring * Link directly to the 2DMSE preprint * Add missing subset in quora-duplicates * Add missing docstrings arguments for Cached... losses * Update training overview docs based on the blogpost reviews --------- Co-authored-by: Matthew Franglen <[email protected]> Co-authored-by: Wang Bo <[email protected]>
- Loading branch information