Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] How do you envision this working with local datasets? #2635

Closed
smerrill opened this issue May 9, 2024 · 3 comments
Closed

[v3] How do you envision this working with local datasets? #2635

smerrill opened this issue May 9, 2024 · 3 comments

Comments

@smerrill
Copy link

smerrill commented May 9, 2024

Hi, and thanks very much for this awesome library. I'm attempting to update my training pipeline to v3, but I've hit a snag; when I try to load a local Parquet file as a dataset, the model card generator attempts to hit the HF hub to look for my dataset, which it can't find. For now I could fake the name to be the same as an existing dataset, but what would you like the workflow to be for this in the future?

My test code:

dataset = load_dataset("./training_dataset/", data_files={'data.parquet'})['train'].train_test_split(test_size=0.2)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

model = SentenceTransformer(MODEL)
guide_model = SentenceTransformer(GUIDE_MODEL)

# 3. Define a loss function
loss = losses.CachedGISTEmbedLoss(model, mini_batch_size=16, guide=guide_model)

training_args = SentenceTransformerTrainingArguments(
    output_dir="checkpoints",
    num_train_epochs=10,
    per_device_train_batch_size=2048,
    per_device_eval_batch_size=2048,
    warmup_ratio=0.2,
    bf16=True
)

# 4. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    args=training_args
)
trainer.train()

And the end of my stack trace looks like the following:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-663d4582-26fdd83d076b6389554b8f7b;3ae1d179-d45b-41f0-aee4-2e2302089c7b)

Repository Not Found for url: https://huggingface.co/api/datasets/training_dataset.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
@smerrill
Copy link
Author

smerrill commented May 9, 2024

For now I added a try: except: pass stanza around https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/sentence_transformers/model_card.py#L69 and https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/sentence_transformers/model_card.py#L74.

What do you think about a fetch_dataset_metadata property on SentenceTransformerTrainingArguments? If that sounds agreeable I can work up a PR.

@tomaarsen
Copy link
Collaborator

Hello!

This sounds like an edge case that I've missed. Could you give a slightly more detailed traceback? I'd love to see which call to Hugging Face failed. I have a suspicion that it might be in a location where potential failures should be caught. After all, the "extract metadata" method just exists to try and fetch some metadata, but that's not always possible - if it's not possible, it should just quietly fail and continue.

Also, thanks for experimenting with v3! These kinds of issues is how I can try and prevent big issues in the full release 🤗

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

I'll close this as I merged the fix into v3.0-pre-release via #2636. Thanks for reporting and feel free to share if you have more feedback (positive or negative).

  • Tom Aarsen

tomaarsen added a commit that referenced this issue May 28, 2024
)

* [`v3`] Training refactor - MultiGPU, loss logging, bf16, etc. (#2449)

* See #1638: Adds huggingface trainer for sentence transformers

* Fix type of tokenizer

* Get the trainer using the feature collation

* Update the docstring to reflect changes

* Initial draft for refactoring training usig the Transformers Trainer

* Separate 'fit' functionality (new and old) into a mixin

* Resolve test issues

* Reformat

* Update the imports

* Add TODO regarding custom label columns

* Remove dead code

* Don't provide the trainer to the eval sampler

* Introduce datasets as a dependency

* Introduce "accelerate" as a dependency

* Avoid use_amp on CPU tests

* Specify that SentenceTransformer is a class, not a module

* Avoid circular import

* Remove | used as an "or" operator in typing

* Use test evaluator after training, as intended

* Use tokenize function instead of tokenizer;

Add EvaluatorCallback which calls the evaluator on every epoch (for BC);
Stop saving "do_lower_case" from Transformer;

* Reformat

* Revert Transformer tokenizer changes

* Add support for the tokenizer to return more than just input_ids & attention_masks

Required for LSTM

* Use the test evaluators after training the examples

* Use pure torch for BoW tokenization

* Use dev evaluator for BiLSTM - test fails

* Add Trainer support for BoW-based models

* Pass epoch to evaluator in every-epoch callback

For fit backwards compatibility

* Run formatting

* Use steps_per_epoch to set max_steps if possible

* Ignore extracting dataloader arguments for now

* Remove dead code

* Allow both "label" and "score" columns for labels

* Reformatting

* Improve errors if datasets don't match with loss dictionary well

* Made tests more consistent; list instead of set

* Simplify trainer with DatasetDict

* Implement a proportional sampler in addition to round robin

* Add CLIP finetuning support to the Trainer

* Start updating evaluators to return dictionaries

* Reformat

* Hackishly insert the DataParallel model into the loss function

* Allow for fsdp=["full_shard", "auto_wrap"]

with fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"}

* Re-add support for DataParallel

* Use 'ParallelMode.NOT_PARALLEL'

* Prevent crash with DDP & an evaluation set

* When training with multiple datasets, add "dataset_name" column

Rather than relying on some Batch Sampler hacking (which fails with some distributed training approaches)

* Update type hints: make loss & evaluator optional

Co-authored-by: Wang Bo <[email protected]>

* Set correct superclasses for samplers

* Override 'accelerator.even_batches' as it's incompatible with multi-dataset

* Throw exception if "return_loss" or "dataset_name" columns are used

* Set min. version for accelerate

* Heavily extend model card generation

* Remove some dead code

* Fix evaluator type hints

* Ensure that 'model_card_template.md' is included in the built package

* Rephrase comments slightly

* Heavily refactor samplers; add no duplicates/group by label samplers

* Ensure that data_loader.dataset exists in FitMixin

* Adopt 8 as the default batch

* Fix logging error in example

* Remove the deprecated correct_bias

* Simplify with walrus operator

* Fix some bugs in set_widget_examples with short datasets

* Improve docstring slightly

* Add edge case in case training data has an unrecognized format

* Fix extracting dataset metadata

* Remove moot TYPE_CHECKING

* Set base model when loading a ST model also

* Add test_dataloader, add prefetch_factor to dataloaders

* Resolve predict_example fix; fix newlines in text

* Fix bug in compute_dataset_metrics examples

* Add call to action in ValueError

* Reuse original model card if no training is done

* Also collect nested losses (e.g. MatryoshkaLoss) and make losses in tags

* Remove generated tag; keep loss: prefix on tags

* Remove unused arguments

* Add support for "best model step" in model card

* Make hyperparameters code-formatted

* Fix load_best_model for Transformers models, prevent for non-Transformers

* Store base_model_revision in model_card_data

* Prevent crash when loading a local model

* Allow for bfloat16 inference

---------

Co-authored-by: Matthew Franglen <[email protected]>
Co-authored-by: Wang Bo <[email protected]>

* [`v3`] Add `similarity` and `similarity_pairwise` methods to Sentence Transformers (#2615)

* Add similarity function to model configuration

* Add more tests

* Replace util.cos_sim with model.similarity in some examples

* Reintroduce evaluation.SimilarityFunction

* Remove last references of score function in ST class

* Add similarity_fn_name to model card

* Add save_pretrained alias for save

* Introduce DOT alias for DOT_PRODUCT

* [`v3`] Fix various model card errors (#2616)

* Prevent model card save failure

* Print exceptions in more detail when they occur

* Fix edge case if dataset language is None

* [`v3`] Fix trainer `compute_loss` when evaluating/predicting if the `loss` updated the inputs in-place (#2617)

* Recompute the features if return_output

* Add SimilarityFunction to __init__, increment dev version

* Never return None in infer_datasets (#2620)

* Implement resume_from_checkpoint (#2621)

* [`v3`] Update example scripts to the new v3 training format (#2622)

* Update example scripts to the new v3 training format

* Add distillation training examples

* Add Matryoshka training examples

* Add NLI training examples

* Add STS training scripts

* Fix accidentally overriding eval set

* Update paraphrases multi-dataset training script

* Convert regular dicts to DatasetDict on Trainer init

* Update Quora duplicate training scripts

* Update "other" training scripts

* Update multilingual conversion script

* Add example scripts to Evaluators

* Add example to ST class itself

* Update docs formatting slightly

* Fix model card snippet

* Add short docstring for similarity_fn_name property

* Remove "return_outputs" as it's not strictly necessary. Avoids OOM & speeds up training (#2633)

* Fix crash from inferring the dataset_id from a local dataset (#2636)

See #2635

* Fix multilingual conversion script; extend MSELoss to multi-column (#2641)

And remove the now-unnecessary make_multilingual_sys.py

* Update evaluation scripts to use HF Datasets (#2642)

* Increment the version in setup.py (as well)

* Fix resume_from_checkpoint by also updating the loss (#2648)

I'm not very sure if updating the potential wrapped model like this will also work; it seems a bit risky, but it's equally risky to not do it.

* Fix an issue with in-place variable overriding preventing backwards passes on MSELoss (#2647)

Only when there's multiple columns

* Simplify load_from_checkpoint using load_state_dict (#2650)

Overriding the model has several downsides, e.g. regarding the model card generation

* Don't override the labels variable to avoid inplace operation (#2651)

* Resolve "one of the variables needed for gradient computation has been modified by an inplace operation." (#2654)

* [`v3`] Add hyperparameter optimization support by letting `loss` be a Callable that accepts a `model` (#2655)

* Add HPO support by letting the 'loss' be a function

* Only add "dataset_name" column if required by the loss function

* Add tag hinting at the number of training samples (#2660)

* [`v3`] For the Cached losses; ignore gradients if grad is disabled (e.g. eval) (#2668)

* For the Cached losses; ignore gradients if grad is disabled (e.g. eval)

* Warn that Matryoshka/AdaptiveLayer losses are not compatible with Cached

* [`docs`] Rewrite the https://sbert.net documentation for v3.0 (#2632)

* Start restructuring/rewriting the docs

* Update Pretrained Models section for ST

* Update & add many docstrings

* Completely overhaul "Training Overview" docs page for ST

* Update dataset overview

* Remove kwargs from paraphrase_mining signature

* Add "aka sbert"

* Remove Hugging Face docs page

* Update ST Usages

* Fix some links

* Use the training examples corresponding to that model type

* Add hyperparameter optimization example script + docs

* Add distributed training docs

* Complete rewrite for the Sentence Transformer docs portion

* Update the CE part of the docs

* Specify if __name__ == "__main__" & dataloader_drop_last with DDP

* Update the entire project to Google-style docstring

* Remove contact page

* Update README with updated links, etc.

* Update the loss examples

* Fix formatting

* Add remove_columns/select_columns tip to dataset overview

* [`v3`] Chore - include import sorting in ruff (#2672)

* Include import sorting in ruff

* Remove deprecated ignore-init-module-imports

* Remove --select I from ruff.toml again after CI issues

* [`v3`] Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy (#2673)

* Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy

* Reformat

* [`v3`] Add various useful Sphinx packages (copy code, link to code, nicer tabs) (#2674)

* No longer hide toctrees in API Reference

* Add linkcode support

It's not perfect, as it'll always link to 'master', but it'll do pretty nicely for the most part.

* Add copy button to all code blocks

* Add nicer tabs

* Reformatted

* [`v3`] Make the "primary_metric" for evaluators a bit more robust (#2675)

* Make the "primary_metric" for evaluators a bit more robust

* Also remove some other TODOs that are not very important or already done

* Set `broadcast_buffers = False` when training with DDP (#2663)

* [`v3`] Warn about using DP instead of DDP + set dataloader_drop_last with DDP (#2677)

* Warn about using DP instead of DDP + set dataloader_drop_last with DDP

* Prevent duplicate warnings

* Remove note, done automatically now

* Avoid inequality comparison to True

* [`v3`] Add warning that Evaluators only run on 1 GPU when multi-GPU training (#2678)

* Add warning that Evaluators only run on 1 GPU when multi-GPU training

* Also add a note in the distributed training docs

* [`v3`] Move training dependencies into a "train" extra (#2676)

* Move training dependencies into a "train" extra

* Install the train extra with the CI tests

* Simplify dev install: also include train deps there

* Implement is_..._available in ST instead; add is_training_available

* Update references to the API ref (#2679)

* [`v3`] Add "dataset_size:" to the tag denoting the number of training samples (#2680)

* Prepend "dataset_size:" instead. I can always change the look of this later

On the HF side

* Fix formatting of Python modules

* Docs: pairwise_cosine_similarity -> pairwise_similarity

* Link to the yet-to-be-released release notes instead

* Update phrasing on local_files_only docstring

* Link directly to the 2DMSE preprint

* Add missing subset in quora-duplicates

* Add missing docstrings arguments for Cached... losses

* Update training overview docs based on the blogpost reviews

---------

Co-authored-by: Matthew Franglen <[email protected]>
Co-authored-by: Wang Bo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants