[`v3`] How do you envision this working with local datasets? #2635

smerrill · 2024-05-09T22:05:26Z

Hi, and thanks very much for this awesome library. I'm attempting to update my training pipeline to v3, but I've hit a snag; when I try to load a local Parquet file as a dataset, the model card generator attempts to hit the HF hub to look for my dataset, which it can't find. For now I could fake the name to be the same as an existing dataset, but what would you like the workflow to be for this in the future?

My test code:

dataset = load_dataset("./training_dataset/", data_files={'data.parquet'})['train'].train_test_split(test_size=0.2)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

model = SentenceTransformer(MODEL)
guide_model = SentenceTransformer(GUIDE_MODEL)

# 3. Define a loss function
loss = losses.CachedGISTEmbedLoss(model, mini_batch_size=16, guide=guide_model)

training_args = SentenceTransformerTrainingArguments(
    output_dir="checkpoints",
    num_train_epochs=10,
    per_device_train_batch_size=2048,
    per_device_eval_batch_size=2048,
    warmup_ratio=0.2,
    bf16=True
)

# 4. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    args=training_args
)
trainer.train()

And the end of my stack trace looks like the following:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-663d4582-26fdd83d076b6389554b8f7b;3ae1d179-d45b-41f0-aee4-2e2302089c7b)

Repository Not Found for url: https://huggingface.co/api/datasets/training_dataset.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The text was updated successfully, but these errors were encountered:

smerrill · 2024-05-09T22:21:53Z

For now I added a try: except: pass stanza around https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/sentence_transformers/model_card.py#L69 and https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/sentence_transformers/model_card.py#L74.

What do you think about a fetch_dataset_metadata property on SentenceTransformerTrainingArguments? If that sounds agreeable I can work up a PR.

tomaarsen · 2024-05-09T22:46:02Z

Hello!

This sounds like an edge case that I've missed. Could you give a slightly more detailed traceback? I'd love to see which call to Hugging Face failed. I have a suspicion that it might be in a location where potential failures should be caught. After all, the "extract metadata" method just exists to try and fetch some metadata, but that's not always possible - if it's not possible, it should just quietly fail and continue.

Also, thanks for experimenting with v3! These kinds of issues is how I can try and prevent big issues in the full release 🤗

Tom Aarsen

See UKPLab#2635

See #2635

tomaarsen · 2024-05-10T14:09:42Z

I'll close this as I merged the fix into v3.0-pre-release via #2636. Thanks for reporting and feel free to share if you have more feedback (positive or negative).

Tom Aarsen

) * [`v3`] Training refactor - MultiGPU, loss logging, bf16, etc. (#2449) * See #1638: Adds huggingface trainer for sentence transformers * Fix type of tokenizer * Get the trainer using the feature collation * Update the docstring to reflect changes * Initial draft for refactoring training usig the Transformers Trainer * Separate 'fit' functionality (new and old) into a mixin * Resolve test issues * Reformat * Update the imports * Add TODO regarding custom label columns * Remove dead code * Don't provide the trainer to the eval sampler * Introduce datasets as a dependency * Introduce "accelerate" as a dependency * Avoid use_amp on CPU tests * Specify that SentenceTransformer is a class, not a module * Avoid circular import * Remove | used as an "or" operator in typing * Use test evaluator after training, as intended * Use tokenize function instead of tokenizer; Add EvaluatorCallback which calls the evaluator on every epoch (for BC); Stop saving "do_lower_case" from Transformer; * Reformat * Revert Transformer tokenizer changes * Add support for the tokenizer to return more than just input_ids & attention_masks Required for LSTM * Use the test evaluators after training the examples * Use pure torch for BoW tokenization * Use dev evaluator for BiLSTM - test fails * Add Trainer support for BoW-based models * Pass epoch to evaluator in every-epoch callback For fit backwards compatibility * Run formatting * Use steps_per_epoch to set max_steps if possible * Ignore extracting dataloader arguments for now * Remove dead code * Allow both "label" and "score" columns for labels * Reformatting * Improve errors if datasets don't match with loss dictionary well * Made tests more consistent; list instead of set * Simplify trainer with DatasetDict * Implement a proportional sampler in addition to round robin * Add CLIP finetuning support to the Trainer * Start updating evaluators to return dictionaries * Reformat * Hackishly insert the DataParallel model into the loss function * Allow for fsdp=["full_shard", "auto_wrap"] with fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"} * Re-add support for DataParallel * Use 'ParallelMode.NOT_PARALLEL' * Prevent crash with DDP & an evaluation set * When training with multiple datasets, add "dataset_name" column Rather than relying on some Batch Sampler hacking (which fails with some distributed training approaches) * Update type hints: make loss & evaluator optional Co-authored-by: Wang Bo <[email protected]> * Set correct superclasses for samplers * Override 'accelerator.even_batches' as it's incompatible with multi-dataset * Throw exception if "return_loss" or "dataset_name" columns are used * Set min. version for accelerate * Heavily extend model card generation * Remove some dead code * Fix evaluator type hints * Ensure that 'model_card_template.md' is included in the built package * Rephrase comments slightly * Heavily refactor samplers; add no duplicates/group by label samplers * Ensure that data_loader.dataset exists in FitMixin * Adopt 8 as the default batch * Fix logging error in example * Remove the deprecated correct_bias * Simplify with walrus operator * Fix some bugs in set_widget_examples with short datasets * Improve docstring slightly * Add edge case in case training data has an unrecognized format * Fix extracting dataset metadata * Remove moot TYPE_CHECKING * Set base model when loading a ST model also * Add test_dataloader, add prefetch_factor to dataloaders * Resolve predict_example fix; fix newlines in text * Fix bug in compute_dataset_metrics examples * Add call to action in ValueError * Reuse original model card if no training is done * Also collect nested losses (e.g. MatryoshkaLoss) and make losses in tags * Remove generated tag; keep loss: prefix on tags * Remove unused arguments * Add support for "best model step" in model card * Make hyperparameters code-formatted * Fix load_best_model for Transformers models, prevent for non-Transformers * Store base_model_revision in model_card_data * Prevent crash when loading a local model * Allow for bfloat16 inference --------- Co-authored-by: Matthew Franglen <[email protected]> Co-authored-by: Wang Bo <[email protected]> * [`v3`] Add `similarity` and `similarity_pairwise` methods to Sentence Transformers (#2615) * Add similarity function to model configuration * Add more tests * Replace util.cos_sim with model.similarity in some examples * Reintroduce evaluation.SimilarityFunction * Remove last references of score function in ST class * Add similarity_fn_name to model card * Add save_pretrained alias for save * Introduce DOT alias for DOT_PRODUCT * [`v3`] Fix various model card errors (#2616) * Prevent model card save failure * Print exceptions in more detail when they occur * Fix edge case if dataset language is None * [`v3`] Fix trainer `compute_loss` when evaluating/predicting if the `loss` updated the inputs in-place (#2617) * Recompute the features if return_output * Add SimilarityFunction to __init__, increment dev version * Never return None in infer_datasets (#2620) * Implement resume_from_checkpoint (#2621) * [`v3`] Update example scripts to the new v3 training format (#2622) * Update example scripts to the new v3 training format * Add distillation training examples * Add Matryoshka training examples * Add NLI training examples * Add STS training scripts * Fix accidentally overriding eval set * Update paraphrases multi-dataset training script * Convert regular dicts to DatasetDict on Trainer init * Update Quora duplicate training scripts * Update "other" training scripts * Update multilingual conversion script * Add example scripts to Evaluators * Add example to ST class itself * Update docs formatting slightly * Fix model card snippet * Add short docstring for similarity_fn_name property * Remove "return_outputs" as it's not strictly necessary. Avoids OOM & speeds up training (#2633) * Fix crash from inferring the dataset_id from a local dataset (#2636) See #2635 * Fix multilingual conversion script; extend MSELoss to multi-column (#2641) And remove the now-unnecessary make_multilingual_sys.py * Update evaluation scripts to use HF Datasets (#2642) * Increment the version in setup.py (as well) * Fix resume_from_checkpoint by also updating the loss (#2648) I'm not very sure if updating the potential wrapped model like this will also work; it seems a bit risky, but it's equally risky to not do it. * Fix an issue with in-place variable overriding preventing backwards passes on MSELoss (#2647) Only when there's multiple columns * Simplify load_from_checkpoint using load_state_dict (#2650) Overriding the model has several downsides, e.g. regarding the model card generation * Don't override the labels variable to avoid inplace operation (#2651) * Resolve "one of the variables needed for gradient computation has been modified by an inplace operation." (#2654) * [`v3`] Add hyperparameter optimization support by letting `loss` be a Callable that accepts a `model` (#2655) * Add HPO support by letting the 'loss' be a function * Only add "dataset_name" column if required by the loss function * Add tag hinting at the number of training samples (#2660) * [`v3`] For the Cached losses; ignore gradients if grad is disabled (e.g. eval) (#2668) * For the Cached losses; ignore gradients if grad is disabled (e.g. eval) * Warn that Matryoshka/AdaptiveLayer losses are not compatible with Cached * [`docs`] Rewrite the https://sbert.net documentation for v3.0 (#2632) * Start restructuring/rewriting the docs * Update Pretrained Models section for ST * Update & add many docstrings * Completely overhaul "Training Overview" docs page for ST * Update dataset overview * Remove kwargs from paraphrase_mining signature * Add "aka sbert" * Remove Hugging Face docs page * Update ST Usages * Fix some links * Use the training examples corresponding to that model type * Add hyperparameter optimization example script + docs * Add distributed training docs * Complete rewrite for the Sentence Transformer docs portion * Update the CE part of the docs * Specify if __name__ == "__main__" & dataloader_drop_last with DDP * Update the entire project to Google-style docstring * Remove contact page * Update README with updated links, etc. * Update the loss examples * Fix formatting * Add remove_columns/select_columns tip to dataset overview * [`v3`] Chore - include import sorting in ruff (#2672) * Include import sorting in ruff * Remove deprecated ignore-init-module-imports * Remove --select I from ruff.toml again after CI issues * [`v3`] Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy (#2673) * Prevent warning with 'model.fit' with transformers >= 4.41.0 due to evaluation_strategy * Reformat * [`v3`] Add various useful Sphinx packages (copy code, link to code, nicer tabs) (#2674) * No longer hide toctrees in API Reference * Add linkcode support It's not perfect, as it'll always link to 'master', but it'll do pretty nicely for the most part. * Add copy button to all code blocks * Add nicer tabs * Reformatted * [`v3`] Make the "primary_metric" for evaluators a bit more robust (#2675) * Make the "primary_metric" for evaluators a bit more robust * Also remove some other TODOs that are not very important or already done * Set `broadcast_buffers = False` when training with DDP (#2663) * [`v3`] Warn about using DP instead of DDP + set dataloader_drop_last with DDP (#2677) * Warn about using DP instead of DDP + set dataloader_drop_last with DDP * Prevent duplicate warnings * Remove note, done automatically now * Avoid inequality comparison to True * [`v3`] Add warning that Evaluators only run on 1 GPU when multi-GPU training (#2678) * Add warning that Evaluators only run on 1 GPU when multi-GPU training * Also add a note in the distributed training docs * [`v3`] Move training dependencies into a "train" extra (#2676) * Move training dependencies into a "train" extra * Install the train extra with the CI tests * Simplify dev install: also include train deps there * Implement is_..._available in ST instead; add is_training_available * Update references to the API ref (#2679) * [`v3`] Add "dataset_size:" to the tag denoting the number of training samples (#2680) * Prepend "dataset_size:" instead. I can always change the look of this later On the HF side * Fix formatting of Python modules * Docs: pairwise_cosine_similarity -> pairwise_similarity * Link to the yet-to-be-released release notes instead * Update phrasing on local_files_only docstring * Link directly to the 2DMSE preprint * Add missing subset in quora-duplicates * Add missing docstrings arguments for Cached... losses * Update training overview docs based on the blogpost reviews --------- Co-authored-by: Matthew Franglen <[email protected]> Co-authored-by: Wang Bo <[email protected]>

tomaarsen added a commit to tomaarsen/sentence-transformers that referenced this issue May 10, 2024

Fix crash from inferring the dataset_id from a local dataset

c8f8e30

See UKPLab#2635

tomaarsen mentioned this issue May 10, 2024

[v3] Fix crash from inferring the dataset_id from a local dataset #2636

Merged

tomaarsen added a commit that referenced this issue May 10, 2024

Fix crash from inferring the dataset_id from a local dataset (#2636)

e88c3f4

See #2635

tomaarsen closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`v3`] How do you envision this working with local datasets? #2635

[`v3`] How do you envision this working with local datasets? #2635

smerrill commented May 9, 2024

smerrill commented May 9, 2024

tomaarsen commented May 9, 2024

tomaarsen commented May 10, 2024

[v3] How do you envision this working with local datasets? #2635

[v3] How do you envision this working with local datasets? #2635

Comments

smerrill commented May 9, 2024

smerrill commented May 9, 2024

tomaarsen commented May 9, 2024

tomaarsen commented May 10, 2024

[`v3`] How do you envision this working with local datasets? #2635

[`v3`] How do you envision this working with local datasets? #2635