See #1446: Adds huggingface trainer for sentence transformers #1733

matthewfranglen · 2022-10-26T16:09:50Z

This implements a Huggingface Transformers compatible trainer that works for a task like the CosineSimilarityLoss example in Training. It should be easy to extend this to multi task training if desired.

You would use it as follows:

sick_ds = datasets.load_dataset("sick")

training_args = TrainingArguments(
    output_dir=...,
    num_train_epochs=10,
    seed=33,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_steps=100,
    optim="adamw_torch",
    #
    # checkpoint settings
    logging_dir=...,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="cosine_similarity",
    greater_is_better=True,
    #
    # needed to get sentence_A and sentence_B
    remove_unused_columns=False,
)

model = SentenceTransformer("nli-distilroberta-base-v2")
tokenizer = model.tokenizer
loss = losses.CosineSimilarityLoss(model)
evaluator = evaluation.EmbeddingSimilarityEvaluator(
    sick_ds["validation"]["sentence_A"],
    sick_ds["validation"]["sentence_B"],
    sick_ds["validation"]["label"],
    main_similarity=evaluation.SimilarityFunction.COSINE,
)
def compute_metrics(predictions: EvalPrediction) -> Dict[str, float]:
    return {
        "cosine_similarity": evaluator(model)
    }

data_collator = SentenceTransformersCollator(
    tokenizer=tokenizer,
    text_columns=["sentence_A", "sentence_B"],
)

trainer = SentenceTransformersTrainer(
    model=model,
    args=training_args,
    train_dataset=sick_ds["train"],
    eval_dataset=sick_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    # custom arguments
    loss=loss,
    text_columns=["sentence_A", "sentence_B"],
)

trainer.train()

matthewfranglen · 2022-10-26T16:11:13Z

Ooh I might've linked the wrong issue. Think that #1446 is more appropriate.

tomaarsen · 2024-06-04T21:40:34Z

Hello!

This has been fully extended and implemented in the v3.0 refactor via #2449. Thanks a bunch for starting this work.

Tom Aarsen

matthewfranglen added 4 commits October 26, 2022 16:04

See UKPLab#1638: Adds huggingface trainer for sentence transformers

d4aea88

Fix type of tokenizer

602c0a4

Get the trainer using the feature collation

9d984c9

Update the docstring to reflect changes

c5f3dde

matthewfranglen changed the title ~~See #1638: Adds huggingface trainer for sentence transformers~~ See #1446: Adds huggingface trainer for sentence transformers Oct 26, 2022

atreyasha mentioned this pull request Nov 3, 2022

HuggingFace Trainer class compatibility #1446

Open

vaibhavad added a commit to vaibhavad/sentence-transformers that referenced this pull request Jan 7, 2024

https://github.com/UKPLab/sentence-transformers/pull/1733

3931407

tomaarsen mentioned this pull request Jan 23, 2024

Refactor training via the Hugging Face Trainer - MultiGPU, loss logging, bf16, etc. #2436

Closed

tomaarsen closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

See #1446: Adds huggingface trainer for sentence transformers #1733

See #1446: Adds huggingface trainer for sentence transformers #1733

matthewfranglen commented Oct 26, 2022

matthewfranglen commented Oct 26, 2022

tomaarsen commented Jun 4, 2024

See #1446: Adds huggingface trainer for sentence transformers #1733

See #1446: Adds huggingface trainer for sentence transformers #1733

Conversation

matthewfranglen commented Oct 26, 2022

matthewfranglen commented Oct 26, 2022

tomaarsen commented Jun 4, 2024