-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU out of memory issues #242
Comments
Hello @chschroeder, Thank you for the heads up! Example script that can't reproduce endless memory growthfrom datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss
import torch
from setfit import SetFitModel, SetFitTrainer, sample_dataset
# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")
# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"]
# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
metric="accuracy",
batch_size=32,
num_iterations=5, # The number of text pairs to generate for contrastive learning
num_epochs=1, # The number of epochs to use for contrastive learning
column_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
for i in range(10):
trainer.train()
model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(torch.cuda.memory_allocated(0)) This constantly prints Similarly, if I run the hyperparameter tuning from the README, except on the Would it be possible to provide a script using either SetFit or a SentenceTransformer where the endless growth is reported? That way it can eventually be used to verify that a potential fix works or modified into a test over on the sentence-transformer repository. Feel free to post the response on the aforementioned issue on the sentence-transformers repository, too.
|
Sure! I am on it. My first attempt where I used a setfit example from scratch did not show the same behaviour, which is why I have to take a step back and take a closer look at my own code. I guess it might be related to the garbage collection because my own setup brings its own abstractions so the references might be cleaned up later. "Normal" transformer models as a cross check work flawlessly using this setup which is why I still think there is some problem elsewhere. I will investigate further and provide a script or notebook. |
Update: I continued investigating my own example (active learning instead of the setfit-only example):
My assessment so far:
I might defer further investigation but I will report back in this issue once I find some time to continue investigating here. |
I just ran into this in SageMaker Studio instance (4 vCPU + 16 GiB + 1 GPU). I trained a model before, using default dataset from
|
@anjanvb If I understand you correctly, your error does likely not occur over time but instantly. This is unrelated to the specific issue described above. (You have, however, guessed correctly a) there is token limit, and b) the GPU memory usage scales with this text length. So your problem is likely caused by the length of your input data. Search for the arguments Regarding the original issue: Sorry, I did not have time to investigate this further. Depending on your choice you can close the issue, as long as everything is documented here it might still be useful in the future. |
I'll close this by now, as the issue will still be accessible here. Should you find anything new on the topic that is worth our attention, then do not hesitate to reopen this issue or make a new one altogether. I'm personally interested in any memory leak issues that may exist, and I'd like to thank you for the time that you spent to look into this.
|
Thank you, Tom! Unfortunately I could not manage to produce a failing self-contained example (given my limited current time budget at least). I think I know how to fix the problem (and that is by adding explicit delete statements) but such a code change is hard to justify if we reliably verify it. Also, the problem is located in sentence-transformers and not in setfit, but nevertheless, it is good that this is documented here as well. For future reference, just two days ago I encountered the problem again, it is still there. This notebook failed in 3 out of 5 tries and succeeded otherwise. The notebook was run on a 2080 with 11GB. |
I ran into this OOM error myself, doing HPO with SetFit.
My case: OOM error using : from optuna import Trial
# Optional, but for test purposes 8 ex. per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8, seed=40)
def model_init(params):
params = params or {}
max_iter = params.get("max_iter", 100)
solver = params.get("solver", "liblinear")
params = {
"head_params": {
"max_iter": max_iter,
"solver": solver,
}
}
return SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", **params)
def hp_space(trial):
""" Define hyperparams search space (Optuna) """
return {
# Embeddings fine-tuning phase params :
"body_learning_rate": trial.suggest_float("body_learning_rate", 1e-6, 1e-3, log=True),
"num_epochs": trial.suggest_int("num_epochs", 1, 3),
"batch_size": trial.suggest_categorical("batch_size", [16, 32]),
"seed": trial.suggest_int("seed", 1, 40),
# LogisticRegression head params :
"max_iter": trial.suggest_int("max_iter", 50, 300),
"solver": trial.suggest_categorical("solver", ["newton-cg", "liblinear","lbfgs"]),
}
trainer = Trainer(
model_init=model_init,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
metric="accuracy",
column_mapping={"comment": "text", "label": "label"},
)
best_run = trainer.hyperparameter_search(direction="maximize", hp_space=hp_space, n_trials=4) I Could extent the trials to more (at least 10), doing some memory management. (Partial) fix to extend the time before running OOM: import gc
import torch
from optuna import Trial
from setfit import Trainer, SetFitModel, sample_dataset
import time
# Model initialization function
def model_init(params):
params = params or {}
max_iter = params.get("max_iter", 100)
solver = params.get("solver", "liblinear")
params = {
"head_params": {
"max_iter": max_iter,
"solver": solver,
}
}
# memory management
gc.collect()
torch.cuda.empty_cache()
return SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", **params)
# Hyperparameter space definition
def hp_space(trial):
""" Define hyperparams search space (Optuna) """
return {
# Embeddings fine-tuning phase params :
"body_learning_rate": trial.suggest_float("body_learning_rate", 1e-7 , 1e-5, log=True), # 1e-6, 1e-3
# "num_epochs": trial.suggest_int("num_epochs", 1, 2),
"max_steps": trial.suggest_int("max_steps", 650, 800), # 200, 900
"batch_size": trial.suggest_categorical("batch_size", [16]),
"seed": trial.suggest_int("seed", 1, 40),
# LogisticRegression head params :
"max_iter": trial.suggest_int("max_iter", 120, 126), # 100, 200
"solver": trial.suggest_categorical("solver", ["liblinear"]), # "newton-cg",'lbfgs'
}
# Customized run_hp_search_optuna function
def run_hp_search_optuna_modified(trainer, n_trials, direction, **kwargs):
import optuna
def _objective(trial):
trainer.objective = None
trainer.train(trial=trial)
# memory management
del trainer.model
gc.collect()
torch.cuda.empty_cache()
# Evaluate if needed
if getattr(trainer, "objective", None) is None:
metrics = trainer.evaluate()
trainer.objective = trainer.compute_objective(metrics)
return trainer.objective
timeout = kwargs.pop("timeout", None)
n_jobs = kwargs.pop("n_jobs", 1)
study = optuna.create_study(direction=direction, **kwargs)
# memory management : overkill, but also adding gc_after_trial=True in study.optimize()
study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs, gc_after_trial=True)
best_trial = study.best_trial
return BestRun(str(best_trial.number), best_trial.value, best_trial.params, study)
# Initialize Trainer
trainer = Trainer(
model_init=model_init,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
metric="accuracy",
column_mapping={"comment": "text", "label": "label"},
)
# Replace the run_hp_search_optuna method with the modified one
trainer.run_hp_search_optuna = run_hp_search_optuna_modified
# Run hyperparameter search
best_run = trainer.hyperparameter_search(direction="maximize", hp_space=hp_space, n_trials=3) |
Thanks @matthieuvion but didn't seem to make much difference for me - every implementation seems to have memory issues. Running 3 trials of HPO on 64 short paragraphs of text with
Maybe I'm missing something more fundamental? |
Had the exact same error. Honestly I spent too much time understanding where it could come from, after a week of trials and your issue could be different. That's why I posted my half backed "solution" here still. It does not prevent saturating the VRAM eventually. Just did a quick search with "e5 OOM" and plenty of possibilities. |
There are some other tricks you can try:
I don't really have time to dive into setfit source code, so can only say:
But you are right, if you're able to work within these limits, setfit is an awesome package. We use it a lot for training production models. |
Hi,
When repeatedly using SetFit's
train()
/predict()
inside a loop (for active learning) the GPU memory usage steadily grows (despite that all results have been correctly transferred to the CPU). Eventually,OutOfMemoryError: CUDA out of memory.
is raised.This is caused by sentence-transformers and I have reported it here in detail:
UKPLab/sentence-transformers/issues/1793
Just wanted to mention it here as well for the purposes of documentation.
The text was updated successfully, but these errors were encountered: