[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

hk63560892 · 2024-07-17T01:44:48Z

Bug description

I find out that there is no label in valid.parquet.

Steps/Code to reproduce bug

While I m running this code:
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
# Set data
time_index_train = time_index
time_index_eval = time_index + 1
train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
# Train on day related to time_index
print('*'20)
print("Launch training for day %s are:" %time_index)
print(''20 + '\n')
trainer.train_dataset_or_path = train_paths
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step +=1
# Evaluate on the following day
trainer.eval_dataset_or_path = eval_paths
train_metrics = trainer.evaluate(metric_key_prefix='eval')
print(''20)
print("Eval results for day %s are:\t" %time_index_eval)
print('\n' + ''*20 + '\n')
for key in sorted(train_metrics.keys()):
print(" %s = %s" % (key, str(train_metrics[key])))
wipe_memory()

the error appear:

Launch training for day 1 are:

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
{'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'train_loss': 10.525657145182292, 'epoch': 60.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:04<00:00, 14.92it/s]
TrainOutput(global_step=60, training_loss=10.525657145182292, metrics={'train_runtime': 4.0234, 'train_samples_per_second': 3817.691, 'train_steps_per_second': 14.913, 'total_flos': 0.0, 'train_loss': 10.525657145182292})
Traceback (most recent call last):
File "", line 17, in
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2932, in evaluate
output = eval_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/trainer.py", line 515, in evaluation_loop
metrics_results_detailed = model.calculate_metrics(preds, labels)
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 616, in calculate_metrics
head.calculate_metrics(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py", line 453, in calculate_metrics
task.calculate_metrics(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/prediction_task.py", line 489, in calculate_metrics
result = metric(predictions, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 301, in forward
self._forward_cache = self._forward_full_state_update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 316, in _forward_full_state_update
self.update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py", line 465, in wrapped_func
update(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 56, in update
metric = self._metric(
File "/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/ranking_metric.py", line 137, in _metric
if rel_indices.shape[0] > 0:
IndexError: tuple index out of range

Expected behavior

I expected there have label for evaluation

Environment details

Transformers4Rec version: 23.12
Platform:Docker
Python version:3.10
Huggingface Transformers version:4.27.1
PyTorch version (GPU?):2.1.0a0+4136153
Tensorflow version (GPU?):

Additional context

The text was updated successfully, but these errors were encountered:

rnyak · 2024-07-18T18:25:36Z

@hk63560892 please share the link to the example notebook you are running? and what docker image you are using?

hk63560892 · 2024-07-22T04:25:35Z

link:
https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/tutorial/03-Session-based-recsys.ipynb
docker:
docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 8888:8888 -v <path_to_data>:/workspace/data/ nvcr.io/nvidia/merlin/merlin-pytorch:23.XX

thankyou!!

rnyak · 2024-07-22T17:06:27Z

@hk63560892 what docker image tag you are using? which 23.XX you are using? we have several ones start with 23. please be specific.

also note that the tutorials have not been maintained for a while so you can refer to other example notebooks in the examples directory.

hk63560892 added bug Something isn't working status/needs-triage labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

hk63560892 commented Jul 17, 2024

rnyak commented Jul 18, 2024

hk63560892 commented Jul 22, 2024

rnyak commented Jul 22, 2024

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

[BUG] A bug while fine-tuning the model by iteratively training and evaluating using a sliding time window #783

Comments

hk63560892 commented Jul 17, 2024

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context

rnyak commented Jul 18, 2024

hk63560892 commented Jul 22, 2024

rnyak commented Jul 22, 2024