You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:
0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 887, in benchmark_main
0: [rank0]: print(f"Tokens/s: {benchmark.perf_metrics['tokens_per_sec']:.02f}")
0: [rank0]: TypeError: unsupported format string passed to NoneType.format
I see in the log that there was a message:
Model Flops/Throughput calculation failed for model Mixtral-8x7B-v0.1. Skipping throughput metric collection.
It might be caused by the fact that in this code in benchmark_litgpt.py:
try:
# Calculate the model FLOPs
self.calculate_model_flops()
# Setup throughput Collection
self.throughput = Throughput(window_size=self.max_iters - self.warmup_iters, world_size=world_size)
except:
self.throughput = None
print(
f"Model Flops/Throughput calculation failed for model {self.model_name}. Skipping throughput metric collection."
)
we have both self.calculate_model_flops() and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.
Another possible fix is to check if tokens_per_sec is present in the dictionary before accessing it.
To Reproduce
Please use:
8 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20241001"
Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops function.
🐛 Bug
When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:
I see in the log that there was a message:
It might be caused by the fact that in this code in benchmark_litgpt.py:
we have both
self.calculate_model_flops()
and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.Another possible fix is to check if
tokens_per_sec
is present in the dictionary before accessing it.To Reproduce
Please use:
8 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20241001"
Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name Mixtral-8x7B-v0.1
--distributed_mode fsdp
--shard_mode zero3
--compile eager
--checkpoint_activations True
--low_precision_mode none
--micro_batch_size 1
Expected behavior
We should be able to run the benchmarking script, even if we are not able print a few metrics.
Environment
system.device_product_name DGXH100
system.gpu_driver_version 535.129.03
libraries.cuda 12.6.2.004
libraries.pip.lightning 2.4.0.dev20240728
libraries.pip.lightning-thunder 0.2.0.dev0
libraries.pip.lightning-utilities 0.11.7
libraries.pip.litgpt 0.4.11
libraries.pip.nvfuser 0.2.13+git4cbd7a4
libraries.pip.pytorch-lightning 2.4.0
libraries.pip.torch 2.6.0a0+gitd6d9183
libraries.pip.torchmetrics 1.4.2
libraries.pip.torchvision 0.19.0a0+d23a6e1
The text was updated successfully, but these errors were encountered: