Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError for Mixtral-8x7B-v0.1: unsupported format string passed to NoneType.__format__ #1267

Open
mpatel31415 opened this issue Oct 7, 2024 · 2 comments · May be fixed by #1347
Open

TypeError for Mixtral-8x7B-v0.1: unsupported format string passed to NoneType.__format__ #1267

mpatel31415 opened this issue Oct 7, 2024 · 2 comments · May be fixed by #1347
Labels
mixology Issues that the mixology team has surfaced

Comments

@mpatel31415
Copy link
Contributor

mpatel31415 commented Oct 7, 2024

🐛 Bug

When running the benchmarks for Mixtral-8x7B-v0.1 for Eager mode we get error:

0: [rank0]: File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 887, in benchmark_main
0: [rank0]: print(f"Tokens/s: {benchmark.perf_metrics['tokens_per_sec']:.02f}")
0: [rank0]: TypeError: unsupported format string passed to NoneType.format

I see in the log that there was a message:

Model Flops/Throughput calculation failed for model Mixtral-8x7B-v0.1. Skipping throughput metric collection.

It might be caused by the fact that in this code in benchmark_litgpt.py:

    try:
        # Calculate the model FLOPs
        self.calculate_model_flops()
        # Setup throughput Collection
        self.throughput = Throughput(window_size=self.max_iters - self.warmup_iters, world_size=world_size)
    except:
        self.throughput = None
        print(
            f"Model Flops/Throughput calculation failed for model {self.model_name}. Skipping throughput metric collection."
        )

we have both self.calculate_model_flops() and throughput in try catch block. I'd put there only calculate_model_flops() but maybe there were some problems in getting Throughput and I'm just not aware of them.

Another possible fix is to check if tokens_per_sec is present in the dictionary before accessing it.

To Reproduce

Please use:

8 node(s), each with 8 GPUs.
Image "INTERNAL_IMAGE:pjnl-20241001"

Training script:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name Mixtral-8x7B-v0.1
--distributed_mode fsdp
--shard_mode zero3
--compile eager
--checkpoint_activations True
--low_precision_mode none
--micro_batch_size 1

Expected behavior

We should be able to run the benchmarking script, even if we are not able print a few metrics.

Environment

system.device_product_name DGXH100
system.gpu_driver_version 535.129.03
libraries.cuda 12.6.2.004
libraries.pip.lightning 2.4.0.dev20240728
libraries.pip.lightning-thunder 0.2.0.dev0
libraries.pip.lightning-utilities 0.11.7
libraries.pip.litgpt 0.4.11
libraries.pip.nvfuser 0.2.13+git4cbd7a4
libraries.pip.pytorch-lightning 2.4.0
libraries.pip.torch 2.6.0a0+gitd6d9183
libraries.pip.torchmetrics 1.4.2
libraries.pip.torchvision 0.19.0a0+d23a6e1

@tfogal tfogal added the mixology Issues that the mixology team has surfaced label Oct 11, 2024
@tfogal
Copy link
Collaborator

tfogal commented Oct 11, 2024

Hey @eqy this seems to be an eager mode bug, not related to thunder at all.
Could you / group take a look at this?

@mpatel31415
Copy link
Contributor Author

mpatel31415 commented Oct 14, 2024

Actually it's related to benchmark_litgpt.py script. I know one possible fix for it, so I can prepare PR around Wednesday, but it won't solve missing results from calculate_model_flops function.

@mpatel31415 mpatel31415 linked a pull request Oct 24, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mixology Issues that the mixology team has surfaced
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants