fix: estimate_mfu dt ZeroDivisionError #446

HildaM · 2024-03-02T03:06:22Z

Previous estimate_mfu function has ZeroDivisionError error

In model.py 301 line, flops_achieved = flops_per_iter * (1.0/dt) will occur ZeroDivisionError, which means dt will be Zero when the time interval between two consecutive calls to time.time() is so small that it is considered as 0 under floating point precision.

replicate the problem

iter 800: loss 1.4306, time 20.79ms, mfu 18.58%
iter 810: loss 1.4020, time 31.59ms, mfu 17.90%
iter 820: loss 1.4028, time 15.12ms, mfu 18.58%
iter 830: loss 1.3907, time 17.64ms, mfu 18.83%
Traceback (most recent call last):
  File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\train.py", line 325, in <module>
    mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Coding\AILearning\LLM\LLM_Learning\nanoGPT\model.py", line 302, in estimate_mfu
    flops_achieved = flops_per_iter * (1.0/dt) # per second
                                       ~~~^~~
ZeroDivisionError: float division by zero

I am training on my single 4090 card, and every time I start the training code it will occur ZeroDivisionError.

fix: estimate_mfu dt ZeroDivisionError

2732405

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: estimate_mfu dt ZeroDivisionError #446

fix: estimate_mfu dt ZeroDivisionError #446

HildaM commented Mar 2, 2024 •

edited

Loading

fix: estimate_mfu dt ZeroDivisionError #446

Are you sure you want to change the base?

fix: estimate_mfu dt ZeroDivisionError #446

Conversation

HildaM commented Mar 2, 2024 • edited Loading

Previous estimate_mfu function has ZeroDivisionError error

replicate the problem

HildaM commented Mar 2, 2024 •

edited

Loading