Replies: 6 comments 7 replies
-
@mrluin, to optimize parameter partitioning, we trace the sequence of submodule/parameter fetches during a training/inference iteration. We store this fetch sequence in a I don't there is anything to be fixed since the message is indicating that optimization is being disabled to preserve correctness. Are you observing any incorrectness in your model results? |
Beta Was this translation helpful? Give feedback.
-
I am not saying there is an error in your training. All this message is indicating is that your module sequence is changing across training iterations, so you can decide whether this change ins intentional or not. Here are two ideas for checking zero-3 correctness:
"stage3_prefetch_bucket_size": 0,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0, |
Beta Was this translation helpful? Give feedback.
-
@yES, MoE could cause this issue since different iterations could execute different expert modules. I don't believe we have evaluated MoE with zero-3, so I am not sure if that works correctly. However, our zero-2 works well with MoE. |
Beta Was this translation helpful? Give feedback.
-
@mrluin Hello. I encountered the similar warning when implementing deepspeed in torch lightning. For my project, this error was because the torchmetrics override Comparison Operators like == and !=, and then produce a CompositionalMetric rather than boolean result. Consequently, the error occurs at line 169 in partitioned_param_coordinator.py. Hope my experience helps. |
Beta Was this translation helpful? Give feedback.
-
This should now work for MoE thanks to #4966 |
Beta Was this translation helpful? Give feedback.
-
Hi all, I have a solution for this problem as I faced the same issue. Here’s how I resolved it:
and you should check return loss in sub moe layer |
Beta Was this translation helpful? Give feedback.
-
Hello community,
In my training phase, I got weird message every two steps:
Invalidate trace cache @ step 13: expected module 5775, but got module 5771
Does this message affect optimization? And by the way how can I fix it?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions