Invalidate trace cache @ step 13: expected module xx, but got module xxx #4081

mrluin · 2023-08-03T01:53:44Z

mrluin
Aug 3, 2023

Hello community,

In my training phase, I got weird message every two steps:

Invalidate trace cache @ step 13: expected module 5775, but got module 5771

Does this message affect optimization? And by the way how can I fix it?

Thanks in advance!

tjruwase · 2023-08-03T16:58:05Z

tjruwase
Aug 3, 2023
Maintainer

@mrluin, to optimize parameter partitioning, we trace the sequence of submodule/parameter fetches during a training/inference iteration. We store this fetch sequence in a trace cache. The trace cache allows us to prefetch parameters ahead of use and overlap fetch with computation for future iterations with the same fetch sequence. The message indicates that we detected a change in the fetch sequence, and so the fetching optimizations will be disabled.

I don't there is anything to be fixed since the message is indicating that optimization is being disabled to preserve correctness. Are you observing any incorrectness in your model results?

1 reply

mrluin Aug 4, 2023
Author

Hi, @tjruwase , thanks for your reply.

You mean that there should be something error in my optimization process, and i also seen that one of used loss functions does not change. But I checked requires_grad of each param is True and model.parameters() has been passed to optimizer, how can I further check the where does the error occur? Could you please give me some hints? Thankyou!

tjruwase · 2023-08-04T10:28:58Z

tjruwase
Aug 4, 2023
Maintainer

I am not saying there is an error in your training. All this message is indicating is that your module sequence is changing across training iterations, so you can decide whether this change ins intentional or not.

Here are two ideas for checking zero-3 correctness:

Compare the loss results with zero-2 run (i.e., without parameter partitioning). In this case, you may need to reduce model or batch size to fit with zero-2.
Disable the prefetching optimizations of zero-3, which will disable the trace cache, setting the following in your ds_config:

"stage3_prefetch_bucket_size": 0,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,

1 reply

mrluin Aug 5, 2023
Author

Thanks for your reply. I checked my code and may have found the problem. The model i used includes Mixture-of-experts structure, when num_experts > 1 the message displays per 2 steps, and when num_experts is set to 1 this message does not display any more. So, do you think this is the cause of this problem?

tjruwase · 2023-08-05T13:23:37Z

tjruwase
Aug 5, 2023
Maintainer

@yES, MoE could cause this issue since different iterations could execute different expert modules. I don't believe we have evaluated MoE with zero-3, so I am not sure if that works correctly. However, our zero-2 works well with MoE.

1 reply

mrluin Aug 5, 2023
Author

Thanks a lot, I will have a try on zero-2.

KevynUtopia · 2023-10-31T07:28:43Z

KevynUtopia
Oct 31, 2023

@mrluin Hello. I encountered the similar warning when implementing deepspeed in torch lightning. For my project, this error was because the torchmetrics override Comparison Operators like == and !=, and then produce a CompositionalMetric rather than boolean result. Consequently, the error occurs at line 169 in partitioned_param_coordinator.py.

Hope my experience helps.

4 replies

class CustomMultiClassStatScore(torchmetrics.classification.MulticlassStatScores):
    """
    Override the eq, neq methods to fall back to object, 
    so that DeepSpeedZero3 trace cache logic does not get confused.
    Refs:
        * https://github.com/microsoft/DeepSpeed/blob/2eafe41be7049721b77c2f2b0ee702fea1702239/deepspeed/runtime/zero/partitioned_param_coordinator.py#L170
        * https://github.com/Lightning-AI/torchmetrics/blob/a7424fd8b9776a4fb1917751c84aaeb9478624a3/src/torchmetrics/metric.py#L936
    
    Modules need to be hashable, so override the hash
    """
    def __eq__(self, other):
        return super(torchmetrics.Metric, self).__eq__(other)

    def __ne__(self, other):
        return super(torchmetrics.Metric, self).__ne__(other)

    def __hash__(self):
        return super().__hash__()

DavidYanAnDe May 1, 2024

hello, thanks for you sharing, so could you please tell me how to use it? Where should I put it in my object or which file should I replace by it?

tjruwase · 2024-01-21T16:17:36Z

tjruwase
Jan 21, 2024
Maintainer

This should now work for MoE thanks to #4966

0 replies

DAVID-NGUYEN-S16 · 2024-11-14T06:25:08Z

DAVID-NGUYEN-S16
Nov 14, 2024

Hi all, I have a solution for this problem as I faced the same issue. Here’s how I resolved it:

moe_module = []
    for name, module in model.named_modules():
        if isinstance(module, MoeLayer):
            moe_module.append(type(module))
 
    deepspeed.utils.set_z3_leaf_modules(model, moe_module)

and you should check return loss in sub moe layer
If you're working with MoE (Mixture of Experts), check out the new training toolkit: https://github.com/Fsoft-AIC/LibMoE

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalidate trace cache @ step 13: expected module xx, but got module xxx #4081

{{title}}

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Invalidate trace cache @ step 13: expected module xx, but got module xxx #4081

mrluin Aug 3, 2023

Replies: 6 comments · 7 replies

tjruwase Aug 3, 2023 Maintainer

mrluin Aug 4, 2023 Author

tjruwase Aug 4, 2023 Maintainer

mrluin Aug 5, 2023 Author

tjruwase Aug 5, 2023 Maintainer

mrluin Aug 5, 2023 Author

KevynUtopia Oct 31, 2023

Ericmututu Jan 21, 2024

cbiras Jan 30, 2024

hrushikesh198 Feb 4, 2024

DavidYanAnDe May 1, 2024

tjruwase Jan 21, 2024 Maintainer

DAVID-NGUYEN-S16 Nov 14, 2024

mrluin
Aug 3, 2023

Replies: 6 comments 7 replies

tjruwase
Aug 3, 2023
Maintainer

mrluin Aug 4, 2023
Author

tjruwase
Aug 4, 2023
Maintainer

mrluin Aug 5, 2023
Author

tjruwase
Aug 5, 2023
Maintainer

mrluin Aug 5, 2023
Author

KevynUtopia
Oct 31, 2023

tjruwase
Jan 21, 2024
Maintainer

DAVID-NGUYEN-S16
Nov 14, 2024