-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API to set a module as a leaf node when recursively setting Z3 hooks #4966
Conversation
…ft/DeepSpeed into tohtana/no_break_for_param_fetch
…ft/DeepSpeed into tohtana/no_break_for_param_fetch
@tohtana thanks for the pr, however, when I use zero3 train mixtral with set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) , it hang for nccl time out during backward, any clues? |
@xs1997zju Thank you for sharing! Can you show a simple repro? |
Ok, here is my scripts:
|
An the deepspeed config is:
|
@xs1997zju Sorry I have missed one of your messages. I used your script but changed the config to reduce the mode size. config = AutoConfig.from_pretrained(model_id)
config.num_hidden_layers = 2
model = MixtralForCausalLM(config) Then I ran this with 2xA100. In this case, it keeps running for around 20 steps. |
why change num_hidden_layers?? if not change, What will happen ??? |
@awzhgw It is for debugging purpose. I wanted to debug with fewer GPUs and launch faster as long as the issue is reproduced. |
If you use the pretrained model, you can't. |
when i add code: config = transformers.AutoConfig.from_pretrained(model_args.model_name_or_path)
config.num_hidden_layers = 2
model = LlavaMixtralForCausalLM.from_pretrained(
model_args.model_name_or_path,
config=config,
cache_dir=training_args.cache_dir,
**bnb_model_from_pretrained_args
)
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) but ,when it train 270 step ,it hang ,and wait 30 m, nccl timeout .. my deepspeed is 0.13.1, my ds.config is : {
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5,
"wall_clock_breakdown": false
} how to resolve it ? |
I same too , just hang at 270 step after. then NCCL timeout.. Invalidate trace cache @ step 1: expected module 25, but got module 323 how to resolve it ? |
I am same too. how to resolve it ? i use deepspeed 0.13.1 |
@awzhgw @xs1997zju I opened #5008 to address the issue. Please feel free to try. |
I tried and meet this issue too. a trick way to solve this problem is add special token for each expert at the begin of input_ids, here's my code.
` |
I run this script get an error like this
I just change L50 and L 143 for model init. Could I please inquire about what these question are referring to? |
i am same too. how to resolve it ? |
i test four Scenario: If I use the zero2 or zero2_offload training script and set config.num_hidden_layers to 2, the training process runs normally. However, if I use the zero2 training script and don't set config.num_hidden_layers, the model loading process encounters an out-of-memory (OOM) error. On the other hand, if I use the zero3 or zero3_offload training script and set config.num_hidden_layers to 2, the training process hangs after 270 steps. Similarly, if I use the zero3 training script and don't set config.num_hidden_layers, the training process also hangs after 270 steps. the GPU state is like this.Power down to 100W until NCCL timeout kill the process. so can you help me ? how can i do ? |
ZeRO3 sets hooks on parameters to run reduce-scatter. This is often problematic for MoE models. Our data parallel processes may activate different sets of experts, but the hook is not fired unless the expert is activated at a forward pass. The reduce-scatter is called only on some processes in this case. This PR delays reduce-scatter for ZeRO3 leaf modules (Refer to #4966) to address the issue. We no longer set reduce-scatter hooks on parameters of the leaf modules. Instead, we launch reduce-scatter on all parameters belonging to the leaf module when exiting the module during the backward pass. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
…oks (microsoft#4966) ZeRO3 does not work with MoE models because the order of executing modules can change at every forward/backward pass (microsoft#4094, microsoft#4808). This PR adds an API to stop breaking down a module for parameter fetching. The following shows an example of the usage: ```python import torch import deepspeed import deepspeed.comm as dist from transformers.deepspeed import HfDeepSpeedConfig from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock model_id = "mistralai/Mixtral-8x7B-v0.1" ds_config = { "bf16": { "enabled": True, }, "zero_optimization": { "stage": 3, }, "train_micro_batch_size_per_gpu": 1, } hfdsc = HfDeepSpeedConfig(ds_config) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) model.eval() ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0] ds_engine.module.eval() model = ds_engine.module inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda") outputs = model.generate(inputs, max_new_tokens=200) output_str = tokenizer.decode(outputs[0]) if dist.get_rank() == 0: print(f"output: {output_str}") ``` By passing names of modules to `set_z3_leaf_modules`, DeepSpeed engine stops breaking down the module. In this example, `MixtralSparseMoeBlock` has multiple experts as its submodule. Using `set_z3_leaf_modules`, the DeepSpeed engine fetches parameters of all the submodules when pre-fetching the parameters of `MixtralSparseMoeBlock`.
ZeRO3 sets hooks on parameters to run reduce-scatter. This is often problematic for MoE models. Our data parallel processes may activate different sets of experts, but the hook is not fired unless the expert is activated at a forward pass. The reduce-scatter is called only on some processes in this case. This PR delays reduce-scatter for ZeRO3 leaf modules (Refer to microsoft#4966) to address the issue. We no longer set reduce-scatter hooks on parameters of the leaf modules. Instead, we launch reduce-scatter on all parameters belonging to the leaf module when exiting the module during the backward pass. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
For Mixtral-MOE Zero-3, a simple workaround woud be: enlarging the total batch size (increase the gradient acc steps) |
I think the hacks done to allow MOE to be executed is still not enough. In my use-case, I am evaluating Mixtral MOE and I see a consistent behaviour that after the warning
|
I had same problem, but I fixed it by enlarging micro batch-size from 1 to 4. How does batch size affect it? |
My colleague has confirmed that:
allows normal training. It works fine with MBS=1 or higher. @tohtana, first - thank you for the fix. How would users know about it? |
@JinpilChoi An alternative solution is to involve all experts in the computing process. For instance, consider the implementation in for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
if top_x.shape[0] == 0:
# continue
# NOTE: make all experts have gradients!!!
# NOTE: important for training with small batch size
zero_idx, zero_top_x = torch.where(expert_mask[expert_idx]==0)
first_zero_idx = zero_idx[:1]
first_zero_top_x = zero_top_x[:1]
zero_top_x_list = first_zero_top_x.tolist()
zero_idx_list = first_zero_idx.tolist()
current_state = hidden_states[None, zero_top_x_list].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(current_state) * routing_weights[zero_top_x_list, zero_idx_list, None]
# multiply by 0 to avoid real gradient
final_hidden_states.index_add_(0, first_zero_top_x, current_hidden_states.to(hidden_states.dtype) * 0.)
else:
# in torch it is faster to index using lists than torch tensors
top_x_list = top_x.tolist()
idx_list = idx.tolist()
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x_list].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(current_state) * routing_weights[top_x_list, idx_list, None]
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype)) |
ZeRO3 sets hooks on parameters to run reduce-scatter. This is often problematic for MoE models. Our data parallel processes may activate different sets of experts, but the hook is not fired unless the expert is activated at a forward pass. The reduce-scatter is called only on some processes in this case. This PR delays reduce-scatter for ZeRO3 leaf modules (Refer to microsoft#4966) to address the issue. We no longer set reduce-scatter hooks on parameters of the leaf modules. Instead, we launch reduce-scatter on all parameters belonging to the leaf module when exiting the module during the backward pass. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
ZeRO3 does not work with MoE models because the order of executing modules can change at every forward/backward pass (#4094, #4808).
This PR adds an API to stop breaking down a module for parameter fetching. The following shows an example of the usage:
By passing names of modules to
set_z3_leaf_modules
, DeepSpeed engine stops breaking down the module.In this example,
MixtralSparseMoeBlock
has multiple experts as its submodule. Usingset_z3_leaf_modules
, the DeepSpeed engine fetches parameters of all the submodules when pre-fetching the parameters ofMixtralSparseMoeBlock
.