[Deepspeed] [performance] inefficient load with `from_pretrained` w/ zero3 #12273

stas00 · 2021-06-20T15:49:30Z

🚀 Feature request

Currently under Deepspeed stage3 with from_pretrained we:

a. loop over each sub-module in zero.Init

init the sub-module
shard and scatter the shards

b. then to load pre-trained weights we loop over each sub-module:

gather the shards
load_state_dict for the one layer layer
shard and scatter the shards

c. any sub-module params that weren't in the pretrained state_dict

run the postponed module_init as it was done in Pytorch - Lazy initialization of models #11471
shard and scatter the shards XXX: I actually don't think deepspeed.zero.GatheredParameters was handled here. so these params don't get ZeRO'ed - need to fix that [Deepspeed zero3] lazy weights init #12272

Because we unnecessarily do scatter/gather/scatter, this takes much longer than just:

a. init the modules w/o allocating any storage as it has been implemented in pt-1.9.0/1.9.1 https://pytorch.org/tutorials/prototype/skip_param_init.html#implementation-details

b. for each sub-module with pretrained weights

load_state_dict
shard and scatter the shards

c. any sub-module params that weren't in the pretrained state_dict

materialize and module_init
shard and scatter the shards

Solving this will most likely require support from Deepspeed, microsoft/DeepSpeed#1142 or perhaps we can just try to remove zero.Init if the weights aren't materialized during model creation. So the very first sharding will get postponed to the load_state_dict stage (and module_init for the sub-modules that don't have pre-trained weights).

The text was updated successfully, but these errors were encountered:

stas00 self-assigned this Jun 20, 2021

stas00 added the DeepSpeed label Jun 20, 2021

This was referenced Jun 20, 2021

[performance] module init w/ from_pretrained skip storage allocation #12274

Closed

[performance] fusing zero.Init and pretrained weights loading microsoft/DeepSpeed#1142

Open

[Performance] Tracking open Issues and PRs (pytorch transformers) #12126

Open

stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jul 21, 2021

huggingface deleted a comment from github-actions bot Jul 21, 2021

1ytic mentioned this issue Jul 20, 2023

improve from_pretrained for zero3 multi gpus mode #24964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deepspeed] [performance] inefficient load with `from_pretrained` w/ zero3 #12273

[Deepspeed] [performance] inefficient load with `from_pretrained` w/ zero3 #12273

stas00 commented Jun 20, 2021 •

edited

Loading

[Deepspeed] [performance] inefficient load with from_pretrained w/ zero3 #12273

[Deepspeed] [performance] inefficient load with from_pretrained w/ zero3 #12273

Comments

stas00 commented Jun 20, 2021 • edited Loading

🚀 Feature request

[Deepspeed] [performance] inefficient load with `from_pretrained` w/ zero3 #12273

[Deepspeed] [performance] inefficient load with `from_pretrained` w/ zero3 #12273

stas00 commented Jun 20, 2021 •

edited

Loading