-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913
Comments
Thank you for trying to analyse this, @moyix and for wanting to make things faster. I dug into it and here is what I have to share with you. What's happening for realIt's pretty clear from your profiler report that the diff comes from weights init which as you said get overwritten with weights. Indeed this is what's happening here. Except you are mixing 2 things. As you discovered lazy model init was implemented here #11471 and it later was improved upon in multiple PRs. This was done only for Now you're forgetting about calls like transformers/src/transformers/models/codegen/modeling_codegen.py Lines 117 to 119 in 37e0974
which of course by default call their init functions:
So that overhead all comes from pytorch You're wanting to use a huge 14GB model and it surely adds some 30sec to init it. The problem is that you're comparing loading the weights only with instantiating the model plus loading the weights, so of course they aren't the same thing. But we agree that it's a pointless waste of compute and time to init weights that are going to be overwritten moments later. To test I changed pytorch's
and the same for hint: perhaps you can use it as a hack until a better solution is provided - simply monkey patch the init functions with a no-op (I hope I covered the ones that are used here).
of course, I assume you are either doing inference or you have all weights in the distributed file - so no important init is missed. this I think should give you the speed closer to What can be doneBut why you'd say can't you skip those inits? We actually are able to do so since pytorch-1.10 where special functionality was added.
Looking at the requirements it actually appears to be possible despite needing to support pytorch<1.10 as well. The modules will have to be adapted to meet 2 requirements:
The first one is certainly possible since doing:
should be backward compatible. I think the 2nd requirement should be somewhat possible, but I can't speak for the multitude of models we have. Once this is done, the rest of the
but of course it will be I think this needs to happen sooner than later as it'd greatly simplify the various juggling we have during the loading process (after updating all the models, e.g. like So now let me bring here @sgugger and @patrickvonplaten to take over as I'm currently working on a different project, and they can decide on whether the project is ready for this major change or not quite yet and then you can use my hack ;) p.s. BTW, while studying your report I have invalidated your suggestion that there was a general edit Additional solutions are added in: |
I'm curious, are you doing inference or finetuning? Because for the latter usually the init overhead is usually irrelevant. Fast loading is also important for debug. I think I'm going to propose to pytorch this new feature:
and it would just work and be really fast w/o the overhead of init'ing weights which will be overloaded from pretrained weights. |
Thanks for the very comprehensive answer! That makes perfect sense :) I am indeed doing inference and trying to get the batch size correct – so having to wait a long time for the model load each attempt (only to get a CUDA out of memory error) was a bit painful. That hack helps a lot for now, thanks! |
Using |
Some additional solutions coming from pytorch-slack where I asked this question:
now you can do:
so it instantiates the model directly on your gpu and all the inits are run much faster. This solution is just a bit slower than cancelling out the init functions. plus your model will already be on gpu, so no copying overhead from cpu. Instead of using the context manager you can just set the default device like so:
and you no longer need to indent your existing code. 1b. Using materialization on the
but the resulting model isn't usable right away and requires additional manipulations to materialize it on the target device with the preloaded weights. This most likely have to be done by credits: @albanD and @stephenroller
credits: @cbalioglu |
In extension of @stas00 's number one, one might enhance the context manager solution with a diversion of the |
@stas00 your solution is great, tested it a bit. Is there any timeline for this feature and could one help with integration? Would be interested to know what are the team's thoughts on integrating this feature within the |
For the timeline questions we need to ask @sgugger |
The |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
…ng (#279) This applies a [newly introduced context manager](huggingface/transformers#21913 (comment)) that skips the overhead of loading models into CPU by loading them directly into the GPU.
I know this issue is closed but just some relevant feedback: I'm also facing extremely slow performance with the TLDR: for a chunk of users (anyone who has to use a conda environment) the |
Hey @tomwagstaff-opml, thanks for reporting. I believe you're using the In our README we indicate that you should use the huggingface channel in order to install the package. Please install it as such:
or, alternatively, use the conda-forge channel which is also the latest version:
|
Thanks for your help @LysandreJik - installing |
@cbalioglu the This does put the model on cuda:
This keeps it on CPU:
|
Observing similar behavior:
OUTPUT:cpu |
System Info
transformers
version: 4.26.1Who can help?
@stas00, @patrickvonplaten
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Loading a model with
from_pretrained
takes much longer than the underlying torch.load. For example, for theSalesforce/codegen-6B-mono
model,CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')
takes ~38 seconds, whereastorch.load()
on itspytorch_model.bin
takes just ~5.4 seconds. This is very similar to #9205, but is happening with the latest transformers from pip (4.26.1), so possibly a regression?Short repro:
Prints Load took 37.78910255432129 seconds
Prints Load took 5.443041801452637 seconds
Based on profiling the HF from_pretrained script, it seems like ~75% of the time is being spent doing random initialization of weights that are about to be overwritten. This is the same problem that was fixed in PR #11471 so I'm not sure what's going on here.
Here's the cProfile output and output from gprof2dot:
loadmodel_profile.txt
hf_loadmodel_new.pdf
Expected behavior
from_pretrained
should skip weight initialization when loading a pretrained model.The text was updated successfully, but these errors were encountered: