[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

moyix · 2023-03-03T01:24:02Z

System Info

transformers version: 4.26.1
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.31
Python version: 3.10.9
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 2.0.0.dev20230224+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@stas00, @patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Loading a model with from_pretrained takes much longer than the underlying torch.load. For example, for the Salesforce/codegen-6B-mono model, CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono') takes ~38 seconds, whereas torch.load() on its pytorch_model.bin takes just ~5.4 seconds. This is very similar to #9205, but is happening with the latest transformers from pip (4.26.1), so possibly a regression?

Short repro:

import time
import torch
from transformers import CodeGenForCausalLM
t1 = time.time()
CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')
t2 = time.time()
print("Load took", t2-t1, "seconds")

Prints Load took 37.78910255432129 seconds

import time
import torch
from transformers.utils import cached_file
torch.load(cached_file('Salesforce/codegen-6B-mono', 'pytorch_model.bin'))

Prints Load took 5.443041801452637 seconds

Based on profiling the HF from_pretrained script, it seems like ~75% of the time is being spent doing random initialization of weights that are about to be overwritten. This is the same problem that was fixed in PR #11471 so I'm not sure what's going on here.

Here's the cProfile output and output from gprof2dot:
loadmodel_profile.txt
hf_loadmodel_new.pdf

Expected behavior

from_pretrained should skip weight initialization when loading a pretrained model.

The text was updated successfully, but these errors were encountered:

stas00 · 2023-03-03T02:50:12Z

Thank you for trying to analyse this, @moyix and for wanting to make things faster.

I dug into it and here is what I have to share with you.

What's happening for real

It's pretty clear from your profiler report that the diff comes from weights init which as you said get overwritten with weights.

Indeed this is what's happening here. Except you are mixing 2 things.

As you discovered lazy model init was implemented here #11471 and it later was improved upon in multiple PRs. This was done only for _init_weights functions defined in the modeling code of transformers.

Now you're forgetting about calls like

transformers/src/transformers/models/codegen/modeling_codegen.py

Lines 117 to 119 in 37e0974

    
           self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False) 
        
           self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

which of course by default call their init functions:

  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/codegen/modeling_codegen.py", line 117, in __init__
    self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 101, in __init__
    self.reset_parameters()
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 107, in reset_parameters
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/init.py", line 396, in kaiming_uniform_

So that overhead all comes from pytorch nn.Module submodules and not _init_weights defined in the modeling code of transformers.

You're wanting to use a huge 14GB model and it surely adds some 30sec to init it.

The problem is that you're comparing loading the weights only with instantiating the model plus loading the weights, so of course they aren't the same thing. But we agree that it's a pointless waste of compute and time to init weights that are going to be overwritten moments later.

To test I changed pytorch's kaiming_uniform_ to be:

def kaiming_uniform_(
    tensor: Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu'
):
    return tensor

and the same for uniform_ and from_pretrained was as fast as you wanted it to be.

hint: perhaps you can use it as a hack until a better solution is provided - simply monkey patch the init functions with a no-op (I hope I covered the ones that are used here).

from transformers import CodeGenForCausalLM
import torch.nn.init
torch.nn.init.kaiming_uniform_ = lambda x, *args, **kwargs: x
torch.nn.init.uniform_ = lambda x, *args, **kwargs: x
CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

of course, I assume you are either doing inference or you have all weights in the distributed file - so no important init is missed.

this I think should give you the speed closer to torch.load

What can be done

But why you'd say can't you skip those inits?

We actually are able to do so since pytorch-1.10 where special functionality was added.

Looking at the requirements it actually appears to be possible despite needing to support pytorch<1.10 as well.

The modules will have to be adapted to meet 2 requirements:
https://pytorch.org/tutorials/prototype/skip_param_init.html#updating-modules-to-support-skipping-initialization
I will repaste them here:

The module must accept a device kwarg in its constructor that is passed to any parameters or buffers created during construction.
The module must not perform any computation on parameters or buffers in its constructor except initialization (i.e. functions from torch.nn.init).

The first one is certainly possible since doing:

-  def __init__(self, foo, bar):
+  def __init__(self, foo, bar, device=None):

should be backward compatible.

I think the 2nd requirement should be somewhat possible, but I can't speak for the multitude of models we have.

Once this is done, the rest of the from_pretrained will need to be adapted to use the device argument as in the example of the tutorial,

m = nn.Linear(10, 5, device='meta')

but of course it will be m = ModelName(..., device='meta')

I think this needs to happen sooner than later as it'd greatly simplify the various juggling we have during the loading process (after updating all the models, e.g. like low_cpu_mem_usage functionality). But needing to support torch<1.10 might make this somewhat messy. I'm not sure.

So now let me bring here @sgugger and @patrickvonplaten to take over as I'm currently working on a different project, and they can decide on whether the project is ready for this major change or not quite yet and then you can use my hack ;)

p.s. BTW, while studying your report I have invalidated your suggestion that there was a general from_pretrained regression, but to do that I had to use a different class since CodeGenForCausalLM was added only recently. I went all the way back to transformers==4.14 and t5-large loads with the same speed as the latest version.

edit Additional solutions are added in:

stas00 · 2023-03-03T03:13:41Z

I'm curious, are you doing inference or finetuning? Because for the latter usually the init overhead is usually irrelevant.

Fast loading is also important for debug.

I think I'm going to propose to pytorch this new feature:

with torch.inference:
    m = MyModel(...)

and it would just work and be really fast w/o the overhead of init'ing weights which will be overloaded from pretrained weights.

moyix · 2023-03-03T03:58:22Z

Thanks for the very comprehensive answer! That makes perfect sense :) I am indeed doing inference and trying to get the batch size correct – so having to wait a long time for the model load each attempt (only to get a CUDA out of memory error) was a bit painful.

That hack helps a lot for now, thanks!

sgugger · 2023-03-03T12:47:02Z

Using low_cpu_mem_usage=True will initialize the model on the meta device (requires Accelerate as an extra dep) and should speed up the initialization as a result. This will become the default mid-term but we need some more preparation work by making the tests more robust for from_pretrained to make sure we absolutely don't break anything.

stas00 · 2023-03-03T17:24:31Z

Some additional solutions coming from pytorch-slack where I asked this question:

install pytorch-nightly from instructions at https://pytorch.org/get-started/locally/ (or if you read this later when pytorch==2.0 is released any 2.0 and higher version will do).

now you can do:

    with torch.device("cuda"):
        model = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

so it instantiates the model directly on your gpu and all the inits are run much faster. This solution is just a bit slower than cancelling out the init functions. plus your model will already be on gpu, so no copying overhead from cpu.

Instead of using the context manager you can just set the default device like so:

torch.set_default_device('cuda')

and you no longer need to indent your existing code.

1b. Using materialization on the meta device will be really fast as it will cancel out the init functions and won't even waste time on allocating memory for the weights:

    with torch.device("meta"):
        model = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

but the resulting model isn't usable right away and requires additional manipulations to materialize it on the target device with the preloaded weights. This most likely have to be done by transformers unless pytorch comes up with a magical method a user could do themselves.

credits: @albanD and @stephenroller

Another solution comes from https://pytorch.org/torchdistx/latest/deferred_init.html, but it requires tweaking from_pretrained to support from torchdistx.deferred_init import deferred_init, materialize_module and this experimental package isn't easy to install since it requires CUDA extensions building (though not for this functionality), so we can't make transformers depend on it. It will have to be upstreamed into pytorch first.

credits: @cbalioglu

t-vi · 2023-03-04T08:18:57Z

In extension of @stas00 's number one, one might enhance the context manager solution with a diversion of the init functions. I wrote up a bit more detail on my blog.

alexcoca · 2023-03-09T17:45:59Z

@stas00 your solution is great, tested it a bit. Is there any timeline for this feature and could one help with integration? Would be interested to know what are the team's thoughts on integrating this feature within the Trainer but also pipelines? Happy to help if I can!

stas00 · 2023-03-09T17:48:36Z

For the timeline questions we need to ask @sgugger

sgugger · 2023-03-09T18:12:05Z

The low_cpu_mem_usage=True option is already there in Transformers and usable today. Changing the default will take more time to ensure backward compatibility.

github-actions · 2023-04-03T15:02:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

…ng (#279) This applies a [newly introduced context manager](huggingface/transformers#21913 (comment)) that skips the overhead of loading models into CPU by loading them directly into the GPU.

tomwagstaff-opml · 2023-05-05T14:00:31Z

I know this issue is closed but just some relevant feedback: I'm also facing extremely slow performance with the from_pretrained method, this time in a conda environment. I tried the low_cpu_mem_usage=True solution, but this requires a more recent version of transformers than is available in the conda repos so I can't. Reported already on Stack Overflow.

TLDR: for a chunk of users (anyone who has to use a conda environment) the low_cpu_mem_usage=True parameter is not available or usable.

LysandreJik · 2023-05-05T17:25:14Z

Hey @tomwagstaff-opml, thanks for reporting.

I believe you're using the transformers version from the main channel of anaconda, but we don't (and none of the open-source project maintainers do) maintain this version. This is maintained by the anaconda team.

In our README we indicate that you should use the huggingface channel in order to install the package.

Please install it as such:

conda install -c huggingface transformers

or, alternatively, use the conda-forge channel which is also the latest version:

conda install -c conda-forge transformers

tomwagstaff-opml · 2023-05-09T10:33:17Z

Thanks for your help @LysandreJik - installing transformers from the Hugging Face channel has worked and allowed me to try out the low_cpu_mem_usage parameter

CorentinJ · 2024-03-25T15:09:24Z

@cbalioglu the torch.device context manager seems not systematically to put the weights on said device with from_pretrained

This does put the model on cuda:

import torch
from transformers import AutoModel

with torch.device("cuda"):
    model = AutoModel.from_pretrained('sshleifer/tiny-gpt2')
    print(model.device)

This keeps it on CPU:

import torch
from transformers import AutoModelForCTC

with torch.device("cuda"):
    model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2_tiny_random")
    print(model.device)

sidecus · 2024-05-15T08:35:30Z

@cbalioglu the torch.device context manager seems not systematically to put the weights on said device with from_pretrained

Observing similar behavior:

import torch
from transformers import AutoModel

with torch.device('cuda'):
    model = AutoModel.from_pretrained('microsoft/wavlm-base-plus')
    print(model.device)

OUTPUT:

cpu

albanD mentioned this issue Mar 10, 2023

Unification of model initialization methods / naming across domain libraries + support of skip_init pytorch/pytorch#65153

Open

github-actions bot closed this as completed Apr 11, 2023

luiscape mentioned this issue Apr 18, 2023

Pins torch to >= 2.0 and applies recommendation for faster model loding modal-labs/modal-examples#279

Merged

thundergolfer mentioned this issue Apr 21, 2023

Use 'device_auto' for hopefully faster StableDiffusion model load modal-labs/modal-examples#285

Merged

johnml1135 mentioned this issue Dec 1, 2023

From pretrained normally? sillsdev/machine.py#64

Closed

storyicon mentioned this issue Jan 15, 2024

feat: significantly optimize the time consumption of clip vision Mikubill/sd-webui-controlnet#2474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

moyix commented Mar 3, 2023

stas00 commented Mar 3, 2023 •

edited

Loading

stas00 commented Mar 3, 2023 •

edited

Loading

moyix commented Mar 3, 2023

sgugger commented Mar 3, 2023

stas00 commented Mar 3, 2023 •

edited

Loading

t-vi commented Mar 4, 2023

alexcoca commented Mar 9, 2023 •

edited

Loading

stas00 commented Mar 9, 2023

sgugger commented Mar 9, 2023

github-actions bot commented Apr 3, 2023

tomwagstaff-opml commented May 5, 2023

LysandreJik commented May 5, 2023

tomwagstaff-opml commented May 9, 2023

CorentinJ commented Mar 25, 2024

sidecus commented May 15, 2024 •

edited

Loading

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

Comments

moyix commented Mar 3, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

stas00 commented Mar 3, 2023 • edited Loading

What's happening for real

What can be done

stas00 commented Mar 3, 2023 • edited Loading

moyix commented Mar 3, 2023

sgugger commented Mar 3, 2023

stas00 commented Mar 3, 2023 • edited Loading

t-vi commented Mar 4, 2023

alexcoca commented Mar 9, 2023 • edited Loading

stas00 commented Mar 9, 2023

sgugger commented Mar 9, 2023

github-actions bot commented Apr 3, 2023

tomwagstaff-opml commented May 5, 2023

LysandreJik commented May 5, 2023

tomwagstaff-opml commented May 9, 2023

CorentinJ commented Mar 25, 2024

sidecus commented May 15, 2024 • edited Loading

OUTPUT:

stas00 commented Mar 3, 2023 •

edited

Loading

stas00 commented Mar 3, 2023 •

edited

Loading

stas00 commented Mar 3, 2023 •

edited

Loading

alexcoca commented Mar 9, 2023 •

edited

Loading

sidecus commented May 15, 2024 •

edited

Loading