could not support gelu？ #776

daeing · 2021-12-17T08:38:37Z

I use this docker( nvcr.io/nvidia/pytorch:21.11-py3 ) you suggested to test torch-tensorrt, but can not trans pytorch model to torchscript model. It seems like gelu is not support, but I also use this docker (pytorch-20.12-py3) to trans pytorch model to torchscript model, it can work well.

File "/opt/conda/lib/python3.8/site-packages/torch/jit/_serialization.py", line 161, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError:
Arguments for call are not valid.
The following variants are available:

aten::gelu(Tensor self, bool approximate) -> (Tensor):
Argument approximate not provided.

aten::gelu.out(Tensor self, bool approximate, *, Tensor(a!) out) -> (Tensor(a!)):
Argument approximate not provided.

The original call is:

tools/pytorch2torchscript.py(123): pytorch2libtorch
tools/pytorch2torchscript.py(186):
Serialized File "code/torch/torch/nn/modules/activation.py", line 27
def forward(self: torch.torch.nn.modules.activation.GELU,
argument_1: Tensor) -> Tensor:
return torch.gelu(argument_1)
~~~~~~~~~~ <--- HERE

naor2013 · 2021-12-18T01:36:46Z

It seems like the issue is because Nvidia's PyTorch is using the unmerged pytorch/pytorch#61439 pull request. This means that if you train your model (or at least covert your model to TorchScript) using "regular" (non-nvidia) PyTorch and then try to use it in other Nvidia properties, it does not work.

Hopefully, Nvidia will either release a PyTorch version which is compatible with regular PyTorch or the person who is working on that pull request will make sure this pull request isn't breaking "regular" PyTorch.

But for now, probably using Nvidia's PyTorch for everything will solve your issue.

Hope that helps.

daeing · 2021-12-19T01:45:07Z

It seems like the issue is because Nvidia's PyTorch is using the unmerged pytorch/pytorch#61439 pull request. This means that if you train your model (or at least covert your model to TorchScript) using "regular" (non-nvidia) PyTorch and then try to use it in other Nvidia properties, it does not work.

Hopefully, Nvidia will either release a PyTorch version which is compatible with regular PyTorch or the person who is working on that pull request will make sure this pull request isn't breaking "regular" PyTorch.

But for now, probably using Nvidia's PyTorch for everything will solve your issue.

Hope that helps.

Many thanks, I'll try to use the same docker to train and inference.

peri044 · 2021-12-19T15:13:58Z

@daeing Yes. Gelu implementation was changed in the master during 1.10 development I think (bool parameter was added and removed). So please use, the same docker for inference. 21.11 docker container had the bool parameter but the next docker container release 21.12 should have gelu without bool parameter (similar to regular pytorch).

naor2013 · 2021-12-21T07:20:55Z

@peri044 21.12 just released, it doesn't seem like it's fixed.
It seems like PyTorch isn't taken from Master, but from a fork with an unfinished pull request (linked in my previous comment) built in. The pull request has even updated a while ago to be backwards-compatible but Nvidia's PyTorch isn't using the latest iteration of that pull request even in 21.12.

This seems like a major issue because since 21.09 (at least) some models (including Bert models) are not compatible with "regular" PyTorch. All the issues I've seen raised about it are only answered with "try the new release" (in another post someone suggested 21.11 and you suggested 21.12) and it's not even tested that it is actually fixed.

Is there any way to re-release 21.12 or maybe a 21.12.1 or whatever to fix this? We run into this issue for a while now and made many hacks to make it work but it really breaks our automatic pipeline and prevents us from releasing new models as much as we would have wanted.

Note: I only tested it based on the Pytorch code in Nvidia's Pytorch 21.12. I assume Nvidia compiles the code that is in this image and uses it in all their other products. If my assumption is incorrect, it may be fixed in Nvidia's other 21.12 products but I don't think that would be the case.

github-actions · 2022-03-22T00:02:03Z

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

daeing added the question Further information is requested label Dec 17, 2021

peri044 self-assigned this Dec 19, 2021

peri044 mentioned this issue Feb 2, 2022

fix: Implement a patch for gelu schema change in older NGC containers #845

Merged

6 tasks

github-actions bot added the No Activity label Mar 22, 2022

github-actions bot closed this as completed Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

could not support gelu？ #776

could not support gelu？ #776

daeing commented Dec 17, 2021

naor2013 commented Dec 18, 2021

daeing commented Dec 19, 2021

peri044 commented Dec 19, 2021

naor2013 commented Dec 21, 2021 •

edited

Loading

github-actions bot commented Mar 22, 2022

could not support gelu？ #776

could not support gelu？ #776

Comments

daeing commented Dec 17, 2021

naor2013 commented Dec 18, 2021

daeing commented Dec 19, 2021

peri044 commented Dec 19, 2021

naor2013 commented Dec 21, 2021 • edited Loading

github-actions bot commented Mar 22, 2022

naor2013 commented Dec 21, 2021 •

edited

Loading