Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not support gelu? #776

Closed
daeing opened this issue Dec 17, 2021 · 5 comments · Fixed by #845
Closed

could not support gelu? #776

daeing opened this issue Dec 17, 2021 · 5 comments · Fixed by #845
Assignees
Labels
No Activity question Further information is requested

Comments

@daeing
Copy link

daeing commented Dec 17, 2021

I use this docker( nvcr.io/nvidia/pytorch:21.11-py3 ) you suggested to test torch-tensorrt, but can not trans pytorch model to torchscript model. It seems like gelu is not support, but I also use this docker (pytorch-20.12-py3) to trans pytorch model to torchscript model, it can work well.

File "/opt/conda/lib/python3.8/site-packages/torch/jit/_serialization.py", line 161, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError:
Arguments for call are not valid.
The following variants are available:

aten::gelu(Tensor self, bool approximate) -> (Tensor):
Argument approximate not provided.

aten::gelu.out(Tensor self, bool approximate, *, Tensor(a!) out) -> (Tensor(a!)):
Argument approximate not provided.

The original call is:

tools/pytorch2torchscript.py(123): pytorch2libtorch
tools/pytorch2torchscript.py(186):
Serialized File "code/torch/torch/nn/modules/activation.py", line 27
def forward(self: torch.torch.nn.modules.activation.GELU,
argument_1: Tensor) -> Tensor:
return torch.gelu(argument_1)
~~~~~~~~~~ <--- HERE

@daeing daeing added the question Further information is requested label Dec 17, 2021
@naor2013
Copy link

It seems like the issue is because Nvidia's PyTorch is using the unmerged pytorch/pytorch#61439 pull request. This means that if you train your model (or at least covert your model to TorchScript) using "regular" (non-nvidia) PyTorch and then try to use it in other Nvidia properties, it does not work.

Hopefully, Nvidia will either release a PyTorch version which is compatible with regular PyTorch or the person who is working on that pull request will make sure this pull request isn't breaking "regular" PyTorch.

But for now, probably using Nvidia's PyTorch for everything will solve your issue.

Hope that helps.

@daeing
Copy link
Author

daeing commented Dec 19, 2021

It seems like the issue is because Nvidia's PyTorch is using the unmerged pytorch/pytorch#61439 pull request. This means that if you train your model (or at least covert your model to TorchScript) using "regular" (non-nvidia) PyTorch and then try to use it in other Nvidia properties, it does not work.

Hopefully, Nvidia will either release a PyTorch version which is compatible with regular PyTorch or the person who is working on that pull request will make sure this pull request isn't breaking "regular" PyTorch.

But for now, probably using Nvidia's PyTorch for everything will solve your issue.

Hope that helps.

Many thanks, I'll try to use the same docker to train and inference.

@peri044
Copy link
Collaborator

peri044 commented Dec 19, 2021

@daeing Yes. Gelu implementation was changed in the master during 1.10 development I think (bool parameter was added and removed). So please use, the same docker for inference. 21.11 docker container had the bool parameter but the next docker container release 21.12 should have gelu without bool parameter (similar to regular pytorch).

@peri044 peri044 self-assigned this Dec 19, 2021
@naor2013
Copy link

naor2013 commented Dec 21, 2021

@peri044 21.12 just released, it doesn't seem like it's fixed.
It seems like PyTorch isn't taken from Master, but from a fork with an unfinished pull request (linked in my previous comment) built in. The pull request has even updated a while ago to be backwards-compatible but Nvidia's PyTorch isn't using the latest iteration of that pull request even in 21.12.

This seems like a major issue because since 21.09 (at least) some models (including Bert models) are not compatible with "regular" PyTorch. All the issues I've seen raised about it are only answered with "try the new release" (in another post someone suggested 21.11 and you suggested 21.12) and it's not even tested that it is actually fixed.

Is there any way to re-release 21.12 or maybe a 21.12.1 or whatever to fix this? We run into this issue for a while now and made many hacks to make it work but it really breaks our automatic pipeline and prevents us from releasing new models as much as we would have wanted.

Note: I only tested it based on the Pytorch code in Nvidia's Pytorch 21.12. I assume Nvidia compiles the code that is in this image and uses it in all their other products. If my assumption is incorrect, it may be fixed in Nvidia's other 21.12 products but I don't think that would be the case.

@github-actions
Copy link

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Activity question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants