Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ [Question] Running LayerNorm in fp16 #2730

Open
Tomiinek opened this issue Apr 5, 2024 · 9 comments · Fixed by #2755
Open

❓ [Question] Running LayerNorm in fp16 #2730

Tomiinek opened this issue Apr 5, 2024 · 9 comments · Fixed by #2755
Assignees
Labels
question Further information is requested

Comments

@Tomiinek
Copy link

Tomiinek commented Apr 5, 2024

❓ Question

What you have already tried

I am trying to convert a transformer model to TRT in fp16 (fp32 works fine 🙂). It includes bunch of LayerNorms, all of them have explicit casting of inputs to fp32, i.e:

class LayerNormFP32(nn.LayerNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)

I am getting warnings about precisions of the layers:

WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected layernorm nodes in FP16: %126 : Tensor = aten::layer_norm(%input.9, %127, %self.decoder.layers.0.attn_ln.weight.1, %370, %129, %130), scope: __module.decoder/__module.decoder.layers.0/__module.decoder.layers.0.attn_ln
...
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT encountered issues when converting weights between types and that could affect accuracy.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Check verbose logs for the list of affected weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 2 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 27 weights are affected by this issue: Detected subnormal FP16 values.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - 3 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.

I checked dtype of the mentioned weights in the trace that I pass to torch_tensorrt.compile and they are correctly in fp32, even though the warnings state the opposite.

The warning suggets two solutions (use INormalizationLayer or force FP32 precisions) but I have no idea ho to achieve it.
This might be a related: #2509 (or NVIDIA/TensorRT#3101)

Any ideas how to resolve or debug this issue?

Environment

  • Python 3.11.8
  • torch 2.2.1
  • torch_tensorrt 2.2.0
  • a100
@Tomiinek Tomiinek added the question Further information is requested label Apr 5, 2024
@Tomiinek
Copy link
Author

Tomiinek commented Apr 5, 2024

Here is a minimal reproducible example:

import torch
import torch.nn as nn


class LayerNormFP32(nn.LayerNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)


class Model(nn.Module):
    def __init__(self, hidden_dim: int = 1024):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.ln = LayerNormFP32(hidden_dim, bias=False)

    def forward(self, x: torch.Tensor):
        return self.ln(x)

    def to_jit_trace(
        self,
        device: str = "cpu",
        dtype: torch.dtype = torch.float, 
        batch_size: int = 2,
    ) -> torch.jit.ScriptModule:
        
        dummy_inputs = torch.randn((batch_size, self.hidden_dim), dtype=dtype, device=device)
        
        self.to(device)
        self.eval()
    
        with torch.no_grad():
            outputs1 = self(*dummy_inputs)
            trace = torch.jit.trace(self, dummy_inputs, check_trace=False)
            outputs2 = trace(*dummy_inputs)
        assert torch.allclose(outputs1, outputs2)
        
        return trace, dummy_inputs

    def to_tensorrt(
        self,
        batch_size,
        precisions: set[torch.dtype] = {
            torch.float,
            torch.half
        },
    ):
        import torch_tensorrt

        dtype = torch.float
        if torch.half in precisions:
            dtype = torch.half

        with torch.cuda.amp.autocast(enabled=True):
            trace, dummy_inputs = self.to_jit_trace("cuda", dtype, batch_size=batch_size)
        
        trt = torch_tensorrt.compile(
            trace,
            input_signature=(torch_tensorrt.Input(shape=dummy_inputs.shape, dtype=dummy_inputs.dtype),),
            enabled_precisions=precisions,
            require_full_compilation=True,
            truncate_long_and_double=True,
        )
            
        return trt

and fp32 gives the same outputs, fp16 does not (while producing the warnings):

model = Model()
batch_size = 1

trt_16 = model.to_tensorrt(batch_size=batch_size, precisions={torch.float, torch.half})
with torch.cuda.amp.autocast(enabled=True):
    trace_fp16, dummy_inputs_16 = model.to_jit_trace("cuda", torch.half, batch_size=batch_size)

trt_32 = model.to_tensorrt(batch_size=batch_size, precisions={torch.float})
trace_fp32, dummy_inputs_32 = model.to_jit_trace("cuda", torch.float, batch_size=batch_size)

with torch.no_grad():

    # False
    # tensor(0.0020, device='cuda:0', dtype=torch.float16)
    print(torch.allclose(trace_fp16(dummy_inputs_16), trt_16(dummy_inputs_16)))
    print((trace_fp16(dummy_inputs_16) - trt_16(dummy_inputs_16)).abs().max())

    # True
    # tensor(2.9802e-08, device='cuda:0')
    print(torch.allclose(trace_fp32(dummy_inputs_32), trt_32(dummy_inputs_32)))
    print((trace_fp32(dummy_inputs_32) - trt_32(dummy_inputs_32)).abs().max())

@zewenli98
Copy link
Collaborator

Hi @Tomiinek, I refactored the layer norm with INormalization Layer. Could you confirm if this works for you? thanks!

@Tomiinek
Copy link
Author

Tomiinek commented Apr 17, 2024

Hello @zewenli98, thank you!

I am having issues compiling the latest code on my environment (python 3.11, torch 2.2), so I tried to use the wheel from gh actions associated to the PR (this one https://github.com/pytorch/TensorRT/actions/runs/8711801688/artifacts/1419799870), but also without a success. Simply patching the file in site_packages of the latest release did not help (i.e. the fp16 issue persists)

Is there another way to check it out or to catch it in tests?

@zewenli98
Copy link
Collaborator

@Tomiinek It seems the trace you pass into torch_tensorrt.compile() has type _ModuleType.ts, which means it will compile with torchscript frontend. Can you try using dynamo frontend instead? because the dynamo is better supported. Maybe the function _get_target_fe() in TensorRT/py/torch_tensorrt/_compile.py will be helpful.

@Tomiinek
Copy link
Author

Tomiinek commented Apr 22, 2024

Hi @zewenli98 thank you for our patience.

I tried something like:

model_ = torch.export.export(model, tuple(dummy_inputs)) 
trt = torch_tensorrt.compile(
    model_,
    input_signature=(torch_tensorrt.Input(shape=dummy_inputs.shape, dtype=dummy_inputs.dtype),),
    enabled_precisions={
        torch.float,
        torch.half
    },
    require_full_compilation=True,
    truncate_long_and_double=True,
)

but it says

ValueError: Input graph is an ExportedProgram which is not currently supported. Please provide torch.nn.Module or torch.fx.GraphModule as inputs

because I am still on 2.2.0.

So I tried to upgrade to 2.3.0dev, but I am not able to import the package:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/__init__.py", line 84, in <module>
    from torch_tensorrt._compile import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/_compile.py", line 9, in <module>
    import torch_tensorrt.ts
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/ts/__init__.py", line 1, in <module>
    from torch_tensorrt.ts._compile_spec import TensorRTCompileSpec  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/ts/_compile_spec.py", line 7, in <module>
    import torch_tensorrt._C.ts as _ts_C
ImportError: /fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN3c104cuda9GetDeviceEPi

Do you have any tips on how to install or try out the latest and greatest code or builds?
What is the prefered way of using the dynamo frontend?

These are my versions:

tensorrt==8.6.1.post1
tensorrt-bindings==8.6.1
tensorrt-libs==8.6.1
torch-tensorrt==2.3.0.dev20240110+cu121

@zewenli98
Copy link
Collaborator

Hi @Tomiinek, For this error:

ImportError: /fsx_home/homes/tomiinek/prdel/lib/python3.11/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN3c104cuda9GetDeviceEPi

This is because you might install mismatched libtorch version. You can replace the corresponding part in WORKSPACE with the correct urls of libtorch version, e.g.,

http_archive(
    name = "libtorch",
    build_file = "@//third_party/libtorch:BUILD",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/test/cu121/libtorch-cxx11-abi-shared-with-deps-2.3.0%2Bcu121.zip"],
    # urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-cxx11-abi-shared-with-deps-latest.zip"],
)

http_archive(
    name = "libtorch_pre_cxx11_abi",
    build_file = "@//third_party/libtorch:BUILD",
    strip_prefix = "libtorch",
    urls = ["https://download.pytorch.org/libtorch/test/cu121/libtorch-shared-with-deps-2.3.0%2Bcu121.zip"],
    # urls = ["https://download.pytorch.org/libtorch/nightly/cu121/libtorch-shared-with-deps-latest.zip"],
)

and then build torch-tensorrt again with:

python setup.py develop

Besides, you can try to use:

exp_program = torch_tensorrt.dynamo._tracer.trace(module, torchtrt_inputs, **kwargs)
trt_graph_module = torch_tensorrt.dynamo._compiler.compile(
    exp_program,
    inputs=torchtrt_inputs,
    enabled_precisions=enabled_precisions_set,
    **kwargs,
)

@srdecny
Copy link

srdecny commented Apr 24, 2024

Hi @zewenli98 , thanks for your responses! I'm trying to create a wheel for @Tomiinek to test out the fix. I'm opting for Docker, as local compilation gave me some weird errors about incompatible hashes when downloading tarballs from Nvidia.

I've changed the libtorch sections per your suggestion, checked out your PR branch and ran DOCKER_BUILDKIT=1 docker build --build-arg TENSORRT_VERSION=8.6 --build-arg CUDNN_VERSION=8.9 -f docker/Dockerfile -t torch_tensorrt:latest .. The container builds, however running python3 -c "import torch_tensorrt in the container still errors out:

root@ond-g5-1gpu-dy-g5-4xlarge-16cpu-1:~/.pyenv/versions/3.10.14/lib/python3.10/site-packages/tensorrt# python3 -c "import torch_tensorrt"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/__init__.py", line 84, in <module>
    from torch_tensorrt._compile import *  # noqa: F403
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 9, in <module>
    import torch_tensorrt.ts
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/ts/__init__.py", line 1, in <module>
    from torch_tensorrt.ts._compile_spec import TensorRTCompileSpec  # noqa: F401
  File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/torch_tensorrt/ts/_compile_spec.py", line 8, in <module>
    import torch_tensorrt._C.ts as _ts_C
ImportError: /opt/python3/site-packages/torch_tensorrt/lib/libtorchtrt.so: undefined symbol: _ZN5torch3jit11parseSchemaERKSs

Perhaps it would be easier to merge the PR and we'll test if the nightly wheel of TensorRT works? Compiling Torch-TensorRT locally seems to be pretty complicated.

@Tomiinek
Copy link
Author

Tomiinek commented Apr 29, 2024

Hello @zewenli98 , I installed the current release with python 3.10 so that I can try out at least dynamo.

I tried to compile a single linear layer with torchscript frontend, in fp32. The compiled module gives correct outputs (i.e. the same as raw), but not in fp16, which I believed changed from the previous release which was giving correct outputs but ignoring casting in layer norms.

I tried to compile a single linear layer with dynamo in fp32. I am not getting correct outputs and the compiled module is 3x slower than the one compiled with torchscript frontend.

The layernorm issue persists with torchscript and dynamo does not produce warnings but still produces weird outputs.

I am really confused, could you please help me and provide code snippets that I could run and at the same time work for you? Specifically:

  • how to compile a single linear layer with torchscript
  • how to compile a single linear layer with dynamo while getting the same inference speed as with ts
  • how to compile a single linear layer with whatever, but in fp16 while getting correct outputs
  • how to compile a single layer norm with fp16 inputs and internal fp32 cast while getting correct outputs and speedups

Or at least tell me if the code I posted above works for you with the latest release, or what I am doing wrong in there 🤷

CC: @narendasan

@zewenli98
Copy link
Collaborator

@narendasan @peri044 Can you guys take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants