Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Crash][pp_liteseg] CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED #50853

Closed
engineer1109 opened this issue Feb 24, 2023 · 23 comments
Closed

[Crash][pp_liteseg] CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED #50853

engineer1109 opened this issue Feb 24, 2023 · 23 comments
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭 type/bug-report 报bug

Comments

@engineer1109
Copy link
Contributor

bug描述 Describe the Bug

The model "pp_liteseg_stdc1_cityscapes_1024x512_scale0.5_160k" from PaddleSeg crashes now.
It is running on Paddle Inference for pure C++.
I am sure the model will not crash one or month ago.

Model Link
链接: https://pan.baidu.com/s/1Z6N4XW3r11ln7qoUML9qGQ?pwd=gpi3 提取码: gpi3

crash point conv2d_fusion.
phi::dynload::cudnnConvolutionBiasActivationForward
conv_fusion_kernel.cu:593

Paddle Infer Config:

    paddle_infer::Config config;
    config.SetModel(m_pdmodelPath, m_pdiparamsPath);
    config.EnableUseGpu(100, 0, paddle::AnalysisConfig::Precision::kFloat32);
    config.EnableCUDNN();
    m_predictor = paddle_infer::CreatePredictor(config);

Crash GLOG_v=4

I0224 09:24:40.566617 328095 conv_fusion_kernel.cu:403] Compute ConvFusionOp with cuDNN: data_format=NCHW compute_format=NCHW
I0224 09:24:40.566874 328095 operator.cc:286] Place(gpu:0) Op(conv2d_fusion), inputs:{Bias[fuse_conv_bn/conv2d_eltwise_y_in/30:float[2]({})(Place(gpu:0))], Filter[conv2d_43.w_0:float[2, 4, 3, 3]({})(Place(gpu:0))], Input[concat_6.tmp_0:float[1, 4, 16, 32]({})(Place(gpu:0))], ResidualData[]}, outputs:{Output[relu_31.tmp_0:float[1, 2, 16, 32]({})(Place(gpu:0))]}.
I0224 09:24:40.566902 328095 naive_executor.cc:61] 140736645738496 run Op(conv2d_fusion), inputs:{Bias[fuse_conv_bn/conv2d_eltwise_y_in/31:float[1]({})(Place(gpu:0))], Filter[conv2d_44.w_0:float[1, 2, 3, 3]({})(Place(gpu:0))], Input[relu_31.tmp_0:float[1, 2, 16, 32]({})(Place(gpu:0))], ResidualData[]}, outputs:{Output[sigmoid_0.tmp_0:[0]({})()]}. on scope 0x5555943c8c30
I0224 09:24:40.566920 328095 operator.cc:219] Place(gpu:0) Op(conv2d_fusion), inputs:{Bias[fuse_conv_bn/conv2d_eltwise_y_in/31:float[1]({})(Place(gpu:0))], Filter[conv2d_44.w_0:float[1, 2, 3, 3]({})(Place(gpu:0))], Input[relu_31.tmp_0:float[1, 2, 16, 32]({})(Place(gpu:0))], ResidualData[]}, outputs:{Output[sigmoid_0.tmp_0:[0]({})()]}.
I0224 09:24:40.566931 328095 cuda_info.cc:250] SetDeviceId 0
I0224 09:24:40.566947 328095 operator.cc:2208] op type:conv2d_fusion, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[CUDNN]}
I0224 09:24:40.566962 328095 operator.cc:3162] Done inputs
I0224 09:24:40.566965 328095 operator.cc:3169] Output Outputs not found
I0224 09:24:40.566967 328095 operator.cc:3225] Done outputs
I0224 09:24:40.566973 328095 operator.cc:3466] Done attributes
I0224 09:24:40.566975 328095 operator.cc:3043] Runtime attr `is_test` is passed to GPUDNNContext.
I0224 09:24:40.566979 328095 operator.cc:3043] Runtime attr `fuse_relu_before_depthwise_conv` is passed to GPUDNNContext.
I0224 09:24:40.566983 328095 operator.cc:3043] Runtime attr `use_addto` is passed to GPUDNNContext.
I0224 09:24:40.566987 328095 operator.cc:3043] Runtime attr `workspace_size_MB` is passed to GPUDNNContext.
I0224 09:24:40.566990 328095 operator.cc:3043] Runtime attr `exhaustive_search` is passed to GPUDNNContext.
I0224 09:24:40.566992 328095 operator.cc:3516] Done runtime attributes
I0224 09:24:40.566994 328095 operator.cc:3546] Done runtime extra inputs
I0224 09:24:40.567003 328095 conv_fusion_kernel.cu:403] Compute ConvFusionOp with cuDNN: data_format=NCHW compute_format=NCHW
I0224 09:24:40.568835 328095 op_call_stack.cc:62] ExternalError: CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED. 
  [Hint: Please search for the error code(9) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /media/wjl/D2/github/fork/7/Paddle/paddle/phi/kernels/fusion/gpu/conv_fusion_kernel.cu:593)
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  In user code:

    File "export.py", line 144, in <module>
      main(args)
    File "export.py", line 123, in main
      paddle.jit.save(new_net, save_path)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/jit.py", line 631, in wrapper
      func(layer, path, input_spec, **configs)
    File "/home/wjl/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun
      return caller(func, *(extras + args), **kw)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/jit.py", line 860, in save
      concrete_program = static_func.concrete_program_specify_input_spec(
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 527, in concrete_program_specify_input_spec
      concrete_program, _ = self.get_concrete_program(
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 436, in get_concrete_program
      concrete_program, partial_program_layer = self._program_cache[cache_key]
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 801, in __getitem__
      self._caches[item_id] = self._build_once(item)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 785, in _build_once
      concrete_program = ConcreteProgram.from_func_spec(
    File "/home/wjl/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun
      return caller(func, *(extras + args), **kw)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 733, in from_func_spec
      outputs = static_func(*inputs)
    File "export.py", line 74, in forward
      outs = self.net(x)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/home/wjl/github/PaddleSeg/paddleseg/models/pp_liteseg.py", line 114, in forward
      feats_head = self.ppseg_head(feats_selected)  # [..., x8, x16, x32]
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/home/wjl/github/PaddleSeg/paddleseg/models/pp_liteseg.py", line 191, in forward
      high_feat = arm(low_feat, high_feat)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/home/wjl/github/PaddleSeg/paddleseg/models/layers/tensor_fusion.py", line 75, in forward
      out = self.fuse(x, y)
    File "/home/wjl/github/PaddleSeg/paddleseg/models/layers/tensor_fusion.py", line 182, in fuse
      atten = F.sigmoid(self.conv_xy_atten(atten))
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/container.py", line 98, in forward
      input = layer(input)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/home/wjl/github/PaddleSeg/paddleseg/models/layers/layer_libs.py", line 107, in forward
      x = self._conv(x)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/nn/layer/conv.py", line 666, in forward
      out = F.conv._conv_nd(
    File "/usr/local/lib/python3.8/dist-packages/paddle/nn/functional/conv.py", line 168, in _conv_nd
      helper.append_op(
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/layer_helper.py", line 44, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/framework.py", line 3615, in append_op
      op = Operator(
    File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/framework.py", line 2635, in __init__
      for frame in traceback.extract_stack():

    ExternalError: CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED. 
      [Hint: Please search for the error code(9) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /media/wjl/D2/github/fork/7/Paddle/paddle/phi/kernels/fusion/gpu/conv_fusion_kernel.cu:593)
      [operator < conv2d_fusion > error]

Code is on develop 605242a
System Ubuntu 20.04
GCC 9.4.0
CUDA 11.7
CUDNN 8.7.0

其他补充信息 Additional Supplementary Information

No response

@paddle-bot
Copy link

paddle-bot bot commented Feb 24, 2023

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Feb 24, 2023
@engineer1109 engineer1109 changed the title [PaddleSeg] [PaddleSeg] CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED Feb 24, 2023
@engineer1109
Copy link
Contributor Author

模型是PaddleSeg导出来的,还是百度自研的模型

@engineer1109 engineer1109 changed the title [PaddleSeg] CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED [Crash][pp_liteseg] CUDNN error(9), CUDNN_STATUS_NOT_SUPPORTED Feb 24, 2023
@SunNy820828449
Copy link
Contributor

是不是你本地cudnn的版本不合适,这个应该不是模型的问题。

@engineer1109
Copy link
Contributor Author

是不是你本地cudnn的版本不合适,这个应该不是模型的问题。

当然不是模型的问题,是inference库代码有问题

不太可能是cudnn不合适这种问题,直接源码编译。以前的commit没有这个问题,环境没有变。cudnn 8.7还不够高吗?其他模型没问题。

@engineer1109
Copy link
Contributor Author

@SunNy820828449
链接: https://pan.baidu.com/s/19clK3-C4t6JCYmDKqKcPPg 密码: cmoa
已经可以确定是近期IR优化代码存在BUG。
config.SwitchIrOptim(false);
开启此代码,关闭IR优化,就可以正常运行。反之,则会出现上面的崩溃。

@SunNy820828449
Copy link
Contributor

我已经把问题反馈给inference的同学了

@paddle-bot paddle-bot bot added status/following-up 跟进中 and removed status/new-issue 新建 labels Mar 2, 2023
@2054686334
Copy link

paddlepaddle-gpu==2.4.2
cuda==11.7.1
cudnn==8.8.0
python310
使用PP-Human时遇到相同报错

@sunjinghua
Copy link

paddlepaddle-gpu==2.4.2
cuda==11.6
cudnn==8.7
python3.7
也存在这个问题

@yanghebao
Copy link

paddlepaddle-gpu==2.4.1
cuda==11.6
cudnn==8.8
python==3.9
config.switch_ir_optim(True) 时存在这个问题

@engineer1109
Copy link
Contributor Author

更多的网络出现,picodet也有 @luotao1

@engineer1109
Copy link
Contributor Author

更新CUDA 12.1 CUDNN 8.9 一样也有

@jimmyflycv
Copy link

jimmyflycv commented Jul 12, 2023

I have got 2 environments, the first one outputs this error.
paddle-bfloat 0.1.7
paddleocr 2.6.1.3
paddlepaddle-gpu 2.3.2.post112
cuda 11.3
cudnn 8.9

the second one wont output this error.
paddle-bfloat 0.1.7
paddleocr 2.6.1.3
paddlepaddle-gpu 2.3.2.post112
cuda 11.3
cudnn 8.2

@engineer1109
Copy link
Contributor Author

@jzhang533 这个问题谁能解一下,堆了半年了

@jzhang533
Copy link
Contributor

我试试看能不能找到人。

@yuanlehome
Copy link
Contributor

yuanlehome commented Jul 13, 2023

我在跟进中,看看能不能复现并解决~

@yuanlehome
Copy link
Contributor

yuanlehome commented Jul 13, 2023

目前看是cudnn >= 8.7的bug,8.6及以下都没问题。修复pr #55407 稍后会合入2.5分支~

@engineer1109
Copy link
Contributor Author

Thanks for fixed.

@liangbaikaizzzZZZ
Copy link

paddlepaddle-gpu==2.5.1
cuda==11.6
cudnn==8.7
python3.8
还是存在这个问题

@engineer1109
Copy link
Contributor Author

@liangbaikaizzzZZZ release不稳定,用develop试试

@xiemeilong
Copy link

xiemeilong commented Sep 27, 2023

同样的问题:
paddlepaddle-gpu==2.5.1
cuda==12.2
cudnn==8.9
python3.10

@engineer1109
Copy link
Contributor Author

@xiemeilong 都说了 release不稳定 ,用develop

@WangShengFeng1
Copy link

目前看是cudnn >= 8.7的bug,8.6及以下都没问题。修复pr #55407 稍后会合入2.5分支~

请问,我是在aistodio平台上跑的,我该怎样降低cuddn版本

@Ultraman6
Copy link

所以到底是怎么解决的?用2.5及以上的paddle又会出现不能与pytorch同时运行的bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭 type/bug-report 报bug
Projects
None yet
Development

No branches or pull requests