Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipline 部署 cascade 模型 http 分析了多张图像后报 Segmentation fault #1276

Closed
okooo5km opened this issue Jun 9, 2021 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@okooo5km
Copy link

okooo5km commented Jun 9, 2021

如题,使用 pipline 的方式部署了 cascade 服务,使用 http 接口进行图像预测,使用多线程方式调接口,分析了多张图像之后出现段错误。

测试环境

  • CUDA 11.2
  • 显卡:RTX 3090
  • python 3.7.0
  • PaddlePaddle 2.1.0.post112
  • paddle-serving-server-gpu 0.6.0.post11
  • paddle_serving_app 0.6.0

web_service.py

import sys
import base64
import logging

import cv2
import numpy as np
from paddle_serving_app.reader import *
from paddle_serving_server.web_service import WebService, Op

NAME = "bank"

class CascadeBankOp(Op):
    def init_op(self):
        self.img_preprocess = Sequential([
            BGR2RGB(), Div(255.0),
            Normalize([0.4966, 0.4876, 0.4861], [0.0270, 0.0245, 0.0238], False),
            Resize((720, 1280)), Transpose((2, 0, 1)), PadStride(32)
        ])
        self.img_postprocess = RCNNPostprocess("label_list.txt", "output")

    def preprocess(self, input_dicts, data_id, log_id):
        (_, input_dict), = input_dicts.items()
        imgs = []
        #print("keys", input_dict.keys())
        for key in input_dict.keys():
            data = base64.b64decode(input_dict[key].encode('utf8'))
            data = np.fromstring(data, np.uint8)
            im = cv2.imdecode(data, cv2.IMREAD_COLOR)
            im = self.img_preprocess(im)
            imgs.append({
              "image": im[np.newaxis,:],
              "im_shape": np.array(list(im.shape[1:])).reshape(-1)[np.newaxis,:],
              "scale_factor": np.array([1.0, 1.0]).reshape(-1)[np.newaxis,:],
            })
        feed_dict = {
            "image": np.concatenate([x["image"] for x in imgs], axis=0),
            "im_shape": np.concatenate([x["im_shape"] for x in imgs], axis=0),
            "scale_factor": np.concatenate([x["scale_factor"] for x in imgs], axis=0)
        }
        #for key in feed_dict.keys():
        #    print(key, feed_dict[key].shape)
        return feed_dict, False, None, ""

    def postprocess(self, input_dicts, fetch_dict, log_id):
        # print(fetch_dict)
        bbox_result = self.img_postprocess(fetch_dict, visualize=False)
        bbox_result = list(filter(lambda x: x.get("score") > 0.5, bbox_result))
        res_dict = {"bbox_result": str(bbox_result)}
        return res_dict, None, ""


class CascadeBankService(WebService):
    def get_pipeline_response(self, read_op):
        cascade_bank_op = CascadeBankOp(name=NAME, input_ops=[read_op])
        return cascade_bank_op


cascade_bank_service = CascadeBankService(name=NAME)
cascade_bank_service.prepare_pipeline_config("config.yml")
cascade_bank_service.run_service()

config.yaml

dag:
  is_thread_op: false
  tracer:
    interval_s: 30
http_port: 9292
op:
  bank:
    concurrency: 4

    local_service_conf:
      client_type: local_predictor
      device_type: 1
      devices: '8'
      fetch_list:
      - save_infer_model/scale_0.tmp_1
      model_config: serving_server/
rpc_port: 9998
worker_num: 32

错误信息

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::AnalysisPredictor::ZeroCopyRun()
1   paddle::framework::NaiveExecutor::Run()
2   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
3   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
5   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 2ul, paddle::operators::SliceKernel<paddle::platform::C
PUDeviceContext, int>, paddle::operators::SliceKernel<paddle::platform::CPUDeviceContext, long>, paddle::operators::SliceKernel<paddle::platform::CPUDeviceContext, float>, paddle::operators::SliceKernel<paddle::p
latform::CPUDeviceContext, double>, paddle::operators::SliceKernel<paddle::platform::CPUDeviceContext, paddle::platform::complex64>, paddle::operators::SliceKernel<paddle::platform::CPUDeviceContext, paddle::plat
form::complex128> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6   void paddle::operators::SliceKernel<paddle::platform::CPUDeviceContext, float>::SliceCompute<2ul>(paddle::framework::ExecutionContext const&) const
7   Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, 2> const, Eigen::DSizes<int, 2> const
, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, int>, 16, Eigen::MakePointer> const> const> const, Eigen::DefaultDevice, true, (Eigen::internal::TiledEvaluation)1>::run(Eigen::TensorAssignOp<Eigen::TensorMap<
Eigen::Tensor<float, 2, 1, int>, 16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, 2> const, Eigen::DSizes<int, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, int>, 16, Eigen::MakePo
inter> const> const> const&, Eigen::DefaultDevice const&)
8   paddle::framework::SignalHandle(char const*, int)
9   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1623228748 (unix time) try "date -d @1623228748" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x7fd13d3f1600) received by PID 17800 (TID 0x7fd2b7261740) from PID 1027544576 ***]
@github-actions
Copy link

github-actions bot commented Jun 9, 2021

Message that will be displayed on users' first issue

@okooo5km okooo5km changed the title pipline 部署 cascade 模型 http 分析了上千张图像后报 Segmentation fault pipline 部署 cascade 模型 http 分析了多张图像后报 Segmentation fault Jun 9, 2021
@TeslaZhao TeslaZhao self-assigned this Jun 9, 2021
@TeslaZhao
Copy link
Collaborator

您好,今晚我复现一下。图片是公共数据集吗?

@okooo5km
Copy link
Author

okooo5km commented Jun 9, 2021

您好,今晚我复现一下。图片是公共数据集吗?

好的,我训练的模型是自训练的,不知道paddle model zoo 中 cascade 模型是否可以复现,没测试!

@TeslaZhao
Copy link
Collaborator

您好,我跑通了Serving目录下cascade示例在Pipeline上部署,你训练的模型是在Paddle 2.1上训练的吗?从报错信息上看是slice op运行报错,你看一下PipelineServingLogs/pipeline.log.wf里是否有更多的错误信息。打印一下image名称,image_shape, scale_factor,看下是否在特定图片报错,image_shape, scale_factor是否正确?

@okooo5km
Copy link
Author

您好,我跑通了Serving目录下cascade示例在Pipeline上部署,你训练的模型是在Paddle 2.1上训练的吗?从报错信息上看是slice op运行报错,你看一下PipelineServingLogs/pipeline.log.wf里是否有更多的错误信息。打印一下image名称,image_shape, scale_factor,看下是否在特定图片报错,image_shape, scale_factor是否正确?

感谢您的回复,在您的提示下,找到了原因,是 web_service.py 中图像预处理的 Resize 参数的问题,写反了:

class CascadeBankOp(Op):
    def init_op(self):
        self.img_preprocess = Sequential([
            BGR2RGB(), Div(255.0),
            Normalize([0.4966, 0.4876, 0.4861], [0.0270, 0.0245, 0.0238], False),
            Resize((720, 1280)), Transpose((2, 0, 1)), PadStride(32)
        ])
...

改正之后:

class CascadeBankOp(Op):
    def init_op(self):
        self.img_preprocess = Sequential([
            BGR2RGB(), Div(255.0),
            Normalize([0.4966, 0.4876, 0.4861], [0.0270, 0.0245, 0.0238], False),
            Resize((1280, 720)), Transpose((2, 0, 1)), PadStride(32)
        ])

可以正常稳定运行了,再次感谢 🤙🏻🤙🏻🤙🏻

@okooo5km
Copy link
Author

okooo5km commented Jun 11, 2021

@TeslaZhao 您好,我这边又进行了批量图像分析测试,发现有一张图分析的时候仍然会导致段错误,看图像没什么异常,不知道为什么只有这一张图像有问题,单独只使用这一张图像调分析接口,返回的错误是:

{
    'err_no': 8, 
    'err_msg': "(data_id=0 log_id=0) [bank|0] Failed to postprocess: 'save_infer_model/scale_0.tmp_1.lod'", 
    'key': [], 
    'value': []
}

打印了下 python3.7/site-packages/paddle_serving_app/reader/image_reader.py 中 (第 344 行) def _get_bbox_result(self, fetch_map, fetch_name, clsid2catid) 方法中 fetch_map

{'save_infer_model/scale_0.tmp_1': array([[-1.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)}

确实是没有 'save_infer_model/scale_0.tmp_1.lod',尝试打印能正常分析图像的 fetch_map:

{'save_infer_model/scale_0.tmp_1': array([[0.0000000e+00, 9.8990679e-01, 6.6382672e+02, 4.6241437e+02,
        8.3422443e+02, 7.1902985e+02],
       [1.0000000e+00, 7.3206198e-01, 6.5045532e+02, 6.0466644e+02,
        7.2246985e+02, 7.0166632e+02]], dtype=float32), 'save_infer_model/scale_0.tmp_1.lod': array([0, 2], dtype=int32)}

@okooo5km okooo5km reopened this Jun 11, 2021
@TeslaZhao
Copy link
Collaborator

您好,从错误信息上看缺少模型结果缺少save_infer_model/scale_0.tmp_1.lod,您可以直接修改一下python3.7/site-packages/paddle_serving_app/reader/image_reader.py中_get_bbox_result的代码,判断lod存在在取值。至于为什么会出现lod不存在的情况,我们要和模型的同学反馈一下,再决定如何修改。

@TeslaZhao TeslaZhao added the question Further information is requested label Jun 16, 2021
@HuiHuiSun
Copy link

您好,从错误信息上看缺少模型结果缺少save_infer_model/scale_0.tmp_1.lod,您可以直接修改一下python3.7/site-packages/paddle_serving_app/reader/image_reader.py中_get_bbox_result的代码,判断lod存在在取值。至于为什么会出现lod不存在的情况,我们要和模型的同学反馈一下,再决定如何修改。

您好,这个问题原因找到了吗?

@paddle-bot paddle-bot bot closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants