Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance for some operation like transforming image to ndarray #1278

Closed
hongyaohongyao opened this issue Oct 8, 2021 · 10 comments
Closed
Labels
enhancement New feature or request

Comments

@hongyaohongyao
Copy link

hongyaohongyao commented Oct 8, 2021

I tested the processing time of some model on djl and libtorch of python,
Im sure djl keep the same performance compared to cpp or python if only count the pure inference time. But there are some operation on djl with bad performance.
Like Image.toNDArray, It cost nearly 0.01s. Even the pure inference time of yolov5s cost only 0.008s. the similar operation(to_tensor of torchvision) on python costs only 0.005s.
If there are any solution to improve the performance?

@hongyaohongyao hongyaohongyao added the enhancement New feature or request label Oct 8, 2021
@hongyaohongyao
Copy link
Author

hongyaohongyao commented Oct 8, 2021

The operations I mentioned above are not the only ones with poor performance, like YoloTranslator provided in this project, which is generally worse(cost 0.027s) than one on python with full object detection process(from opencv to nms, cost 0.014s)

@hongyaohongyao hongyaohongyao changed the title Bad performance for some operation like transform image to ndarray Bad performance for some operation like transforming image to ndarray Oct 8, 2021
@hongyaohongyao
Copy link
Author

the code used for the test of Image.toNDArray,

    @Test
    public void Image2NDArrayTest() throws Exception {
        int inpNum = 1000;
        int height = 640, width = 640;

        Scalar white = new Scalar(0, 0, 0);
        Image img = ImageUtils.mat2Image(new Mat(height, width, CvType.CV_8UC3, white));

        NDManager ndManager = NDManager.newBaseManager(Device.gpu());

        System.out.printf("start(%d times)\n", inpNum);
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < inpNum; i++) {
            try (NDManager subManager = ndManager.newSubManager()) {
                img.toNDArray(subManager);
            }
        }
        long endTime = System.currentTimeMillis();
        System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
    }

the code for the test of translator

    public <IN, OUT> void test(ZooModel<IN, OUT> model, IN inp, int inpNum, int warmupNum) throws Exception {

        try (Predictor<IN, OUT> predictor = model.newPredictor()) {
            if (warmupNum > 0) {
                System.out.printf("warming up(%dtimes)\n", warmupNum);
                for (int i = 0; i < warmupNum; i++) {
                    predictor.predict(inp);
                }
            }
            System.out.printf("testing(%dtimes)\n", inpNum);
            long startTime = System.currentTimeMillis();
            for (int i = 0; i < inpNum; i++) {
                predictor.predict(inp);
            }
            long endTime = System.currentTimeMillis();
            System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
        }
    }

the code for the test of yolov5 on python

def yolov5_test():
    yolov5_weight = './weights/yolov5s.torchscript.pt'
    device = 'cuda'
    #
    imgs_num = 1000
    height, width = 640, 640
    detector = YoloV5Detector(yolov5_weight, device)
    test(detector.detect, lambda: np.zeros((height, width, 3), int), imgs_num)
def test(model, inp, inp_num, warmup_num=50):
    if warmup_num > 0:
        print(f"warming({warmup_num}times)")
        for _ in range(warmup_num):
            model(inp())
    print(f"testing({inp_num}times)")
    torch.cuda.synchronize()
    start_time = time.time()
    #
    for _ in range(inp_num):
        model(inp())
    #
    torch.cuda.synchronize()
    end_time = time.time()
    print(f"time: {(end_time - start_time) / inp_num} s/img")

the implement of Yolov5Detector

class YoloV5Detector:
    def __init__(self, weights, device):
        self.device = device
        self.model = torch.jit.load(weights).to(device)
        self.conf_thres = 0.35
        self.iou_thres = 0.45
        self.agnostic_nms = False
        self.max_det = 1000
        self.classes = [0]
        self.transformer = transforms.Compose([transforms.ToTensor()])
        # 预热
        _ = self.model(torch.zeros(1, 3, 640, 480).to(self.device))

    def preprocess_img(self, img):

        return self.transformer(img[:, :, ::-1].copy()).unsqueeze(0).to(self.device, dtype=torch.float32)

    def detect(self, img):
        # 预处理
        img = self.preprocess_img(img)
        # 检测
        pred = self.model(img)[0]
        # NMS
        pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, self.classes, self.agnostic_nms,
                                   max_det=self.max_det)
        pred = pred[0].detach().cpu()
        return pred

@frankfliu
Copy link
Contributor

@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.

@hongyaohongyao
Copy link
Author

@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.

I took a further test today. It is the problem of java.awt.BufferedImage.getRGB() which costs more than 0.006s.
I think some objects in djl are over-encapsulated, it may be better for programmers to operate NDList/NDArray or customized light intermediate data directly.

@frankfliu
Copy link
Contributor

frankfliu commented Oct 9, 2021

@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory using high performance native implementation like OpenCV.

@hongyaohongyao
Copy link
Author

@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory using high performance native implementation like OpenCV.

thanks for reply, I get it,

@xwaeaewcrhomesysplug
Copy link

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?

@hongyaohongyao
Copy link
Author

hongyaohongyao commented Oct 20, 2021

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?
I guess

  1. you can see the implementation of BufferedImageFactory
  2. submanager for successing environment from parent manager, releasing ndarray automatically and preventing memory leaks

@steinhae
Copy link

steinhae commented Nov 2, 2021

Removing the parallel from BufferedImageFactory.fromNDArray:111 did the trick for me. Seems like overhead for parallel operation is bigger than the gain here.

@frankfliu
Copy link
Contributor

Now we have OpenCV extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants