Bad performance for some operation like transforming image to ndarray #1278

hongyaohongyao · 2021-10-08T15:04:26Z

I tested the processing time of some model on djl and libtorch of python,
Im sure djl keep the same performance compared to cpp or python if only count the pure inference time. But there are some operation on djl with bad performance.
Like Image.toNDArray, It cost nearly 0.01s. Even the pure inference time of yolov5s cost only 0.008s. the similar operation(to_tensor of torchvision) on python costs only 0.005s.
If there are any solution to improve the performance?

hongyaohongyao · 2021-10-08T15:10:25Z

The operations I mentioned above are not the only ones with poor performance, like YoloTranslator provided in this project, which is generally worse(cost 0.027s) than one on python with full object detection process(from opencv to nms, cost 0.014s)

hongyaohongyao · 2021-10-08T15:30:16Z

the code used for the test of Image.toNDArray,

    @Test
    public void Image2NDArrayTest() throws Exception {
        int inpNum = 1000;
        int height = 640, width = 640;

        Scalar white = new Scalar(0, 0, 0);
        Image img = ImageUtils.mat2Image(new Mat(height, width, CvType.CV_8UC3, white));

        NDManager ndManager = NDManager.newBaseManager(Device.gpu());

        System.out.printf("start(%d times)\n", inpNum);
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < inpNum; i++) {
            try (NDManager subManager = ndManager.newSubManager()) {
                img.toNDArray(subManager);
            }
        }
        long endTime = System.currentTimeMillis();
        System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
    }

the code for the test of translator

    public <IN, OUT> void test(ZooModel<IN, OUT> model, IN inp, int inpNum, int warmupNum) throws Exception {

        try (Predictor<IN, OUT> predictor = model.newPredictor()) {
            if (warmupNum > 0) {
                System.out.printf("warming up(%dtimes)\n", warmupNum);
                for (int i = 0; i < warmupNum; i++) {
                    predictor.predict(inp);
                }
            }
            System.out.printf("testing(%dtimes)\n", inpNum);
            long startTime = System.currentTimeMillis();
            for (int i = 0; i < inpNum; i++) {
                predictor.predict(inp);
            }
            long endTime = System.currentTimeMillis();
            System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
        }
    }

the code for the test of yolov5 on python

def yolov5_test():
    yolov5_weight = './weights/yolov5s.torchscript.pt'
    device = 'cuda'
    #
    imgs_num = 1000
    height, width = 640, 640
    detector = YoloV5Detector(yolov5_weight, device)
    test(detector.detect, lambda: np.zeros((height, width, 3), int), imgs_num)
def test(model, inp, inp_num, warmup_num=50):
    if warmup_num > 0:
        print(f"warming({warmup_num}times)")
        for _ in range(warmup_num):
            model(inp())
    print(f"testing({inp_num}times)")
    torch.cuda.synchronize()
    start_time = time.time()
    #
    for _ in range(inp_num):
        model(inp())
    #
    torch.cuda.synchronize()
    end_time = time.time()
    print(f"time: {(end_time - start_time) / inp_num} s/img")

the implement of Yolov5Detector

class YoloV5Detector:
    def __init__(self, weights, device):
        self.device = device
        self.model = torch.jit.load(weights).to(device)
        self.conf_thres = 0.35
        self.iou_thres = 0.45
        self.agnostic_nms = False
        self.max_det = 1000
        self.classes = [0]
        self.transformer = transforms.Compose([transforms.ToTensor()])
        # 预热
        _ = self.model(torch.zeros(1, 3, 640, 480).to(self.device))

    def preprocess_img(self, img):

        return self.transformer(img[:, :, ::-1].copy()).unsqueeze(0).to(self.device, dtype=torch.float32)

    def detect(self, img):
        # 预处理
        img = self.preprocess_img(img)
        # 检测
        pred = self.model(img)[0]
        # NMS
        pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, self.classes, self.agnostic_nms,
                                   max_det=self.max_det)
        pred = pred[0].detach().cpu()
        return pred

frankfliu · 2021-10-09T15:54:37Z

@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.

hongyaohongyao · 2021-10-09T16:14:47Z

@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.

I took a further test today. It is the problem of java.awt.BufferedImage.getRGB() which costs more than 0.006s.
I think some objects in djl are over-encapsulated, it may be better for programmers to operate NDList/NDArray or customized light intermediate data directly.

frankfliu · 2021-10-09T16:30:28Z

@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory using high performance native implementation like OpenCV.

hongyaohongyao · 2021-10-09T17:07:11Z

@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory using high performance native implementation like OpenCV.

thanks for reply, I get it,

xwaeaewcrhomesysplug · 2021-10-20T13:10:20Z

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?

hongyaohongyao · 2021-10-20T13:27:32Z

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?
I guess

you can see the implementation of BufferedImageFactory
submanager for successing environment from parent manager, releasing ndarray automatically and preventing memory leaks

steinhae · 2021-11-02T17:58:45Z

Removing the parallel from BufferedImageFactory.fromNDArray:111 did the trick for me. Seems like overhead for parallel operation is bigger than the gain here.

frankfliu · 2021-11-30T17:19:44Z

Now we have OpenCV extension.

hongyaohongyao added the enhancement New feature or request label Oct 8, 2021

hongyaohongyao changed the title ~~Bad performance for some operation like transform image to ndarray~~ Bad performance for some operation like transforming image to ndarray Oct 8, 2021

zachgk mentioned this issue Nov 2, 2021

Make BufferedImageFactory.fromNDArray synchronous #1339

Merged

frankfliu closed this as completed Nov 30, 2021

maaquib pushed a commit to maaquib/djl that referenced this issue Mar 8, 2024

fix the bug in the cuda compat settings (deepjavalibrary#1278)

68056dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance for some operation like transforming image to ndarray #1278

Bad performance for some operation like transforming image to ndarray #1278

hongyaohongyao commented Oct 8, 2021 •

edited

Loading

hongyaohongyao commented Oct 8, 2021 •

edited

Loading

hongyaohongyao commented Oct 8, 2021

frankfliu commented Oct 9, 2021

hongyaohongyao commented Oct 9, 2021

frankfliu commented Oct 9, 2021 •

edited

Loading

hongyaohongyao commented Oct 9, 2021

xwaeaewcrhomesysplug commented Oct 20, 2021

hongyaohongyao commented Oct 20, 2021 •

edited

Loading

steinhae commented Nov 2, 2021 •

edited

Loading

frankfliu commented Nov 30, 2021

Bad performance for some operation like transforming image to ndarray #1278

Bad performance for some operation like transforming image to ndarray #1278

Comments

hongyaohongyao commented Oct 8, 2021 • edited Loading

hongyaohongyao commented Oct 8, 2021 • edited Loading

hongyaohongyao commented Oct 8, 2021

frankfliu commented Oct 9, 2021

hongyaohongyao commented Oct 9, 2021

frankfliu commented Oct 9, 2021 • edited Loading

hongyaohongyao commented Oct 9, 2021

xwaeaewcrhomesysplug commented Oct 20, 2021

hongyaohongyao commented Oct 20, 2021 • edited Loading

steinhae commented Nov 2, 2021 • edited Loading

frankfliu commented Nov 30, 2021

hongyaohongyao commented Oct 8, 2021 •

edited

Loading

hongyaohongyao commented Oct 8, 2021 •

edited

Loading

frankfliu commented Oct 9, 2021 •

edited

Loading

hongyaohongyao commented Oct 20, 2021 •

edited

Loading

steinhae commented Nov 2, 2021 •

edited

Loading