-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance for some operation like transforming image to ndarray #1278
Comments
The operations I mentioned above are not the only ones with poor performance, like YoloTranslator provided in this project, which is generally worse(cost 0.027s) than one on python with full object detection process(from opencv to nms, cost 0.014s) |
the code used for the test of Image.toNDArray, @Test
public void Image2NDArrayTest() throws Exception {
int inpNum = 1000;
int height = 640, width = 640;
Scalar white = new Scalar(0, 0, 0);
Image img = ImageUtils.mat2Image(new Mat(height, width, CvType.CV_8UC3, white));
NDManager ndManager = NDManager.newBaseManager(Device.gpu());
System.out.printf("start(%d times)\n", inpNum);
long startTime = System.currentTimeMillis();
for (int i = 0; i < inpNum; i++) {
try (NDManager subManager = ndManager.newSubManager()) {
img.toNDArray(subManager);
}
}
long endTime = System.currentTimeMillis();
System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
} the code for the test of translator public <IN, OUT> void test(ZooModel<IN, OUT> model, IN inp, int inpNum, int warmupNum) throws Exception {
try (Predictor<IN, OUT> predictor = model.newPredictor()) {
if (warmupNum > 0) {
System.out.printf("warming up(%dtimes)\n", warmupNum);
for (int i = 0; i < warmupNum; i++) {
predictor.predict(inp);
}
}
System.out.printf("testing(%dtimes)\n", inpNum);
long startTime = System.currentTimeMillis();
for (int i = 0; i < inpNum; i++) {
predictor.predict(inp);
}
long endTime = System.currentTimeMillis();
System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
}
} the code for the test of yolov5 on python def yolov5_test():
yolov5_weight = './weights/yolov5s.torchscript.pt'
device = 'cuda'
#
imgs_num = 1000
height, width = 640, 640
detector = YoloV5Detector(yolov5_weight, device)
test(detector.detect, lambda: np.zeros((height, width, 3), int), imgs_num)
def test(model, inp, inp_num, warmup_num=50):
if warmup_num > 0:
print(f"warming({warmup_num}times)")
for _ in range(warmup_num):
model(inp())
print(f"testing({inp_num}times)")
torch.cuda.synchronize()
start_time = time.time()
#
for _ in range(inp_num):
model(inp())
#
torch.cuda.synchronize()
end_time = time.time()
print(f"time: {(end_time - start_time) / inp_num} s/img") the implement of Yolov5Detector class YoloV5Detector:
def __init__(self, weights, device):
self.device = device
self.model = torch.jit.load(weights).to(device)
self.conf_thres = 0.35
self.iou_thres = 0.45
self.agnostic_nms = False
self.max_det = 1000
self.classes = [0]
self.transformer = transforms.Compose([transforms.ToTensor()])
# 预热
_ = self.model(torch.zeros(1, 3, 640, 480).to(self.device))
def preprocess_img(self, img):
return self.transformer(img[:, :, ::-1].copy()).unsqueeze(0).to(self.device, dtype=torch.float32)
def detect(self, img):
# 预处理
img = self.preprocess_img(img)
# 检测
pred = self.model(img)[0]
# NMS
pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, self.classes, self.agnostic_nms,
max_det=self.max_det)
pred = pred[0].detach().cpu()
return pred |
@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue. |
I took a further test today. It is the problem of java.awt.BufferedImage.getRGB() which costs more than 0.006s. |
@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own |
thanks for reply, I get it, |
sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call? |
|
Removing the parallel from BufferedImageFactory.fromNDArray:111 did the trick for me. Seems like overhead for parallel operation is bigger than the gain here. |
Now we have OpenCV extension. |
I tested the processing time of some model on djl and libtorch of python,
Im sure djl keep the same performance compared to cpp or python if only count the pure inference time. But there are some operation on djl with bad performance.
Like Image.toNDArray, It cost nearly 0.01s. Even the pure inference time of yolov5s cost only 0.008s. the similar operation(to_tensor of torchvision) on python costs only 0.005s.
If there are any solution to improve the performance?
The text was updated successfully, but these errors were encountered: