Inference time issue #13

SoonminHwang · 2017-10-12T19:29:19Z

Thanks for quick reply!

I tried to change compiler which supports std:regex such as gcc-4.9 or clang-3.5 & libcxx.
But the polly.py seems not to support gcc-4.9.
(I cannot find gcc-4-9 in the list when I type polly.py --help)

In the case of libcxx toolchain, I failed to build with some error messages. Here is log file.

Anyway, my first goal is to compare running time to piotr's matlab implementation.
I commented cxxopts things in acf.cpp and measure inference time using gettimeofday function.

Even though the inference time of the classifier heavily depends on the image content and casc thershold, somthings are wrong.
It takes 54ms for lena512color.png using drishti_face_gray_80x80.cpb.
(As you know, ~100ms in piotr's MATLAB code for 640x480 image)

I expect <1ms with my GPU (Titan X Pascal).

I think, I turn on the flag to use GPU.

acf/CMakeLists.txt

Line 91 in 17c2689

option(ACF_BUILD_OGLES_GPGPU "Build with OGLES_GPGPU" ON)

How about the inference time on your machine?

The text was updated successfully, but these errors were encountered:

headupinclouds · 2017-10-13T00:20:52Z

But the polly.py seems not to support gcc-4.9.

I'm using a gcc 5 toolchain in the travis tests, which is working fine:

acf/.travis.yml

Line 46 in 17c2689

env: CONFIG=Release TOOLCHAIN=gcc-5-pic-hid-sections-lto INSTALL=--strip

If you need gcc 4.9 it should be easy to create one from this recipe:

https://github.com/ruslo/polly/blob/master/gcc-5-pic-hid-sections-lto.cmake

headupinclouds · 2017-10-13T00:36:27Z

In the case of libcxx toolchain, I failed to build with some error messages. Here is log file.

I will try adding libcxx to the CI tests: #14

UPDATE: Clang 3.8 is now building fine in the travis Unbutu Trusty (14.04) image:

https://travis-ci.org/elucideye/acf/jobs/287454354ttps://travis-ci.org/elucideye/acf/jobs/287454354

@SoonminHwang ☝️

headupinclouds · 2017-10-13T01:17:49Z

It takes 54ms for lena512color.png using drishti_face_gray_80x80.cpb. (As you know, ~100ms in piotr's MATLAB code for 640x480 image). I expect <1ms with my GPU (Titan X Pascal).

TL;DR: The shader implementation is geared towards optimized feature computation on mobile GPUs. The detection itself doesn't map well to simple GLSL processing, so the features must be transfered from GPU->CPU (slow) for CPU based detection (fast). On a desktop, the full process could be executed on the GPU.

The console app doesn't currently use the OpenGL ES 2.0 shader acceleration, so I'm sure you are running a CPU only benchmark. I recently migrated this stuff from drishti for general purpose use and improvements, and it will be added to the Hunter package manger once it is cleaned up a little more. I originally needed this for mobile platforms, so OpenGL ES 2.0 was the lowest common denominator that could support both iOS and Android platforms. The main drawback with this approach is the 8 bit channel output limitation (it can be improved with 4x8 -> 32 bit packing). Caveat: Due the the above mentioned limitation, the GLSL output is currently only an approximation of the CPU floating point output, and it needs to be improved (there will be a measurable performance hit). For desktop use, it is probably better to write it in OpenCL or something higher level that doesn't have these limitations. (I recently came across Halide, which seems like an excellent path for cross platform optimization, but I currently have no experience with it.)

The GLSL code is all in this file https://github.com/elucideye/acf/blob/master/src/lib/acf/GPUACF.h, which is currently separate from the ACF detection class. To use that class, you will need to manage your own OpenGL context. It uses https://github.com/hunter-packages/ogles_gpgpu to manage a shader pipeline that computes the features. The expensive part on mobile platforms is the GPU->CPU transfer, so one frame of latency is added to the pipeline, such that ACF pyramids can be computed on the GPU for frame N ("for free"), and they are available for processing at time N+1 with no added CPU cost. In this workflow, the precomputed ACF pyramid is passed in for detection in place of the RGB image. The face detection/search on the precomputed pyramids then runs in a few milliseconds on an iPhone 7. For pedestrian detection the extra frame of latency might not be suitable. The SDK call is shown here:

acf/src/lib/acf/ACF.h

Line 392 in 17c2689

int operator()(const Pyramid& P, RectVec& objects, RealVec* scores = 0);

There is a small unit test that illustrates what the basic process would look like: 1) compute acf::Detector::Pyramid objects on the GPU and then; 2) feed them to acf::Detector:

acf/src/lib/acf/ut/test-acf.cpp

Lines 444 to 464 in 17c2689

    
           TEST_F(ACFTest, ACFDetectionGPU10) 
        
           { 
        
               acf::Detector::Pyramid Pgpu; 
        
               initGPUAndCreatePyramid(Pgpu); 
        
               ASSERT_NE(m_detector, nullptr); 
        
               ASSERT_NE(m_acf, nullptr); 
        
               std::vector<double> scores; 
        
               std::vector<cv::Rect> objects; 
        
               (*m_detector)(Pgpu, objects); 
        
           #if ACF_TEST_DISPLAY_OUTPUT 
        
               WaitKey waitKey; 
        
               cv::Mat canvas = image.clone(); 
        
               draw(canvas, objects); 
        
               cv::imshow("acf_gpu_detections", canvas); 
        
           #endif 
        
               ASSERT_GT(objects.size(), 0); // Very weak test!!! 
        
           } 
        
           #endif // defined(ACF_DO_GPU)

The above test uses the Hunter aglet package to manage the OpenGL context (just glfw for PC builds).

That test could be used for some initial benchmarks, and perhaps it could be added to the console application for additional testing. I'll try to take a look in the next few days, unless you want to try it sooner.

It would be nice to automate the GPGPU processing at the API level. Actually there was an issue for this elucideye/drishti#373 here. I'll migrate it to the new repository. A cv::UMat OpenCL interface would be cool, but I'm primarily interested in mobile platforms where this isn't really an option.

headupinclouds · 2017-10-13T15:24:01Z

As a temporary GPU benchmark, I've added a timer class that can be enabled w/ an option in the unit test. The is currently sitting in this PR: #16

See define ACF_LOG_GPU_TIME 1 here: 0b95cb7#r144582514

That will print the GPGPU pyramid compute (shaders, read, and "fill" to memory)

ACF::fill(): 512 0.00225127

As well as the detection time

 acf::Detector::operator():0.00207684

On my desktop these each take about 2 milliseconds (2+2=4 ms) with a GEFORCE GTX TITAN X. The detection time is comparable on my 2013 MacBook, but the fill operation is more like 20 ms (nearly all of that is spent in the texture read). Actually if we use the PBO reads in ogles_gpgpu in the OpenGL ES 3.0 compatibility mode (or desktop) that should be faster. I'll add an option for it.

This isn't a proper benchmark, but it can provide some info in the short term.

1: Test timeout computed to be: 9.99988e+06
1: [==========] Running 7 tests from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 7 tests from ACFTest
1: [ RUN      ] ACFTest.ACFSerializeCereal
1: [       OK ] ACFTest.ACFSerializeCereal (334 ms)
1: [ RUN      ] ACFTest.ACFDetectionCPUMat
1: [       OK ] ACFTest.ACFDetectionCPUMat (106 ms)
1: [ RUN      ] ACFTest.ACFDetectionCPUMatP
1: [       OK ] ACFTest.ACFDetectionCPUMatP (84 ms)
1: [ RUN      ] ACFTest.ACFChannelsCPU
1: [       OK ] ACFTest.ACFChannelsCPU (89 ms)
1: [ RUN      ] ACFTest.ACFPyramidCPU
1: [       OK ] ACFTest.ACFPyramidCPU (101 ms)
1: [ RUN      ] ACFTest.ACFPyramidGPU10
1: ACF::fill(): 512 0.00246045
1: [       OK ] ACFTest.ACFPyramidGPU10 (305 ms)
1: [ RUN      ] ACFTest.ACFDetectionGPU10
1: ACF::fill(): 512 0.00225127
1: acf::Detector::operator():0.00207684
1: [       OK ] ACFTest.ACFDetectionGPU10 (153 ms)
1: [----------] 7 tests from ACFTest (1173 ms total)
1:
1: [----------] Global test environment tear-down
1: [==========] 7 tests from 1 test case ran. (1173 ms total)
1: [  PASSED  ] 7 tests.
1/1 Test #1: ACFTest ..........................   Passed    1.19 sec

headupinclouds · 2017-10-16T14:07:26Z

@SoonminHwang : I hope this answers your question. I'm going to close this for now. Since one of the strong advantages of this packages is size + speed, it might make sense to add some targeted google benchmarks.

headupinclouds mentioned this issue Oct 13, 2017

Make ACF GPU (OpenGL ES) acceleration transparent #15

Closed

headupinclouds added the discussion label Oct 13, 2017

headupinclouds closed this as completed Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time issue #13

Inference time issue #13

SoonminHwang commented Oct 12, 2017

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 16, 2017

Inference time issue #13

Inference time issue #13

Comments

SoonminHwang commented Oct 12, 2017

headupinclouds commented Oct 13, 2017 • edited Loading

headupinclouds commented Oct 13, 2017 • edited Loading

headupinclouds commented Oct 13, 2017 • edited Loading

headupinclouds commented Oct 13, 2017 • edited Loading

headupinclouds commented Oct 16, 2017

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading

headupinclouds commented Oct 13, 2017 •

edited

Loading