No significant change in iters/sec while comparing cpu vs gpu performance #138

hemantranvir · 2019-11-01T09:07:29Z

I have installed torch_tvm with cuda/opencl support by enabling the following options:
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L32
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L129
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L132

Trying to compare the cpu vs gpu performance by running the following test: https://github.com/pytorch/tvm/blob/master/test/benchmarks.py

CPU version:

$ CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.134256974191366 iter/s
TVM: 62.80919757107452 iter/s
root@ccf26f0f9541:/opt/work/tvm/test#

GPU version:
Edit L39 of benchmarks.py to torch_tvm.enable(opt_level=3, device_type='cuda')

$ CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.478923510188096 iter/s
TVM: 64.52328684937197 iter/s
root@ccf26f0f9541:/opt/work/tvm/test#

As seen above there is no significant change in iter/s.
CPU version: 62.80919757107452 iter/s
GPU version: 64.52328684937197 iter/s

If I check the GPU memory usage with nvidia-smi command, as expected, the GPU is idle.
Is there any other configuration necessary to enable GPU backend?

(Apart from setting set(USE_CUDA ON) , set(USE_CUDNN ON), set(USE_CUBLAS ON) in https://github.com/dmlc/tvm/blob/master/cmake/config.cmake
And setting torch_tvm.enable(opt_level=3, device_type='cuda') in https://github.com/pytorch/tvm/blob/master/test/benchmarks.py)

The text was updated successfully, but these errors were encountered:

yinghai · 2019-11-01T21:46:56Z

I don't think current integration support CUDA now. But we have something WIP. @ilia-cher

ilia-cher · 2019-11-01T22:30:02Z

I have a local patch that adds support for CUDA, eta send it next week

hemantranvir · 2019-11-04T06:28:48Z

@ilia-cher Thanks for your response!
If it is not a big modification in the source code, considering the tvm side already has the support for cuda, can you please transcribe the method to enable the cuda support in torch_tvm.
Excuse my hastiness, as I am in a bit of a hurry.

ilia-cher · 2019-11-04T23:07:10Z

plan to send cuda support PR this week

hemantranvir · 2019-11-11T03:12:52Z

@ilia-cher Any updates? Thanks!

hemantranvir changed the title ~~Unable to see significant change in iters/sec while compaing cpu/gpu performance~~ No significant change in iters/sec while compaing cpu/gpu performance Nov 1, 2019

hemantranvir changed the title ~~No significant change in iters/sec while compaing cpu/gpu performance~~ No significant change in iters/sec while comparing cpu/gpu performance Nov 1, 2019

hemantranvir changed the title ~~No significant change in iters/sec while comparing cpu/gpu performance~~ No significant change in iters/sec while comparing cpu vs gpu performance Nov 1, 2019

hemantranvir closed this as completed Nov 1, 2019

hemantranvir reopened this Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No significant change in iters/sec while comparing cpu vs gpu performance #138

No significant change in iters/sec while comparing cpu vs gpu performance #138

hemantranvir commented Nov 1, 2019 •

edited

Loading

yinghai commented Nov 1, 2019

ilia-cher commented Nov 1, 2019

hemantranvir commented Nov 4, 2019

ilia-cher commented Nov 4, 2019

hemantranvir commented Nov 11, 2019

No significant change in iters/sec while comparing cpu vs gpu performance #138

No significant change in iters/sec while comparing cpu vs gpu performance #138

Comments

hemantranvir commented Nov 1, 2019 • edited Loading

yinghai commented Nov 1, 2019

ilia-cher commented Nov 1, 2019

hemantranvir commented Nov 4, 2019

ilia-cher commented Nov 4, 2019

hemantranvir commented Nov 11, 2019

hemantranvir commented Nov 1, 2019 •

edited

Loading