Skip to content
This repository has been archived by the owner on Apr 1, 2021. It is now read-only.

No significant change in iters/sec while comparing cpu vs gpu performance #138

Open
hemantranvir opened this issue Nov 1, 2019 · 5 comments

Comments

@hemantranvir
Copy link

hemantranvir commented Nov 1, 2019

I have installed torch_tvm with cuda/opencl support by enabling the following options:
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L32
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L129
https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L132

Trying to compare the cpu vs gpu performance by running the following test: https://github.com/pytorch/tvm/blob/master/test/benchmarks.py

  • CPU version:
$ CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.134256974191366 iter/s
TVM: 62.80919757107452 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 
  • GPU version:
    Edit L39 of benchmarks.py to torch_tvm.enable(opt_level=3, device_type='cuda')
$ CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.478923510188096 iter/s
TVM: 64.52328684937197 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 

As seen above there is no significant change in iter/s.
CPU version: 62.80919757107452 iter/s
GPU version: 64.52328684937197 iter/s

If I check the GPU memory usage with nvidia-smi command, as expected, the GPU is idle.
Is there any other configuration necessary to enable GPU backend?

(Apart from setting set(USE_CUDA ON) , set(USE_CUDNN ON), set(USE_CUBLAS ON) in https://github.com/dmlc/tvm/blob/master/cmake/config.cmake
And setting torch_tvm.enable(opt_level=3, device_type='cuda') in https://github.com/pytorch/tvm/blob/master/test/benchmarks.py)

@hemantranvir hemantranvir changed the title Unable to see significant change in iters/sec while compaing cpu/gpu performance No significant change in iters/sec while compaing cpu/gpu performance Nov 1, 2019
@hemantranvir hemantranvir changed the title No significant change in iters/sec while compaing cpu/gpu performance No significant change in iters/sec while comparing cpu/gpu performance Nov 1, 2019
@hemantranvir hemantranvir changed the title No significant change in iters/sec while comparing cpu/gpu performance No significant change in iters/sec while comparing cpu vs gpu performance Nov 1, 2019
@hemantranvir hemantranvir reopened this Nov 1, 2019
@yinghai
Copy link

yinghai commented Nov 1, 2019

I don't think current integration support CUDA now. But we have something WIP. @ilia-cher

@ilia-cher
Copy link
Contributor

I have a local patch that adds support for CUDA, eta send it next week

@hemantranvir
Copy link
Author

@ilia-cher Thanks for your response!
If it is not a big modification in the source code, considering the tvm side already has the support for cuda, can you please transcribe the method to enable the cuda support in torch_tvm.
Excuse my hastiness, as I am in a bit of a hurry.

@ilia-cher
Copy link
Contributor

plan to send cuda support PR this week

@hemantranvir
Copy link
Author

@ilia-cher Any updates? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants