Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][OP] Use Thrust sort for argsort and topk #5097

Merged
merged 10 commits into from
Mar 20, 2020
Merged

Conversation

kazum
Copy link
Contributor

@kazum kazum commented Mar 19, 2020

The current GPU sort implementation (odd-even transposition sort) is too slow when the number of elements is large. This PR introduces Thrust implementation of sort which is much faster.

Note that this change requires CMake 3.8 3.13 or later since we have to use nvcc to compile a thrust code.

  • benchmark script

    import tvm
    from tvm import relay
    from tvm.contrib import graph_runtime
    import numpy as np
    
    target = 'cuda'
    ctx = tvm.gpu(0)
    n = 100000
    
    x = relay.var("x", shape=(n,))
    out = relay.topk(x)
    func = relay.Function([x], out[0])
    
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(func, target)
    
    module = graph_runtime.create(graph, lib, ctx)
    
    print("Evaluate inference time cost...")
    ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=3)
    prof_res = np.array(ftimer().results) * 1000
    print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
          (np.mean(prof_res), np.std(prof_res)))
  • result (NVIDIA P100, without thrust)

    Evaluate inference time cost...
    Mean inference time (std dev): 2058.89 ms (0.07 ms)
    
  • result (NVIDIA P100, with thrust)

    Evaluate inference time cost...
    Mean inference time (std dev): 1.11 ms (0.03 ms)
    

@icemelon9 @vinx13 @masahi could you help to review?

kazum added 5 commits March 19, 2020 17:56
The current GPU sort implementation (odd-even transposition sort) is too slow
when the number of elements is large.  This PR introduces Thrust implementation
of sort which is much faster.

Note that this change requires CMake 3.8 or later since we have to use nvcc to
compile a thrust code.
@Laurawly
Copy link
Contributor

Is this cuda 10.1 compatible? I tried to build it on my machine with cuda 10.1 and it gives me the following errors:

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(42): error: namespace "thrust" has no member "device_ptr"

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(42): error: type name is not allowed

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(43): error: namespace "thrust" has no member "device_ptr"

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(43): error: type name is not allowed

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(44): error: namespace "thrust" has no member "device_ptr"

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(44): error: type name is not allowed

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(52): error: identifier "data_ptr" is undefined

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(52): error: identifier "values_ptr" is undefined

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(55): error: identifier "indices_ptr" is undefined

/home/ubuntu/workplace/tvm-1/src/runtime/contrib/thrust/thrust.cu(42): error: identifier "data_ptr" is undefined

@kazum
Copy link
Contributor Author

kazum commented Mar 19, 2020

@Laurawly Thanks, I've update the code and confirmed that it can be compiled on Ubuntu 18.04 and CUDA 10.1.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Member

@icemelon icemelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Laurawly
Copy link
Contributor

Could you also add USE_THRUST option in config.cmake file?

@Laurawly Laurawly merged commit 2e8f3a9 into apache:master Mar 20, 2020
@Laurawly
Copy link
Contributor

Laurawly commented Mar 20, 2020

Thanks @kazum @masahi @icemelon9

@kazum kazum deleted the thrust branch March 21, 2020 01:50
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
* [TOPI][OP] Use Thrust sort for argsort and topk

The current GPU sort implementation (odd-even transposition sort) is too slow
when the number of elements is large.  This PR introduces Thrust implementation
of sort which is much faster.

Note that this change requires CMake 3.8 or later since we have to use nvcc to
compile a thrust code.

* cmake: make CUDA optional

* allow .cu file to be into the repository

* pylint fix and cleanup

* require cmake 3.8 only when thrust is enabled

* fix nvcc compiler error when passing -pthread

* add missing include

* add USE_THRUST option in config.cmake

* retrigger CI

* retrigger CI
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
* [TOPI][OP] Use Thrust sort for argsort and topk

The current GPU sort implementation (odd-even transposition sort) is too slow
when the number of elements is large.  This PR introduces Thrust implementation
of sort which is much faster.

Note that this change requires CMake 3.8 or later since we have to use nvcc to
compile a thrust code.

* cmake: make CUDA optional

* allow .cu file to be into the repository

* pylint fix and cleanup

* require cmake 3.8 only when thrust is enabled

* fix nvcc compiler error when passing -pthread

* add missing include

* add USE_THRUST option in config.cmake

* retrigger CI

* retrigger CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants