Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow maximum function #24

Open
mehrhardt opened this issue Jul 25, 2017 · 17 comments
Open

slow maximum function #24

mehrhardt opened this issue Jul 25, 2017 · 17 comments

Comments

@mehrhardt
Copy link

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)
CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs
@mehrhardt
Copy link
Author

I kind of get the feeling that I should write my own kernels for some functions I want to have very good performance for. Do you have any pointers for me on how to do this with ODL?

@adler-j
Copy link
Member

adler-j commented Aug 21, 2017

Sorry for the delay here.

Sadly there is basically two ways to do this:

  1. Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.

  2. Use python and do some wrapping. I'd prefer using something like theano or numba.

Something like this should work:

import odl
import numba
import numba.cuda
import ctypes
import numpy as np


def as_numba_arr(el):
    """Convert ``el`` to numba array."""
    gpu_data = numba.cuda.cudadrv.driver.MemoryPointer(
        context=numba.cuda.current_context(),
        pointer=ctypes.c_ulong(el.data_ptr),
        size=el.size)

    return numba.cuda.cudadrv.devicearray.DeviceNDArray(
        shape=el.shape,
        strides=(el.dtype.itemsize,),
        dtype=el.dtype,
        gpu_data=gpu_data)

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0)
print(maximum(numba_el, 3.0))

This seems to be doing quite well:

spc = odl.rn(10**8, impl='cuda')
el = odl.phantom.white_noise(spc)
numba_el = as_numba_arr(el)

with odl.util.Timer('numpy'):
    for i in range(100):
        el2 = np.maximum(el, 3.0)

with odl.util.Timer('numba'):
    for i in range(100):
        el2 = maximum(numba_el, 3.0)

with odl.util.Timer('numba in place'):
    for i in range(100):
        maximum(numba_el, 3.0, out=numba_el)

Which gives:

                         numpy :     39.821 
                         numba :      0.652 
                numba in place :      0.589 

Which is a 67x speedup.

@adler-j
Copy link
Member

adler-j commented Aug 21, 2017

I now added util.as_numba_arr in 5738eb6, the above code is then simply:

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = odlcuda.util.as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0) in place
print(maximum(numba_el, 3.0))

@mehrhardt
Copy link
Author

Great, I think I will first give the second option a shot.

Regarding the first one:

Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

@adler-j
Copy link
Member

adler-j commented Aug 21, 2017

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

Sure I mean PR, and there is this file:

https://github.com/odlgroup/odlcuda/blob/master/odlcuda/cuda/UFunc.cu

@kohr-h
Copy link
Member

kohr-h commented Aug 21, 2017

Random remark, this will be super-fast with the new GPU backend and ufuncs, in the order of 30 ms for the above code (speedup 1200x). Working on it.

@mehrhardt
Copy link
Author

That sounds amazing. Can't wait to have it! What is a realistic time frame that you need for this? Days, weeks, months?

@kohr-h
Copy link
Member

kohr-h commented Aug 21, 2017

Well, the big chunk of work for the CPU tensors is done and lying as a PR ready for review. Another PR which requires work in the order of weeks adds the GPU backend as full-value Numpy alternative.
The only issue is that the current version of that backend has no native ufuncs, which is yet another PR I'm working on. Some issues to solve there, but probably also in the order of weeks (optimistic as I am :-))

@mehrhardt
Copy link
Author

Any news on this front @kohr-h, @adler-j ?

@mehrhardt
Copy link
Author

While your example works for me

spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

doing the same with uniform_discr does not:

spc = odl.uniform_discr([0], [1], [5], impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

File "", line 1, in
numba_el = odlcuda.util.as_numba_arr(el)

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/util.py", line 37, in as_numba_arr
pointer=ctypes.c_ulong(el.data_ptr),

AttributeError: 'DiscreteLpElement' object has no attribute 'data_ptr'

@adler-j
Copy link
Member

adler-j commented Nov 1, 2017

It seems the as_numba_arr function needs to have a special case added, e.g. if isinstance(el, DiscreteLpElement): el = el.ntuple, that should solve the problem.

Regarding updates we're working on it :)

@mehrhardt
Copy link
Author

Great, I am making progress on this end with numba. One proposal and one question:

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

Second, what about transfering the numba array back to ODL?

@adler-j
Copy link
Member

adler-j commented Nov 3, 2017

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

That would have to be something like asnumbacuda or so, but why not!

Second, what about transfering the numba array back to ODL?

The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the __setitem__ function to work with numba, which would allow odl_el[:] = numba_el to work

@mehrhardt
Copy link
Author

mehrhardt commented Nov 6, 2017

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

That would have to be something like asnumbacuda or so, but why not!

Maybe even better, what about el.asarray(impl='numbacuda')?

Second, what about transfering the numba array back to ODL?

The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the setitem function to work with numba, which would allow odl_el[:] = numba_el to work

Yes, this would be a good option.

@kohr-h
Copy link
Member

kohr-h commented Nov 6, 2017

Maybe even better, what about el.asarray(impl='numbacuda')?

Hm, I've always thought that asarray is so darn unspecific about implementation, maybe we can exploit that now?

@adler-j
Copy link
Member

adler-j commented Nov 8, 2017

I agree with that asarray proposal, looks good and reduces clutter!

@mehrhardt
Copy link
Author

I also realized that this functionality is useful for the CPU, too! Not sure how much speed up to expect but numba allows you to compute things more elegant and memory efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants