slow maximum function #24

mehrhardt · 2017-07-25T15:21:42Z

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs

The text was updated successfully, but these errors were encountered:

mehrhardt · 2017-08-14T14:50:14Z

I kind of get the feeling that I should write my own kernels for some functions I want to have very good performance for. Do you have any pointers for me on how to do this with ODL?

adler-j · 2017-08-21T11:17:17Z

Sorry for the delay here.

Sadly there is basically two ways to do this:

Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.
Use python and do some wrapping. I'd prefer using something like theano or numba.

Something like this should work:

import odl
import numba
import numba.cuda
import ctypes
import numpy as np


def as_numba_arr(el):
    """Convert ``el`` to numba array."""
    gpu_data = numba.cuda.cudadrv.driver.MemoryPointer(
        context=numba.cuda.current_context(),
        pointer=ctypes.c_ulong(el.data_ptr),
        size=el.size)

    return numba.cuda.cudadrv.devicearray.DeviceNDArray(
        shape=el.shape,
        strides=(el.dtype.itemsize,),
        dtype=el.dtype,
        gpu_data=gpu_data)

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0)
print(maximum(numba_el, 3.0))

This seems to be doing quite well:

spc = odl.rn(10**8, impl='cuda')
el = odl.phantom.white_noise(spc)
numba_el = as_numba_arr(el)

with odl.util.Timer('numpy'):
    for i in range(100):
        el2 = np.maximum(el, 3.0)

with odl.util.Timer('numba'):
    for i in range(100):
        el2 = maximum(numba_el, 3.0)

with odl.util.Timer('numba in place'):
    for i in range(100):
        maximum(numba_el, 3.0, out=numba_el)

Which gives:

                         numpy :     39.821 
                         numba :      0.652 
                numba in place :      0.589

Which is a 67x speedup.

adler-j · 2017-08-21T11:37:08Z

I now added util.as_numba_arr in 5738eb6, the above code is then simply:

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = odlcuda.util.as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0) in place
print(maximum(numba_el, 3.0))

mehrhardt · 2017-08-21T14:11:25Z

Great, I think I will first give the second option a shot.

Regarding the first one:

Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

adler-j · 2017-08-21T14:13:17Z

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

Sure I mean PR, and there is this file:

https://github.com/odlgroup/odlcuda/blob/master/odlcuda/cuda/UFunc.cu

kohr-h · 2017-08-21T14:47:05Z

Random remark, this will be super-fast with the new GPU backend and ufuncs, in the order of 30 ms for the above code (speedup 1200x). Working on it.

mehrhardt · 2017-08-21T15:04:35Z

That sounds amazing. Can't wait to have it! What is a realistic time frame that you need for this? Days, weeks, months?

kohr-h · 2017-08-21T15:55:00Z

Well, the big chunk of work for the CPU tensors is done and lying as a PR ready for review. Another PR which requires work in the order of weeks adds the GPU backend as full-value Numpy alternative.
The only issue is that the current version of that backend has no native ufuncs, which is yet another PR I'm working on. Some issues to solve there, but probably also in the order of weeks (optimistic as I am :-))

mehrhardt · 2017-11-01T16:35:05Z

Any news on this front @kohr-h, @adler-j ?

mehrhardt · 2017-11-01T16:42:43Z

While your example works for me

spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

doing the same with uniform_discr does not:

spc = odl.uniform_discr([0], [1], [5], impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

File "", line 1, in
numba_el = odlcuda.util.as_numba_arr(el)

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/util.py", line 37, in as_numba_arr
pointer=ctypes.c_ulong(el.data_ptr),

AttributeError: 'DiscreteLpElement' object has no attribute 'data_ptr'

adler-j · 2017-11-01T16:47:40Z

It seems the as_numba_arr function needs to have a special case added, e.g. if isinstance(el, DiscreteLpElement): el = el.ntuple, that should solve the problem.

Regarding updates we're working on it :)

mehrhardt · 2017-11-03T16:00:54Z

Great, I am making progress on this end with numba. One proposal and one question:

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

Second, what about transfering the numba array back to ODL?

adler-j · 2017-11-03T17:29:59Z

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

That would have to be something like asnumbacuda or so, but why not!

Second, what about transfering the numba array back to ODL?

The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the __setitem__ function to work with numba, which would allow odl_el[:] = numba_el to work

mehrhardt · 2017-11-06T10:07:32Z

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?
That would have to be something like asnumbacuda or so, but why not!

Maybe even better, what about el.asarray(impl='numbacuda')?

Second, what about transfering the numba array back to ODL?
The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the setitem function to work with numba, which would allow odl_el[:] = numba_el to work

Yes, this would be a good option.

kohr-h · 2017-11-06T18:12:26Z

Maybe even better, what about el.asarray(impl='numbacuda')?

Hm, I've always thought that asarray is so darn unspecific about implementation, maybe we can exploit that now?

adler-j · 2017-11-08T10:09:37Z

I agree with that asarray proposal, looks good and reduces clutter!

mehrhardt · 2017-11-08T17:32:22Z

I also realized that this functionality is useful for the CPU, too! Not sure how much speed up to expect but numba allows you to compute things more elegant and memory efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow maximum function #24

slow maximum function #24

mehrhardt commented Jul 25, 2017

mehrhardt commented Aug 14, 2017

adler-j commented Aug 21, 2017 •

edited

Loading

adler-j commented Aug 21, 2017

mehrhardt commented Aug 21, 2017

adler-j commented Aug 21, 2017

kohr-h commented Aug 21, 2017

mehrhardt commented Aug 21, 2017

kohr-h commented Aug 21, 2017

mehrhardt commented Nov 1, 2017

mehrhardt commented Nov 1, 2017

adler-j commented Nov 1, 2017

mehrhardt commented Nov 3, 2017

adler-j commented Nov 3, 2017

mehrhardt commented Nov 6, 2017 •

edited

Loading

kohr-h commented Nov 6, 2017

adler-j commented Nov 8, 2017

mehrhardt commented Nov 8, 2017

slow maximum function #24

slow maximum function #24

Comments

mehrhardt commented Jul 25, 2017

mehrhardt commented Aug 14, 2017

adler-j commented Aug 21, 2017 • edited Loading

adler-j commented Aug 21, 2017

mehrhardt commented Aug 21, 2017

adler-j commented Aug 21, 2017

kohr-h commented Aug 21, 2017

mehrhardt commented Aug 21, 2017

kohr-h commented Aug 21, 2017

mehrhardt commented Nov 1, 2017

mehrhardt commented Nov 1, 2017

adler-j commented Nov 1, 2017

mehrhardt commented Nov 3, 2017

adler-j commented Nov 3, 2017

mehrhardt commented Nov 6, 2017 • edited Loading

kohr-h commented Nov 6, 2017

adler-j commented Nov 8, 2017

mehrhardt commented Nov 8, 2017

adler-j commented Aug 21, 2017 •

edited

Loading

mehrhardt commented Nov 6, 2017 •

edited

Loading