-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow maximum function #24
Comments
I kind of get the feeling that I should write my own kernels for some functions I want to have very good performance for. Do you have any pointers for me on how to do this with ODL? |
Sorry for the delay here. Sadly there is basically two ways to do this:
Something like this should work: import odl
import numba
import numba.cuda
import ctypes
import numpy as np
def as_numba_arr(el):
"""Convert ``el`` to numba array."""
gpu_data = numba.cuda.cudadrv.driver.MemoryPointer(
context=numba.cuda.current_context(),
pointer=ctypes.c_ulong(el.data_ptr),
size=el.size)
return numba.cuda.cudadrv.devicearray.DeviceNDArray(
shape=el.shape,
strides=(el.dtype.itemsize,),
dtype=el.dtype,
gpu_data=gpu_data)
# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
# Wrap
numba_el = as_numba_arr(el)
# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
target='cuda')
def maximum(a, b):
return max(a, b)
# Compute max(el, 3.0)
print(maximum(numba_el, 3.0)) This seems to be doing quite well: spc = odl.rn(10**8, impl='cuda')
el = odl.phantom.white_noise(spc)
numba_el = as_numba_arr(el)
with odl.util.Timer('numpy'):
for i in range(100):
el2 = np.maximum(el, 3.0)
with odl.util.Timer('numba'):
for i in range(100):
el2 = maximum(numba_el, 3.0)
with odl.util.Timer('numba in place'):
for i in range(100):
maximum(numba_el, 3.0, out=numba_el) Which gives:
Which is a 67x speedup. |
I now added # Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
# Wrap
numba_el = odlcuda.util.as_numba_arr(el)
# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
target='cuda')
def maximum(a, b):
return max(a, b)
# Compute max(el, 3.0) in place
print(maximum(numba_el, 3.0)) |
Great, I think I will first give the second option a shot. Regarding the first one:
I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction? |
Sure I mean PR, and there is this file: https://github.com/odlgroup/odlcuda/blob/master/odlcuda/cuda/UFunc.cu |
Random remark, this will be super-fast with the new GPU backend and ufuncs, in the order of 30 ms for the above code (speedup 1200x). Working on it. |
That sounds amazing. Can't wait to have it! What is a realistic time frame that you need for this? Days, weeks, months? |
Well, the big chunk of work for the CPU tensors is done and lying as a PR ready for review. Another PR which requires work in the order of weeks adds the GPU backend as full-value Numpy alternative. |
While your example works for me
doing the same with
|
It seems the Regarding updates we're working on it :) |
Great, I am making progress on this end with numba. One proposal and one question: What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()? Second, what about transfering the numba array back to ODL? |
That would have to be something like
The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the |
Maybe even better, what about el.asarray(impl='numbacuda')?
Yes, this would be a good option. |
Hm, I've always thought that |
I agree with that asarray proposal, looks good and reduces clutter! |
I also realized that this functionality is useful for the CPU, too! Not sure how much speed up to expect but numba allows you to compute things more elegant and memory efficient. |
I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?
The text was updated successfully, but these errors were encountered: