Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any tips for keeping frames in GPU when using the python wrapper? #7824

Closed
smartin015 opened this issue Nov 20, 2020 · 13 comments
Closed

Any tips for keeping frames in GPU when using the python wrapper? #7824

smartin015 opened this issue Nov 20, 2020 · 13 comments

Comments

@smartin015
Copy link

Hey all,

I'm doing some custom CUDA processing using a relasense D435 and an NVIDIA Jetson Nano - I have an example which works here.

I use numba for JIT compilation of the CUDA kernel, and would like to to pass the incoming depth frame to the kernel without first copying out of the GPU. To outline this, it's roughly:

from numba import cuda

#<realsense import & pipeline setup stuff>

@cuda.jit
def rvl_kernel(im, _):
    #<definition of kernel in python>

_, frame = pipeline.try_wait_for_frames()
data = np.asanyarray(frame.get_depth_frame().get_data())
rvl_kernel[block_per_grid, CUDA_THREAD_DIM](data, _)

I suspect that my call to get_data() to get the data and convert to numpy is causing an extra copy out of GPU memory that could be avoided. I saw #7816 and the :gl namespace in C++, but that doesn't appear to be accesible in the python wrapper in order to pass the mapped memory into the kernel.

Can you please advise? Thanks!

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Nov 21, 2020

Hi @smartin015 I researched your question deeply. Accessing the SDK's C++ implementation of GLSL from Python did not seem to be practical. I also investigated the possibility of applying GLSL in Python from outside of the SDK, perhaps through Pyglet (since the Python wrapper has a Pyglet viewer example).

An example of Python, Pyglet and GLSL:

https://www.pythonstuff.org/glsl/example_2_glsl_with_pyglet.html

It may be best though to instead tackle your suspected root cause of your problem: the conversion of SDK frames to numpy. This is a goal that some RealSense Python users have been experimenting with for a while (both with frame to numpy and with numpy to frame). There is not a definitive solution at the time of writing this, though the subject has been referred to Intel to investigate according to @RealSenseSupport .

#5784

#2551

@smartin015
Copy link
Author

Looking around a bit, I found the cuda array interface spec from Numba, which is also used in CuPy, PyTorch, JAX, etc.

It sounds as though if realsense frame objects had this standard __cuda_array_interface__ dict attribute, then I could pass them directly without calling get_data(). Does this sound right to you? I could try monkey patching this to see if it works... but I don't know how to access the data pointer of the existing frame. Do you have any tips on how I could do this?

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Nov 23, 2020

I do not personally have knowledge of the advanced workings of the CUDA support in librealsense, so I cannot offer an educated opinion on that subject.

In regard to a data pointer though, librealsense's frames have a frame handle that acts as a 'smart pointer'.

#5087 (comment)

Data can be retrieved from the frame handle with get_data() (the instruction that you already mentioned):

https://intelrealsense.github.io/librealsense/doxygen/classrs2_1_1frame.html#a4b373fc81617be881b691a97b0f8358c

#6261

@dorodnic
Copy link
Contributor

For this use case, get_data -> numpy array -> CUDA is the way to go. I don't see benefits to using ::gl in this case - depth frames originate in main memory, someone has to copy it to GPU memory, GLSL will not do it faster than built-in CUDA methods.
Since Z16 does not go through any CUDA processing inside the SDK, there is also no opportunity for re-use.

@smartin015
Copy link
Author

With the Jetson Nano, there's actually no such thing as main vs GPU memory - the CPU and GPU physically share the same system memory. But if librealsense, numpy, and numba/CUDA don't know about it (which is what I'm assuming), it causes an unnecessary copy to "load the data onto the GPU" which wastes time and space in memory.

Efficiently using the system memory requires allocating "mapped" pinned memory. This is apparently done by replacing malloc() calls with cudaMallocManaged(). It's hard to tell from searching the librealsense repo how frame buffers are allocated, but I'm guessing it doesn't do this by default as this is rather specific/specialized.

Ignoring the ability to do any of this in python for the moment... is there a way to explicitly create the frame buffer and configure the pipeline to use it? That would allow me to use cudaMallocManaged(), pass is along for librealsense to use, and then turn around and pass it to numba/CUDA to run the custom kernel.

@MartyG-RealSense
Copy link
Collaborator

@smartin015 I must defer again to the CUDA expertise of @dorodnic on this matter.

@smartin015
Copy link
Author

smartin015 commented Dec 2, 2020

I also stumbled across ENABLE_ZERO_COPY in https://github.com/IntelRealSense/librealsense/wiki/Build-Configuration which mentions "the rs2::frame object will track native handle to the underlying OS resource" and "frames... will not be mem-copied on arrival" when enabled, but it's marked as deprecated and I don't see any references to it in the repo, nor are there any docs about it. Is that still possible to use?

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Dec 2, 2020

@smartin015 I located information from @dorodnic about ENABLE_ZERO_COPY:

"Zero Copy feature is for now not functional. The idea was that rs2::frame object could track the underlying Kernel resource instead of making a copy, but this does not always play well with the rest of the SDK. We might re-enable it at some point, but for now there seem to be little need for it".

#4132 (comment)

@smartin015
Copy link
Author

Hey Marty - I'm still blocked on what I'm trying to do: use librealsense2 with mapped, pinned memory to eliminate unnecessary copying of depth frames on the NVIDIA Jetson Nano.

@MartyG-RealSense
Copy link
Collaborator

Hi @smartin015 Considering that progress was not able to be made the last time that this issue was looked at, do you wish to continue with it please? Thanks!

@smartin015
Copy link
Author

I suppose not. It's unfortunate the realsense library doesn't support this optimization, but I don't know where to start and it sounds like there's no interest in implementing it. I'll go ahead and close the issue.

@smartin015
Copy link
Author

smartin015 commented Dec 29, 2020

One last update for folks who might come across this thread - I managed to speed up my code 10x (!) by following the conversation to add CUDA UVM support to Numba.

The trick was in replacing my calls to np.array() with cuda.mapped_array() for the arrays passed into the CUDA kernel. I copy the realsense frame into one of these mapped arrays with something like

arr = cuda.mapped_array((w,h), dtype=...)
...
_, frame = self.pipeline.try_wait_for_frames()
...
arr[:,:] = np.asanyarray(frame.get_depth_frame().get_data())

and then running my code as normal.

So there's still a copy happening to get from librealsense into numba land, but it's apparently not the expensive one(s) that fake loading into and out of GPU memory.

@MartyG-RealSense
Copy link
Collaborator

Thanks so much for sharing your solution @smartin015 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants