Any tips for keeping frames in GPU when using the python wrapper? #7824

smartin015 · 2020-11-20T20:16:31Z

Hey all,

I'm doing some custom CUDA processing using a relasense D435 and an NVIDIA Jetson Nano - I have an example which works here.

I use numba for JIT compilation of the CUDA kernel, and would like to to pass the incoming depth frame to the kernel without first copying out of the GPU. To outline this, it's roughly:

from numba import cuda

#<realsense import & pipeline setup stuff>

@cuda.jit
def rvl_kernel(im, _):
    #<definition of kernel in python>

_, frame = pipeline.try_wait_for_frames()
data = np.asanyarray(frame.get_depth_frame().get_data())
rvl_kernel[block_per_grid, CUDA_THREAD_DIM](data, _)

I suspect that my call to get_data() to get the data and convert to numpy is causing an extra copy out of GPU memory that could be avoided. I saw #7816 and the :gl namespace in C++, but that doesn't appear to be accesible in the python wrapper in order to pass the mapped memory into the kernel.

Can you please advise? Thanks!

The text was updated successfully, but these errors were encountered:

MartyG-RealSense · 2020-11-21T10:57:40Z

Hi @smartin015 I researched your question deeply. Accessing the SDK's C++ implementation of GLSL from Python did not seem to be practical. I also investigated the possibility of applying GLSL in Python from outside of the SDK, perhaps through Pyglet (since the Python wrapper has a Pyglet viewer example).

An example of Python, Pyglet and GLSL:

https://www.pythonstuff.org/glsl/example_2_glsl_with_pyglet.html

It may be best though to instead tackle your suspected root cause of your problem: the conversion of SDK frames to numpy. This is a goal that some RealSense Python users have been experimenting with for a while (both with frame to numpy and with numpy to frame). There is not a definitive solution at the time of writing this, though the subject has been referred to Intel to investigate according to @RealSenseSupport .

#5784

#2551

smartin015 · 2020-11-22T19:14:23Z

Looking around a bit, I found the cuda array interface spec from Numba, which is also used in CuPy, PyTorch, JAX, etc.

It sounds as though if realsense frame objects had this standard __cuda_array_interface__ dict attribute, then I could pass them directly without calling get_data(). Does this sound right to you? I could try monkey patching this to see if it works... but I don't know how to access the data pointer of the existing frame. Do you have any tips on how I could do this?

MartyG-RealSense · 2020-11-23T17:34:10Z

I do not personally have knowledge of the advanced workings of the CUDA support in librealsense, so I cannot offer an educated opinion on that subject.

In regard to a data pointer though, librealsense's frames have a frame handle that acts as a 'smart pointer'.

#5087 (comment)

Data can be retrieved from the frame handle with get_data() (the instruction that you already mentioned):

https://intelrealsense.github.io/librealsense/doxygen/classrs2_1_1frame.html#a4b373fc81617be881b691a97b0f8358c

#6261

dorodnic · 2020-11-24T14:15:09Z

For this use case, get_data -> numpy array -> CUDA is the way to go. I don't see benefits to using ::gl in this case - depth frames originate in main memory, someone has to copy it to GPU memory, GLSL will not do it faster than built-in CUDA methods.
Since Z16 does not go through any CUDA processing inside the SDK, there is also no opportunity for re-use.

smartin015 · 2020-11-26T16:04:12Z

With the Jetson Nano, there's actually no such thing as main vs GPU memory - the CPU and GPU physically share the same system memory. But if librealsense, numpy, and numba/CUDA don't know about it (which is what I'm assuming), it causes an unnecessary copy to "load the data onto the GPU" which wastes time and space in memory.

Efficiently using the system memory requires allocating "mapped" pinned memory. This is apparently done by replacing malloc() calls with cudaMallocManaged(). It's hard to tell from searching the librealsense repo how frame buffers are allocated, but I'm guessing it doesn't do this by default as this is rather specific/specialized.

Ignoring the ability to do any of this in python for the moment... is there a way to explicitly create the frame buffer and configure the pipeline to use it? That would allow me to use cudaMallocManaged(), pass is along for librealsense to use, and then turn around and pass it to numba/CUDA to run the custom kernel.

MartyG-RealSense · 2020-11-26T22:52:16Z

@smartin015 I must defer again to the CUDA expertise of @dorodnic on this matter.

smartin015 · 2020-12-02T02:02:13Z

I also stumbled across ENABLE_ZERO_COPY in https://github.com/IntelRealSense/librealsense/wiki/Build-Configuration which mentions "the rs2::frame object will track native handle to the underlying OS resource" and "frames... will not be mem-copied on arrival" when enabled, but it's marked as deprecated and I don't see any references to it in the repo, nor are there any docs about it. Is that still possible to use?

MartyG-RealSense · 2020-12-02T14:27:16Z

@smartin015 I located information from @dorodnic about ENABLE_ZERO_COPY:

"Zero Copy feature is for now not functional. The idea was that rs2::frame object could track the underlying Kernel resource instead of making a copy, but this does not always play well with the rest of the SDK. We might re-enable it at some point, but for now there seem to be little need for it".

#4132 (comment)

smartin015 · 2020-12-10T13:54:29Z

Hey Marty - I'm still blocked on what I'm trying to do: use librealsense2 with mapped, pinned memory to eliminate unnecessary copying of depth frames on the NVIDIA Jetson Nano.

MartyG-RealSense · 2020-12-17T08:48:07Z

Hi @smartin015 Considering that progress was not able to be made the last time that this issue was looked at, do you wish to continue with it please? Thanks!

smartin015 · 2020-12-19T16:26:01Z

I suppose not. It's unfortunate the realsense library doesn't support this optimization, but I don't know where to start and it sounds like there's no interest in implementing it. I'll go ahead and close the issue.

smartin015 · 2020-12-29T15:15:59Z

One last update for folks who might come across this thread - I managed to speed up my code 10x (!) by following the conversation to add CUDA UVM support to Numba.

The trick was in replacing my calls to np.array() with cuda.mapped_array() for the arrays passed into the CUDA kernel. I copy the realsense frame into one of these mapped arrays with something like

arr = cuda.mapped_array((w,h), dtype=...)
...
_, frame = self.pipeline.try_wait_for_frames()
...
arr[:,:] = np.asanyarray(frame.get_depth_frame().get_data())

and then running my code as normal.

So there's still a copy happening to get from librealsense into numba land, but it's apparently not the expensive one(s) that fake loading into and out of GPU memory.

MartyG-RealSense · 2020-12-29T16:15:02Z

Thanks so much for sharing your solution @smartin015 :)

MartyG-RealSense added D400 Series jetson python labels Nov 21, 2020

smartin015 closed this as completed Dec 19, 2020

smartin015 reopened this Dec 29, 2020

smartin015 closed this as completed Dec 29, 2020

MartyG-RealSense mentioned this issue Jun 17, 2021

D415 with high CPU usage on Jetson NANO #9236

Closed

MartyG-RealSense mentioned this issue Aug 12, 2021

rs-terminal cmd.xml not found #9622

Closed

MartyG-RealSense mentioned this issue Apr 18, 2023

Color conversion in CUDA re-allocates memery at every single frame #11692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any tips for keeping frames in GPU when using the python wrapper? #7824

Any tips for keeping frames in GPU when using the python wrapper? #7824

smartin015 commented Nov 20, 2020

MartyG-RealSense commented Nov 21, 2020 •

edited

Loading

smartin015 commented Nov 22, 2020

MartyG-RealSense commented Nov 23, 2020 •

edited

Loading

dorodnic commented Nov 24, 2020

smartin015 commented Nov 26, 2020

MartyG-RealSense commented Nov 26, 2020

smartin015 commented Dec 2, 2020 •

edited

Loading

MartyG-RealSense commented Dec 2, 2020 •

edited

Loading

smartin015 commented Dec 10, 2020

MartyG-RealSense commented Dec 17, 2020

smartin015 commented Dec 19, 2020

smartin015 commented Dec 29, 2020 •

edited

Loading

MartyG-RealSense commented Dec 29, 2020

Any tips for keeping frames in GPU when using the python wrapper? #7824

Any tips for keeping frames in GPU when using the python wrapper? #7824

Comments

smartin015 commented Nov 20, 2020

MartyG-RealSense commented Nov 21, 2020 • edited Loading

smartin015 commented Nov 22, 2020

MartyG-RealSense commented Nov 23, 2020 • edited Loading

dorodnic commented Nov 24, 2020

smartin015 commented Nov 26, 2020

MartyG-RealSense commented Nov 26, 2020

smartin015 commented Dec 2, 2020 • edited Loading

MartyG-RealSense commented Dec 2, 2020 • edited Loading

smartin015 commented Dec 10, 2020

MartyG-RealSense commented Dec 17, 2020

smartin015 commented Dec 19, 2020

smartin015 commented Dec 29, 2020 • edited Loading

MartyG-RealSense commented Dec 29, 2020

MartyG-RealSense commented Nov 21, 2020 •

edited

Loading

MartyG-RealSense commented Nov 23, 2020 •

edited

Loading

smartin015 commented Dec 2, 2020 •

edited

Loading

MartyG-RealSense commented Dec 2, 2020 •

edited

Loading

smartin015 commented Dec 29, 2020 •

edited

Loading