Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

viclafargue · 2021-12-06T16:24:52Z

DeviceBuffer deallocation sometimes fails with a segfault. I wasn't able to create a minimal reproducer for now. However, a possible source for the problem is the way Cython releases the memory. Indeed the DeviceBuffer's DeviceMemoryResource appears to be sometimes released before the device_buffer. This causes a segfault in the the device_buffer deconstructor as the device memory resource pointer is set to null.

The code of this PR seems to fix the issue. It makes use of a holder class that seem to allow proper ordering of the class members destructions.

shwina · 2021-12-06T19:46:04Z

Thanks for the PR!

I think we need a better understanding of the problem, exactly when it arises, and why the changes in this PR fix it. If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing? cdef classes in Cython translate into C++ structs, and the order of deletion of members is in the reverse order of declaration.

viclafargue · 2021-12-07T12:15:57Z

Thanks for taking a look!

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing?

I remember trying that, unfortunately, it didn't seemed to work.

What I noticed is that the DeviceMemoryResource's shared_ptr refcount seems to always be set to 1, and could, in my understanding, as well be defined as a unique_ptr. DeviceMemoryResource copies in the code seem to be done by reference at the Python level. Because of this, what actually matters is the Python refcount.

In the C++ code, we can see the following:

struct __pyx_obj_3rmm_4_lib_13device_buffer_DeviceBuffer {
  PyObject_HEAD
  struct __pyx_vtabstruct_3rmm_4_lib_13device_buffer_DeviceBuffer *__pyx_vtab;
  std::unique_ptr<rmm::device_buffer>  c_obj;
  struct __pyx_obj_3rmm_4_lib_15memory_resource_DeviceMemoryResource *mr;
  struct __pyx_obj_3rmm_5_cuda_6stream_Stream *stream;
};

While c_obj holding the native device_buffer is a C++ object, mr the DeviceMemoryResource extension type is a pointer to a struct that seems to be regulated by to the Python layer. And, as discussed in the Cython documentation:

By the time your dealloc() method is called, the object may already have been partially destroyed and may not be in a valid state as far as Python is concerned

In my understanding, mr, the DeviceMemoryResource class member might have been destroyed by the time the c_obj is deconstructed.

If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Sure, I can help you reproduce it through NVIDIA's internal communication means.

shwina · 2021-12-10T20:18:42Z

As an update: using no_gc_clear with DeviceBuffer also fixes the issue. Although, a simple test program where a DeviceBuffer lives in a reference cycle does not reproduce the issue. Continuing to investigate.

shwina · 2021-12-14T21:37:47Z

I believe this should be a more minimal reproducer of the problem:

import rmm
import gc

class Foo():
    def __init__(self, x):
        self.x = x

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

dbuf1 = rmm.DeviceBuffer(size=10)

# Make dbuf1 part of a reference cycle:
l1 = Foo(dbuf1)
l2 = Foo(dbuf1)
l1.l2 = l2
l2.l1 = l1

# due to the reference cycle, the device buffer doesn't actually get
# cleaned up until later, after we invoke `gc.collect()`:
del dbuf1, l1, l2

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

# by now, the only remaining reference to the *original* memory
# resource should be in `dbuf1`. However, the cyclic garbage collector
# will eliminate that reference when it clears the object via its
# `tp_clear` method.  Later, when `tp_dealloc` attemps to actually
# deallocate `dbuf1` (which needs the MR alive), a segfault occurs.

gc.collect()

As suspected, we are seeing exactly the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear. Setting no_gc_clear seems to be the recommended fix and eliminates the segfault, both in the original issue and the smaller repro.

The one caveat with no_gc_clear is:

If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak.

I cannot think of a way to construct such a reference cycle.

shwina · 2021-12-15T15:19:10Z

@viclafargue I think we should avoid the indirection via NativeDeviceBufferHolder if possible. Adding a @no_gc_clear to DeviceBuffer tackles the problem more directly, although we should add a descriptive comment describing the need for it.

viclafargue · 2021-12-15T17:18:44Z

Wrote a fix as discussed with @shwina through DM. It involves storing a shared_ptr to a device_memory_resource as a reference to the memory resource in the DeviceBuffer. Previously the reference was stored as a DeviceMemoryResource object that was unfortunately subject to being released before the device buffer in itself could be deallocated.

python/rmm/_lib/device_buffer.pyx

python/rmm/_lib/device_buffer.pxd

viclafargue · 2021-12-17T15:03:30Z

The solution actually resulted in segfaults when using a UpstreamResourceAdaptor. Indeed, the upstream device memory resource was released before the device buffer. Will use the no_gc_clear decorator instead.

python/rmm/tests/test_rmm.py

viclafargue · 2021-12-21T14:23:26Z

rerun tests

shwina

This looks good to me. Going to merge!

shwina · 2022-01-05T18:59:27Z

@gpucibot merge

harrism · 2022-01-10T23:05:32Z

Please remove WIP from PRS before merging.

harrism · 2022-01-10T23:07:29Z

Gave this PR a new title, posthumously. Would hate to end up with just "Fix for DeviceBuffer" in our release notes! @viclafargue Please be thoughtful and descriptive in PR titles.

viclafargue requested a review from a team as a code owner December 6, 2021 16:24

github-actions bot added the Python Related to RMM Python API label Dec 6, 2021

Fix for DeviceBuffer

75c0e2f

viclafargue force-pushed the devicebuffer-fix branch from 37a5171 to 75c0e2f Compare December 15, 2021 17:12

Update doc

59836ec

shwina reviewed Dec 15, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pyx Outdated Show resolved Hide resolved

shwina reviewed Dec 15, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

viclafargue added 2 commits December 16, 2021 15:35

Review requests

69c2a7b

Adding test

41fcdab

viclafargue requested a review from shwina December 16, 2021 14:55

shwina reviewed Dec 16, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

Update doc

0515d60

shwina reviewed Dec 16, 2021

View reviewed changes

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved

viclafargue added 2 commits December 16, 2021 16:41

Modify imports

9050447

Use no_gc_clear decorator

507fa06

viclafargue requested a review from shwina December 17, 2021 15:00

shwina reviewed Dec 17, 2021

View reviewed changes

python/rmm/tests/test_rmm.py Show resolved Hide resolved

Addition to RMM test doc

2cf2dbb

viclafargue requested a review from shwina January 5, 2022 16:54

shwina added the non-breaking Non-breaking change label Jan 5, 2022

shwina added the bug Something isn't working label Jan 5, 2022

shwina approved these changes Jan 5, 2022

View reviewed changes

rapids-bot bot merged commit 5a239d2 into rapidsai:branch-22.02 Jan 5, 2022

pentschev mentioned this pull request Jan 10, 2022

Failing test_proxy.py::test_sizeof_cudf rapidsai/dask-cuda#824

Closed

harrism changed the title ~~[WIP] Fix for DeviceBuffer~~ Fix for DeviceBuffer Jan 10, 2022

harrism changed the title ~~Fix for DeviceBuffer~~ Prevent DeviceBuffer DeviceMemoryResource premature release Jan 10, 2022

jakirkham mentioned this pull request Feb 1, 2022

[BUG] Segfault when a device_buffer is released as its memory resource has already been released #943

Closed

viclafargue mentioned this pull request Apr 20, 2022

knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump rapidsai/cuml#4629

Open

tfeher mentioned this pull request Jul 27, 2022

[BUG] Memory leak when mdarray is used in a file that is used by cython/python interface rapidsai/raft#740

Closed

nickjcroucher mentioned this pull request Oct 16, 2022

Lineage model fitting - PopPUNK changes bacpop/PopPUNK#232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

viclafargue commented Dec 6, 2021

shwina commented Dec 6, 2021

viclafargue commented Dec 7, 2021

shwina commented Dec 10, 2021

shwina commented Dec 14, 2021 •

edited

Loading

shwina commented Dec 15, 2021

viclafargue commented Dec 15, 2021

viclafargue commented Dec 17, 2021

viclafargue commented Dec 21, 2021

shwina left a comment

shwina commented Jan 5, 2022

harrism commented Jan 10, 2022

harrism commented Jan 10, 2022

Prevent DeviceBuffer DeviceMemoryResource premature release #931

Prevent DeviceBuffer DeviceMemoryResource premature release #931

Conversation

viclafargue commented Dec 6, 2021

shwina commented Dec 6, 2021

viclafargue commented Dec 7, 2021

shwina commented Dec 10, 2021

shwina commented Dec 14, 2021 • edited Loading

shwina commented Dec 15, 2021

viclafargue commented Dec 15, 2021

viclafargue commented Dec 17, 2021

viclafargue commented Dec 21, 2021

shwina left a comment

Choose a reason for hiding this comment

shwina commented Jan 5, 2022

harrism commented Jan 10, 2022

harrism commented Jan 10, 2022

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

Prevent `DeviceBuffer` DeviceMemoryResource premature release #931

shwina commented Dec 14, 2021 •

edited

Loading