Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent DeviceBuffer DeviceMemoryResource premature release #931

Merged
merged 8 commits into from
Jan 5, 2022

Conversation

viclafargue
Copy link
Contributor

DeviceBuffer deallocation sometimes fails with a segfault. I wasn't able to create a minimal reproducer for now. However, a possible source for the problem is the way Cython releases the memory. Indeed the DeviceBuffer's DeviceMemoryResource appears to be sometimes released before the device_buffer. This causes a segfault in the the device_buffer deconstructor as the device memory resource pointer is set to null.

The code of this PR seems to fix the issue. It makes use of a holder class that seem to allow proper ordering of the class members destructions.

@viclafargue viclafargue requested a review from a team as a code owner December 6, 2021 16:24
@github-actions github-actions bot added the Python Related to RMM Python API label Dec 6, 2021
@shwina
Copy link
Contributor

shwina commented Dec 6, 2021

Thanks for the PR!

I think we need a better understanding of the problem, exactly when it arises, and why the changes in this PR fix it. If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing? cdef classes in Cython translate into C++ structs, and the order of deletion of members is in the reverse order of declaration.

@viclafargue
Copy link
Contributor Author

Thanks for taking a look!

Do you think maybe simply reversing the order of declaration of c_obj and mr here would resolve the issue you are seeing?

I remember trying that, unfortunately, it didn't seemed to work.

What I noticed is that the DeviceMemoryResource's shared_ptr refcount seems to always be set to 1, and could, in my understanding, as well be defined as a unique_ptr. DeviceMemoryResource copies in the code seem to be done by reference at the Python level. Because of this, what actually matters is the Python refcount.

In the C++ code, we can see the following:

struct __pyx_obj_3rmm_4_lib_13device_buffer_DeviceBuffer {
  PyObject_HEAD
  struct __pyx_vtabstruct_3rmm_4_lib_13device_buffer_DeviceBuffer *__pyx_vtab;
  std::unique_ptr<rmm::device_buffer>  c_obj;
  struct __pyx_obj_3rmm_4_lib_15memory_resource_DeviceMemoryResource *mr;
  struct __pyx_obj_3rmm_5_cuda_6stream_Stream *stream;
};

While c_obj holding the native device_buffer is a C++ object, mr the DeviceMemoryResource extension type is a pointer to a struct that seems to be regulated by to the Python layer. And, as discussed in the Cython documentation:

By the time your dealloc() method is called, the object may already have been partially destroyed and may not be in a valid state as far as Python is concerned

In my understanding, mr, the DeviceMemoryResource class member might have been destroyed by the time the c_obj is deconstructed.

If we haven't been able to find a minimal reproducer yet, do you perhaps have a larger program that reproduces the segfault?

Sure, I can help you reproduce it through NVIDIA's internal communication means.

@shwina
Copy link
Contributor

shwina commented Dec 10, 2021

As an update: using no_gc_clear with DeviceBuffer also fixes the issue. Although, a simple test program where a DeviceBuffer lives in a reference cycle does not reproduce the issue. Continuing to investigate.

@shwina
Copy link
Contributor

shwina commented Dec 14, 2021

I believe this should be a more minimal reproducer of the problem:

import rmm
import gc

class Foo():
    def __init__(self, x):
        self.x = x

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

dbuf1 = rmm.DeviceBuffer(size=10)

# Make dbuf1 part of a reference cycle:
l1 = Foo(dbuf1)
l2 = Foo(dbuf1)
l1.l2 = l2
l2.l1 = l1

# due to the reference cycle, the device buffer doesn't actually get
# cleaned up until later, after we invoke `gc.collect()`:
del dbuf1, l1, l2

rmm.mr.set_current_device_resource(rmm.mr.CudaMemoryResource())

# by now, the only remaining reference to the *original* memory
# resource should be in `dbuf1`. However, the cyclic garbage collector
# will eliminate that reference when it clears the object via its
# `tp_clear` method.  Later, when `tp_dealloc` attemps to actually
# deallocate `dbuf1` (which needs the MR alive), a segfault occurs.

gc.collect()

As suspected, we are seeing exactly the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear. Setting no_gc_clear seems to be the recommended fix and eliminates the segfault, both in the original issue and the smaller repro.

The one caveat with no_gc_clear is:

If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak.

I cannot think of a way to construct such a reference cycle.

@shwina
Copy link
Contributor

shwina commented Dec 15, 2021

@viclafargue I think we should avoid the indirection via NativeDeviceBufferHolder if possible. Adding a @no_gc_clear to DeviceBuffer tackles the problem more directly, although we should add a descriptive comment describing the need for it.

@viclafargue
Copy link
Contributor Author

Wrote a fix as discussed with @shwina through DM. It involves storing a shared_ptr to a device_memory_resource as a reference to the memory resource in the DeviceBuffer. Previously the reference was stored as a DeviceMemoryResource object that was unfortunately subject to being released before the device buffer in itself could be deallocated.

python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved
python/rmm/_lib/device_buffer.pxd Outdated Show resolved Hide resolved
@viclafargue
Copy link
Contributor Author

The solution actually resulted in segfaults when using a UpstreamResourceAdaptor. Indeed, the upstream device memory resource was released before the device buffer. Will use the no_gc_clear decorator instead.

@viclafargue
Copy link
Contributor Author

rerun tests

@shwina shwina added the non-breaking Non-breaking change label Jan 5, 2022
@shwina shwina added the bug Something isn't working label Jan 5, 2022
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Going to merge!

@shwina
Copy link
Contributor

shwina commented Jan 5, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5a239d2 into rapidsai:branch-22.02 Jan 5, 2022
@harrism
Copy link
Member

harrism commented Jan 10, 2022

Please remove WIP from PRS before merging.

@harrism harrism changed the title [WIP] Fix for DeviceBuffer Fix for DeviceBuffer Jan 10, 2022
@harrism harrism changed the title Fix for DeviceBuffer Prevent DeviceBuffer DeviceMemoryResource premature release Jan 10, 2022
@harrism
Copy link
Member

harrism commented Jan 10, 2022

Gave this PR a new title, posthumously. Would hate to end up with just "Fix for DeviceBuffer" in our release notes! @viclafargue Please be thoughtful and descriptive in PR titles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants