diff --git a/docs/cudf/source/developer_guide/library_design.md b/docs/cudf/source/developer_guide/library_design.md index 54a28db1b58..f9a51f005cb 100644 --- a/docs/cudf/source/developer_guide/library_design.md +++ b/docs/cudf/source/developer_guide/library_design.md @@ -316,3 +316,180 @@ The pandas API also includes a number of helper objects, such as `GroupBy`, `Rol cuDF implements corresponding objects with the same APIs. Internally, these objects typically interact with cuDF objects at the Frame layer via composition. However, for performance reasons they frequently access internal attributes and methods of `Frame` and its subclasses. + + +## Copy-on-write + + +Copy-on-write (COW) is designed to reduce memory footprint on GPUs. With this feature, a copy (`.copy(deep=False)`) is only really made whenever +there is a write operation on a column. It is first recommended to see +the public usage [here](copy-on-write-user-doc) of this functionality before reading through the internals +below. + +The core copy-on-write implementation relies on the `CopyOnWriteBuffer` class. This class stores the pointer to the device memory and size. +With the help of `CopyOnWriteBuffer.ptr` we generate [weak references](https://docs.python.org/3/library/weakref.html) of `CopyOnWriteBuffer` and store it in `CopyOnWriteBuffer._instances`. +This is a mapping from `ptr` keys to `WeakSet`s containing references to `CopyOnWriterBuffer` objects. This +means all the new `CopyOnWriteBuffer`s that are created map to the same key in `CopyOnWriteBuffer._instances` if they have same `.ptr` +i.e., if they are all pointing to the same device memory. + +When the cudf option `"copy_on_write"` is `True`, `as_buffer` will always return a `CopyOnWriteBuffer`. This class contains all the +mechanisms to enable copy-on-write for all buffers. When a `CopyOnWriteBuffer` is created, its weakref is generated and added to the `WeakSet` which is in turn stored in `CopyOnWriterBuffer._instances`. This will later serve as an indication of whether or not to make a copy when a +when write operation is performed on a `Column` (see below). + + +### Eager copies when exposing to third-party libraries + +If `Column`/`CopyOnWriteBuffer` is exposed to a third-party library via `__cuda_array_interface__`, we are no longer able to track whether or not modification of the buffer has occurred without introspection. Hence whenever +someone accesses data through the `__cuda_array_interface__`, we eagerly trigger the copy by calling +`_unlink_shared_buffers` which ensures a true copy of underlying device data is made and +unlinks the buffer from any shared "weak" references. Any future shallow-copy requests must also trigger a true physical copy (since we cannot track the lifetime of the third-party object), to handle this we also mark the `Column`/`CopyOnWriteBuffer` as +`obj._zero_copied=True` thus indicating any future shallow-copy requests will trigger a true physical copy +rather than a copy-on-write shallow copy with weak references. + +### How to obtain read-only object? + +A read-only object can be quite useful for operations that will not +mutate the data. This can be achieved by calling `._get_readonly_proxy_obj` +API, this API will return a proxy object that has `__cuda_array_interface__` +implemented and will not trigger a deep copy even if the `CopyOnWriteBuffer` +has weak references. It is only recommended to use this API as long as +the objects/arrays created with this proxy object gets cleaned up during +the developer code execution. We currently use this API for device to host +copies like in `ColumnBase._data_array_view` which is used for `Column.values_host`. + +Notes: +1. Weak references are implemented only for fixed-width data types as these are only column +types that can be mutated in place. +2. Deep copies of variable width data types return shallow-copies of the Columns, because these +types don't support real in-place mutations to the data. We just mimic in such a way that it looks +like an in-place operation using `_mimic_inplace`. + + +### Examples + +When copy-on-write is enabled, taking a shallow copy of a `Series` or a `DataFrame` does not +eagerly create a copy of the data. Instead, it produces a view that will be lazily +copied when a write operation is performed on any of its copies. + +Let's create a series: + +```python +>>> import cudf +>>> cudf.set_option("copy_on_write", True) +>>> s1 = cudf.Series([1, 2, 3, 4]) +``` + +Make a copy of `s1`: +```python +>>> s2 = s1.copy(deep=False) +``` + +Make another copy, but of `s2`: +```python +>>> s3 = s2.copy(deep=False) +``` + +Viewing the data and memory addresses show that they all point to the same device memory: +```python +>>> s1 +0 1 +1 2 +2 3 +3 4 +dtype: int64 +>>> s2 +0 1 +1 2 +2 3 +3 4 +dtype: int64 +>>> s3 +0 1 +1 2 +2 3 +3 4 +dtype: int64 + +>>> s1.data._ptr +139796315897856 +>>> s2.data._ptr +139796315897856 +>>> s3.data._ptr +139796315897856 +``` + +Now, when we perform a write operation on one of them, say on `s2`, a new copy is created +for `s2` on device and then modified: + +```python +>>> s2[0:2] = 10 +>>> s2 +0 10 +1 10 +2 3 +3 4 +dtype: int64 +>>> s1 +0 1 +1 2 +2 3 +3 4 +dtype: int64 +>>> s3 +0 1 +1 2 +2 3 +3 4 +dtype: int64 +``` + +If we inspect the memory address of the data, `s1` and `s3` still share the same address but `s2` has a new one: + +```python +>>> s1.data._ptr +139796315897856 +>>> s3.data._ptr +139796315897856 +>>> s2.data._ptr +139796315899392 +``` + +Now, performing write operation on `s1` will trigger a new copy on device memory as there +is a weak reference being shared in `s3`: + +```python +>>> s1[0:2] = 11 +>>> s1 +0 11 +1 11 +2 3 +3 4 +dtype: int64 +>>> s2 +0 10 +1 10 +2 3 +3 4 +dtype: int64 +>>> s3 +0 1 +1 2 +2 3 +3 4 +dtype: int64 +``` + +If we inspect the memory address of the data, the addresses of `s2` and `s3` remain unchanged, but `s1`'s memory address has changed because of a copy operation performed during the writing: + +```python +>>> s2.data._ptr +139796315899392 +>>> s3.data._ptr +139796315897856 +>>> s1.data._ptr +139796315879723 +``` + +cudf Copy-on-write implementation is motivated by pandas Copy-on-write proposal here: +1. [Google doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u) +2. [Github issue](https://github.com/pandas-dev/pandas/issues/36195) diff --git a/docs/cudf/source/user_guide/copy-on-write.md b/docs/cudf/source/user_guide/copy-on-write.md new file mode 100644 index 00000000000..14ca3656250 --- /dev/null +++ b/docs/cudf/source/user_guide/copy-on-write.md @@ -0,0 +1,169 @@ +(copy-on-write-user-doc)= + +# Copy-on-write + +Copy-on-write reduces GPU memory usage when copies(`.copy(deep=False)`) of a column +are made. + +| | Copy-on-Write enabled | Copy-on-Write disabled (default) | +|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| +| `.copy(deep=True)` | A true copy is made and changes don't propagate to the original object. | A true copy is made and changes don't propagate to the original object. | +| `.copy(deep=False)` | Memory is shared between the two objects and but any write operation on one object will trigger a true physical copy before the write is performed. Hence changes will not propagate to the original object. | Memory is shared between the two objects and changes performed on one will propagate to the other object. | + +## How to enable it + +i. Use `cudf.set_option`: + +```python +>>> import cudf +>>> cudf.set_option("copy_on_write", True) +``` + +ii. Set the environment variable ``CUDF_COPY_ON_WRITE`` to ``1`` prior to the +launch of the Python interpreter: + +```bash +export CUDF_COPY_ON_WRITE="1" python -c "import cudf" +``` + + +## Making copies + +There are no additional changes required in the code to make use of copy-on-write. + +```python +>>> series = cudf.Series([1, 2, 3, 4]) +``` + +Performing a shallow copy will create a new Series object pointing to the +same underlying device memory: + +```python +>>> copied_series = series.copy(deep=False) +>>> series +0 1 +1 2 +2 3 +3 4 +dtype: int64 +>>> copied_series +0 1 +1 2 +2 3 +3 4 +dtype: int64 +``` + +When a write operation is performed on either ``series`` or +``copied_series``, a true physical copy of the data is created: + +```python +>>> series[0:2] = 10 +>>> series +0 10 +1 10 +2 3 +3 4 +dtype: int64 +>>> copied_series +0 1 +1 2 +2 3 +3 4 +dtype: int64 +``` + + +## Notes + +When copy-on-write is enabled, there is no concept of views. i.e., modifying any view created inside cudf will not actually not modify +the original object it was viewing and thus a separate copy is created and then modified. + +## Advantages + +1. With copy-on-write enabled and by requesting `.copy(deep=False)`, the GPU memory usage can be reduced drastically if you are not performing +write operations on all of those copies. This will also increase the speed at which objects are created for execution of your ETL workflow. +2. With the concept of views going away, every object is a copy of it's original object. This will bring consistency across operations and cudf closer to parity with +pandas. Following is one of the inconsistency: + +```python + +>>> import pandas as pd +>>> s = pd.Series([1, 2, 3, 4, 5]) +>>> s1 = s[0:2] +>>> s1[0] = 10 +>>> s1 +0 10 +1 2 +dtype: int64 +>>> s +0 10 +1 2 +2 3 +3 4 +4 5 +dtype: int64 + +>>> import cudf +>>> s = cudf.Series([1, 2, 3, 4, 5]) +>>> s1 = s[0:2] +>>> s1[0] = 10 +>>> s1 +0 10 +1 2 +>>> s +0 1 +1 2 +2 3 +3 4 +4 5 +dtype: int64 +``` + +The above inconsistency is solved when Copy-on-write is enabled: + +```python +>>> import pandas as pd +>>> pd.set_option("mode.copy_on_write", True) +>>> s = pd.Series([1, 2, 3, 4, 5]) +>>> s1 = s[0:2] +>>> s1[0] = 10 +>>> s1 +0 10 +1 2 +dtype: int64 +>>> s +0 1 +1 2 +2 3 +3 4 +4 5 +dtype: int64 + + +>>> import cudf +>>> cudf.set_option("copy_on_write", True) +>>> s = cudf.Series([1, 2, 3, 4, 5]) +>>> s1 = s[0:2] +>>> s1[0] = 10 +>>> s1 +0 10 +1 2 +dtype: int64 +>>> s +0 1 +1 2 +2 3 +3 4 +4 5 +dtype: int64 +``` + +## How to disable it + + +Copy-on-write can be disable by setting ``copy_on_write`` cudf option to ``False``: + +```python +>>> cudf.set_option("copy_on_write", False) +``` diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md index 86168f0d81b..f39f3b49671 100644 --- a/docs/cudf/source/user_guide/index.md +++ b/docs/cudf/source/user_guide/index.md @@ -13,4 +13,5 @@ guide-to-udfs cupy-interop options PandasCompat +copy-on-write ``` diff --git a/python/cudf/cudf/_lib/column.pyx b/python/cudf/cudf/_lib/column.pyx index ec7d2570708..8f09a0f41f4 100644 --- a/python/cudf/cudf/_lib/column.pyx +++ b/python/cudf/cudf/_lib/column.pyx @@ -1,4 +1,6 @@ -# Copyright (c) 2020-2022, NVIDIA CORPORATION. +# Copyright (c) 2020-2023, NVIDIA CORPORATION. + +import inspect import cupy as cp import numpy as np @@ -10,6 +12,7 @@ import cudf._lib as libcudf from cudf.api.types import is_categorical_dtype from cudf.core.buffer import ( Buffer, + CopyOnWriteBuffer, SpillableBuffer, SpillLock, acquire_spill_lock, @@ -203,9 +206,21 @@ cdef class Column: "The value for mask is smaller than expected, got {} bytes, " "expected " + str(required_num_bytes) + " bytes." ) + + # Because hasattr will trigger invocation of + # `__cuda_array_interface__` which could + # be expensive in CopyOnWriteBuffer case. + value_cai = inspect.getattr_static( + value, + "__cuda_array_interface__", + None + ) + if value is None: mask = None - elif hasattr(value, "__cuda_array_interface__"): + elif type(value_cai) is property: + if isinstance(value, CopyOnWriteBuffer): + value = value._get_readonly_proxy_obj if value.__cuda_array_interface__["typestr"] not in ("|i1", "|u1"): if isinstance(value, Column): value = value.data_array_view @@ -303,6 +318,7 @@ cdef class Column: instead replaces the Buffers and other attributes underneath the column object with the Buffers and attributes from the other column. """ + if inplace: self._offset = other_col.offset self._size = other_col.size @@ -318,6 +334,7 @@ cdef class Column: return self._view(libcudf_types.UNKNOWN_NULL_COUNT).null_count() cdef mutable_column_view mutable_view(self) except *: + if is_categorical_dtype(self.dtype): col = self.base_children[0] else: @@ -331,12 +348,8 @@ cdef class Column: if col.base_data is None: data = NULL - elif isinstance(col.base_data, SpillableBuffer): - data = (col.base_data).get_ptr( - spill_lock=get_spill_lock() - ) else: - data = (col.base_data.ptr) + data = (col.base_data.mutable_ptr) cdef Column child_column if col.base_children: @@ -396,7 +409,11 @@ cdef class Column: spill_lock=get_spill_lock() ) else: - data = (col.base_data.ptr) + # Shouldn't access `.ptr`, because in case + # of `CopyOnWriteBuffer` that could trigger + # a copy, which isn't required to create a + # view that is read only. + data = (col.base_data._ptr) cdef Column child_column if col.base_children: @@ -526,6 +543,12 @@ cdef class Column: rmm.DeviceBuffer(ptr=data_ptr, size=(size+offset) * dtype_itemsize) ) + elif column_owner and isinstance(data_owner, CopyOnWriteBuffer): + # TODO: In future, see if we can just pass on the + # CopyOnWriteBuffer reference to another column + # and still create a weak reference. + # With the current design that's not possible. + data = data_owner.copy(deep=False) elif ( # This is an optimization of the most common case where # from_column_view creates a "view" that is identical to @@ -552,7 +575,9 @@ cdef class Column: owner=data_owner, exposed=True, ) - if isinstance(data_owner, SpillableBuffer): + if isinstance(data_owner, CopyOnWriteBuffer): + data_owner.ptr # accessing the pointer marks it exposed. + elif isinstance(data_owner, SpillableBuffer): if data_owner.is_spilled: raise ValueError( f"{data_owner} is spilled, which invalidates " diff --git a/python/cudf/cudf/core/buffer/__init__.py b/python/cudf/cudf/core/buffer/__init__.py index 49f2c57b17f..0d433509497 100644 --- a/python/cudf/cudf/core/buffer/__init__.py +++ b/python/cudf/cudf/core/buffer/__init__.py @@ -1,6 +1,7 @@ -# Copyright (c) 2022, NVIDIA CORPORATION. +# Copyright (c) 2022-2023, NVIDIA CORPORATION. from cudf.core.buffer.buffer import Buffer, cuda_array_interface_wrapper +from cudf.core.buffer.cow_buffer import CopyOnWriteBuffer from cudf.core.buffer.spillable_buffer import SpillableBuffer, SpillLock from cudf.core.buffer.utils import ( acquire_spill_lock, diff --git a/python/cudf/cudf/core/buffer/buffer.py b/python/cudf/cudf/core/buffer/buffer.py index ebc4d76b6a0..5479dc1fd50 100644 --- a/python/cudf/cudf/core/buffer/buffer.py +++ b/python/cudf/cudf/core/buffer/buffer.py @@ -192,6 +192,28 @@ def __getitem__(self, key: slice) -> Buffer: raise ValueError("slice must be C-contiguous") return self._getitem(offset=start, size=stop - start) + def copy(self, deep: bool = True): + """ + Return a copy of Buffer. + + Parameters + ---------- + deep : bool, default True + If True, returns a deep copy of the underlying Buffer data. + If False, returns a shallow copy of the Buffer pointing to + the same underlying data. + + Returns + ------- + Buffer + """ + if not deep: + return self[:] + else: + return self._from_device_memory( + rmm.DeviceBuffer(ptr=self.ptr, size=self.size) + ) + @property def size(self) -> int: """Size of the buffer in bytes.""" @@ -207,22 +229,44 @@ def ptr(self) -> int: """Device pointer to the start of the buffer.""" return self._ptr + @property + def mutable_ptr(self) -> int: + """Device pointer to the start of the buffer.""" + return self._ptr + @property def owner(self) -> Any: """Object owning the memory of the buffer.""" return self._owner @property - def __cuda_array_interface__(self) -> Mapping: - """Implementation of the CUDA Array Interface.""" + def __cuda_array_interface__(self) -> dict: + """Implementation for the CUDA Array Interface.""" + return self._get_cuda_array_interface(readonly=False) + + def _get_cuda_array_interface(self, readonly=False): return { - "data": (self.ptr, False), + "data": (self.ptr, readonly), "shape": (self.size,), "strides": None, "typestr": "|u1", "version": 0, } + @property + def _get_readonly_proxy_obj(self) -> dict: + """ + Returns a proxy object with a read-only CUDA Array Interface. + """ + return cuda_array_interface_wrapper( + ptr=self.ptr, + size=self.size, + owner=self, + readonly=True, + typestr="|u1", + version=0, + ) + def memoryview(self) -> memoryview: """Read-only access to the buffer through host memory.""" host_buf = host_memory_allocation(self.size) diff --git a/python/cudf/cudf/core/buffer/cow_buffer.py b/python/cudf/cudf/core/buffer/cow_buffer.py new file mode 100644 index 00000000000..e322912ed4f --- /dev/null +++ b/python/cudf/cudf/core/buffer/cow_buffer.py @@ -0,0 +1,211 @@ +# Copyright (c) 2022-2023, NVIDIA CORPORATION. + +from __future__ import annotations + +import weakref +from collections import defaultdict +from typing import Any, DefaultDict, Tuple, Type, TypeVar +from weakref import WeakSet + +import rmm + +from cudf.core.buffer.buffer import Buffer, cuda_array_interface_wrapper + +T = TypeVar("T", bound="CopyOnWriteBuffer") + + +def _keys_cleanup(ptr): + weak_set_values = CopyOnWriteBuffer._instances[ptr] + if ( + len(weak_set_values) == 1 + and next(iter(weak_set_values.data))() is None + ): + # When the last remaining reference is being cleaned up we will still + # have a dead reference in `weak_set_values`. If that is the case, then + # we can safely clean up the key + del CopyOnWriteBuffer._instances[ptr] + + +class CopyOnWriteBuffer(Buffer): + """A Copy-on-write buffer that implements Buffer. + + This buffer enables making copies of data only when there + is a write operation being performed. + + See `Copy-on-write` section in `library_design.md` for + detailed information on `CopyOnWriteBuffer`. + + Use the factory function `as_buffer` to create a CopyOnWriteBuffer + instance. + """ + + # This dict keeps track of all instances that have the same `ptr` + # and `size` attributes. Each key of the dict is a `(ptr, size)` + # tuple and the corresponding value is a set of weak references to + # instances with that `ptr` and `size`. + _instances: DefaultDict[Tuple, WeakSet] = defaultdict(WeakSet) + + # TODO: This is synonymous to SpillableBuffer._exposed attribute + # and has to be merged. + _zero_copied: bool + + def _finalize_init(self): + self.__class__._instances[self._ptr].add(self) + self._instances = self.__class__._instances[self._ptr] + self._zero_copied = False + weakref.finalize(self, _keys_cleanup, self._ptr) + + @classmethod + def _from_device_memory( + cls: Type[T], data: Any, *, exposed: bool = False + ) -> T: + """Create a Buffer from an object exposing `__cuda_array_interface__`. + + No data is being copied. + + Parameters + ---------- + data : device-buffer-like + An object implementing the CUDA Array Interface. + exposed : bool, optional + Mark the buffer as zero copied. + + Returns + ------- + Buffer + Buffer representing the same device memory as `data` + """ + + # Bypass `__init__` and initialize attributes manually + ret = super()._from_device_memory(data) + ret._finalize_init() + ret._zero_copied = exposed + return ret + + @classmethod + def _from_host_memory(cls: Type[T], data: Any) -> T: + ret = super()._from_host_memory(data) + ret._finalize_init() + return ret + + @property + def _is_shared(self): + """ + Return `True` if `self`'s memory is shared with other columns. + """ + return len(self._instances) > 1 + + @property + def ptr(self) -> int: + """Device pointer to the start of the buffer. + + This will trigger a deep copy if there are any weak references. + The Buffer would be marked as zero copied. + """ + self._unlink_shared_buffers() + self._zero_copied = True + return self._ptr + + @property + def mutable_ptr(self) -> int: + """Device pointer to the start of the buffer. + + This will trigger a deep copy if there are any weak references. + """ + # Shouldn't need to mark the Buffer as zero copied, + # because this API is used by libcudf only to create + # mutable views. + self._unlink_shared_buffers() + return self._ptr + + def _getitem(self, offset: int, size: int) -> Buffer: + """ + Helper for `__getitem__` + """ + return self._from_device_memory( + cuda_array_interface_wrapper( + ptr=self._ptr + offset, size=size, owner=self.owner + ) + ) + + def copy(self, deep: bool = True): + """ + Return a copy of Buffer. + + Parameters + ---------- + deep : bool, default True + If True, returns a deep-copy of the underlying Buffer data. + If False, returns a shallow-copy of the Buffer pointing to + the same underlying data. + + Returns + ------- + Buffer + """ + if deep or self._zero_copied: + return self._from_device_memory( + rmm.DeviceBuffer(ptr=self._ptr, size=self.size) + ) + else: + copied_buf = CopyOnWriteBuffer.__new__(CopyOnWriteBuffer) + copied_buf._ptr = self._ptr + copied_buf._size = self._size + copied_buf._owner = self._owner + copied_buf._finalize_init() + return copied_buf + + @property + def __cuda_array_interface__(self) -> dict: + # Unlink if there are any weak references. + self._unlink_shared_buffers() + # Mark the Buffer as ``zero_copied=True``, + # which will prevent any copy-on-write + # mechanism post this operation. + # This is done because we don't have any + # control over knowing if a third-party library + # has modified the data this Buffer is + # pointing to. + self._zero_copied = True + return self._get_cuda_array_interface(readonly=False) + + def _get_cuda_array_interface(self, readonly=False): + return { + "data": (self._ptr, readonly), + "shape": (self.size,), + "strides": None, + "typestr": "|u1", + "version": 0, + } + + @property + def _get_readonly_proxy_obj(self) -> dict: + """ + Returns a proxy object with a read-only CUDA Array Interface. + + See `Copy-on-write` section in `library_design.md` for + more information on this API. + """ + return cuda_array_interface_wrapper( + ptr=self._ptr, + size=self.size, + owner=self, + readonly=True, + typestr="|u1", + version=0, + ) + + def _unlink_shared_buffers(self): + """ + Unlinks a Buffer if it is shared with other buffers by + making a true deep-copy. + """ + if not self._zero_copied and self._is_shared: + # make a deep copy of existing DeviceBuffer + # and replace pointer to it. + current_buf = rmm.DeviceBuffer(ptr=self._ptr, size=self._size) + new_buf = current_buf.copy() + self._ptr = new_buf.ptr + self._size = new_buf.size + self._owner = new_buf + self._finalize_init() diff --git a/python/cudf/cudf/core/buffer/spillable_buffer.py b/python/cudf/cudf/core/buffer/spillable_buffer.py index 6b99b875572..d22fb6fdc20 100644 --- a/python/cudf/cudf/core/buffer/spillable_buffer.py +++ b/python/cudf/cudf/core/buffer/spillable_buffer.py @@ -62,7 +62,7 @@ def __getitem__(self, i): class SpillableBuffer(Buffer): - """A spillable buffer that implements DeviceBufferLike. + """A spillable buffer that implements Buffer. This buffer supports spilling the represented data to host memory. Spilling can be done manually by calling `.spill(target="cpu")` but @@ -258,6 +258,10 @@ def ptr(self) -> int: self._last_accessed = time.monotonic() return self._ptr + @property + def mutable_ptr(self) -> int: + return self.get_ptr(spill_lock=SpillLock()) + def spill_lock(self, spill_lock: SpillLock) -> None: """Spill lock the buffer diff --git a/python/cudf/cudf/core/buffer/utils.py b/python/cudf/cudf/core/buffer/utils.py index 062e86d0cb1..fc67138de42 100644 --- a/python/cudf/cudf/core/buffer/utils.py +++ b/python/cudf/cudf/core/buffer/utils.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION. +# Copyright (c) 2022-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -7,8 +7,10 @@ from typing import Any, Dict, Optional, Tuple, Union from cudf.core.buffer.buffer import Buffer, cuda_array_interface_wrapper +from cudf.core.buffer.cow_buffer import CopyOnWriteBuffer from cudf.core.buffer.spill_manager import get_global_manager from cudf.core.buffer.spillable_buffer import SpillableBuffer, SpillLock +from cudf.options import get_option def as_buffer( @@ -71,6 +73,14 @@ def as_buffer( "`data` is a buffer-like or array-like object" ) + if get_option("copy_on_write"): + if isinstance(data, Buffer) or hasattr( + data, "__cuda_array_interface__" + ): + return CopyOnWriteBuffer._from_device_memory(data, exposed=exposed) + if exposed: + raise ValueError("cannot created exposed host memory") + return CopyOnWriteBuffer._from_host_memory(data) if get_global_manager() is not None: if hasattr(data, "__cuda_array_interface__"): return SpillableBuffer._from_device_memory(data, exposed=exposed) diff --git a/python/cudf/cudf/core/column/categorical.py b/python/cudf/cudf/core/column/categorical.py index ef9f515fff7..4b53c3ccd92 100644 --- a/python/cudf/cudf/core/column/categorical.py +++ b/python/cudf/cudf/core/column/categorical.py @@ -733,7 +733,7 @@ def categories(self) -> ColumnBase: @categories.setter def categories(self, value): - self.dtype = CategoricalDtype( + self._dtype = CategoricalDtype( categories=value, ordered=self.dtype.ordered ) @@ -1275,31 +1275,12 @@ def _get_decategorized_column(self) -> ColumnBase: return out def copy(self, deep: bool = True) -> CategoricalColumn: + result_col = super().copy(deep=deep) if deep: - copied_col = libcudf.copying.copy_column(self) - copied_cat = libcudf.copying.copy_column(self.dtype._categories) - - return column.build_categorical_column( - categories=copied_cat, - codes=column.build_column( - copied_col.base_data, dtype=copied_col.dtype - ), - offset=copied_col.offset, - size=copied_col.size, - mask=copied_col.base_mask, - ordered=self.dtype.ordered, - ) - else: - return column.build_categorical_column( - categories=self.dtype.categories._values, - codes=column.build_column( - self.codes.base_data, dtype=self.codes.dtype - ), - mask=self.base_mask, - ordered=self.dtype.ordered, - offset=self.offset, - size=self.size, + result_col.categories = libcudf.copying.copy_column( + self.dtype._categories ) + return result_col @cached_property def memory_usage(self) -> int: diff --git a/python/cudf/cudf/core/column/column.py b/python/cudf/cudf/core/column/column.py index e3c0abde66c..1f0b675f536 100644 --- a/python/cudf/cudf/core/column/column.py +++ b/python/cudf/cudf/core/column/column.py @@ -120,6 +120,30 @@ def data_array_view(self) -> "cuda.devicearray.DeviceNDArray": """ return cuda.as_cuda_array(self.data).view(self.dtype) + @property + def _data_array_view(self) -> "cuda.devicearray.DeviceNDArray": + """ + Internal implementation for viewing the data as a device array object + without triggering a deep-copy. + """ + return cuda.as_cuda_array( + self.data._get_readonly_proxy_obj + if self.data is not None + else None + ).view(self.dtype) + + @property + def _mask_array_view(self) -> "cuda.devicearray.DeviceNDArray": + """ + Internal implementation for viewing the mask as a device array object + without triggering a deep-copy. + """ + return cuda.as_cuda_array( + self.mask._get_readonly_proxy_obj + if self.mask is not None + else None + ).view(mask_dtype) + @property def mask_array_view(self) -> "cuda.devicearray.DeviceNDArray": """ @@ -163,7 +187,7 @@ def values_host(self) -> "np.ndarray": if self.has_nulls(): raise ValueError("Column must have no nulls.") - return self.data_array_view.copy_to_host() + return self._data_array_view.copy_to_host() @property def values(self) -> "cupy.ndarray": @@ -365,24 +389,59 @@ def nullmask(self) -> Buffer: raise ValueError("Column has no null mask") return self.mask_array_view + @property + def _nullmask(self) -> Buffer: + """The gpu buffer for the null-mask""" + if not self.nullable: + raise ValueError("Column has no null mask") + return self._mask_array_view + + def force_deep_copy(self: T) -> T: + """ + A method to create deep copy irrespective of whether + `copy-on-write` is enable. + """ + result = libcudf.copying.copy_column(self) + return cast(T, result._with_type_metadata(self.dtype)) + def copy(self: T, deep: bool = True) -> T: - """Columns are immutable, so a deep copy produces a copy of the - underlying data and mask and a shallow copy creates a new column and - copies the references of the data and mask. + """ + Makes a copy of the Column. + + Parameters + ---------- + deep : bool, default True + If True, a true physical copy of the column + is made. + If False and `copy_on_write` is False, the same + memory is shared between the buffers of the Column + and changes made to one Column will propagate to + it's copy and vice-versa. + If False and `copy_on_write` is True, the same + memory is shared between the buffers of the Column + until there is a write operation being performed on + them. """ if deep: - result = libcudf.copying.copy_column(self) - return cast(T, result._with_type_metadata(self.dtype)) + return self.force_deep_copy() else: return cast( T, build_column( - self.base_data, - self.dtype, - mask=self.base_mask, + data=self.base_data + if self.base_data is None + else self.base_data.copy(deep=False), + dtype=self.dtype, + mask=self.base_mask + if self.base_mask is None + else self.base_mask.copy(deep=False), size=self.size, offset=self.offset, - children=self.base_children, + children=tuple( + col.copy(deep=False) for col in self.base_children + ) + if cudf.get_option("copy_on_write") + else self.base_children, ), ) @@ -1301,11 +1360,12 @@ def column_empty_like( ): column = cast("cudf.core.column.CategoricalColumn", column) codes = column_empty_like(column.codes, masked=masked, newsize=newsize) + return build_column( data=None, dtype=dtype, mask=codes.base_mask, - children=(as_column(codes.base_data, dtype=codes.dtype),), + children=(codes,), size=codes.size, ) diff --git a/python/cudf/cudf/core/column/datetime.py b/python/cudf/cudf/core/column/datetime.py index 56436ac141d..449691750d8 100644 --- a/python/cudf/cudf/core/column/datetime.py +++ b/python/cudf/cudf/core/column/datetime.py @@ -1,4 +1,4 @@ -# Copyright (c) 2019-2022, NVIDIA CORPORATION. +# Copyright (c) 2019-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -279,6 +279,7 @@ def as_numerical(self) -> "cudf.core.column.NumericalColumn": @property def __cuda_array_interface__(self) -> Mapping[str, Any]: + output = { "shape": (len(self),), "strides": (self.dtype.itemsize,), diff --git a/python/cudf/cudf/core/column/lists.py b/python/cudf/cudf/core/column/lists.py index 3bd8f99c863..92bb0d2e4c5 100644 --- a/python/cudf/cudf/core/column/lists.py +++ b/python/cudf/cudf/core/column/lists.py @@ -198,6 +198,11 @@ def _with_type_metadata( return self + def copy(self, deep: bool = True): + # Since list columns are immutable, both deep and shallow copies share + # the underlying device data and mask. + return super().copy(deep=False) + def leaves(self): if isinstance(self.elements, ListColumn): return self.elements.leaves() diff --git a/python/cudf/cudf/core/column/numerical.py b/python/cudf/cudf/core/column/numerical.py index 7943135afe1..37d1d7b6e61 100644 --- a/python/cudf/cudf/core/column/numerical.py +++ b/python/cudf/cudf/core/column/numerical.py @@ -1,4 +1,4 @@ -# Copyright (c) 2018-2022, NVIDIA CORPORATION. +# Copyright (c) 2018-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -13,7 +13,6 @@ cast, ) -import cupy import numpy as np import pandas as pd @@ -110,8 +109,8 @@ def __contains__(self, item: ScalarLike) -> bool: # Handles improper item types # Fails if item is of type None, so the handler. try: - if np.can_cast(item, self.data_array_view.dtype): - item = self.data_array_view.dtype.type(item) + if np.can_cast(item, self.dtype): + item = self.dtype.type(item) else: return False except (TypeError, ValueError): @@ -165,6 +164,7 @@ def __setitem__(self, key: Any, value: Any): @property def __cuda_array_interface__(self) -> Mapping[str, Any]: + output = { "shape": (len(self),), "strides": (self.dtype.itemsize,), @@ -573,14 +573,14 @@ def _find_value( found = 0 if len(self): found = find( - self.data_array_view, + self._data_array_view, value, mask=self.mask, ) if found == -1: if self.is_monotonic_increasing and closest: found = find( - self.data_array_view, + self._data_array_view, value, mask=self.mask, compare=compare, @@ -724,7 +724,7 @@ def to_pandas( pandas_array = pandas_nullable_dtype.__from_arrow__(arrow_array) pd_series = pd.Series(pandas_array, copy=False) elif str(self.dtype) in NUMERIC_TYPES and not self.has_nulls(): - pd_series = pd.Series(cupy.asnumpy(self.values), copy=False) + pd_series = pd.Series(self.values_host, copy=False) else: pd_series = self.to_arrow().to_pandas(**kwargs) diff --git a/python/cudf/cudf/core/column/string.py b/python/cudf/cudf/core/column/string.py index c5d5b396ebc..da717edf0e2 100644 --- a/python/cudf/cudf/core/column/string.py +++ b/python/cudf/cudf/core/column/string.py @@ -5251,6 +5251,11 @@ def __init__( self._start_offset = None self._end_offset = None + def copy(self, deep: bool = True): + # Since string columns are immutable, both deep + # and shallow copies share the underlying device data and mask. + return super().copy(deep=False) + @property def start_offset(self) -> int: if self._start_offset is None: diff --git a/python/cudf/cudf/core/column/struct.py b/python/cudf/cudf/core/column/struct.py index 69d70cf427f..6838d711641 100644 --- a/python/cudf/cudf/core/column/struct.py +++ b/python/cudf/cudf/core/column/struct.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020-2022, NVIDIA CORPORATION. +# Copyright (c) 2020-2023, NVIDIA CORPORATION. from __future__ import annotations from functools import cached_property @@ -95,7 +95,9 @@ def __setitem__(self, key, value): super().__setitem__(key, value) def copy(self, deep=True): - result = super().copy(deep=deep) + # Since struct columns are immutable, both deep and + # shallow copies share the underlying device data and mask. + result = super().copy(deep=False) if deep: result = result._rename_fields(self.dtype.fields.keys()) return result diff --git a/python/cudf/cudf/core/column/timedelta.py b/python/cudf/cudf/core/column/timedelta.py index 901547d94a9..74cb1efb3a5 100644 --- a/python/cudf/cudf/core/column/timedelta.py +++ b/python/cudf/cudf/core/column/timedelta.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020-2022, NVIDIA CORPORATION. +# Copyright (c) 2020-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -129,7 +129,7 @@ def to_arrow(self) -> pa.Array: mask = None if self.nullable: mask = pa.py_buffer(self.mask_array_view.copy_to_host()) - data = pa.py_buffer(self.as_numerical.data_array_view.copy_to_host()) + data = pa.py_buffer(self.as_numerical.values_host) pa_dtype = np_to_pa_dtype(self.dtype) return pa.Array.from_buffers( type=pa_dtype, diff --git a/python/cudf/cudf/core/column_accessor.py b/python/cudf/cudf/core/column_accessor.py index e1d12807fa8..707eda3f5e6 100644 --- a/python/cudf/cudf/core/column_accessor.py +++ b/python/cudf/cudf/core/column_accessor.py @@ -1,4 +1,4 @@ -# Copyright (c) 2021-2022, NVIDIA CORPORATION. +# Copyright (c) 2021-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -319,9 +319,9 @@ def copy(self, deep=False) -> ColumnAccessor: """ Make a copy of this ColumnAccessor. """ - if deep: + if deep or cudf.get_option("copy_on_write"): return self.__class__( - {k: v.copy(deep=True) for k, v in self._data.items()}, + {k: v.copy(deep=deep) for k, v in self._data.items()}, multiindex=self.multiindex, level_names=self.level_names, ) diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py index b24895805a9..d073f17089c 100644 --- a/python/cudf/cudf/core/dataframe.py +++ b/python/cudf/cudf/core/dataframe.py @@ -6533,12 +6533,14 @@ def to_struct(self, name=None): field_names = [str(name) for name in self._data.names] col = cudf.core.column.build_struct_column( - names=field_names, children=self._data.columns, size=len(self) + names=field_names, + children=tuple( + [col.copy(deep=True) for col in self._data.columns] + ), + size=len(self), ) return cudf.Series._from_data( - cudf.core.column_accessor.ColumnAccessor( - {name: col.copy(deep=True)} - ), + cudf.core.column_accessor.ColumnAccessor({name: col}), index=self.index, name=name, ) diff --git a/python/cudf/cudf/core/frame.py b/python/cudf/cudf/core/frame.py index 32764c6c2f0..2e85d75eb71 100644 --- a/python/cudf/cudf/core/frame.py +++ b/python/cudf/cudf/core/frame.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020-2022, NVIDIA CORPORATION. +# Copyright (c) 2020-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -1448,7 +1448,7 @@ def searchsorted( # Return result as cupy array if the values is non-scalar # If values is scalar, result is expected to be scalar. - result = cupy.asarray(outcol.data_array_view) + result = cupy.asarray(outcol._data_array_view) if scalar_flag: return result[0].item() else: @@ -1701,8 +1701,8 @@ def _colwise_binop( # that nulls that are present in both left_column and # right_column are not filled. if left_column.nullable and right_column.nullable: - lmask = as_column(left_column.nullmask) - rmask = as_column(right_column.nullmask) + lmask = as_column(left_column._nullmask) + rmask = as_column(right_column._nullmask) output_mask = (lmask | rmask).data left_column = left_column.fillna(fill_value) right_column = right_column.fillna(fill_value) @@ -1756,7 +1756,7 @@ def _apply_cupy_ufunc_to_operands( cupy_inputs = [] for inp in (left, right) if ufunc.nin == 2 else (left,): if isinstance(inp, ColumnBase) and inp.has_nulls(): - new_mask = as_column(inp.nullmask) + new_mask = as_column(inp._nullmask) # TODO: This is a hackish way to perform a bitwise and # of bitmasks. Once we expose diff --git a/python/cudf/cudf/core/multiindex.py b/python/cudf/cudf/core/multiindex.py index 7c4eac59eab..783c3996400 100644 --- a/python/cudf/cudf/core/multiindex.py +++ b/python/cudf/cudf/core/multiindex.py @@ -1,4 +1,4 @@ -# Copyright (c) 2019-2022, NVIDIA CORPORATION. +# Copyright (c) 2019-2023, NVIDIA CORPORATION. from __future__ import annotations @@ -156,7 +156,7 @@ def __init__( source_data = {} for i, (column_name, col) in enumerate(codes._data.items()): - if -1 in col.values: + if -1 in col: level = cudf.DataFrame( {column_name: [None] + list(levels[i])}, index=range(-1, len(levels[i])), diff --git a/python/cudf/cudf/core/reshape.py b/python/cudf/cudf/core/reshape.py index 64622bde86e..df1a543c4aa 100644 --- a/python/cudf/cudf/core/reshape.py +++ b/python/cudf/cudf/core/reshape.py @@ -584,9 +584,7 @@ def _tile(A, reps): mdata[var_name] = cudf.Series( cudf.core.column.build_categorical_column( categories=value_vars, - codes=cudf.core.column.as_column( - temp._column.base_data, dtype=temp._column.dtype - ), + codes=temp._column, mask=temp._column.base_mask, size=temp._column.size, offset=temp._column.offset, diff --git a/python/cudf/cudf/core/series.py b/python/cudf/cudf/core/series.py index faad5275abd..74dfb69593e 100644 --- a/python/cudf/cudf/core/series.py +++ b/python/cudf/cudf/core/series.py @@ -365,6 +365,9 @@ class Series(SingleColumnFrame, IndexedFrame, Serializable): name : str, optional The name to give to the Series. + copy : bool, default False + Copy input data. Only affects Series or 1d ndarray input. + nan_as_null : bool, Default True If ``None``/``True``, converts ``np.nan`` values to ``null`` values. @@ -490,6 +493,7 @@ def __init__( index=None, dtype=None, name=None, + copy=False, nan_as_null=True, ): if isinstance(data, pd.Series): @@ -520,6 +524,8 @@ def __init__( if name is None: name = data.name data = data._column + if copy: + data = data.copy(deep=True) if dtype is not None: data = data.astype(dtype) @@ -538,7 +544,21 @@ def __init__( data = {} if not isinstance(data, ColumnBase): - data = column.as_column(data, nan_as_null=nan_as_null, dtype=dtype) + has_cai = ( + type( + inspect.getattr_static( + data, "__cuda_array_interface__", None + ) + ) + is property + ) + data = column.as_column( + data, + nan_as_null=nan_as_null, + dtype=dtype, + ) + if copy and has_cai: + data = data.copy(deep=True) else: if dtype is not None: data = data.astype(dtype) @@ -4959,10 +4979,10 @@ def isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False): index = as_index(a.index) a_col = column.as_column(a) - a_array = cupy.asarray(a_col.data_array_view) + a_array = cupy.asarray(a_col._data_array_view) b_col = column.as_column(b) - b_array = cupy.asarray(b_col.data_array_view) + b_array = cupy.asarray(b_col._data_array_view) result = cupy.isclose( a=a_array, b=b_array, rtol=rtol, atol=atol, equal_nan=equal_nan diff --git a/python/cudf/cudf/options.py b/python/cudf/cudf/options.py index f80ad92879d..d98b194722a 100644 --- a/python/cudf/cudf/options.py +++ b/python/cudf/cudf/options.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION. +# Copyright (c) 2022-2023, NVIDIA CORPORATION. import os import textwrap @@ -150,6 +150,33 @@ def _validator(val): return _validator +def _cow_validator(val): + if get_option("spill") and val: + raise ValueError( + "Copy-on-write is not supported when spilling is enabled. " + "Please set `spill` to `False`" + ) + if val not in {False, True}: + raise ValueError( + f"{val} is not a valid option. Must be one of {{False, True}}." + ) + + +def _spill_validator(val): + try: + if get_option("copy_on_write") and val: + raise ValueError( + "Spilling is not supported when Copy-on-write is enabled. " + "Please set `copy_on_write` to `False`" + ) + except KeyError: + pass + if val not in {False, True}: + raise ValueError( + f"{val} is not a valid option. Must be one of {{False, True}}." + ) + + def _integer_validator(val): try: int(val) @@ -205,7 +232,6 @@ def _integer_and_none_validator(val): _make_contains_validator([None, 32, 64]), ) - _register_option( "spill", _env_get_bool("CUDF_SPILL", False), @@ -215,9 +241,28 @@ def _integer_and_none_validator(val): \tValid values are True or False. Default is False. """ ), - _make_contains_validator([False, True]), + _spill_validator, ) + +_register_option( + "copy_on_write", + _env_get_bool("CUDF_COPY_ON_WRITE", False), + textwrap.dedent( + """ + Default behavior of performing shallow copies. + If set to `False`, each shallow copy will perform a true shallow copy. + If set to `True`, each shallow copy will perform a shallow copy + with underlying data actually referring to the actual column, in this + case a copy is only made when there is a write operation performed on + the column. + \tValid values are True or False. Default is False. + """ + ), + _cow_validator, +) + + _register_option( "spill_on_demand", _env_get_bool("CUDF_SPILL_ON_DEMAND", True), diff --git a/python/cudf/cudf/testing/_utils.py b/python/cudf/cudf/testing/_utils.py index cbaf47a4c68..c27e1eaab36 100644 --- a/python/cudf/cudf/testing/_utils.py +++ b/python/cudf/cudf/testing/_utils.py @@ -346,6 +346,11 @@ def get_ptr(x) -> int: assert len(lhs.base_children) == len(rhs.base_children) for lhs_child, rhs_child in zip(lhs.base_children, rhs.base_children): assert_column_memory_eq(lhs_child, rhs_child) + if isinstance(lhs, cudf.core.column.CategoricalColumn) and isinstance( + rhs, cudf.core.column.CategoricalColumn + ): + assert_column_memory_eq(lhs.categories, rhs.categories) + assert_column_memory_eq(lhs.codes, rhs.codes) def assert_column_memory_ne( diff --git a/python/cudf/cudf/tests/test_copying.py b/python/cudf/cudf/tests/test_copying.py index 1d3d9e91ae2..1dd73a69384 100644 --- a/python/cudf/cudf/tests/test_copying.py +++ b/python/cudf/cudf/tests/test_copying.py @@ -1,5 +1,6 @@ -# Copyright (c) 2020-2022, NVIDIA CORPORATION. +# Copyright (c) 2020-2023, NVIDIA CORPORATION. +import cupy as cp import numpy as np import pandas as pd import pytest @@ -52,3 +53,238 @@ def test_null_copy(): col = Series(np.arange(2049)) col[:] = None assert len(col) == 2049 + + +@pytest.mark.parametrize("copy_on_write", [True, False]) +def test_series_setitem_cow(copy_on_write): + cudf.set_option("copy_on_write", copy_on_write) + actual = cudf.Series([1, 2, 3, 4, 5]) + new_copy = actual.copy(deep=False) + + actual[1] = 100 + assert_eq(actual, cudf.Series([1, 100, 3, 4, 5])) + if copy_on_write: + assert_eq(new_copy, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(new_copy, cudf.Series([1, 100, 3, 4, 5])) + + actual = cudf.Series([1, 2, 3, 4, 5]) + new_copy = actual.copy(deep=False) + + actual[slice(0, 2, 1)] = 100 + assert_eq(actual, cudf.Series([100, 100, 3, 4, 5])) + if copy_on_write: + assert_eq(new_copy, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(new_copy, cudf.Series([100, 100, 3, 4, 5])) + + new_copy[slice(2, 4, 1)] = 300 + if copy_on_write: + assert_eq(actual, cudf.Series([100, 100, 3, 4, 5])) + else: + assert_eq(actual, cudf.Series([100, 100, 300, 300, 5])) + + if copy_on_write: + assert_eq(new_copy, cudf.Series([1, 2, 300, 300, 5])) + else: + assert_eq(new_copy, cudf.Series([100, 100, 300, 300, 5])) + + actual = cudf.Series([1, 2, 3, 4, 5]) + new_copy = actual.copy(deep=False) + + new_copy[slice(2, 4, 1)] = 300 + if copy_on_write: + assert_eq(actual, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(actual, cudf.Series([1, 2, 300, 300, 5])) + assert_eq(new_copy, cudf.Series([1, 2, 300, 300, 5])) + + new_slice = actual[2:] + assert new_slice._column.base_data._ptr == actual._column.base_data._ptr + new_slice[0:2] = 10 + assert_eq(new_slice, cudf.Series([10, 10, 5], index=[2, 3, 4])) + if copy_on_write: + assert_eq(actual, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(actual, cudf.Series([1, 2, 10, 10, 5])) + + +def test_multiple_series_cow(): + cudf.set_option("copy_on_write", True) + s = cudf.Series([10, 20, 30, 40, 50]) + s1 = s.copy(deep=False) + s2 = s.copy(deep=False) + s3 = s.copy(deep=False) + s4 = s2.copy(deep=False) + s5 = s4.copy(deep=False) + s6 = s3.copy(deep=False) + + s1[0:3] = 10000 + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + for ser in [s, s2, s3, s4, s5, s6]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + s6[0:3] = 3000 + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + for ser in [s2, s3, s4, s5]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + s2[1:4] = 4000 + assert_eq(s2, cudf.Series([10, 4000, 4000, 4000, 50])) + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + for ser in [s3, s4, s5]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + s4[2:4] = 5000 + assert_eq(s4, cudf.Series([10, 20, 5000, 5000, 50])) + assert_eq(s2, cudf.Series([10, 4000, 4000, 4000, 50])) + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + for ser in [s3, s5]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + s5[2:4] = 6000 + assert_eq(s5, cudf.Series([10, 20, 6000, 6000, 50])) + assert_eq(s4, cudf.Series([10, 20, 5000, 5000, 50])) + assert_eq(s2, cudf.Series([10, 4000, 4000, 4000, 50])) + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + for ser in [s3]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + s7 = s5.copy(deep=False) + assert_eq(s7, cudf.Series([10, 20, 6000, 6000, 50])) + s7[1:3] = 55 + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + assert_eq(s4, cudf.Series([10, 20, 5000, 5000, 50])) + assert_eq(s2, cudf.Series([10, 4000, 4000, 4000, 50])) + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + for ser in [s3]: + assert_eq(ser, cudf.Series([10, 20, 30, 40, 50])) + + del s2 + + assert_eq(s1, cudf.Series([10000, 10000, 10000, 40, 50])) + assert_eq(s3, cudf.Series([10, 20, 30, 40, 50])) + assert_eq(s4, cudf.Series([10, 20, 5000, 5000, 50])) + assert_eq(s5, cudf.Series([10, 20, 6000, 6000, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + del s4 + del s1 + + assert_eq(s3, cudf.Series([10, 20, 30, 40, 50])) + assert_eq(s5, cudf.Series([10, 20, 6000, 6000, 50])) + assert_eq(s6, cudf.Series([3000, 3000, 3000, 40, 50])) + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + del s + del s6 + + assert_eq(s3, cudf.Series([10, 20, 30, 40, 50])) + assert_eq(s5, cudf.Series([10, 20, 6000, 6000, 50])) + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + del s5 + + assert_eq(s3, cudf.Series([10, 20, 30, 40, 50])) + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + del s3 + assert_eq(s7, cudf.Series([10, 55, 55, 6000, 50])) + + +@pytest.mark.parametrize("copy_on_write", [True, False]) +def test_series_zero_copy(copy_on_write): + cudf.set_option("copy_on_write", copy_on_write) + s = cudf.Series([1, 2, 3, 4, 5]) + s1 = s.copy(deep=False) + cp_array = cp.asarray(s) + + assert_eq(s, cudf.Series([1, 2, 3, 4, 5])) + assert_eq(s1, cudf.Series([1, 2, 3, 4, 5])) + assert_eq(cp_array, cp.array([1, 2, 3, 4, 5])) + + cp_array[0:3] = 10 + + assert_eq(s, cudf.Series([10, 10, 10, 4, 5])) + if copy_on_write: + assert_eq(s1, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(s1, cudf.Series([10, 10, 10, 4, 5])) + assert_eq(cp_array, cp.array([10, 10, 10, 4, 5])) + + s2 = cudf.Series(cp_array) + assert_eq(s2, cudf.Series([10, 10, 10, 4, 5])) + s3 = s2.copy(deep=False) + cp_array[0] = 20 + + assert_eq(s, cudf.Series([20, 10, 10, 4, 5])) + if copy_on_write: + assert_eq(s1, cudf.Series([1, 2, 3, 4, 5])) + else: + assert_eq(s1, cudf.Series([20, 10, 10, 4, 5])) + assert_eq(cp_array, cp.array([20, 10, 10, 4, 5])) + assert_eq(s2, cudf.Series([20, 10, 10, 4, 5])) + if copy_on_write: + assert_eq(s3, cudf.Series([10, 10, 10, 4, 5])) + else: + assert_eq(s3, cudf.Series([20, 10, 10, 4, 5])) + + s4 = cudf.Series([10, 20, 30, 40, 50]) + s5 = cudf.Series(s4) + assert_eq(s5, cudf.Series([10, 20, 30, 40, 50])) + s5[0:2] = 1 + assert_eq(s5, cudf.Series([1, 1, 30, 40, 50])) + assert_eq(s4, cudf.Series([1, 1, 30, 40, 50])) + + +@pytest.mark.parametrize("copy_on_write", [True, False]) +def test_series_str_copy(copy_on_write): + cudf.set_option("copy_on_write", copy_on_write) + s = cudf.Series(["a", "b", "c", "d", "e"]) + s1 = s.copy(deep=True) + s2 = s.copy(deep=True) + + assert_eq(s, cudf.Series(["a", "b", "c", "d", "e"])) + assert_eq(s1, cudf.Series(["a", "b", "c", "d", "e"])) + assert_eq(s2, cudf.Series(["a", "b", "c", "d", "e"])) + + s[0:3] = "abc" + + assert_eq(s, cudf.Series(["abc", "abc", "abc", "d", "e"])) + assert_eq(s1, cudf.Series(["a", "b", "c", "d", "e"])) + assert_eq(s2, cudf.Series(["a", "b", "c", "d", "e"])) + + s2[1:4] = "xyz" + + assert_eq(s, cudf.Series(["abc", "abc", "abc", "d", "e"])) + assert_eq(s1, cudf.Series(["a", "b", "c", "d", "e"])) + assert_eq(s2, cudf.Series(["a", "xyz", "xyz", "xyz", "e"])) + + +@pytest.mark.parametrize("copy_on_write", [True, False]) +def test_series_cat_copy(copy_on_write): + cudf.set_option("copy_on_write", copy_on_write) + s = cudf.Series([10, 20, 30, 40, 50], dtype="category") + s1 = s.copy(deep=True) + s2 = s1.copy(deep=True) + s3 = s1.copy(deep=True) + + s[0] = 50 + assert_eq(s, cudf.Series([50, 20, 30, 40, 50], dtype=s.dtype)) + assert_eq(s1, cudf.Series([10, 20, 30, 40, 50], dtype="category")) + assert_eq(s2, cudf.Series([10, 20, 30, 40, 50], dtype="category")) + assert_eq(s3, cudf.Series([10, 20, 30, 40, 50], dtype="category")) + + s2[3] = 10 + s3[2:5] = 20 + assert_eq(s, cudf.Series([50, 20, 30, 40, 50], dtype=s.dtype)) + assert_eq(s1, cudf.Series([10, 20, 30, 40, 50], dtype=s.dtype)) + assert_eq(s2, cudf.Series([10, 20, 30, 10, 50], dtype=s.dtype)) + assert_eq(s3, cudf.Series([10, 20, 20, 20, 20], dtype=s.dtype)) diff --git a/python/cudf/cudf/tests/test_index.py b/python/cudf/cudf/tests/test_index.py index 5b0eca1c635..32d83bb1c83 100644 --- a/python/cudf/cudf/tests/test_index.py +++ b/python/cudf/cudf/tests/test_index.py @@ -392,6 +392,7 @@ def test_index_copy_category(name, dtype, deep=True): with pytest.warns(FutureWarning): cidx_copy = cidx.copy(name=name, deep=deep, dtype=dtype) + assert_column_memory_ne(cidx._values, cidx_copy._values) assert_eq(pidx_copy, cidx_copy) @@ -410,7 +411,18 @@ def test_index_copy_category(name, dtype, deep=True): def test_index_copy_deep(idx, deep): """Test if deep copy creates a new instance for device data.""" idx_copy = idx.copy(deep=deep) - if not deep: + + if ( + isinstance(idx, cudf.StringIndex) + or not deep + or (cudf.get_option("copy_on_write") and not deep) + ): + # StringColumn is immutable hence, deep copies of a + # StringIndex will share the same StringColumn. + + # When `copy_on_write` is turned on, Index objects will + # have unique column object but they all point to same + # data pointers. assert_column_memory_eq(idx._values, idx_copy._values) else: assert_column_memory_ne(idx._values, idx_copy._values) diff --git a/python/cudf/cudf/tests/test_multiindex.py b/python/cudf/cudf/tests/test_multiindex.py index d27d6732226..ee7c10d607a 100644 --- a/python/cudf/cudf/tests/test_multiindex.py +++ b/python/cudf/cudf/tests/test_multiindex.py @@ -781,13 +781,15 @@ def test_multiindex_copy_sem(data, levels, codes, names): ), ], ) +@pytest.mark.parametrize("copy_on_write", [True, False]) @pytest.mark.parametrize("deep", [True, False]) -def test_multiindex_copy_deep(data, deep): +def test_multiindex_copy_deep(data, copy_on_write, deep): """Test memory identity for deep copy Case1: Constructed from GroupBy, StringColumns Case2: Constructed from MultiIndex, NumericColumns """ - same_ref = not deep + cudf.set_option("copy_on_write", copy_on_write) + same_ref = (not deep) or (cudf.get_option("copy_on_write") and not deep) if isinstance(data, dict): import operator @@ -804,10 +806,10 @@ def test_multiindex_copy_deep(data, deep): lchildren = reduce(operator.add, lchildren) rchildren = reduce(operator.add, rchildren) - lptrs = [child.base_data.ptr for child in lchildren] - rptrs = [child.base_data.ptr for child in rchildren] + lptrs = [child.base_data._ptr for child in lchildren] + rptrs = [child.base_data._ptr for child in rchildren] - assert all((x == y) == same_ref for x, y in zip(lptrs, rptrs)) + assert all((x == y) for x, y in zip(lptrs, rptrs)) elif isinstance(data, cudf.MultiIndex): mi1 = data diff --git a/python/cudf/cudf/tests/test_series.py b/python/cudf/cudf/tests/test_series.py index 6a1245a1582..b3c7c9ac9bb 100644 --- a/python/cudf/cudf/tests/test_series.py +++ b/python/cudf/cudf/tests/test_series.py @@ -2080,3 +2080,25 @@ def test_series_duplicated(data, index, keep): ps = gs.to_pandas() assert_eq(gs.duplicated(keep=keep), ps.duplicated(keep=keep)) + + +@pytest.mark.parametrize( + "data", + [ + [1, 2, 3, 4], + [10, 20, None, None], + ], +) +@pytest.mark.parametrize("copy", [True, False]) +def test_series_copy(data, copy): + psr = pd.Series(data) + gsr = cudf.from_pandas(psr) + + new_psr = pd.Series(psr, copy=copy) + new_gsr = cudf.Series(gsr, copy=copy) + + new_psr.iloc[0] = 999 + new_gsr.iloc[0] = 999 + + assert_eq(psr, gsr) + assert_eq(new_psr, new_gsr) diff --git a/python/cudf/cudf/tests/test_stats.py b/python/cudf/cudf/tests/test_stats.py index 715d17169a6..6478fbaad95 100644 --- a/python/cudf/cudf/tests/test_stats.py +++ b/python/cudf/cudf/tests/test_stats.py @@ -531,7 +531,6 @@ def test_nans_stats(data, ops, skipna): getattr(psr, ops)(skipna=skipna), getattr(gsr, ops)(skipna=skipna) ) - psr = _create_pandas_series(data) gsr = cudf.Series(data, nan_as_null=False) # Since there is no concept of `nan_as_null` in pandas, # nulls will be returned in the operations. So only diff --git a/python/cudf/cudf/utils/applyutils.py b/python/cudf/cudf/utils/applyutils.py index 89331b933a8..b7677f7b43e 100644 --- a/python/cudf/cudf/utils/applyutils.py +++ b/python/cudf/cudf/utils/applyutils.py @@ -1,4 +1,4 @@ -# Copyright (c) 2018-2022, NVIDIA CORPORATION. +# Copyright (c) 2018-2023, NVIDIA CORPORATION. import functools from typing import Any, Dict @@ -110,7 +110,7 @@ def make_aggregate_nullmask(df, columns=None, op="__and__"): col = cudf.core.dataframe.extract_col(df, k) if not col.nullable: continue - nullmask = df[k].nullmask + nullmask = cudf.Series(df[k]._column._nullmask) if out_mask is None: out_mask = column.as_column(