[REVIEW] Copy on write implementation (#11718)

Initial copy-on-write implementation
rapidsai · Jan 13, 2023 · bd72a17 · bd72a17
1 parent ec7f8c6
commit bd72a17
Show file tree

Hide file tree

Showing 31 changed files with 1,132 additions and 95 deletions.
diff --git a/docs/cudf/source/developer_guide/library_design.md b/docs/cudf/source/developer_guide/library_design.md
@@ -316,3 +316,180 @@ The pandas API also includes a number of helper objects, such as `GroupBy`, `Rol
 cuDF implements corresponding objects with the same APIs.
 Internally, these objects typically interact with cuDF objects at the Frame layer via composition.
 However, for performance reasons they frequently access internal attributes and methods of `Frame` and its subclasses.
+
+
+## Copy-on-write
+
+
+Copy-on-write (COW) is designed to reduce memory footprint on GPUs. With this feature, a copy (`.copy(deep=False)`) is only really made whenever
+there is a write operation on a column. It is first recommended to see
+the public usage [here](copy-on-write-user-doc) of this functionality before reading through the internals
+below.
+
+The core copy-on-write implementation relies on the `CopyOnWriteBuffer` class. This class stores the pointer to the device memory and size.
+With the help of `CopyOnWriteBuffer.ptr` we generate [weak references](https://docs.python.org/3/library/weakref.html) of `CopyOnWriteBuffer` and store it in `CopyOnWriteBuffer._instances`.
+This is a mapping from `ptr` keys to `WeakSet`s containing references to `CopyOnWriterBuffer` objects. This
+means all the new `CopyOnWriteBuffer`s that are created map to the same key in `CopyOnWriteBuffer._instances` if they have same `.ptr`
+i.e., if they are all pointing to the same device memory.
+
+When the cudf option `"copy_on_write"` is `True`, `as_buffer` will always return a `CopyOnWriteBuffer`. This class contains all the
+mechanisms to enable copy-on-write for all buffers. When a `CopyOnWriteBuffer` is created, its weakref is generated and added to the `WeakSet` which is in turn stored in `CopyOnWriterBuffer._instances`. This will later serve as an indication of whether or not to make a copy when a
+when write operation is performed on a `Column` (see below).
+
+
+### Eager copies when exposing to third-party libraries
+
+If `Column`/`CopyOnWriteBuffer` is exposed to a third-party library via `__cuda_array_interface__`, we are no longer able to track whether or not modification of the buffer has occurred without introspection. Hence whenever
+someone accesses data through the `__cuda_array_interface__`, we eagerly trigger the copy by calling
+`_unlink_shared_buffers` which ensures a true copy of underlying device data is made and
+unlinks the buffer from any shared "weak" references. Any future shallow-copy requests must also trigger a true physical copy (since we cannot track the lifetime of the third-party object), to handle this we also mark the `Column`/`CopyOnWriteBuffer` as
+`obj._zero_copied=True` thus indicating any future shallow-copy requests will trigger a true physical copy
+rather than a copy-on-write shallow copy with weak references.
+
+### How to obtain read-only object?
+
+A read-only object can be quite useful for operations that will not
+mutate the data. This can be achieved by calling `._get_readonly_proxy_obj`
+API, this API will return a proxy object that has `__cuda_array_interface__`
+implemented and will not trigger a deep copy even if the `CopyOnWriteBuffer`
+has weak references. It is only recommended to use this API as long as
+the objects/arrays created with this proxy object gets cleaned up during
+the developer code execution. We currently use this API for device to host
+copies like in `ColumnBase._data_array_view` which is used for `Column.values_host`.
+
+Notes:
+1. Weak references are implemented only for fixed-width data types as these are only column
+types that can be mutated in place.
+2. Deep copies of variable width data types return shallow-copies of the Columns, because these
+types don't support real in-place mutations to the data. We just mimic in such a way that it looks
+like an in-place operation using `_mimic_inplace`.
+
+
+### Examples
+
+When copy-on-write is enabled, taking a shallow copy of a `Series` or a `DataFrame` does not
+eagerly create a copy of the data. Instead, it produces a view that will be lazily
+copied when a write operation is performed on any of its copies.
+
+Let's create a series:
+
+```python
+>>> import cudf
+>>> cudf.set_option("copy_on_write", True)
+>>> s1 = cudf.Series([1, 2, 3, 4])
+```
+
+Make a copy of `s1`:
+```python
+>>> s2 = s1.copy(deep=False)
+```
+
+Make another copy, but of `s2`:
+```python
+>>> s3 = s2.copy(deep=False)
+```
+
+Viewing the data and memory addresses show that they all point to the same device memory:
+```python
+>>> s1
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s2
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+
+>>> s1.data._ptr
+139796315897856
+>>> s2.data._ptr
+139796315897856
+>>> s3.data._ptr
+139796315897856
+```
+
+Now, when we perform a write operation on one of them, say on `s2`, a new copy is created
+for `s2` on device and then modified:
+
+```python
+>>> s2[0:2] = 10
+>>> s2
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> s1
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+If we inspect the memory address of the data, `s1` and `s3` still share the same address but `s2` has a new one:
+
+```python
+>>> s1.data._ptr
+139796315897856
+>>> s3.data._ptr
+139796315897856
+>>> s2.data._ptr
+139796315899392
+```
+
+Now, performing write operation on `s1` will trigger a new copy on device memory as there
+is a weak reference being shared in `s3`:
+
+```python
+>>> s1[0:2] = 11
+>>> s1
+0    11
+1    11
+2     3
+3     4
+dtype: int64
+>>> s2
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> s3
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+If we inspect the memory address of the data, the addresses of `s2` and `s3` remain unchanged, but `s1`'s memory address has changed because of a copy operation performed during the writing:
+
+```python
+>>> s2.data._ptr
+139796315899392
+>>> s3.data._ptr
+139796315897856
+>>> s1.data._ptr
+139796315879723
+```
+
+cudf Copy-on-write implementation is motivated by pandas Copy-on-write proposal here:
+1. [Google doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u)
+2. [Github issue](https://github.com/pandas-dev/pandas/issues/36195)
diff --git a/docs/cudf/source/user_guide/copy-on-write.md b/docs/cudf/source/user_guide/copy-on-write.md
@@ -0,0 +1,169 @@
+(copy-on-write-user-doc)=
+
+# Copy-on-write
+
+Copy-on-write reduces GPU memory usage when copies(`.copy(deep=False)`) of a column
+are made.
+
+|                     | Copy-on-Write enabled                                                                                                                                                                                          | Copy-on-Write disabled (default)                                                                               |
+|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
+| `.copy(deep=True)`  | A true copy is made and changes don't propagate to the original object.                                                                                                                            | A true copy is made and changes don't propagate to the original object.                  |
+| `.copy(deep=False)` | Memory is shared between the two objects and but any write operation on one object will trigger a true physical copy before the write is performed. Hence changes will not propagate to the original object. | Memory is shared between the two objects and changes performed on one will propagate to the other object. |
+
+## How to enable it
+
+i. Use `cudf.set_option`:
+
+```python
+>>> import cudf
+>>> cudf.set_option("copy_on_write", True)
+```
+
+ii. Set the environment variable ``CUDF_COPY_ON_WRITE`` to ``1`` prior to the
+launch of the Python interpreter:
+
+```bash
+export CUDF_COPY_ON_WRITE="1" python -c "import cudf"
+```
+
+
+## Making copies
+
+There are no additional changes required in the code to make use of copy-on-write.
+
+```python
+>>> series = cudf.Series([1, 2, 3, 4])
+```
+
+Performing a shallow copy will create a new Series object pointing to the
+same underlying device memory:
+
+```python
+>>> copied_series = series.copy(deep=False)
+>>> series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+>>> copied_series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+When a write operation is performed on either ``series`` or
+``copied_series``, a true physical copy of the data is created:
+
+```python
+>>> series[0:2] = 10
+>>> series
+0    10
+1    10
+2     3
+3     4
+dtype: int64
+>>> copied_series
+0    1
+1    2
+2    3
+3    4
+dtype: int64
+```
+
+
+## Notes
+
+When copy-on-write is enabled, there is no concept of views. i.e., modifying any view created inside cudf will not actually not modify
+the original object it was viewing and thus a separate copy is created and then modified.
+
+## Advantages
+
+1. With copy-on-write enabled and by requesting `.copy(deep=False)`, the GPU memory usage can be reduced drastically if you are not performing
+write operations on all of those copies. This will also increase the speed at which objects are created for execution of your ETL workflow.
+2. With the concept of views going away, every object is a copy of it's original object. This will bring consistency across operations and cudf closer to parity with
+pandas. Following is one of the inconsistency:
+
+```python
+
+>>> import pandas as pd
+>>> s = pd.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    10
+1     2
+2     3
+3     4
+4     5
+dtype: int64
+
+>>> import cudf
+>>> s = cudf.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+```
+
+The above inconsistency is solved when Copy-on-write is enabled:
+
+```python
+>>> import pandas as pd
+>>> pd.set_option("mode.copy_on_write", True)
+>>> s = pd.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+
+
+>>> import cudf
+>>> cudf.set_option("copy_on_write", True)
+>>> s = cudf.Series([1, 2, 3, 4, 5])
+>>> s1 = s[0:2]
+>>> s1[0] = 10
+>>> s1
+0    10
+1     2
+dtype: int64
+>>> s
+0    1
+1    2
+2    3
+3    4
+4    5
+dtype: int64
+```
+
+## How to disable it
+
+
+Copy-on-write can be disable by setting ``copy_on_write`` cudf option to ``False``:
+
+```python
+>>> cudf.set_option("copy_on_write", False)
+```
diff --git a/docs/cudf/source/user_guide/index.md b/docs/cudf/source/user_guide/index.md
@@ -13,4 +13,5 @@ guide-to-udfs
 cupy-interop
 options
 PandasCompat
+copy-on-write
 ```