Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Copy on write implementation #11718

Merged
merged 193 commits into from
Jan 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
193 commits
Select commit Hold shift + click to select a range
d780606
initial commit
galipremsagar Jun 23, 2022
9baaa08
initial commit
galipremsagar Jun 23, 2022
ec5d461
merge
galipremsagar Jun 28, 2022
dc94cde
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w
galipremsagar Jun 29, 2022
98d3cae
fix
galipremsagar Jun 29, 2022
5bd9dfa
import
galipremsagar Jun 29, 2022
6a941a4
fix
galipremsagar Jul 1, 2022
29491f0
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w
galipremsagar Jul 6, 2022
98426c3
fix
galipremsagar Jul 6, 2022
827f52a
push down to column
galipremsagar Jul 7, 2022
f9e81cf
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-1
galipremsagar Jul 8, 2022
5a81c2d
cleanup
galipremsagar Jul 8, 2022
c10fb77
cleanup
galipremsagar Jul 8, 2022
f6d1003
changes
galipremsagar Jul 8, 2022
212af2e
cleanup
galipremsagar Jul 8, 2022
6e56c86
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-1
galipremsagar Jul 8, 2022
a1eef3d
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-1
galipremsagar Jul 8, 2022
6fff02a
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-1
galipremsagar Jul 11, 2022
744446f
use base_data
galipremsagar Jul 11, 2022
d869dda
handle strings
galipremsagar Jul 11, 2022
11709bc
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-2
galipremsagar Jul 14, 2022
219ee1b
struct & list
galipremsagar Jul 14, 2022
3c3934c
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-2
galipremsagar Jul 18, 2022
b86e020
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-2
galipremsagar Jul 19, 2022
4b423a4
cleanup
galipremsagar Jul 19, 2022
4c7e9cd
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-2
galipremsagar Jul 25, 2022
aa4620f
Merge remote-tracking branch 'upstream/branch-22.08' into c-o-w-2
galipremsagar Aug 1, 2022
c66656b
Merge remote-tracking branch 'upstream/branch-22.10' into c-o-w-2
galipremsagar Sep 12, 2022
497f5df
Merge remote-tracking branch 'upstream/branch-22.10' into c-o-w-2
galipremsagar Sep 18, 2022
83269ea
Merge remote-tracking branch 'upstream/branch-22.10' into c-o-w-2
galipremsagar Sep 19, 2022
ff59856
add copy_on_write option
galipremsagar Sep 19, 2022
3d177e0
simply has_a_weakref
galipremsagar Sep 19, 2022
56e4af5
internalize detach_refs
galipremsagar Sep 19, 2022
49994d5
cleanup
galipremsagar Sep 19, 2022
f3bbdfc
Merge remote-tracking branch 'upstream/branch-22.10' into c-o-w-2
galipremsagar Sep 19, 2022
de6c9dc
fix
galipremsagar Sep 20, 2022
e2b746e
cleanup
galipremsagar Sep 20, 2022
7698958
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Sep 28, 2022
3bbc9f9
Fix non cow tests
galipremsagar Sep 28, 2022
79c5f17
Fix non cow tests
galipremsagar Sep 28, 2022
79cc09f
pytest fix
galipremsagar Sep 28, 2022
fdc8043
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 5, 2022
5521016
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 5, 2022
a902285
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 6, 2022
6621dc3
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 6, 2022
385b7de
Handle categoricals
galipremsagar Oct 6, 2022
2d24882
style
galipremsagar Oct 6, 2022
f755758
style
galipremsagar Oct 6, 2022
ff5e3cd
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 6, 2022
1cebe44
struct fix
galipremsagar Oct 6, 2022
5385e32
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 7, 2022
0326763
detach in CAI
galipremsagar Oct 7, 2022
2dab262
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 10, 2022
160ba75
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 10, 2022
3ef8ae4
add Buffer._detach
galipremsagar Oct 11, 2022
e62c9be
add Buffer._detach
galipremsagar Oct 11, 2022
6a63ce1
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 12, 2022
709dd62
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 17, 2022
c5e27c3
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 18, 2022
a0f90d3
Lower weakref to Buffer class
galipremsagar Oct 21, 2022
82585e1
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 24, 2022
6028691
cleanup
galipremsagar Oct 24, 2022
6dc355f
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 24, 2022
5904c8e
cleanup
galipremsagar Oct 24, 2022
0e4ce26
cleanup
galipremsagar Oct 24, 2022
995b66c
docstrings
galipremsagar Oct 24, 2022
ae4b5e0
Move detach_refs to mutable_view()
galipremsagar Oct 24, 2022
97584de
refactor
galipremsagar Oct 24, 2022
804a121
changes
galipremsagar Oct 25, 2022
ef21cdb
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 25, 2022
d916a24
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 25, 2022
bf449be
design docs and improvements
galipremsagar Oct 25, 2022
dd47e67
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 25, 2022
a0d4fd4
revert
galipremsagar Oct 25, 2022
e36e553
Add user facing docs
galipremsagar Oct 26, 2022
97dd34a
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Oct 26, 2022
e611c0b
improvements
galipremsagar Oct 26, 2022
4b2fd7f
improvements
galipremsagar Oct 26, 2022
27009e6
update tests
galipremsagar Oct 26, 2022
c8490ff
make get_weakref internal
galipremsagar Oct 26, 2022
18146a9
Apply suggestions from code review
galipremsagar Oct 26, 2022
c748a00
merge
galipremsagar Nov 2, 2022
1ef8349
align with pandas
galipremsagar Nov 3, 2022
58f6245
merge
galipremsagar Nov 7, 2022
0642c2e
fix tests
galipremsagar Nov 7, 2022
9603c86
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Nov 9, 2022
c8c917d
Merge remote-tracking branch 'upstream/branch-22.12' into c-o-w-2
galipremsagar Nov 16, 2022
0dcc7cc
changes
galipremsagar Nov 18, 2022
c11dfde
fix
galipremsagar Nov 18, 2022
f36fa25
Handle more cases
galipremsagar Nov 18, 2022
8c04594
merge with spilling changes
galipremsagar Nov 19, 2022
92fe5d7
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Nov 30, 2022
e97c851
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 1, 2022
f01a017
disable spilling + cow
galipremsagar Dec 1, 2022
18fda73
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 5, 2022
e18a9d9
fix issues
galipremsagar Dec 5, 2022
e250f6d
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 7, 2022
7411805
Move copy on write logic to a separate Buffer implementation
galipremsagar Dec 7, 2022
56ef249
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 7, 2022
e72eaa3
Use CachedInstanceMeta to ensure BufferWeakref is a singleton
galipremsagar Dec 7, 2022
978f379
type
galipremsagar Dec 7, 2022
87f7641
read only cai
galipremsagar Dec 7, 2022
057967b
add slots
galipremsagar Dec 7, 2022
2e8a929
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 7, 2022
ce12519
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 7, 2022
082202f
more validation
galipremsagar Dec 7, 2022
b55f039
More validation and rename cai to readonly cai
galipremsagar Dec 8, 2022
35a64a5
Refactor 1
shwina Dec 8, 2022
1abaca0
Rename
shwina Dec 8, 2022
eba1525
Refactor 2
shwina Dec 8, 2022
65cb7ac
Refactor 3
shwina Dec 8, 2022
2e8fc97
Apply suggestions from code review
galipremsagar Dec 13, 2022
b415fa1
Merge pull request #3 from shwina/c-o-w-2
galipremsagar Dec 13, 2022
1138c1f
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 13, 2022
2a72d9a
fix naming errors
galipremsagar Dec 14, 2022
3789255
Apply suggestions from code review
galipremsagar Dec 14, 2022
7ea425c
address review
galipremsagar Dec 14, 2022
b0ab29e
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Dec 14, 2022
ccfd064
Apply suggestions from code review
galipremsagar Dec 14, 2022
7b9b574
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Dec 14, 2022
93d449e
fix deep copies
galipremsagar Dec 14, 2022
6dbdf2f
Make _is_shared a property
galipremsagar Dec 14, 2022
4376702
use WeakKeyDictionary
galipremsagar Dec 14, 2022
87cadfe
rename
galipremsagar Dec 14, 2022
61260da
rename to _is_internally_referenced and _is_externally_referenced
galipremsagar Dec 14, 2022
499c902
make methods with no params as properties
galipremsagar Dec 14, 2022
4dd8927
revert
galipremsagar Dec 14, 2022
25e1be8
updates
galipremsagar Dec 14, 2022
35d513d
add Series copy tests
galipremsagar Dec 14, 2022
d4a9a3f
docs
galipremsagar Dec 14, 2022
dd2f61a
docs
galipremsagar Dec 14, 2022
3721450
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 15, 2022
71d7d88
update library design doc
galipremsagar Dec 15, 2022
2c95e85
Updated end user docs
galipremsagar Dec 15, 2022
d3bdd86
update docs
galipremsagar Dec 15, 2022
a43f1f9
review
galipremsagar Dec 15, 2022
8968adb
fix tracking of weakreferences
galipremsagar Dec 15, 2022
16aab17
Merge branch 'branch-23.02' into c-o-w-2
galipremsagar Dec 16, 2022
8e2fb6a
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 16, 2022
5bec54e
Fix data array view
galipremsagar Dec 16, 2022
937939b
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Dec 16, 2022
3639564
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Dec 28, 2022
b423b30
Merge branch 'branch-23.02' into c-o-w-2
galipremsagar Jan 3, 2023
97b4922
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Jan 3, 2023
1edef2b
style
galipremsagar Jan 3, 2023
bce9781
Hide cow details from column
galipremsagar Jan 3, 2023
ce32cf5
cleanup
galipremsagar Jan 3, 2023
2bad64f
Merge branch 'branch-23.02' into c-o-w-2
galipremsagar Jan 4, 2023
1106f95
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 5, 2023
25511fc
use getattr_static
galipremsagar Jan 5, 2023
22198a3
update comment
galipremsagar Jan 5, 2023
80c48f1
use iter
galipremsagar Jan 5, 2023
0a21c83
separate setting _zero_copied from _unlink_shared_buffers
galipremsagar Jan 5, 2023
9d6c09e
Apply suggestions from code review
galipremsagar Jan 5, 2023
2514142
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Jan 5, 2023
141ca49
drop extra args for as_column
galipremsagar Jan 5, 2023
a93709e
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 5, 2023
12ab4c6
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 6, 2023
6356765
Apply suggestions from code review
galipremsagar Jan 6, 2023
56eb809
rename copy on write for consistency
galipremsagar Jan 6, 2023
6d5e121
merge
galipremsagar Jan 6, 2023
863da0b
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 9, 2023
da343c5
address reviews
galipremsagar Jan 9, 2023
751294b
fix categorical copy
galipremsagar Jan 9, 2023
c0fe955
refactor option validators
galipremsagar Jan 9, 2023
2237af2
fix
galipremsagar Jan 9, 2023
e2900d8
update comments
galipremsagar Jan 9, 2023
5cacc1f
rename
galipremsagar Jan 9, 2023
4387eac
add more clarification
galipremsagar Jan 9, 2023
21f93fb
add to advantages
galipremsagar Jan 9, 2023
662756a
add weakref url
galipremsagar Jan 9, 2023
e78871c
Handle slice operation properly
galipremsagar Jan 9, 2023
d02a936
add a table
galipremsagar Jan 9, 2023
8fe99f4
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 11, 2023
3f05a63
Apply suggestions from code review
galipremsagar Jan 11, 2023
67660a6
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Jan 11, 2023
59ac57e
Add self._instances
galipremsagar Jan 11, 2023
2a2876f
Add docstring
galipremsagar Jan 11, 2023
2fcb1f0
Add _get_cuda_array_interface
galipremsagar Jan 11, 2023
201d423
skip copy for host objects
galipremsagar Jan 12, 2023
8b589c2
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 12, 2023
86d580c
Merge remote-tracking branch 'upstream/branch-23.02' into c-o-w-2
galipremsagar Jan 12, 2023
f123ad8
simplify copy
galipremsagar Jan 12, 2023
3f65c7a
docstring update
galipremsagar Jan 12, 2023
9a16606
add test
galipremsagar Jan 12, 2023
71f3ff4
flip if condition
galipremsagar Jan 12, 2023
71c4473
add comment
galipremsagar Jan 12, 2023
863a7ae
add more docstring
galipremsagar Jan 12, 2023
c230e94
Apply suggestions from code review
galipremsagar Jan 12, 2023
6e51a5a
update and address docs reviews
galipremsagar Jan 12, 2023
ab44c9e
Address doc reviews
galipremsagar Jan 12, 2023
2474f04
Merge branch 'c-o-w-2' of https://github.com/galipremsagar/cudf into …
galipremsagar Jan 12, 2023
f4c9114
Use only `ptr` as key (#5)
galipremsagar Jan 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/cudf/source/developer_guide/library_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,3 +316,180 @@ The pandas API also includes a number of helper objects, such as `GroupBy`, `Rol
cuDF implements corresponding objects with the same APIs.
Internally, these objects typically interact with cuDF objects at the Frame layer via composition.
However, for performance reasons they frequently access internal attributes and methods of `Frame` and its subclasses.


## Copy-on-write

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

Copy-on-write (COW) is designed to reduce memory footprint on GPUs. With this feature, a copy (`.copy(deep=False)`) is only really made whenever
there is a write operation on a column. It is first recommended to see
the public usage [here](copy-on-write-user-doc) of this functionality before reading through the internals
below.

galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
The core copy-on-write implementation relies on the `CopyOnWriteBuffer` class. This class stores the pointer to the device memory and size.
With the help of `CopyOnWriteBuffer.ptr` we generate [weak references](https://docs.python.org/3/library/weakref.html) of `CopyOnWriteBuffer` and store it in `CopyOnWriteBuffer._instances`.
This is a mapping from `ptr` keys to `WeakSet`s containing references to `CopyOnWriterBuffer` objects. This
means all the new `CopyOnWriteBuffer`s that are created map to the same key in `CopyOnWriteBuffer._instances` if they have same `.ptr`
i.e., if they are all pointing to the same device memory.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

When the cudf option `"copy_on_write"` is `True`, `as_buffer` will always return a `CopyOnWriteBuffer`. This class contains all the
mechanisms to enable copy-on-write for all buffers. When a `CopyOnWriteBuffer` is created, its weakref is generated and added to the `WeakSet` which is in turn stored in `CopyOnWriterBuffer._instances`. This will later serve as an indication of whether or not to make a copy when a
when write operation is performed on a `Column` (see below).


### Eager copies when exposing to third-party libraries

If `Column`/`CopyOnWriteBuffer` is exposed to a third-party library via `__cuda_array_interface__`, we are no longer able to track whether or not modification of the buffer has occurred without introspection. Hence whenever
someone accesses data through the `__cuda_array_interface__`, we eagerly trigger the copy by calling
`_unlink_shared_buffers` which ensures a true copy of underlying device data is made and
unlinks the buffer from any shared "weak" references. Any future shallow-copy requests must also trigger a true physical copy (since we cannot track the lifetime of the third-party object), to handle this we also mark the `Column`/`CopyOnWriteBuffer` as
`obj._zero_copied=True` thus indicating any future shallow-copy requests will trigger a true physical copy
rather than a copy-on-write shallow copy with weak references.

### How to obtain read-only object?

A read-only object can be quite useful for operations that will not
mutate the data. This can be achieved by calling `._get_readonly_proxy_obj`
API, this API will return a proxy object that has `__cuda_array_interface__`
implemented and will not trigger a deep copy even if the `CopyOnWriteBuffer`
has weak references. It is only recommended to use this API as long as
the objects/arrays created with this proxy object gets cleaned up during
the developer code execution. We currently use this API for device to host
copies like in `ColumnBase._data_array_view` which is used for `Column.values_host`.

Notes:
1. Weak references are implemented only for fixed-width data types as these are only column
types that can be mutated in place.
2. Deep copies of variable width data types return shallow-copies of the Columns, because these
types don't support real in-place mutations to the data. We just mimic in such a way that it looks
like an in-place operation using `_mimic_inplace`.


### Examples

When copy-on-write is enabled, taking a shallow copy of a `Series` or a `DataFrame` does not
eagerly create a copy of the data. Instead, it produces a view that will be lazily
copied when a write operation is performed on any of its copies.

Let's create a series:

```python
>>> import cudf
>>> cudf.set_option("copy_on_write", True)
>>> s1 = cudf.Series([1, 2, 3, 4])
```

Make a copy of `s1`:
```python
>>> s2 = s1.copy(deep=False)
```

Make another copy, but of `s2`:
```python
>>> s3 = s2.copy(deep=False)
```

Viewing the data and memory addresses show that they all point to the same device memory:
```python
>>> s1
0 1
1 2
2 3
3 4
dtype: int64
>>> s2
0 1
1 2
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64

>>> s1.data._ptr
139796315897856
>>> s2.data._ptr
139796315897856
>>> s3.data._ptr
139796315897856
```

Now, when we perform a write operation on one of them, say on `s2`, a new copy is created
for `s2` on device and then modified:

```python
>>> s2[0:2] = 10
>>> s2
0 10
1 10
2 3
3 4
dtype: int64
>>> s1
0 1
1 2
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64
```

If we inspect the memory address of the data, `s1` and `s3` still share the same address but `s2` has a new one:

```python
>>> s1.data._ptr
139796315897856
>>> s3.data._ptr
139796315897856
>>> s2.data._ptr
139796315899392
```

Now, performing write operation on `s1` will trigger a new copy on device memory as there
is a weak reference being shared in `s3`:

```python
>>> s1[0:2] = 11
>>> s1
0 11
1 11
2 3
3 4
dtype: int64
>>> s2
0 10
1 10
2 3
3 4
dtype: int64
>>> s3
0 1
1 2
2 3
3 4
dtype: int64
```

If we inspect the memory address of the data, the addresses of `s2` and `s3` remain unchanged, but `s1`'s memory address has changed because of a copy operation performed during the writing:

```python
>>> s2.data._ptr
139796315899392
>>> s3.data._ptr
139796315897856
>>> s1.data._ptr
139796315879723
```

cudf Copy-on-write implementation is motivated by pandas Copy-on-write proposal here:
1. [Google doc](https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.iexejdstiz8u)
2. [Github issue](https://github.com/pandas-dev/pandas/issues/36195)
169 changes: 169 additions & 0 deletions docs/cudf/source/user_guide/copy-on-write.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
(copy-on-write-user-doc)=

# Copy-on-write

Copy-on-write reduces GPU memory usage when copies(`.copy(deep=False)`) of a column
are made.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved

| | Copy-on-Write enabled | Copy-on-Write disabled (default) |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `.copy(deep=True)` | A true copy is made and changes don't propagate to the original object. | A true copy is made and changes don't propagate to the original object. |
| `.copy(deep=False)` | Memory is shared between the two objects and but any write operation on one object will trigger a true physical copy before the write is performed. Hence changes will not propagate to the original object. | Memory is shared between the two objects and changes performed on one will propagate to the other object. |

## How to enable it

i. Use `cudf.set_option`:

```python
>>> import cudf
>>> cudf.set_option("copy_on_write", True)
```

ii. Set the environment variable ``CUDF_COPY_ON_WRITE`` to ``1`` prior to the
launch of the Python interpreter:

```bash
export CUDF_COPY_ON_WRITE="1" python -c "import cudf"
```


## Making copies

There are no additional changes required in the code to make use of copy-on-write.

```python
>>> series = cudf.Series([1, 2, 3, 4])
```

Performing a shallow copy will create a new Series object pointing to the
same underlying device memory:

```python
>>> copied_series = series.copy(deep=False)
>>> series
0 1
1 2
2 3
3 4
dtype: int64
>>> copied_series
0 1
1 2
2 3
3 4
dtype: int64
```

When a write operation is performed on either ``series`` or
``copied_series``, a true physical copy of the data is created:

```python
>>> series[0:2] = 10
>>> series
0 10
1 10
2 3
3 4
dtype: int64
>>> copied_series
0 1
1 2
2 3
3 4
dtype: int64
```


## Notes

When copy-on-write is enabled, there is no concept of views. i.e., modifying any view created inside cudf will not actually not modify
the original object it was viewing and thus a separate copy is created and then modified.

## Advantages

1. With copy-on-write enabled and by requesting `.copy(deep=False)`, the GPU memory usage can be reduced drastically if you are not performing
write operations on all of those copies. This will also increase the speed at which objects are created for execution of your ETL workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that happens a lot in workflows? It feels like quite a weak reason.

2. With the concept of views going away, every object is a copy of it's original object. This will bring consistency across operations and cudf closer to parity with
pandas. Following is one of the inconsistency:
Comment on lines +86 to +87
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this example shows what you want it to.

  1. It is unclear to the reader if copy-on-write is enabled or disabled in this scenario
  2. The behaviour looks like copy-on-write is enabled, but I think it is disabled so the inconsistency with pandas is that cudf is making a deep copy when it is not expected
  3. How would enabling copy-on-write fix this inconsistency? The cudf result would remain unchanged, but now it is "expected"?

I think you need to (at least) link to the pandas docs on copy-on-write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates to clarify a bit more. Let me know what you think and how you would want it to improve, if at all.
Unfortunately, copy-on-write is undocumented on pandas public docs :(


```python

>>> import pandas as pd
>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 10
1 2
2 3
3 4
4 5
dtype: int64

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64
```

The above inconsistency is solved when Copy-on-write is enabled:

```python
>>> import pandas as pd
>>> pd.set_option("mode.copy_on_write", True)
>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64


>>> import cudf
>>> cudf.set_option("copy_on_write", True)
>>> s = cudf.Series([1, 2, 3, 4, 5])
>>> s1 = s[0:2]
>>> s1[0] = 10
>>> s1
0 10
1 2
dtype: int64
>>> s
0 1
1 2
2 3
3 4
4 5
dtype: int64
```

## How to disable it


Copy-on-write can be disable by setting ``copy_on_write`` cudf option to ``False``:

```python
>>> cudf.set_option("copy_on_write", False)
```
1 change: 1 addition & 0 deletions docs/cudf/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ guide-to-udfs
cupy-interop
options
PandasCompat
copy-on-write
```
Loading