Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Copy on write implementation #11718

Merged
merged 193 commits into from
Jan 13, 2023
Merged

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Sep 20, 2022

This PR introduces copy-on-write. As the name suggests when copy-on-write is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(c-o-w) can be enabled in two ways:

  1. Setting CUDF_COPY_ON_WRITE environment variable to 1 will enable c-o-w, unsetting will disable c-o-w.
  2. Setting copy_on_write option in cudf options by doing cudf.set_option("copy_on_write", True) to enable it and cudf.set_option("copy_on_write", False) to disable it.

Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in.

A valid performance comparison can be done only with copy_on_write=OFF + .copy(deep=True) vs copy_on_write=ON + .copy(deep=False):

In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000))

# branch-23.02 : 1209MiB
# This-PR : 1209MiB

In [3]: s_copy = s.copy(deep=True) #branch-23.02
In [3]: s_copy = s.copy(deep=False) #This-PR

# branch-23.02 : 1973MiB
# This-PR : 1209MiB

In [4]: s
Out[4]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [5]: s_copy
Out[5]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [6]: s[2] = 10001

# branch-23.02 : 3121MiB
# This-PR : 3121MiB

In [7]: s
Out[7]: 
0                  0
1                  1
2              10001
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [8]: s_copy
Out[8]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

Stats around the performance and memory gains :

  • Memory usage of new copies will be 0 GPU memory additional overhead i.e., users will save 2x, 5x, 10x,...20x memory usage for making 2x, 5x, 10x,...20x deep copies respectively. So, The more you copy the more you save 😉(as long as you don't write on all of them)
  • copying times are now cut by 99% for all dtypes when copy-on-write is enabled(copy_on_write=OFF + .copy(deep=True) vs copy_on_write=ON + .copy(deep=False)).
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a': range(0, 1000000)})

In [3]: df = cudf.DataFrame({'a': range(0, 100000000)})

In [4]: df['b'] = df.a.astype('str')

# GPU memory usage
# branch-23.02 : 2345MiB
# This-PR : 2345MiB

In [5]: df
Out[5]: 
                 a         b
0                0         0
1                1         1
2                2         2
3                3         3
4                4         4
...            ...       ...
99999995  99999995  99999995
99999996  99999996  99999996
99999997  99999997  99999997
99999998  99999998  99999998
99999999  99999999  99999999

[100000000 rows x 2 columns]

In [6]: def make_two_copies(df, deep):
   ...:     return df.copy(deep=deep), df.copy(deep=deep)
   ...: 

In [7]: x, y = make_two_copies(df, deep=True) # branch-23.02
In [7]: x, y = make_two_copies(df, deep=False) # This PR

# GPU memory usage
# branch-23.02 : 6147MiB
# This-PR : 2345MiB

In [8]: %timeit make_two_copies(df, deep=True) # branch-23.02
In [8]: %timeit make_two_copies(df, deep=False) # This PR

# Execution times
# branch-23.02 : 135 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 100 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  • Even when copy-on-write is disabled, string, list & struct columns deep copies are now 99% faster
In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000), dtype='str')

In [3]: %timeit s.copy(deep=True)


# branch-23.02 : 28.3 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This PR : 19.9 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [9]: s = cudf.Series([[1, 2], [2, 3], [3, 4], [4, 5], [6, 7]]* 10000000)

In [10]: %timeit s.copy(deep=True)
# branch-23.02 : 25.7 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 44.2 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [4]: df = cudf.DataFrame({'a': range(0, 100000000), 'b': range(0, 100000000)[::-1]})

In [5]: s = df.to_struct()

In [6]: %timeit s.copy(deep=True)

# branch-23.02 : 42.5 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 89.7 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  • Add pytests
  • Docs page explaining copy on write and how to enable/disable it.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm pretty happy with this PR. The level of complexity is significantly lower than I had originally feared it would be, especially after looking through spilling. My main outstanding concerns are around 1) the different new properties introduced for the data an mask, and 2) the potential confusion around the different methods for accessing pointers, especially as it pertains to their different behaviors in base/COW/spilling. We also need to deal with the issue Lawrence raised about slices, but hopefully my suggestion there will be sufficient. I don't want the PR's scope to sprawl, but I also don't want to merge anything that adds even more complexity in those areas, especially if it does so in ways that make the code harder to understand even when COW or spilling are disabled. We may want to meet and just discuss minimal solutions for those problems.

Aside from the API-level implications of COW, I would also like for us to come out of this project with a consistent story on column mutability (or at least a plan for having one). @shwina and I have previously discussed that we would love for columns to be immutable, and that has been about 90% true for a long time. COW gets us closer, but the addition of the mutable_ptr attribute for buffers is a clear indicator that we have one important exception (__setitem__ calls) that may always have to exist. I think finding a clear way not only to document this, but to design the API around this fact, should help reduce some of issue (2) above regarding the different accessors.

Regarding the docs guides, I'm happy to make PRs into yours with proposed changes to the docs so that you don't have to implement all of my suggestions (they're definitely quite a bit of work). I didn't make any new comments there in this review, but there are still some outstanding major requests from the first round (like structuring the developer docs around having already read the user guide) that I am happy to draft for review to help move things forward.

python/cudf/cudf/core/buffer/utils.py Show resolved Hide resolved
docs/cudf/source/user_guide/copy-on-write.md Outdated Show resolved Hide resolved
else:
data = <void*><uintptr_t>(col.base_data.ptr)
data = <void*><uintptr_t>(col.base_data.mutable_ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, my first review was me getting a lot of ideas out without really thinking through scoping, but I'm certainly in favor of limiting the scope of this PR and not trying to rewrite existing internals more than absolutely necessary. We can do that in follow-ups.

python/cudf/cudf/_lib/column.pyx Show resolved Hide resolved
python/cudf/cudf/core/buffer/cow_buffer.py Outdated Show resolved Hide resolved
def _data_array_view(self) -> "cuda.devicearray.DeviceNDArray":
"""
Internal implementation for viewing the data as a device array object
without triggering a deep-copy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to limit the scope of this PR, but TBH I also don't want to merge it until we at least have a plan for how to handle this. The spilling code already makes buffer internals too hard to follow without some serious care. I think we need to put a hard cap on any changes that increase the complexity further without having a plan for how to fix it. Otherwise we will quickly find ourselves in a state where nobody will be able to debug this code without spending way too much time relearning each time (I still feel like I have to relearn the spilling internals every time I review a spilling PR, which I find quite concerning and I think we need to prioritize reducing that complexity and mitigating the fallout).

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
"Spilling is not supported when Copy-on-write is enabled. "
"Please set `copy_on_write` to `False`"
)
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the option being registered below should indeed guarantee that we won't get a KeyError

rhs, cudf.core.column.CategoricalColumn
):
assert_column_memory_eq(lhs.categories, rhs.categories)
assert_column_memory_eq(lhs.codes, rhs.codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary when it wasn't before? What difference does COW make here?

Copy link
Contributor Author

@galipremsagar galipremsagar Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't COW related. This was necessary to specifically test the special handling in CategoricalColumn.copy, where we will need to deep copy the categories especially. See test_index_copy_category below that uses this now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it that makes sense.

@@ -186,14 +208,21 @@ def ptr(self) -> int:
"""Device pointer to the start of the buffer."""
return self._ptr

@property
def mutable_ptr(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth us meeting (with other interested parties like @wence- and @shwina) to discuss the minimal changes needed to address the issues with confusing APIs introduced in this PR.

Comment on lines +210 to +217
# Because hasattr will trigger invocation of
# `__cuda_array_interface__` which could
# be expensive in CopyOnWriteBuffer case.
value_cai = inspect.getattr_static(
value,
"__cuda_array_interface__",
None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like this?

Suggested change
# Because hasattr will trigger invocation of
# `__cuda_array_interface__` which could
# be expensive in CopyOnWriteBuffer case.
value_cai = inspect.getattr_static(
value,
"__cuda_array_interface__",
None
)
has_cai = isinstance(value, Buffer) or hasattr(value, "__cuda_array_interface__")

Or even better, we could create a small PR that implement an util function to access __cuda_array_interface__, which we could use exclusively. This would also simplify SpillableBuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add a utility to remove this code duplication across the repo.

@galipremsagar galipremsagar changed the base branch from branch-23.02 to copy-on-write January 13, 2023 19:03
* Use only ptr as key

* assert ptrs

* use self._ptr

* Update python/cudf/cudf/_lib/column.pyx
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Jan 13, 2023

@shwina @vyasr and I synced up offline and created and came up with the following todo items:

  • @shwina created a feature branch copy-on-write, we will merge this PR into it.
  • Prem to open a follow-up PR to copy-on-write making buffer.get_ptr(mode=r/w) , Column.data_array_view(mode=r/w) APIs. No more Buffer.ptr or .mutable_ptr should exist. If things break we can fix it in other repos but we expect not much disruption outside of rapids. This also means getting rid of the spill_lock argument to SpillableBuffer.get_ptr(). In its place, we want get_ptr() to just always be able to call get_spill_lock(), which means only using it in an acquire_spill_lock context. [REVIEW] Change ways to access ptr in Buffer  #12587
  • Add stronger reasons to the Advantages section in user docs for COW.
  • Needs some final passes through docs.
  • Once all of the above are completed merge back copy-on-write to branch-23.02

Far in the future items(next releases):

  • To think of how to avoid copying of an entire Buffer when we are modifying a slice of it.
  • Try to get rid of cuda_array_interface at Buffer level and see its feasibility.
  • Ensuring that dask-cuda and ucx-py aren't interacting with cudf at the buffer level.
  • Devising a single long-term API to expose cudf's data pointers externally.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with merging this to the copy-on-write branch.

@shwina shwina merged commit bd72a17 into rapidsai:copy-on-write Jan 13, 2023
rapids-bot bot pushed a commit that referenced this pull request Jan 19, 2023
This simplifies the code and is a step towards `copy-on-write`: #11718 (comment)

cc. @galipremsagar, @shwina, @vyasr

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ashwin Srinath (https://github.com/shwina)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #12564
rapids-bot bot pushed a commit that referenced this pull request Jan 26, 2023
This PR:
With the introduction of `copy-on-write`(#11718), and `spilling`, there are different ways each `Buffer` implementation expects the code to behave when a `.ptr` is accessed, i.e., in case of COW, a `ptr` can be accessed for read or write purposes, in case of spilling the ptr can be accessed with a spill lock and the buffer will be spillable after the spill lock goes out of scope during execution. For these reasons we introduced `ptr`, `mutable_ptr` and `data_array_view`, `_data_array_view`, `mask_array_view`, and `_mask_array_view`. With so many ways to access buffer ptr and array views, this has become quite difficult for devs to know when to use what unless you are fully familiar with the implementation details. It will also lead us to a lot of special case handling for `Buffer`, `SpillableBuffer`, and `CopyonWriteBuffer`. For this reason, we have decided to simplify fetching the pointer with a `get_ptr(mode="read"/"write")` API, fetching data & mask array views will also become methods that accept `mode` like `data_array_view(mode="read"/"write")` & `mask_array_view(mode="read"/"write")`. It is the expectation that when the caller passed "read", they don't tamper with the buffer or the memory it is pointing to. In the case of "write", they are good to mutate the memory it is pointing to.

Note that even with `mode="read"/"write"` the caller should still, if appropriate, acquire a spill lock for the duration of the access. If this is not done, and the buffer is a `SpillableBuffer`, it will permanently be marked as unspillable.

- [x] Introduces `get_ptr()` to replace `ptr` property. 
- [x] Replaces `data_array_view` & `mask_array_view` methods with `data_array_view(mode=r/w)` & `mask_array_view(mode=r/w)`

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - https://github.com/brandon-b-miller
  - Lawrence Mitchell (https://github.com/wence-)
  - Ashwin Srinath (https://github.com/shwina)

URL: #12587
rapids-bot bot pushed a commit that referenced this pull request Feb 16, 2023
This PR contains changes from #11718 primarily that will enable Copy on write feature in cudf.

This PR introduces `copy-on-write`. As the name suggests when `copy-on-write` is enabled, when there is a shallow copy of a column made, both the columns share the same memory and only when there is a write operation being performed on either the parent or any of it's copies a true copy will be triggered. Copy-on-write(`c-o-w`) can be enabled in two ways:

1. Setting `CUDF_COPY_ON_WRITE` environment variable to `1` will enable `c-o-w`, unsetting will disable `c-o-w`.
2. Setting `copy_on_write` option in `cudf` options by doing `cudf.set_option("copy_on_write", True)` to enable it and `cudf.set_option("copy_on_write", False)` to disable it.

Note: Copy-on-write is not being enabled by default, it is being introduced as an opt-in.

A valid performance comparison can be done only with `copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`:

```python
In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000))

# branch-23.02 : 1209MiB
# This-PR : 1209MiB

In [3]: s_copy = s.copy(deep=True) #branch-23.02
In [3]: s_copy = s.copy(deep=False) #This-PR

# branch-23.02 : 1973MiB
# This-PR : 1209MiB

In [4]: s
Out[4]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [5]: s_copy
Out[5]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [6]: s[2] = 10001

# branch-23.02 : 3121MiB
# This-PR : 3121MiB

In [7]: s
Out[7]: 
0                  0
1                  1
2              10001
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64

In [8]: s_copy
Out[8]: 
0                  0
1                  1
2                  2
3                  3
4                  4
              ...   
99999995    99999995
99999996    99999996
99999997    99999997
99999998    99999998
99999999    99999999
Length: 100000000, dtype: int64
```

Stats around the performance and memory gains :

- [x] Memory usage of new copies will be 0 GPU memory additional overhead i.e., users will save 2x, 5x, 10x,...20x memory usage for making 2x, 5x, 10x,...20x deep copies respectively. **So, The more you copy the more you save 😉**(_as long as you don't write on all of them_) 
- [x] **copying times are now cut by 99%** for all dtypes when copy-on-write is enabled(`copy_on_write=OFF` + `.copy(deep=True)` vs `copy_on_write=ON` + `.copy(deep=False)`).

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a': range(0, 1000000)})

In [3]: df = cudf.DataFrame({'a': range(0, 100000000)})

In [4]: df['b'] = df.a.astype('str')

# GPU memory usage
# branch-23.02 : 2345MiB
# This-PR : 2345MiB

In [5]: df
Out[5]: 
                 a         b
0                0         0
1                1         1
2                2         2
3                3         3
4                4         4
...            ...       ...
99999995  99999995  99999995
99999996  99999996  99999996
99999997  99999997  99999997
99999998  99999998  99999998
99999999  99999999  99999999

[100000000 rows x 2 columns]

In [6]: def make_two_copies(df, deep):
   ...:     return df.copy(deep=deep), df.copy(deep=deep)
   ...: 

In [7]: x, y = make_two_copies(df, deep=True) # branch-23.02
In [7]: x, y = make_two_copies(df, deep=False) # This PR

# GPU memory usage
# branch-23.02 : 6147MiB
# This-PR : 2345MiB

In [8]: %timeit make_two_copies(df, deep=True) # branch-23.02
In [8]: %timeit make_two_copies(df, deep=False) # This PR

# Execution times
# branch-23.02 : 135 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 100 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```

- [x] Even when `copy-on-write` is disabled, `string`, `list` & `struct` columns **deep copies are now 99% faster**

```python

In [1]: import cudf

In [2]: s = cudf.Series(range(0, 100000000), dtype='str')

In [3]: %timeit s.copy(deep=True)


# branch-23.02 : 28.3 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This PR : 19.9 µs ± 93.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [9]: s = cudf.Series([[1, 2], [2, 3], [3, 4], [4, 5], [6, 7]]* 10000000)

In [10]: %timeit s.copy(deep=True)
# branch-23.02 : 25.7 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 44.2 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [4]: df = cudf.DataFrame({'a': range(0, 100000000), 'b': range(0, 100000000)[::-1]})

In [5]: s = df.to_struct()

In [6]: %timeit s.copy(deep=True)

# branch-23.02 : 42.5 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This-PR : 89.7 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```



- [x] Add pytests
- [x] Docs page explaining copy on write and how to enable/disable it.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12619
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.