Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: New copy / view semantics using Copy-on-Write #46958

Merged
merged 37 commits into from
Aug 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ea0bff8
API: New copy / view semantics using Copy-on-Write
jorisvandenbossche May 6, 2022
74981f9
fix more tests
jorisvandenbossche May 7, 2022
36faac9
Handle CoW in BM.iset
jorisvandenbossche May 9, 2022
ffa24a4
Handle CoW in xs
jorisvandenbossche May 9, 2022
8c00bbd
add bunch of todo comments and usage warnings
jorisvandenbossche May 9, 2022
af5a02c
Insert None ref in BM.insert
jorisvandenbossche May 9, 2022
a64f310
Ensure to not assume copy_on_write is set in case of ArrayManager
jorisvandenbossche May 9, 2022
e6c0baa
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche May 10, 2022
977cbb8
Handle refs in BM._combine / test CoW in select_dtypes
jorisvandenbossche May 10, 2022
cc367f5
handle get_numeric_data for single block manager
jorisvandenbossche May 10, 2022
53a8273
fix test_internals (get_numeric_data now uses CoW)
jorisvandenbossche May 11, 2022
d3d26f4
handle refs in consolidation
jorisvandenbossche May 20, 2022
5360e57
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche May 20, 2022
e431bc4
fix deep=None for ArrayManager
jorisvandenbossche May 20, 2022
a2d8fde
update copy/view tests from other PR
jorisvandenbossche May 20, 2022
81f6eae
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche May 23, 2022
b2a1428
clean-up fast_xs workarounds now it returns a SingleBlockManager
jorisvandenbossche May 23, 2022
4b1ccf6
tracks refs in to_frame
jorisvandenbossche May 23, 2022
40f2b24
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Jun 8, 2022
9339f15
fixup after updata main and column_setitem + iloc inplace setitem cha…
jorisvandenbossche Jul 16, 2022
358deaf
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Jul 16, 2022
085d15b
fix inplace fillna + fixup new tests
jorisvandenbossche Jul 16, 2022
377f789
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Jul 17, 2022
7dcefe8
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Jul 21, 2022
773f03f
address comments + update some todo comments
jorisvandenbossche Jul 21, 2022
bfc2117
Update pandas/core/internals/managers.py
jorisvandenbossche Jul 21, 2022
89c4fff
fixup linting
jorisvandenbossche Jul 21, 2022
b2e56ea
update new copy_view tests to use get_array helper
jorisvandenbossche Jul 21, 2022
fdeb154
add comment to setitem
jorisvandenbossche Jul 21, 2022
c521161
switch default to False, ensure CoW copies only happen when enabled +…
jorisvandenbossche Jul 21, 2022
fb297df
update type annotations
jorisvandenbossche Jul 21, 2022
9ab9df5
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Aug 10, 2022
14f753e
Fix stata issue to avoid SettingWithCopyWarning in read_stata
jorisvandenbossche Aug 10, 2022
f7389b1
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Aug 17, 2022
764838f
update type + option comment
jorisvandenbossche Aug 17, 2022
efde1bf
Merge remote-tracking branch 'upstream/main' into blockmanager-cow
jorisvandenbossche Aug 18, 2022
3d1377b
fixup new rename test
jorisvandenbossche Aug 19, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/ubuntu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ jobs:
extra_apt: "language-pack-zh-hans"
lang: "zh_CN.utf8"
lc_all: "zh_CN.utf8"
- name: "Copy-on-Write"
env_file: actions-310.yaml
pattern: "not slow and not network and not single_cpu"
pandas_copy_on_write: "1"
- name: "Data Manager"
env_file: actions-38.yaml
pattern: "not slow and not network and not single_cpu"
Expand Down Expand Up @@ -84,6 +88,7 @@ jobs:
LC_ALL: ${{ matrix.lc_all || '' }}
PANDAS_TESTING_MODE: ${{ matrix.pandas_testing_mode || '' }}
PANDAS_DATA_MANAGER: ${{ matrix.pandas_data_manager || 'block' }}
PANDAS_COPY_ON_WRITE: ${{ matrix.pandas_copy_on_write || '0' }}
TEST_ARGS: ${{ matrix.test_args || '' }}
PYTEST_WORKERS: ${{ contains(matrix.pattern, 'not single_cpu') && 'auto' || '1' }}
PYTEST_TARGET: ${{ matrix.pytest_target || 'pandas' }}
Expand Down
13 changes: 9 additions & 4 deletions pandas/_libs/internals.pyx
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from collections import defaultdict
import weakref

cimport cython
from cpython.slice cimport PySlice_GetIndicesEx
Expand Down Expand Up @@ -674,8 +675,9 @@ cdef class BlockManager:
public list axes
public bint _known_consolidated, _is_consolidated
public ndarray _blknos, _blklocs
public list refs

def __cinit__(self, blocks=None, axes=None, verify_integrity=True):
def __cinit__(self, blocks=None, axes=None, refs=None, verify_integrity=True):
# None as defaults for unpickling GH#42345
if blocks is None:
# This adds 1-2 microseconds to DataFrame(np.array([]))
Expand All @@ -687,6 +689,7 @@ cdef class BlockManager:

self.blocks = blocks
self.axes = axes.copy() # copy to make sure we are not remotely-mutable
self.refs = refs
mroeschke marked this conversation as resolved.
Show resolved Hide resolved

# Populate known_consolidate, blknos, and blklocs lazily
self._known_consolidated = False
Expand Down Expand Up @@ -795,12 +798,14 @@ cdef class BlockManager:
ndarray blknos, blklocs

nbs = []
nrefs = []
for blk in self.blocks:
nb = blk.getitem_block_index(slobj)
nbs.append(nb)
nrefs.append(weakref.ref(blk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to make the reference to the blk instead of its array blk.values? will that make a difference in the cases where blk.values is re-set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't remember if there was a specific technical reason to do so, but I think it seemed the easier option (since this is all handled at the BlockManager level). When keeping those references / when checking for references, it would otherwise be an additional level of indirection to check the blk.values instead of blk itself.

I suppose this would generally be the same, but indeed except in places where blk.values is re-set in place. I actually added one such case in this PR in Block.set_values (which is called in BlockManager.iset). In that case I should probably rather copy the Block and replace the block in the BM with the copied block, instead of copying the values in the existing block.
Do you know of other places where we currently we re-set blk.values?


new_axes = [self.axes[0], self.axes[1]._getitem_slice(slobj)]
mgr = type(self)(tuple(nbs), new_axes, verify_integrity=False)
mgr = type(self)(tuple(nbs), new_axes, nrefs, verify_integrity=False)

# We can avoid having to rebuild blklocs/blknos
blklocs = self._blklocs
Expand All @@ -813,7 +818,7 @@ cdef class BlockManager:
def get_slice(self, slobj: slice, axis: int = 0) -> BlockManager:

if axis == 0:
new_blocks = self._slice_take_blocks_ax0(slobj)
new_blocks, new_refs = self._slice_take_blocks_ax0(slobj)
elif axis == 1:
return self._get_index_slice(slobj)
else:
Expand All @@ -822,4 +827,4 @@ cdef class BlockManager:
new_axes = list(self.axes)
new_axes[axis] = new_axes[axis]._getitem_slice(slobj)

return type(self)(tuple(new_blocks), new_axes, verify_integrity=False)
return type(self)(tuple(new_blocks), new_axes, new_refs, verify_integrity=False)
2 changes: 1 addition & 1 deletion pandas/compat/pickle_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ def load_newobj(self):
arr = np.array([], dtype="m8[ns]")
obj = cls.__new__(cls, arr, arr.dtype)
elif cls is BlockManager and not args:
obj = cls.__new__(cls, (), [], False)
obj = cls.__new__(cls, (), [], None, False)
else:
obj = cls.__new__(cls, *args)

Expand Down
9 changes: 3 additions & 6 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,7 @@
from decimal import Decimal
import operator
import os
from typing import (
Callable,
Literal,
)
from typing import Callable

from dateutil.tz import (
tzlocal,
Expand Down Expand Up @@ -1844,8 +1841,8 @@ def using_array_manager():


@pytest.fixture
def using_copy_on_write() -> Literal[False]:
def using_copy_on_write() -> bool:
"""
Fixture to check if Copy-on-Write is enabled.
"""
return False
return pd.options.mode.copy_on_write and pd.options.mode.data_manager == "block"
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
20 changes: 20 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,26 @@ def use_inf_as_na_cb(key) -> None:
)


# TODO better name?
copy_on_write_doc = """
: bool
Use new copy-view behaviour using Copy-on-Write. Defaults to False,
unless overridden by the 'PANDAS_COPY_ON_WRITE' environment variable
(if set to "1" for True, needs to be set before pandas is imported).
"""


with cf.config_prefix("mode"):
cf.register_option(
"copy_on_write",
# Get the default from an environment variable, if set, otherwise defaults
# to False. This environment variable can be set for testing.
os.environ.get("PANDAS_COPY_ON_WRITE", "0") == "1",
copy_on_write_doc,
validator=is_bool,
)


# user warnings
chained_assignment = """
: string
Expand Down
18 changes: 12 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3708,13 +3708,19 @@ def _get_column_array(self, i: int) -> ArrayLike:
"""
Get the values of the i'th column (ndarray or ExtensionArray, as stored
in the Block)

Warning! The returned array is a view but doesn't handle Copy-on-Write,
so this should be used with caution (for read-only purposes).
"""
return self._mgr.iget_values(i)

def _iter_column_arrays(self) -> Iterator[ArrayLike]:
"""
Iterate over the arrays of all columns in order.
This returns the values as stored in the Block (ndarray or ExtensionArray).

Warning! The returned array is a view but doesn't handle Copy-on-Write,
so this should be used with caution (for read-only purposes).
"""
for i in range(len(self.columns)):
yield self._get_column_array(i)
Expand Down Expand Up @@ -5145,7 +5151,7 @@ def set_axis(
"labels",
[
("method", None),
("copy", True),
("copy", None),
("level", None),
("fill_value", np.nan),
("limit", None),
Expand Down Expand Up @@ -5370,7 +5376,7 @@ def rename(
index: Renamer | None = ...,
columns: Renamer | None = ...,
axis: Axis | None = ...,
copy: bool = ...,
copy: bool | None = ...,
inplace: Literal[True],
level: Level = ...,
errors: IgnoreRaise = ...,
Expand All @@ -5385,7 +5391,7 @@ def rename(
index: Renamer | None = ...,
columns: Renamer | None = ...,
axis: Axis | None = ...,
copy: bool = ...,
copy: bool | None = ...,
inplace: Literal[False] = ...,
level: Level = ...,
errors: IgnoreRaise = ...,
Expand All @@ -5400,7 +5406,7 @@ def rename(
index: Renamer | None = ...,
columns: Renamer | None = ...,
axis: Axis | None = ...,
copy: bool = ...,
copy: bool | None = ...,
inplace: bool = ...,
level: Level = ...,
errors: IgnoreRaise = ...,
Expand All @@ -5414,7 +5420,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool = True,
copy: bool | None = None,
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
inplace: bool = False,
level: Level = None,
errors: IgnoreRaise = "ignore",
Expand Down Expand Up @@ -6288,7 +6294,7 @@ class max type
if inplace:
new_obj = self
else:
new_obj = self.copy()
new_obj = self.copy(deep=None)
if allow_duplicates is not lib.no_default:
allow_duplicates = validate_bool_kwarg(allow_duplicates, "allow_duplicates")

Expand Down
16 changes: 10 additions & 6 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1034,7 +1034,7 @@ def _rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool_t = True,
copy: bool_t | None = None,
inplace: bool_t = False,
level: Level | None = None,
errors: str = "ignore",
Expand Down Expand Up @@ -4145,6 +4145,12 @@ def _check_setitem_copy(self, t="setting", force=False):
df.iloc[0:5]['group'] = 'a'

"""
if (
config.get_option("mode.copy_on_write")
and config.get_option("mode.data_manager") == "block"
):
return

# return early if the check is not needed
if not (force or self._is_copy):
return
Expand Down Expand Up @@ -5245,7 +5251,7 @@ def reindex(self: NDFrameT, *args, **kwargs) -> NDFrameT:
axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
method = missing.clean_reindex_fill_method(kwargs.pop("method", None))
level = kwargs.pop("level", None)
copy = kwargs.pop("copy", True)
copy = kwargs.pop("copy", None)
limit = kwargs.pop("limit", None)
tolerance = kwargs.pop("tolerance", None)
fill_value = kwargs.pop("fill_value", None)
Expand All @@ -5270,9 +5276,7 @@ def reindex(self: NDFrameT, *args, **kwargs) -> NDFrameT:
for axis, ax in axes.items()
if ax is not None
):
if copy:
return self.copy()
return self
return self.copy(deep=copy)

# check if we are a multi reindex
if self._needs_reindex_multi(axes, method, level):
Expand Down Expand Up @@ -6249,7 +6253,7 @@ def astype(
return cast(NDFrameT, result)

@final
def copy(self: NDFrameT, deep: bool_t = True) -> NDFrameT:
def copy(self: NDFrameT, deep: bool_t | None = True) -> NDFrameT:
"""
Make a copy of this object's indices and data.

Expand Down
10 changes: 10 additions & 0 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -527,6 +527,11 @@ def copy(self: T, deep=True) -> T:
-------
BlockManager
"""
if deep is None:
# ArrayManager does not yet support CoW, so deep=None always means
# deep=True for now
deep = True

# this preserves the notion of view copying of axes
if deep:
# hit in e.g. tests.io.json.test_pandas
Expand Down Expand Up @@ -591,6 +596,11 @@ def _reindex_indexer(
pandas-indexer with -1's only.
"""
if copy is None:
# ArrayManager does not yet support CoW, so deep=None always means
# deep=True for now
copy = True

if indexer is None:
if new_axis is self._axes[axis] and not copy:
return self
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -839,17 +839,22 @@ def _slice(

return self.values[slicer]

def set_inplace(self, locs, values: ArrayLike) -> None:
def set_inplace(self, locs, values: ArrayLike, copy: bool = False) -> None:
"""
Modify block values in-place with new item value.

If copy=True, first copy the underlying values in place before modifying
(for Copy-on-Write).

Notes
-----
`set_inplace` never creates a new array or new Block, whereas `setitem`
_may_ create a new array and always creates a new Block.

Caller is responsible for checking values.dtype == self.dtype.
"""
if copy:
self.values = self.values.copy()
self.values[locs] = values

def take_nd(
Expand Down Expand Up @@ -1665,9 +1670,11 @@ def iget(self, i: int | tuple[int, int] | tuple[slice, int]):
raise IndexError(f"{self} only contains one item")
return self.values

def set_inplace(self, locs, values: ArrayLike) -> None:
def set_inplace(self, locs, values: ArrayLike, copy: bool = False) -> None:
# When an ndarray, we should have locs.tolist() == [0]
# When a BlockPlacement we should have list(locs) == [0]
if copy:
self.values = self.values.copy()
self.values[:] = values

def _maybe_squeeze_arg(self, arg):
Expand Down
Loading