Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential perf regressions introduced by Copy-on-Write #57431

Open
22 of 50 tasks
rhshadrach opened this issue Feb 15, 2024 · 9 comments
Open
22 of 50 tasks

Potential perf regressions introduced by Copy-on-Write #57431

rhshadrach opened this issue Feb 15, 2024 · 9 comments
Labels
Copy / view semantics Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Feb 15, 2024

PR #56633 may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.

Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.

Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.

Commit Range

cc @phofl

@rhshadrach rhshadrach added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Copy / view semantics labels Feb 15, 2024
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 16, 2024

Thanks for the list!
I was just planning to look at ASV for CoW today. Investigation of a few of the list:

reshape.Unstack.time_full_product

m = 100
n = 1000

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
values = np.arange(m * m * n).reshape(m * m, n)
df = pd.DataFrame(values, index, columns)

%timeit df.unstack()

I am using pandas 2.2.0 to easily be able to switch between then default and CoW mode. With that, I can reproduce a slowdown in unstack. Profiling this with CoW enabled, this seems to be caused by a slower DataFrame construction of the final result in Unstacker.get_result. I assume this is because of the DataFrame constructor now copying an input numpy array, and in this case we can pass copy=False as we know we created the new values ourselves:

return self.constructor(
values, index=index, columns=columns, dtype=values.dtype
)

arithmetic.FrameWithFrameWide.time_op_different_blocks op=; shape=(1000000, 10)

n_rows, n_cols = 1_000_000, 10

# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, n_cols // 2).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("f4")
df = pd.concat([DataFrame(arr1), DataFrame(arr2)], axis=1, ignore_index=True)
df._consolidate_inplace()

arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
df2 = pd.concat(
    [DataFrame(arr1), DataFrame(arr2), DataFrame(arr3)],
    axis=1,
    ignore_index=True,
)
df2._consolidate_inplace()

%timeit df > df2

This I can also reproduce, and a profile indicates the slowdown comes entirely from the actual > operation on the underlying data. So the one difference seems to be that with CoW, the underlying block values seem to be row major (column major in the transposed Block.values), while without CoW it's the expected column major layout.
Simpler example:

>>> pd.options.mode.copy_on_write = False
>>> arr = np.random.randn(n_rows, n_cols)
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.shape
(10, 1000000)
>>> df._mgr.blocks[0].values.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

>>> pd.options.mode.copy_on_write = True
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

So something seems to go wrong in the DataFrame constructor. With CoW enabled, this should make a copy by default, but when making this copy, it seems to not use the correct C vs F order.

series_methods.ToNumpy.time_to_numpy

N = 1_000_000
ser = pd.Series(np.random.randn(N,))

%timeit ser.to_numpy()

gives

In [80]: %timeit ser.to_numpy()
846 ns ± 57.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)   # witout CoW
2.38 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)     # with CoW

In this case there is some extra work being done to return a read-only view:

pandas/pandas/core/base.py

Lines 667 to 669 in f538741

if using_copy_on_write() and not copy:
result = result.view()
result.flags.writeable = False

So it might be this is just that (but haven't checked in detail, from a quick profile nothing stood really out apart from just the body of to_numpy taking a bit longer)

@phofl
Copy link
Member

phofl commented Feb 18, 2024

Starting a list here what is a false-positive/expected:

both to_dict cases are only as fast because of the item_cache which is something we removed, so nothing we can do

  • time_iteritems_indexing

also item_cache, iterating over the columns once is actually twice as fast now, but doing it repeatedly scales linearly now, but this is expected

  • groupby.Fillna.time_srs_bfill and fill actually are faster now if you increase the objects, the actual object is tiny, so it's probably the constant cost caused the increase (we are 20% faster if we create an object that runs around 100ms)

  • time_loc scales well, it's only on small dfs that we see the slowdown

@jorisvandenbossche jorisvandenbossche changed the title Potential regression induced by PR #56633 Potential regression introduced by Copy-on-Write Feb 19, 2024
@jorisvandenbossche
Copy link
Member

For the several time_stack ones (eg reshape.ReshapeMaskedArrayDtype.time_stack - dtype='Float64'), it seems that the CoW PR only gave a small perf regression (which I think is again due to the item cache being gone), but those benchmarks show a much bigger slowdown some commits later (I see @rhshadrach and @DeaMariaLeon were discussing those yesterday on slack).
Given that those are not related to CoW, marking them as "done" in the top post here.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 19, 2024

For the various arithmetic.NumericInferOps.time_add - dtype=<class 'numpy.int16'> cases (also for time_multiply and time_subtract, I assume): the small slowdown is also due to the item_cache removal.
And the fact that we only see it for the lower bitwidth dtypes, is because it seems the actual addition operation for those is much faster, and so the overhead of getting the column becomes relatively bigger, making this noticeable in the benchmark.
Marked those as "done".

@jorisvandenbossche
Copy link
Member

For strings.Construction.time_construction, the reason of the slowdown is that Series(nparr) now by default makes a copy, so that is expected (without the benchmark using copy=False)

@jorisvandenbossche jorisvandenbossche changed the title Potential regression introduced by Copy-on-Write Potential perf regressions introduced by Copy-on-Write Feb 19, 2024
@jorisvandenbossche
Copy link
Member

stat_ops.FrameOps.time_op

dtype = "float"
axis = 0
values = np.random.randn(100000, 4)
df = pd.DataFrame(values).astype(dtype)

%timeit df.sum(axis=axis)

This was also because of the constructor copying the array but not creating the correct columnar layout, and so this should be fixed on main with #57459 (confirmed locally, in a few days we should also see this on the ASV results)

@jorisvandenbossche
Copy link
Member

series_methods.Fillna.time_fillna - dtype='string'

dtype = "string"
N = 10**6
data = np.array([str(i) * 5 for i in range(N)], dtype=object)
na_value = pd.NA
fill_value = data[0]
ser = pd.Series(data, dtype=dtype)
ser[::2] = na_value

%timeit ser.fillna(fill_value)

This is a slowdown that I can reproduce, and is due to the additional hasna check that we added (linking to the 2.2 code, so its clearer what is run in case of CoW vs before):

if using_cow and self._can_hold_na and not self.values._hasna:
refs = self.refs
new_values = self.values
else:
copy, refs = self._get_refs_and_copy(using_cow, inplace)
try:
new_values = self.values.fillna(
value=value, method=None, limit=limit, copy=copy

So for CoW, we added the additional check to see if there are actually NAs to fill, and if not just return the original values + the correct refs.
But for the case where we are actually filling NAs, it seems this _hasna check gives a significant overhead (around 30% in this example, and at least for string dtype, since that's the only dtype that shows a regression in the ASV results)

Now, that is also only for our own object-dtype string dtype, so given we have pyarrow-backed strings as a faster alternative if you care about performance, this is probably an OK trade-off (since the hasna check gives other benefits, like avoiding a copy if no NAs)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 5, 2024

Although after taking a quick look at isna for strings, it's also easy to speed it up, to offset the additional overhead a bit -> #57733

@rhshadrach
Copy link
Member Author

I've looked through the most recent benchmarks, confirmed #57431 (comment), and checked off the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

3 participants