Potential perf regressions introduced by Copy-on-Write #57431

rhshadrach · 2024-02-15T04:50:08Z

PR #56633 may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.

Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.

Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.

Commit Range

cc @phofl

jorisvandenbossche · 2024-02-16T16:41:52Z

Thanks for the list!
I was just planning to look at ASV for CoW today. Investigation of a few of the list:

reshape.Unstack.time_full_product

m = 100
n = 1000

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels] * 2)
columns = np.arange(n)
values = np.arange(m * m * n).reshape(m * m, n)
df = pd.DataFrame(values, index, columns)

%timeit df.unstack()

I am using pandas 2.2.0 to easily be able to switch between then default and CoW mode. With that, I can reproduce a slowdown in unstack. Profiling this with CoW enabled, this seems to be caused by a slower DataFrame construction of the final result in Unstacker.get_result. I assume this is because of the DataFrame constructor now copying an input numpy array, and in this case we can pass copy=False as we know we created the new values ourselves:

pandas/pandas/core/reshape/reshape.py

Lines 242 to 244 in f538741

    
           return self.constructor( 
        
               values, index=index, columns=columns, dtype=values.dtype 
        
           )

arithmetic.FrameWithFrameWide.time_op_different_blocks op=; shape=(1000000, 10)

n_rows, n_cols = 1_000_000, 10

# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, n_cols // 2).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("f4")
df = pd.concat([DataFrame(arr1), DataFrame(arr2)], axis=1, ignore_index=True)
df._consolidate_inplace()

arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
df2 = pd.concat(
    [DataFrame(arr1), DataFrame(arr2), DataFrame(arr3)],
    axis=1,
    ignore_index=True,
)
df2._consolidate_inplace()

%timeit df > df2

This I can also reproduce, and a profile indicates the slowdown comes entirely from the actual > operation on the underlying data. So the one difference seems to be that with CoW, the underlying block values seem to be row major (column major in the transposed Block.values), while without CoW it's the expected column major layout.
Simpler example:

>>> pd.options.mode.copy_on_write = False
>>> arr = np.random.randn(n_rows, n_cols)
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.shape
(10, 1000000)
>>> df._mgr.blocks[0].values.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

>>> pd.options.mode.copy_on_write = True
>>> df = pd.concat([pd.DataFrame(arr), pd.DataFrame(arr)], axis=1, ignore_index=True)
>>> df._mgr.blocks[0].values.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

So something seems to go wrong in the DataFrame constructor. With CoW enabled, this should make a copy by default, but when making this copy, it seems to not use the correct C vs F order.

series_methods.ToNumpy.time_to_numpy

N = 1_000_000
ser = pd.Series(np.random.randn(N,))

%timeit ser.to_numpy()

gives

In [80]: %timeit ser.to_numpy()
846 ns ± 57.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)   # witout CoW
2.38 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)     # with CoW

In this case there is some extra work being done to return a read-only view:

pandas/pandas/core/base.py

Lines 667 to 669 in f538741

    
           if using_copy_on_write() and not copy: 
        
               result = result.view() 
        
               result.flags.writeable = False

So it might be this is just that (but haven't checked in detail, from a quick profile nothing stood really out apart from just the body of to_numpy taking a bit longer)

phofl · 2024-02-18T10:41:38Z

Starting a list here what is a false-positive/expected:

time_to_dict_datetimelike

both to_dict cases are only as fast because of the item_cache which is something we removed, so nothing we can do

time_iteritems_indexing

also item_cache, iterating over the columns once is actually twice as fast now, but doing it repeatedly scales linearly now, but this is expected

groupby.Fillna.time_srs_bfill and fill actually are faster now if you increase the objects, the actual object is tiny, so it's probably the constant cost caused the increase (we are 20% faster if we create an object that runs around 100ms)
time_loc scales well, it's only on small dfs that we see the slowdown

jorisvandenbossche · 2024-02-19T15:10:48Z

For the several time_stack ones (eg reshape.ReshapeMaskedArrayDtype.time_stack - dtype='Float64'), it seems that the CoW PR only gave a small perf regression (which I think is again due to the item cache being gone), but those benchmarks show a much bigger slowdown some commits later (I see @rhshadrach and @DeaMariaLeon were discussing those yesterday on slack).
Given that those are not related to CoW, marking them as "done" in the top post here.

jorisvandenbossche · 2024-02-19T15:34:11Z

For the various arithmetic.NumericInferOps.time_add - dtype=<class 'numpy.int16'> cases (also for time_multiply and time_subtract, I assume): the small slowdown is also due to the item_cache removal.
And the fact that we only see it for the lower bitwidth dtypes, is because it seems the actual addition operation for those is much faster, and so the overhead of getting the column becomes relatively bigger, making this noticeable in the benchmark.
Marked those as "done".

jorisvandenbossche · 2024-02-19T16:03:06Z

For strings.Construction.time_construction, the reason of the slowdown is that Series(nparr) now by default makes a copy, so that is expected (without the benchmark using copy=False)

jorisvandenbossche · 2024-03-05T08:32:20Z

stat_ops.FrameOps.time_op

dtype = "float"
axis = 0
values = np.random.randn(100000, 4)
df = pd.DataFrame(values).astype(dtype)

%timeit df.sum(axis=axis)

This was also because of the constructor copying the array but not creating the correct columnar layout, and so this should be fixed on main with #57459 (confirmed locally, in a few days we should also see this on the ASV results)

jorisvandenbossche · 2024-03-05T09:17:33Z

series_methods.Fillna.time_fillna - dtype='string'

dtype = "string"
N = 10**6
data = np.array([str(i) * 5 for i in range(N)], dtype=object)
na_value = pd.NA
fill_value = data[0]
ser = pd.Series(data, dtype=dtype)
ser[::2] = na_value

%timeit ser.fillna(fill_value)

This is a slowdown that I can reproduce, and is due to the additional hasna check that we added (linking to the 2.2 code, so its clearer what is run in case of CoW vs before):

pandas/pandas/core/internals/blocks.py

Lines 2321 to 2329 in 3cc5afa

    
           if using_cow and self._can_hold_na and not self.values._hasna: 
        
               refs = self.refs 
        
               new_values = self.values 
        
           else: 
        
               copy, refs = self._get_refs_and_copy(using_cow, inplace) 
        
               try: 
        
                   new_values = self.values.fillna( 
        
                       value=value, method=None, limit=limit, copy=copy

So for CoW, we added the additional check to see if there are actually NAs to fill, and if not just return the original values + the correct refs.
But for the case where we are actually filling NAs, it seems this _hasna check gives a significant overhead (around 30% in this example, and at least for string dtype, since that's the only dtype that shows a regression in the ASV results)

Now, that is also only for our own object-dtype string dtype, so given we have pyarrow-backed strings as a faster alternative if you care about performance, this is probably an OK trade-off (since the hasna check gives other benefits, like avoiding a copy if no NAs)

jorisvandenbossche · 2024-03-05T09:39:12Z

Although after taking a quick look at isna for strings, it's also easy to speed it up, to offset the additional overhead a bit -> #57733

rhshadrach · 2024-03-06T22:32:46Z

I've looked through the most recent benchmarks, confirmed #57431 (comment), and checked off the box.

rhshadrach added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Copy / view semantics labels Feb 15, 2024

This was referenced Feb 16, 2024

PERF: use copy=False in constructor in unstack (CoW) #57458

Closed

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

Merged

jorisvandenbossche changed the title ~~Potential regression induced by PR #56633~~ Potential regression introduced by Copy-on-Write Feb 19, 2024

jorisvandenbossche changed the title ~~Potential regression introduced by Copy-on-Write~~ Potential perf regressions introduced by Copy-on-Write Feb 19, 2024

jorisvandenbossche mentioned this issue Mar 5, 2024

PERF: improve StringArray.isna #57733

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential perf regressions introduced by Copy-on-Write #57431

Potential perf regressions introduced by Copy-on-Write #57431

rhshadrach commented Feb 15, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

phofl commented Feb 18, 2024 •

edited

Loading

jorisvandenbossche commented Feb 19, 2024

jorisvandenbossche commented Feb 19, 2024 •

edited

Loading

jorisvandenbossche commented Feb 19, 2024

jorisvandenbossche commented Mar 5, 2024

jorisvandenbossche commented Mar 5, 2024

jorisvandenbossche commented Mar 5, 2024 •

edited

Loading

rhshadrach commented Mar 6, 2024

Potential perf regressions introduced by Copy-on-Write #57431

Potential perf regressions introduced by Copy-on-Write #57431

Comments

rhshadrach commented Feb 15, 2024 • edited by jorisvandenbossche Loading

jorisvandenbossche commented Feb 16, 2024 • edited Loading

phofl commented Feb 18, 2024 • edited Loading

jorisvandenbossche commented Feb 19, 2024

jorisvandenbossche commented Feb 19, 2024 • edited Loading

jorisvandenbossche commented Feb 19, 2024

jorisvandenbossche commented Mar 5, 2024

jorisvandenbossche commented Mar 5, 2024

jorisvandenbossche commented Mar 5, 2024 • edited Loading

rhshadrach commented Mar 6, 2024

rhshadrach commented Feb 15, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

phofl commented Feb 18, 2024 •

edited

Loading

jorisvandenbossche commented Feb 19, 2024 •

edited

Loading

jorisvandenbossche commented Mar 5, 2024 •

edited

Loading