Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas copy-on-write behaviour #8846

Merged
merged 6 commits into from
Mar 18, 2024
Merged

Conversation

dcherian
Copy link
Contributor

import numpy as np
import pandas as pd

pd.set_option("mode.copy_on_write", True)

from xarray.core.variable import _possibly_convert_objects

string_var = np.array(["a", "bc", "def"], dtype=object)
datetime_var = np.array(
    ["2019-01-01", "2019-01-02", "2019-01-03"], dtype="datetime64[ns]"
)
assert _possibly_convert_objects(string_var).flags.writeable
assert _possibly_convert_objects(datetime_var).flags.writeable

The core issue is that we now get read-only arrays back from pandas here:

def _possibly_convert_objects(values):
"""Convert arrays of datetime.datetime and datetime.timedelta objects into
datetime64 and timedelta64, according to the pandas convention. For the time
being, convert any non-nanosecond precision DatetimeIndex or TimedeltaIndex
objects to nanosecond precision. While pandas is relaxing this in version
2.0.0, in xarray we will need to make sure we are ready to handle
non-nanosecond precision datetimes or timedeltas in our code before allowing
such values to pass through unchanged. Converting to nanosecond precision
through pandas.Series objects ensures that datetimes and timedeltas are
within the valid date range for ns precision, as pandas will raise an error
if they are not.
"""
as_series = pd.Series(values.ravel(), copy=False)
if as_series.dtype.kind in "mM":
as_series = _as_nanosecond_precision(as_series)
return np.asarray(as_series).reshape(values.shape)

@phofl is this expected?

@dcherian dcherian added the run-upstream Run upstream CI label Mar 16, 2024
@phofl
Copy link
Contributor

phofl commented Mar 16, 2024

Yes, pandas now avoids copies wherever possible, meaning that an inplace modification can modify an arbitrary number of pandas objects if done outside of pandas. That's why we return read only things (same as you would get now if you have arrow arrays). You can either copy or reset the flag manually if you want to get rid of that

xarray/tests/__init__.py Outdated Show resolved Hide resolved
@dcherian dcherian marked this pull request as ready for review March 16, 2024 03:38
@dcherian dcherian marked this pull request as draft March 16, 2024 04:05
@dcherian dcherian marked this pull request as ready for review March 16, 2024 04:15
@dcherian dcherian added plan to merge Final call for comments and removed needs review labels Mar 18, 2024
@dcherian
Copy link
Contributor Author

Merging so we can get more useful upstream failure reports.

@dcherian dcherian merged commit c6c01b1 into pydata:main Mar 18, 2024
27 of 30 checks passed
@dcherian dcherian deleted the fix-pd-cow branch March 18, 2024 16:00
dcherian added a commit to kmsquire/xarray that referenced this pull request Mar 21, 2024
* upstream/main: (765 commits)
  increase typing annotations coverage in `xarray/core/indexing.py` (pydata#8857)
  pandas 3 MultiIndex fixes (pydata#8847)
  FIX: adapt handling of copy keyword argument in scipy backend for numpy >= 2.0dev (pydata#8851)
  FIX: do not cast _FillValue/missing_value in CFMaskCoder if _Unsigned is provided (pydata#8852)
  Implement setitem syntax for `.oindex` and `.vindex` properties (pydata#8845)
  Support pandas copy-on-write behaviour (pydata#8846)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-upstream Run upstream CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get ready for pandas 3 copy-on-write
3 participants