Merge remote-tracking branch 'upstream/main' into chunk-by-frequency

* upstream/main: [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014) Migrate datatree io.py and common.py into xarray/core (pydata#9011) Micro optimizations to improve indexing (pydata#9002) (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)
dcherian · Jun 13, 2024 · 566fd37 · 566fd37
2 parents 8a980ef + 6554855
commit 566fd37
Show file tree

Hide file tree

Showing 20 changed files with 556 additions and 308 deletions.
diff --git a/.github/workflows/hypothesis.yaml b/.github/workflows/hypothesis.yaml
@@ -39,9 +39,9 @@ jobs:
     if: |
         always()
         && (
-            (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
-            || needs.detect-ci-trigger.outputs.triggered == 'true'
-            || contains( github.event.pull_request.labels.*.name, 'run-slow-hypothesis')
+            needs.detect-ci-trigger.outputs.triggered == 'false'
+            && ( (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
+                || contains( github.event.pull_request.labels.*.name, 'run-slow-hypothesis'))
         )
     defaults:
       run:

diff --git a/CITATION.cff b/CITATION.cff
@@ -84,6 +84,11 @@ authors:
 - family-names: "Scheick"
   given-names: "Jessica"
   orcid: "https://orcid.org/0000-0002-3421-4459"
+- family-names: "Savoie"
+  given-names: "Matthew"
+  orcid: "https://orcid.org/0000-0002-8881-2550"
+- family-names: "Littlejohns"
+  given-names: "Owen"
 title: "xarray"
 abstract: "N-D labeled arrays and datasets in Python."
 license: Apache-2.0

diff --git a/doc/_static/style.css b/doc/_static/style.css
@@ -7,9 +7,8 @@ table.docutils td {
     word-wrap: break-word;
 }
 
-div.bd-header-announcement {
-  background-color: unset;
-  color: #000;
+.bd-header-announcement {
+  background-color: var(--pst-color-info-bg);
 }
 
 /* Reduce left and right margins */
@@ -222,8 +221,6 @@ main *:target::before {
 }
 
 body {
-  /* Add padding to body to avoid overlap with navbar. */
-  padding-top: var(--navbar-height);
   width: 100%;
 }
 

diff --git a/doc/conf.py b/doc/conf.py
@@ -242,7 +242,7 @@
     Theme by the <a href="https://ebp.jupyterbook.org">Executable Book Project</a></p>""",
     twitter_url="https://twitter.com/xarray_dev",
     icon_links=[],  # workaround for pydata/pydata-sphinx-theme#1220
-    announcement="🍾 <a href='https://github.com/pydata/xarray/discussions/8462'>Xarray is now 10 years old!</a> 🎉",
+    announcement="<a href='https://forms.gle/KEq7WviCdz9xTaJX6'>Xarray's 2024 User Survey is live now. Please take ~5 minutes to fill it out and help us improve Xarray.</a>",
 )
 
 

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -17,7 +17,7 @@ What's New
 
 .. _whats-new.2024.05.1:
 
-v2024.05.1 (unreleased)
+v2024.06 (unreleased)
 -----------------------
 
 New Features
@@ -28,6 +28,10 @@ Performance
 
 - Small optimization to the netCDF4 and h5netcdf backends (:issue:`9058`, :pull:`9067`).
   By `Deepak Cherian <https://github.com/dcherian>`_.
+- Small optimizations to help reduce indexing speed of datasets (:pull:`9002`).
+  By `Mark Harfouche <https://github.com/hmaarrfk>`_.
+- Performance improvement in `open_datatree` method for Zarr, netCDF4 and h5netcdf backends (:issue:`8994`, :pull:`9014`).
+  By `Alfonso Ladino <https://github.com/aladinor>`_.
 
 
 Breaking changes
@@ -40,6 +44,9 @@ Deprecations
 
 Bug fixes
 ~~~~~~~~~
+- Preserve conversion of timezone-aware pandas Datetime arrays to numpy object arrays
+  (:issue:`9026`, :pull:`9042`).
+  By `Ilan Gold <https://github.com/ilan-gold>`_.
 
 - :py:meth:`DataArrayResample.interpolate` and :py:meth:`DatasetResample.interpolate` method now
   support aribtrary kwargs such as ``order`` for polynomial interpolation. (:issue:`8762`).
@@ -54,6 +61,10 @@ Documentation
 
 Internal Changes
 ~~~~~~~~~~~~~~~~
+- Migrates remainder of ``io.py`` to ``xarray/core/datatree_io.py`` and
+  ``TreeAttrAccessMixin`` into ``xarray/core/common.py`` (:pull: `9011`)
+  By `Owen Littlejohns <https://github.com/owenlittlejohns>`_ and
+  `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 
 .. _whats-new.2024.05.0:
@@ -136,10 +147,9 @@ Internal Changes
   By `Owen Littlejohns <https://github.com/owenlittlejohns>`_, `Matt Savoie
   <https://github.com/flamingbear>`_ and `Tom Nicholas <https://github.com/TomNicholas>`_.
 - ``transpose``, ``set_dims``, ``stack`` & ``unstack`` now use a ``dim`` kwarg
-  rather than ``dims`` or ``dimensions``. This is the final change to unify
-  xarray functions to use ``dim``. Using the existing kwarg will raise a
-  warning.
-  By `Maximilian Roos <https://github.com/max-sixty>`_
+  rather than ``dims`` or ``dimensions``. This is the final change to make xarray methods
+  consistent with their use of ``dim``. Using the existing kwarg will raise a
+  warning. By `Maximilian Roos <https://github.com/max-sixty>`_
 
 .. _whats-new.2024.03.0:
 
@@ -2903,7 +2913,7 @@ Bug fixes
   process (:issue:`4045`, :pull:`4684`). It also enables encoding and decoding standard
   calendar dates with time units of nanoseconds (:pull:`4400`).
   By `Spencer Clark <https://github.com/spencerkclark>`_ and `Mark Harfouche
-  <http://github.com/hmaarrfk>`_.
+  <https://github.com/hmaarrfk>`_.
 - :py:meth:`DataArray.astype`, :py:meth:`Dataset.astype` and :py:meth:`Variable.astype` support
   the ``order`` and ``subok`` parameters again. This fixes a regression introduced in version 0.16.1
   (:issue:`4644`, :pull:`4683`).

diff --git a/properties/test_pandas_roundtrip.py b/properties/test_pandas_roundtrip.py
@@ -9,6 +9,7 @@
 import pytest
 
 import xarray as xr
+from xarray.tests import has_pandas_3
 
 pytest.importorskip("hypothesis")
 import hypothesis.extra.numpy as npst  # isort:skip
@@ -30,6 +31,16 @@
 )
 
 
+datetime_with_tz_strategy = st.datetimes(timezones=st.timezones())
+dataframe_strategy = pdst.data_frames(
+    [
+        pdst.column("datetime_col", elements=datetime_with_tz_strategy),
+        pdst.column("other_col", elements=st.integers()),
+    ],
+    index=pdst.range_indexes(min_size=1, max_size=10),
+)
+
+
 @st.composite
 def datasets_1d_vars(draw) -> xr.Dataset:
     """Generate datasets with only 1D variables
@@ -98,3 +109,19 @@ def test_roundtrip_pandas_dataframe(df) -> None:
     roundtripped = arr.to_pandas()
     pd.testing.assert_frame_equal(df, roundtripped)
     xr.testing.assert_identical(arr, roundtripped.to_xarray())
+
+
+@pytest.mark.skipif(
+    has_pandas_3,
+    reason="fails to roundtrip on pandas 3 (see https://github.com/pydata/xarray/issues/9098)",
+)
+@given(df=dataframe_strategy)
+def test_roundtrip_pandas_dataframe_datetime(df) -> None:
+    # Need to name the indexes, otherwise Xarray names them 'dim_0', 'dim_1'.
+    df.index.name = "rows"
+    df.columns.name = "cols"
+    dataset = xr.Dataset.from_dataframe(df)
+    roundtripped = dataset.to_dataframe()
+    roundtripped.columns.name = "cols"  # why?
+    pd.testing.assert_frame_equal(df, roundtripped)
+    xr.testing.assert_identical(dataset, roundtripped.to_xarray())
diff --git a/xarray/backends/api.py b/xarray/backends/api.py
@@ -36,7 +36,7 @@
 from xarray.core.dataarray import DataArray
 from xarray.core.dataset import Dataset, _get_chunk, _maybe_chunk
 from xarray.core.indexes import Index
-from xarray.core.types import ZarrWriteModes
+from xarray.core.types import NetcdfWriteModes, ZarrWriteModes
 from xarray.core.utils import is_remote_uri
 from xarray.namedarray.daskmanager import DaskManager
 from xarray.namedarray.parallelcompat import guess_chunkmanager
@@ -1120,7 +1120,7 @@ def open_mfdataset(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike | None = None,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1138,7 +1138,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: None = None,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1155,7 +1155,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1173,7 +1173,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1191,7 +1191,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1209,7 +1209,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1226,7 +1226,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike | None,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,
@@ -1241,7 +1241,7 @@ def to_netcdf(
 def to_netcdf(
     dataset: Dataset,
     path_or_file: str | os.PathLike | None = None,
-    mode: Literal["w", "a"] = "w",
+    mode: NetcdfWriteModes = "w",
     format: T_NetcdfTypes | None = None,
     group: str | None = None,
     engine: T_NetcdfEngine | None = None,

diff --git a/xarray/backends/common.py b/xarray/backends/common.py
@@ -19,9 +19,6 @@
 if TYPE_CHECKING:
     from io import BufferedIOBase
 
-    from h5netcdf.legacyapi import Dataset as ncDatasetLegacyH5
-    from netCDF4 import Dataset as ncDataset
-
     from xarray.core.dataset import Dataset
     from xarray.core.datatree import DataTree
     from xarray.core.types import NestedSequence
@@ -131,33 +128,6 @@ def _decode_variable_name(name):
     return name
 
 
-def _open_datatree_netcdf(
-    ncDataset: ncDataset | ncDatasetLegacyH5,
-    filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
-    **kwargs,
-) -> DataTree:
-    from xarray.backends.api import open_dataset
-    from xarray.core.datatree import DataTree
-    from xarray.core.treenode import NodePath
-
-    ds = open_dataset(filename_or_obj, **kwargs)
-    tree_root = DataTree.from_dict({"/": ds})
-    with ncDataset(filename_or_obj, mode="r") as ncds:
-        for path in _iter_nc_groups(ncds):
-            subgroup_ds = open_dataset(filename_or_obj, group=path, **kwargs)
-
-            # TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
-            node_name = NodePath(path).name
-            new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
-            tree_root._set_item(
-                path,
-                new_node,
-                allow_overwrite=False,
-                new_nodes_along_path=True,
-            )
-    return tree_root
-
-
 def _iter_nc_groups(root, parent="/"):
     from xarray.core.treenode import NodePath
 

diff --git a/xarray/backends/h5netcdf_.py b/xarray/backends/h5netcdf_.py
@@ -3,15 +3,14 @@
 import functools
 import io
 import os
-from collections.abc import Iterable
+from collections.abc import Callable, Iterable
 from typing import TYPE_CHECKING, Any
 
 from xarray.backends.common import (
     BACKEND_ENTRYPOINTS,
     BackendEntrypoint,
     WritableCFDataStore,
     _normalize_path,
-    _open_datatree_netcdf,
     find_root_and_group,
 )
 from xarray.backends.file_manager import CachingFileManager, DummyFileManager
@@ -431,11 +430,58 @@ def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporti
     def open_datatree(
         self,
         filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
+        *,
+        mask_and_scale=True,
+        decode_times=True,
+        concat_characters=True,
+        decode_coords=True,
+        drop_variables: str | Iterable[str] | None = None,
+        use_cftime=None,
+        decode_timedelta=None,
+        group: str | Iterable[str] | Callable | None = None,
         **kwargs,
     ) -> DataTree:
-        from h5netcdf.legacyapi import Dataset as ncDataset
+        from xarray.backends.api import open_dataset
+        from xarray.backends.common import _iter_nc_groups
+        from xarray.core.datatree import DataTree
+        from xarray.core.treenode import NodePath
+        from xarray.core.utils import close_on_error
 
-        return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)
+        filename_or_obj = _normalize_path(filename_or_obj)
+        store = H5NetCDFStore.open(
+            filename_or_obj,
+            group=group,
+        )
+        if group:
+            parent = NodePath("/") / NodePath(group)
+        else:
+            parent = NodePath("/")
+
+        manager = store._manager
+        ds = open_dataset(store, **kwargs)
+        tree_root = DataTree.from_dict({str(parent): ds})
+        for path_group in _iter_nc_groups(store.ds, parent=parent):
+            group_store = H5NetCDFStore(manager, group=path_group, **kwargs)
+            store_entrypoint = StoreBackendEntrypoint()
+            with close_on_error(group_store):
+                ds = store_entrypoint.open_dataset(
+                    group_store,
+                    mask_and_scale=mask_and_scale,
+                    decode_times=decode_times,
+                    concat_characters=concat_characters,
+                    decode_coords=decode_coords,
+                    drop_variables=drop_variables,
+                    use_cftime=use_cftime,
+                    decode_timedelta=decode_timedelta,
+                )
+                new_node: DataTree = DataTree(name=NodePath(path_group).name, data=ds)
+                tree_root._set_item(
+                    path_group,
+                    new_node,
+                    allow_overwrite=False,
+                    new_nodes_along_path=True,
+                )
+        return tree_root
 
 
 BACKEND_ENTRYPOINTS["h5netcdf"] = ("h5netcdf", H5netcdfBackendEntrypoint)