Merge branch 'main' into fix-duplicate-dimensions

* main: new whats-new section (pydata#9115) release v2024.06.0 (pydata#9113) release notes for 2024.06.0 (pydata#9092) [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)
mraspaud · Jun 13, 2024 · 42326c3 · 42326c3
2 parents af380cf + 9237f90
commit 42326c3
Show file tree

Hide file tree

Showing 11 changed files with 345 additions and 155 deletions.
diff --git a/.github/workflows/hypothesis.yaml b/.github/workflows/hypothesis.yaml
@@ -39,9 +39,9 @@ jobs:
     if: |
         always()
         && (
-            (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
-            || needs.detect-ci-trigger.outputs.triggered == 'true'
-            || contains( github.event.pull_request.labels.*.name, 'run-slow-hypothesis')
+            needs.detect-ci-trigger.outputs.triggered == 'false'
+            && ( (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
+                || contains( github.event.pull_request.labels.*.name, 'run-slow-hypothesis'))
         )
     defaults:
       run:

diff --git a/CITATION.cff b/CITATION.cff
@@ -84,6 +84,11 @@ authors:
 - family-names: "Scheick"
   given-names: "Jessica"
   orcid: "https://orcid.org/0000-0002-3421-4459"
+- family-names: "Savoie"
+  given-names: "Matthew"
+  orcid: "https://orcid.org/0000-0002-8881-2550"
+- family-names: "Littlejohns"
+  given-names: "Owen"
 title: "xarray"
 abstract: "N-D labeled arrays and datasets in Python."
 license: Apache-2.0

diff --git a/doc/_static/style.css b/doc/_static/style.css
@@ -7,9 +7,8 @@ table.docutils td {
     word-wrap: break-word;
 }
 
-div.bd-header-announcement {
-  background-color: unset;
-  color: #000;
+.bd-header-announcement {
+  background-color: var(--pst-color-info-bg);
 }
 
 /* Reduce left and right margins */
@@ -222,8 +221,6 @@ main *:target::before {
 }
 
 body {
-  /* Add padding to body to avoid overlap with navbar. */
-  padding-top: var(--navbar-height);
   width: 100%;
 }
 

diff --git a/doc/conf.py b/doc/conf.py
@@ -242,7 +242,7 @@
     Theme by the <a href="https://ebp.jupyterbook.org">Executable Book Project</a></p>""",
     twitter_url="https://twitter.com/xarray_dev",
     icon_links=[],  # workaround for pydata/pydata-sphinx-theme#1220
-    announcement="🍾 <a href='https://github.com/pydata/xarray/discussions/8462'>Xarray is now 10 years old!</a> 🎉",
+    announcement="<a href='https://forms.gle/KEq7WviCdz9xTaJX6'>Xarray's 2024 User Survey is live now. Please take ~5 minutes to fill it out and help us improve Xarray.</a>",
 )
 
 

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -15,22 +15,14 @@ What's New
     np.random.seed(123456)
 
 
-.. _whats-new.2024.05.1:
+.. _whats-new.2024.06.1:
 
-v2024.06 (unreleased)
+v2024.06.1 (unreleased)
 -----------------------
 
 New Features
 ~~~~~~~~~~~~
 
-Performance
-~~~~~~~~~~~
-
-- Small optimization to the netCDF4 and h5netcdf backends (:issue:`9058`, :pull:`9067`).
-  By `Deepak Cherian <https://github.com/dcherian>`_.
-- Small optimizations to help reduce indexing speed of datasets (:pull:`9002`).
-  By `Mark Harfouche <https://github.com/hmaarrfk>`_.
-
 
 Breaking changes
 ~~~~~~~~~~~~~~~~
@@ -40,14 +32,45 @@ Deprecations
 ~~~~~~~~~~~~
 
 
+Bug fixes
+~~~~~~~~~
+
+
+Documentation
+~~~~~~~~~~~~~
+
+
+Internal Changes
+~~~~~~~~~~~~~~~~
+
+
+.. _whats-new.2024.06.0:
+
+v2024.06.0 (Jun 13, 2024)
+-------------------------
+This release brings various performance optimizations and compatibility with the upcoming numpy 2.0 release.
+
+Thanks to the 22 contributors to this release:
+Alfonso Ladino, David Hoese, Deepak Cherian, Eni Awowale, Ilan Gold, Jessica Scheick, Joe Hamman, Justus Magin, Kai Mühlbauer, Mark Harfouche, Mathias Hauser, Matt Savoie, Maximilian Roos, Mike Thramann, Nicolas Karasiak, Owen Littlejohns, Paul Ockenfuß, Philippe THOMY, Scott Henderson, Spencer Clark, Stephan Hoyer and Tom Nicholas
+
+Performance
+~~~~~~~~~~~
+
+- Small optimization to the netCDF4 and h5netcdf backends (:issue:`9058`, :pull:`9067`).
+  By `Deepak Cherian <https://github.com/dcherian>`_.
+- Small optimizations to help reduce indexing speed of datasets (:pull:`9002`).
+  By `Mark Harfouche <https://github.com/hmaarrfk>`_.
+- Performance improvement in `open_datatree` method for Zarr, netCDF4 and h5netcdf backends (:issue:`8994`, :pull:`9014`).
+  By `Alfonso Ladino <https://github.com/aladinor>`_.
+
+
 Bug fixes
 ~~~~~~~~~
 - Preserve conversion of timezone-aware pandas Datetime arrays to numpy object arrays
   (:issue:`9026`, :pull:`9042`).
   By `Ilan Gold <https://github.com/ilan-gold>`_.
-
 - :py:meth:`DataArrayResample.interpolate` and :py:meth:`DatasetResample.interpolate` method now
-  support aribtrary kwargs such as ``order`` for polynomial interpolation. (:issue:`8762`).
+  support arbitrary kwargs such as ``order`` for polynomial interpolation (:issue:`8762`).
   By `Nicolas Karasiak <https://github.com/nkarasiak>`_.
 
 - Allow chunking for arrays with duplicated dimension names (:issue:`8759`, :pull:`9099`).
@@ -56,16 +79,18 @@ Bug fixes
 
 Documentation
 ~~~~~~~~~~~~~
-- Add link to CF Conventions on packed data and sentence on type determination in doc/user-guide/io.rst (:issue:`9041`, :pull:`9045`).
+- Add link to CF Conventions on packed data and sentence on type determination in the I/O user guide (:issue:`9041`, :pull:`9045`).
   By `Kai Mühlbauer <https://github.com/kmuehlbauer>`_.
 
 
 Internal Changes
 ~~~~~~~~~~~~~~~~
 - Migrates remainder of ``io.py`` to ``xarray/core/datatree_io.py`` and
-  ``TreeAttrAccessMixin`` into ``xarray/core/common.py`` (:pull: `9011`)
+  ``TreeAttrAccessMixin`` into ``xarray/core/common.py`` (:pull:`9011`).
   By `Owen Littlejohns <https://github.com/owenlittlejohns>`_ and
   `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Compatibility with numpy 2 (:issue:`8844`, :pull:`8854`, :pull:`8946`).
+  By `Justus Magin <https://github.com/keewis>`_ and `Stephan Hoyer <https://github.com/shoyer>`_.
 
 
 .. _whats-new.2024.05.0:
@@ -124,8 +149,8 @@ Bug fixes
   <https://github.com/pandas-dev/pandas/issues/56147>`_ to
   :py:func:`pandas.date_range`, date ranges produced by
   :py:func:`xarray.cftime_range` with negative frequencies will now fall fully
-  within the bounds of the provided start and end dates (:pull:`8999`). By
-  `Spencer Clark <https://github.com/spencerkclark>`_.
+  within the bounds of the provided start and end dates (:pull:`8999`).
+  By `Spencer Clark <https://github.com/spencerkclark>`_.
 
 Internal Changes
 ~~~~~~~~~~~~~~~~
@@ -150,7 +175,8 @@ Internal Changes
 - ``transpose``, ``set_dims``, ``stack`` & ``unstack`` now use a ``dim`` kwarg
   rather than ``dims`` or ``dimensions``. This is the final change to make xarray methods
   consistent with their use of ``dim``. Using the existing kwarg will raise a
-  warning. By `Maximilian Roos <https://github.com/max-sixty>`_
+  warning.
+  By `Maximilian Roos <https://github.com/max-sixty>`_
 
 .. _whats-new.2024.03.0:
 

diff --git a/properties/test_pandas_roundtrip.py b/properties/test_pandas_roundtrip.py
@@ -9,6 +9,7 @@
 import pytest
 
 import xarray as xr
+from xarray.tests import has_pandas_3
 
 pytest.importorskip("hypothesis")
 import hypothesis.extra.numpy as npst  # isort:skip
@@ -110,6 +111,10 @@ def test_roundtrip_pandas_dataframe(df) -> None:
     xr.testing.assert_identical(arr, roundtripped.to_xarray())
 
 
+@pytest.mark.skipif(
+    has_pandas_3,
+    reason="fails to roundtrip on pandas 3 (see https://github.com/pydata/xarray/issues/9098)",
+)
 @given(df=dataframe_strategy)
 def test_roundtrip_pandas_dataframe_datetime(df) -> None:
     # Need to name the indexes, otherwise Xarray names them 'dim_0', 'dim_1'.

diff --git a/xarray/backends/common.py b/xarray/backends/common.py
@@ -19,9 +19,6 @@
 if TYPE_CHECKING:
     from io import BufferedIOBase
 
-    from h5netcdf.legacyapi import Dataset as ncDatasetLegacyH5
-    from netCDF4 import Dataset as ncDataset
-
     from xarray.core.dataset import Dataset
     from xarray.core.datatree import DataTree
     from xarray.core.types import NestedSequence
@@ -131,33 +128,6 @@ def _decode_variable_name(name):
     return name
 
 
-def _open_datatree_netcdf(
-    ncDataset: ncDataset | ncDatasetLegacyH5,
-    filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
-    **kwargs,
-) -> DataTree:
-    from xarray.backends.api import open_dataset
-    from xarray.core.datatree import DataTree
-    from xarray.core.treenode import NodePath
-
-    ds = open_dataset(filename_or_obj, **kwargs)
-    tree_root = DataTree.from_dict({"/": ds})
-    with ncDataset(filename_or_obj, mode="r") as ncds:
-        for path in _iter_nc_groups(ncds):
-            subgroup_ds = open_dataset(filename_or_obj, group=path, **kwargs)
-
-            # TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
-            node_name = NodePath(path).name
-            new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
-            tree_root._set_item(
-                path,
-                new_node,
-                allow_overwrite=False,
-                new_nodes_along_path=True,
-            )
-    return tree_root
-
-
 def _iter_nc_groups(root, parent="/"):
     from xarray.core.treenode import NodePath
 

diff --git a/xarray/backends/h5netcdf_.py b/xarray/backends/h5netcdf_.py
@@ -3,15 +3,14 @@
 import functools
 import io
 import os
-from collections.abc import Iterable
+from collections.abc import Callable, Iterable
 from typing import TYPE_CHECKING, Any
 
 from xarray.backends.common import (
     BACKEND_ENTRYPOINTS,
     BackendEntrypoint,
     WritableCFDataStore,
     _normalize_path,
-    _open_datatree_netcdf,
     find_root_and_group,
 )
 from xarray.backends.file_manager import CachingFileManager, DummyFileManager
@@ -431,11 +430,58 @@ def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporti
     def open_datatree(
         self,
         filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
+        *,
+        mask_and_scale=True,
+        decode_times=True,
+        concat_characters=True,
+        decode_coords=True,
+        drop_variables: str | Iterable[str] | None = None,
+        use_cftime=None,
+        decode_timedelta=None,
+        group: str | Iterable[str] | Callable | None = None,
         **kwargs,
     ) -> DataTree:
-        from h5netcdf.legacyapi import Dataset as ncDataset
+        from xarray.backends.api import open_dataset
+        from xarray.backends.common import _iter_nc_groups
+        from xarray.core.datatree import DataTree
+        from xarray.core.treenode import NodePath
+        from xarray.core.utils import close_on_error
 
-        return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)
+        filename_or_obj = _normalize_path(filename_or_obj)
+        store = H5NetCDFStore.open(
+            filename_or_obj,
+            group=group,
+        )
+        if group:
+            parent = NodePath("/") / NodePath(group)
+        else:
+            parent = NodePath("/")
+
+        manager = store._manager
+        ds = open_dataset(store, **kwargs)
+        tree_root = DataTree.from_dict({str(parent): ds})
+        for path_group in _iter_nc_groups(store.ds, parent=parent):
+            group_store = H5NetCDFStore(manager, group=path_group, **kwargs)
+            store_entrypoint = StoreBackendEntrypoint()
+            with close_on_error(group_store):
+                ds = store_entrypoint.open_dataset(
+                    group_store,
+                    mask_and_scale=mask_and_scale,
+                    decode_times=decode_times,
+                    concat_characters=concat_characters,
+                    decode_coords=decode_coords,
+                    drop_variables=drop_variables,
+                    use_cftime=use_cftime,
+                    decode_timedelta=decode_timedelta,
+                )
+                new_node: DataTree = DataTree(name=NodePath(path_group).name, data=ds)
+                tree_root._set_item(
+                    path_group,
+                    new_node,
+                    allow_overwrite=False,
+                    new_nodes_along_path=True,
+                )
+        return tree_root
 
 
 BACKEND_ENTRYPOINTS["h5netcdf"] = ("h5netcdf", H5netcdfBackendEntrypoint)
diff --git a/xarray/backends/netCDF4_.py b/xarray/backends/netCDF4_.py
@@ -3,7 +3,7 @@
 import functools
 import operator
 import os
-from collections.abc import Iterable
+from collections.abc import Callable, Iterable
 from contextlib import suppress
 from typing import TYPE_CHECKING, Any
 
@@ -16,7 +16,6 @@
     BackendEntrypoint,
     WritableCFDataStore,
     _normalize_path,
-    _open_datatree_netcdf,
     find_root_and_group,
     robust_getitem,
 )
@@ -672,11 +671,57 @@ def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporti
     def open_datatree(
         self,
         filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
+        *,
+        mask_and_scale=True,
+        decode_times=True,
+        concat_characters=True,
+        decode_coords=True,
+        drop_variables: str | Iterable[str] | None = None,
+        use_cftime=None,
+        decode_timedelta=None,
+        group: str | Iterable[str] | Callable | None = None,
         **kwargs,
     ) -> DataTree:
-        from netCDF4 import Dataset as ncDataset
+        from xarray.backends.api import open_dataset
+        from xarray.backends.common import _iter_nc_groups
+        from xarray.core.datatree import DataTree
+        from xarray.core.treenode import NodePath
 
-        return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)
+        filename_or_obj = _normalize_path(filename_or_obj)
+        store = NetCDF4DataStore.open(
+            filename_or_obj,
+            group=group,
+        )
+        if group:
+            parent = NodePath("/") / NodePath(group)
+        else:
+            parent = NodePath("/")
+
+        manager = store._manager
+        ds = open_dataset(store, **kwargs)
+        tree_root = DataTree.from_dict({str(parent): ds})
+        for path_group in _iter_nc_groups(store.ds, parent=parent):
+            group_store = NetCDF4DataStore(manager, group=path_group, **kwargs)
+            store_entrypoint = StoreBackendEntrypoint()
+            with close_on_error(group_store):
+                ds = store_entrypoint.open_dataset(
+                    group_store,
+                    mask_and_scale=mask_and_scale,
+                    decode_times=decode_times,
+                    concat_characters=concat_characters,
+                    decode_coords=decode_coords,
+                    drop_variables=drop_variables,
+                    use_cftime=use_cftime,
+                    decode_timedelta=decode_timedelta,
+                )
+                new_node: DataTree = DataTree(name=NodePath(path_group).name, data=ds)
+                tree_root._set_item(
+                    path_group,
+                    new_node,
+                    allow_overwrite=False,
+                    new_nodes_along_path=True,
+                )
+        return tree_root
 
 
 BACKEND_ENTRYPOINTS["netcdf4"] = ("netCDF4", NetCDF4BackendEntrypoint)