Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rely on NEP-18 to dispatch to dask in duck_array_ops #5571

Merged

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jul 3, 2021

Removes special-casing for dask in duck_array_ops.py, instead relying on NEP-18 to call it when the input is a dask array.

Probably actually don't need the _dask_or_eager_func() (now _module_func()) helper function at all, because all remaining instances look like pandas_isnull = _module_func("isnull", module=pd), which could just be pandas_isnull = pd.isnull.

Only problem is that I seem to have broken one (parameterized) test: test_duck_array_ops.py::test_min_count[True-True-None-sum-True-bool_-1] fails with

    @pytest.mark.parametrize("dim_num", [1, 2])
    @pytest.mark.parametrize("dtype", [float, int, np.float32, np.bool_])
    @pytest.mark.parametrize("dask", [False, True])
    @pytest.mark.parametrize("func", ["sum", "prod"])
    @pytest.mark.parametrize("aggdim", [None, "x"])
    @pytest.mark.parametrize("contains_nan", [True, False])
    @pytest.mark.parametrize("skipna", [True, False, None])
    def test_min_count(dim_num, dtype, dask, func, aggdim, contains_nan, skipna):
        if dask and not has_dask:
            pytest.skip("requires dask")
    
        da = construct_dataarray(dim_num, dtype, contains_nan=contains_nan, dask=dask)
        min_count = 3
    
        # If using Dask, the function call should be lazy.
        with raise_if_dask_computes():
>           actual = getattr(da, func)(dim=aggdim, skipna=skipna, min_count=min_count)

/home/tegn500/Documents/Work/Code/xarray/xarray/tests/test_duck_array_ops.py:578: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/tegn500/Documents/Work/Code/xarray/xarray/core/common.py:56: in wrapped_func
    return self.reduce(func, dim, axis, skipna=skipna, **kwargs)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/dataarray.py:2638: in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/variable.py:1725: in reduce
    data = func(self.data, **kwargs)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:328: in f
    return func(values, axis=axis, **kwargs)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/nanops.py:106: in nansum
    a, mask = _replace_nan(a, 0)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/nanops.py:23: in _replace_nan
    mask = isnull(a)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:83: in isnull
    return pandas_isnull(data)
/home/tegn500/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:40: in f
    return getattr(module, name)(*args, **kwargs)
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/pandas/core/dtypes/missing.py:127: in isna
    return _isna(obj)
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/pandas/core/dtypes/missing.py:166: in _isna
    return _isna_ndarraylike(np.asarray(obj), inf_as_na=inf_as_na)
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/numpy/core/_asarray.py:102: in asarray
    return array(a, dtype, copy=False, order=order)
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/dask/array/core.py:1502: in __array__
    x = self.compute()
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/dask/base.py:285: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/home/tegn500/miniconda3/envs/py38-mamba/lib/python3.8/site-packages/dask/base.py:567: in compute
    results = schedule(dsk, keys, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xarray.tests.CountingScheduler object at 0x7f0804db2310>
dsk = {('xarray-<this-array>-29953318277423606f95b509ad1a9aa7', 0): array([False, False, False, False], dtype=object), ('xar...pe=object), ('xarray-<this-array>-29953318277423606f95b509ad1a9aa7', 3): array([nan, False, False, nan], dtype=object)}
keys = [[('xarray-<this-array>-29953318277423606f95b509ad1a9aa7', 0), ('xarray-<this-array>-29953318277423606f95b509ad1a9aa7'...array-<this-array>-29953318277423606f95b509ad1a9aa7', 2), ('xarray-<this-array>-29953318277423606f95b509ad1a9aa7', 3)]]
kwargs = {}

    def __call__(self, dsk, keys, **kwargs):
        self.total_computes += 1
        if self.total_computes > self.max_computes:
>           raise RuntimeError(
                "Too many computes. Total: %d > max: %d."
                % (self.total_computes, self.max_computes)
            )
E           RuntimeError: Too many computes. Total: 1 > max: 0.

/home/tegn500/Documents/Work/Code/xarray/xarray/tests/__init__.py:118: RuntimeError

@pep8speaks
Copy link

pep8speaks commented Jul 3, 2021

Hello @TomNicholas! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-29 17:13:58 UTC

@TomNicholas
Copy link
Member Author

Right I understand the failure now - everywhere where _dask_or_eager_func was dispatching numpy->dask is working fine via NEP-18, but my change made pandas.isnull no longer dispatch to dask.isnull, which then computes (and raises because we want it to be lazy). The solution is to special-case that one use of pandas (and perhaps any other uses of pandas?).

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2021

Unit Test Results

         6 files  ±0           6 suites  ±0   56m 27s ⏱️ ±0s
16 228 tests ±0  14 493 ✔️ ±0  1 735 💤 ±0  0 ±0 
90 564 runs  ±0  82 384 ✔️ ±0  8 180 💤 ±0  0 ±0 

Results for commit ebfc6a3. ± Comparison against base commit ebfc6a3.

♻️ This comment has been updated with latest results.

@TomNicholas
Copy link
Member Author

A net-negative pull request 🤯

@max-sixty
Copy link
Collaborator

Looks great! I don't know the details of this code well, but at the conceptual level it looks good!

A net-negative pull request 🤯

The best type!

from numpy import take, tensordot, transpose, unravel_index # noqa
from numpy import where as _where
from numpy import zeros_like # noqa
from numpy.ma import masked_invalid # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do masked arrays support NEP-18?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see an __array_function__ method in MaskedArray, but it does inherit from ndarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does have an __array_function__ attribute

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it does support NEP-18.

However, for some functions (e.g. np.median) it returns a strange MaskedConstant object if the result would be a constant. Not sure if it needs special care, but something to be aware of.

@TomNicholas TomNicholas mentioned this pull request Jul 8, 2021
8 tasks
from numpy import take, tensordot, transpose, unravel_index # noqa
from numpy import where as _where
from numpy import zeros_like # noqa
from numpy.ma import masked_invalid # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it does support NEP-18.

However, for some functions (e.g. np.median) it returns a strange MaskedConstant object if the result would be a constant. Not sure if it needs special care, but something to be aware of.

zeros_like = _dask_or_eager_func("zeros_like")
# Requires special-casing because pandas won't automatically dispatch to dask.isnull via NEP-18
def _dask_or_eager_isnull(obj):
if is_duck_dask_array(obj):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does dask_array.isnull dispatch for duck dask arrays, or does this also create e.g. dask(pint(dask)) objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does this:

d = dask_array.array([1, 2, 3])
q = pint.Quantity(d, units='m')
dask.array.isnull(q)

raises warnings

/home/tegn500/miniconda3/envs/xarray-testing-min-all-deps-py37/lib/python3.7/site-packages/dask/array/core.py:2756: UserWarning: Passing an object to dask.array.from_array which is already a Dask collection. This can lead to unexpected behavior.
  "Passing an object to dask.array.from_array which is already a "
/home/tegn500/miniconda3/envs/xarray-testing-min-all-deps-py37/lib/python3.7/site-packages/numpy/core/_asarray.py:85: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return array(a, dtype, copy=False, order=order)

and returns the result

dask.array<_asarray_isnull, shape=(3,), dtype=bool, chunksize=(3,), chunktype=numpy.ndarray>

I think that's a sensible result of calling isnull on a chunked pint array?

xarray/core/duck_array_ops.py Show resolved Hide resolved
xarray/tests/test_units.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Show resolved Hide resolved
@TomNicholas
Copy link
Member Author

Now there is one more mystery: why do some of the tests for padding fail with output that is 0.5 away from what's expected??:

_____________ TestVariableWithDask.test_pad[xr_arg3-np_arg3-mean] ______________
[gw3] linux -- Python 3.7.10 /usr/share/miniconda/envs/xarray-tests/bin/python

self = <xarray.tests.test_variable.TestVariableWithDask object at 0x7f10bd848990>
mode = 'mean', xr_arg = {'x': (3, 1), 'z': (2, 0)}
np_arg = ((3, 1), (0, 0), (2, 0))

    @pytest.mark.parametrize(
        "mode",
        [
            "mean",
            pytest.param(
                "median",
                marks=pytest.mark.xfail(reason="median is not implemented by Dask"),
            ),
            pytest.param(
                "reflect", marks=pytest.mark.xfail(reason="dask.array.pad bug")
            ),
            "edge",
            pytest.param(
                "linear_ramp",
                marks=pytest.mark.xfail(
                    reason="pint bug: https://github.com/hgrecco/pint/issues/1026"
                ),
            ),
            "maximum",
            "minimum",
            "symmetric",
            "wrap",
        ],
    )
    @pytest.mark.parametrize("xr_arg, np_arg", _PAD_XR_NP_ARGS)
    @pytest.mark.filterwarnings(
        r"ignore:dask.array.pad.+? converts integers to floats."
    )
    def test_pad(self, mode, xr_arg, np_arg):
        data = np.arange(4 * 3 * 2).reshape(4, 3, 2)
        v = self.cls(["x", "y", "z"], data)
    
        actual = v.pad(mode=mode, **xr_arg)
        expected = np.pad(data, np_arg, mode=mode)
   
>       assert_array_equal(actual, expected)
E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatched elements: 48 / 96 (50%)
E       Max absolute difference: 0.5
E       Max relative difference: 0.25
E        x: array([[[ 9.5,  9.5,  9. , 10. ],
E               [11.5, 11.5, 11. , 12. ],
E               [13.5, 13.5, 13. , 14. ]],...
E        y: array([[[10, 10,  9, 10],
E               [12, 12, 11, 12],
E               [14, 14, 13, 14]],...

@dcherian
Copy link
Contributor

dcherian commented Jul 23, 2021

for padding fail with output that is 0.5 away from what's expected

We work around some dask bugs with dask_array_compat.pad I think you'll need to revert to that and not use np.pad until we can bump dask.

EDIT: though I don't see this kind of thing being fixed there.

@Illviljan
Copy link
Contributor

The issue mentioned in dask_array_compat.pad was closed around may 2020 (dask/dask#6213).
The min version of dask can be bumped to 2.24 tomorrow (from 2.15 and 2.9), maybe that's enough to fix the errors?

@Illviljan
Copy link
Contributor

@TomNicholas try merging main now. Let's see if there are any errors left.

@TomNicholas
Copy link
Member Author

@Illviljan that actually solved those padding errors! Awesome!

The tests still fail because of something going on with np.around in the doctests though 😕

@dcherian
Copy link
Contributor

I think you just need to fix the spaces in the expected output

@Illviljan
Copy link
Contributor

___________ [doctest] xarray.core._typed_ops.DataArrayOpsMixin.round ___________
EXAMPLE LOCATION UNKNOWN, not showing all tests of that example
??? >>> np.around([0.37, 1.64])
Expected:
    array([0.,  2.])
Got:
    array([0., 2.])

/home/runner/work/xarray/xarray/xarray/core/_typed_ops.py:None: DocTestFailure

The expected had 2 spaces after the comma for some reason. I think the actual output makes more sense now so I think it's fine to just accept this difference and change the expected output.

@Illviljan
Copy link
Contributor

Illviljan commented Aug 26, 2021

Seems to be a problem in numpy np.around fails the doctest, see numpy/numpy#19759.

I wonder if the docstring can be ignored if it is copied from somewhere else?

I think an easy workaround is simply replacing the docstring:

around = _dask_or_eager_func("around")

around = _dask_or_eager_func("around")
# np.around has failing doctests, overwrite it so they pass:
around.__doc__ = """
    Evenly round to the given number of decimals.

    Parameters
    ----------
    a : array_like
        Input data.
    decimals : int, optional
        Number of decimal places to round to (default: 0).  If
        decimals is negative, it specifies the number of positions to
        the left of the decimal point.
    out : ndarray, optional
        Alternative output array in which to place the result. It must have
        the same shape as the expected output, but the type of the output
        values will be cast if necessary. See :ref:`ufuncs-output-type` for more
        details.

    Returns
    -------
    rounded_array : ndarray
        An array of the same type as `a`, containing the rounded values.
        Unless `out` was specified, a new array is created.  A reference to
        the result is returned.

        The real and imaginary parts of complex numbers are rounded
        separately.  The result of rounding a float is a float.

    See Also
    --------
    ndarray.round : equivalent method

    ceil, fix, floor, rint, trunc


    Notes
    -----
    For values exactly halfway between rounded decimal values, NumPy
    rounds to the nearest even value. Thus 1.5 and 2.5 round to 2.0,
    -0.5 and 0.5 round to 0.0, etc.

    ``np.around`` uses a fast but sometimes inexact algorithm to round
    floating-point datatypes. For positive `decimals` it is equivalent to
    ``np.true_divide(np.rint(a * 10**decimals), 10**decimals)``, which has
    error due to the inexact representation of decimal fractions in the IEEE
    floating point standard [1]_ and errors introduced when scaling by powers
    of ten. For instance, note the extra "1" in the following:

        >>> np.round(56294995342131.5, 3)
        56294995342131.51

    If your goal is to print such values with a fixed number of decimals, it is
    preferable to use numpy's float printing routines to limit the number of
    printed decimals:

        >>> np.format_float_positional(56294995342131.5, precision=3)
        '56294995342131.5'

    The float printing routines use an accurate but much more computationally
    demanding algorithm to compute the number of digits after the decimal
    point.

    Alternatively, Python's builtin `round` function uses a more accurate
    but slower algorithm for 64-bit floating point values:

        >>> round(56294995342131.5, 3)
        56294995342131.5
        >>> np.round(16.055, 2), round(16.055, 2)  # equals 16.0549999999999997
        (16.06, 16.05)


    References
    ----------
    .. [1] "Lecture Notes on the Status of IEEE 754", William Kahan,
           https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
    .. [2] "How Futile are Mindless Assessments of
           Roundoff in Floating-Point Computation?", William Kahan,
           https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

    Examples
    --------
    >>> np.around([0.37, 1.64])
    array([0., 2.])
    >>> np.around([0.37, 1.64], decimals=1)
    array([0.4, 1.6])
    >>> np.around([.5, 1.5, 2.5, 3.5, 4.5]) # rounds to nearest even value
    array([0., 2., 2., 4., 4.])
    >>> np.around([1,2,3,11], decimals=1) # ndarray of ints is returned
    array([ 1,  2,  3, 11])
    >>> np.around([1,2,3,11], decimals=-1)
    array([ 0,  0,  0, 10])

    """

@Illviljan
Copy link
Contributor

Saw dask does similar fixes too:
https://github.com/dask/dask/blob/85f0b14bd36a5135ce51aeee067b6207374b00c4/dask/array/wrap.py#L183

Here's a version inspired by that one:

around = _dask_or_eager_func("around")
# np.around has failing doctests, overwrite it so they pass:
# https://github.com/numpy/numpy/issues/19759
around.__doc__ = test.__doc__.replace(
    "array([0.,  2.])",
    "array([0., 2.])",
)
around.__doc__ = test.__doc__.replace(
    "array([0.4,  1.6])",
    "array([0.4, 1.6])",
)
around.__doc__ = test.__doc__.replace(
    "array([0.,  2.,  2.,  4.,  4.])",
    "array([0., 2., 2., 4., 4.])",
)

@dcherian
Copy link
Contributor

Thanks @Illviljan

@Illviljan
Copy link
Contributor

I must say it gets kind of annoying having so pedantic doctests and doc generation when upstream modules aren't as picky.

@Illviljan
Copy link
Contributor

Illviljan commented Aug 27, 2021

I don't understand why doctests doesn't go through the if path, is it using tricks that can make around.__doc__ == None?

Anyway I think just ignoring these typing errors might be easier.

Readthedocs error:

--------
/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/5571/xarray/core/_typed_ops.py:docstring of xarray.core._typed_ops.DataArrayOpsMixin.round:65: WARNING: Explicit markup ends without a blank line; unexpected unindent.
/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/5571/xarray/core/_typed_ops.py:docstring of xarray.core._typed_ops.DataArrayOpsMixin.round:65: WARNING: Unexpected section title or transition.

@TomNicholas
Copy link
Member Author

Doctests passed! Thanks so much @Illviljan !

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though maybe we should merge main again before merging..

@TomNicholas
Copy link
Member Author

Thanks for the reminder @dcherian - I merged main and all the tests pass so I'll merge this PR now!

@TomNicholas TomNicholas merged commit cd6065e into pydata:main Sep 29, 2021
@TomNicholas
Copy link
Member Author

The what's new entry for this went in under the wrong edition - I fixed it in ebfc6a3

@TomNicholas TomNicholas mentioned this pull request Oct 23, 2021
snowman2 pushed a commit to snowman2/xarray that referenced this pull request Feb 9, 2022
* basic test for the mean

* minimum to get mean working

* don't even need to call dask specifically

* remove reference to dask when dispatching to modules

* fixed special case of pandas vs dask isnull

* removed _dask_or_eager_func completely

* noqa

* pre-commit

* what's new

* linting

* properly import dask for test

* fix iris conversion error by rolling back treatment of np.ma.masked_invalid

* linting

* Update xarray/core/duck_array_ops.py

Co-authored-by: Illviljan <[email protected]>

* Update xarray/core/duck_array_ops.py

Co-authored-by: Illviljan <[email protected]>

* Update xarray/core/duck_array_ops.py

Co-authored-by: Illviljan <[email protected]>

Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Illviljan <[email protected]>
@TomNicholas TomNicholas deleted the duck_array_ops_dont_treat_dask_special branch July 9, 2022 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UserWarning when wrapping pint & dask arrays together
6 participants