[BUG] dask-cudf `.describe()` broken with NumPy 1.20 #7289

shwina · 2021-02-03T16:38:03Z

The .describe()method of dask-cudf fails with cudf 0.18 (nightly) and NumPy v1.20. Minimal repro:

In [2]: import dask_cudf

In [3]: import cudf

In [4]: ddf = dask_cudf.from_cudf(cudf.DataFrame({'a': [1, 2, 3]}), npartitions=2)

In [5]: ddf.describe().compute()

<truncated>

~/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/dask/array/percentile.py in merge_percentiles(finalq, qs, vals, interpolation, Ns)
    233     combined_vals, combined_counts = zip(*combined_vals_counts)
    234
--> 235     combined_vals = np.array([combined_vals])
    236     combined_counts = np.array(combined_counts)
    237

cupy/core/core.pyx in cupy.core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

The text was updated successfully, but these errors were encountered:

pentschev · 2021-02-03T21:32:21Z

Should be fixed by dask/dask#7162 .

github-actions · 2021-03-05T22:17:39Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

pentschev · 2021-03-08T09:42:30Z

dask/dask#7162 has been merged, can we close this @shwina ?

randerzander · 2021-03-24T16:27:38Z

@pentschev with the latest nightlies, this is still broken:

Dask: '2021.03.0'
cudf: '0.19.0a+269.g444b889a05'
dask-cudf: '0.19.0a+269.g444b889a05'

Snippet:

import dask_cudf, cudf

df = cudf.DataFrame({'id': [0, 1, 2], 'val': [0, 1, 2]})
ddf = dask_cudf.from_cudf(df, npartitions=2)

ddf.describe().compute()

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-c5a19e4ab2f2> in <module>
      4 ddf = dask_cudf.from_cudf(df, npartitions=2)
      5 
----> 6 ddf.describe().compute()

/conda/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    281         dask.base.compute
    282         """
--> 283         (result,) = compute(self, traverse=False, **kwargs)
    284         return result
    285 

/conda/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    563         postcomputes.append(x.__dask_postcompute__())
    564 
--> 565     results = schedule(dsk, keys, **kwargs)
    566     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    567 

/conda/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2652                     should_rejoin = False
   2653             try:
-> 2654                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2655             finally:
   2656                 for f in futures.values():

/conda/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1961             else:
   1962                 local_worker = None
-> 1963             return self.sync(
   1964                 self._gather,
   1965                 futures,

/conda/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    835             return future
    836         else:
--> 837             return sync(
    838                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    839             )

/conda/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    349     if error[0]:
    350         typ, exc, tb = error[0]
--> 351         raise exc.with_traceback(tb)
    352     else:
    353         return result[0]

/conda/lib/python3.8/site-packages/distributed/utils.py in f()
    332             if callback_timeout is not None:
    333                 future = asyncio.wait_for(future, callback_timeout)
--> 334             result[0] = yield future
    335         except Exception as exc:
    336             error[0] = sys.exc_info()

/conda/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/conda/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1826                             exc = CancelledError(key)
   1827                         else:
-> 1828                             raise exception.with_traceback(traceback)
   1829                         raise exc
   1830                     if errors == "skip":

/conda/lib/python3.8/site-packages/dask/array/percentile.py in merge_percentiles()
    232     sort_order = np.argsort(combined_vals)
    233     combined_vals = np.take(combined_vals, sort_order)
--> 234     combined_counts = np.take(combined_counts, sort_order)
    235 
    236     # percentile-like, but scaled by total number of observations

<__array_function__ internals> in take()

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in take()
    189            [5, 7]])
    190     """
--> 191     return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
    192 
    193 

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapfunc()
     65         # Call _wrapit from within the except clause to ensure a potential
     66         # exception has a traceback chain.
---> 67         return _wrapit(obj, method, *args, **kwds)
     68 
     69 

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapit()
     42     except AttributeError:
     43         wrap = None
---> 44     result = getattr(asarray(obj), method)(*args, **kwds)
     45     if wrap:
     46         if not isinstance(result, mu.ndarray):

cupy/core/core.pyx in cupy.core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

pentschev · 2021-03-24T16:51:06Z

@randerzander could you check that you have at least the versions below? It works for me on a new environment created just now:

In [1]: import dask_cudf, cudf, numpy as np, dask, cupy as cp

In [2]: np.__version__
Out[2]: '1.20.1'

In [3]: cp.__version__
Out[3]: '8.5.0'

In [4]: dask.__version__
Out[4]: '2021.03.0'

In [5]: cudf.__version__
Out[5]: '0.19.0a+270.g267d29ba5a'

In [6]: dask_cudf.__version__
Out[6]: '0.19.0a+270.g267d29ba5a'

In [7]: df = cudf.DataFrame({'id': [0, 1, 2], 'val': [0, 1, 2]})
   ...: ddf = dask_cudf.from_cudf(df, npartitions=2)
   ...:
   ...: ddf.describe().compute()
   ...:
Out[7]:
        id  val
count  3.0  3.0
mean   1.0  1.0
std    1.0  1.0
min    0.0  0.0
25%    0.5  0.5
50%    1.0  1.0
75%    1.5  1.5
max    2.0  2.0

pentschev · 2021-03-24T17:31:50Z

Oh, I see what happened, it's actually now broken for NumPy < 1.20, since dask/dask#7172 . @shwina the line https://github.com/dask/dask/pull/7172/files#diff-b0bd5609b1b3853c06a0e8bbe312694bffc952fd1b116bd4ecb6f85a0ea7874bR231 doesn't work for NumPy < 1.20 with CuPy because it lacks NEP-35 (a.k.a., like= kwarg). Therefore, the combined_counts variable turns out to be a np.ndarray, and when combined_counts = np.take(combined_counts, sort_order) is called we have combined_counts as an np.ndarray and sort_order as a cp.ndarray, raising that error on the CuPy side.

I see 4 options:

Dropping support for NumPy < 1.20;
Changing the code back to how it was before, but this could mean a performance loss;
Trying to find a performant way that also works with older NumPy;
Special-casing that piece of code for new vs old NumPy (this may get some criticism from Dask folks).

cc @kkraus14 @quasiben @jakirkham

jakirkham · 2021-03-24T18:00:51Z

Requiring NumPy 1.20+ makes a lot of sense to me. There's a lot of really important improvements particularly for RAPIDS there (of course you already know this Peter) and I think we are going to find it hard to get things working for older NumPy versions

pentschev · 2021-03-24T18:06:11Z

Requiring NumPy 1.20+ makes a lot of sense to me. There's a lot of really important improvements particularly for RAPIDS there (of course you already know this Peter) and I think we are going to find it hard to get things working for older NumPy versions

This is my personal preference too, but I guess we need to know whether there's users who for whatever reason can't upgrade to NumPy 1.20+, if not, then we should explicitly pin numpy>=1.20 which clearly isn't the case at this time.

jakirkham · 2021-03-24T18:08:40Z

Yep I'm happy to make that PR if we go that route 🙂

kkraus14 · 2021-03-24T18:11:07Z

It looks like the code was already using the like= kwarg before the linked PR. Was that introduced in another PR that broke the numpy < 1.20? I'm hesitant about requiring numpy >= 1.20 as we go to pretty big lengths in cuDF Python to support Pandas 1.0+ and I imagine numpy is a similar situation.

jakirkham · 2021-03-24T18:55:21Z

So do you have thoughts about 2-4 then?

kkraus14 · 2021-03-24T18:57:03Z

I'm most in favor of number 3 personally, but not sure how much a rabbit hole that is.

jakirkham · 2021-03-24T18:59:07Z

@shwina as you worked on this do you see a way of approaching 3?

pentschev · 2021-03-24T19:03:17Z

It looks like the code was already using the like= kwarg before the linked PR. Was that introduced in another PR that broke the numpy < 1.20?

It's true, but in the previous code both combined_vals[0] and combined_counts[0] were already CuPy arrays after results of merge_sorted/zip, and dask/dask#7172 attempted now to convert counts (a list) into a a CuPy array (with like=combined_vals).

kkraus14 · 2021-04-06T01:29:02Z

This is resolved

shwina added bug Something isn't working Needs Triage Need team to review and classify labels Feb 3, 2021

galipremsagar self-assigned this Feb 3, 2021

galipremsagar added Python Affects Python cuDF API. dask-cudf and removed Needs Triage Need team to review and classify labels Feb 3, 2021

kkraus14 assigned shwina and unassigned galipremsagar Feb 3, 2021

pentschev mentioned this issue Feb 3, 2021

Add percentile support for NEP-35 dask/dask#7162

Merged

github-actions bot added the inactive-30d label Mar 5, 2021

github-actions bot removed the inactive-30d label Mar 8, 2021

kkraus14 closed this as completed Apr 6, 2021

randerzander mentioned this issue Apr 6, 2021

[FEA] Support numpy>=1.20 BlazingDB/blazingsql#1432

Closed

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dask-cudf `.describe()` broken with NumPy 1.20 #7289

[BUG] dask-cudf `.describe()` broken with NumPy 1.20 #7289

shwina commented Feb 3, 2021

pentschev commented Feb 3, 2021

github-actions bot commented Mar 5, 2021

pentschev commented Mar 8, 2021

randerzander commented Mar 24, 2021

pentschev commented Mar 24, 2021

pentschev commented Mar 24, 2021

jakirkham commented Mar 24, 2021

pentschev commented Mar 24, 2021

jakirkham commented Mar 24, 2021

kkraus14 commented Mar 24, 2021 •

edited

Loading

jakirkham commented Mar 24, 2021

kkraus14 commented Mar 24, 2021

jakirkham commented Mar 24, 2021

pentschev commented Mar 24, 2021

kkraus14 commented Apr 6, 2021

[BUG] dask-cudf .describe() broken with NumPy 1.20 #7289

[BUG] dask-cudf .describe() broken with NumPy 1.20 #7289

Comments

shwina commented Feb 3, 2021

pentschev commented Feb 3, 2021

github-actions bot commented Mar 5, 2021

pentschev commented Mar 8, 2021

randerzander commented Mar 24, 2021

pentschev commented Mar 24, 2021

pentschev commented Mar 24, 2021

jakirkham commented Mar 24, 2021

pentschev commented Mar 24, 2021

jakirkham commented Mar 24, 2021

kkraus14 commented Mar 24, 2021 • edited Loading

jakirkham commented Mar 24, 2021

kkraus14 commented Mar 24, 2021

jakirkham commented Mar 24, 2021

pentschev commented Mar 24, 2021

kkraus14 commented Apr 6, 2021

[BUG] dask-cudf `.describe()` broken with NumPy 1.20 #7289

[BUG] dask-cudf `.describe()` broken with NumPy 1.20 #7289

kkraus14 commented Mar 24, 2021 •

edited

Loading