Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask-cudf .describe() broken with NumPy 1.20 #7289

Closed
shwina opened this issue Feb 3, 2021 · 15 comments
Closed

[BUG] dask-cudf .describe() broken with NumPy 1.20 #7289

shwina opened this issue Feb 3, 2021 · 15 comments
Assignees
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Feb 3, 2021

The .describe()method of dask-cudf fails with cudf 0.18 (nightly) and NumPy v1.20. Minimal repro:

In [2]: import dask_cudf

In [3]: import cudf

In [4]: ddf = dask_cudf.from_cudf(cudf.DataFrame({'a': [1, 2, 3]}), npartitions=2)

In [5]: ddf.describe().compute()

<truncated>

~/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/dask/array/percentile.py in merge_percentiles(finalq, qs, vals, interpolation, Ns)
    233     combined_vals, combined_counts = zip(*combined_vals_counts)
    234
--> 235     combined_vals = np.array([combined_vals])
    236     combined_counts = np.array(combined_counts)
    237

cupy/core/core.pyx in cupy.core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
@shwina shwina added bug Something isn't working Needs Triage Need team to review and classify labels Feb 3, 2021
@galipremsagar galipremsagar self-assigned this Feb 3, 2021
@galipremsagar galipremsagar added Python Affects Python cuDF API. dask-cudf and removed Needs Triage Need team to review and classify labels Feb 3, 2021
@kkraus14 kkraus14 assigned shwina and unassigned galipremsagar Feb 3, 2021
@pentschev
Copy link
Member

Should be fixed by dask/dask#7162 .

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@pentschev
Copy link
Member

dask/dask#7162 has been merged, can we close this @shwina ?

@randerzander
Copy link
Contributor

@pentschev with the latest nightlies, this is still broken:

Dask: '2021.03.0'
cudf: '0.19.0a+269.g444b889a05'
dask-cudf: '0.19.0a+269.g444b889a05'

Snippet:

import dask_cudf, cudf

df = cudf.DataFrame({'id': [0, 1, 2], 'val': [0, 1, 2]})
ddf = dask_cudf.from_cudf(df, npartitions=2)

ddf.describe().compute()

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-c5a19e4ab2f2> in <module>
      4 ddf = dask_cudf.from_cudf(df, npartitions=2)
      5 
----> 6 ddf.describe().compute()

/conda/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    281         dask.base.compute
    282         """
--> 283         (result,) = compute(self, traverse=False, **kwargs)
    284         return result
    285 

/conda/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    563         postcomputes.append(x.__dask_postcompute__())
    564 
--> 565     results = schedule(dsk, keys, **kwargs)
    566     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    567 

/conda/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2652                     should_rejoin = False
   2653             try:
-> 2654                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2655             finally:
   2656                 for f in futures.values():

/conda/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1961             else:
   1962                 local_worker = None
-> 1963             return self.sync(
   1964                 self._gather,
   1965                 futures,

/conda/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    835             return future
    836         else:
--> 837             return sync(
    838                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    839             )

/conda/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    349     if error[0]:
    350         typ, exc, tb = error[0]
--> 351         raise exc.with_traceback(tb)
    352     else:
    353         return result[0]

/conda/lib/python3.8/site-packages/distributed/utils.py in f()
    332             if callback_timeout is not None:
    333                 future = asyncio.wait_for(future, callback_timeout)
--> 334             result[0] = yield future
    335         except Exception as exc:
    336             error[0] = sys.exc_info()

/conda/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/conda/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1826                             exc = CancelledError(key)
   1827                         else:
-> 1828                             raise exception.with_traceback(traceback)
   1829                         raise exc
   1830                     if errors == "skip":

/conda/lib/python3.8/site-packages/dask/array/percentile.py in merge_percentiles()
    232     sort_order = np.argsort(combined_vals)
    233     combined_vals = np.take(combined_vals, sort_order)
--> 234     combined_counts = np.take(combined_counts, sort_order)
    235 
    236     # percentile-like, but scaled by total number of observations

<__array_function__ internals> in take()

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in take()
    189            [5, 7]])
    190     """
--> 191     return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
    192 
    193 

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapfunc()
     65         # Call _wrapit from within the except clause to ensure a potential
     66         # exception has a traceback chain.
---> 67         return _wrapit(obj, method, *args, **kwds)
     68 
     69 

/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapit()
     42     except AttributeError:
     43         wrap = None
---> 44     result = getattr(asarray(obj), method)(*args, **kwds)
     45     if wrap:
     46         if not isinstance(result, mu.ndarray):

cupy/core/core.pyx in cupy.core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

@pentschev
Copy link
Member

@randerzander could you check that you have at least the versions below? It works for me on a new environment created just now:

In [1]: import dask_cudf, cudf, numpy as np, dask, cupy as cp

In [2]: np.__version__
Out[2]: '1.20.1'

In [3]: cp.__version__
Out[3]: '8.5.0'

In [4]: dask.__version__
Out[4]: '2021.03.0'

In [5]: cudf.__version__
Out[5]: '0.19.0a+270.g267d29ba5a'

In [6]: dask_cudf.__version__
Out[6]: '0.19.0a+270.g267d29ba5a'

In [7]: df = cudf.DataFrame({'id': [0, 1, 2], 'val': [0, 1, 2]})
   ...: ddf = dask_cudf.from_cudf(df, npartitions=2)
   ...:
   ...: ddf.describe().compute()
   ...:
Out[7]:
        id  val
count  3.0  3.0
mean   1.0  1.0
std    1.0  1.0
min    0.0  0.0
25%    0.5  0.5
50%    1.0  1.0
75%    1.5  1.5
max    2.0  2.0

@pentschev
Copy link
Member

Oh, I see what happened, it's actually now broken for NumPy < 1.20, since dask/dask#7172 . @shwina the line https://github.com/dask/dask/pull/7172/files#diff-b0bd5609b1b3853c06a0e8bbe312694bffc952fd1b116bd4ecb6f85a0ea7874bR231 doesn't work for NumPy < 1.20 with CuPy because it lacks NEP-35 (a.k.a., like= kwarg). Therefore, the combined_counts variable turns out to be a np.ndarray, and when combined_counts = np.take(combined_counts, sort_order) is called we have combined_counts as an np.ndarray and sort_order as a cp.ndarray, raising that error on the CuPy side.

I see 4 options:

  1. Dropping support for NumPy < 1.20;
  2. Changing the code back to how it was before, but this could mean a performance loss;
  3. Trying to find a performant way that also works with older NumPy;
  4. Special-casing that piece of code for new vs old NumPy (this may get some criticism from Dask folks).

cc @kkraus14 @quasiben @jakirkham

@jakirkham
Copy link
Member

Requiring NumPy 1.20+ makes a lot of sense to me. There's a lot of really important improvements particularly for RAPIDS there (of course you already know this Peter) and I think we are going to find it hard to get things working for older NumPy versions

@pentschev
Copy link
Member

Requiring NumPy 1.20+ makes a lot of sense to me. There's a lot of really important improvements particularly for RAPIDS there (of course you already know this Peter) and I think we are going to find it hard to get things working for older NumPy versions

This is my personal preference too, but I guess we need to know whether there's users who for whatever reason can't upgrade to NumPy 1.20+, if not, then we should explicitly pin numpy>=1.20 which clearly isn't the case at this time.

@jakirkham
Copy link
Member

Yep I'm happy to make that PR if we go that route 🙂

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 24, 2021

It looks like the code was already using the like= kwarg before the linked PR. Was that introduced in another PR that broke the numpy < 1.20? I'm hesitant about requiring numpy >= 1.20 as we go to pretty big lengths in cuDF Python to support Pandas 1.0+ and I imagine numpy is a similar situation.

@jakirkham
Copy link
Member

So do you have thoughts about 2-4 then?

@kkraus14
Copy link
Collaborator

I'm most in favor of number 3 personally, but not sure how much a rabbit hole that is.

@jakirkham
Copy link
Member

@shwina as you worked on this do you see a way of approaching 3?

@pentschev
Copy link
Member

It looks like the code was already using the like= kwarg before the linked PR. Was that introduced in another PR that broke the numpy < 1.20?

It's true, but in the previous code both combined_vals[0] and combined_counts[0] were already CuPy arrays after results of merge_sorted/zip, and dask/dask#7172 attempted now to convert counts (a list) into a a CuPy array (with like=combined_vals).

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 6, 2021

This is resolved

@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

7 participants