Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when using distributed scheduler with __array_function__ #5224

Closed
gforsyth opened this issue Aug 17, 2021 · 8 comments
Closed

KeyError when using distributed scheduler with __array_function__ #5224

gforsyth opened this issue Aug 17, 2021 · 8 comments

Comments

@gforsyth
Copy link
Contributor

gforsyth commented Aug 17, 2021

The _get_computation_codes introduced in #5001 causes errors if relying on numpy ducktyping on a dask.dataframe

What happened: I get a KeyError when I use numpy.where on a column of a dask.dataframe

What you expected to happen: Forced evaluation of column

Minimal Complete Verifiable Example:
No issue (without distributed scheduler)

In [1]: import dask.datasets
In [2]: ddf = dask.datasets.timeseries(seed=123)
In [3]: import numpy
In [4]: numpy.where(ddf.id > 1000)
Out[4]: (array([      1,       3,       6, ..., 2591988, 2591990, 2591997]),)

with distributed scheduler

In [1]: from distributed.client import Client
In [2]: import numpy
In [3]: import dask.datasets
In [4]: client = Client()
In [5]: ddf = dask.datasets.timeseries(seed=123)
In [6]: numpy.where(ddf.id > 1000)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-6492628d3a98> in <module>
----> 1 numpy.where(ddf.id > 1000)

<__array_function__ internals> in where(*args, **kwargs)

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/dask/dataframe/core.py in __array__(self, dtype, **kwargs)
    413 
    414     def __array__(self, dtype=None, **kwargs):
--> 415         self._computed = self.compute()
    416         x = np.array(self._computed)
    417         return x

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/dask/base.py in compute(self, **kwargs)
    284         dask.base.compute
    285         """
--> 286         (result,) = compute(self, traverse=False, **kwargs)
    287         return result
    288 

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/dask/base.py in compute(*args, **kwargs)
    566         postcomputes.append(x.__dask_postcompute__())
    567 
--> 568     results = schedule(dsk, keys, **kwargs)
    569     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    570 

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/distributed/client.py in get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2650         Client.compute : Compute asynchronous collections
   2651         """
-> 2652         futures = self._graph_to_futures(
   2653             dsk,
   2654             keys=set(flatten([keys])),

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/distributed/client.py in _graph_to_futures(self, dsk, keys, workers, allow_other_workers, priority, user_priority, resources, retries, fifo_timeout, actors)
   2589                     "fifo_timeout": fifo_timeout,
   2590                     "actors": actors,
-> 2591                     "code": self._get_computation_code(),
   2592                 }
   2593             )

~/miniforge3/envs/dasklatest/lib/python3.8/site-packages/distributed/client.py in _get_computation_code()
   2524             breakpoint()
   2525             if pattern is None or (
-> 2526                 not pattern.match(fr.f_globals["__name__"])
   2527                 and fr.f_code.co_name not in ("<listcomp>", "<dictcomp>")
   2528             ):

KeyError: '__name__'

Anything else we need to know?:

Environment:

  • Dask version: 2021.8.0
  • Python version: 3.8.10
  • Operating System: OSX
  • Install method (conda, pip, source): conda
@jrbourbeau
Copy link
Member

Thanks for reporting @gforsyth, I'm able to reproduce. cc @fjetter

@mrocklin
Copy link
Member

mrocklin commented Aug 17, 2021 via email

@jrbourbeau
Copy link
Member

Thanks Matt. I believe he'll be out starting the day after tomorrow, so I'm hoping he might be able to take a look at this tomorrow (though it's entirely possible he might be fully saturated with other things).

A more thoughtful solution is definitely welcome, but we could always punt for now with this diff:

diff --git a/distributed/client.py b/distributed/client.py
index e7389220..92f778a6 100644
--- a/distributed/client.py
+++ b/distributed/client.py
@@ -2522,7 +2522,7 @@ class Client:

         for fr, _ in traceback.walk_stack(None):
             if pattern is None or (
-                not pattern.match(fr.f_globals["__name__"])
+                not pattern.match(fr.f_globals.get("__name__", ""))
                 and fr.f_code.co_name not in ("<listcomp>", "<dictcomp>")
             ):
                 try:

@pentschev
Copy link
Member

It seems like @jrbourbeau has already a potential solution, but it's worth noting that although the np.where example above works I don't think it's generally safe to assume __array_function__ will work with dataframes/series. AFAIK, there's no explicit support for that in Pandas and neither in dask.dataframe.

@jakirkham
Copy link
Member

Yep Peter is right this is unsupported by DataFrame libraries. For example see this Pandas issue ( pandas-dev/pandas#26380 ).

There is a larger discussion about creating a common DataFrame API, but I don't think we have discussed using Array API functions on DataFrames. Have raised issue ( data-apis/dataframe-api#50 ) about this use case

@jrbourbeau
Copy link
Member

Thanks for the feedback @pentschev @jakirkham! I'm proposing the above solution over in #5236. While it's a great point that we can't assume __array_function__ will work in all these situations, distributed also shouldn't raise an error.

FWIW I also ran into the same KeyError: '__name__' when using @eriknw's afar library, which has been fun to play around with

@jakirkham
Copy link
Member

FWIW I also ran into the same KeyError: '__name__' when using @eriknw's afar library, which has been fun to play around with

Off-topic, but that's neat Erik made a library for that. Saw some of the discussion, but didn't see there was now a library that did this.

@jrbourbeau
Copy link
Member

Closed via #5236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants