-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError: ('fetch', 'memory') in Dask 2017.7.2 (execution sometimes hangs) #5152
Comments
Once the MCV code finishes, Dask dashboard reports 13 GB of "unmanaged old" memory used by the cluster. Does this mean there is memory leak somewhere? |
This example does not reproduce an error on my machine. Truth be told, it barely works since my machine is a bit too small for it. I am spending more time spilling than doing anything else.
Yes and no. The KeyError is raised because this transition is not implemented. However, the problem is not the missing implementation but rather that this transition is not allowed to occur. This indicates an inconsistency in the worker state machine where a worker is trying to do two things simultaneously for a given task, e.g. compute and fetch. Long story short, this can cause such a transition and raise this exception. while not pretty, this should be harmless. The title of your issue says "execution sometimes hangs" which is likely another problem.
Might be the case that some workers never release their tasks properly. There is likely not a "true memory leak" |
FWIW I could find an issue which can cause a deadlock associated with this exception. I'm working on a patch |
FYI, I've also been seeing this issue intermittently on CuPy-backed workflows, I'll make sure to test the #5157 . |
I have experienced this as well, tested on I am not planning to share a reproducible example, as my workflow is too complex to narrow the problem down easily, but I thought I'd still share the fact that I seem to get this error even after updating. Errordistributed.utils - ERROR - ('fetch', 'memory')
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils.py", line 638, in log_errors
yield
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2411, in gather_dep
self.transition(ts, "memory", value=data[d])
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1692, in transition
func = self._transitions[start, finish]
KeyError: ('fetch', 'memory')
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f60aa93dcd0>>, <Task finished name='Task-1600' coro=<Worker.gather_dep() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:2267> exception=KeyError(('fetch', 'memory'))>)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2411, in gather_dep
self.transition(ts, "memory", value=data[d])
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 1692, in transition
func = self._transitions[start, finish]
KeyError: ('fetch', 'memory') |
What happened: When running the code below, Dask reports the following error
Sometimes the execution hangs with one task left, and sometimes it finishes. Changing the number of workers seems to affect the outcome (i.e., whether the code finishes or hangs).
In addition to the above error, I also get following errors
EDIT - Added some additional errors that happen when running MCV code on fake data.
What you expected to happen: I did not expect such any errors to be reported.
Minimal Complete Verifiable Example:
EDIT - Added MCV code that will hopefully reproduce the problem.
Anything else we need to know?: In #4721, a user reported that after upgrading to 2021.07.01 they started seeing the same error.
Environment:
The text was updated successfully, but these errors were encountered: