-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask.dataframe.read_csv('./filepath/*.csv')
returning tuple
#7777
Comments
Thanks for reporting @evanharwin! Are you able to provide a minimal reproducer (see https://blog.dask.org/2018/02/28/minimal-bug-reports)? I'm able to generate an error (discussed below) with a similar code snippet, but not the same error you're seeing. With the current from dask.datasets import timeseries
from distributed import Client
if __name__ == "__main__":
# This snippet runs successfully with `processes=True` (the default value)
# but failed when `processes=False` with tasks reporting
# TypeError('cannot unpack non-iterable Serialize object')
client = Client(processes=False)
df = timeseries()
result = df.sample(frac=0.01).drop(["x", "y"], 1).corr().compute()
print(f"{result = }") results in tasks failing with distributed.worker - WARNING - Compute Failed
Function: subgraph_callable-ac5c875a-e373-4f75-befa-b213a8ee
args: (<Serialize: ([Timestamp('2000-01-23 00:00:00', freq='D'), Timestamp('2000-01-24 00:00:00', freq='D')], 1765816192)>)
kwargs: {}
Exception: TypeError('cannot unpack non-iterable Serialize object') A couple of interesting things to note:
cc @rjzamora for visibility |
Hmm - This looks like a familar problem @madsbk and I were running into. I will investigate, but I suspect the culprit is that using |
FYI: dask/distributed#4897 should fix this issue however it might take some time before the PR is merged. |
Thanks all for your input. I solved my issue by calculating the correlations on individual partitions of my dataset using However, I'll keep an eye on that PR and transition to whole dataset correlations when it is merged. |
I can confirm this bug exists and I solved it by just removing the |
I run into a similar issue with Set |
I'd like to close this issue in favor of #8581 - Although this issue has useful discussion, that bug report is bit more focused on the underlying serilaization issue (and the likely fix in distributed). |
What happened:
Loading a dataframe seemingly returned a tuple, rather than a
dask.dataframe
, as an exception was thrown:AttributeError: 'tuple' object has no attribute 'sample'
What you expected to happen:
I expected for the code below to return a
pandas.DataFrame
with the correlations that I'm looking for!Minimal Complete Verifiable Example:
Anything else we need to know?:
The example runs fine on my local machine (Windows 10, Dask 2021.1.1, Python 3.8.5), it is just failing when run in containerised compute provided by Azure.
The full traceback is here:
Environment:
The text was updated successfully, but these errors were encountered: