-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating dask geodataframe from_dask_dataframe deadlocks #197
Comments
@ian-r-rose and I looked at this with @rrpelgrim. The problem is that the single The reason this causes a deadlock (as opposed to the worker running out of memory and dying) is a bit deployment-specific: dask/distributed#6110 (comment), dask/distributed#6177. On the dask-geopandas side, we should just focus on why the graph for this |
Thanks for the investigation! Visualizing the graph for the local reproducer, I can also see the Now, the dask-geopandas/dask_geopandas/core.py Line 781 in 82da8f1
Would this be a general issue with that method? |
I took a closer look at this today, and the problem is indeed in Dask. In particular, we've run into this issue, where I consider this to be a bug in Dask (or at least highly surprising and undesirable behavior). Unfortunately, the fix for that would be very invasive to the internal @rrpelgrim, in the very short-term, I have a workaround that should hopefully unblock you until we can work out a better fix: import dask.dataframe as dd
# load some data
ddf = dd.read_csv(
"s3://nyc-tlc/csv_backup/yellow_tripdata_2012-01.csv",
)
# subset for faster iteration
ddf = ddf.partitions[0:5]
# convert to dask geodataframe
import dask_geopandas
# Assign the geometry column using vanilla Dask
ddf = ddf.assign(geometry=dask_geopandas.points_from_xy(ddf, "pickup_longitude", "pickup_latitude"))
# Refer to the geometry column by name
ddf = dask_geopandas.from_dask_dataframe(ddf, geometry="geometry")
ddf.head() cc @rjzamora, who might find this real-world example of the limitation in |
I'm running into issues when I create a dask geodataframe from a regular dask dataframe. After some debugging it looks like the
.finalize()
task is causing issues.The local reproducer below shows that the
finalize
task is created. This is not a problem when the data fits in memory.However, when working at scale, this
finalize
call causes workers to run out of memory. These workers are not killed but become unresponsive, causing a virtual 'deadlock' where thefinalize
task gets endlessly shipped around workers.Local Reproducer (does not hang because data fits in memory)
Cloud-Based Reproducer (hangs because data does not fit into memory)
The text was updated successfully, but these errors were encountered: