Creating dask geodataframe from_dask_dataframe deadlocks #197

avriiil · 2022-06-13T18:57:36Z

I'm running into issues when I create a dask geodataframe from a regular dask dataframe. After some debugging it looks like the .finalize() task is causing issues.

The local reproducer below shows that the finalize task is created. This is not a problem when the data fits in memory.

However, when working at scale, this finalize call causes workers to run out of memory. These workers are not killed but become unresponsive, causing a virtual 'deadlock' where the finalize task gets endlessly shipped around workers.

Local Reproducer (does not hang because data fits in memory)

import dask.dataframe as dd
from distributed import Client

client = Client(n_workers=8)

# load some data
ddf = dd.read_csv(
    "s3://nyc-tlc/csv_backup/yellow_tripdata_2012-01.csv",
)

# subset for faster iteration
ddf = ddf.partitions[0:5]

# convert to dask geodataframe
import dask_geopandas
ddf = dask_geopandas.from_dask_dataframe(
    ddf,
    geometry=dask_geopandas.points_from_xy(ddf, "pickup_longitude", "pickup_latitude"),
)

ddf.head()

Cloud-Based Reproducer (hangs because data does not fit into memory)

import dask.dataframe as dd

ddf = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/*",
    engine="pyarrow",
    storage_options={"anon": True},
)

import dask_geopandas

ddf = dask_geopandas.from_dask_dataframe(
    ddf,
    geometry=dask_geopandas.points_from_xy(ddf, "pickup_longitude", "pickup_latitude"),
)

ddf.head()

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-06-13T19:15:33Z

@ian-r-rose and I looked at this with @rrpelgrim. The problem is that the single finalize task is taking every single partition as input, in order to concatenate them. This means trying to move 150GB to a single worker. This should definitely not happen when you do something like head (and the output of finalize should probably not be the input to head?).

The reason this causes a deadlock (as opposed to the worker running out of memory and dying) is a bit deployment-specific: dask/distributed#6110 (comment), dask/distributed#6177. On the dask-geopandas side, we should just focus on why the graph for this head operation is creating this finalize task that takes every partition as input.

jorisvandenbossche · 2022-06-19T11:50:26Z

Thanks for the investigation! Visualizing the graph for the local reproducer, I can also see the finalize bottleneck step:

Now, the from_dask_dataframe method is under the hood a simple call to the map_partitions method from dask:

dask-geopandas/dask_geopandas/core.py

Line 781 in 82da8f1

return df.map_partitions(geopandas.GeoDataFrame, geometry=geometry)

Would this be a general issue with that method?

ian-r-rose · 2022-06-23T23:31:34Z

I took a closer look at this today, and the problem is indeed in Dask. In particular, we've run into this issue, where map_partitions treats positional arguments differently from keyword arguments. If a dataframe-like argument is passed in as a positional arg, the function is correctly mapped across the partitions. If it's passed as a keyword arg, it is computed and concatenated before being passed in as a non-partitioned series.

I consider this to be a bug in Dask (or at least highly surprising and undesirable behavior). Unfortunately, the fix for that would be very invasive to the internal Blockwise implementation, and probably take some time. I can, however, open a PR with a reasonable workaround in dask-geopandas. Briefly: we can avoid referring to the geometry column using a GeoSeries, and instead refer to it by name.

@rrpelgrim, in the very short-term, I have a workaround that should hopefully unblock you until we can work out a better fix:

import dask.dataframe as dd

# load some data
ddf = dd.read_csv(
    "s3://nyc-tlc/csv_backup/yellow_tripdata_2012-01.csv",
)

# subset for faster iteration
ddf = ddf.partitions[0:5]

# convert to dask geodataframe
import dask_geopandas
# Assign the geometry column using vanilla Dask
ddf = ddf.assign(geometry=dask_geopandas.points_from_xy(ddf, "pickup_longitude", "pickup_latitude"))
# Refer to the geometry column by name
ddf = dask_geopandas.from_dask_dataframe(ddf, geometry="geometry")

ddf.head()

cc @rjzamora, who might find this real-world example of the limitation in map_partitions interesting.

avriiil mentioned this issue Jun 13, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling dask/distributed#6110

Closed

This was referenced Jun 23, 2022

Don't put GeoSeries in map_partitions kwarg #205

Merged

map_partitions only aligns arguments in *args; **kwargs are concatenated first dask/dask#8308

Open

martinfleis closed this as completed in #205 Jul 1, 2022

TomAugspurger mentioned this issue May 4, 2024

ENH: minimal support for dask.dataframe query planning (dask-expr) #285

Merged

giswqs mentioned this issue Jun 24, 2024

Initial attempt to get dask-geopandas working. alxmrs/dask-ee#14

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating dask geodataframe from_dask_dataframe deadlocks #197

Creating dask geodataframe from_dask_dataframe deadlocks #197

avriiil commented Jun 13, 2022 •

edited

Loading

gjoseph92 commented Jun 13, 2022

jorisvandenbossche commented Jun 19, 2022

ian-r-rose commented Jun 23, 2022

Creating dask geodataframe from_dask_dataframe deadlocks #197

Creating dask geodataframe from_dask_dataframe deadlocks #197

Comments

avriiil commented Jun 13, 2022 • edited Loading

Local Reproducer (does not hang because data fits in memory)

Cloud-Based Reproducer (hangs because data does not fit into memory)

gjoseph92 commented Jun 13, 2022

jorisvandenbossche commented Jun 19, 2022

ian-r-rose commented Jun 23, 2022

avriiil commented Jun 13, 2022 •

edited

Loading