Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Cannot select struct typed columns in Dask cuDF #8657

Closed
beckernick opened this issue Jul 6, 2021 · 0 comments · Fixed by #8675
Closed

[FEA] Cannot select struct typed columns in Dask cuDF #8657

beckernick opened this issue Jul 6, 2021 · 0 comments · Fixed by #8675
Assignees
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 6, 2021

Today, in cuDF Python I can manipulate struct columns. But, I can't with Dask. One hurdle appears to be that we cannot do column selection on a struct typed column.

I believe this is best framed as a feature request, as this is a new type.

import cudf
import dask_cudfdf = cudf.DataFrame(
    {"col": [
        {"a":5, "b":10},
        {"a":3, "b":7},
        {"a":-3, "b":11}
    ]}
)
ddf = dask_cudf.from_cudf(df, 2)
ddf.col
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_22770/1289136161.py in <module>
     10 )
     11 ddf = dask_cudf.from_cudf(df, 2)
---> 12 ddf.col

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   3991     def __getattr__(self, key):
   3992         if key in self.columns:
-> 3993             return self[key]
   3994         else:
   3995             raise AttributeError("'DataFrame' object has no attribute %r" % key)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   3912             dsk = partitionwise_graph(operator.getitem, name, self, key)
   3913             graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self])
-> 3914             return new_dd_object(graph, name, meta, self.divisions)
   3915         elif isinstance(key, slice):
   3916             from pandas.api.types import is_float_dtype

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in new_dd_object(dsk, name, meta, divisions, parent_meta)
   6802     """
   6803     if has_parallel_type(meta):
-> 6804         return get_parallel_type(meta)(dsk, name, meta, divisions)
   6805     elif is_arraylike(meta) and meta.shape:
   6806         import dask.array as da

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask_cudf/core.py in __init__(self, dsk, name, meta, divisions)
     63         self.dask = dsk
     64         self._name = name
---> 65         meta = dask_make_meta(meta)
     66         if not isinstance(meta, self._partition_type):
     67             raise TypeError(

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/dispatch.py in make_meta(x, index, parent_meta)
    125 
    126     try:
--> 127         return make_meta_dispatch(x, index=index)
    128     except TypeError:
    129         if parent_meta is not None:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)
    573         """
    574         meth = self.dispatch(type(arg))
--> 575         return meth(arg, *args, **kwargs)
    576 
    577     @property

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask_cudf/backends.py in make_meta_cudf(x, index)
    130 @make_meta_dispatch.register((cudf.Series, cudf.DataFrame))
    131 def make_meta_cudf(x, index=None):
--> 132     return x.head(0)
    133 
    134 

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in head(self, n)
   1169         dtype: object
   1170         """
-> 1171         return self.iloc[:n]
   1172 
   1173     def tail(self, n=5):

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
     91         if isinstance(arg, tuple):
     92             arg = list(arg)
---> 93         data = self._sr._column[arg]
     94 
     95         if (

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/struct.py in __getitem__(self, args)
     82 
     83     def __getitem__(self, args):
---> 84         result = super().__getitem__(args)
     85         if isinstance(result, dict):
     86             return {

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in __getitem__(self, arg)
    511         elif isinstance(arg, slice):
    512             start, stop, stride = arg.indices(len(self))
--> 513             return self.slice(start, stop, stride)
    514         else:
    515             arg = as_column(arg)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in slice(self, start, stop, stride)
    495             stop = stop + len(self)
    496         if (stride > 0 and start >= stop) or (stride < 0 and start <= stop):
--> 497             return column_empty(0, self.dtype, masked=True)
    498         # compute mask slice
    499         if stride == 1:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in column_empty(row_count, dtype, masked)
   1365         mask = None
   1366 
-> 1367     return build_column(
   1368         data, dtype, mask=mask, size=row_count, children=children
   1369     )

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in build_column(data, dtype, size, mask, offset, null_count, children)
   1474         if size is None:
   1475             raise TypeError("Must specify size")
-> 1476         return cudf.core.column.StructColumn(
   1477             data=data,
   1478             dtype=dtype,

cudf/_lib/column.pyx in cudf._lib.column.Column.__init__()

cudf/_lib/column.pyx in cudf._lib.column.Column.set_base_mask()

ValueError: The Buffer for mask is smaller than expected, got 0 bytes, expected 64 bytes.

Note: The mask is expected to be sized according to the base allocation as opposed to the offsetted or sized allocation.

This does not happen for list typed columns.

conda list | grep "cudf\|dask\|pandas\|numpy\|arrow"
arrow-cpp                 4.0.1           py38hf0991f3_4_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      21.08.00a210706 cuda_11.2_py38_g3ee264c4c3_244    rapidsai-nightly
cudf_kafka                21.08.00a210706 py38_g3ee264c4c3_244    rapidsai-nightly
dask                      2021.6.2           pyhd8ed1ab_0    conda-forge
dask-core                 2021.6.2           pyhd8ed1ab_0    conda-forge
dask-cuda                 21.08.00a210706         py38_32    rapidsai-nightly
dask-cudf                 21.08.00a210706 py38_g3ee264c4c3_244    rapidsai-nightly
geopandas                 0.9.0              pyhd8ed1ab_1    conda-forge
geopandas-base            0.9.0              pyhd8ed1ab_1    conda-forge
libcudf                   21.08.00a210706 cuda11.2_g3ee264c4c3_244    rapidsai-nightly
libcudf_kafka             21.08.00a210706 g3ee264c4c3_244    rapidsai-nightly
numpy                     1.21.0           py38h9894fe3_0    conda-forge
pandas                    1.2.5            py38h1abd341_0    conda-forge
pyarrow                   4.0.1           py38hb53058b_4_cuda    conda-forge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants