Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Struct accessor from dask-cudf #8658

Closed
beckernick opened this issue Jul 6, 2021 · 4 comments
Closed

[FEA] Struct accessor from dask-cudf #8658

beckernick opened this issue Jul 6, 2021 · 4 comments
Assignees
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 6, 2021

Today, in cuDF Python I can extract individual fields from a struct column with the struct.field(key) API. As manipulating struct column is common in big data processing frameworks, we should support this accessor in Dask-cuDF (like we do with the list accessor, shown below).

This is likely blocked by #8657 , based on the traceback

import cudf
import dask_cudfs = cudf.Series([
    {"a":5, "b":10},
    {"a":3, "b":7},
    {"a":-3, "b":11}
])
s.struct.field("a")
0    5
1    3
2   -3
dtype: int64
import cudf
import dask_cudf

df = cudf.DataFrame(
    {"col": [
        {"a":5, "b":10},
        {"a":3, "b":7},
        {"a":-3, "b":11}
    ]}
)
ddf = dask_cudf.from_cudf(df, 2)
ddf.col.struct
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_22770/2588260011.py in <module>
     10 )
     11 ddf = dask_cudf.from_cudf(df, 2)
---> 12 ddf.col.struct

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   3991     def __getattr__(self, key):
   3992         if key in self.columns:
-> 3993             return self[key]
   3994         else:
   3995             raise AttributeError("'DataFrame' object has no attribute %r" % key)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   3912             dsk = partitionwise_graph(operator.getitem, name, self, key)
   3913             graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self])
-> 3914             return new_dd_object(graph, name, meta, self.divisions)
   3915         elif isinstance(key, slice):
   3916             from pandas.api.types import is_float_dtype

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/core.py in new_dd_object(dsk, name, meta, divisions, parent_meta)
   6802     """
   6803     if has_parallel_type(meta):
-> 6804         return get_parallel_type(meta)(dsk, name, meta, divisions)
   6805     elif is_arraylike(meta) and meta.shape:
   6806         import dask.array as da

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask_cudf/core.py in __init__(self, dsk, name, meta, divisions)
     63         self.dask = dsk
     64         self._name = name
---> 65         meta = dask_make_meta(meta)
     66         if not isinstance(meta, self._partition_type):
     67             raise TypeError(

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/dataframe/dispatch.py in make_meta(x, index, parent_meta)
    125 
    126     try:
--> 127         return make_meta_dispatch(x, index=index)
    128     except TypeError:
    129         if parent_meta is not None:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask/utils.py in __call__(self, arg, *args, **kwargs)
    573         """
    574         meth = self.dispatch(type(arg))
--> 575         return meth(arg, *args, **kwargs)
    576 
    577     @property

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/dask_cudf/backends.py in make_meta_cudf(x, index)
    130 @make_meta_dispatch.register((cudf.Series, cudf.DataFrame))
    131 def make_meta_cudf(x, index=None):
--> 132     return x.head(0)
    133 
    134 

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in head(self, n)
   1169         dtype: object
   1170         """
-> 1171         return self.iloc[:n]
   1172 
   1173     def tail(self, n=5):

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
     91         if isinstance(arg, tuple):
     92             arg = list(arg)
---> 93         data = self._sr._column[arg]
     94 
     95         if (

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/struct.py in __getitem__(self, args)
     82 
     83     def __getitem__(self, args):
---> 84         result = super().__getitem__(args)
     85         if isinstance(result, dict):
     86             return {

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in __getitem__(self, arg)
    511         elif isinstance(arg, slice):
    512             start, stop, stride = arg.indices(len(self))
--> 513             return self.slice(start, stop, stride)
    514         else:
    515             arg = as_column(arg)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in slice(self, start, stop, stride)
    495             stop = stop + len(self)
    496         if (stride > 0 and start >= stop) or (stride < 0 and start <= stop):
--> 497             return column_empty(0, self.dtype, masked=True)
    498         # compute mask slice
    499         if stride == 1:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in column_empty(row_count, dtype, masked)
   1365         mask = None
   1366 
-> 1367     return build_column(
   1368         data, dtype, mask=mask, size=row_count, children=children
   1369     )

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in build_column(data, dtype, size, mask, offset, null_count, children)
   1474         if size is None:
   1475             raise TypeError("Must specify size")
-> 1476         return cudf.core.column.StructColumn(
   1477             data=data,
   1478             dtype=dtype,

cudf/_lib/column.pyx in cudf._lib.column.Column.__init__()

cudf/_lib/column.pyx in cudf._lib.column.Column.set_base_mask()

ValueError: The Buffer for mask is smaller than expected, got 0 bytes, expected 64 bytes.

Note: The mask is expected to be sized according to the base allocation as opposed to the offsetted or sized allocation.

List accessor

import cudf
import dask_cudfdf = cudf.DataFrame(
    {"col": [
        [0,1],
        [1,2],
        [2,3]
    ]}
)
ddf = dask_cudf.from_cudf(df, 2)
ddf.col.list
<dask_cudf.accessors.ListMethods at 0x7f6eb02222e0>
conda list | grep "cudf\|dask\|pandas\|numpy\|arrow"
arrow-cpp                 4.0.1           py38hf0991f3_4_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudf                      21.08.00a210706 cuda_11.2_py38_g3ee264c4c3_244    rapidsai-nightly
cudf_kafka                21.08.00a210706 py38_g3ee264c4c3_244    rapidsai-nightly
dask                      2021.6.2           pyhd8ed1ab_0    conda-forge
dask-core                 2021.6.2           pyhd8ed1ab_0    conda-forge
dask-cuda                 21.08.00a210706         py38_32    rapidsai-nightly
dask-cudf                 21.08.00a210706 py38_g3ee264c4c3_244    rapidsai-nightly
geopandas                 0.9.0              pyhd8ed1ab_1    conda-forge
geopandas-base            0.9.0              pyhd8ed1ab_1    conda-forge
libcudf                   21.08.00a210706 cuda11.2_g3ee264c4c3_244    rapidsai-nightly
libcudf_kafka             21.08.00a210706 g3ee264c4c3_244    rapidsai-nightly
numpy                     1.21.0           py38h9894fe3_0    conda-forge
pandas                    1.2.5            py38h1abd341_0    conda-forge
pyarrow                   4.0.1           py38hb53058b_4_cuda    conda-forge
@NV-jpt
Copy link
Contributor

NV-jpt commented Jul 9, 2021

I'm beginning to work on this issue

@rjzamora
Copy link
Member

Any update here @NV-jpt? (No worries if not)

@NV-jpt
Copy link
Contributor

NV-jpt commented Jul 27, 2021

I've created a draft pull-request for this issue! #8874

One of the tests I wrote is failing, so I expect we still need to do a bit more digging, but I think we are just about done!

rapids-bot bot pushed a commit that referenced this issue Aug 24, 2021
This PR implements 'Struct Accessor' requested feature in dask-cudf (Issue [#8658](#8658))

StructMethod class implemented to expose 'field(key)' method in dask-cudf

        Examples
        --------
        >>> s = cudf.Series([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
        >>> ds = dask_cudf.from_cudf(s, 2)
        >>> ds.struct.field(0).compute()
        0    1
        1    3
        dtype: int64
        >>> ds.struct.field('a').compute()
        0    1
        1    3
        dtype: int64

Authors:
  - https://github.com/NV-jpt
  - https://github.com/shaneding

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8874
@beckernick
Copy link
Member Author

This was implemented in #8874 . Closing

import cudf
import dask_cudfdf = cudf.DataFrame(
    {"col": [
        {"a":5, "b":10},
        {"a":3, "b":7},
        {"a":-3, "b":11}
    ]}
)
ddf = dask_cudf.from_cudf(df, 2)
ddf.col.struct.field("a").compute()
0    5
1    3
2   -3
Name: col, dtype: int64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants