Add `type_metadata` property to `ColumnBase` #8333

charlesbluca · 2021-05-24T18:10:27Z

Adds a type_metadata property to ColumnBase, which returns a dictionary with the dtype of the column along with metadata required to construct a copy of this column (if any).

Also adds method _copy_type_metadata_from_dict() to allow a ColumnBase to apply a type metadata dictionary to itself, and refactors _copy_type_metadata() to use this method/property when copying metadata from one column to another.

The motivator here is #8153, which requires us to construct copies of columns with dtype metadata without actually having access to the original columns.

kkraus14 · 2021-05-24T18:12:44Z

python/cudf/cudf/core/column/column.py

+        if isinstance(self, cudf.core.column.CategoricalColumn):
+            metadata.update(
+                {"categories": self.categories, "ordered": self.ordered}
+            )
+        if isinstance(self, cudf.core.column.StructColumn):
+            metadata["field_keys"] = self.dtype.fields.keys()
+        if isinstance(self, cudf.core.column.DecimalColumn):
+            metadata["precision"] = self.dtype.precision


For Structs / Decimals the info is already attached to the dtype here. For Categoricals I believe the categories are attached to the dtype as well, where we should always have the info to reconstruct a column given its dtype, no?

That's correct, but wouldn't we need access to self to do that?

The main function of these changes would be that we could make a copy of a column that is not in scope - for example, when unpacking a PackedColumns object from #8153, we need to apply the correct dtype metadata to each resulting column, but we probably don't want to have to access the original DataFrame (before packing) to do this.

With these changes, we could store each columns' type_metadata in the PackedColumns object, and use that for reconstruction instead.

I guess my point is why do we need a new function associated with the Column class to do this when all of the information already exists on the dtype? I.E. I believe we already have __serialize__ and __deserialize__ implemented for all of our dtypes, so we should already be able to do serialize(column.dtype) to shove into the PackedColumns object.

That makes sense - in that case, I think a good roadmap would be to add the serialize/deserialize methods to the StructDtype so that all the types affected here are serializable, and then implement the relevant as_*_column() methods so that we can cast the columns with the dtype alone.

EDIT: just realized we shouldn't need any casting for the struct/decimal columns - we only need to apply the metadata. Would we want a utility function to apply metadata from a Dtype for this, or is it suitable to just do this conditionally in unpack()?

charlesbluca · 2021-05-25T14:13:43Z

Closing this in favor of just including the columns' dtypes in the PackedColumns object in #8153, and sorting out the associated serialization/deserialization in another PR.

shwina · 2021-05-25T14:15:37Z

I think the logic will end up being very similar to _copy_type_metadata, where it may still be worth investigating how to reuse some of it. Especially given that we now also have a _copy_type_metadata_from_arrow, we'll end up having three pieces of very similar looking code.

charlesbluca · 2021-05-25T14:21:44Z

Agreed - I think that a general function that is able to apply metadata to a column taking a dtype as input could work for all these cases, although I'm not sure how to handling typing if we want it to work for Arrow arrays.

Would fleshing out the casting functions here be helpful for this purpose?

cudf/python/cudf/cudf/core/column/column.py

Lines 965 to 993 in dd5eecd

    
           def as_numerical_column( 
        
               self, dtype: Dtype 
        
           ) -> "cudf.core.column.NumericalColumn": 
        
               raise NotImplementedError 
        
           def as_datetime_column( 
        
               self, dtype: Dtype, **kwargs 
        
           ) -> "cudf.core.column.DatetimeColumn": 
        
               raise NotImplementedError 
        
           def as_interval_column( 
        
               self, dtype: Dtype, **kwargs 
        
           ) -> "cudf.core.column.IntervalColumn": 
        
               raise NotImplementedError 
        
           def as_timedelta_column( 
        
               self, dtype: Dtype, **kwargs 
        
           ) -> "cudf.core.column.TimeDeltaColumn": 
        
               raise NotImplementedError 
        
           def as_string_column( 
        
               self, dtype: Dtype, format=None 
        
           ) -> "cudf.core.column.StringColumn": 
        
               raise NotImplementedError 
        
           def as_decimal_column( 
        
               self, dtype: Dtype, **kwargs 
        
           ) -> "cudf.core.column.DecimalColumn": 
        
               raise NotImplementedError

shwina · 2021-05-25T14:30:43Z

I think all three cases could be implemented in terms of something like _apply_type_metadata.

class ColumnBase:

    def _apply_type_metadata(self, dtype: Dtype) -> ColumnBase:
        # return a new column composed of the data from `self`
        # and metadata from `dtype`
        pass

    def _copy_type_metadata(self, other: ColumnBase) -> ColumnBase:
        # copy the type metadata from the column `other` onto `self`
        return self._apply_type_metadata(other.dtype)
        
    def _copy_type_metadata_from_arrow(self, other: pa.Array) -> ColumnBase:
        # copy the type metadata from the pa.Array `other` onto `self`
        return self._apply_type_metadata(cudf_dtype_from_pa_type(other.type))

shwina · 2021-05-25T14:32:25Z

I'm not necessarily advocating that we need three distinct methods though. I think just having the first one is enough.

isVoid · 2021-05-25T21:44:48Z

The difficult thing of reusing shared recursive backbone is that both cudf and pyarrow have nested column structure, where each column has there own dtype/type attribute. _apply_type_metadata should apply dtype to the current level, and recursively calls the correct _copy_type_metadata* depending on the child type (whether a cudf column or arrow array).

shwina · 2021-05-25T21:52:15Z

Right, I imagine that _apply_type_metadata would need to be recursive. In the arrow case, cudf_dtype_from_pa_type(other.type) will recursively construct the appropriate cuDF dtype from the arrow type. Would that work, or maybe I've missed something?

isVoid · 2021-05-25T22:02:34Z

I could be wrong. For concrete example, say we have a column of list of list of ints:
In cudf,

>>> x = cudf.Series([[[1, 2, 3]]])

It is true that x's dtype needs to be recursively constructed, so you have:

>>> x.dtype
ListDtype(ListDtype(int64))

However, x's child column (which is a separate Column instance in cudf), also has a dtype that requires to be recursively constructed:

>>> s._column.children[1].dtype
ListDtype(int64)

This is best illustrated if the nested type is not only a ListDtype(int64) column, but a more deeply nested (e.g. structs of lists). But for simplicity this should do the justice.

For pyarrow it's pretty much the same.

>>> x = pa.array([[[1, 2, 3]]])
>>> x.type
ListType(list<item: list<item: int64>>)
>>> x.values.type
ListType(list<item: int64>)

Based on discussion on #8333: - adds `_with_type_metadata()` to `ColumnBase` to return a new column with the metadata of `dtype` applied - removes `_copy_type_metadata[_from_arrow]()` and uses this function in their place These changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Ashwin Srinath (https://github.com/shwina) - Michael Wang (https://github.com/isVoid) URL: #8373

Add type_metadata property and associated methods

d0280a4

charlesbluca requested a review from a team as a code owner May 24, 2021 18:10

charlesbluca requested review from shwina and isVoid May 24, 2021 18:10

github-actions bot added the Python Affects Python cuDF API. label May 24, 2021

kkraus14 reviewed May 24, 2021

View reviewed changes

charlesbluca closed this May 25, 2021

charlesbluca mentioned this pull request May 26, 2021

Add functionality to apply Dtype metadata to ColumnBase #8373

Merged

charlesbluca deleted the type-metadata-prop branch August 3, 2021 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `type_metadata` property to `ColumnBase` #8333

Add `type_metadata` property to `ColumnBase` #8333

charlesbluca commented May 24, 2021 •

edited

Loading

kkraus14 May 24, 2021

charlesbluca May 24, 2021 •

edited

Loading

kkraus14 May 24, 2021

charlesbluca May 24, 2021 •

edited

Loading

charlesbluca commented May 25, 2021

shwina commented May 25, 2021

charlesbluca commented May 25, 2021

shwina commented May 25, 2021 •

edited

Loading

shwina commented May 25, 2021

isVoid commented May 25, 2021

shwina commented May 25, 2021

isVoid commented May 25, 2021 •

edited

Loading

Add type_metadata property to ColumnBase #8333

Add type_metadata property to ColumnBase #8333

Conversation

charlesbluca commented May 24, 2021 • edited Loading

kkraus14 May 24, 2021

Choose a reason for hiding this comment

charlesbluca May 24, 2021 • edited Loading

Choose a reason for hiding this comment

kkraus14 May 24, 2021

Choose a reason for hiding this comment

charlesbluca May 24, 2021 • edited Loading

Choose a reason for hiding this comment

charlesbluca commented May 25, 2021

shwina commented May 25, 2021

charlesbluca commented May 25, 2021

shwina commented May 25, 2021 • edited Loading

shwina commented May 25, 2021

isVoid commented May 25, 2021

shwina commented May 25, 2021

isVoid commented May 25, 2021 • edited Loading

Add `type_metadata` property to `ColumnBase` #8333

Add `type_metadata` property to `ColumnBase` #8333

charlesbluca commented May 24, 2021 •

edited

Loading

charlesbluca May 24, 2021 •

edited

Loading

charlesbluca May 24, 2021 •

edited

Loading

shwina commented May 25, 2021 •

edited

Loading

isVoid commented May 25, 2021 •

edited

Loading