Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type_metadata property to ColumnBase #8333

Closed

Conversation

charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented May 24, 2021

Adds a type_metadata property to ColumnBase, which returns a dictionary with the dtype of the column along with metadata required to construct a copy of this column (if any).

Also adds method _copy_type_metadata_from_dict() to allow a ColumnBase to apply a type metadata dictionary to itself, and refactors _copy_type_metadata() to use this method/property when copying metadata from one column to another.

The motivator here is #8153, which requires us to construct copies of columns with dtype metadata without actually having access to the original columns.

@charlesbluca charlesbluca requested a review from a team as a code owner May 24, 2021 18:10
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 24, 2021
Comment on lines +1290 to +1297
if isinstance(self, cudf.core.column.CategoricalColumn):
metadata.update(
{"categories": self.categories, "ordered": self.ordered}
)
if isinstance(self, cudf.core.column.StructColumn):
metadata["field_keys"] = self.dtype.fields.keys()
if isinstance(self, cudf.core.column.DecimalColumn):
metadata["precision"] = self.dtype.precision
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Structs / Decimals the info is already attached to the dtype here. For Categoricals I believe the categories are attached to the dtype as well, where we should always have the info to reconstruct a column given its dtype, no?

Copy link
Member Author

@charlesbluca charlesbluca May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but wouldn't we need access to self to do that?

The main function of these changes would be that we could make a copy of a column that is not in scope - for example, when unpacking a PackedColumns object from #8153, we need to apply the correct dtype metadata to each resulting column, but we probably don't want to have to access the original DataFrame (before packing) to do this.

With these changes, we could store each columns' type_metadata in the PackedColumns object, and use that for reconstruction instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is why do we need a new function associated with the Column class to do this when all of the information already exists on the dtype? I.E. I believe we already have __serialize__ and __deserialize__ implemented for all of our dtypes, so we should already be able to do serialize(column.dtype) to shove into the PackedColumns object.

Copy link
Member Author

@charlesbluca charlesbluca May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - in that case, I think a good roadmap would be to add the serialize/deserialize methods to the StructDtype so that all the types affected here are serializable, and then implement the relevant as_*_column() methods so that we can cast the columns with the dtype alone.

EDIT: just realized we shouldn't need any casting for the struct/decimal columns - we only need to apply the metadata. Would we want a utility function to apply metadata from a Dtype for this, or is it suitable to just do this conditionally in unpack()?

@charlesbluca
Copy link
Member Author

Closing this in favor of just including the columns' dtypes in the PackedColumns object in #8153, and sorting out the associated serialization/deserialization in another PR.

@shwina
Copy link
Contributor

shwina commented May 25, 2021

I think the logic will end up being very similar to _copy_type_metadata, where it may still be worth investigating how to reuse some of it. Especially given that we now also have a _copy_type_metadata_from_arrow, we'll end up having three pieces of very similar looking code.

@charlesbluca
Copy link
Member Author

Agreed - I think that a general function that is able to apply metadata to a column taking a dtype as input could work for all these cases, although I'm not sure how to handling typing if we want it to work for Arrow arrays.

Would fleshing out the casting functions here be helpful for this purpose?

def as_numerical_column(
self, dtype: Dtype
) -> "cudf.core.column.NumericalColumn":
raise NotImplementedError
def as_datetime_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.DatetimeColumn":
raise NotImplementedError
def as_interval_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.IntervalColumn":
raise NotImplementedError
def as_timedelta_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.TimeDeltaColumn":
raise NotImplementedError
def as_string_column(
self, dtype: Dtype, format=None
) -> "cudf.core.column.StringColumn":
raise NotImplementedError
def as_decimal_column(
self, dtype: Dtype, **kwargs
) -> "cudf.core.column.DecimalColumn":
raise NotImplementedError

@shwina
Copy link
Contributor

shwina commented May 25, 2021

I think all three cases could be implemented in terms of something like _apply_type_metadata.

class ColumnBase:

    def _apply_type_metadata(self, dtype: Dtype) -> ColumnBase:
        # return a new column composed of the data from `self`
        # and metadata from `dtype`
        pass

    def _copy_type_metadata(self, other: ColumnBase) -> ColumnBase:
        # copy the type metadata from the column `other` onto `self`
        return self._apply_type_metadata(other.dtype)
        
    def _copy_type_metadata_from_arrow(self, other: pa.Array) -> ColumnBase:
        # copy the type metadata from the pa.Array `other` onto `self`
        return self._apply_type_metadata(cudf_dtype_from_pa_type(other.type))

@shwina
Copy link
Contributor

shwina commented May 25, 2021

I'm not necessarily advocating that we need three distinct methods though. I think just having the first one is enough.

@isVoid
Copy link
Contributor

isVoid commented May 25, 2021

The difficult thing of reusing shared recursive backbone is that both cudf and pyarrow have nested column structure, where each column has there own dtype/type attribute. _apply_type_metadata should apply dtype to the current level, and recursively calls the correct _copy_type_metadata* depending on the child type (whether a cudf column or arrow array).

@shwina
Copy link
Contributor

shwina commented May 25, 2021

Right, I imagine that _apply_type_metadata would need to be recursive. In the arrow case, cudf_dtype_from_pa_type(other.type) will recursively construct the appropriate cuDF dtype from the arrow type. Would that work, or maybe I've missed something?

@isVoid
Copy link
Contributor

isVoid commented May 25, 2021

I could be wrong. For concrete example, say we have a column of list of list of ints:
In cudf,

>>> x = cudf.Series([[[1, 2, 3]]])

It is true that x's dtype needs to be recursively constructed, so you have:

>>> x.dtype
ListDtype(ListDtype(int64))

However, x's child column (which is a separate Column instance in cudf), also has a dtype that requires to be recursively constructed:

>>> s._column.children[1].dtype
ListDtype(int64)

This is best illustrated if the nested type is not only a ListDtype(int64) column, but a more deeply nested (e.g. structs of lists). But for simplicity this should do the justice.

For pyarrow it's pretty much the same.

>>> x = pa.array([[[1, 2, 3]]])
>>> x.type
ListType(list<item: list<item: int64>>)
>>> x.values.type
ListType(list<item: int64>)

rapids-bot bot pushed a commit that referenced this pull request Jun 15, 2021
Based on discussion on #8333:

- adds `_with_type_metadata()` to `ColumnBase` to return a new column with the metadata of `dtype` applied
- removes `_copy_type_metadata[_from_arrow]()` and uses this function in their place

These changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object.

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Michael Wang (https://github.com/isVoid)

URL: #8373
@charlesbluca charlesbluca deleted the type-metadata-prop branch August 3, 2021 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants