Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type_metadata property to ColumnBase #8333

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 63 additions & 28 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1273,44 +1273,79 @@ def scatter_to_table(
}
)

def _copy_type_metadata(self: T, other: ColumnBase) -> ColumnBase:
@property
def type_metadata(self):
"""
Return metadata relevant for constructing a copy of this column.

The metadata will always contain the dtype of the column, in addition
to:

* the categories and ordering, if ``self`` is a ``CategoricalColumn``
* the field names, if ``self`` is a ``StructColumn``
* the precision, if ``self`` is a ``DecimalColumn``
"""
metadata = {"type": self.__class__}

if isinstance(self, cudf.core.column.CategoricalColumn):
metadata.update(
{"categories": self.categories, "ordered": self.ordered}
)
if isinstance(self, cudf.core.column.StructColumn):
metadata["field_keys"] = self.dtype.fields.keys()
if isinstance(self, cudf.core.column.DecimalColumn):
metadata["precision"] = self.dtype.precision
Comment on lines +1290 to +1297
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Structs / Decimals the info is already attached to the dtype here. For Categoricals I believe the categories are attached to the dtype as well, where we should always have the info to reconstruct a column given its dtype, no?

Copy link
Member Author

@charlesbluca charlesbluca May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but wouldn't we need access to self to do that?

The main function of these changes would be that we could make a copy of a column that is not in scope - for example, when unpacking a PackedColumns object from #8153, we need to apply the correct dtype metadata to each resulting column, but we probably don't want to have to access the original DataFrame (before packing) to do this.

With these changes, we could store each columns' type_metadata in the PackedColumns object, and use that for reconstruction instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is why do we need a new function associated with the Column class to do this when all of the information already exists on the dtype? I.E. I believe we already have __serialize__ and __deserialize__ implemented for all of our dtypes, so we should already be able to do serialize(column.dtype) to shove into the PackedColumns object.

Copy link
Member Author

@charlesbluca charlesbluca May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - in that case, I think a good roadmap would be to add the serialize/deserialize methods to the StructDtype so that all the types affected here are serializable, and then implement the relevant as_*_column() methods so that we can cast the columns with the dtype alone.

EDIT: just realized we shouldn't need any casting for the struct/decimal columns - we only need to apply the metadata. Would we want a utility function to apply metadata from a Dtype for this, or is it suitable to just do this conditionally in unpack()?


return metadata

def _copy_type_metadata_from_dict(self: ColumnBase, metadata: dict):
"""
Copies type metadata from self onto other, returning a new column.

* when `self` is a CategoricalColumn and `other` is not, we assume
other is a column of codes, and return a CategoricalColumn composed
of `other` and the categories of `self`.
* when both `self` and `other` are StructColumns, rename the fields
of `other` to the field names of `self`.
* when both `self` and `other` are DecimalColumns, copy the precision
from self.dtype to other.dtype
* when `self` and `other` are nested columns of the same type,
recursively apply this function on the children of `self` to the
and the children of `other`.
* if none of the above, return `other` without any changes
Applies metadata extracted from another column onto ``self``.

* when ``metadata["type"]`` is a ``CategoricalColumn`` and ``self`` is
not, we assume ``self`` is a column of codes, and return a
``CategoricalColumn`` composed of ``self`` and the categories listed
in ``metadata["categories"]``.
* when both ``metadata["type"]`` and ``self`` are ``StructColumns``,
rename the fields of `self` to the field names listed in
``metadata["field_keys"]``.
* when both ``metadata["type"]`` and ``self`` are ``DecimalColumns``,
copy the precision from ``metadata["precision"]`` to ``self.dtype``
* if none of the above, no changes are applied
"""
if isinstance(self, cudf.core.column.CategoricalColumn) and not (
isinstance(other, cudf.core.column.CategoricalColumn)
if metadata["type"] is cudf.core.column.CategoricalColumn and not (
isinstance(self, cudf.core.column.CategoricalColumn)
):
other = build_categorical_column(
categories=self.categories,
codes=as_column(other.base_data, dtype=other.dtype),
mask=other.base_mask,
ordered=self.ordered,
size=other.size,
offset=other.offset,
null_count=other.null_count,
self = build_categorical_column(
categories=metadata["categories"],
codes=as_column(self.base_data, dtype=self.dtype),
mask=self.base_mask,
ordered=metadata["ordered"],
size=self.size,
offset=self.offset,
null_count=self.null_count,
)

if isinstance(other, cudf.core.column.StructColumn) and isinstance(
if metadata["type"] is cudf.core.column.StructColumn and isinstance(
self, cudf.core.column.StructColumn
):
other = other._rename_fields(self.dtype.fields.keys())
self = self._rename_fields(metadata["field_keys"])

if isinstance(other, cudf.core.column.DecimalColumn) and isinstance(
if metadata["type"] is cudf.core.column.DecimalColumn and isinstance(
self, cudf.core.column.DecimalColumn
):
other.dtype.precision = self.dtype.precision
self.dtype.precision = metadata["precision"]

def _copy_type_metadata(self: T, other: ColumnBase) -> ColumnBase:
"""
Copies type metadata from ``self`` onto ``other``, returning a new
column.

When ``self`` and ``other`` are nested columns of the same type,
recursively apply this function on the children of ``self`` and
``other``.
"""
other._copy_type_metadata_from_dict(self.type_metadata)

if type(self) is type(other):
if self.base_children and other.base_children:
Expand Down