Add functionality to apply `Dtype` metadata to `ColumnBase` #8373

charlesbluca · 2021-05-26T19:50:31Z

Based on discussion on #8333:

adds _with_type_metadata() to ColumnBase to return a new column with the metadata of dtype applied
removes _copy_type_metadata[_from_arrow]() and uses this function in their place

These changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object.

charlesbluca · 2021-05-26T19:52:24Z

Just realized this is targeting branch-21.06; could someone retarget to branch-21.08?

python/cudf/cudf/core/column/column.py

…meta

python/cudf/cudf/core/column/column.py

charlesbluca · 2021-05-27T19:02:33Z

Just noticed that a pretty significant refactor of the dtype metadata copying got merged yesterday (#8278), and now the _copy_type_metadata functions are implemented separately for each ColumnBase subclass. @vyasr, do you have any thoughts on how an _apply_type_metadata(ColumnBase, Dtype) function should be implemented to reflect these changes? Currently, I have it implemented in ColumnBase mirroring what _copy_type_metadata() used to do.

vyasr · 2021-05-27T21:19:33Z

I just chatted briefly with @charlesbluca about this code. I'm not super familiar with this function yet, so @shwina it would be good to sync with you before I make a final recommendation because I suspect that I'm missing some details here. What I'm noticing thus far suggests that _copy_type_metadata_from_arrow merits some significant reworking, though.

The fact that it actually copies the passed column rather than simply applying metadata change is an unexpected and IMO undesirable divergence from _copy_type_metadata. This function is only ever called once, in ColumnBase.from_arrow, where we are generating a new column anyway, which makes me think we should just be able to modify that column's metadata in place rather than making a new column object. Furthermore, a lot of the branches in ColumnBase.from_arrow corresponding to the types checked in _copy_type_metadata_from_arrow are immediately returning a column, so _copy_type_metadata_from_arrow is never actually called. Unless pa.types.is_struct(arrow_array.type) can return True even if isinstance(array.type, pa.StructType) returns False, the only reachable conditional statement in _copy_type_metadata_from_arrow is the one for ListColumn. All of this leads me to think that we should be able to push any important logic from this function into the from_arrow methods of the corresponding column types and get rid of it entirely.

shwina · 2021-05-27T21:38:50Z

@vyasr as I mentioned in #8333 (comment), I really think we just need the one method _apply_type_metadata. I might be missing some details here too, so maybe best we three sync offline?

python/cudf/cudf/core/column/categorical.py

python/cudf/cudf/core/column/column.py

charlesbluca · 2021-05-28T20:53:48Z

python/cudf/cudf/core/column/column.py

-                    for i in range(len(self.base_children))
-                )
-                other.set_base_children(base_children)
+        other = other._apply_type_metadata(self.dtype)


This method, along with _copy_type_metadata_from_arrow, is now a one-liner since all the logic is handled in apply. Is it worth it to keep the copy functions, or should we just replace all their appearances with the corresponding apply method call?

If they are really the same thing I'd vote to remove the old ones, less clutter, less confusion.

I think it's better to avoid the indirection and just delete the methods.

charlesbluca · 2021-05-28T20:58:57Z

python/cudf/cudf/tests/test_column.py

+            pa.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
+            cudf.core.column.as_column(
+                [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]
+            ),


This test gave a somewhat cryptic warning:

cudf/tests/test_column.py::test_as_column_arrow_array[data2-expected2] /home/charlesbluca/Documents/GitHub/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pandas/core/dtypes/missing.py:484: DeprecationWarning: elementwise comparison failed; this will raise an error in the future. if np.any(np.asarray(left_value != right_value)): -- Docs: https://docs.pytest.org/en/stable/warnings.html

Is there something happening in the as_column operation here that would be causing the elementwise comparison to fail?

assert_eq does it's final checking in a loop in python/pandas land. It's probably trying to compare two non-scalar objects ultimately and something weird is happening. I'd suggest calling to_pandas on the cuDF side of the comparison and seeing what the things that are being compared look like in the end.

Checking out to_pandas shows the two objects do look the same:

>>> cudf.Series(actual_column).to_pandas() 0 [[1, 2, 3], [4, 5, 6]] 1 [[7, 8, 9], [10, 11, 12]] dtype: object >>> cudf.Series(expected).to_pandas() 0 [[1, 2, 3], [4, 5, 6]] 1 [[7, 8, 9], [10, 11, 12]] dtype: object

Could this warning just be a consequence of comparing list columns with nested lists?

Probably. When cudf objects convert to pandas, it first goes through pyarrow. Upon converting arrow to pandas format, it converts nested lists into nested arrays:

>>> arr = pa.array([[1, 2, 3]]) >>> arr <pyarrow.lib.ListArray object at 0x7f43187e8a60> [ [ 1, 2, 3 ] ] >>> arr.to_pandas() 0 [1, 2, 3] dtype: object >>> arr.to_pandas().iloc[0] array([1, 2, 3])

python/cudf/cudf/core/column/categorical.py

charlesbluca · 2021-06-04T19:23:17Z

python/cudf/cudf/core/column/lists.py

+        # in some cases, libcudf will return an empty ListColumn with no
+        # indices; in these cases, we must manually set the base_size to 0 to
+        # avoid it being negative
+        return max(0, len(self.base_children[0]) - 1)


Just noting this change; @shwina and I noticed that in some test cases, libcudf was returning empty a ListColumn with no indices (i.e. self.base_children[0] is just an empty NumericalColumn). Before, this would result in a ListColumn with a base_size of -1, which was breaking some test cases where this refactor required us to reconstruct the empty ListColumn using this erroneous size.

This change ensures that ListColumn.base_size will always be at least 0.

codecov · 2021-06-04T21:43:37Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@72d8de5). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 507f452 differs from pull request most recent head ad80721. Consider uploading reports for the commit ad80721 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08    #8373   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      110           
  Lines                   ?    18099           
  Branches                ?        0           
===============================================
  Hits                    ?    15004           
  Misses                  ?     3095           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72d8de5...ad80721. Read the comment docs.

brandon-b-miller · 2021-06-10T21:41:27Z

@charlesbluca Sorry its been a few days since I looked at this - can you merge the latest and rerun tests?

brandon-b-miller · 2021-06-10T21:42:21Z

ugh I did the thing where I said "rerun t*sts" in a comment so it did it anyways :(

…meta

charlesbluca · 2021-06-11T15:14:07Z

No worries 😄 looks like all tests are passing.

shwina

LGTM! Great work, @charlesbluca !

isVoid

Looks great!

isVoid · 2021-06-15T00:23:10Z

@gpucibot merge

…ing (#430) This PR contains three distinct changes required to get cuspatial builds working and tests passing again: 1. RMM switched to rapids-cmake (rapidsai/rmm#800), which requires CMake 3.20.1, so this PR includes the required updates for that. 2. The Arrow upgrade in cudf also moved the location of testing utilities (rapidsai/cudf#7495). Long term cuspatial needs to move away from use of the testing utilities, which are not part of cudf's public API, but we are currently blocked by rapidsai/cudf#8646, so this PR just imports the internal `assert_eq` method as a stopgap to get tests passing. 3. The changes in rapidsai/cudf#8373 altered the way that metadata was propagated to libcudf outputs from previously existing cuDF Python objects. The new code paths require cuspatial to override metadata copying at the GeoDataFrame rather than the GeoColumn level in order to ensure that information about column types is lost in the libcudf round trip and the metadata copying functions are now called on the output DataFrame rather than the input one. This PR supersedes #427, #428, and #429, all of which can now be closed. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Christopher Harris (https://github.com/cwharris) URL: #430

charlesbluca added 4 commits May 26, 2021 12:50

Add _apply_type_metadata() to ColumnBase

918c8c5

Return self if nothing needs to be done

1a061e6

Use apply function in _copy_type_metadata()

3c470ae

Remove arrow dtype conversion for now

b7329bd

charlesbluca requested a review from a team as a code owner May 26, 2021 19:50

charlesbluca requested review from shwina and brandon-b-miller May 26, 2021 19:50

github-actions bot added the Python Affects Python cuDF API. label May 26, 2021

kkraus14 changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 19:53

shwina reviewed May 26, 2021

View reviewed changes

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved

charlesbluca added 5 commits May 27, 2021 10:58

Use apply function for arrow copy

fed7075

Add tests for arrow array conversion

d6dccbc

Merge remote-tracking branch 'upstream/branch-21.08' into apply-type-…

a93084b

…meta

Add arrow StructArray test

7739a25

Reuse computed cudf dtype

27290cb

shwina reviewed May 27, 2021

View reviewed changes

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved

charlesbluca added 2 commits May 28, 2021 14:29

Migrate _apply_type_metadata to column subclasses

dc706bf

Handle recursive calls in struct/list methods, consolidate copy method

e04f052

charlesbluca commented May 28, 2021

View reviewed changes

Add build functions for struct/list columns

ba5d067

brandon-b-miller reviewed Jun 1, 2021

View reviewed changes

python/cudf/cudf/core/column/categorical.py Outdated Show resolved Hide resolved

charlesbluca added 4 commits June 1, 2021 16:59

Fix circular import

ec50e1b

Rename apply function

cbe027d

Use utils function for arrow type conversion

9e26743

Remove ColumnBase copy type functions

d01f558

charlesbluca added 2 commits June 3, 2021 12:43

Return new columns directly

9139229

Add fix for ListColumn.base_size

7f09ab0

charlesbluca commented Jun 4, 2021

View reviewed changes

charlesbluca added 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 10, 2021

Merge remote-tracking branch 'upstream/branch-21.08' into apply-type-…

ad80721

…meta

shwina approved these changes Jun 11, 2021

View reviewed changes

isVoid approved these changes Jun 15, 2021

View reviewed changes

isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Jun 15, 2021

rapids-bot bot merged commit 884f98f into rapidsai:branch-21.08 Jun 15, 2021

charlesbluca mentioned this pull request Jun 16, 2021

[BUG] ValueError on slicing of dataframes with list or struct columns #8537

Closed

vyasr mentioned this pull request Jul 2, 2021

Update CMake, fix testing use of assert_eq, and correct metadata copying rapidsai/cuspatial#430

Merged

charlesbluca deleted the apply-type-meta branch August 3, 2021 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to apply `Dtype` metadata to `ColumnBase` #8373

Add functionality to apply `Dtype` metadata to `ColumnBase` #8373

charlesbluca commented May 26, 2021 •

edited

Loading

charlesbluca commented May 26, 2021

charlesbluca commented May 27, 2021

vyasr commented May 27, 2021

shwina commented May 27, 2021

charlesbluca May 28, 2021

brandon-b-miller Jun 1, 2021

shwina Jun 1, 2021

charlesbluca May 28, 2021

brandon-b-miller Jun 1, 2021

charlesbluca Jun 3, 2021

isVoid Jun 3, 2021 •

edited

Loading

charlesbluca Jun 4, 2021 •

edited

Loading

codecov bot commented Jun 4, 2021 •

edited

Loading

brandon-b-miller commented Jun 10, 2021

brandon-b-miller commented Jun 10, 2021

charlesbluca commented Jun 11, 2021

shwina left a comment

isVoid left a comment

isVoid commented Jun 15, 2021

Add functionality to apply Dtype metadata to ColumnBase #8373

Add functionality to apply Dtype metadata to ColumnBase #8373

Conversation

charlesbluca commented May 26, 2021 • edited Loading

charlesbluca commented May 26, 2021

charlesbluca commented May 27, 2021

vyasr commented May 27, 2021

shwina commented May 27, 2021

charlesbluca May 28, 2021

Choose a reason for hiding this comment

brandon-b-miller Jun 1, 2021

Choose a reason for hiding this comment

shwina Jun 1, 2021

Choose a reason for hiding this comment

charlesbluca May 28, 2021

Choose a reason for hiding this comment

brandon-b-miller Jun 1, 2021

Choose a reason for hiding this comment

charlesbluca Jun 3, 2021

Choose a reason for hiding this comment

isVoid Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

charlesbluca Jun 4, 2021 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jun 4, 2021 • edited Loading

Codecov Report

brandon-b-miller commented Jun 10, 2021

brandon-b-miller commented Jun 10, 2021

charlesbluca commented Jun 11, 2021

shwina left a comment

Choose a reason for hiding this comment

isVoid left a comment

Choose a reason for hiding this comment

isVoid commented Jun 15, 2021

Add functionality to apply `Dtype` metadata to `ColumnBase` #8373

Add functionality to apply `Dtype` metadata to `ColumnBase` #8373

charlesbluca commented May 26, 2021 •

edited

Loading

isVoid Jun 3, 2021 •

edited

Loading

charlesbluca Jun 4, 2021 •

edited

Loading

codecov bot commented Jun 4, 2021 •

edited

Loading