Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to apply Dtype metadata to ColumnBase #8373

Merged
merged 19 commits into from
Jun 15, 2021

Conversation

charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented May 26, 2021

Based on discussion on #8333:

  • adds _with_type_metadata() to ColumnBase to return a new column with the metadata of dtype applied
  • removes _copy_type_metadata[_from_arrow]() and uses this function in their place

These changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object.

@charlesbluca charlesbluca requested a review from a team as a code owner May 26, 2021 19:50
@github-actions github-actions bot added the Python Affects Python cuDF API. label May 26, 2021
@charlesbluca
Copy link
Member Author

Just realized this is targeting branch-21.06; could someone retarget to branch-21.08?

@kkraus14 kkraus14 changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 19:53
@charlesbluca
Copy link
Member Author

Just noticed that a pretty significant refactor of the dtype metadata copying got merged yesterday (#8278), and now the _copy_type_metadata functions are implemented separately for each ColumnBase subclass. @vyasr, do you have any thoughts on how an _apply_type_metadata(ColumnBase, Dtype) function should be implemented to reflect these changes? Currently, I have it implemented in ColumnBase mirroring what _copy_type_metadata() used to do.

@vyasr
Copy link
Contributor

vyasr commented May 27, 2021

I just chatted briefly with @charlesbluca about this code. I'm not super familiar with this function yet, so @shwina it would be good to sync with you before I make a final recommendation because I suspect that I'm missing some details here. What I'm noticing thus far suggests that _copy_type_metadata_from_arrow merits some significant reworking, though.

The fact that it actually copies the passed column rather than simply applying metadata change is an unexpected and IMO undesirable divergence from _copy_type_metadata. This function is only ever called once, in ColumnBase.from_arrow, where we are generating a new column anyway, which makes me think we should just be able to modify that column's metadata in place rather than making a new column object. Furthermore, a lot of the branches in ColumnBase.from_arrow corresponding to the types checked in _copy_type_metadata_from_arrow are immediately returning a column, so _copy_type_metadata_from_arrow is never actually called. Unless pa.types.is_struct(arrow_array.type) can return True even if isinstance(array.type, pa.StructType) returns False, the only reachable conditional statement in _copy_type_metadata_from_arrow is the one for ListColumn. All of this leads me to think that we should be able to push any important logic from this function into the from_arrow methods of the corresponding column types and get rid of it entirely.

@shwina
Copy link
Contributor

shwina commented May 27, 2021

@vyasr as I mentioned in #8333 (comment), I really think we just need the one method _apply_type_metadata. I might be missing some details here too, so maybe best we three sync offline?

python/cudf/cudf/core/column/categorical.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
for i in range(len(self.base_children))
)
other.set_base_children(base_children)
other = other._apply_type_metadata(self.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method, along with _copy_type_metadata_from_arrow, is now a one-liner since all the logic is handled in apply. Is it worth it to keep the copy functions, or should we just replace all their appearances with the corresponding apply method call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are really the same thing I'd vote to remove the old ones, less clutter, less confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to avoid the indirection and just delete the methods.

pa.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
cudf.core.column.as_column(
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test gave a somewhat cryptic warning:

cudf/tests/test_column.py::test_as_column_arrow_array[data2-expected2]
  /home/charlesbluca/Documents/GitHub/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pandas/core/dtypes/missing.py:484: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
    if np.any(np.asarray(left_value != right_value)):

-- Docs: https://docs.pytest.org/en/stable/warnings.html

Is there something happening in the as_column operation here that would be causing the elementwise comparison to fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq does it's final checking in a loop in python/pandas land. It's probably trying to compare two non-scalar objects ultimately and something weird is happening. I'd suggest calling to_pandas on the cuDF side of the comparison and seeing what the things that are being compared look like in the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking out to_pandas shows the two objects do look the same:

>>> cudf.Series(actual_column).to_pandas()
0       [[1, 2, 3], [4, 5, 6]]
1    [[7, 8, 9], [10, 11, 12]]
dtype: object
>>> cudf.Series(expected).to_pandas()
0       [[1, 2, 3], [4, 5, 6]]
1    [[7, 8, 9], [10, 11, 12]]
dtype: object

Could this warning just be a consequence of comparing list columns with nested lists?

Copy link
Contributor

@isVoid isVoid Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. When cudf objects convert to pandas, it first goes through pyarrow. Upon converting arrow to pandas format, it converts nested lists into nested arrays:

>>> arr = pa.array([[1, 2, 3]])
>>> arr
<pyarrow.lib.ListArray object at 0x7f43187e8a60>
[
  [
    1,
    2,
    3
  ]
]
>>> arr.to_pandas()
0    [1, 2, 3]
dtype: object
>>> arr.to_pandas().iloc[0]
array([1, 2, 3])

# in some cases, libcudf will return an empty ListColumn with no
# indices; in these cases, we must manually set the base_size to 0 to
# avoid it being negative
return max(0, len(self.base_children[0]) - 1)
Copy link
Member Author

@charlesbluca charlesbluca Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this change; @shwina and I noticed that in some test cases, libcudf was returning empty a ListColumn with no indices (i.e. self.base_children[0] is just an empty NumericalColumn). Before, this would result in a ListColumn with a base_size of -1, which was breaking some test cases where this refactor required us to reconstruct the empty ListColumn using this erroneous size.

This change ensures that ListColumn.base_size will always be at least 0.

@codecov
Copy link

codecov bot commented Jun 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@72d8de5). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 507f452 differs from pull request most recent head ad80721. Consider uploading reports for the commit ad80721 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8373   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      110           
  Lines                   ?    18099           
  Branches                ?        0           
===============================================
  Hits                    ?    15004           
  Misses                  ?     3095           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72d8de5...ad80721. Read the comment docs.

@charlesbluca charlesbluca added 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 10, 2021
@brandon-b-miller
Copy link
Contributor

@charlesbluca Sorry its been a few days since I looked at this - can you merge the latest and rerun tests?

@brandon-b-miller
Copy link
Contributor

ugh I did the thing where I said "rerun t*sts" in a comment so it did it anyways :(

@charlesbluca
Copy link
Member Author

No worries 😄 looks like all tests are passing.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work, @charlesbluca !

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@isVoid isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Jun 15, 2021
@isVoid
Copy link
Contributor

isVoid commented Jun 15, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 884f98f into rapidsai:branch-21.08 Jun 15, 2021
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Jul 2, 2021
…ing (#430)

This PR contains three distinct changes required to get cuspatial builds working and tests passing again:
1. RMM switched to rapids-cmake (rapidsai/rmm#800), which requires CMake 3.20.1, so this PR includes the required updates for that.
2. The Arrow upgrade in cudf also moved the location of testing utilities (rapidsai/cudf#7495). Long term cuspatial needs to move away from use of the testing utilities, which are not part of cudf's public API, but we are currently blocked by rapidsai/cudf#8646, so this PR just imports the internal `assert_eq` method as a stopgap to get tests passing.
3. The changes in rapidsai/cudf#8373 altered the way that metadata was propagated to libcudf outputs from previously existing cuDF Python objects. The new code paths require cuspatial to override metadata copying at the GeoDataFrame rather than the GeoColumn level in order to ensure that information about column types is lost in the libcudf round trip and the metadata copying functions are now called on the output DataFrame rather than the input one.

This PR supersedes #427, #428, and #429, all of which can now be closed.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Christopher Harris (https://github.com/cwharris)

URL: #430
@charlesbluca charlesbluca deleted the apply-type-meta branch August 3, 2021 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants