Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor cleanup of unused Python functions #9974

Merged
merged 7 commits into from
Jan 6, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Jan 5, 2022

This PR just removes some unused internal functions and inlines some single-use functions that were defined at the wrong levels of the class hierarchy (largely Frame internal methods that were exclusively called in a single DataFrame method).

@vyasr vyasr added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 5, 2022
@vyasr vyasr added this to the CuDF Python Refactoring milestone Jan 5, 2022
@vyasr vyasr self-assigned this Jan 5, 2022
@vyasr vyasr requested a review from a team as a code owner January 5, 2022 19:23
@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #9974 (194e209) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9974      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20444     +139     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18310     +135     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eba4f03...194e209. Read the comment docs.

Comment on lines +505 to +507
# Note that both Series and DataFrame return Series objects from this
# calculation, necessitating the unfortunate circular reference to the
# child class here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this somehow worse than the previous code referring to Series within the Series.hash_values? It doesn't seem like a problem to me - the comment block might be unnecessary.

Copy link
Contributor Author

@vyasr vyasr Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because Series inherits from IndexedFrame. This is code in a parent class that relies on knowing a specific child exists, how to instantiate it, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying this is circular (or awkward) from the perspective of Python imports, or from a perspective of object-oriented purity? This implementation looks correct and feels only slightly awkward to me. This could be phrased as, "A _ThingContainer has derived classes SingleThingContainer and MultiThingContainer. A _ThingContainer has a method collapseThings that returns a SingleThingContainer, which can be called by its derived classes," with which I have no qualms.

Regardless, I don't think there is a good way around it. If this were indicative of a design flaw that we could fix, then I might say we should keep the comment (as a sort of TODO that indicates the circularity should be removed with some improved design). However, I hesitate to call this "unfortunate" if it is necessary.

(This is not a big deal - happy to close the conversation and let you resolve however you wish. Just wanted to clarify my thoughts.)

Copy link
Contributor Author

@vyasr vyasr Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with this is that your description isn't quite correct. The result of DataFrame.hash_values is reduced in dimensionality from the input because it results in a Series, whereas Series.hash_values produces an output of the same dimensionality because a Series is in this case treated like a single column, not a single row. SingleThingContainer.collapseThings() does not actually collapse things. So yes, this is about object-oriented purity, but it is also something of a design flaw because in some sense we're saying that IndexedFrame.hash_values is actually semantically different for different subclasses of IndexedFrame.

From a purist perspective, this discussion is actually making me want to undo this one change in this PR and move the method back to the two child classes for this reason.

Copy link
Contributor

@bdice bdice Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting viewpoint. What you're describing sounds more like a choice of how to leverage a type system (e.g. principles of covariance / contravariance) with the class hierarchy and methods' return types.

Here's why I think this is fine to keep in IndexedFrame, if you wish:

  • Many methods of IndexedFrame (or its parent Frame) have covariant return types and return an instance of self.__class__ (e.g. round)
  • The return type of hash_values is invariant to the choice of subclass. Other return-type-invariant methods like Frame.__len__ can return an int or a str or any type that doesn't depend on the choice of subclass. It just happens that the invariant return type of IndexedFrame.hash_values is Series, a child class of IndexedFrame.

It is partly a question of what semantics we choose to adhere to in IndexedFrame. Regardless, not all methods of IndexedFrame have covariant return types (and it would be silly to require that), so I'm not sure if any reasoning from the type system would justify removing this from IndexedFrame, especially since the underlying Cython hash doesn't care what subclass of Frame it receives. Methods with invariant return types are perfectly fine to include, and I think this is one of them.

Copy link
Contributor

@shwina shwina Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep here. I don't think the following is necessarily true:

IndexedFrame.hash_values is actually semantically different for different subclasses of IndexedFrame.

hash_values hashes each row of a Frame. A Series is viewed as Frame with N rows, rather than a Frame with N columns. I would consider it surprising behaviour if Series.hash_values instead returned a scalar.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 6, 2022
@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a61fc55 into rapidsai:branch-22.02 Jan 6, 2022
@vyasr vyasr deleted the refactor/more_cleanup branch January 14, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants